We have decided to stop processing new submissions to LUNA16, except for submissions that are accompanied by a complete description in the form of a high-quality scientific article. In this statement, we explain why we took this step.
We have recently received quite a few submissions with a performance that, we believe, is 'too good to be true'. We have not added these to the leaderboard. LUNA16 is different from most challenges in medical image analysis: it is an open challenge because the reference standard for the LIDC data set that we use is publicly available. We ask all participants to train their systems in a cross-validation procedure or to not use the LIDC data for training. We have specified which scans should be used in which folds. We have found that teams sometimes make errors, probably not intentional, in carrying out this cross-validation procedure. The overview article we have now published about LUNA16 in Medical Image Analysis discusses this:
"Moreover, a cross-validation approach introduces some risks as people may make a mistake that goes unnoticed while carrying out a cross-validation experiment. In fact, one team that originally participated in the challenge and reported excellent results had to withdraw because of a bug in the reinitialization of the network weights when starting training for the next fold in cross-validation."
We have added a number of sanity checks to our evaluation code, and these checks flag some submissions that are unlikely to be correct. We know, for example, that some locations included in the nodule candidate list are not listed as nodules in the reference standard, but are in fact nodules. If a submission with very good performance gives these locations an extremely low score for being a nodule, we think that is fishy. However, we cannot detect everything with such checks, and for several submissions we have engaged with the teams in sometimes lengthy exchanges in which we tried to find out what the underlying problem was. We lack the time to keep doing this.
It would be a good idea to repeat the LUNA16 challenge with a new test data set. Preferably we do not release this test data publicly; instead, we should request teams to submit their solution in the form of a Docker container or something similar, and we apply the submitted systems to the secret test data and report the results. If anybody is interested in co-organizing such a new competition, which we could call LUNA18, feel free to contact us.
Submissions that are accompanied by a solid description will still be processed. We realize that reviewers these days insist that authors of scientific papers report performance of their algorithms on well-established benchmark data sets. Note that the evaluation script is available for download. Therefore, you can still evaluate your results yourself and compare them to those listed on the leaderboard.
Bram van Ginneken
Arnaud A. A. Setio