Evaluation - LUNA16 - Grand Challenge

Evaluation¶

The evaluation is performed by measuring the detection sensitivity of the algorithm and the corresponding false positive rate per scan. A candidate is considered to be a true positive if it detects a nodule from the reference standard. Our hit criterium is that a candidate should be located within a distance R of the nodule center, where R is set to the diameter of the nodule divided by 2. Candidates that are not detecting any nodule are considered false positives. Candidates that are detecting irrelevant findings (see Data section in this page for definition) are ignored during the evaluation and are not considered either false positives or true positives. Analysis is performed using free receiver operating characteristic (FROC) analysis. To obtain a point on the FROC curve, only those findings of a CAD system whose degree of suspicion is above a threshold t are selected, and the sensitivity and average number of false positives per scan is determined. All thresholds that define a unique point on the FROC curve are evaluated. The point with the lowest false positive rate is connected to (0,0).

The final score is defined as the average sensitivity at 7 predefined false positive rates: 1/8, 1/4, 1/2, 1, 2, 4, and 8 FPs per scan. Clearly a perfect system will have a score of 1 and the lowest possible score is 0. Most CAD systems in clinical use today have their internal threshold set to operate somewhere between 1 to 4 false positives per scan on average. Some systems allow the user to vary the threshold. To make the task more challenging, we included low false positive rates in our evaluation. This determines if a system can also identify a significant percentage of nodules with very few false alarms, as might be needed for CAD algorithms that operate more or less autonomously.

We compute the 95% confidence interval using bootstrapping with 1,000 bootstraps. For each bootstrap, a new set of candidates are constructed using (scan-level) sampling with replacement.

This performance metric was introduced in the ANODE09 study and is described in detail in the ANODE09 paper. Evaluation script is provided in this link, so that participants can use the same evaluation method for the development of the algorithm.

Submission¶

Each participant should upload the results file in csv format and a description file. The name of the result file should follow the following format [NAME]_[TRACK].csv, where track would be:

NDET : the nodule detection track
FPRED : the false positive reduction track (use candidates_V2.csv)

Example of the correct filename is LUNA16CAD_NDET.csv. A description of the algorithm in pdf format should be uploaded (with the same title like the result file, e.g. LUNA16CAD_NDET.pdf) and will be made available in the results page. The description file should explain the algorithm that is used.

Results file¶

The result file must be a simple comma separated values (csv) file that contains a string and four numbers per line. Each line holds one finding. The strings or numbers must be separated by spaces or tab characters.

The string is the name of the scan (seriesuid) in which the finding is located. This string is followed by 3 numbers giving the x, y, and z coordinates of the finding, respectively. They can be floating point values if desired (use a decimal point, not a comma). Note that the first voxel in the supplied data has coordinates (-128.6,-175.3,-298.3). Coordinates are given in world coordinates or millimeters. You can verify if you use the correct way of addressing voxels by checking the supplied coordinates for the example data, the locations should refer to locations that hold a nodule.

The final number is a degree of suspicion that should be higher for findings more likely to represent true nodules (can be a floating point value, use a decimal point in that case). The FROC curve of the system will be determined by thresholding this value.

Between the strings and the numbers should be a comma character. The order of the lines is irrelevant.

The following is an example result file:

      seriesuid,coordX,coordY,coordZ,probability
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860,-128.6,-175.3,-298.3,1.0
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860, 103.7,-211.9,-227.1,0.8
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793540579077826395208,  69.6,-140.9, 876.3,0.2
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016233746780170740405, -24.0, 192.1,-391.0,0.5
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016233746780170740405,   2.4, 172.4,-405.4,1.0
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016233746780170740405,  90.9, 149.0,-426.5,1.0
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016233746780170740405,  89.5, 196.4,-515.4,0.2
        1.3.6.1.4.1.14519.5.2.1.6279.6001.100953483028192176989979435275,  81.5,  54.9,-150.3,0.1

This file contains 8 findings (obviously way too few). There are 5 unique likelihood values. This means that there will be 5 unique thresholds that produce a distinct set of findings (each threshold discards finding below the threshold. That means there will be 5 points on the FROC curve.

It has to be noted that for the 'false positive reduction' challenge track, 754,975 findings (the amount of given candidates) are expected. The order of the lines is irrelevant.

Description of the algorithms¶

A pdf file describing the system that generated the results may be requested by the organizers. Ideally this document is a paper (scientific publication or technical report) describing the system that has been used to generate the results in such detail that others can reimplement it.

References to accessible literature in the description are also acceptable, there is no need to repeat information that can be found in such publications. If you have not yet written a detailed paper, or have submitted this for publication and do not want it to become publicly available yet, or if you have other reasons why you want to withhold detailed information about your method, please indicate the reasons for this in the pdf file you submit and describe the system only briefly, using the checklist given below. If your system is commercial and you do not want to divulge details, please provide whatever you can release, and state the name and the version of the system and the company that is associated with it.

The reason we require the upload of a descriptive document with every submitted result is that we believe that it is far less interesting to report results of systems whose working is unknown. We reserve the right to not process results that are not accompanied by a suitable description.

For convenience, we provide a checklist below of items that we believe should be mentioned in a description of the CAD system.

Contact details
Mention the challenge track that you follow
Does your system use training data, if so describe the characteristics of the training data.
Give the overall structure of the algorithm. In many cases this will be lung segmentation, candidate detection, feature computation for each candidate, classification of candidates
Briefly describe each step in the structure of the algorithm (how were the lung segmented, how were candidates detected, what features were used, what classifier was used).
List limitations of the algorithm. Is the algorithm specifically designed to detect only certain types of nodules? What size ranges was the algorithm optimized for? Can the method detect nodules connected to vasculature, the pleural surface, can it detect non-solid and part-solid nodules? Was it optimized to work for scans with thick or thin slices, are other technical scan parameters expected to influence detection performance?
If the algorithm has been tested on other databases, you could consider including those result

How often can I submit results?¶

Teams can upload as often as they want, but the new submission is expected to be substantially different from previous submissions. If you have developed a new algorithm, and you are already listed on the site with another system, you can either use the same team or create a new team for your new algorithm.