Background: True volumetric measurements of tumors and pathologies in CT scans requires the delineation of the structure boundaries. It is well known that different radiologists generate different delineations. The delineations variations depend on many factors, e.g., the structure of interest, the resolution, contrast, and quality of the scan, the radiologist clinical experience, the time available for the task, and the radiologist patience and dedication, among others. To properly assess the algorithms and their performance, it is thus essential to quantify the inter and intra observer variability. While quantifying observer variability is recognized as a major issue by radiologists and technologists alike, very few large-scale studies have been conducted to actually quantify it.
Method: We conducted a large manual delineation study at the Hadassah University Medical Center to obtain ground truth segmentation variability data and to quantify the radiologists delineation variability. We retrospectively selected 18 CT studies, 5 from liver tumors, 5 from lung tumors, and 6 left kidneys from the Hadassah University Medical Center with dimensions 512x512x350-466 voxels and resolutions 0.76-0.98x0.76-0.98x1-3.3mm3. Manual delineations of 2,829 axial slices from the 18 CT scans were made by 8-11 clinicians with various levels of expertise. The data analysis focuses on the observer variability as a function of the number of annotators, on the variability by structure and by radiologist expertise, and on the radiologists annotations discrepancies. The structure area/volume variability is defined as the difference between the union (possible) and the intersection (consensus) of the voxels inside the delineations.
Results: The kidney, liver tumors, and lung tumors contour delineation variability is 7%, 14% and 16% for 3 observers and 13%, 26%, and 32% for 10 observers, respectively. The variability convergence rate volume reaches to 52%-56% for 3 annotators, up to 86%-89% for 7 annotators. Other quantitative measures by structure and by radiologist expertise as well as statistical confidence intervals have being computed from the delineation results.
Conclusion: The analysis of our results indicates that: 1) the observer variability may be larger than originally perceived; 2) two or even three observers usually do not suffice to properly quantify observer variability; 3) the observer variability conver
ges more slowly than expected; 4) the variability difference between the radiologists according to their expertise and seniority is smaller than anticipated; 5) there are significant differences between the convergenc
e rates for different types of structures, and; 6) there are significant variability differences from case to case.