There are various datasets available for lung cancer, both for classification and segmentation tasks. Here are some of the publicly available datasets for ML training:
The LIDC-IDRI (Lung Image Database Consortium and Image Database Resource Initiative) dataset contains CT scan images of lung nodules annotated by radiologists. The dataset consists of 1,018 cases, and it is available through the National Cancer Institute’s Cancer Imaging Archive (NCI-CIA).
The NSCLC Radiogenomics dataset is a collection of imaging and genomic data from patients with non-small cell lung cancer (NSCLC). It includes CT scans, PET scans, and genomic data from 422 patients, and it is available through The Cancer Imaging Archive (TCIA).
The Kaggle Data Science Bowl 2017 dataset is a collection of CT scans of lung nodules from the National Lung Screening Trial (NLST). The dataset includes over 1,000 scans with annotations, and it was used as a competition dataset on Kaggle.
The TCGA (The Cancer Genome Atlas) Lung Adenocarcinoma dataset includes genomic data from patients with lung adenocarcinoma. The dataset includes gene expression data, methylation data, and clinical data, and it is available through the Genomic Data Commons (GDC).
The DeepLesion dataset contains CT scan images of lesions, including lung nodules, annotated by radiologists. The dataset consists of 32,735 lesions from 10,594 patients, and it is available through the NIH Clinical Center.
These are just a few examples of the many publicly available datasets for lung cancer ML training. It is important to note that some of these datasets may have restrictions on their use, so it is important to read and understand the dataset’s licensing and usage terms before using them for training purposes.
LUNA16: The LUNG Nodule Analysis 2016 dataset consists of 888 CT scans with annotations of lung nodules. It is available through the LUNA16 Challenge.
OPTIMAS: The Optimization of Radiation Therapy for Individualized Systemic Adaptive Treatment of Lung Cancer dataset contains clinical data from patients with lung cancer, including tumor response to treatment. It is available through The Cancer Imaging Archive (TCIA).
ImageCLEFmed: The ImageCLEFmed dataset includes CT scans of the lung, annotated with tumor location and type. It is available through the ImageCLEFmed challenge.
SPIE-AAPM-NCI: The SPIE-AAPM-NCI Lung CT Challenge dataset consists of CT scans of the lung with annotations of lung nodules. It is available through the SPIE-AAPM-NCI Lung CT Challenge.
SEER: The Surveillance, Epidemiology, and End Results (SEER) database contains clinical and demographic data from patients with lung cancer. It is available through the National Cancer Institute.
CBIS-DDSM: The Curated Breast Imaging Subset of DDSM dataset includes mammography images of the lung with annotations of lung nodules. It is available through the Cancer Imaging Archive (TCIA).
TUPAC16: The Tissue Phenomics Analysis Center (TUPAC) 2016 dataset contains histological images of lung cancer samples, annotated with tumor location and type. It is available through the TUPAC16 challenge.
CPTAC: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) Lung Adenocarcinoma dataset includes proteomic and phosphoproteomic data from patients with lung adenocarcinoma. It is available through the Proteomic Data Commons (PDC).
Lung1: The Lung1 dataset includes CT scans of the lung with annotations of lung nodules. It is available through the Lung Image Database Consortium (LIDC).
NSCLC-Radiomics: The Non-Small Cell Lung Cancer Radiomics dataset includes CT scans of lung tumors, along with radiomic features extracted from the scans. It is available through the Cancer Imaging Archive (TCIA).
LIDC-IDRI-CT-Radiomics: The LIDC-IDRI CT Radiomics dataset includes CT scans of lung nodules, along with radiomic features extracted from the scans. It is available through the Cancer Imaging Archive (TCIA).
RIDER: The Reference Image Database to Evaluate Therapy Response (RIDER) dataset includes CT scans of the lung with annotations of lung nodules. It is available through the Cancer Imaging Archive (TCIA).
CDMATH: The Computational Development of Mathematics for Application in Tumor Healthcare (CDMATH) dataset includes CT scans of the lung, along with segmentation masks of the tumors. It is available through the Cancer Imaging Archive (TCIA).
QIN-HEADNECK: The Quantitative Imaging Network (QIN) Head and Neck dataset includes CT scans of the head and neck, along with segmentation masks of the tumors. It is available through the Cancer Imaging Archive (TCIA).
QIN-LUNG: The Quantitative Imaging Network (QIN) Lung dataset includes CT scans of the lung, along with segmentation masks of the tumors. It is available through the Cancer Imaging Archive (TCIA).
QIN-PANCAN: The Quantitative Imaging Network (QIN) Pancreatic dataset includes CT scans of the pancreas, along with segmentation masks of the tumors. It is available through the Cancer Imaging Archive (TCIA).
NSCLC-Radiomics-Genomics: The NSCLC Radiomics-Genomics dataset includes imaging data from CT scans, along with genomic data from patients with NSCLC. It is available through the Cancer Imaging Archive (TCIA).
ISIC-LD: The International Skin Imaging Collaboration – Lung Nodule Detection dataset includes CT scans of the lung, along with annotations of lung nodules. It is available through the ISIC Archive.
INbreast: The INbreast dataset includes mammography images of the lung, annotated with lung nodules. It is available through the INbreast website.
JSRT: The Japanese Society of Radiological Technology (JSRT) dataset includes chest radiography images, annotated with lung nodules. It is available through the JSRT website.
CQ500: The Chinese Qure.ai Dataset 500 (CQ500) includes CT scans of the chest, annotated with lung nodules. It is available through the CQ500 website.
OASIS: The Open Access Series of Imaging Studies (OASIS) dataset includes CT scans of the lung with annotations of lung nodules. It is available through the OASIS website.
ANODE09: The ANODE09 dataset includes CT scans of the lung, annotated with lung nodules. It is available through the ANODE09 website.
LCP: The Lung Cancer Prognosis dataset includes clinical data from patients with lung cancer, including survival time and disease stage. It is available through the UCI Machine Learning Repository.
LUNGx Challenge: The LUNGx Challenge dataset includes CT scans of the lung, along with annotations of lung nodules. It is available through the LUNGx Challenge website.
NSCLC-Radiomics-RESP: The Non-Small Cell Lung Cancer Radiomics-Response dataset includes CT scans of lung tumors, along with radiomic features extracted from the scans and response to treatment. It is available through the Cancer Imaging Archive (TCIA).
NSCLC-Radiomics-Genomics-KRAS: The Non-Small Cell Lung Cancer Radiomics-Genomics-KRAS dataset includes CT scans of lung tumors, along with radiomic features extracted from the scans and genomic data related to the KRAS gene mutation. It is available through the Cancer Imaging Archive (TCIA).
NSCLC-Radiomics-Genomics-EGFR: The Non-Small Cell Lung Cancer Radiomics-Genomics-EGFR dataset includes CT scans of lung tumors, along with radiomic features extracted from the scans and genomic data related to the EGFR gene mutation. It is available through the Cancer Imaging Archive (TCIA).
DREAM Challenge: The Digital Mammography DREAM Challenge dataset includes mammography images of the lung, annotated with lung nodules. It is available through the DREAM Challenge website.
DIAGNijmegen: The DIAGNijmegen dataset includes CT scans of the lung, annotated with lung nodules. It is available through the DIAGNijmegen website.
Interobserver1: The Interobserver1 dataset includes CT scans of the lung, annotated with lung nodules by multiple observers. It is available through the Lung Image Database Consortium (LIDC).
Interobserver2: The Interobserver2 dataset includes CT scans of the lung, annotated with lung nodules by multiple observers. It is available through the Lung Image Database Consortium (LIDC).
LUNGxAI: The LUNGxAI dataset includes CT scans of the lung, along with annotations of lung nodules. It is available through the LUNGxAI website.
CBICA: The Center for Biomedical Image Computing and Analytics (CBICA) Lung dataset includes CT scans of the lung, along with annotations of lung nodules. It is available through the CBICA website.
NSCLC-Radiomics-Survival: The Non-Small Cell Lung Cancer Radiomics-Survival dataset includes CT scans of lung tumors, along with radiomic features extracted from the scans and survival data. It is available through the Cancer Imaging Archive (TCIA).
NSCLC-Radiomics-OverallSurvival: The Non-Small Cell Lung Cancer Radiomics-OverallSurvival dataset includes CT scans of lung tumors, along with radiomic features extracted from the scans and overall survival data. It is available through the Cancer Imaging Archive (TCIA).
CPTAC-LSCC: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) Lung Squamous Cell Carcinoma dataset includes proteomic and phosphoproteomic data from patients with lung squamous cell carcinoma. It is available through the Proteomic Data Commons (PDC).
LUNGx Image Collection: The LUNGx Image Collection includes CT scans of the lung, along with annotations of lung nodules. It is available through the LUNGx Challenge website.
LSSC: The Lung Squamous Cell Carcinoma dataset includes genomic data from patients with lung squamous cell carcinoma. It is available through The Cancer Genome Atlas (TCGA).
LUSC: The Lung Adenocarcinoma dataset includes genomic data from patients with lung adenocarcinoma. It is available through The Cancer Genome Atlas (TCGA).
ADC: The Adenocarcinoma dataset includes genomic data from patients with lung adenocarcinoma. It is available through The Cancer Genome Atlas (TCGA).
LC25000: The LC25000 dataset includes CT scans of the lung, along with annotations of lung nodules. It is available through the TCIA.
Lung1: The Lung1 dataset includes CT scans of the lung, annotated with lung nodules. It is available through the LIDC.
Lung2: The Lung2 dataset includes CT scans of the lung, annotated with lung nodules. It is available through the LIDC.
Lung3: The Lung3 dataset includes CT scans of the lung, annotated with lung nodules. It is available through the LIDC.
Lung4: The Lung4 dataset includes CT scans of the lung, annotated with lung nodules. It is available through the LIDC.
LUNA16: The LUNA16 dataset includes CT scans of the lung, annotated with lung nodules. It is available through the LUNA16 website.
TCIA-Pancreas: The Pancreas CT dataset includes CT scans of the pancreas, along with annotations of pancreatic tumors. It is available through the TCIA.
QIN-HEADNECK: The Quantitative Imaging Network (QIN) Head and Neck dataset includes CT scans of the head and neck, along with annotations of tumors. It is available through the TCIA.
CPTAC-LUAD: The Clinical Proteomic Tumor Analysis Consortium (CPTAC) Lung Adenocarcinoma dataset includes proteomic and phosphoproteomic data from patients with lung adenocarcinoma. It is available through the PDC.