Auditing DeepCDR: Data Bugs in Cancer Drug Response Prediction
An automated Natural Joints audit of DeepCDR (Liu et al., Bioinformatics 2020) — a widely used cancer drug response prediction model — uncovered critical data integrity issues in the underlying GDSC dataset.
Findings
Training Cancer Drug Models on Contradictory Labels
Fifteen drugs have IC50 measurements from two GDSC screening campaigns mapped through a single threshold. A Python dict overwrite causes one campaign’s threshold to classify the other campaign’s values, producing 1,808 contradictory training labels — the same drug on the same cell line labeled both sensitive and resistant. Verified in GraphCDR (Briefings in Bioinformatics 2022) which ships the same data.
Affected drugs include: Afatinib (FDA-approved lung cancer), Olaparib (BRCA cancers), Selumetinib (neurofibromatosis), Bicalutamide (prostate cancer).
Cisplatin Mapped to the Wrong Molecule
The drug reference table maps Cisplatin — one of the most prescribed chemotherapy drugs in the world — to PubChem CID 84691, which is Transplatin (the biologically inactive trans isomer). The correct CID is 441203. Any model using molecular graph features learned the structure of the wrong compound. Plus 24 additional factual errors in drug metadata: 3 more wrong PubChem CIDs, 3 invalid SMILES strings, 12 wrong target annotations, and naming errors.
Method
We used natural-joints to:
- Enrich the dataset — profiled 38,184 columns across 11 files, reduced to 112 representative columns, resolved 100% in 31 minutes
- Audit the analysis code — 6 parallel auditors (3 blind, 3 with enrichment context) found 17 issues
- Fact-check the drug reference tables — automated verification against domain knowledge found 25 factual errors
The threshold mapping bug was found by the enriched auditors (not the blind ones) — understanding that IC50 thresholds are “cutoff values for classifying sensitive vs resistant” was the key context that led to investigating the sign-flip.
Impact
- DeepCDR: ~150 citations, 104 GitHub stars, 29 forks
- Data reused by GraphCDR, GraTransDRP, DeepDR, JDACS4C-IMPROVE, and others
- Still actively benchmarked in 2025-2026 publications (Nature Communications Chemistry, Advanced Science, Scientific Reports)
- GDSC data is the standard benchmark for the cancer drug response prediction field
Reproducing
git clone https://github.com/kimmo1019/DeepCDR
cd DeepCDR
# Verify the threshold bug
python3 -c "
with open('data/CCLE/IC50_thred.txt') as f:
names = f.readline().strip().split('\t')
vals = f.readline().strip().split('\t')
from collections import defaultdict
name_vals = defaultdict(list)
for n, v in zip(names, vals):
name_vals[n].append(float(v))
for n, vs in name_vals.items():
if len(vs) > 1 and vs[0] != vs[1]:
print(f'{n}: {vs[0]:+.3f} vs {vs[1]:+.3f}')
"
# Verify the Cisplatin CID error
python3 -c "
import csv
for row in csv.reader(open('data/GDSC/1.Drug_listMon Jun 24 09_00_55 2019.csv')):
if row[1] == 'Cisplatin':
print(f'{row[1]}: PubChem CID {row[5]} = Transplatin (WRONG)')
print(f'Correct CID: 441203 (Cisplatin)')
"
# Run Natural Joints + fact-check
# Install from natural-joints.com
natural-joints enrich data/ -v
natural-joints fact-check data/ -v
Disclosure
We have not contacted the DeepCDR authors or GDSC database maintainers before publishing this analysis. The code and data are publicly available. The threshold mapping bug and PubChem CID errors likely originate in the GDSC database, not in DeepCDR’s code — DeepCDR inherited them.
Audit conducted April 2026 using natural-joints.