Training Cancer Drug Models on Contradictory Labels: A Threshold Mapping Bug in GDSC Data
The One-Line Summary
Fifteen cancer drugs in a widely used pharmacogenomics dataset have IC50 measurements from two different screening campaigns, but only one sensitivity threshold. The code applies one campaign’s threshold to the other campaign’s values, producing opposite sensitivity labels for the same drug on the same cell line. Models trained on this data learn that a cell is simultaneously sensitive and resistant to the same drug.
The Dataset
The Genomics of Drug Sensitivity in Cancer (GDSC) project screened ~1,000 cancer cell lines against ~200 drugs, measuring the IC50 — the drug concentration needed to kill half the cells. Lower IC50 means the cell is more sensitive to the drug. These measurements are the foundation of cancer drug response prediction, a field with hundreds of published models.
DeepCDR (Liu et al., Bioinformatics 2020) packaged GDSC data into a clean GitHub repository with pre-processed files ready for machine learning. The repo has been cloned, forked, and reused by dozens of downstream projects. The data includes:
GDSC_IC50.csv— 266 drugs × 969 cell lines, with log-transformed IC50 valuesIC50_thred.txt— per-drug thresholds for classifying cells as sensitive or resistant1.Drug_listMon Jun 24 09_00_55 2019.csv— drug metadata mapping drug IDs to names and PubChem compound IDs
The Bug
GDSC ran two screening campaigns — GDSC1 and GDSC2 — with different dose ranges and assay protocols. For 15 drugs, both campaigns produced IC50 data, and both sets of values are in the IC50 matrix under different drug IDs.
Take JQ1, a BET bromodomain inhibitor widely studied in cancer epigenetics:
Drug list CSV:
drug_id=163 Name="JQ1" PubCHEM=46907787 (GDSC1 screen)
drug_id=1218 Name="JQ1" PubCHEM=46907787 (GDSC2 screen)
Same drug name. Same PubChem ID. Two different rows in the IC50 matrix with very different value distributions:
IC50 matrix:
GDSC:163 (GDSC1): 865 cell lines, median IC50 = 0.05
GDSC:1218 (GDSC2): 895 cell lines, median IC50 = +2.28
The medians differ by 2.2 log-units — the two screens measured fundamentally different IC50 distributions for the same compound. This is expected: different dose ranges and protocols produce different absolute IC50 values.
Now look at the threshold file. It has drug names as column headers and threshold values in a single row:
IC50_thred.txt:
Column 57: "JQ1" threshold = -1.898
Column 238: "JQ1" threshold = +1.351
JQ1 appears twice — once with threshold -1.898 (calibrated for GDSC1) and once with +1.351 (calibrated for GDSC2). These thresholds are on opposite sides of the IC50 distribution. The classification rule is: if IC50 < threshold, the cell is sensitive.
The code builds a Python dictionary from this file:
# Line 82 of run_DeepCDR_classify.py
drug2thred = {name2pubchemid[a]:b
for a,b in zip(drug_names, IC50_threds)
if a in name2pubchemid.keys()}
The dict comprehension iterates left to right. Both “JQ1” entries match the same key (name2pubchemid["JQ1"] = "46907787"). The second value (+1.351) overwrites the first (-1.898):
Iteration 1: drug2thred["46907787"] = -1.898 (written)
Iteration 2: drug2thred["46907787"] = +1.351 (OVERWRITES)
Final: drug2thred["46907787"] = +1.351
Now both GDSC1 and GDSC2 IC50 values are classified against the GDSC2 threshold (+1.351). And both drug rows map to the same PubChem ID, so both produce training instances with identical molecular graph features.
The Concrete Consequence
For cell line ACH-000001:
From GDSC:163 (GDSC1 screen):
ln_IC50 = +0.030
Classification: 0.030 < 1.351 → SENSITIVE
From GDSC:1218 (GDSC2 screen):
ln_IC50 = +2.905
Classification: 2.905 < 1.351 is FALSE → RESISTANT
Same cell line. Same drug molecular graph (PubChem 46907787). The model receives two training instances:
Input: (ACH-000001 genomics, JQ1 graph) → Label: SENSITIVE
Input: (ACH-000001 genomics, JQ1 graph) → Label: RESISTANT
Identical features. Opposite labels. The model’s gradient for this drug on this cell line cancels out — it cannot learn anything meaningful from contradictory supervision.
For JQ1, this happens for 530 of 805 shared cell lines (65.8%).
Across all 15 affected drugs:
| Drug | Shared Cells | Conflicts | Rate |
|---|---|---|---|
| Avagacestat | 851 | 490 | 57.6% |
| Selumetinib | 771 | 181 | 23.5% |
| AKT inhibitor VIII | 849 | 153 | 18.0% |
| CHIR-99021 | 802 | 136 | 17.0% |
| PLX-4720 | 797 | 121 | 15.2% |
| GSK269962A | 354 | 52 | 14.7% |
| AZD6482 | 788 | 108 | 13.7% |
| BMS-536924 | 354 | 49 | 13.8% |
| Afatinib | 793 | 105 | 13.2% |
| Olaparib | 784 | 87 | 11.1% |
| Pictilisib | 758 | 82 | 10.8% |
| Bicalutamide | 761 | 67 | 8.8% |
| Refametinib | 750 | 65 | 8.7% |
| UNC0638 | 852 | 64 | 7.5% |
| JQ1 | 805 | 48 | 6.0% |
| Total | 11,069 | 1,808 | 16.3% |
1,808 training instances where the model is told the same input produces two opposite outputs.
These are not obscure compounds. Afatinib is an FDA-approved EGFR inhibitor for lung cancer. Olaparib is approved for BRCA-mutant cancers. Selumetinib is approved for neurofibromatosis. Bicalutamide is a standard treatment for prostate cancer.
The Downstream Propagation
We verified that GraphCDR (Liu et al., Briefings in Bioinformatics 2022) ships with the same drug list file (identical filename, identical content) and the same IC50 matrix with both screening campaigns present. GraphCDR reformatted the threshold file to avoid the overwrite — it uses PubChem ID as the key directly and keeps only the GDSC1 threshold. But the IC50 matrix still has both GDSC1 and GDSC2 values, and both get classified against the GDSC1 threshold.
The result: GraphCDR trains on 1,808 contradictory labels across the same 15 drugs.
Any project that downloaded DeepCDR’s data directory and used the IC50 matrix for binary classification inherits this bug. The DeepCDR paper has ~150 citations. The repo has 29 forks. The data flows downstream through:
- GraphCDR (Briefings in Bioinformatics 2022) — verified: same files, same bug
- GraTransDRP, GraOmicDRP — follow-on architectures
- JDACS4C-IMPROVE — NCI’s benchmark framework (forked DeepCDR, but switched to different data)
- DeepDR (Bioinformatics 2024) — drug response library including DeepCDR as baseline
- A 2026 benchmarking study in Nature Communications Chemistry
How We Found It
We used Natural Joints to profile the dataset. Natural Joints processed 38,184 columns across 11 files — including the 34,674-column binary mutation matrix — reducing them to 112 representative columns for analysis in 31 minutes.
We then ran a parallel audit: three auditors working blind (without enrichment) and three working with the full semantic layer. The blind auditors found the duplicate drugs and noted the contradictory IC50 values, but they didn’t investigate the threshold file’s behavior. The enriched auditors, armed with the enrichment’s description of what the threshold values mean (“standard IC50 threshold values used to determine drug sensitivity or resistance”), traced the full chain from threshold file → dict overwrite → conflicting labels.
Separately, Natural Joints’s automated fact-checking module flagged 25 factual errors in the drug reference data — including that Cisplatin’s PubChem ID points to Transplatin, the biologically inactive isomer. This means the molecular graph features for one of the most important chemotherapy drugs in the world encode the wrong molecule. That finding, along with 3 other wrong PubChem CIDs, 3 invalid SMILES strings, and 12 wrong target annotations, compounds the threshold mapping bug: not only are labels contradictory for 15 drugs, but structural features are wrong for 10 drugs.
The Fix
For the threshold/screening bug:
# Option 1: Deduplicate IC50 matrix — keep one screen per drug
# For each PubChem ID with multiple GDSC rows, keep the more recent screen
# and drop the older one
# Option 2: Use screen-specific thresholds
# Map each GDSC drug_id to its corresponding threshold,
# not just the PubChem ID. This requires knowing which threshold
# row corresponds to which screen.
# Option 3: Average duplicate measurements
# For shared cell lines, average the IC50 values from both screens
# and apply a single threshold to the averaged values
Option 1 is the simplest and safest. The duplicate rows exist because GDSC rescreened compounds — the newer screen is generally more reliable.
For the PubChem CID errors: correct the 4 wrong CIDs, fix the 3 CID-SMILES mismatches, fix the 3 invalid SMILES strings, and regenerate all molecular graph features.
The Broader Point
This bug is invisible at every level of standard ML development practice:
- Code review wouldn’t catch it — the dict comprehension is syntactically correct
- Unit tests wouldn’t catch it — the function returns valid data structures
- Training metrics wouldn’t catch it — the model converges and produces reasonable Pearson correlations (contradictory labels add noise but don’t prevent learning)
- Peer review didn’t catch it — across ~150 citing papers and multiple reimplementations
The bug lives in the gap between what the data means and what the code does with it. The threshold file has two entries for JQ1 with opposite signs. The IC50 matrix has two sets of values from different screens. The code stitches them together through a PubChem ID that doesn’t distinguish between screens. Understanding why this is wrong requires knowing that IC50 thresholds are screen-specific, that the same PubChem ID can have different IC50 distributions across screens, and that a sensitivity label should only be computed from matching threshold-value pairs.
This is exactly what Natural Joints catches. Natural Joints understands that IC50_thred.txt contains “threshold values used to determine drug sensitivity or resistance,” that GDSC_IC50.csv contains “log-transformed IC50 concentrations,” and that the Drug_list maps drug IDs to PubChem compound identifiers. An auditor reading these descriptions immediately sees the chain: threshold → classification → which values it applies to → whether the screens match.
Without that context, the threshold file is just 265 numbers with drug names. The IC50 matrix is just 266 rows of floating-point values. The drug list is just an ID mapping table. The bug is invisible because the data’s meaning is invisible.
Reproducing
# Clone and verify
git clone https://github.com/kimmo1019/DeepCDR
cd DeepCDR
# Check for duplicate drugs in the drug list
python3 -c "
import csv
from collections import Counter
reader = csv.reader(open('data/GDSC/1.Drug_listMon Jun 24 09_00_55 2019.csv'))
rows = list(reader)[1:]
pubchem_counts = Counter()
for row in rows:
if row[5].isdigit():
pubchem_counts[row[5]] += 1
dupes = {k:v for k,v in pubchem_counts.items() if v > 1}
print(f'{len(dupes)} drugs with duplicate PubChem IDs:')
for pid, count in dupes.items():
names = [r[1] for r in rows if r[5] == pid]
ids = [r[0] for r in rows if r[5] == pid]
print(f' PubChem {pid}: {names[0]} (drug_ids: {ids})')
"
# Check for duplicate thresholds with different values
python3 -c "
with open('data/CCLE/IC50_thred.txt') as f:
names = f.readline().strip().split('\t')
vals = f.readline().strip().split('\t')
from collections import defaultdict
name_vals = defaultdict(list)
for n, v in zip(names, vals):
name_vals[n].append(float(v))
for n, vs in name_vals.items():
if len(vs) > 1 and vs[0] != vs[1]:
print(f'{n}: {vs[0]:+.3f} vs {vs[1]:+.3f} (diff={abs(vs[0]-vs[1]):.3f})')
"
# Run Natural Joints + fact-check
# Install from natural-joints.com
natural-joints enrich data/ -v
natural-joints fact-check data/ -v