Cisplatin or Transplatin? A Drug ID Mapping Error in Cancer Drug Response Prediction
Summary
We ran an automated Natural Joints and fact-check on a widely used cancer pharmacogenomics dataset. The fact-checker flagged that Cisplatin — one of the most prescribed chemotherapy drugs in the world — is mapped to the wrong PubChem compound ID. The ID in the dataset points to Transplatin, the biologically inactive trans isomer, not Cisplatin, the active cis isomer used to treat cancer.
This mapping error propagates through the drug’s molecular graph features. Any model trained on this dataset that uses chemical structure to predict drug response has learned the structural features of the wrong molecule for Cisplatin. The dataset has been downloaded and reused by dozens of published papers.
The Error
The dataset ships with a drug list file (1.Drug_listMon Jun 24 09_00_55 2019.csv) containing 266 anti-cancer drugs from the Genomics of Drug Sensitivity in Cancer (GDSC) project. Each drug is mapped to a PubChem Compound ID (CID), which is used to retrieve the drug’s molecular structure.
Cisplatin’s entry:
drug_id: 1005
Name: Cisplatin
PubCHEM: 84691
PubChem CID 84691 is Transplatin — trans-diamminedichloroplatinum(II). This is the geometrical isomer of Cisplatin where the two chloride ligands are on opposite sides of the platinum center (trans configuration). Transplatin is biologically inactive and has no anti-cancer activity.
The correct CID for Cisplatin (cis-diamminedichloroplatinum(II)) is 441203 (or 5702198 for the hydrated form). In Cisplatin, the two chloride ligands are on the same side (cis configuration), which is essential for its mechanism of action — forming intrastrand DNA crosslinks that trigger apoptosis.
The difference matters because models like DeepCDR encode drug structure as a molecular graph. The graph for Transplatin has different bond geometry than Cisplatin. A Graph Convolutional Network (GCN) processing these two molecules will compute different node features, different adjacency patterns, and different learned representations. The model has learned to associate Transplatin’s structure with Cisplatin’s IC50 values.
How We Found It
We used natural-joints to profile and enrich the dataset, then ran its fact-checking module. The fact-checker identifies concept-grain reference tables (tables where each row is a named real-world entity with verifiable attributes) and validates the entries against domain knowledge.
The drug list table was selected as a concept-grain table. The fact-checker verified each drug’s attributes:
FAIL: PubChem CID 84691 corresponds to Transplatin
(trans-diamminedichloroplatinum(II)), which is the inactive
trans isomer. The correct CID for the active cis isomer
(Cisplatin) is 441203 or 5702198.
This was one of 25 factual errors the fact-checker found in the drug reference data, including:
- Entinostat: PubChem CID truncated (4261 instead of 4261125 — points to a completely different molecule)
- Rapamycin: PubChem CID typo (5384616 instead of 5284616)
- PD173074: Wrong CID entirely (1401 instead of 5384048)
- 4 drugs with chemically invalid SMILES strings (impossible bond configurations)
- Multiple drugs with wrong or incomplete target annotations (e.g., Methotrexate listing “Antimetabolite” as its target instead of DHFR)
The Propagation Path
The Cisplatin CID error propagates through a chain of data transformations:
Step 1: Drug list → PubChem CID 84691 (wrong)
Step 2: SMILES retrieval → The 223drugs_pubchem_smiles.txt file maps PubChem CIDs to SMILES chemical structure strings. CID 84691 retrieves Transplatin’s SMILES, not Cisplatin’s.
Step 3: Molecular graph → The drug_graph_feat/ directory contains pre-computed molecular graphs for each drug, keyed by PubChem CID. The graph for CID 84691 encodes Transplatin’s atomic structure and bond geometry.
Step 4: Model training → DeepCDR’s Graph Convolutional Network takes these molecular graphs as input. During training, the model learns to associate Transplatin’s structural features with Cisplatin’s actual IC50 values (which measure cellular response to real Cisplatin, not Transplatin).
Step 5: Prediction → When the trained model predicts response to “Cisplatin,” it’s computing features from Transplatin’s graph. The model has learned a mapping from the wrong molecular structure to drug sensitivity.
Why This Matters
Cisplatin is not a niche research compound. It is a cornerstone of cancer chemotherapy, used as first-line treatment for:
- Testicular cancer (cure rate >95%)
- Ovarian cancer
- Bladder cancer
- Head and neck cancers
- Non-small cell lung cancer
- Many other solid tumors
It is on the WHO Model List of Essential Medicines. Over 40% of all chemotherapy regimens include a platinum-based drug.
A drug response prediction model that has learned the wrong molecular features for Cisplatin cannot correctly capture how Cisplatin’s structure relates to sensitivity. Cisplatin’s anti-cancer activity depends on its cis geometry — the two chloride leaving groups must be on the same side of the platinum center to form 1,2-intrastrand d(GpG) DNA crosslinks. Transplatin’s trans geometry forms different adducts (primarily interstrand crosslinks) that are efficiently repaired by the cell’s repair machinery, which is why it has no clinical activity. The structural difference between cis and trans is precisely the kind of feature a GCN should learn — but it can’t learn it from the wrong molecule.
The Ecosystem Effect
The DeepCDR paper (Liu et al., Bioinformatics 2020) has been cited approximately 150 times. The GitHub repository has 104 stars and 29 forks. The dataset has been directly reused by multiple downstream projects:
- GraphCDR (Briefings in Bioinformatics 2022): We verified it ships with the same drug list file (identical filename, same data). Same wrong Cisplatin CID.
- JDACS4C-IMPROVE: NCI’s benchmark framework forked DeepCDR as a reference implementation.
- GraTransDRP, GraOmicDRP: Follow-on architectures built on the same data.
- DeepDR (Bioinformatics 2024): A drug response library that includes DeepCDR as a baseline.
- A 2026 Nature Communications Chemistry benchmarking study compared 5 models including DeepCDR.
Any project that:
- Downloaded the drug list from DeepCDR’s GitHub
- Used the PubChem CIDs to generate molecular features
- Trained a model using those features
…inherited the wrong Cisplatin structure. The error doesn’t affect models that ignore chemical structure (expression-only or mutation-only models), but it affects every model that encodes drug molecular graphs — which is the primary innovation of DeepCDR and its follow-on work.
Beyond Cisplatin, the three other wrong PubChem CIDs (Entinostat, Rapamycin, PD173074) mean four drugs in the dataset have molecular graph features computed from entirely different molecules. The four invalid SMILES strings mean four more drugs have corrupted structural representations. That’s 8 of 223 drugs (3.6%) with wrong or broken molecular features.
The Broader Lesson
This error was not found by peer review. It was not found by the original authors. It was not found by any of the ~150 papers that cited the work. It was not found by the teams that forked the code. It was found in seconds by an automated fact-checker that knows what PubChem CIDs mean and can verify them against domain knowledge.
The error is trivial to check — look up CID 84691 on PubChem and you’ll see “Transplatin” in the title. But nobody checked because:
- The CID is just a number in a CSV. Without semantic context, it’s opaque.
- Drug response prediction papers focus on model architecture, not data validation.
- The dataset is treated as a given — downloaded from a trusted source (GDSC) and used as-is.
- Cisplatin’s IC50 values are real measurements of real Cisplatin. The error is only in the structural metadata.
This is what Natural Joints catches. Natural Joints profiles every column, understands that PubChem CIDs are chemical compound identifiers, and validates them against known chemistry. The fact-checker flagged 25 errors in a 266-row drug reference table — a 9.4% error rate in a dataset used by dozens of published models.
The fix is simple: correct the 4 wrong PubChem CIDs, fix the 4 invalid SMILES strings, and regenerate the molecular graph features. But the models already trained on wrong features would need to be retrained.
All 25 Fact-Check Findings
The automated fact-checker flagged 25 errors across 2 reference tables (drug list and SMILES file). Here is the complete list.
Wrong PubChem Compound IDs (7 errors)
These drugs have PubChem CIDs that point to the wrong molecule. Any molecular graph or fingerprint derived from these CIDs encodes the wrong compound.
| Drug | CID in File | Actual Molecule at That CID | Correct CID |
|---|---|---|---|
| Cisplatin | 84691 | Transplatin (inactive trans isomer) | 441203 |
| Entinostat (MS-275) | 4261 | Completely different small molecule | 4261125 (truncation error) |
| PD173074 | 1401 | Different simple molecule | 5384048 |
| Rapamycin | 5384616 | Non-existent or wrong compound | 5284616 (digit transposition) |
Additionally, 3 drugs have correct CIDs but their SMILES strings don’t match the CID:
| Drug | Issue |
|---|---|
| AICA Ribonucleotide (AICAR) | CID 65110 is the unphosphorylated nucleoside, but the SMILES contains a phosphate group — depicts the phosphorylated form (ZMP) instead |
| PI-103 | CID 9956222 is correct, but the SMILES string depicts a completely different compound |
| Tubastatin A | CID 53394750 is correct, but the SMILES contains an invalid salt fragment .OOC#CF (chemically absurd) |
Chemically Invalid SMILES Strings (3 errors)
These drugs have SMILES strings that violate basic chemistry rules. Any molecular graph generated from them is structurally impossible.
| Drug | Issue |
|---|---|
| Drug with cyanoguanidine scaffold | Neutral divalent nitrogen [N] represents an unstable nitrene, not a stable amine |
| Drug with heterocyclic ring | Ring-closure atom assigned contradictory bonds (exocyclic double bond + aromatic ring), chemically impossible |
| Drug with aromatic system | Exocyclic double bonds on aromatic carbons break aromaticity rules, generates hypervalent atoms |
Wrong Drug Target Annotations (12 errors)
These errors don’t affect molecular graph features but corrupt any analysis that uses target/pathway annotations for biological interpretation.
| Drug | Listed Target/Pathway | Correct Target/Pathway |
|---|---|---|
| Methotrexate | “Antimetabolite” (drug class, not a target) | Dihydrofolate reductase (DHFR) |
| Phenformin | “Biguanide agent” (drug class) | Mitochondrial Complex I / AMPK pathway |
| Tretinoin | “Retinoic acid” (it IS retinoic acid) | Retinoic acid receptors (RARα, RARβ, RARγ) |
| GSK1070916 | Lists AURKA as target | Actually targets AURKB/AURKC (>250x selectivity over AURKA) |
| GSK650394 | Omits primary target | Primary target is SGK1 (IC50 ~62 nM), not the listed alternative |
| Tubastatin A | Lists HDAC1, HDAC6, HDAC8 | Highly selective for HDAC6 only (>1000x over HDAC1/8) |
| Vismodegib | Listed under wrong pathway | Should be Hedgehog signaling (SMO inhibitor), not the listed pathway |
| Z-LLNle-CHO | Missing pathway context | Gamma-secretase inhibitor acting through Notch signaling |
| AS601245 | “JNK1, JNK2, JNK2” | Should be JNK1, JNK2, JNK3 (JNK2 duplicated, JNK3 omitted) |
| Avagacestat | “Amyloid beta20” | Should be “Amyloid beta42” (typo: 20→42) |
| SB52334 | Compound name | Should be SB-525334 (missing digit ‘5’) |
| ALK4/ALK5 inhibitor | “RTK signaling” | ALK4/ALK5 are Serine/Threonine kinases, not RTKs |
Drug Naming Errors (2 errors)
| Drug | Issue |
|---|---|
| Lenalidomide | Listed as “CDC-501” — should be “CC-5013” (Celgene’s code name) |
| Thapsigargin | Lists “Octanoic acid” as synonym — octanoic acid is a simple fatty acid, not a synonym (Thapsigargin contains an octanoate ester but they are completely different molecules) |
Impact Summary
| Error Category | Count | Affects Molecular Features? | Affects Target Analysis? |
|---|---|---|---|
| Wrong PubChem CID | 4 | YES — wrong molecular graph | No |
| CID-SMILES mismatch | 3 | YES — wrong/corrupted structure | No |
| Invalid SMILES | 3 | YES — impossible structure | No |
| Wrong targets | 12 | No | YES — wrong biological interpretation |
| Wrong names | 2 | No | Partially — lookup failures |
| Total | 25 | 10 drugs (4.5%) with wrong structure | 12 drugs (5.4%) with wrong targets |
Of 223 drugs in the dataset, 10 have incorrect or impossible molecular structures and 12 have wrong target annotations. Combined (with some overlap), roughly 9.4% of the drug reference data contains factual errors.
Reproducing
# Verify the Cisplatin CID error
python3 -c "
import csv
reader = csv.reader(open('1.Drug_listMon Jun 24 09_00_55 2019.csv'))
for row in reader:
if row[1] == 'Cisplatin':
print(f'Drug: {row[1]}')
print(f'PubChem CID in file: {row[5]}')
print(f'PubChem CID 84691 = Transplatin (WRONG)')
print(f'Correct CID: 441203 (Cisplatin)')
print(f'https://pubchem.ncbi.nlm.nih.gov/compound/84691')
print(f'https://pubchem.ncbi.nlm.nih.gov/compound/441203')
"
# Run Natural Joints + fact-check
natural-joints enrich ./data -v
natural-joints fact-check ./data -v
Disclosure
We have not contacted the DeepCDR authors or the GDSC database maintainers before publishing this analysis. The code and data are publicly available under MIT license. The PubChem CID mapping error likely originates in the GDSC database itself, not in DeepCDR’s code — DeepCDR inherited it. We recommend that GDSC verify all PubChem CID mappings in their drug metadata.