Skip to the content.

Cisplatin or Transplatin? A Drug ID Mapping Error in Cancer Drug Response Prediction

Summary

We ran an automated Natural Joints and fact-check on a widely used cancer pharmacogenomics dataset. The fact-checker flagged that Cisplatin — one of the most prescribed chemotherapy drugs in the world — is mapped to the wrong PubChem compound ID. The ID in the dataset points to Transplatin, the biologically inactive trans isomer, not Cisplatin, the active cis isomer used to treat cancer.

This mapping error propagates through the drug’s molecular graph features. Any model trained on this dataset that uses chemical structure to predict drug response has learned the structural features of the wrong molecule for Cisplatin. The dataset has been downloaded and reused by dozens of published papers.

The Error

The dataset ships with a drug list file (1.Drug_listMon Jun 24 09_00_55 2019.csv) containing 266 anti-cancer drugs from the Genomics of Drug Sensitivity in Cancer (GDSC) project. Each drug is mapped to a PubChem Compound ID (CID), which is used to retrieve the drug’s molecular structure.

Cisplatin’s entry:

drug_id: 1005
Name: Cisplatin
PubCHEM: 84691

PubChem CID 84691 is Transplatin — trans-diamminedichloroplatinum(II). This is the geometrical isomer of Cisplatin where the two chloride ligands are on opposite sides of the platinum center (trans configuration). Transplatin is biologically inactive and has no anti-cancer activity.

The correct CID for Cisplatin (cis-diamminedichloroplatinum(II)) is 441203 (or 5702198 for the hydrated form). In Cisplatin, the two chloride ligands are on the same side (cis configuration), which is essential for its mechanism of action — forming intrastrand DNA crosslinks that trigger apoptosis.

The difference matters because models like DeepCDR encode drug structure as a molecular graph. The graph for Transplatin has different bond geometry than Cisplatin. A Graph Convolutional Network (GCN) processing these two molecules will compute different node features, different adjacency patterns, and different learned representations. The model has learned to associate Transplatin’s structure with Cisplatin’s IC50 values.

How We Found It

We used natural-joints to profile and enrich the dataset, then ran its fact-checking module. The fact-checker identifies concept-grain reference tables (tables where each row is a named real-world entity with verifiable attributes) and validates the entries against domain knowledge.

The drug list table was selected as a concept-grain table. The fact-checker verified each drug’s attributes:

FAIL: PubChem CID 84691 corresponds to Transplatin
      (trans-diamminedichloroplatinum(II)), which is the inactive
      trans isomer. The correct CID for the active cis isomer
      (Cisplatin) is 441203 or 5702198.

This was one of 25 factual errors the fact-checker found in the drug reference data, including:

The Propagation Path

The Cisplatin CID error propagates through a chain of data transformations:

Step 1: Drug list → PubChem CID 84691 (wrong)

Step 2: SMILES retrieval → The 223drugs_pubchem_smiles.txt file maps PubChem CIDs to SMILES chemical structure strings. CID 84691 retrieves Transplatin’s SMILES, not Cisplatin’s.

Step 3: Molecular graph → The drug_graph_feat/ directory contains pre-computed molecular graphs for each drug, keyed by PubChem CID. The graph for CID 84691 encodes Transplatin’s atomic structure and bond geometry.

Step 4: Model training → DeepCDR’s Graph Convolutional Network takes these molecular graphs as input. During training, the model learns to associate Transplatin’s structural features with Cisplatin’s actual IC50 values (which measure cellular response to real Cisplatin, not Transplatin).

Step 5: Prediction → When the trained model predicts response to “Cisplatin,” it’s computing features from Transplatin’s graph. The model has learned a mapping from the wrong molecular structure to drug sensitivity.

Why This Matters

Cisplatin is not a niche research compound. It is a cornerstone of cancer chemotherapy, used as first-line treatment for:

It is on the WHO Model List of Essential Medicines. Over 40% of all chemotherapy regimens include a platinum-based drug.

A drug response prediction model that has learned the wrong molecular features for Cisplatin cannot correctly capture how Cisplatin’s structure relates to sensitivity. Cisplatin’s anti-cancer activity depends on its cis geometry — the two chloride leaving groups must be on the same side of the platinum center to form 1,2-intrastrand d(GpG) DNA crosslinks. Transplatin’s trans geometry forms different adducts (primarily interstrand crosslinks) that are efficiently repaired by the cell’s repair machinery, which is why it has no clinical activity. The structural difference between cis and trans is precisely the kind of feature a GCN should learn — but it can’t learn it from the wrong molecule.

The Ecosystem Effect

The DeepCDR paper (Liu et al., Bioinformatics 2020) has been cited approximately 150 times. The GitHub repository has 104 stars and 29 forks. The dataset has been directly reused by multiple downstream projects:

Any project that:

  1. Downloaded the drug list from DeepCDR’s GitHub
  2. Used the PubChem CIDs to generate molecular features
  3. Trained a model using those features

…inherited the wrong Cisplatin structure. The error doesn’t affect models that ignore chemical structure (expression-only or mutation-only models), but it affects every model that encodes drug molecular graphs — which is the primary innovation of DeepCDR and its follow-on work.

Beyond Cisplatin, the three other wrong PubChem CIDs (Entinostat, Rapamycin, PD173074) mean four drugs in the dataset have molecular graph features computed from entirely different molecules. The four invalid SMILES strings mean four more drugs have corrupted structural representations. That’s 8 of 223 drugs (3.6%) with wrong or broken molecular features.

The Broader Lesson

This error was not found by peer review. It was not found by the original authors. It was not found by any of the ~150 papers that cited the work. It was not found by the teams that forked the code. It was found in seconds by an automated fact-checker that knows what PubChem CIDs mean and can verify them against domain knowledge.

The error is trivial to check — look up CID 84691 on PubChem and you’ll see “Transplatin” in the title. But nobody checked because:

  1. The CID is just a number in a CSV. Without semantic context, it’s opaque.
  2. Drug response prediction papers focus on model architecture, not data validation.
  3. The dataset is treated as a given — downloaded from a trusted source (GDSC) and used as-is.
  4. Cisplatin’s IC50 values are real measurements of real Cisplatin. The error is only in the structural metadata.

This is what Natural Joints catches. Natural Joints profiles every column, understands that PubChem CIDs are chemical compound identifiers, and validates them against known chemistry. The fact-checker flagged 25 errors in a 266-row drug reference table — a 9.4% error rate in a dataset used by dozens of published models.

The fix is simple: correct the 4 wrong PubChem CIDs, fix the 4 invalid SMILES strings, and regenerate the molecular graph features. But the models already trained on wrong features would need to be retrained.

All 25 Fact-Check Findings

The automated fact-checker flagged 25 errors across 2 reference tables (drug list and SMILES file). Here is the complete list.

Wrong PubChem Compound IDs (7 errors)

These drugs have PubChem CIDs that point to the wrong molecule. Any molecular graph or fingerprint derived from these CIDs encodes the wrong compound.

Drug CID in File Actual Molecule at That CID Correct CID
Cisplatin 84691 Transplatin (inactive trans isomer) 441203
Entinostat (MS-275) 4261 Completely different small molecule 4261125 (truncation error)
PD173074 1401 Different simple molecule 5384048
Rapamycin 5384616 Non-existent or wrong compound 5284616 (digit transposition)

Additionally, 3 drugs have correct CIDs but their SMILES strings don’t match the CID:

Drug Issue
AICA Ribonucleotide (AICAR) CID 65110 is the unphosphorylated nucleoside, but the SMILES contains a phosphate group — depicts the phosphorylated form (ZMP) instead
PI-103 CID 9956222 is correct, but the SMILES string depicts a completely different compound
Tubastatin A CID 53394750 is correct, but the SMILES contains an invalid salt fragment .OOC#CF (chemically absurd)

Chemically Invalid SMILES Strings (3 errors)

These drugs have SMILES strings that violate basic chemistry rules. Any molecular graph generated from them is structurally impossible.

Drug Issue
Drug with cyanoguanidine scaffold Neutral divalent nitrogen [N] represents an unstable nitrene, not a stable amine
Drug with heterocyclic ring Ring-closure atom assigned contradictory bonds (exocyclic double bond + aromatic ring), chemically impossible
Drug with aromatic system Exocyclic double bonds on aromatic carbons break aromaticity rules, generates hypervalent atoms

Wrong Drug Target Annotations (12 errors)

These errors don’t affect molecular graph features but corrupt any analysis that uses target/pathway annotations for biological interpretation.

Drug Listed Target/Pathway Correct Target/Pathway
Methotrexate “Antimetabolite” (drug class, not a target) Dihydrofolate reductase (DHFR)
Phenformin “Biguanide agent” (drug class) Mitochondrial Complex I / AMPK pathway
Tretinoin “Retinoic acid” (it IS retinoic acid) Retinoic acid receptors (RARα, RARβ, RARγ)
GSK1070916 Lists AURKA as target Actually targets AURKB/AURKC (>250x selectivity over AURKA)
GSK650394 Omits primary target Primary target is SGK1 (IC50 ~62 nM), not the listed alternative
Tubastatin A Lists HDAC1, HDAC6, HDAC8 Highly selective for HDAC6 only (>1000x over HDAC1/8)
Vismodegib Listed under wrong pathway Should be Hedgehog signaling (SMO inhibitor), not the listed pathway
Z-LLNle-CHO Missing pathway context Gamma-secretase inhibitor acting through Notch signaling
AS601245 “JNK1, JNK2, JNK2” Should be JNK1, JNK2, JNK3 (JNK2 duplicated, JNK3 omitted)
Avagacestat “Amyloid beta20” Should be “Amyloid beta42” (typo: 20→42)
SB52334 Compound name Should be SB-525334 (missing digit ‘5’)
ALK4/ALK5 inhibitor “RTK signaling” ALK4/ALK5 are Serine/Threonine kinases, not RTKs

Drug Naming Errors (2 errors)

Drug Issue
Lenalidomide Listed as “CDC-501” — should be “CC-5013” (Celgene’s code name)
Thapsigargin Lists “Octanoic acid” as synonym — octanoic acid is a simple fatty acid, not a synonym (Thapsigargin contains an octanoate ester but they are completely different molecules)

Impact Summary

Error Category Count Affects Molecular Features? Affects Target Analysis?
Wrong PubChem CID 4 YES — wrong molecular graph No
CID-SMILES mismatch 3 YES — wrong/corrupted structure No
Invalid SMILES 3 YES — impossible structure No
Wrong targets 12 No YES — wrong biological interpretation
Wrong names 2 No Partially — lookup failures
Total 25 10 drugs (4.5%) with wrong structure 12 drugs (5.4%) with wrong targets

Of 223 drugs in the dataset, 10 have incorrect or impossible molecular structures and 12 have wrong target annotations. Combined (with some overlap), roughly 9.4% of the drug reference data contains factual errors.

Reproducing

# Verify the Cisplatin CID error
python3 -c "
import csv
reader = csv.reader(open('1.Drug_listMon Jun 24 09_00_55 2019.csv'))
for row in reader:
    if row[1] == 'Cisplatin':
        print(f'Drug: {row[1]}')
        print(f'PubChem CID in file: {row[5]}')
        print(f'PubChem CID 84691 = Transplatin (WRONG)')
        print(f'Correct CID: 441203 (Cisplatin)')
        print(f'https://pubchem.ncbi.nlm.nih.gov/compound/84691')
        print(f'https://pubchem.ncbi.nlm.nih.gov/compound/441203')
"

# Run Natural Joints + fact-check
natural-joints enrich ./data -v
natural-joints fact-check ./data -v

Disclosure

We have not contacted the DeepCDR authors or the GDSC database maintainers before publishing this analysis. The code and data are publicly available under MIT license. The PubChem CID mapping error likely originates in the GDSC database itself, not in DeepCDR’s code — DeepCDR inherited it. We recommend that GDSC verify all PubChem CID mappings in their drug metadata.