The original Dr. Duke database is a veritable treasure trove of plant compounds, but it remains completely untapped. It cannot be easily integrated into modern machine learning pipelines.

My partner and I have spent the last few weeks manually cleaning and structurally validating 76,907 records from it. We assigned them PubChem CIDs, verified the SMILES descriptions, and added bioactivity values from ChEMBL v35. We also built a query bridge to 1.55 million PubMed abstracts. The core dataset itself is now a strictly typed flat file.

I have uploaded a public 400-row sample with all 16 columns to GitHub and Zenodo so you can test the schema in Pandas or DuckDB.

GitHub: github.com/wirthal1990-tech/USDA-Phytochemical-Database-JSON

Zenodo DOI: 10.5281/zenodo.19660107

reddit.com
u/DoubleReception2962 — 1 month ago

I've been cleaning up the USDA botanical database and mapping it against modern APIs. I wanted to find compounds that are heavily researched in academia but completely ignored by commercial patents.

Initially, our script found 994 compounds. But after applying a strict structural validation gate checking SMILES and InChIKeys against ChEMBL, that number collapsed to exactly 1 valid outlier: Sorbose.

It shows how dirty historical chemical data is. Almost all the "hidden gems" were just data artifacts and broken joins.

Data source: Enriched USDA Phytochemical Database (v2.4.0) via PubChem and USPTO APIs. Tools: Python, DuckDB, Matplotlib.

Sample data is on my GitHub if anyone wants to run their own clustering on it.

u/DoubleReception2962 — 1 month ago

My partner and I are currently rebuilding the historical USDA Dr. Duke Phytochemical database to make it usable for modern computational pipelines.

While writing the validation scripts to map the historical records to PubChem CIDs, we hit a massive wall with legacy nomenclature. We found 35 specific stereoisomer prefix issues mapped to achiral compounds. The old database basically slapped chiral prefixes onto structures that PubChem explicitly registers as achiral.

We decided to build a validation gate that drops the prefix if the base InChIKey matches an achiral PubChem record, rather than completely invalidating the historical entry.

How do computational chemists here handle legacy naming conventions when standardizing old datasets against modern InChIKey/SMILES rules? Do you hard-drop the records or write exception scripts?

I uploaded our validation logic notes and a sample of the cleaned data on GitHub if anyone wants to critique the approach.

GitHub-Repo: wirthal1990-tech/USDA-Phytochemical-Database-JSON

reddit.com
u/DoubleReception2962 — 1 month ago

NewVersion 2.4.0 of the derived dataset from the USDA Dr. Duke Phytochemicals Database has been uploaded to my GitHub- and Huggingface-repos.

What the dataset includes: 76,907 records on plant compounds from 2,313 plant species, converted from the original Dr. Duke database into a structured flat file format for ML workflows.

Fields: Compound\_Name, Plant\_Species, Plant\_Part, Chemical\_Activity, PubChem\_CID, SMILES, molecular\_formula, compound\_type, number\_of\_patents\_since\_2020, method\_for\_determining\_number\_of\_patents, ClinicalTrials.gov\_flag, iupac\_verified, inchi\_key, partner\_CID, method\_for\_partner\_mapping.

What has changed in v2.4.0:

1,534 previously zero-CID records now have verified PubChem CIDs. These were resolved through a systematic IUPAC name search against PubChem REST. The CIDs resulting from this process are marked in the “iupac\_verified” column, and the “partner\_match\_method” column documents the resolution path.

157 InChI keys were added to previously matched records.

Number of zero-CIDs: 19,150 in v2.3.1, 17,616 in v2.4.0.

All existing CID mappings underwent external review during this release cycle. My new partner, a guy with a cheminformatics backgound manually reviewed 13,206 mappings. One confirmed CID error was identified and corrected by him. 35 issues with stereoisomer prefixes for achiral compounds were resolved. Methodology documented per dataset.

File format: Parquet and JSON. Column documentation in MANIFEST\_v2.json.

HuggingFace: wirthal1990-tech/USDA-Phytochemical-Database-JSON

GitHub: wirthal1990-tech/USDA-Phytochemical-Database-JSON

reddit.com
u/DoubleReception2962 — 1 month ago

The USDA Dr. Duke Database of Phytochemicals and Ethnobotany is one of the most comprehensive collections of relationships between plant compounds in existence. Over 76,000 records. Decades of work. It includes notes on bioactivity, concentration ranges, and ethnobotanical uses for thousands of plant species.

The user interface hasn’t changed in about twenty years. There is no bulk export. The compounds have no standardized identifiers. SMILES strings do not exist. If your workflow requires PubChem CIDs, you have to start from scratch.

Every team working in the field of machine learning for natural products ultimately has to preprocess the same raw data independently. I know this because I’ve spoken with people who’ve done it, and the same problems came up every time.

So I rebuilt it.

The current version: 76,907 records. 9,098 unique compounds with PubChem CID mappings. SMILES via CID lookup. USPTO patent numbers starting in 2020. Intervention data from ClinicalTrials.gov. Classification of compounds into discrete phytochemicals, complex mixtures, substance classes, and generic ambiguities.

The most time-consuming part was not the data enrichment. It was the question of how to handle records where the compound name is ambiguous. RESIN has no CID. ALKALOID FRACTION has no CID. Assigning one would be incorrect. Leaving them without documentation explaining why they are zero leaves the next researcher in the dark. That is why I added a “compound_type” column that classifies each record and documents the classification logic.

The dataset underwent an external CID review this month. A chemistry consultant manually reviewed 13,206 compound assignments and compared them with PubChem, COCONUT, and InChI keys. One confirmed error was found and corrected. 1,534 previously zero-CIDs were resolved by matching them with IUPAC names. The number of zero-CIDs has decreased by 8%.

The dataset is provided as Parquet and JSON. Queryable in less than five minutes using DuckDB.

Available on HuggingFace (wirthal1990-tech/USDA-Phytochemical-Database-JSON). The GitHub repository (wirthal1990-tech/USDA-Phytochemical-Database-JSON) contains the complete MANIFEST and the methodology documentation.

reddit.com
u/DoubleReception2962 — 1 month ago

Release of Version 2.4.0 of the derived dataset from the USDA Dr. Duke Phytochemicals Database.

What the dataset includes: 76,907 records on plant compounds from 2,313 plant species, converted from the original Dr. Duke database into a structured flat file format for ML workflows.

Fields: Compound_Name, Plant_Species, Plant_Part, Chemical_Activity, PubChem_CID, SMILES, molecular_formula, compound_type, number_of_patents_since_2020, method_for_determining_number_of_patents, ClinicalTrials.gov_flag, iupac_verified, inchi_key, partner_CID, method_for_partner_mapping.

What has changed in v2.4.0:

1,534 previously zero-CID records now have verified PubChem CIDs. These were resolved through a systematic IUPAC name search against PubChem REST. The CIDs resulting from this process are marked in the “iupac_verified” column, and the “partner_match_method” column documents the resolution path.

157 InChI keys were added to previously matched records.

Number of zero-CIDs: 19,150 in v2.3.1, 17,616 in v2.4.0.

All existing CID mappings underwent external review during this release cycle. A chemistry consultant manually reviewed 13,206 mappings. One confirmed CID error was identified and corrected. 35 issues with stereoisomer prefixes for achiral compounds were resolved. Methodology documented per dataset.

File format: Parquet and JSON. Column documentation in MANIFEST_v2.json.

HuggingFace: wirthal1990-tech/USDA-Phytochemical-Database-JSON
GitHub: wirthal1990-tech/USDA-Phytochemical-Database-JSON

reddit.com
u/DoubleReception2962 — 1 month ago