Using PCA and TMAP to visualize the space of synthetic vs natural molecules

7 min readApr 2, 2022

TLDR;

Natural products provide the greatest opportunity in small-molecule drug discovery with over 60% of all small-molecule drugs approved between 1981–2014 coming from natural products or their derivates. What’s even more astounding is that 95% of the natural world hasn’t been mapped chemically.
Screening libraries focus primarily on commercially available synthetic molecules which lack the structural diversity found across natural products. The existing literature fails to visualize the significance of this opportunity space. Natural product drug discovery has been limited by the inability of the pharmaceutical industry to deal with commercial challenges in ‘purification, characterization, and chemical modification of complex natural product scaffolds.
I conducted a PCA analysis using 16 two-dimensional variables that describe the structure and stereochemistry of a combined ~390k natural products and synthetic molecules. Natural products demonstrate much greater variability in terms of molecular complexity (most evidently shown by FSP3).
I also utilize TMAP clustering using MHFP fingerprints — NP’s occupy a huge space structurally which isn’t explored by synthetic molecules, and as the PCA plots show, they have more diversity in terms of FSP3.
Decade-old measures such as QED, Lipinski’s Rule of 5, and Verber’s Rule fail to account for this structural diversity.
Looking at the TMAP clustering, it only excites me further, how Enveda Bio’s novel approach using metabolomics and machine learning will truly unlock this amazing class of small molecule drugs.

3D PCA plot comparing the structure and stereochemistry of natural products vs synthetic molecules (coloured by FSP3)

TMAP clustering (coloured by molecule type)

TMAP clustering (coloured by FSP3)

1. Introduction

Current limitations with screening libraries

Chen et al (2019) mention that

‘A recent analysis found that over 60% of all small-molecule drugs approved between 1981 and 2014 are genuine NPs, NP analogs or their derivatives, or compounds containing an NP pharmacophore’.

Yet current screening libraries for small molecules focus primarily on commercially available synthetic molecules which 1) are often biased by decade-old conventions such as Lipinski’s rule of five (Lipinski, 1997) and Verber’s rule (Verber, 2002) 2) are often limited in terms of structural variability (with a focus on molecules with structural similarity to approved small molecule drugs). Why have the pharmaceutical industry shied away from natural product small molecule drug discovery? Stratton et al. 2016 mention the commercial challenges in ‘purification, characterization, and chemical modification of complex natural product scaffolds’ and Atanasov et. al 2021 highlight the challenges relating to identifying the bioactive molecule and ‘Accessing sufficient biological material to isolate and characterize a bioactive NP’.

The existing visualization of the natural product vs synthetic molecule space

While it is clear that natural products remain the most promising source of small molecule drug discovery, Chen et al (2019) fail to visualize this structural diversity in the stereochemistry in a compelling way. Some questions that come to mind:

Filtering the molecules with a molecular weight up to 1500 Da leads to huge streaks in the natural product space. This does not truly represent the space of drug-likeness, especially when it comes to cell-permeability.
Chen et al (2019) do not remove terminal sugars from the molecules in their dataset. This could skew metrics of drug-likeness when we really only want to understand properties of the potential bioactive aglycon.

2. Methodology

I followed a similar approach to Chen et al (2019) when it came to data preprocessing. Natural products were taken from the UNPD, TCM Database@Taiwan and NP Atlas (the three largest datasets from Chen et. al 2019). For synthetic molecules, a sample was taken from the ZINC in-stock molecules (available for delivery within 2 weeks). I confirmed that these do not contain the ZINC natural products. I ended up with 194,952 natural product molecules and an equivalent sample of 194,952 synthetic molecules.

Dataset links

UNPD (as csv)

Location: http://oolonek.github.io/ISDB/
Data accessed: 01/03/2022

TCM Database@Taiwan (as mol2 file)

Location: TCM Database@Taiwan: The World’s Largest Traditional Chinese Medicine Database for Drug Screening In Silico, PLoS ONE 6(1): e15939. doi:10.1371/journal.pone.0015939*.*
Date accessed: 01/03/2022

NP Atlas (as csv)

Location: https://www.npatlas.org/download
Date accessed: 01/03/2022

ZINC in-stock (as csv) — https://zinc15.docking.org/

Location: https://zinc15.docking.org/substances/subsets/now/
Date accessed: 01/03/2022

Cleaning procedures

Removing molecules with a molecular weight <150 Da and >1000 Da (Matsson and Kihlberg, 2017 assess how molecular weight affects cell permeability which is the reason for the difference with the 1500 Da upper bound given by Chen et. al 2019)
Removing molecules not within the set of (H,B,C,N,O,F,Si,P,S,Cl,Se,Br,I)
Removing the few multi-component molecules — salts (<100)
Canonicalising and standardizing the smiles using the MolVS library
Removing terminal sugars (I utilised the deglycosylation command line program created by Schaub et. al 2020)
Finally, removing duplicate SMILES

PCA Features

16 two-dimensional descriptors were used in the creation of the PCA plots. 13 were taken from the Chen et. al (2019) paper, and 3 new features were added based on recent literature. Wei et al. (2020) classify the fraction of SP3 carbon atoms (FSP3) as a measure of ‘drug-likeness’:

‘It characterizes the spatial complexity of the molecule by describing its carbon saturation and improves the features of the compound by enhancing the water solubility’

Molecules with a higher FSP3 are likely to have an enhanced 3D structure which could support their ability to bind to novel sites. In addition, Stratton et. al (2016) conduct a cheminformatic comparison of approved drugs from 1981–2010 and also utilize FSP3 as well as the number of stereocenters as measures of molecular complexity, hence why I thought it was valuable to incorporate them in my analysis. They state:

‘Notably, the normalized values for stereocenter count (nStMW) for NP and ND drugs were 2- to 6-fold higher than those for S* and S drugs (Table 2). These data are consistent with previous cheminformatic studies indicating that natural products have a greater degree of stereochemical diversity relative to synthetic drug-like compounds.34,35’

The full list of features are below:

MW (Weight)
LogP (log P (o/w))
Topological polar surface area (TPSA)
Number of hydrogen bond acceptors (a_acc)
Number of hydrogen bond donors (a_don)
Number of heavy atoms (a_heavy)
Fraction of rotatable bonds (b_rotR)
Number of nitrogen atoms (a_nN)
Number of oxygen atoms (a_nO)
Sum of formal charges (FCharge)
Number of aromatic atoms (a_aro)
Number of chiral centers (chiral)
Number of rings (rings)
Number of stereocenters (stereo)
Fraction of sp3 carbon atoms (fsp3)
Number of spiroatoms (a_spiro)

TMAP clustering

Probst et al. (2020) introduce the use of TMAP, a two-dimensional tree-based clustering algorithm using a combination of locality sensitive hashing and graph theory. Generating a similarity matrix based on Tanimoto similarity creates computational complexity greater than O(n²) and self-organizing maps become unfeasible with large scale data. On the other hand, TMAP is built for large scale visualization. The SMILES undergo locality sensitive hashing (using MinHash) to generate LSFP fingerprints which look at sub-structures within a diameter of 6 bonds. TMAP then has an in-built LSH Forest data structure for fast k-NN search.

3. Results

💡 I first recreate the 2D PCA created by Chen et. al (2019) to ensure that the data cleaning process was conducted successfully.

💡 Within a ‘drug-like’ range of molecular weight, natural products have much greater variability, especially within the 500–100 Da range (which is neglected by Lipinski’s rule of 5).

💡 Natural products have much greater variability in terms of molecular complexity (comparing fraction of SP3 carbon atoms — FSP3).

💡 The visualization below, splitting molecules based on if they fit within the Beyond Rule of 5 outer limits fails to show any real message.

Beyond Rule of 5 Parameters (Doak et al. 2016) ‘defines the current outer limits of physicochemical space where orally absorbed compounds may have a reasonable chance of being designed’ to set new boundaries for the Lipinski features.

💡 When looking at the treemap of structural (MHFP) fingerprints, natural products occupy a new space which isn’t explored within the existing synthetic molecule space.

💡 Natural products are also evidently more structurally complex in terms of FSP3…

💡 … and to some degree in terms of the number of spiro atoms.

4. Limitations

I only incorporate three of the many natural product data sources utilized by Chen et al. 2019. I found this quite late on, but there is a collection of over 700,000 natural products that can be downloaded here (Sorokina et al. 2021) that could enhance the map of natural products investigated.
The deglycosylation program used to remove both linear and terminal sugars may have altered bond types slightly of 15,363/389,904 molecules (as shown below). This should not interfere with the majority of indicators apart from chiral centres which I noticed.