When a dataset is loaded inside Discngine Ideation SAR Slides, we process all input compounds and apply an in-house scaffold detection algorithm. This article provides some details on what's happening behind the scene and describes properties that are attached to scaffolds.
What is a (good) scaffold ?
A scaffold generally refers to the core structure or molecular framework of a compound: a (typically central) arrangement of atoms (rings and linkers) that somewhat defines the compound’s structural backbone. At the compound set level, the scaffold is typically found in multiple compound, and used as a starting point for compound optimization by changing substituents (R-groups) or side chains.
Various automated scaffold definition method exist in the literature. Most scaffold identification studies rely on ring-based scaffold definition and typically make use of a specific set of rules to derive scaffolds based on compound structures. Murcko scaffold is probably one of the most famous definition you'll find.
Scaffold Detection methodology overview
We have developed a method that relies on compound fragmentation algorithm to derive a comprehensive (although not completely exhaustive - see limitations at the end of this article) set of substructures. Every molecule is fragmented in multiple ways, generating a set of fragments with attachment points. Based on this list, we apply the following steps to derive scaffolds with attachment points:
- Remove very unlikely candidates (fragments having less than 5 atoms, or found in only a small fraction of compounds)
- Group scaffolds sharing the same substructure (ignoring attachment points)
- Within each group, merge scaffold attachment points to derive the final scaffolds.
.png)
- Keep the most populated ones
- Remap each scaffold to input structures to have a better estimate of the number of matching molecules.
- Score each remaining scaffold and return the top X (currently: 200).
Descriptors
We provide a set of descriptors that we think might be useful to navigate and filter within the list of detected scaffolds. Some of these are also used in the scoring scheme. We describe in more details some of these descriptors.
-
R-Group Counts
The number of R-Groups identified for the scaffold, i.e. the number of position on the scaffold atoms that were identified having at least one non-hydrogen substituent. -
Est. Total # Match
We associate with each detected scaffold the estimated number of molecules that match the scaffold. This number is an estimation: especially for large datasets, deriving an accurate value for this descriptor remains a computationally demanding task given the number of candidate scaffolds we process. Until we figure out ways of improving, we compute it on a random subset of the dataset molecules, and derive an estimate. -
Structural descriptors
Various structural descriptors are also provided and are self-expnalatory. It should be noted though that descriptors (unless explicitely stated) are computed on the scaffold without considering R-Groups. For example, the number of heavy atoms of this scaffold is 9:
.png)
Score
The score is making use of an internal weighted multi-parameter scoring function. It is relative, that is, each individual component of the score is normalized between the min and max value within the dataset itself. It means scores can't be compared across datasets. It also means that it's not because a structure has a very low score that it may not be worth considering. It would rather means that compounds with higher score are probably more interesting.
Applicability and limitations
For datasets containing molecules which are close analogs, typically based on the same scaffold or sets of scaffolds showing a high degree of structural similarity, the automated scaffold detection algorithm is likely to provide relevant results.
For large datasets with very diverse sets of molecules, the results may be less relevant, because one heavy weighted attribute considered in the underlying scoring scheme remains the (estimated) number of compounds mapping a particular scaffold. Although we apply some penalty for very small scaffold (e.g. a single rings with multiple attachment points), very diverse dataset would still have these usually irrelevant scaffold showing up.
Likewise, molecules based on scaffolds that do not contain any ring systems are less likely to be detected automatically: they may show up in the list, but they may be assigned a lower score since we favor scaffolds containing at least one ring in our scoring scheme. Various scaffold descriptors are available to filter the list of returned scaffolds to help you remove or focus on what you might find relevant in the context of your project.
Finally, our algorithm currently don't split fused rings, meaning that datasets containing molecules having moieties being a mix between non-ring attachment and attachment forming fused rings with a given core scaffold are less likely to trigger a proper detection of the core scaffold itself.
