Retrosynthesis Prediction Task Overview

Definition: Retrosynthesis is the process of finding a set of reactants that can synthesize a target molecule, i.e., product, which is a fundamental task in drug manufacturing. The target is recursively transformed into simpler precursor molecules until commercially available "starting" molecules are identified. In a data sample, there is only one product molecule, reactants can be one or multiple molecules. Retrosynthesis prediction can be seen as reverse process of Reaction outcome prediction.

Impact: Retrosynthesis planning is useful for chemists to design synthetic routes to target molecules. Computational retrosynthetic analysis tools can potentially greatly assist chemists in designing synthetic routes to novel molecules. Machine learning based methods will significantly save the time and cost.

Generalization: The model is expected to accurately generate reactant sets for novel drug candidates with distinct structures from the training set across reaction types with varying reaction conditions.

Product: Small-molecule.

Pipeline: Manufacturing - Synthesis planning.

USPTO-50K

Dataset Description: USPTO (United States Patent and Trademark Office) 50K consists of 50K extracted atom-mapped reactions with 10 reaction types.

Task Description: Given the product X, generate the reactant set Y.

Dataset Statistics: 50,036 reactions.

Dataset Split: Random Split

from tdc.generation import RetroSyn
data = RetroSyn(name = 'USPTO-50K')
split = data.get_split()

Note: To get the reaction types of each reaction, you can type:

from tdc.utils import get_reaction_type
get_reaction_type('USPTO-50K')

Note: Starting 0.3.5, you can also get the reaction type in the dataframe file in each split by turning on

split = data.get_split(include_reaction_type = True)

References:

[1] Lowe, Daniel Mark. Extraction of chemical structures and reactions from the literature. Diss. University of Cambridge, 2012.

[2] Liu, Bowen, et al. “Retrosynthetic reaction prediction using neural sequence-to-sequence models.” ACS central science 3.10 (2017): 1103-1113.

[3] Zheng, Shuangjia, et al. “Predicting retrosynthetic reactions using self-corrected transformer neural networks.” Journal of Chemical Information and Modeling 60.1 (2019): 47-55.

Dataset License: CC0.


USPTO

Dataset Description: The full USPTO (United States Patent and Trademark Office) retrosynthesis dataset.

Task Description: Given the product X, generate the reactant set Y.

Dataset Statistics: 1,939,253 reactions.

Dataset Split: Random Split

from tdc.generation import RetroSyn
data = RetroSyn(name = 'USPTO')
split = data.get_split()

References:

[1] Daniel Lowe. Chemical reactions from US patents (1976-Sep2016).

Dataset License: CC0.