Automatic Extraction of Reaction Templates for Synthesis Prediction

: Several tools for the computational planning of synthetic routes have been developed over the last 60 years. Traditionally these have been built on manually or automatically extracted reaction rules or templates obtained from a deep knowledge of organic chemistry in the case of the former, and reaction databases for the latter. Herein we give an introductory overview on the process of automatically extracting reaction templates, starting from methods for reaction centre identification, through to their use in computer aided synthesis planning and the de novo design of compounds.


Introduction
Reaction templates or rules encode the atom, bond, and bond order changes between a set of substances for a given chemical transformation. [1,2] Thus, a reaction template encodes the reaction centre, the automatic extraction of which was first proposed by Vleduts. [3] It follows that given the ability to extract the reaction centre from a set of reaction examples, a knowledge-base of reaction templates can be constructed codifying organic chemistry. [4] In turn, the knowledge base can be applied to the task of synthesis planning, which starting from a molecule of interest aims to predict the most likely steps for its construction from a set of known building blocks. [5] Herein, we will give an introductory overview on reaction templates as used in organic chemistry, starting from a generalised method for the identification of reaction centres, exemplified by a Claisen rearrangement, through to their use in computer aided synthesis planning (CASP). For a more exhaustive coverage of methods used for reaction centre identification and extraction, we refer the reader to the references within. [1,6]

Automatic Te mplate Extraction
The extraction of reaction templates from a set of examples starts with atom-atom mapping (AAM) to identify correspondence of atoms and bonds in the reaction. The reaction centre (RC) can subsequently be extracted by using AAM to identify the atoms and bonds that have changed. Subsequently, the RC can be encoded into a reaction template, and adapted for downstream modelling tasks as outlined in the following section.

Atom-Atom Mapping (AAM)
The identification of the RC and AAM are two closely related problems. Molecules can be represented as graphs, [7] thus the reaction centre can be represented in the form of sub-graphs describing the atom and bond changes between a set of substances, in this case organic compounds. Consequently, graph matching (isomorphism) techniques can be used to compare two or more sets of molecules. The maximum common subgraph (MCS) is defined as the largest substructure common to the collection of graphs under consideration. [6] In the context of reaction centre detection the MCS algorithm can be used to map atoms in the products to those in the reactants (AAM), and in doing so identify the atoms and bonds that have changed during the course of a reaction. [8] While determination of the correspondence between atoms and bonds may be trivial for a human chemist, the determination of the MCS is too computationally complex to solve exactly. Thus, approximate routines are often used to determine AAM. Several alternatives to MCS-based algorithms exist for AAM and have been comprehensively reviewed elsewhere. [2,9] However, a recent AAM benchmarking study found that RXNMapper, a deep learning-based AAM tool outperformed previous approaches obtaining 83.74% on the benchmark dataset. [9,10]

The Reaction Centre
Given that the RC considers all bond changes occurring during the course of a reaction, [3] several approaches have been developed for its representation.
The imaginary transition state (ITS) proposed by Fujita et. al. may be intuitive for a chemist to grasp given an understanding tasks is shown alongside the structural representation (Fig. 1b).
The reaction centre can be identified by iterating around the molecule using the atom mapping numbers until a change in the AE can be identified. For each atom, a change in AE is identified by evaluating whether the neighbouring atoms and bonding environment have changed between the reactants and products. If a change has been detected, a component of the RC has been found. The iterations continue until all atoms have been visited, and the fragments identified as having changed are combined to create the reaction centre.
The RC is then extracted in the form of SMARTS patterns ( Fig. 1c-e). The simplistic radial approach is followed by the application of heuristics to tune the specificity of the reaction centre by accounting for groups known to influence the reaction.
The open source toolkit RDChiral facilitates reaction centre extraction using the aforementioned approach, [1] in line with approaches such as ARChem (Route Designer), [17] KOSP, [18] and InfoChem's CLASSIFY. [6] of reaction mechanism. [11] The ITS can be considered a unitary reaction representation, meaning that all molecules involved in the reaction are merged into one molecular graph, and the reaction is considered a pseudomolecule. A more recent example of a unitary representation is the condensed graph of reaction (CGR), by Va rnek and co-workers. [12,13] The corresponding Python library can additionally be used for reaction centre extraction and curation. [13] The result is a string representing the reaction in the form of a CGR signature. [13] So-called 'shell/radius-based' approaches for extracting the reaction centre may also be applied. 'Shell/radius' (Radius-N) based approaches capture the reaction centre that examine the atomic environment (AE) up to a pre-specified number of bonds (N) away from the reaction centre ( Fig. 1c-e), as shown for Radius-0 to Radius-2 in purple shading. To illustrate this, consider the structural representation of a retro-reaction for the Claisen rearrangement shown in Fig. 1a,b. [14,15] The reaction centre is highlighted in dark purple and constitutes the atoms and bonds that change as a result of the reaction. The atom-mapped reaction SMILES, [16] a string representation commonly used in reaction databases and for modelling The Claisen rearrangement shown as a retrosynthetic reaction alongside the atom-mapped reaction SMILES, commonly used for data-processing tasks. The atom-mapping is annotated on the structure and shows the correspondence between atoms in the reactants and products. The highlighted atoms and bonds correspond to (c) the reaction centre (Radius-0), which constitutes the core atoms and bonds that have changed during the reaction. We see a C-C bond is broken between atoms 12 and 13, and a C-C bond formed between atoms 5 and 15. Additionally all bond orders have been modified. The details of atom, bond, and bond order change are written as reaction SMARTS, shown below the extracted template. The mapping contained in the reaction SMARTS is self-consistent and does not reflect that annotated on the structural representation. (d) depicts the reaction template extracted when the reaction centre is extended one bond away from the reaction centre (Radius-1), alongside the reaction SMARTS. (e) the reaction template extracted when the reaction centre is extended two bonds away from the reaction centre (Radius-2), alongside the reaction SMARTS. The templates correspond to sub-structures extracted from the reaction shown in (b) and have been highlighted accordingly.
to generate an outcome upon application is dependent on there being a substructure match between a template and the substrate to which it is applied. Thus, as discussed above, the requirements for size, specificity, and exclusivity depend on the downstream task for which the templates are required. Approaches for automated template extraction have limitations in that the templates may generalise poorly, as they can be too specific. [19] This has been shown for the task of retrosynthetic prediction, whereby increasing the radius at which the template is extracted, led to a decrease in performance when searching for multi-step synthetic routes. [43]

Uses in Synthesis Informed de novo Design
Given that templates can be used for reaction prediction and retrosynthesis, it follows that starting from a set of building blocks, templates can be applied strategically to generate de novo compounds. [24,25,[44][45][46] These strategies have been used in the past, for instance, for the combinatorial enumeration of virtual libraries. Current approaches combine developments in CASP using neural networks to prioritize and score reaction templates that may be used to generate a set of compounds with a given property profile. [47,48] This overcomes an inherent limitation of enumerative and generative approaches in that synthetic accessibility is considered as a design factor. In doing so, synthetic routes are predicted alongside de novo designed compounds.

Conclusions
Herein we have given an introductory overview to reaction templates as used in organic chemistry. Starting from methods for the identification of reaction centres, through to their use in computer aided synthesis planning (CASP) and the de novo design of compounds. Given the wide use of reaction templates in the field of cheminformatics, there remains a strong interest in developing methods for improved reaction centre identification and extraction. In addition, the development of underlying technologies such as atom-atom mapping, as well as methods addressing the specificity, diversity, and exclusivity of reaction templates are currently under investigation in the community.

Size, Specificity, Diversity, and Exclusivity of Reaction Templates
The size and specificity of reaction templates govern the diversity of reaction centres and their exclusivity. Larger templates, obtained at larger radii, or by extension of the reaction centre are more specific to the substrate from which they were extracted as shown in Fig. 1b-d. This has the advantage of capturing substrate diversity for analytical tasks, however, for synthesis planning the exclusivity of the template means that it cannot broadly be applied to carry out the transformation it encodes. Thus for synthesis planning tasks, non-exclusive templates that describe the same chemical transformation on overlapping sets of molecules are required. [19] In addition, the larger, more specific, and exclusive a template, the greater the number of templates extracted from a reaction database. Thus, the requirements for size, specificity, and exclusivity depend on the downstream task for which the templates are required. In the context of synthesis planning, this can vary depending on the reaction type. For instance, consider enzyme-catalysed reactions for which the bonding motifs required for substrate-enzyme binding are constrained. It is vital to consider both the reaction centre and the groups governing binding to the enzyme active site. Duigou et. al. have tackled this issue by specifying a stereochemistry aware set of reaction rules with different levels of specificity. [20] An issue arising from non-exclusive templates is that multiple templates may encode the same chemical transformation due to variations in the encoding. The variations arise from the existence of multiple solutions to atom-mapping, and the order in which the atoms in the molecule are visited during algorithmic extraction. To address this issue, Heid et al. have built upon RDChiral, a template extraction tool, through the development of a canonicalization algorithm to correct automatically extracted templates. [19]

Uses in Computer Aided Synthesis Planning
Research into CASP has steadily been making progress since its beginnings in the 1960s. [5] Historically, CASP systems for the prediction of multi-step synthetic routes, and the combinatorial enumeration of virtual libraries has relied upon rule-, or templatebased systems, where the terms are used interchangeably. [21][22][23][24] Initially, templates were manually encoded based on expert knowledge, however few rule-sets were made publicly available, and those that were, remained limited in chemical diversity and scope. [25,26] The largest rule-base known was developed by the Grzybowski group for the CASP program SYNTHIA (formerly CHEMATICA), consisting of over 70 k manually encoded reaction rules, and has delivered successful synthesis for medicinally relevant and natural product targets. [22,27,28] Given the manpower and time taken to encode such large rule-bases, algorithmic approaches that automatically extract reaction rules have been investigated since the early 1990s. [1,4,6,17,[29][30][31][32][33][34] The debate concerning the quality and scalability of manual versus automatic reaction rule encoding is still ongoing. [35,36] While the process of manual encoding is laborious, the quality of the rules may be higher, and their coverage may be sufficient for the rate at which organic chemistry is growing, argues Grzybowski et al. [36] However, purely data-driven approaches to CASP negating the need for templates have now been developed, utilizing the SMILES representation of molecules, [37] combined with developments in natural language processing (NLP) from computer science. [38][39][40] As templates encode the reaction centre, they can be applied to a set of reactants to generate the corresponding product (reaction prediction), [34,41] or applied to a product to generate a set of reactants (retrosynthesis). [32,41,42] To determine which reaction template to use for a given set of reactants/products, modern approaches to synthesis planning use neural networks to recommend which reactions, thus templates are suitable for the generation of the desired transformation. The ability of a template