Augmenting Chemical Space with DNA-encoded Library Technology and Machine Learning

<jats:p> </jats:p>

DNA-encoded library (DEL [1] ) technology has emerged as one of the fastest and most cost-effective screening platforms available in industry both for hit discovery [2] as well as more recently for druggability and tractability assessments and successive prioritization of therapeutic targets in the early phase of drug discovery programs. [3]he key principle of DELs is based on the combinatorial assembly (synthesis) of library members from chemical building blocks (BBs) and the corresponding tagging of each BB with unique DNA sequences (barcodes) in an alternating fashion of chemical reactions and DNA ligations.In analogy to phage display technology, [4] this physical linkage (Fig. 1) of small organic molecules with distinctive DNA barcodes enables to deconvolute the chemical identity (structure) of each and every molecule by next-generation sequencing (NGS) at any time. [5]riginally proposed by Brenner and Lerner in a theoretical paper in 1992 [6] it was not until 2004 as a result of the remarkable advancements in NGS, that several academic groups [7] reduced the technology to practice with multiple implementations of encoding schemes and library designs resulting in distinguished IP space [8] and its commercial exploration by an entire new industry.
Attributable to the specific encoding system, the combined set of libraries (pool) can be stored in a single test-tube and billions of potential ligands can be screened as mixtures all at once in a simple, one-day binding experiment (panning) against the target of choice (in general, recombinant protein of high purity and quality is needed).The DNA tags of library members allow for further exponential amplification by polymerase chain reaction (PCR), thus, even minute amounts of binders can be detected and unambiguously identified by deep sequencing after (heat) elution from the target. [9]The obtained sequencing data is evaluated by calculating an enrichment ratio (ER) or score [10] of preferential binders compared to the background (defined matrix/ non-target control) and the results are displayed using dedicated chemical analysis software (e.g.TIBCO Spotfire).Identifying patterns or fingerprints (chemical series) within the same library and across different libraries facilitate the discrimination of binding from non-binding library members.
DELT has proven to be robust in delivering novel (and often) radically different chemical starting points for medicinal chemistry programs within Roche and also externally.Not surprisingly, DELT takes now a firm place in the screening armamen-

Medicinal Chemistry and Chemical Biology Highlights Division of Medicinal Chemistry and Chemical Biology
A Division of the Swiss Chemical Society tarium of almost every pharma company (either as an in-house operated platform or accessed through a CRO) with a constantly growing number of success stories [5] (e.g.appearance of several DELT-derived molecules in the clinic [11] ) and even more players in the market. [12]or academic researchers, who are often confronted with budget constraints, DELT offers a convenient and relatively cheap source for accessing tool compounds from large chemical repertoires, e.g. to use hits from DELT screens as probes for the elucidation of complex cellular signaling pathways or the analysis of biostructural dynamics and interaction studies.
However, one limiting factor in the whole DELT process is the actual hit follow-up, i.e. the binding or activity confirmation with resynthesized hit compounds off-DNA.Hits have to be resynthesized in milligram quantities due to the extremely low abundance of library members in the form of DNA-compound conjugates.No current chemical analytics technique can quantify or characterize these DELT library members in the library pool.The only method which is able to clearly identify molecules in the library is exponential amplification by PCR (and quantify by qPCR) and subsequent decoding of the DNA barcodes by deep sequencing (read and count).
Whereas the DELT screen from target arrival to the processed hit list with ER values may take less than one week, the de novo chemical synthesis of these hits without DNA barcodes

International Year of the Periodic Table 2019: Elements important for Life Sciences Division of Medicinal Chemistry and Chemical Biology
A Division of the Swiss Chemical Society may take up to 6 months, depending on the number of hits to follow up, the availability of the initial building blocks (starting material), the invested resources and number of chemists allocated to the task.There are two main avenues which are followed to shorten this essential step of biophysical or biochemical hit validation: 1.Using assay methods which are amenable for testing the conjugates as they are present in the library.Such systems have been implemented, for example, by the group of Dario Neri, which uses short LNA duplexes (more stable than DNA) with certain fluorophores incorporated for fluorescence anisotropy experiments or conjugated with biotin for immobilization on a streptavidin-modified gold sensor chips for measuring binding kinetics by surface plasmon resonance (SPR). [13]. In silico methods are employed for searching close analogs of hits among in-house available compound collections and from vendors catalogues who guarantee fast delivery. [14]oth methods are currently applied with some success but the throughput for 1) and the hit rate of 2) still limits the efficient use in daily medicinal chemistry expansion work.
As mentioned earlier, the sequencing output of DELT screens is typically analyzed by calculating an ER for each individual library member from the sequence count in target selection conditions versus non-target controls.The hits above a certain ER threshold are then visually and interactively inspected library-by-library with the aim to identify structural patterns and chemical motifs of interest.
These extensive analysis efforts limit the throughput of molecules considered, introduce bias, and make it difficult to fully overview and utilize the subtle patterns in the depth of DELT data.
In a recent study [15] through a collaboration between ZebiAI, Google Accelerated Science (GAS) and X-Chem, research-ers presented a method to circumvent two of the current main limitations of DELT: human bias in result analysis and the timeconsuming and expensive resynthesis step of compounds off-DNA for hit validation.Accordingly, a combination of physical screening data from DELT selections was used to build a surprisingly effective machine-learning (ML) model (Fig. 2).The ML approach allows for the discovery of complex patterns otherwise nearly impossible for a scientist to detect by visual inspection of hundreds of millions of data points derived from DELT selection/sequencing output.
By generating models to targets of interest, the ML algorithm is able to predict activities of collections of compounds that were not in the physical DELs.Hence, the universe of chemical space can be easily explored by sourcing from existing compound collections and vendors at little expense.. ML-supported DELT analysis could also be of value to more accurately predict matrix binders (not binding to the target of interest) and consequently diminish the false-positive rates of DELT screens overall.Once these ML-based similarity searches in existing compound collections have become a robust, well trained and routinely performed method for faster hit finding, one could even envision to leverage the enormous DELT data pool together with sophisticated ML approaches for de novo predictions/design of novel compounds to augment the chemical space of the original DELT screening data set to an entire new universe.
In conclusion, DEL technology has become a widely accepted and routinely used method for hit finding across the pharma industry (and academic labs), enabling access to broad chemical diversity through a fast, single-well binding assay thereby complementing other high-throughput screening efforts.
As the number of DELT screens (and the resulting amount of data) is continuously growing, novel ML-based approaches expedite the data analysis to unprecedented levels.We are now routinely executing DELT screening and analysis and providing novel chemical starting points for our small-molecule research teams to explore.

Fig. 1 .
Fig. 1.Schematic representation of a DNA-encoded library member (oligo-compound conjugate) binding to a target protein of interest.The DNA barcode is chemically conjugated to the small molecule (via a long linker) and carries the unambiguous structural information of the displayed compound (e.g.encoding tags A and B corresponding to the synthesis scheme and identity of building blocks (BBs) A and B of the final structure.Thanks to the DNA, even minute amounts of binders are effectively amplified by PCR and subsequently identified (sequenced and counted) by next-generation sequencing after heat elution from an affinity-based selection experiment with a library mixture input of billions of molecules.

Fig. 2 .
Fig. 2. General concept of machine learning models based on DELT selection data.Starting with a chemically synthesized DNA-encoded library, an affinity-mediated selection is performed against the target of interest, and the DNA tags of binding molecules are PCR-amplified and sequenced following heat elution from the target.The aggregated DELT selection data (disynthon representations) is first used as a training set for machine learning models and subsequently these trained algorithms are run to predict hits from virtual libraries or commercially available catalogs such as provided by Mcule.Predicted hit compounds are ordered or synthesized and tested experimentally to confirm activity in functional assays.Reprinted (adapted) with permission from ref. [15] J. Med.Chem.2020, 63, 16, 8857-8866, Publication Date: June 11, 2020, https://doi.org/10.1021/acs.jmedchem.0c00452.Copyright (2020) American Chemical Society.