Molecular Similarity for Drug Discovery, Target Prediction and Chemical Space Visualization

: Similar drug molecules often have similar properties and activities. Therefore, quantifying molecular similarity is central to drug discovery and optimization. Here I review computational methods using molecular similarity measures developed in my group within the interdisciplinary network NCCR TransCure investigating the physiology, structural biology and pharmacology of ion channels and membrane transporters. We designed a 3D molecular shape and pharmacophore comparison algorithm to optimize weak and unselective inhibitors by scaffold hopping and discovered potent and selective inhibitors of the ion channels TRPV6 and TRPM4, of endocannabinoid membrane transport, and of the divalent metal transporters DMT1 and ZIP8. We predicted off-target effects by combining molecular similarity searches from different molecular fingerprints against target annotated compounds from the ChEMBL database. Finally, we created interactive chemical space maps reflecting molecular similarities to facilitate the selection of screening compounds and the analysis of screening results. These different tools are available online at https://gdb.unibe.ch/tools/.


Introduction
The biological activity of small molecule drugs primarily depends on their complementarity to their targets, usually proteins, following the well-known lock-and-key principle, implying that similar drug molecules often have similar biological activities. [1,2] A broad variety of computational methods can be used to predict either the complementarity of potential drug molecules to their protein target, typically by docking, or the molecular similarity between potential drug molecules, most often using molecular fingerprints. [3][4][5] Although only moderately reliable, such predictions are extremely useful in drug discovery to focus experiments on small sets of test compounds, for example by selecting them from commercial catalogs of millions of screening compounds such as those accessible at the ZINC database, [6] and to guide hit and lead optimization by analog design. [7] Computational models can also predict physico-chemical [8] and ADME-Tox [9,10] properties as well as possible off-targets, [11][12][13] typically through multiple com-parisons with drug molecules of known properties and activities as collected in databases such as ChEMBL and PubChem. [14,15] Here I review molecular similarity measures developed in my group for virtual screening, target prediction and chemical space visualization, and how they were used to discover and optimize inhibitors of ion channels and membrane transporters within the NCCR TransCure project.

Drug Discovery
Our virtual screening approach for TransCure was designed in the context of identifying a potent and selective inhibitor of TRPV6, a calcium channel initially discovered by Matthias Hediger and a putative cancer target for which no structural information was available, but for which several weak inhibitors (IC 50~1 00 µM) had been described. [16][17][18][19][20] Our hypothesis was that these weak inhibitors must be partly correct but their scaffold was probably not the right one. Among the many options to compute molecular similarity, [2] we focused on molecular shape and pharmacophores, which is one of the most useful concepts in drug discovery because it enables scaffold-hopping and the discovery of new compound series. [21,22] Considering the weak inhibitors available as seed molecules, this approach appeared as a good option to identify different and potentially better scaffolds for this target.
The gold standard for molecular shape and pharmacophore comparison, then and now, is the ROCS score developed by the software company Open Eye. [23] The ROCS algorithm searches for the best overlay between 3D-molecular models of a seed and a query molecule in different conformations, quantifying the overlay of molecular surfaces either in terms of pure shape, or in terms of pharmacophore by assigning properties to the molecular surface. We developed a simpler version of this approach comparing only the lowest energy conformers of seed and query predicted by the 3D-builder CORINA [24] based on a modified version of the ligand overlap score. [25] Our extended ligand overlap score (xLOS) was calculated from the relative positions of hydrophobic, hydrogen-bond acceptor and hydrogen-bond donor atoms as three separate categories and was minimized by optimizing the relative position of seed and query. Benchmarking using a reference set [26] showed that xLOS performed as good as ROCS and better than a 3D-pharmacophore fingerprint [27] or a simple Daylight binary group allowed to solve X-ray and cryo-EM structures of several TRPV6 inhibitor complexes including cis-22a (PDB 7K4B) and 3OG (PDB 7K4D). These different structures showed that our inhibitors plug the open pore of TRPV6 and convert the channel into a nonconducting, inactive state. [33] In a similar approach in collaboration with Hugues Abriel, we searched for selective inhibitors of TRPM4, a monovalent cation channel involved in heart diseases and cancer. [34,35] We tested xLOS analogs of known cation channel blockers such as flufenamic acid [36] using a fluorescent sodium uptake assay for TRPM4 in transfected HEK293 cells. These studies led to the identification of CBA (IC 50 = 1.5 ± 0.1 µM) and NBA (IC 50 = 0.4 ± 0.3 µM) as the first potent and selective TRPM4 inhibitors, [37,38] with interesting species specific differences in their activities. [39] We have also used xLOS in virtual screening campaigns to optimize inhibitors of the iron transporter DMT1 such as the pyrazolyl-pyrimidine 3 (IC 50 = 0.17 µM), [40,41] however, in this case detailed studies showed that the literature seed compound as well as our inhibitor 13 (IC 50 = 1.1 ± 0.01 µM) acted indirectly by iron chelation, leaving bis-thioureas such as 2 (IC 50 = 0.29µM) as the only class of potent inhibitors for DMT1 as confirmed by X-ray crystallographic studies. [42,43] X-ray crystallographic studies were also critical in confirming the binding mode of Aurora A kinase inhibitors 6 (IC 50 = 2.3 µM, PDB 4ZS0) and 9 (IC 50 = 2 ± 0.5 nM, PDB 4ZTR) discovered by an xLOS driven screening campaign. [44] In this study, we found substructure fingerprint [5] in a virtual screening benchmark, and performed significant scaffold-hopping ( Fig. 1). [28] Starting with the chemical structure of five weak TRPV6 inhibitors described in the literature, [18] we used xLOS similarity to select 133 test compounds in the catalog of~800,000 drug-like molecules of a commercial provider. The activity assays were performed in the group of Matthias Hediger by monitoring calcium uptake using a fluorogenic calcium dye in a stably transfected HEK293 cell line expressing TRPV6, [29] and revealed five hit compounds with IC 50~2 0 µM. In particular inhibitor 8 (IC 50 = 31 µM) from a (4-phenylcyclohexyl)piperazine scaffold had a high xLOS similarity to the prolinol type seed inhibitor 1 (IC 50 = 90 µM). Further optimization by purchasing an additional 90 analogs, again selected by xLOS similarity, confirmed the activity of this scaffold, and provided a first potent inhibitor, which turned out to be most active as the 1,4-cis cyclohexane diastereomer cis-11a (IC 50 = 1.0 ± 0.11 µM). This diastereoselective inhibition was confirmed with additional analogs obtained by synthesis, which led to the optimized inhibitor cis-22a (IC 50 = 0.32 ± 0.12 µM). [30] Further studies focusing on electrophysiology and on introducing structural features from the natural product capsaicin to mitigate off-target effects led to the photoswitchable Z-9e (IC 50 = 1.7 ± 0.4 µM) only active as the Z-isomer, [31] and to the more potent and selective TRPV6 inhibitor 3OG (IC 50 = 0.082 ± 0.004 µM) (Fig. 2). [32] A collaboration with Alexander Sobolevsky and his effects. Following early reports such as PASS, [50,51] SEA [52] and OCEAN, [53] numerous approaches exploiting various statistical methods have been reported, a few of which have been made accessible online. [12,13,54] Most target prediction methods exploit compound-activity databases by measuring molecular similarity using a substructure molecular fingerprint such as the extended connectivity fingerprint ECFP4, originally developed in the 1960's as the Morgan fingerprint. [55,56] In our own approach to the problem, we used nearestneighbor searches combining ECFP4 with several other molecular fingerprints comparing molecules by composition [57] and molecular shape and pharmacophores, [58] as well as fusion fingerprints. [59] The system, called polypharmacology browser (PPB), [60] searched for similarity of a query molecule to ChEMBL compounds annotated with any of 4,613 possible bioactivities on single protein targets, cell lines and organisms. When tested with our TRPV6 inhibitor cis-22a, PPB successfully predicted five of twelve experimentally verified off-targets, and none of twelve experimentally verified inactive off-targets, which was comparable to the performance of several other off-target prediction tools available online (Fig. 3a).
We later reported a second, modified polypharmacology browser PPB2, which used the activity data of 344,163 ChEMBL that the thiazolidinone nucleus of the potent and highly selective inhibitor 9 did not exhibit any significant thiol reactivity despite being flagged as a potential pan-assay interference substructure (PAINS). [45] A similar lack of thiol reactivity and a very low toxicity were observed during the preclinical development of a thiazolidinone series identified in collaboration with Jürg Gertsch in an xLOS-driven screening campaign searching for inhibitors of endocannabinoid transport, [46,47] which is being further developed by Synendos Therapeutics. [48]

Ta rget Prediction
To be reasonably useful, small molecule modulators should not only be potent on their intended target, but also show good selectivity against other targets, including possible close isoforms as well as safety panels containing known problematic side activities. [49] This polypharmacology is usually unintended but very frequent and must be addressed as soon as possible in the course of a discovery project. Thanks to the availability of open access databases such as ChEMBL listing millions of compounds and their associated bioactivity data, [14] one can use molecular similarity to compare any hit compound with known bioactive molecules and deduce the probability of off-target lysophosphatidic acid acyl transferase beta (LPAAT-β), a prediction which was verified experimentally. Interestingly, the key molecular similarity comparison in this project was made by XFP, a molecular shape and pharmacophore similarity fingerprint, [58] and was missed by other target prediction models.

Visualizing Chemical Space
During applied projects such as those performed in the NCCR TransCure, the selection of test compounds guided by virtual screening engages substantial time and resources and determines project outcome. To enable better decision making in this critical early project phase, we have developed interactive tools to perform rapid similarity searches in large databases such as ZINC using various molecular fingerprints, and to visualize the content of the source database or the selected subsets in form of chemical space maps. In both cases our interactive tools display chemical structures, which is essential to evaluate and compare molecules. With such tools, various options for virtual screening can be considered and possible sets of test molecules can be compared against each other and against the entire source database.
compounds annotated with 1,720 different single protein targets. [61] In this second version, we used only three different fingerprints but investigated several models for target prediction including nearest neighbor (NN) searches with the different fingerprints, Naïve-Bayes (NB) machine learning and a deep neural network (DNN). The recall and precision of the off-target prediction was strongly influenced by the ECFP4 similarity of the query molecule to the target annotated ChEMBL compound, as well as by the target class. Among the different models, the DNN as well as a combination of NN and NB machine learning performed best. When tested with off-targets of cis-22a, PPB2 performed significantly better than PPB, with the DNN performing best by being able to predict eight of the twelve active off-target and none of the inactive ones (Fig. 3b).
Additional insight into the performance of PPB2 compared to other online prediction tools was provided by analyzing the details of a target identification project for a highly cytotoxic triazine series (IC 50~2 0 nM) identified in a phenotypic assay for angiogenesis inhibition and initially thought to target kinases. [62] In the absence of any measurable activity in whole kinome profiling, we followed a suggestion of PPB2 that this compound might inhibit a) b) Fig. 3. Predicting off-targets of the TRPV6 inhibitor cis-22a (structure shown in Fig. 2)  Our fastest and most versatile molecular similarity search tool is the multi-fingerprint browser for the ZINC database. [63] This tool offers a choice of four different molecular fingerprints to retrieve a list of analogs of any seed molecule, and to cluster the results by similarity. By contrast to typical online similarity searches such as those at the website of ChEMBL and PubChem which use only a single, substructure-based similarity method, our search tool additionally offers fuzzy comparisons using Molecular Quantum Numbers (MQN), a set of 42 descriptors counting different types of structural features, [57] and the SMILES fingerprint SMIPF, which counts the occurrence of 34 different characters in the SMILES notation of molecules. [64] Such fuzzy searches allow very substantial scaffold hopping [21] and can enable surprising and counter-intuitive analog discoveries. [65,66] To visualize collections of molecules, often referred to as chemical spaces, [67,68] we produce 2D-or 3D-maps organized by molecular fingerprint similarity where each point corresponds to one or several molecules. Principal component analysis (PCA) is well suited to produce maps from MQN and SMIFP fingerprints, for which the first two principal components usually cover more than 70% of data variability. For binary substructure fingerprints such as ECFP4 which have very high dimensionality, we produce the maps by applying PCA to an N-dimensional secondary molecular fingerprint obtained by computing similarities to N different reference compounds using the original fingerprint, which results in a strong dimensionality reduction. [69,70] PCA and similarity PCA maps provide nice overviews of large molecular databases such as the generated databases (GDBs) [71,72] PubChem, [73] or the RCSB Protein DataBank, [74] and can be inspected with interactive Java applets in 2D [75] or 3D, [38,76,77] or with the application Fearun. [78] Fearun can display millions of datapoints in 3D within a web-browser, and represents molecules in millisecond without any server communication using the drawing program Smilesdrawer. [79] This visualization was adapted for a virtual reality 3D-chemical space map of DrugBank, a database containing thousands of small molecule drugs. [80] Applying PCA and other dimensionality reduction methods such as t-SNE [81] and UMAP [82] to high-dimensional molecular fingerprint datasets usually creates images with uneven datapoint density, which hides the bulk of the data in dense clusters and draws attention to outliers. To overcome this limitation, we have reported the tree-map (TMAP), a dimensionality reduction method which organizes high-dimensional datasets in a 2D-tree connecting approximate nearest neighbors, such that the density of points is homogeneous over the entire map. TMAP largely outperforms UMAP in terms of computation speed, maximum dataset size, and preservation of nearest-neighbor relationships, and has been exemplified with datasets from computer science, physics, chemistry, biology, and literature. [83] For molecular datasets, we usually compute TMAPs using MinHashed molecular fingerprints encoding circular substructures (MHFP6) [84] or pairs of circular substructures and the topological distances between them (MAP4). [85] These molecular fingerprints outperform binary molecular fingerprints such as ECFP4 in bioactivity benchmarks [86] and allow fast determination of approximate nearest neighbors for TMAP construction. As for the PCA maps discussed above, TMAPs are displayed interactively in a web-browser using Faerun.
TMAPs representing various high-dimensional datasets are accessible at https://tmap.gdb.tools/ and https://tm.gdb.tools/ map4/. TMAPs can provide unsuspected insights into large datasets, as recently exemplified by the analysis of natural products, for which the TMAP showed that they separate in different groups according to their origin as from plants, bacteria or fungi, which implies that the origin of any natural product can be inferred from its molecular structure (https://np-svm-map4.gdb.tools/). [87,88] TMAPs provide useful project overviews, as exemplified here for a discovery project in collaboration with Matthias Hediger, targeting the divalent metal transporter ZIP8, a transporter implicated in various diseases but for which no previous pharmacology was available. [89][90][91] In this project, we purchased a set of diverse fragment-like [92] molecules belonging to the fragment subset FDB-17 [93] of the generated database GDB-17. [94] A limited screening using a transiently transfected cell line was sufficient to identify a hit series exemplified by the tricyclic indole (S)-3 (IC 50 = 17.2 ± 3.8 µM). A subsequent result analysis using TMAP showed that the screening set used for this discovery was not truly representative of the original FDB17 database and represented a selection typical of commercial fragment series, and provided useful hints for further screening efforts (Fig. 4). [95]

Conclusion and Outlook
This review presented methods developed in my research group to use molecular similarity for drug discovery, target prediction, and chemical space visualization and how they were used in NCCR TransCure projects for which no structural information was available on the protein targets. Transforming our computations into successful experiments depended on a well-organized pipeline to verify hit compound purity, structure and activity, on consistent follow-up studies, on the intrinsic hit-rate of the selected protein targets, and unavoidably also on chance. We currently keep improving our methods to address further challenges such as the discovery of bioactive peptides [95][96][97][98] and reaction informatics. [99][100][101] Fig. 4. TMAP visualization of a screening project leading to the discovery of (S)-3 as the first inhibitor of the divalent metal transporter ZIP8. The TMAP illustrates the relative chemical space coverage of the tested fragment set in the chemical space covered by fragments from different sources. The interactive map is accessible at https://gdb.unibe.ch/tools/.