HTE and Data Analysis for Discovery and Molecular-level Understanding of Catalysts

: The combination of high-throughput experimentation (HTE) and data analysis is a valuable methodology for mechanistic interrogation and rational development of catalysts. In this article, we point out the general structure of HTE-data analysis workflow and illustrate how it can be applied with examples of olefin metathesis and cyanation reactions.


Introduction
Catalysis is at the heart of efficient chemical processes and is directly associated with sustainable development, by lowering energy consumption while optimizing resources. Furthermore, it will also provide a way to transition from fossil energy and chemical resources to renewables. [1] In industrial settings, heterogeneous catalysts are essential as they allow process intensification (decrease of energy intensive steps, typically associated with separation and regeneration), but they suffer from their intrinsic complexity. Hence, they are developed mostly via empirical approaches. In some cases, they can be replaced by homogenous catalysts, which are powerful alternatives due to their often higher selectivity, lower operational temperatures and easier rational developments. For the latter, the operations (recycling) are far more complex and often require years of development. Overall, catalysis, whether homogenous or heterogeneous, requires tedious optimization due to the large parameter space (concentration, temperature, pressure, additives…) that influences catalytic performance (activity, selectivity, and stability). In that context, one approach to speed up developments, that embraces the complexity of catalysis, is high-throughput experimentation (HTE), [2] which has emerged in the late 1990s and has gained momentum more recently, in particular with the emergence of improved data analysis and machine learning-based approaches. [3] For instance, laboratory robotics simultaneously perform multiple tasks that enable time-efficient screening of catalysts [4] in a broad range of well-defined conditions, thus generating large and reliable catalytic data sets. However, find-ing the underlying rationalization behind the success of a particular catalyst formulation remains a formidable challenge. At the opposite end, computational chemistry, in particular, based on density functional theory (DFT), has demonstrated its power to describe reaction pathways and to rationalize reactivity and selectivity patterns but at the expense of long and tedious work. More recently, structure-activity studies based on multivariate linear regression analyses have demonstrated their efficiency in identifying -in a more timely and cost-effective way -promising correlations with predictive power. [5] In this article, we describe how combining the efficiency of HTE methods with data analysis via multivariate linear regression fitting of catalytic results and computational rationalization of data allows for computer-guided prediction of catalysis research and rational design. Furthermore, we discuss how, in this context, the emergence of machine learning is yet offering new possibilities, which could revolutionize catalyst design and process implementation.

Methodology
The general HTE-data analysis workflow is shown in Fig. 1. The approach described in this article is best for the development of catalytic systems where organic ligands and additives are used to modulate the catalytic performance and one is looking to discover and optimize the catalyst structure or formulation, e.g. welldefined and ill-defined molecular and supported catalysts including nanoparticles where the organic ligands can play a major role. sive gas chromatography analysis of reaction aliquots. Analysis of the raw data allowed for extraction of conversions, product selectivities, as well as respective turnover numbers (TON) and turnover frequencies (TOF) as catalytic output descriptors ( Fig.  2A, right). After parameter processing, subsequent identification of univariate correlations highlighted the non-anticipated importance of splitting the phenolic data set in two subgroups differing by the presence/absence of aryl substituents in ortho positions of the respective phenol ligands, drastically improving individual univariate correlations with several computed descriptors (Fig. 2B). Multivariate linear regression analysis was then utilized to obtain internally validated, predictive models that portray the impact of the interplay of stereo-electronic effects of the ligands on TOF and TON responses for both groups (Fig. 2C). The resulting models captured the well-established importance of the σ-donation ability of the ligand in modulating the activity of the catalysts, [12] which increased the confidence in the meaningfulness of the analysis. More importantly though, the models uncovered the influence of non-covalent interactions in tuning activity and performance, in particular for aryl-arm bearing phenolic ligands, hence providing a new lever that may be exploited for the future design of improved d 0 metathesis catalysts.

Cyanation
Nitriles are important structural motifs in pharmaceuticals and natural products [13] and their cyano moiety serves as a valuable precursor for numerous functional group interconversions. [14] Ample efforts have been put in developing catalytic cyanation protocols, for instance employing nucleophilic or electrophilic cyano sources. [15] While a major concern remains the toxicity of the employed cyanation reagents, the choice of the optimal reaction conditions, and in particular of the ligand to catalyze the reaction, is often not clear. In this regard, the HTE-data analysis methodology was applied to develop a novel palladium-catalyzed electrophilic cyanation protocol, opting for classic Suzuki-Miyaura cross-coupling conditions, using aryl boronic acids and N-cyano succinimide as cyanating agent (Fig. 3). [16] Accelerated investigation of the ligand effect was automated using a liquid handling robot to screen 90 ligands belonging to either monophosphine, bisphosphine or the miscellaneous subgroup. All tests were hereby performed in triplicate to assess the reproducibility of the results, yielding 288 formulations to analyze via gas chromatography (Fig. 3A). Similar to what was described in Section 3.1, the workflow involved calculating DFT-derived ligand descriptors to relate the electronic and steric properties of the ligands to the experimentally determined yield for the mono-and bisphosphine subsets. For the bisphosphine subset, however, ligand parameters specific to the PdCl 2 adduct were assessed additionally to describe the bidentate nature of the ligands. Relying on univariate and multivariate linear regression analysis, structurally-responsive ligand behavior [17] was identified as the main characteristic required in an optimal ligand, displaying the ability to stabilize the metal in their bisligated state while their hemilability is able to open up a coordination site that may be needed to enable catalysis (Fig. 3B). XantPhos turned out to excel in this regard and was further used as ligand to investigate the protocol for different substrates, demonstrating excellent functional group tolerance in particular with electron-withdrawing moiety bearing aryl boronic acids (Fig. 3C).

Outlook: Beyond Multivariate Linear Regression
In recent years, tremendous progress in the area of machine learning (ML) and artificial intelligence has facilitated the implementation of algorithms for non-specialists. [18] The multidimensionality of chemical space complicates the use of such algorithms for the synthetic community, given the requirements for large amount of data to efficiently navigate that space. HTE has been of outmost importance in unlocking the potential of ML in Once the system under study is defined, high-throughput screening should be used to perform catalyst evaluation in a time efficient and reproducible way (step 1a). Based on analysis of proposed mechanisms and catalyst structure, suitable ligand descriptors should be identified and validated (step 1b). For the construction of generalizable, unbiased models, which are aimed at making accurate predictions for a wide range of different molecules, the gathered experimental data (TOF, TON, yield, etc.) should be divided in a training set, used for model construction, and an external validation set, necessary for the verification of the models (step 2). The descriptors have to be normalized to possess the same scale and deviation, so that the coefficient in future models reflects the variance of each parameter (step 3). First univariate correlations are done to see which ligand descriptors are most relevant for catalysis and to identify possible data subsets of structurally related ligands. Consequently, preliminary multivariate models can be constructed, e.g. by least-squares linear regression by forward feature selection, effectively evaluating the change in statistics caused by addition/removal of each parameter and incorporation of the most important term in each step (step 4). The generated models have to be validated by (internal) cross-validation like Q2 or k-fold means, or by external validation with empirical results that are known before model development (step 5). The goal of this workflow is to gain a better understanding of reaction mechanisms through analysis of interactions that are described by ligand parameters, but also to predict new active catalysts via extrapolation in concert with virtual screening. The application of this workflow will be discussed in two case studies. In the first study, the methodology was applied to a well-studied reaction, olefin metathesis, in order to investigate the key descriptors that drive the catalysis for both homogeneous and heterogeneous systems. The second example focuses on the development of a new cyanation protocol and investigation of the optimal ligand properties for the design of improved palladium cross-coupling catalysts.

Olefin Metathesis
Olefin metathesis is a prototypical example of an (atom) efficient reaction catalyzed by group 6 metals, with a broad industrial interest, ranging from petrochemicals, polymers to the fine chemical industry. Olefin metathesis, a Nobel-prize-winning technology, [6] is used to produce propene, an essential component of polymers, via the OCT process (WO 3 /SiO 2 ), long chain olefins via the SHOP (MoO 3 /Al 2 O 3 ), biomass-derived oils or complex pheromones and drugs (Molecular Mo or W Schrock-type catalysts). [7] Over the last decades, a multitude of catalysts have been synthesized by a serendipity driven approach [8] and their reactivity was rationalized by computational studies. [9] Further understanding of the key parameters driving alkene metathesis and the deactivation pathways still remains a challenging task. Towards this goal, libraries of homogeneous and heterogeneous Schrock-type olefin metathesis catalysts were efficiently synthesized using high-throughput experimentation, specifically by using bis-pyrrolide type Mo alkylidene molecular complexes as ideal candidates due to their ease of synthesis and modularity. [10] In fact, the pyrrolido ligand can readily be exchanged via protonolysis with a XH molecule e.g. X = aryloxides or even silica as a support, so that over 200 formulations were readily prepared from 35 selected phenols and with/without silica partially dehydroxylated at 700°C (SiO 2-700 ). [11] In parallel, density functional theory (DFT) calculations on the phenolic ligands were used to acquire simple steric and electronic molecular descriptors to correlate to the anticipated reaction outputs ( Fig. 2A,  left). Testing the in situ generated complexes in the homometathesis of 1-nonene in a robotized way enabled the monitoring of the reaction progress at different time points by retrieval and succes-

License and Terms
This is an Open Access article under the terms of the Creative Commons Attribution License CC BY 4.0. The material may not be used for commercial purposes. The license is subject to the CHIMIA terms and conditions: (https://chimia.ch/chimia/about).
The definitive version of this article is the electronic one that can be found at https://doi.org/10.2533/chimia.2022.346 Chemistry. The strive of data-driven strategies to the chemical sciences has since known a clear uptick in applications [19] in both homogeneous and heterogeneous catalysis. [20] The power of machine learning resides in its pattern recognition and self-learning without explicit tailored programming. In this regard, the combination of ML with HTE opens up new avenues towards self-driven laboratories by autonomously planning, executing and processing reactions. [21]

Conclusion
In summary, the use of high-throughput experimentation in concert with data analysis was demonstrated to be effective towards mechanistic interrogation and accelerated reaction development. While molecularly well-defined systems were explored here, this approach is applicable to a broader range of catalyst classes, such as supported or unsupported nanoparticles, where ligand additives can play a major role. Such approaches are currently under investigation in our group. While HTE ensures time-efficient execution of synthetic steps and reproducibility of the performed reactions, data analysis, and in particular, simple multivariate linear regression models appeal through their ease of use and interpretability. Exciting developments of ML-based methods are yet offering new possibilities to take this methodology to the next level.