Fuelling the Digital Chemistry Revolution with Language Models

Sandmeyer Award 2022

Authors

  • Antonio Cardinale IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
  • Alessandro Castrogiovanni IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
  • Theophile Gaudin IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
  • Joppe Geluykens IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
  • Teodoro Laino IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
  • Matteo Manica IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
  • Daniel Probst IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
  • Philippe Schwaller IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
  • Aleksandros Sobczyk IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
  • Alessandra Toniato IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
  • Alain C. Vaucher IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
  • Heiko Wolf IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
  • Federico Zipoli IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland

DOI:

https://doi.org/10.2533/chimia.2023.484

PMID:

38047789

Keywords:

Digital chemistry, Language models, Machine learning, Sandmeyer Award 2022, Synthetic Organic Chemistry

Abstract

The RXN for Chemistry project, initiated by IBM Research Europe – Zurich in 2017, aimed to develop a series of digital assets using machine learning techniques to promote the use of data-driven methodologies in synthetic organic chemistry. This research adopts an innovative concept by treating chemical reaction data as language records, treating the prediction of a synthetic organic chemistry reaction as a translation task between precursor and product languages. Over the years, the IBM Research team has successfully developed language models for various applications including forward reaction prediction, retrosynthesis, reaction classification, atom-mapping, procedure extraction from text, inference of experimental protocols and its use in programming commercial automation hardware to implement an autonomous chemical laboratory. Furthermore, the project has recently incorporated biochemical data in training models for greener and more sustainable chemical reactions. The remarkable ease of constructing prediction models and continually enhancing them through data augmentation with minimal human intervention has led to the widespread adoption of language model technologies, facilitating the digitalization of chemistry in diverse industrial sectors such as pharmaceuticals and chemical manufacturing. This manuscript provides a concise overview of the scientific components that contributed to the prestigious Sandmeyer Award in 2022

Funding data

Downloads

Published

2023-08-09

How to Cite

[1]
A. Cardinale, A. Castrogiovanni, T. Gaudin, J. Geluykens, T. Laino, M. Manica, D. Probst, P. Schwaller, A. Sobczyk, A. Toniato, A. C. Vaucher, H. Wolf, F. Zipoli, Chimia 2023, 77, 484, DOI: 10.2533/chimia.2023.484.