Research News
Researchers Design Novel Reaction Description Language for Encoding Molecular Editing Operations in Chemical Reactions
Editor: LIU Jia | May 19, 2025
Print

Artificial intelligence technologies, represented by large language models (LLMs), have made many breakthroughs in natural language processing. In chemistry and pharmaceuticals, the concept of chemical language models (CLMs) has emerged and they capitalize on chemist-defined molecular linear notations to learn and generate molecular structures. 

To enhance the performance of CLMs in specific tasks, new molecular linear notations have been designed. However, these notations are all designed to describe static structures of chemical molecules, and cannot explicitly describe the crucial aspect of chemistry, namely the process of atom and bond changes in molecules during reactions. 

In a study published in Nature Machine Intelligence, a research team led by ZHENG Mingyue from the Shanghai Institute of Materia Medica of the Chinese Academy of Sciences designed a new reaction description language called ReactSeq, which endows domain-specific LLMs with several emergent capabilities.

Inspired by retrosynthesis process, ReactSeq defines both product structure and molecular editing operations (MEOs) required to transform it back into reactant molecules. These MEOs include breaking and changing of chemical bonds, alterations in atomic charges, and attachment of leaving groups (LGs). 

In a ReactSeq-based retrosynthesis LM, reactants are not generated token-by-token from scratch. Instead, they are transformed from product molecule through these MEOs, ensuring precise atom mapping between predicted reactants and products, and thereby enhancing the model's interpretability. Using ReactSeq, a vanilla Transformer can achieve state-of-the-art performance in retrosynthesis prediction. 

ReactSeq also features explicit tokens denoting MEOs, enabling the encoding of human instructions. Expert prompts can significantly enhance the model's performance and even guide it in exploring new reactions. These tokens also benefit extraction of reaction representations. Focusing on the embeddings of these MEO tokens can yield more faithful and intrinsic reaction representations than aggregating the embeddings of the entire ReactSeq.

Based on this strategy and self-supervised learning, researchers developed a universal and reliable reaction representation. This representation has been proven to be effective across multiple tasks, including reaction classification, similar reaction retrieval, reaction yield prediction, and experimental procedure recommendation.

This study significantly enhances the ability of natural language processing models to tackle complex chemical problems and provides a new direction for the development of foundational models in chemical artificial intelligence.