Chinese Journal of Catalysis ›› 2025, Vol. 73: 159-173.DOI: 10.1016/S1872-2067(25)64725-5

• Article • Previous Articles     Next Articles

From lab to fab: A large language model for chemical engineering

Jibin Zhoua,1, Feiyang Xub,1, Zhijun Changc, Duiping Liua, Lulu Lia, Jian Cuib, Yi Lib, Xin Lib,d,e(), Li Qianc, Zhixiong Zhangc, Guoping Hub,e, Mao Yea(), Zhongmin Liua   

  1. aNational Engineering Research Center of Lower-Carbon Catalysis Technology, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China
    bArtificial Intelligence Research Institute, iFLYTEK Co., Ltd., Hefei 230000, Anhui, China
    cNational Science Library, Chinese Academy of Sciences, Beijing 100190, China
    dUniversity of Science and Technology of China, Hefei 230000, Anhui, China
    eState Key Laboratory of Cognitive Intelligence, Hefei 230000, Anhui, China
  • Received:2025-03-27 Accepted:2025-05-13 Online:2025-06-18 Published:2025-06-12
  • Contact: *E-mail: maoye@dicp.ac.cn (M. Ye),leexin@ustc.edu.cn (X. Li).
  • About author:1Contributed equally to this work.
  • Supported by:
    Liaoning Binhai Laboratory(LBLF-2023-01);Strategic Priority ResearchProgram of Chinese Academy of Sciences(XDA0490000);Key Research and Development Program of Liaoning(2023JH26/10200012)

Abstract:

The development of chemical technologies, which involves a multistage process covering laboratory research, scale-up to industrial deployment, and necessitates interdisciplinary collaboration, is often accompanied by substantial time and economic costs. To address these challenges, in this work, we report ChemELLM, a domain-specific large language model (LLM) with 70 billion parameters for chemical engineering. ChemELLM demonstrates state-of-the-art performance across critical tasks ranging from foundational understanding to professional problem-solving. It outperforms mainstream LLMs (e.g., O1-Preview, GPT-4o, and DeepSeek-R1) on ChemEBench, the first multidimensional benchmark for chemical engineering, which encompasses 15 dimensions across 101 distinct essential tasks. To support robust model development, we curated ChemEData, a purpose-built dataset containing 19 billion tokens for pre-training and 1 billion tokens for fine-tuning. This work establishes a new paradigm for artificial intelligence-driven innovation, bridging the gap between laboratory‐scale innovation and industrial‐scale implementation, thus accelerating technological advancement in chemical engineering. ChemELLM is publicly available at https://chemindustry.iflytek.com/chat.

Key words: Large language model, Chemical engineering, Process development, Multidimensional benchmark, Domain adaptation