从实验室到工厂: 化学工程领域的大语言模型

doi:10.1016/S1872-2067(25)64725-5

催化学报 ›› 2025, Vol. 73: 159-173.DOI: 10.1016/S1872-2067(25)64725-5

从实验室到工厂: 化学工程领域的大语言模型

周吉彬^a^,¹, 徐飞扬^b^,¹, 常志军^c, 刘对平^a, 李路路^a, 崔健^b, 李益^b, 李鑫^b^,^d^,^e(), 钱力^c, 张智雄^c, 胡国平^b^,^e, 叶茂^a(), 刘中民^a

^a中国科学院大连化学物理研究所, 低碳催化技术国家工程研究中心, 辽宁大连 116023
^b科大讯飞股份有限公司, 人工智能研究院, 安徽合肥 230000
^c中国科学院文献情报中心, 北京 100190
^d中国科学技术大学, 安徽合肥 230000
^e认知智能国家重点实验室, 安徽合肥 230000

收稿日期:2025-03-27 接受日期:2025-05-13 出版日期:2025-06-18 发布日期:2025-06-12
通讯作者: *电子信箱: maoye@dicp.ac.cn (叶茂),leexin@ustc.edu.cn (李鑫).
作者简介:¹共同第一作者.
基金资助:
辽宁滨海实验室联合类基金(LBLF-2023-01);中国科学院A类先导专项(XDA0490000);辽宁省重点研发计划(2023JH26/10200012)

From lab to fab: A large language model for chemical engineering

Jibin Zhou^a^,¹, Feiyang Xu^b^,¹, Zhijun Chang^c, Duiping Liu^a, Lulu Li^a, Jian Cui^b, Yi Li^b, Xin Li^b^,^d^,^e(), Li Qian^c, Zhixiong Zhang^c, Guoping Hu^b^,^e, Mao Ye^a(), Zhongmin Liu^a

^aNational Engineering Research Center of Lower-Carbon Catalysis Technology, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China
^bArtificial Intelligence Research Institute, iFLYTEK Co., Ltd., Hefei 230000, Anhui, China
^cNational Science Library, Chinese Academy of Sciences, Beijing 100190, China
^dUniversity of Science and Technology of China, Hefei 230000, Anhui, China
^eState Key Laboratory of Cognitive Intelligence, Hefei 230000, Anhui, China

Received:2025-03-27 Accepted:2025-05-13 Online:2025-06-18 Published:2025-06-12
Contact: *E-mail: maoye@dicp.ac.cn (M. Ye),leexin@ustc.edu.cn (X. Li).
About author:¹Contributed equally to this work.
Supported by:
Liaoning Binhai Laboratory(LBLF-2023-01);Strategic Priority ResearchProgram of Chinese Academy of Sciences(XDA0490000);Key Research and Development Program of Liaoning(2023JH26/10200012)

摘要/Abstract

摘要：

化学工程技术的开发是一个复杂多阶段的过程, 涵盖实验室研究、过程放大到工业部署应用等多个环节. 该过程不仅需要化学、材料和工程等多学科的紧密协作, 还面临着漫长的研发周期和高昂的经济成本. 尽管以大语言模型为代表的生成式人工智能在基础研究领域取得显著进展, 但其在复杂工程问题中的深度应用仍面临挑战. 现有通用大语言模型对化学工程专业知识的理解有限, 难以支撑从实验室创新到工业化实施的全链条技术转化. 同时, 由于缺乏系统性评估基准, 难以客观评价大语言模型在化工专业场景中的实际性能.

为了应对上述挑战, 本文以星火大模型为基座, 成功开发出面向化学工程领域的垂直大语言模型ChemELLM, 其参数规模高达700亿. 同时, 为了全面且系统地评估大语言模型在化学工程领域的综合能力, 本文精心构建了首个化学工程多维度评估基准体系ChemEBench. 该体系采用从基础知识理解、领域高级解析到专业问题求解的递进式三级架构评估框架, 涵盖了催化剂设计、流体模拟、设备选型和安全评估等15个核心领域, 并设置101项细粒度评估任务, 实现了从基础理论认知到复杂工程建设的全维度能力评估. 基准测试结果表明, ChemELLM在上述关键指标上均表现卓越, 综合性能领先于O1-Preview, GPT-4o和DeepSeek-R1等主流大语言模型. 此外, 为了支撑大语言模型的高质量训练与微调, 构建了ChemEData数据集, 其中预训练语料规模达190亿token, 包含106万篇高质量专业文献、579万篇高价值专利以及1200本专业书籍; 微调数据集规模达10亿token, 包含275万对精心设计的问答对数据.

综上, 本研究聚焦化学工程领域大语言模型的开发, 提升其对化学工程领域的理解和推理能力, 有望建立从实验室研究到工业应用之间的桥梁, 加速化工新技术落地与产业化进程, 构建人工智能驱动化学工程创新的新范式. ChemELLM已上线部署并可公开访问, https://chemindustry.iflytek.com/chat.

关键词: 大语言模型, 化学工程, 过程开发, 多维度基准评估体系, 领域适用性

Abstract:

The development of chemical technologies, which involves a multistage process covering laboratory research, scale-up to industrial deployment, and necessitates interdisciplinary collaboration, is often accompanied by substantial time and economic costs. To address these challenges, in this work, we report ChemELLM, a domain-specific large language model (LLM) with 70 billion parameters for chemical engineering. ChemELLM demonstrates state-of-the-art performance across critical tasks ranging from foundational understanding to professional problem-solving. It outperforms mainstream LLMs (e.g., O1-Preview, GPT-4o, and DeepSeek-R1) on ChemEBench, the first multidimensional benchmark for chemical engineering, which encompasses 15 dimensions across 101 distinct essential tasks. To support robust model development, we curated ChemEData, a purpose-built dataset containing 19 billion tokens for pre-training and 1 billion tokens for fine-tuning. This work establishes a new paradigm for artificial intelligence-driven innovation, bridging the gap between laboratory‐scale innovation and industrial‐scale implementation, thus accelerating technological advancement in chemical engineering. ChemELLM is publicly available at https://chemindustry.iflytek.com/chat.

Key words: Large language model, Chemical engineering, Process development, Multidimensional benchmark, Domain adaptation

周吉彬, 徐飞扬, 常志军, 刘对平, 李路路, 崔健, 李益, 李鑫, 钱力, 张智雄, 胡国平, 叶茂, 刘中民. 从实验室到工厂: 化学工程领域的大语言模型[J]. 催化学报, 2025, 73: 159-173.

Jibin Zhou, Feiyang Xu, Zhijun Chang, Duiping Liu, Lulu Li, Jian Cui, Yi Li, Xin Li, Li Qian, Zhixiong Zhang, Guoping Hu, Mao Ye, Zhongmin Liu. From lab to fab: A large language model for chemical engineering[J]. Chinese Journal of Catalysis, 2025, 73: 159-173.

×
扫码分享

导出引用管理器 EndNote|Ris|BibTeX

链接本文: https://www.cjcatal.com/CN/10.1016/S1872-2067(25)64725-5

https://www.cjcatal.com/CN/Y2025/V73/I6/159

图/表 14

参考文献

[1]	T. Rambaran, R. Schirhagl, Nanoscale Adv., 2022, 4, 3664-3675.
[2]	P. Tian, Y. Wei, M. Ye, Z. Liu, ACS Catal., 2015, 5, 1922-1938.
[3]	Y. Xie, Y. Ma, Comput. Aided Chem. Eng., 2014, 34, 747-752.
[4]	A. Wiesner, J. Morbach, W. Marquardt, Comput. Chem. Eng., 2011, 35, 692-708.
[5]	S. A. Gembicki, K. M. VandenBussche, A. R. Oroskar, Chem. Eng. Sci., 2003, 58, 549-555.
[6]	J.-F. Joly, F. Giroudière, F. Bertoncini, Catal. Today, 2013, 218, 153-161.
[7]	C. He, C. Zhang, T. Bian, K. Jiao, W. Su, K.-J. Wu, A. Su, Processes, 2023, 11, 330.
[8]	P. K. Pal, A. Hens, N. Behera, S. K. Lahiri, Can. J. Chem. Eng., 2025, https://doi.org/10.1002/cjce.25611.
[9]	M. Mowbray, M. Vallerio, C. Perez-Galvan, D. Zhang, A. Del Rio Chanona, F. J. Navarro-Brull, React. Chem. Eng., 2022, 7, 1471-1509.
[10]	L. H. Chiang, B. Braun, Z. Wang, I. Castillo, AIChE J., 2022, 68, e17644.
[11]	L. Li, L. Dinh, S. Hu, L. Hemphill, arXiv preprint arXiv:2408.04163, 2024.
[12]	C. Mammides, H. Papadopoulos, Methods Ecol. Evol., 2024, 15, 1774-1776.
[13]	D. A. Boiko, R. MacKnight, B. Kline, G. Gomes, Nature, 2023, 624, 570-578.
[14]	T. Song, M. Luo, X. Zhang, L. Chen, Y. Huang, J. Cao, Q. Zhu, D. Liu, B. Zhang, G. Zou, G. Zhang, F. Zhang, W. Shang, Y. Fu, J. Jiang, H. N. Laboratory, Y. Luo, H. N. Laboratory,, J. Am. Chem. Soc., 2025, 147, 12534-12545.
[15]	Q. Zhang, K. Ding, T. Lv, X. Wang, Q. Yin, Y. Zhang, J. Yu, Y. Wang, X. Li, Z. Xiang, X. Zhuang, Z. Wang, M. Qin, M. Zhang, J. Zhang, J. Cui, R. Xu, H. Chen, X. Fan, H. Xing, H. Chen,, ACM Comput. Surv., 2025, 57, 1-38.
[16]	M. C. Ramos, C. J. Collison, A. D. White, Chem. Sci., 2025, 16, 2514-2572.
[17]	T. Guo, K. Guo, B. Nan, Z. Liang, Z. Guo, N.V. Chawla, O. Wiest, X. Zhang, Advances in Neural Information Processing Systems, 2023, 36, 59662-59688.
[18]	K. M. Jablonka, P. Schwaller, A. Ortega-Guerrero, B. Smit, Nat. Mach. Intell., 2024, 6, 161-169.
[19]	D. Bhattacharya, H. J. Cassady, M. A. Hickner, W. F. Reinhart, J. Chem. Inf. Model., 2024, 64, 7086-7096.
[20]	J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, A. Jain, Nat. Commun., 2024, 15, 1418.
[21]	D. H. Mok, S. Back, arXiv preprint arXiv:2407.14040, 2024.
[22]	L. Wang, X. Chen, Y. Du, Y. Zhou, Y. Gao, W. Cui, Int. J. Mach. Learn. Cyber., 2025, 2473.
[23]	Y. Su, X. Wang, Y. Ye, Y. Xie, Y. Xu, Y. Jiang, C. Wang, Chem. Sci., 2024, 15, 12200-12233.
[24]	A. M. Bran, T. A. Neukomm, D. P. Armstrong, Z. Jončev, P. Schwaller, arXiv preprint arXiv:2503.08537, 2025.
[25]	Z. Zhao, D. Ma, L. Chen, L. Sun, Z. Li, Y. Xia, B. Chen, H. Xu, Z. Zhu, S. Zhu, S. Fan, G. Shen, K. Yu, X. Chen, arXiv preprint arXiv:2401.14818, 2024.
[26]	D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y. Jiang, C. Chen, T. Wu, K. Shu, L. Cheng, H. Liu, arXiv preprint arXiv:2411.16594, 2024.
[27]	Y. Yang, R. Shi, Z. Li, S. Jiang, Y. Yang, B.-L. Lu, H. Zhao, Preprint at https://doi.org/10.26434/chemrxiv-2024-1p4xt, 2024.
[28]	D. Zhang, W. Liu, Q. Tan, J. Chen, H. Yan, Y. Yan, J. Li, W. Huang, X. Yue, W. Ouyang, D. Zhou, S. Zhang, M. Su, H.-S. Zhong, Y. Li, ChemLLM: a chemical large language model. 2024: 2402.06852. https://arxiv.org/abs/2402.06852v2 .
[29]	R. Bommasani, P. Liang, T. Lee, Ann. N Y Acad. Sci., 2023, 1525, 140-146.
[30]	Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, Supryadi , L. Yu, Y. Liu, J. Li, B. Xiong, D. Xiong, Evaluating large language models: a comprehensive survey. 2023: 2310.19736. https://arxiv.org/abs/2310.19736v3 .
[31]	X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Subramaniam, A. R. Loomba, S. Zhang, Y. Sun, W. Wang, arXiv preprint arXiv:2307.10635, 2023.
[32]	Y. Huang, R. Zhang, X. He, X. Zhi, H. Wang, X. Li, F. Xu, D. Liu, H. Liang, Y. Li, J. Cui, Z. Liu, S. Wang, G. Hu, G. Liu, Q. Liu, D. Lian, E. Chen, arXiv preprint arXiv:2409.13989, 2024.
[33]	J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F.L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. arXiv preprint arXiv:2303.08774, 2023.
[34]	A. Anthropic, https://www-cdn.anthropic.com/ de8ba9b01c9ab7 cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf , 2024, 3.
[35]	A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., arXiv preprint arXiv:2407.21783, 2024.
[36]	D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al., rXiv preprint arXiv:2501.12948, 2025.
[37]	A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al., rXiv preprint arXiv:2412.19437, 2024,
[38]	R. Qin, Z. Li, W. He, M. Zhang, Y. Wu, W. Zheng, X. Xu, arXiv preprint arXiv:2407.00079, 2024.
[39]	T. Glm, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Sun, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, X. Song, X. Zhang, Y. An, Y. Xu, Y. Niu, Y. Yang, Y. Li, Y. Bai, Y. Dong, Z. Qi, Z. Wang, Z. Yang, Z. Du, Z. Hou, Z. Wang,, ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools. 2024: 2406.12793. https://arxiv.org/abs/2406.12793v2 .
[40]	Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, O. Xuan, D. Yu, H. Tian, H. Wu, H. Wang, arXiv preprint arXiv:2107.02137, 2021.
[41]	B. Yu, F. N. Baker, Z. Chen, X. Ning, H. Sun, arXiv preprint arXiv:2402.09391, 2024.
[42]	M. Bennamoun, G. J. Mamic, in: Object Recognition. London, Springer, 2002, 199-220.
[43]	S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: adapt language models to domains and tasks. 2020: 2004.10964. https://arxiv.org/abs/2004.10964v3 .
[44]	M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, arXiv preprint arXiv:1909.08053, 2019.
[45]	A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fern ndez, A. Gatt, E. Ghaleb, M. Giulianelli, M. Hanna, A. Koller, arXiv preprint arXiv:2406.18403, 2024.
[46]	Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, V. Pande, Chem. Sci., 2018, 9, 513-530.
[47]	Z. Guo, C. Zhang, W. Yu, J. Herr, O. Wiest, M. Jiang, N.V. Chawla, Proceedings of the World Wide Web Conference 2021, 2021, 2559-2567.

Data source	Document	Size
Scholarly paper	1.06 million	30.5 GB
Chemical patent	5.79 million	58.9 GB
Professional book	1200	106.2 GB

Data source	Document	Size
Scholarly paper	1.06 million	30.5 GB
Chemical patent	5.79 million	58.9 GB
Professional book	1200	106.2 GB

Type	Catalyst	Simulation	Equipment	Separation	Safety	Heat	Engineering
Multiple choice	24600	43900	48600	96700	9200	10800	8000
True/False	14100	46800	42400	80100	5800	10000	7000
Fill-in-the-blank	19500	39900	44100	83100	9000	10000	2000
Calculation	30500	54200	63700	116700	13100	12500	5000
Short answer	31500	46300	63000	117500	13000	11200	7000
Sum	120200	231100	261800	494100	50100	54500	29000
Sum	1240800

Type	Catalyst	Simulation	Equipment	Separation	Safety	Heat	Engineering
Multiple choice	24600	43900	48600	96700	9200	10800	8000
True/False	14100	46800	42400	80100	5800	10000	7000
Fill-in-the-blank	19500	39900	44100	83100	9000	10000	2000
Calculation	30500	54200	63700	116700	13100	12500	5000
Short answer	31500	46300	63000	117500	13000	11200	7000
Sum	120200	231100	261800	494100	50100	54500	29000
Sum	1240800

Dimension	Definition	Score range
Objectivity	the question should have a unique and objective answer under unified evaluation standards	0-5
Rationality	the question and answer must be complete and clear, without omitting critical information	0-5
Accuracy	the reasoning chain should be checked step by step to ensure the absence of factual, logical, computational, or knowledge errors	0-5
Generalizability	questions and answers should be based on general domain knowledge rather than relying on specific papers or patents	0-5

从实验室到工厂: 化学工程领域的大语言模型

From lab to fab: A large language model for chemical engineering

RichHTML

PDF

Supporting Information

可视化

摘要/Abstract

引用本文

使用本文

扫码分享

图/表 14

参考文献

相关文章 1

编辑推荐

Metrics

Level	Category	Task	Type (Metric)
Found- ational Knowledge	subjective Q&A of domain knowledge	objective question	multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc)
Found- ational Knowledge	objective Q&A about domain knowledge	subjective question	short answer (score), calculation (score)
Advanced Knowledge	molecular name translation	SMILES to IUPAC	SMILES to IUPAC (Acc)
	molecular name generation	molecular name generation from text description	molecular name generation from text description (Score)
	molecular description	generate text descriptions based on molecular smiles	generate text descriptions based on molecular SMILES (Score)
	Molecular Property Prediction	prediction of molecular properties based on molecular smiles	prediction of molecular properties based on molecular SMILES(Acc)
	reaction prediction	reaction prediction	predict the reactants from the products (Acc), predict the products from the reactants (F1), and predict whether the reaction is high yield based on the reaction information (Acc)
Professional Skill	catalyst	catalyst deactivation	short answer (score)
		catalyst stability	short answer (score)
		catalyst industrial process	short answer (score)
	equipment	general equipment	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
		reactor	multiple choice (Acc), Fill-in-the-blank (Acc), true/false (Acc), short answer (score)
		dryer	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
		centrifuge	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
		pump	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
		tower	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
	fluid simulation	computational fluid dynamics	multiple choice (Acc), short Answer (score)
		discrete element method	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score)
		machine learning method	short Answer (score)
		direct numerical simulation	short answer (score), calculation (score)
	separation	absorption	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
		distillation	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
		extraction	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
	heat	heat exchanger	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score)
	safety	regulations and standards	multiple choice (Acc)
		process safety	multiple choice (Acc), true/false (Acc), short answer (score)
		environment safety	multiple choice (Acc), true/false (Acc), short answer (score)
		personnel safety	multiple choice (Acc)
		equipment safety	multiple choice (Acc)
		hazardous chemistry	multiple choice (Acc), true/false (Acc), short answer (score)
	economics	economics	multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score)
	engineering construction	electrical engineering	multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score)
		automatic control	multiple choice (Acc), true/false (Acc), short answer (score)
		material engineering	multiple choice (Acc)
		equipment engineering	multiple choice (Acc), true/false (Acc), short answer (score)
		civil engineering	multiple choice (Acc)
		thermal engineering	multiple choice (Acc)
		water supply and drainage engineering	multiple choice (Acc)
		general plot plan	multiple choice (Acc)
		chemical system	multiple choice (Acc), true/false (Acc)
		fire protection engineering	multiple choice (Acc)

Model	Developer	Size (parameter)	Access
O3-mini	OpenAI	undisclosed	API
O1-Preview	OpenAI	undisclosed	API
GPT-4o	OpenAI	undisclosed	API
Claude-3.7	Anthropic	undisclosed	API
LLaMA 3.1-70B	Meta	70B	weights
DeepSeek-R1	DeepSeek	671B	API
DeepSeek-V3	DeepSeek	671B	API
Kimi	Moonshot AI	undisclosed	API
GLM-4	Zhipu AI	undisclosed	API
ERNIE-4.0	Baidu	undisclosed	API
ChemDFM-13B	Suzhou Lab	13B	weights
ChemLLM-7B-Char-1.5-SFT	Shanghai AILab	7B	weights
LlaSMol-Mistral-7B	OSU	7B	weights

Model	L1	L2	L3	Mean score	Overall rank
O3-mini	74.72	23.13	59.74	58.85	6
O1-Preview	76.10	23.88	67.94	65.76	3
GPT-4o	62.81	23.19	58.48	56.48	8
Claude-3.7	70.38	21.76	64.01	61.75	5
LLaMA 3.1-70B	48.48	10.25	47.84	45.26	11
DeepSeek-R1	82.19	14.75	73.49	70.33	2
DeepSeek-V3	69.83	17.13	65.97	62.96	4
Kimi	51.12	16.25	53.06	50.24	10
GLM-4	54.95	11.75	57.24	53.77	9
ERNIE-4.0	57.01	26.62	60.49	57.71	7
ChemDFM-13B	29.71	28.25	31.69	31.22	12
ChemLLM- 7B-Char-1.5-SFT	20.10	6.50	21.97	20.67	13
LlaSMol-Mistral-7B	16.90	26.38	19.64	19.81	14
ChemELLM	73.88	50.25	74.72	72.90	1

Model	Objective task			Subjective task		Mean score	Overall rank
Model	multiple choice	true/false	fill-in-the-blank	calculation	short answer	Mean score	Overall rank
O3-mini	59.63	62.94	53.12	77.64	54.41	58.85	6
O1-Preview	71.46	71.01	61.46	72.22	57.88	65.75	3
GPT-4o	63.78	56.88	54.25	56.80	51.09	56.69	8
Claude-3.7	67.93	63.49	55.97	67.78	56.37	61.75	5
LLaMA 3.1-70B	52.81	60.73	37.81	39.03	33.43	45.26	11
DeepSeek-R1	78.54	69.36	72.68	76.67	60.96	70.33	2
DeepSeek-V3	72.93	61.28	62.63	63.89	54.72	62.96	4
Kimi	53.05	56.88	46.19	43.89	46.69	50.24	10
GLM-4	60.98	62.57	51.33	43.20	44.94	53.77	9
ERNIE-4.0	64.63	64.22	54.50	49.31	50.45	57.71	7
ChemDFM-13B	29.51	43.67	25.61	11.39	31.75	31.23	12
ChemLLM-7B-Char-1.5-SFT	21.10	35.05	21.14	5.14	14.35	20.67	13
LlaSMol-Mistral-7B	13.90	48.81	13.83	1.67	13.84	19.81	14
ChemELLM	77.32	80.18	66.60	64.93	68.81	72.90	1

Task type		Quantity	Metric	Models
Task type		Quantity	Metric	GPT-4o	O1-Preview	DeepSeek-R1	ChemELLM
Property prediction	BACE	100	ACC	35	40	38	64
	BBBP	100	ACC	61	56	52	67
	ClinTox	100	ACC	50	52	31.5	57.5
	HIV	100	ACC	33	78	40	81
	Tox21	1044	ACC	80.27	81.9	81.03	83.14
Yield prediction	Buchwald-Hartwig	100	ACC	62	75	63	61
Yield prediction	Suzuki-Miyaura	100	ACC	52	65	61	48
Name prediction	iupac2formula	100	Exact	28	65	38	4
	smiles2iupac	100	Exact	1	0	0	24
	iupac2smiles	100	Exact	8	14	9	20
	smiles2formula	100	Exact	9	42	24	5
Molecule analysis	text-based molecule design	100	BLEU	42.56	51.76	58.12	75.71
Molecule analysis	molecule Captioning	100	score	20	23.5	18.25	26.5
Synthetic analysis	reactant Prediction	100	F1	3	32.67	25	61
	retro synthesis	100	F1	4.9	14.13	11.5	33.83
	solvent selection	100	F1	51	51	51	51
	reactant selection	100	F1	24.7	20.83	26	50.47
	ligands selection	100	F1	15.27	18.19	16.9	17.97
Overall		2744	mean	48.78	56.67	51.36	58.89

Category Model		L1		L2					L3								Overall
Category Model		C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12	C13	C14	C15	Overall
O3-mini	0*	74.39	75.19	0.00	6.67	60.83	36.00	20.00	57.33	59.07	72.50	67.31	72.78	61.00	53.21	47.75	58.85
O3-mini	3*	80.92	72.69	3.33	10.00	56.67	56.00	37.65	51.00	56.29	71.44	69.36	73.48	67.25	54.74	56.14	61.46
O1-Preview	0*	80.01	70.39	3.33	3.33	45.83	44.00	24.71	59.67	66.92	73.57	74.31	77.26	73.33	63.21	59.53	65.76
O1-Preview	3*	79.12	72.88	6.67	20.00	39.17	64.00	30.59	54.67	64.02	70.55	73.89	75.46	70.92	61.99	66.16	65.94
GPT-4o	0*	67.63	55.77	3.75	6.67	37.50	76.00	15.29	53.33	56.89	66.71	65.36	61.54	62.83	54.54	51.84	56.69
GPT-4o	3*	65.93	54.62	1.67	10.00	25.00	72.00	10.59	48.33	56.57	68.09	66.45	60.98	68.00	54.92	59.16	58.03
Claude-3.7	0*	74.59	64.23	3.33	16.67	51.67	36.00	15.30	60.50	60.64	73.32	65.88	71.11	69.25	59.60	59.30	61.75
Claude-3.7	3*	79.21	63.27	6.67	30.00	62.50	48.00	29.41	55.33	62.73	74.38	73.55	69.46	73.75	61.45	64.74	65.48
LLaMA 3.1-70B	0*	58.23	34.23	0.00	0.00	28.33	28.00	5.88	33.67	47.42	59.09	49.53	46.28	55.33	40.29	42.19	45.26
LLaMA 3.1-70B	3*	59.96	38.85	0.00	13.33	12.50	60.00	22.35	36.33	49.08	59.81	58.92	50.94	60.17	41.93	48.07	49.63
DeepSeek-R1	0*	85.92	76.73	3.33	16.67	45.00	12.00	8.24	59.00	73.44	78.13	79.16	81.46	78.50	67.92	66.68	70.33
DeepSeek-R1	3*	85.03	70.58	3.33	20.00	41.67	48.00	40.00	55.67	71.19	77.30	79.97	78.00	80.08	71.55	75.28	72.38
DeepSeek-V3	0*	75.90	60.96	0.00	20.00	47.50	40.00	4.71	56.00	64.92	73.22	70.63	73.14	72.25	62.59	57.56	62.96
DeepSeek-V3	3*	75.79	65.96	3.33	13.33	35.00	52.00	16.47	55.33	66.80	73.37	71.14	70.10	74.75	60.57	67.72	65.63
Kimi	0*	57.80	41.35	3.33	0.00	25.00	72.00	7.06	45.67	53.02	63.63	58.40	57.32	60.92	45.59	42.07	50.24
Kimi	3*	63.29	40.19	0.00	0.00	11.67	84.00	30.59	44.33	53.54	64.26	57.19	55.86	62.25	52.52	53.59	53.64
GLM-4	0*	63.99	41.73	0.00	0.00	28.33	44.00	4.71	47.33	54.76	65.57	60.04	56.84	66.84	55.63	50.48	53.77
GLM-4	3*	60.52	41.73	0.00	3.33	17.50	56.00	14.12	49.00	56.30	67.31	62.06	55.74	66.00	57.36	53.60	55.08
ERNIE-4.0	0*	63.51	47.50	3.33	3.33	37.50	72.00	25.88	46.33	62.45	67.58	61.64	61.55	67.75	57.71	51.93	57.71
ERNIE-4.0	3*	58.34	43.27	0.00	6.67	20.00	76.00	32.94	42.00	57.00	62.61	63.97	57.34	69.17	56.42	60.88	57.13
ChemDFM-13B	0*	40.17	14.42	10.00	56.67	28.33	40.00	21.18	31.67	30.69	34.03	33.86	31.16	45.67	31.13	21.91	31.23
ChemDFM-13B	3*	43.68	15.96	10.00	40.00	11.67	44.00	20.00	26.33	35.44	36.64	33.72	34.81	44.58	32.45	31.29	33.98
ChemLLM-7B-Char-1.5-SFT	0*	27.41	9.42	0.00	0.00	0.00	48.00	1.18	13.33	24.25	22.07	28.25	18.12	25.50	16.48	17.24	20.67
ChemLLM-7B-Char-1.5-SFT	3*	25.29	14.62	0.00	0.00	0.00	48.00	1.18	6.67	16.65	15.50	15.76	12.46	15.92	10.25	14.14	14.87
LlaSMol-Mistral-7B	0*	25.70	4.04	3.33	30.00	19.17	64.00	24.71	9.00	24.48	17.23	18.31	23.54	23.00	24.67	12.16	19.81
LlaSMol-Mistral-7B	3*	28.27	7.31	0.00	3.33	0.00	80	1.18	4.33	18.51	12.43	12.67	31.18	16.00	15.85	24.11	17.65
ChemELLM	0*	80.95	63.56	30.00	56.67	45.00	64.00	52.94	60.67	72.59	75.56	80.42	74.06	80.75	69.53	73.95	72.90
ChemELLM	3*	80.36	66.35	26.67	53.33	53.33	64.00	54.12	60.00	72.07	74.70	82.19	75.44	82.08	64.49	72.93	72.73