催化学报 ›› 2025, Vol. 73: 159-173.DOI: 10.1016/S1872-2067(25)64725-5
周吉彬a,1, 徐飞扬b,1, 常志军c, 刘对平a, 李路路a, 崔健b, 李益b, 李鑫b,d,e(), 钱力c, 张智雄c, 胡国平b,e, 叶茂a(
), 刘中民a
收稿日期:
2025-03-27
接受日期:
2025-05-13
出版日期:
2025-06-18
发布日期:
2025-06-12
通讯作者:
*电子信箱: maoye@dicp.ac.cn (叶茂),leexin@ustc.edu.cn (李鑫).
作者简介:
1共同第一作者.
基金资助:
Jibin Zhoua,1, Feiyang Xub,1, Zhijun Changc, Duiping Liua, Lulu Lia, Jian Cuib, Yi Lib, Xin Lib,d,e(), Li Qianc, Zhixiong Zhangc, Guoping Hub,e, Mao Yea(
), Zhongmin Liua
Received:
2025-03-27
Accepted:
2025-05-13
Online:
2025-06-18
Published:
2025-06-12
Contact:
*E-mail: maoye@dicp.ac.cn (M. Ye),leexin@ustc.edu.cn (X. Li).
About author:
1Contributed equally to this work.
Supported by:
摘要:
化学工程技术的开发是一个复杂多阶段的过程, 涵盖实验室研究、过程放大到工业部署应用等多个环节. 该过程不仅需要化学、材料和工程等多学科的紧密协作, 还面临着漫长的研发周期和高昂的经济成本. 尽管以大语言模型为代表的生成式人工智能在基础研究领域取得显著进展, 但其在复杂工程问题中的深度应用仍面临挑战. 现有通用大语言模型对化学工程专业知识的理解有限, 难以支撑从实验室创新到工业化实施的全链条技术转化. 同时, 由于缺乏系统性评估基准, 难以客观评价大语言模型在化工专业场景中的实际性能.
为了应对上述挑战, 本文以星火大模型为基座, 成功开发出面向化学工程领域的垂直大语言模型ChemELLM, 其参数规模高达700亿. 同时, 为了全面且系统地评估大语言模型在化学工程领域的综合能力, 本文精心构建了首个化学工程多维度评估基准体系ChemEBench. 该体系采用从基础知识理解、领域高级解析到专业问题求解的递进式三级架构评估框架, 涵盖了催化剂设计、流体模拟、设备选型和安全评估等15个核心领域, 并设置101项细粒度评估任务, 实现了从基础理论认知到复杂工程建设的全维度能力评估. 基准测试结果表明, ChemELLM在上述关键指标上均表现卓越, 综合性能领先于O1-Preview, GPT-4o和DeepSeek-R1等主流大语言模型. 此外, 为了支撑大语言模型的高质量训练与微调, 构建了ChemEData数据集, 其中预训练语料规模达190亿token, 包含106万篇高质量专业文献、579万篇高价值专利以及1200本专业书籍; 微调数据集规模达10亿token, 包含275万对精心设计的问答对数据.
综上, 本研究聚焦化学工程领域大语言模型的开发, 提升其对化学工程领域的理解和推理能力, 有望建立从实验室研究到工业应用之间的桥梁, 加速化工新技术落地与产业化进程, 构建人工智能驱动化学工程创新的新范式. ChemELLM已上线部署并可公开访问,
周吉彬, 徐飞扬, 常志军, 刘对平, 李路路, 崔健, 李益, 李鑫, 钱力, 张智雄, 胡国平, 叶茂, 刘中民. 从实验室到工厂: 化学工程领域的大语言模型[J]. 催化学报, 2025, 73: 159-173.
Jibin Zhou, Feiyang Xu, Zhijun Chang, Duiping Liu, Lulu Li, Jian Cui, Yi Li, Xin Li, Li Qian, Zhixiong Zhang, Guoping Hu, Mao Ye, Zhongmin Liu. From lab to fab: A large language model for chemical engineering[J]. Chinese Journal of Catalysis, 2025, 73: 159-173.
Data source | Document | Size |
---|---|---|
Scholarly paper | 1.06 million | 30.5 GB |
Chemical patent | 5.79 million | 58.9 GB |
Professional book | 1200 | 106.2 GB |
Table 1 Statistics of pre-training data.
Data source | Document | Size |
---|---|---|
Scholarly paper | 1.06 million | 30.5 GB |
Chemical patent | 5.79 million | 58.9 GB |
Professional book | 1200 | 106.2 GB |
Type | Catalyst | Simulation | Equipment | Separation | Safety | Heat | Engineering |
---|---|---|---|---|---|---|---|
Multiple choice | 24600 | 43900 | 48600 | 96700 | 9200 | 10800 | 8000 |
True/False | 14100 | 46800 | 42400 | 80100 | 5800 | 10000 | 7000 |
Fill-in-the-blank | 19500 | 39900 | 44100 | 83100 | 9000 | 10000 | 2000 |
Calculation | 30500 | 54200 | 63700 | 116700 | 13100 | 12500 | 5000 |
Short answer | 31500 | 46300 | 63000 | 117500 | 13000 | 11200 | 7000 |
Sum | 120200 | 231100 | 261800 | 494100 | 50100 | 54500 | 29000 |
1240800 |
Table 2 The supervised fine-tuning data contains 1.24 million instruction-tuning Q&A.
Type | Catalyst | Simulation | Equipment | Separation | Safety | Heat | Engineering |
---|---|---|---|---|---|---|---|
Multiple choice | 24600 | 43900 | 48600 | 96700 | 9200 | 10800 | 8000 |
True/False | 14100 | 46800 | 42400 | 80100 | 5800 | 10000 | 7000 |
Fill-in-the-blank | 19500 | 39900 | 44100 | 83100 | 9000 | 10000 | 2000 |
Calculation | 30500 | 54200 | 63700 | 116700 | 13100 | 12500 | 5000 |
Short answer | 31500 | 46300 | 63000 | 117500 | 13000 | 11200 | 7000 |
Sum | 120200 | 231100 | 261800 | 494100 | 50100 | 54500 | 29000 |
1240800 |
Dimension | Definition | Score range |
---|---|---|
Objectivity | the question should have a unique and objective answer under unified evaluation standards | 0-5 |
Rationality | the question and answer must be complete and clear, without omitting critical information | 0-5 |
Accuracy | the reasoning chain should be checked step by step to ensure the absence of factual, logical, computational, or knowledge errors | 0-5 |
Generalizability | questions and answers should be based on general domain knowledge rather than relying on specific papers or patents | 0-5 |
Table 3 Criteria for model-based scoring of answer generation.
Dimension | Definition | Score range |
---|---|---|
Objectivity | the question should have a unique and objective answer under unified evaluation standards | 0-5 |
Rationality | the question and answer must be complete and clear, without omitting critical information | 0-5 |
Accuracy | the reasoning chain should be checked step by step to ensure the absence of factual, logical, computational, or knowledge errors | 0-5 |
Generalizability | questions and answers should be based on general domain knowledge rather than relying on specific papers or patents | 0-5 |
Level | Category | Task | Type (Metric) |
---|---|---|---|
Found- ational Knowledge | subjective Q&A of domain knowledge | objective question | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc) |
objective Q&A about domain knowledge | subjective question | short answer (score), calculation (score) | |
Advanced Knowledge | molecular name translation | SMILES to IUPAC | SMILES to IUPAC (Acc) |
molecular name generation | molecular name generation from text description | molecular name generation from text description (Score) | |
molecular description | generate text descriptions based on molecular smiles | generate text descriptions based on molecular SMILES (Score) | |
Molecular Property Prediction | prediction of molecular properties based on molecular smiles | prediction of molecular properties based on molecular SMILES(Acc) | |
reaction prediction | reaction prediction | predict the reactants from the products (Acc), predict the products from the reactants (F1), and predict whether the reaction is high yield based on the reaction information (Acc) | |
Professional Skill | catalyst | catalyst deactivation | short answer (score) |
catalyst stability | short answer (score) | ||
catalyst industrial process | short answer (score) | ||
equipment | general equipment | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
reactor | multiple choice (Acc), Fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
dryer | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
centrifuge | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
pump | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
tower | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
fluid simulation | computational fluid dynamics | multiple choice (Acc), short Answer (score) | |
discrete element method | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | ||
machine learning method | short Answer (score) | ||
direct numerical simulation | short answer (score), calculation (score) | ||
separation | absorption | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
distillation | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
extraction | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
heat | heat exchanger | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
safety | regulations and standards | multiple choice (Acc) | |
process safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
environment safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
personnel safety | multiple choice (Acc) | ||
equipment safety | multiple choice (Acc) | ||
hazardous chemistry | multiple choice (Acc), true/false (Acc), short answer (score) | ||
economics | economics | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
engineering construction | electrical engineering | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
automatic control | multiple choice (Acc), true/false (Acc), short answer (score) | ||
material engineering | multiple choice (Acc) | ||
equipment engineering | multiple choice (Acc), true/false (Acc), short answer (score) | ||
civil engineering | multiple choice (Acc) | ||
thermal engineering | multiple choice (Acc) | ||
water supply and drainage engineering | multiple choice (Acc) | ||
general plot plan | multiple choice (Acc) | ||
chemical system | multiple choice (Acc), true/false (Acc) | ||
fire protection engineering | multiple choice (Acc) |
Table 4 The statistics of ChemEBench. It includes 3 progressive levels, evaluating 15 dimensions of LLMs capabilities and featuring 101 distinct chemical tasks.
Level | Category | Task | Type (Metric) |
---|---|---|---|
Found- ational Knowledge | subjective Q&A of domain knowledge | objective question | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc) |
objective Q&A about domain knowledge | subjective question | short answer (score), calculation (score) | |
Advanced Knowledge | molecular name translation | SMILES to IUPAC | SMILES to IUPAC (Acc) |
molecular name generation | molecular name generation from text description | molecular name generation from text description (Score) | |
molecular description | generate text descriptions based on molecular smiles | generate text descriptions based on molecular SMILES (Score) | |
Molecular Property Prediction | prediction of molecular properties based on molecular smiles | prediction of molecular properties based on molecular SMILES(Acc) | |
reaction prediction | reaction prediction | predict the reactants from the products (Acc), predict the products from the reactants (F1), and predict whether the reaction is high yield based on the reaction information (Acc) | |
Professional Skill | catalyst | catalyst deactivation | short answer (score) |
catalyst stability | short answer (score) | ||
catalyst industrial process | short answer (score) | ||
equipment | general equipment | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
reactor | multiple choice (Acc), Fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
dryer | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
centrifuge | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
pump | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
tower | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
fluid simulation | computational fluid dynamics | multiple choice (Acc), short Answer (score) | |
discrete element method | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | ||
machine learning method | short Answer (score) | ||
direct numerical simulation | short answer (score), calculation (score) | ||
separation | absorption | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
distillation | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
extraction | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
heat | heat exchanger | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
safety | regulations and standards | multiple choice (Acc) | |
process safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
environment safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
personnel safety | multiple choice (Acc) | ||
equipment safety | multiple choice (Acc) | ||
hazardous chemistry | multiple choice (Acc), true/false (Acc), short answer (score) | ||
economics | economics | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
engineering construction | electrical engineering | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
automatic control | multiple choice (Acc), true/false (Acc), short answer (score) | ||
material engineering | multiple choice (Acc) | ||
equipment engineering | multiple choice (Acc), true/false (Acc), short answer (score) | ||
civil engineering | multiple choice (Acc) | ||
thermal engineering | multiple choice (Acc) | ||
water supply and drainage engineering | multiple choice (Acc) | ||
general plot plan | multiple choice (Acc) | ||
chemical system | multiple choice (Acc), true/false (Acc) | ||
fire protection engineering | multiple choice (Acc) |
Fig. 3. Distribution of questions in ChemEBench. The bar chart shows the number of questions in different sub-domains. The pie chart shows questions classified according to question structure
Model | Developer | Size (parameter) | Access |
---|---|---|---|
O3-mini | OpenAI | undisclosed | API |
O1-Preview | OpenAI | undisclosed | API |
GPT-4o | OpenAI | undisclosed | API |
Claude-3.7 | Anthropic | undisclosed | API |
LLaMA 3.1-70B | Meta | 70B | weights |
DeepSeek-R1 | DeepSeek | 671B | API |
DeepSeek-V3 | DeepSeek | 671B | API |
Kimi | Moonshot AI | undisclosed | API |
GLM-4 | Zhipu AI | undisclosed | API |
ERNIE-4.0 | Baidu | undisclosed | API |
ChemDFM-13B | Suzhou Lab | 13B | weights |
ChemLLM-7B-Char-1.5-SFT | Shanghai AILab | 7B | weights |
LlaSMol-Mistral-7B | OSU | 7B | weights |
Table 5 Detailed information of the LLMs chosen for evaluation in our experiments. The “size” column represents the number of parameters of each model. The “access” column represents approaches to obtain models through API or loading models with weights.
Model | Developer | Size (parameter) | Access |
---|---|---|---|
O3-mini | OpenAI | undisclosed | API |
O1-Preview | OpenAI | undisclosed | API |
GPT-4o | OpenAI | undisclosed | API |
Claude-3.7 | Anthropic | undisclosed | API |
LLaMA 3.1-70B | Meta | 70B | weights |
DeepSeek-R1 | DeepSeek | 671B | API |
DeepSeek-V3 | DeepSeek | 671B | API |
Kimi | Moonshot AI | undisclosed | API |
GLM-4 | Zhipu AI | undisclosed | API |
ERNIE-4.0 | Baidu | undisclosed | API |
ChemDFM-13B | Suzhou Lab | 13B | weights |
ChemLLM-7B-Char-1.5-SFT | Shanghai AILab | 7B | weights |
LlaSMol-Mistral-7B | OSU | 7B | weights |
Model | L1 | L2 | L3 | Mean score | Overall rank |
---|---|---|---|---|---|
O3-mini | 74.72 | 23.13 | 59.74 | 58.85 | 6 |
O1-Preview | 76.10 | 23.88 | 67.94 | 65.76 | 3 |
GPT-4o | 62.81 | 23.19 | 58.48 | 56.48 | 8 |
Claude-3.7 | 70.38 | 21.76 | 64.01 | 61.75 | 5 |
LLaMA 3.1-70B | 48.48 | 10.25 | 47.84 | 45.26 | 11 |
DeepSeek-R1 | 82.19 | 14.75 | 73.49 | 70.33 | 2 |
DeepSeek-V3 | 69.83 | 17.13 | 65.97 | 62.96 | 4 |
Kimi | 51.12 | 16.25 | 53.06 | 50.24 | 10 |
GLM-4 | 54.95 | 11.75 | 57.24 | 53.77 | 9 |
ERNIE-4.0 | 57.01 | 26.62 | 60.49 | 57.71 | 7 |
ChemDFM-13B | 29.71 | 28.25 | 31.69 | 31.22 | 12 |
ChemLLM- 7B-Char-1.5-SFT | 20.10 | 6.50 | 21.97 | 20.67 | 13 |
LlaSMol-Mistral-7B | 16.90 | 26.38 | 19.64 | 19.81 | 14 |
ChemELLM | 73.88 | 50.25 | 74.72 | 72.90 | 1 |
Table 6 Performance of the selected LLMs and ChemELLM. The best and second-best results are labeled in bold and underlined, respectively.
Model | L1 | L2 | L3 | Mean score | Overall rank |
---|---|---|---|---|---|
O3-mini | 74.72 | 23.13 | 59.74 | 58.85 | 6 |
O1-Preview | 76.10 | 23.88 | 67.94 | 65.76 | 3 |
GPT-4o | 62.81 | 23.19 | 58.48 | 56.48 | 8 |
Claude-3.7 | 70.38 | 21.76 | 64.01 | 61.75 | 5 |
LLaMA 3.1-70B | 48.48 | 10.25 | 47.84 | 45.26 | 11 |
DeepSeek-R1 | 82.19 | 14.75 | 73.49 | 70.33 | 2 |
DeepSeek-V3 | 69.83 | 17.13 | 65.97 | 62.96 | 4 |
Kimi | 51.12 | 16.25 | 53.06 | 50.24 | 10 |
GLM-4 | 54.95 | 11.75 | 57.24 | 53.77 | 9 |
ERNIE-4.0 | 57.01 | 26.62 | 60.49 | 57.71 | 7 |
ChemDFM-13B | 29.71 | 28.25 | 31.69 | 31.22 | 12 |
ChemLLM- 7B-Char-1.5-SFT | 20.10 | 6.50 | 21.97 | 20.67 | 13 |
LlaSMol-Mistral-7B | 16.90 | 26.38 | 19.64 | 19.81 | 14 |
ChemELLM | 73.88 | 50.25 | 74.72 | 72.90 | 1 |
Model | Objective task | Subjective task | Mean score | Overall rank | |||
---|---|---|---|---|---|---|---|
multiple choice | true/false | fill-in-the-blank | calculation | short answer | |||
O3-mini | 59.63 | 62.94 | 53.12 | 77.64 | 54.41 | 58.85 | 6 |
O1-Preview | 71.46 | 71.01 | 61.46 | 72.22 | 57.88 | 65.75 | 3 |
GPT-4o | 63.78 | 56.88 | 54.25 | 56.80 | 51.09 | 56.69 | 8 |
Claude-3.7 | 67.93 | 63.49 | 55.97 | 67.78 | 56.37 | 61.75 | 5 |
LLaMA 3.1-70B | 52.81 | 60.73 | 37.81 | 39.03 | 33.43 | 45.26 | 11 |
DeepSeek-R1 | 78.54 | 69.36 | 72.68 | 76.67 | 60.96 | 70.33 | 2 |
DeepSeek-V3 | 72.93 | 61.28 | 62.63 | 63.89 | 54.72 | 62.96 | 4 |
Kimi | 53.05 | 56.88 | 46.19 | 43.89 | 46.69 | 50.24 | 10 |
GLM-4 | 60.98 | 62.57 | 51.33 | 43.20 | 44.94 | 53.77 | 9 |
ERNIE-4.0 | 64.63 | 64.22 | 54.50 | 49.31 | 50.45 | 57.71 | 7 |
ChemDFM-13B | 29.51 | 43.67 | 25.61 | 11.39 | 31.75 | 31.23 | 12 |
ChemLLM-7B-Char-1.5-SFT | 21.10 | 35.05 | 21.14 | 5.14 | 14.35 | 20.67 | 13 |
LlaSMol-Mistral-7B | 13.90 | 48.81 | 13.83 | 1.67 | 13.84 | 19.81 | 14 |
ChemELLM | 77.32 | 80.18 | 66.60 | 64.93 | 68.81 | 72.90 | 1 |
Table 7 Performance of the selected LLMs and ChemELLM on different question types. The best and second-best results are labeled in bold and underlined, respectively.
Model | Objective task | Subjective task | Mean score | Overall rank | |||
---|---|---|---|---|---|---|---|
multiple choice | true/false | fill-in-the-blank | calculation | short answer | |||
O3-mini | 59.63 | 62.94 | 53.12 | 77.64 | 54.41 | 58.85 | 6 |
O1-Preview | 71.46 | 71.01 | 61.46 | 72.22 | 57.88 | 65.75 | 3 |
GPT-4o | 63.78 | 56.88 | 54.25 | 56.80 | 51.09 | 56.69 | 8 |
Claude-3.7 | 67.93 | 63.49 | 55.97 | 67.78 | 56.37 | 61.75 | 5 |
LLaMA 3.1-70B | 52.81 | 60.73 | 37.81 | 39.03 | 33.43 | 45.26 | 11 |
DeepSeek-R1 | 78.54 | 69.36 | 72.68 | 76.67 | 60.96 | 70.33 | 2 |
DeepSeek-V3 | 72.93 | 61.28 | 62.63 | 63.89 | 54.72 | 62.96 | 4 |
Kimi | 53.05 | 56.88 | 46.19 | 43.89 | 46.69 | 50.24 | 10 |
GLM-4 | 60.98 | 62.57 | 51.33 | 43.20 | 44.94 | 53.77 | 9 |
ERNIE-4.0 | 64.63 | 64.22 | 54.50 | 49.31 | 50.45 | 57.71 | 7 |
ChemDFM-13B | 29.51 | 43.67 | 25.61 | 11.39 | 31.75 | 31.23 | 12 |
ChemLLM-7B-Char-1.5-SFT | 21.10 | 35.05 | 21.14 | 5.14 | 14.35 | 20.67 | 13 |
LlaSMol-Mistral-7B | 13.90 | 48.81 | 13.83 | 1.67 | 13.84 | 19.81 | 14 |
ChemELLM | 77.32 | 80.18 | 66.60 | 64.93 | 68.81 | 72.90 | 1 |
Task type | Quantity | Metric | Models | ||||
---|---|---|---|---|---|---|---|
GPT-4o | O1-Preview | DeepSeek-R1 | ChemELLM | ||||
Property prediction | BACE | 100 | ACC | 35 | 40 | 38 | 64 |
BBBP | 100 | ACC | 61 | 56 | 52 | 67 | |
ClinTox | 100 | ACC | 50 | 52 | 31.5 | 57.5 | |
HIV | 100 | ACC | 33 | 78 | 40 | 81 | |
Tox21 | 1044 | ACC | 80.27 | 81.9 | 81.03 | 83.14 | |
Yield prediction | Buchwald-Hartwig | 100 | ACC | 62 | 75 | 63 | 61 |
Suzuki-Miyaura | 100 | ACC | 52 | 65 | 61 | 48 | |
Name prediction | iupac2formula | 100 | Exact | 28 | 65 | 38 | 4 |
smiles2iupac | 100 | Exact | 1 | 0 | 0 | 24 | |
iupac2smiles | 100 | Exact | 8 | 14 | 9 | 20 | |
smiles2formula | 100 | Exact | 9 | 42 | 24 | 5 | |
Molecule analysis | text-based molecule design | 100 | BLEU | 42.56 | 51.76 | 58.12 | 75.71 |
molecule Captioning | 100 | score | 20 | 23.5 | 18.25 | 26.5 | |
Synthetic analysis | reactant Prediction | 100 | F1 | 3 | 32.67 | 25 | 61 |
retro synthesis | 100 | F1 | 4.9 | 14.13 | 11.5 | 33.83 | |
solvent selection | 100 | F1 | 51 | 51 | 51 | 51 | |
reactant selection | 100 | F1 | 24.7 | 20.83 | 26 | 50.47 | |
ligands selection | 100 | F1 | 15.27 | 18.19 | 16.9 | 17.97 | |
Overall | 2744 | mean | 48.78 | 56.67 | 51.36 | 58.89 |
Table 8 Performance comparison of different LLMs on ChemLLMBench tasks. The best and second-best results are labeled in bold and underlined, respectively.
Task type | Quantity | Metric | Models | ||||
---|---|---|---|---|---|---|---|
GPT-4o | O1-Preview | DeepSeek-R1 | ChemELLM | ||||
Property prediction | BACE | 100 | ACC | 35 | 40 | 38 | 64 |
BBBP | 100 | ACC | 61 | 56 | 52 | 67 | |
ClinTox | 100 | ACC | 50 | 52 | 31.5 | 57.5 | |
HIV | 100 | ACC | 33 | 78 | 40 | 81 | |
Tox21 | 1044 | ACC | 80.27 | 81.9 | 81.03 | 83.14 | |
Yield prediction | Buchwald-Hartwig | 100 | ACC | 62 | 75 | 63 | 61 |
Suzuki-Miyaura | 100 | ACC | 52 | 65 | 61 | 48 | |
Name prediction | iupac2formula | 100 | Exact | 28 | 65 | 38 | 4 |
smiles2iupac | 100 | Exact | 1 | 0 | 0 | 24 | |
iupac2smiles | 100 | Exact | 8 | 14 | 9 | 20 | |
smiles2formula | 100 | Exact | 9 | 42 | 24 | 5 | |
Molecule analysis | text-based molecule design | 100 | BLEU | 42.56 | 51.76 | 58.12 | 75.71 |
molecule Captioning | 100 | score | 20 | 23.5 | 18.25 | 26.5 | |
Synthetic analysis | reactant Prediction | 100 | F1 | 3 | 32.67 | 25 | 61 |
retro synthesis | 100 | F1 | 4.9 | 14.13 | 11.5 | 33.83 | |
solvent selection | 100 | F1 | 51 | 51 | 51 | 51 | |
reactant selection | 100 | F1 | 24.7 | 20.83 | 26 | 50.47 | |
ligands selection | 100 | F1 | 15.27 | 18.19 | 16.9 | 17.97 | |
Overall | 2744 | mean | 48.78 | 56.67 | 51.36 | 58.89 |
Category Model | L1 | L2 | L3 | Overall | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | |||
O3-mini | 0* | 74.39 | 75.19 | 0.00 | 6.67 | 60.83 | 36.00 | 20.00 | 57.33 | 59.07 | 72.50 | 67.31 | 72.78 | 61.00 | 53.21 | 47.75 | 58.85 |
3* | 80.92 | 72.69 | 3.33 | 10.00 | 56.67 | 56.00 | 37.65 | 51.00 | 56.29 | 71.44 | 69.36 | 73.48 | 67.25 | 54.74 | 56.14 | 61.46 | |
O1-Preview | 0* | 80.01 | 70.39 | 3.33 | 3.33 | 45.83 | 44.00 | 24.71 | 59.67 | 66.92 | 73.57 | 74.31 | 77.26 | 73.33 | 63.21 | 59.53 | 65.76 |
3* | 79.12 | 72.88 | 6.67 | 20.00 | 39.17 | 64.00 | 30.59 | 54.67 | 64.02 | 70.55 | 73.89 | 75.46 | 70.92 | 61.99 | 66.16 | 65.94 | |
GPT-4o | 0* | 67.63 | 55.77 | 3.75 | 6.67 | 37.50 | 76.00 | 15.29 | 53.33 | 56.89 | 66.71 | 65.36 | 61.54 | 62.83 | 54.54 | 51.84 | 56.69 |
3* | 65.93 | 54.62 | 1.67 | 10.00 | 25.00 | 72.00 | 10.59 | 48.33 | 56.57 | 68.09 | 66.45 | 60.98 | 68.00 | 54.92 | 59.16 | 58.03 | |
Claude-3.7 | 0* | 74.59 | 64.23 | 3.33 | 16.67 | 51.67 | 36.00 | 15.30 | 60.50 | 60.64 | 73.32 | 65.88 | 71.11 | 69.25 | 59.60 | 59.30 | 61.75 |
3* | 79.21 | 63.27 | 6.67 | 30.00 | 62.50 | 48.00 | 29.41 | 55.33 | 62.73 | 74.38 | 73.55 | 69.46 | 73.75 | 61.45 | 64.74 | 65.48 | |
LLaMA 3.1-70B | 0* | 58.23 | 34.23 | 0.00 | 0.00 | 28.33 | 28.00 | 5.88 | 33.67 | 47.42 | 59.09 | 49.53 | 46.28 | 55.33 | 40.29 | 42.19 | 45.26 |
3* | 59.96 | 38.85 | 0.00 | 13.33 | 12.50 | 60.00 | 22.35 | 36.33 | 49.08 | 59.81 | 58.92 | 50.94 | 60.17 | 41.93 | 48.07 | 49.63 | |
DeepSeek-R1 | 0* | 85.92 | 76.73 | 3.33 | 16.67 | 45.00 | 12.00 | 8.24 | 59.00 | 73.44 | 78.13 | 79.16 | 81.46 | 78.50 | 67.92 | 66.68 | 70.33 |
3* | 85.03 | 70.58 | 3.33 | 20.00 | 41.67 | 48.00 | 40.00 | 55.67 | 71.19 | 77.30 | 79.97 | 78.00 | 80.08 | 71.55 | 75.28 | 72.38 | |
DeepSeek-V3 | 0* | 75.90 | 60.96 | 0.00 | 20.00 | 47.50 | 40.00 | 4.71 | 56.00 | 64.92 | 73.22 | 70.63 | 73.14 | 72.25 | 62.59 | 57.56 | 62.96 |
3* | 75.79 | 65.96 | 3.33 | 13.33 | 35.00 | 52.00 | 16.47 | 55.33 | 66.80 | 73.37 | 71.14 | 70.10 | 74.75 | 60.57 | 67.72 | 65.63 | |
Kimi | 0* | 57.80 | 41.35 | 3.33 | 0.00 | 25.00 | 72.00 | 7.06 | 45.67 | 53.02 | 63.63 | 58.40 | 57.32 | 60.92 | 45.59 | 42.07 | 50.24 |
3* | 63.29 | 40.19 | 0.00 | 0.00 | 11.67 | 84.00 | 30.59 | 44.33 | 53.54 | 64.26 | 57.19 | 55.86 | 62.25 | 52.52 | 53.59 | 53.64 | |
GLM-4 | 0* | 63.99 | 41.73 | 0.00 | 0.00 | 28.33 | 44.00 | 4.71 | 47.33 | 54.76 | 65.57 | 60.04 | 56.84 | 66.84 | 55.63 | 50.48 | 53.77 |
3* | 60.52 | 41.73 | 0.00 | 3.33 | 17.50 | 56.00 | 14.12 | 49.00 | 56.30 | 67.31 | 62.06 | 55.74 | 66.00 | 57.36 | 53.60 | 55.08 | |
ERNIE-4.0 | 0* | 63.51 | 47.50 | 3.33 | 3.33 | 37.50 | 72.00 | 25.88 | 46.33 | 62.45 | 67.58 | 61.64 | 61.55 | 67.75 | 57.71 | 51.93 | 57.71 |
3* | 58.34 | 43.27 | 0.00 | 6.67 | 20.00 | 76.00 | 32.94 | 42.00 | 57.00 | 62.61 | 63.97 | 57.34 | 69.17 | 56.42 | 60.88 | 57.13 | |
ChemDFM-13B | 0* | 40.17 | 14.42 | 10.00 | 56.67 | 28.33 | 40.00 | 21.18 | 31.67 | 30.69 | 34.03 | 33.86 | 31.16 | 45.67 | 31.13 | 21.91 | 31.23 |
3* | 43.68 | 15.96 | 10.00 | 40.00 | 11.67 | 44.00 | 20.00 | 26.33 | 35.44 | 36.64 | 33.72 | 34.81 | 44.58 | 32.45 | 31.29 | 33.98 | |
ChemLLM-7B-Char-1.5-SFT | 0* | 27.41 | 9.42 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 13.33 | 24.25 | 22.07 | 28.25 | 18.12 | 25.50 | 16.48 | 17.24 | 20.67 |
3* | 25.29 | 14.62 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 6.67 | 16.65 | 15.50 | 15.76 | 12.46 | 15.92 | 10.25 | 14.14 | 14.87 | |
LlaSMol-Mistral-7B | 0* | 25.70 | 4.04 | 3.33 | 30.00 | 19.17 | 64.00 | 24.71 | 9.00 | 24.48 | 17.23 | 18.31 | 23.54 | 23.00 | 24.67 | 12.16 | 19.81 |
3* | 28.27 | 7.31 | 0.00 | 3.33 | 0.00 | 80 | 1.18 | 4.33 | 18.51 | 12.43 | 12.67 | 31.18 | 16.00 | 15.85 | 24.11 | 17.65 | |
ChemELLM | 0* | 80.95 | 63.56 | 30.00 | 56.67 | 45.00 | 64.00 | 52.94 | 60.67 | 72.59 | 75.56 | 80.42 | 74.06 | 80.75 | 69.53 | 73.95 | 72.90 |
3* | 80.36 | 66.35 | 26.67 | 53.33 | 53.33 | 64.00 | 54.12 | 60.00 | 72.07 | 74.70 | 82.19 | 75.44 | 82.08 | 64.49 | 72.93 | 72.73 |
Table 9 3-shot versus 0-shot performance across LLMs and tasks. Bold indicates performance improvement compared to the 0-shot setting.
Category Model | L1 | L2 | L3 | Overall | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | |||
O3-mini | 0* | 74.39 | 75.19 | 0.00 | 6.67 | 60.83 | 36.00 | 20.00 | 57.33 | 59.07 | 72.50 | 67.31 | 72.78 | 61.00 | 53.21 | 47.75 | 58.85 |
3* | 80.92 | 72.69 | 3.33 | 10.00 | 56.67 | 56.00 | 37.65 | 51.00 | 56.29 | 71.44 | 69.36 | 73.48 | 67.25 | 54.74 | 56.14 | 61.46 | |
O1-Preview | 0* | 80.01 | 70.39 | 3.33 | 3.33 | 45.83 | 44.00 | 24.71 | 59.67 | 66.92 | 73.57 | 74.31 | 77.26 | 73.33 | 63.21 | 59.53 | 65.76 |
3* | 79.12 | 72.88 | 6.67 | 20.00 | 39.17 | 64.00 | 30.59 | 54.67 | 64.02 | 70.55 | 73.89 | 75.46 | 70.92 | 61.99 | 66.16 | 65.94 | |
GPT-4o | 0* | 67.63 | 55.77 | 3.75 | 6.67 | 37.50 | 76.00 | 15.29 | 53.33 | 56.89 | 66.71 | 65.36 | 61.54 | 62.83 | 54.54 | 51.84 | 56.69 |
3* | 65.93 | 54.62 | 1.67 | 10.00 | 25.00 | 72.00 | 10.59 | 48.33 | 56.57 | 68.09 | 66.45 | 60.98 | 68.00 | 54.92 | 59.16 | 58.03 | |
Claude-3.7 | 0* | 74.59 | 64.23 | 3.33 | 16.67 | 51.67 | 36.00 | 15.30 | 60.50 | 60.64 | 73.32 | 65.88 | 71.11 | 69.25 | 59.60 | 59.30 | 61.75 |
3* | 79.21 | 63.27 | 6.67 | 30.00 | 62.50 | 48.00 | 29.41 | 55.33 | 62.73 | 74.38 | 73.55 | 69.46 | 73.75 | 61.45 | 64.74 | 65.48 | |
LLaMA 3.1-70B | 0* | 58.23 | 34.23 | 0.00 | 0.00 | 28.33 | 28.00 | 5.88 | 33.67 | 47.42 | 59.09 | 49.53 | 46.28 | 55.33 | 40.29 | 42.19 | 45.26 |
3* | 59.96 | 38.85 | 0.00 | 13.33 | 12.50 | 60.00 | 22.35 | 36.33 | 49.08 | 59.81 | 58.92 | 50.94 | 60.17 | 41.93 | 48.07 | 49.63 | |
DeepSeek-R1 | 0* | 85.92 | 76.73 | 3.33 | 16.67 | 45.00 | 12.00 | 8.24 | 59.00 | 73.44 | 78.13 | 79.16 | 81.46 | 78.50 | 67.92 | 66.68 | 70.33 |
3* | 85.03 | 70.58 | 3.33 | 20.00 | 41.67 | 48.00 | 40.00 | 55.67 | 71.19 | 77.30 | 79.97 | 78.00 | 80.08 | 71.55 | 75.28 | 72.38 | |
DeepSeek-V3 | 0* | 75.90 | 60.96 | 0.00 | 20.00 | 47.50 | 40.00 | 4.71 | 56.00 | 64.92 | 73.22 | 70.63 | 73.14 | 72.25 | 62.59 | 57.56 | 62.96 |
3* | 75.79 | 65.96 | 3.33 | 13.33 | 35.00 | 52.00 | 16.47 | 55.33 | 66.80 | 73.37 | 71.14 | 70.10 | 74.75 | 60.57 | 67.72 | 65.63 | |
Kimi | 0* | 57.80 | 41.35 | 3.33 | 0.00 | 25.00 | 72.00 | 7.06 | 45.67 | 53.02 | 63.63 | 58.40 | 57.32 | 60.92 | 45.59 | 42.07 | 50.24 |
3* | 63.29 | 40.19 | 0.00 | 0.00 | 11.67 | 84.00 | 30.59 | 44.33 | 53.54 | 64.26 | 57.19 | 55.86 | 62.25 | 52.52 | 53.59 | 53.64 | |
GLM-4 | 0* | 63.99 | 41.73 | 0.00 | 0.00 | 28.33 | 44.00 | 4.71 | 47.33 | 54.76 | 65.57 | 60.04 | 56.84 | 66.84 | 55.63 | 50.48 | 53.77 |
3* | 60.52 | 41.73 | 0.00 | 3.33 | 17.50 | 56.00 | 14.12 | 49.00 | 56.30 | 67.31 | 62.06 | 55.74 | 66.00 | 57.36 | 53.60 | 55.08 | |
ERNIE-4.0 | 0* | 63.51 | 47.50 | 3.33 | 3.33 | 37.50 | 72.00 | 25.88 | 46.33 | 62.45 | 67.58 | 61.64 | 61.55 | 67.75 | 57.71 | 51.93 | 57.71 |
3* | 58.34 | 43.27 | 0.00 | 6.67 | 20.00 | 76.00 | 32.94 | 42.00 | 57.00 | 62.61 | 63.97 | 57.34 | 69.17 | 56.42 | 60.88 | 57.13 | |
ChemDFM-13B | 0* | 40.17 | 14.42 | 10.00 | 56.67 | 28.33 | 40.00 | 21.18 | 31.67 | 30.69 | 34.03 | 33.86 | 31.16 | 45.67 | 31.13 | 21.91 | 31.23 |
3* | 43.68 | 15.96 | 10.00 | 40.00 | 11.67 | 44.00 | 20.00 | 26.33 | 35.44 | 36.64 | 33.72 | 34.81 | 44.58 | 32.45 | 31.29 | 33.98 | |
ChemLLM-7B-Char-1.5-SFT | 0* | 27.41 | 9.42 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 13.33 | 24.25 | 22.07 | 28.25 | 18.12 | 25.50 | 16.48 | 17.24 | 20.67 |
3* | 25.29 | 14.62 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 6.67 | 16.65 | 15.50 | 15.76 | 12.46 | 15.92 | 10.25 | 14.14 | 14.87 | |
LlaSMol-Mistral-7B | 0* | 25.70 | 4.04 | 3.33 | 30.00 | 19.17 | 64.00 | 24.71 | 9.00 | 24.48 | 17.23 | 18.31 | 23.54 | 23.00 | 24.67 | 12.16 | 19.81 |
3* | 28.27 | 7.31 | 0.00 | 3.33 | 0.00 | 80 | 1.18 | 4.33 | 18.51 | 12.43 | 12.67 | 31.18 | 16.00 | 15.85 | 24.11 | 17.65 | |
ChemELLM | 0* | 80.95 | 63.56 | 30.00 | 56.67 | 45.00 | 64.00 | 52.94 | 60.67 | 72.59 | 75.56 | 80.42 | 74.06 | 80.75 | 69.53 | 73.95 | 72.90 |
3* | 80.36 | 66.35 | 26.67 | 53.33 | 53.33 | 64.00 | 54.12 | 60.00 | 72.07 | 74.70 | 82.19 | 75.44 | 82.08 | 64.49 | 72.93 | 72.73 |
|
[1] | 张成翼, 王兴宇, 王子运. 大语言模型在电催化领域中的应用[J]. 催化学报, 2024, 59(4): 7-14. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||