Chinese Journal of Catalysis ›› 2025, Vol. 73: 159-173.DOI: 10.1016/S1872-2067(25)64725-5
• Article • Previous Articles Next Articles
Jibin Zhoua,1, Feiyang Xub,1, Zhijun Changc, Duiping Liua, Lulu Lia, Jian Cuib, Yi Lib, Xin Lib,d,e(), Li Qianc, Zhixiong Zhangc, Guoping Hub,e, Mao Yea(
), Zhongmin Liua
Received:
2025-03-27
Accepted:
2025-05-13
Online:
2025-06-18
Published:
2025-06-12
Contact:
*E-mail: maoye@dicp.ac.cn (M. Ye),leexin@ustc.edu.cn (X. Li).
About author:
1Contributed equally to this work.
Supported by:
Jibin Zhou, Feiyang Xu, Zhijun Chang, Duiping Liu, Lulu Li, Jian Cui, Yi Li, Xin Li, Li Qian, Zhixiong Zhang, Guoping Hu, Mao Ye, Zhongmin Liu. From lab to fab: A large language model for chemical engineering[J]. Chinese Journal of Catalysis, 2025, 73: 159-173.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.cjcatal.com/EN/10.1016/S1872-2067(25)64725-5
Data source | Document | Size |
---|---|---|
Scholarly paper | 1.06 million | 30.5 GB |
Chemical patent | 5.79 million | 58.9 GB |
Professional book | 1200 | 106.2 GB |
Table 1 Statistics of pre-training data.
Data source | Document | Size |
---|---|---|
Scholarly paper | 1.06 million | 30.5 GB |
Chemical patent | 5.79 million | 58.9 GB |
Professional book | 1200 | 106.2 GB |
Type | Catalyst | Simulation | Equipment | Separation | Safety | Heat | Engineering |
---|---|---|---|---|---|---|---|
Multiple choice | 24600 | 43900 | 48600 | 96700 | 9200 | 10800 | 8000 |
True/False | 14100 | 46800 | 42400 | 80100 | 5800 | 10000 | 7000 |
Fill-in-the-blank | 19500 | 39900 | 44100 | 83100 | 9000 | 10000 | 2000 |
Calculation | 30500 | 54200 | 63700 | 116700 | 13100 | 12500 | 5000 |
Short answer | 31500 | 46300 | 63000 | 117500 | 13000 | 11200 | 7000 |
Sum | 120200 | 231100 | 261800 | 494100 | 50100 | 54500 | 29000 |
1240800 |
Table 2 The supervised fine-tuning data contains 1.24 million instruction-tuning Q&A.
Type | Catalyst | Simulation | Equipment | Separation | Safety | Heat | Engineering |
---|---|---|---|---|---|---|---|
Multiple choice | 24600 | 43900 | 48600 | 96700 | 9200 | 10800 | 8000 |
True/False | 14100 | 46800 | 42400 | 80100 | 5800 | 10000 | 7000 |
Fill-in-the-blank | 19500 | 39900 | 44100 | 83100 | 9000 | 10000 | 2000 |
Calculation | 30500 | 54200 | 63700 | 116700 | 13100 | 12500 | 5000 |
Short answer | 31500 | 46300 | 63000 | 117500 | 13000 | 11200 | 7000 |
Sum | 120200 | 231100 | 261800 | 494100 | 50100 | 54500 | 29000 |
1240800 |
Dimension | Definition | Score range |
---|---|---|
Objectivity | the question should have a unique and objective answer under unified evaluation standards | 0-5 |
Rationality | the question and answer must be complete and clear, without omitting critical information | 0-5 |
Accuracy | the reasoning chain should be checked step by step to ensure the absence of factual, logical, computational, or knowledge errors | 0-5 |
Generalizability | questions and answers should be based on general domain knowledge rather than relying on specific papers or patents | 0-5 |
Table 3 Criteria for model-based scoring of answer generation.
Dimension | Definition | Score range |
---|---|---|
Objectivity | the question should have a unique and objective answer under unified evaluation standards | 0-5 |
Rationality | the question and answer must be complete and clear, without omitting critical information | 0-5 |
Accuracy | the reasoning chain should be checked step by step to ensure the absence of factual, logical, computational, or knowledge errors | 0-5 |
Generalizability | questions and answers should be based on general domain knowledge rather than relying on specific papers or patents | 0-5 |
Level | Category | Task | Type (Metric) |
---|---|---|---|
Found- ational Knowledge | subjective Q&A of domain knowledge | objective question | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc) |
objective Q&A about domain knowledge | subjective question | short answer (score), calculation (score) | |
Advanced Knowledge | molecular name translation | SMILES to IUPAC | SMILES to IUPAC (Acc) |
molecular name generation | molecular name generation from text description | molecular name generation from text description (Score) | |
molecular description | generate text descriptions based on molecular smiles | generate text descriptions based on molecular SMILES (Score) | |
Molecular Property Prediction | prediction of molecular properties based on molecular smiles | prediction of molecular properties based on molecular SMILES(Acc) | |
reaction prediction | reaction prediction | predict the reactants from the products (Acc), predict the products from the reactants (F1), and predict whether the reaction is high yield based on the reaction information (Acc) | |
Professional Skill | catalyst | catalyst deactivation | short answer (score) |
catalyst stability | short answer (score) | ||
catalyst industrial process | short answer (score) | ||
equipment | general equipment | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
reactor | multiple choice (Acc), Fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
dryer | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
centrifuge | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
pump | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
tower | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
fluid simulation | computational fluid dynamics | multiple choice (Acc), short Answer (score) | |
discrete element method | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | ||
machine learning method | short Answer (score) | ||
direct numerical simulation | short answer (score), calculation (score) | ||
separation | absorption | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
distillation | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
extraction | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
heat | heat exchanger | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
safety | regulations and standards | multiple choice (Acc) | |
process safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
environment safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
personnel safety | multiple choice (Acc) | ||
equipment safety | multiple choice (Acc) | ||
hazardous chemistry | multiple choice (Acc), true/false (Acc), short answer (score) | ||
economics | economics | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
engineering construction | electrical engineering | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
automatic control | multiple choice (Acc), true/false (Acc), short answer (score) | ||
material engineering | multiple choice (Acc) | ||
equipment engineering | multiple choice (Acc), true/false (Acc), short answer (score) | ||
civil engineering | multiple choice (Acc) | ||
thermal engineering | multiple choice (Acc) | ||
water supply and drainage engineering | multiple choice (Acc) | ||
general plot plan | multiple choice (Acc) | ||
chemical system | multiple choice (Acc), true/false (Acc) | ||
fire protection engineering | multiple choice (Acc) |
Table 4 The statistics of ChemEBench. It includes 3 progressive levels, evaluating 15 dimensions of LLMs capabilities and featuring 101 distinct chemical tasks.
Level | Category | Task | Type (Metric) |
---|---|---|---|
Found- ational Knowledge | subjective Q&A of domain knowledge | objective question | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc) |
objective Q&A about domain knowledge | subjective question | short answer (score), calculation (score) | |
Advanced Knowledge | molecular name translation | SMILES to IUPAC | SMILES to IUPAC (Acc) |
molecular name generation | molecular name generation from text description | molecular name generation from text description (Score) | |
molecular description | generate text descriptions based on molecular smiles | generate text descriptions based on molecular SMILES (Score) | |
Molecular Property Prediction | prediction of molecular properties based on molecular smiles | prediction of molecular properties based on molecular SMILES(Acc) | |
reaction prediction | reaction prediction | predict the reactants from the products (Acc), predict the products from the reactants (F1), and predict whether the reaction is high yield based on the reaction information (Acc) | |
Professional Skill | catalyst | catalyst deactivation | short answer (score) |
catalyst stability | short answer (score) | ||
catalyst industrial process | short answer (score) | ||
equipment | general equipment | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
reactor | multiple choice (Acc), Fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
dryer | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
centrifuge | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
pump | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
tower | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
fluid simulation | computational fluid dynamics | multiple choice (Acc), short Answer (score) | |
discrete element method | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | ||
machine learning method | short Answer (score) | ||
direct numerical simulation | short answer (score), calculation (score) | ||
separation | absorption | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
distillation | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
extraction | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | ||
heat | heat exchanger | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
safety | regulations and standards | multiple choice (Acc) | |
process safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
environment safety | multiple choice (Acc), true/false (Acc), short answer (score) | ||
personnel safety | multiple choice (Acc) | ||
equipment safety | multiple choice (Acc) | ||
hazardous chemistry | multiple choice (Acc), true/false (Acc), short answer (score) | ||
economics | economics | multiple choice(Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score) | |
engineering construction | electrical engineering | multiple choice (Acc), fill-in-the-blank (Acc), true/false (Acc), short answer (score), calculation (score) | |
automatic control | multiple choice (Acc), true/false (Acc), short answer (score) | ||
material engineering | multiple choice (Acc) | ||
equipment engineering | multiple choice (Acc), true/false (Acc), short answer (score) | ||
civil engineering | multiple choice (Acc) | ||
thermal engineering | multiple choice (Acc) | ||
water supply and drainage engineering | multiple choice (Acc) | ||
general plot plan | multiple choice (Acc) | ||
chemical system | multiple choice (Acc), true/false (Acc) | ||
fire protection engineering | multiple choice (Acc) |
Fig. 3. Distribution of questions in ChemEBench. The bar chart shows the number of questions in different sub-domains. The pie chart shows questions classified according to question structure
Model | Developer | Size (parameter) | Access |
---|---|---|---|
O3-mini | OpenAI | undisclosed | API |
O1-Preview | OpenAI | undisclosed | API |
GPT-4o | OpenAI | undisclosed | API |
Claude-3.7 | Anthropic | undisclosed | API |
LLaMA 3.1-70B | Meta | 70B | weights |
DeepSeek-R1 | DeepSeek | 671B | API |
DeepSeek-V3 | DeepSeek | 671B | API |
Kimi | Moonshot AI | undisclosed | API |
GLM-4 | Zhipu AI | undisclosed | API |
ERNIE-4.0 | Baidu | undisclosed | API |
ChemDFM-13B | Suzhou Lab | 13B | weights |
ChemLLM-7B-Char-1.5-SFT | Shanghai AILab | 7B | weights |
LlaSMol-Mistral-7B | OSU | 7B | weights |
Table 5 Detailed information of the LLMs chosen for evaluation in our experiments. The “size” column represents the number of parameters of each model. The “access” column represents approaches to obtain models through API or loading models with weights.
Model | Developer | Size (parameter) | Access |
---|---|---|---|
O3-mini | OpenAI | undisclosed | API |
O1-Preview | OpenAI | undisclosed | API |
GPT-4o | OpenAI | undisclosed | API |
Claude-3.7 | Anthropic | undisclosed | API |
LLaMA 3.1-70B | Meta | 70B | weights |
DeepSeek-R1 | DeepSeek | 671B | API |
DeepSeek-V3 | DeepSeek | 671B | API |
Kimi | Moonshot AI | undisclosed | API |
GLM-4 | Zhipu AI | undisclosed | API |
ERNIE-4.0 | Baidu | undisclosed | API |
ChemDFM-13B | Suzhou Lab | 13B | weights |
ChemLLM-7B-Char-1.5-SFT | Shanghai AILab | 7B | weights |
LlaSMol-Mistral-7B | OSU | 7B | weights |
Model | L1 | L2 | L3 | Mean score | Overall rank |
---|---|---|---|---|---|
O3-mini | 74.72 | 23.13 | 59.74 | 58.85 | 6 |
O1-Preview | 76.10 | 23.88 | 67.94 | 65.76 | 3 |
GPT-4o | 62.81 | 23.19 | 58.48 | 56.48 | 8 |
Claude-3.7 | 70.38 | 21.76 | 64.01 | 61.75 | 5 |
LLaMA 3.1-70B | 48.48 | 10.25 | 47.84 | 45.26 | 11 |
DeepSeek-R1 | 82.19 | 14.75 | 73.49 | 70.33 | 2 |
DeepSeek-V3 | 69.83 | 17.13 | 65.97 | 62.96 | 4 |
Kimi | 51.12 | 16.25 | 53.06 | 50.24 | 10 |
GLM-4 | 54.95 | 11.75 | 57.24 | 53.77 | 9 |
ERNIE-4.0 | 57.01 | 26.62 | 60.49 | 57.71 | 7 |
ChemDFM-13B | 29.71 | 28.25 | 31.69 | 31.22 | 12 |
ChemLLM- 7B-Char-1.5-SFT | 20.10 | 6.50 | 21.97 | 20.67 | 13 |
LlaSMol-Mistral-7B | 16.90 | 26.38 | 19.64 | 19.81 | 14 |
ChemELLM | 73.88 | 50.25 | 74.72 | 72.90 | 1 |
Table 6 Performance of the selected LLMs and ChemELLM. The best and second-best results are labeled in bold and underlined, respectively.
Model | L1 | L2 | L3 | Mean score | Overall rank |
---|---|---|---|---|---|
O3-mini | 74.72 | 23.13 | 59.74 | 58.85 | 6 |
O1-Preview | 76.10 | 23.88 | 67.94 | 65.76 | 3 |
GPT-4o | 62.81 | 23.19 | 58.48 | 56.48 | 8 |
Claude-3.7 | 70.38 | 21.76 | 64.01 | 61.75 | 5 |
LLaMA 3.1-70B | 48.48 | 10.25 | 47.84 | 45.26 | 11 |
DeepSeek-R1 | 82.19 | 14.75 | 73.49 | 70.33 | 2 |
DeepSeek-V3 | 69.83 | 17.13 | 65.97 | 62.96 | 4 |
Kimi | 51.12 | 16.25 | 53.06 | 50.24 | 10 |
GLM-4 | 54.95 | 11.75 | 57.24 | 53.77 | 9 |
ERNIE-4.0 | 57.01 | 26.62 | 60.49 | 57.71 | 7 |
ChemDFM-13B | 29.71 | 28.25 | 31.69 | 31.22 | 12 |
ChemLLM- 7B-Char-1.5-SFT | 20.10 | 6.50 | 21.97 | 20.67 | 13 |
LlaSMol-Mistral-7B | 16.90 | 26.38 | 19.64 | 19.81 | 14 |
ChemELLM | 73.88 | 50.25 | 74.72 | 72.90 | 1 |
Model | Objective task | Subjective task | Mean score | Overall rank | |||
---|---|---|---|---|---|---|---|
multiple choice | true/false | fill-in-the-blank | calculation | short answer | |||
O3-mini | 59.63 | 62.94 | 53.12 | 77.64 | 54.41 | 58.85 | 6 |
O1-Preview | 71.46 | 71.01 | 61.46 | 72.22 | 57.88 | 65.75 | 3 |
GPT-4o | 63.78 | 56.88 | 54.25 | 56.80 | 51.09 | 56.69 | 8 |
Claude-3.7 | 67.93 | 63.49 | 55.97 | 67.78 | 56.37 | 61.75 | 5 |
LLaMA 3.1-70B | 52.81 | 60.73 | 37.81 | 39.03 | 33.43 | 45.26 | 11 |
DeepSeek-R1 | 78.54 | 69.36 | 72.68 | 76.67 | 60.96 | 70.33 | 2 |
DeepSeek-V3 | 72.93 | 61.28 | 62.63 | 63.89 | 54.72 | 62.96 | 4 |
Kimi | 53.05 | 56.88 | 46.19 | 43.89 | 46.69 | 50.24 | 10 |
GLM-4 | 60.98 | 62.57 | 51.33 | 43.20 | 44.94 | 53.77 | 9 |
ERNIE-4.0 | 64.63 | 64.22 | 54.50 | 49.31 | 50.45 | 57.71 | 7 |
ChemDFM-13B | 29.51 | 43.67 | 25.61 | 11.39 | 31.75 | 31.23 | 12 |
ChemLLM-7B-Char-1.5-SFT | 21.10 | 35.05 | 21.14 | 5.14 | 14.35 | 20.67 | 13 |
LlaSMol-Mistral-7B | 13.90 | 48.81 | 13.83 | 1.67 | 13.84 | 19.81 | 14 |
ChemELLM | 77.32 | 80.18 | 66.60 | 64.93 | 68.81 | 72.90 | 1 |
Table 7 Performance of the selected LLMs and ChemELLM on different question types. The best and second-best results are labeled in bold and underlined, respectively.
Model | Objective task | Subjective task | Mean score | Overall rank | |||
---|---|---|---|---|---|---|---|
multiple choice | true/false | fill-in-the-blank | calculation | short answer | |||
O3-mini | 59.63 | 62.94 | 53.12 | 77.64 | 54.41 | 58.85 | 6 |
O1-Preview | 71.46 | 71.01 | 61.46 | 72.22 | 57.88 | 65.75 | 3 |
GPT-4o | 63.78 | 56.88 | 54.25 | 56.80 | 51.09 | 56.69 | 8 |
Claude-3.7 | 67.93 | 63.49 | 55.97 | 67.78 | 56.37 | 61.75 | 5 |
LLaMA 3.1-70B | 52.81 | 60.73 | 37.81 | 39.03 | 33.43 | 45.26 | 11 |
DeepSeek-R1 | 78.54 | 69.36 | 72.68 | 76.67 | 60.96 | 70.33 | 2 |
DeepSeek-V3 | 72.93 | 61.28 | 62.63 | 63.89 | 54.72 | 62.96 | 4 |
Kimi | 53.05 | 56.88 | 46.19 | 43.89 | 46.69 | 50.24 | 10 |
GLM-4 | 60.98 | 62.57 | 51.33 | 43.20 | 44.94 | 53.77 | 9 |
ERNIE-4.0 | 64.63 | 64.22 | 54.50 | 49.31 | 50.45 | 57.71 | 7 |
ChemDFM-13B | 29.51 | 43.67 | 25.61 | 11.39 | 31.75 | 31.23 | 12 |
ChemLLM-7B-Char-1.5-SFT | 21.10 | 35.05 | 21.14 | 5.14 | 14.35 | 20.67 | 13 |
LlaSMol-Mistral-7B | 13.90 | 48.81 | 13.83 | 1.67 | 13.84 | 19.81 | 14 |
ChemELLM | 77.32 | 80.18 | 66.60 | 64.93 | 68.81 | 72.90 | 1 |
Task type | Quantity | Metric | Models | ||||
---|---|---|---|---|---|---|---|
GPT-4o | O1-Preview | DeepSeek-R1 | ChemELLM | ||||
Property prediction | BACE | 100 | ACC | 35 | 40 | 38 | 64 |
BBBP | 100 | ACC | 61 | 56 | 52 | 67 | |
ClinTox | 100 | ACC | 50 | 52 | 31.5 | 57.5 | |
HIV | 100 | ACC | 33 | 78 | 40 | 81 | |
Tox21 | 1044 | ACC | 80.27 | 81.9 | 81.03 | 83.14 | |
Yield prediction | Buchwald-Hartwig | 100 | ACC | 62 | 75 | 63 | 61 |
Suzuki-Miyaura | 100 | ACC | 52 | 65 | 61 | 48 | |
Name prediction | iupac2formula | 100 | Exact | 28 | 65 | 38 | 4 |
smiles2iupac | 100 | Exact | 1 | 0 | 0 | 24 | |
iupac2smiles | 100 | Exact | 8 | 14 | 9 | 20 | |
smiles2formula | 100 | Exact | 9 | 42 | 24 | 5 | |
Molecule analysis | text-based molecule design | 100 | BLEU | 42.56 | 51.76 | 58.12 | 75.71 |
molecule Captioning | 100 | score | 20 | 23.5 | 18.25 | 26.5 | |
Synthetic analysis | reactant Prediction | 100 | F1 | 3 | 32.67 | 25 | 61 |
retro synthesis | 100 | F1 | 4.9 | 14.13 | 11.5 | 33.83 | |
solvent selection | 100 | F1 | 51 | 51 | 51 | 51 | |
reactant selection | 100 | F1 | 24.7 | 20.83 | 26 | 50.47 | |
ligands selection | 100 | F1 | 15.27 | 18.19 | 16.9 | 17.97 | |
Overall | 2744 | mean | 48.78 | 56.67 | 51.36 | 58.89 |
Table 8 Performance comparison of different LLMs on ChemLLMBench tasks. The best and second-best results are labeled in bold and underlined, respectively.
Task type | Quantity | Metric | Models | ||||
---|---|---|---|---|---|---|---|
GPT-4o | O1-Preview | DeepSeek-R1 | ChemELLM | ||||
Property prediction | BACE | 100 | ACC | 35 | 40 | 38 | 64 |
BBBP | 100 | ACC | 61 | 56 | 52 | 67 | |
ClinTox | 100 | ACC | 50 | 52 | 31.5 | 57.5 | |
HIV | 100 | ACC | 33 | 78 | 40 | 81 | |
Tox21 | 1044 | ACC | 80.27 | 81.9 | 81.03 | 83.14 | |
Yield prediction | Buchwald-Hartwig | 100 | ACC | 62 | 75 | 63 | 61 |
Suzuki-Miyaura | 100 | ACC | 52 | 65 | 61 | 48 | |
Name prediction | iupac2formula | 100 | Exact | 28 | 65 | 38 | 4 |
smiles2iupac | 100 | Exact | 1 | 0 | 0 | 24 | |
iupac2smiles | 100 | Exact | 8 | 14 | 9 | 20 | |
smiles2formula | 100 | Exact | 9 | 42 | 24 | 5 | |
Molecule analysis | text-based molecule design | 100 | BLEU | 42.56 | 51.76 | 58.12 | 75.71 |
molecule Captioning | 100 | score | 20 | 23.5 | 18.25 | 26.5 | |
Synthetic analysis | reactant Prediction | 100 | F1 | 3 | 32.67 | 25 | 61 |
retro synthesis | 100 | F1 | 4.9 | 14.13 | 11.5 | 33.83 | |
solvent selection | 100 | F1 | 51 | 51 | 51 | 51 | |
reactant selection | 100 | F1 | 24.7 | 20.83 | 26 | 50.47 | |
ligands selection | 100 | F1 | 15.27 | 18.19 | 16.9 | 17.97 | |
Overall | 2744 | mean | 48.78 | 56.67 | 51.36 | 58.89 |
Category Model | L1 | L2 | L3 | Overall | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | |||
O3-mini | 0* | 74.39 | 75.19 | 0.00 | 6.67 | 60.83 | 36.00 | 20.00 | 57.33 | 59.07 | 72.50 | 67.31 | 72.78 | 61.00 | 53.21 | 47.75 | 58.85 |
3* | 80.92 | 72.69 | 3.33 | 10.00 | 56.67 | 56.00 | 37.65 | 51.00 | 56.29 | 71.44 | 69.36 | 73.48 | 67.25 | 54.74 | 56.14 | 61.46 | |
O1-Preview | 0* | 80.01 | 70.39 | 3.33 | 3.33 | 45.83 | 44.00 | 24.71 | 59.67 | 66.92 | 73.57 | 74.31 | 77.26 | 73.33 | 63.21 | 59.53 | 65.76 |
3* | 79.12 | 72.88 | 6.67 | 20.00 | 39.17 | 64.00 | 30.59 | 54.67 | 64.02 | 70.55 | 73.89 | 75.46 | 70.92 | 61.99 | 66.16 | 65.94 | |
GPT-4o | 0* | 67.63 | 55.77 | 3.75 | 6.67 | 37.50 | 76.00 | 15.29 | 53.33 | 56.89 | 66.71 | 65.36 | 61.54 | 62.83 | 54.54 | 51.84 | 56.69 |
3* | 65.93 | 54.62 | 1.67 | 10.00 | 25.00 | 72.00 | 10.59 | 48.33 | 56.57 | 68.09 | 66.45 | 60.98 | 68.00 | 54.92 | 59.16 | 58.03 | |
Claude-3.7 | 0* | 74.59 | 64.23 | 3.33 | 16.67 | 51.67 | 36.00 | 15.30 | 60.50 | 60.64 | 73.32 | 65.88 | 71.11 | 69.25 | 59.60 | 59.30 | 61.75 |
3* | 79.21 | 63.27 | 6.67 | 30.00 | 62.50 | 48.00 | 29.41 | 55.33 | 62.73 | 74.38 | 73.55 | 69.46 | 73.75 | 61.45 | 64.74 | 65.48 | |
LLaMA 3.1-70B | 0* | 58.23 | 34.23 | 0.00 | 0.00 | 28.33 | 28.00 | 5.88 | 33.67 | 47.42 | 59.09 | 49.53 | 46.28 | 55.33 | 40.29 | 42.19 | 45.26 |
3* | 59.96 | 38.85 | 0.00 | 13.33 | 12.50 | 60.00 | 22.35 | 36.33 | 49.08 | 59.81 | 58.92 | 50.94 | 60.17 | 41.93 | 48.07 | 49.63 | |
DeepSeek-R1 | 0* | 85.92 | 76.73 | 3.33 | 16.67 | 45.00 | 12.00 | 8.24 | 59.00 | 73.44 | 78.13 | 79.16 | 81.46 | 78.50 | 67.92 | 66.68 | 70.33 |
3* | 85.03 | 70.58 | 3.33 | 20.00 | 41.67 | 48.00 | 40.00 | 55.67 | 71.19 | 77.30 | 79.97 | 78.00 | 80.08 | 71.55 | 75.28 | 72.38 | |
DeepSeek-V3 | 0* | 75.90 | 60.96 | 0.00 | 20.00 | 47.50 | 40.00 | 4.71 | 56.00 | 64.92 | 73.22 | 70.63 | 73.14 | 72.25 | 62.59 | 57.56 | 62.96 |
3* | 75.79 | 65.96 | 3.33 | 13.33 | 35.00 | 52.00 | 16.47 | 55.33 | 66.80 | 73.37 | 71.14 | 70.10 | 74.75 | 60.57 | 67.72 | 65.63 | |
Kimi | 0* | 57.80 | 41.35 | 3.33 | 0.00 | 25.00 | 72.00 | 7.06 | 45.67 | 53.02 | 63.63 | 58.40 | 57.32 | 60.92 | 45.59 | 42.07 | 50.24 |
3* | 63.29 | 40.19 | 0.00 | 0.00 | 11.67 | 84.00 | 30.59 | 44.33 | 53.54 | 64.26 | 57.19 | 55.86 | 62.25 | 52.52 | 53.59 | 53.64 | |
GLM-4 | 0* | 63.99 | 41.73 | 0.00 | 0.00 | 28.33 | 44.00 | 4.71 | 47.33 | 54.76 | 65.57 | 60.04 | 56.84 | 66.84 | 55.63 | 50.48 | 53.77 |
3* | 60.52 | 41.73 | 0.00 | 3.33 | 17.50 | 56.00 | 14.12 | 49.00 | 56.30 | 67.31 | 62.06 | 55.74 | 66.00 | 57.36 | 53.60 | 55.08 | |
ERNIE-4.0 | 0* | 63.51 | 47.50 | 3.33 | 3.33 | 37.50 | 72.00 | 25.88 | 46.33 | 62.45 | 67.58 | 61.64 | 61.55 | 67.75 | 57.71 | 51.93 | 57.71 |
3* | 58.34 | 43.27 | 0.00 | 6.67 | 20.00 | 76.00 | 32.94 | 42.00 | 57.00 | 62.61 | 63.97 | 57.34 | 69.17 | 56.42 | 60.88 | 57.13 | |
ChemDFM-13B | 0* | 40.17 | 14.42 | 10.00 | 56.67 | 28.33 | 40.00 | 21.18 | 31.67 | 30.69 | 34.03 | 33.86 | 31.16 | 45.67 | 31.13 | 21.91 | 31.23 |
3* | 43.68 | 15.96 | 10.00 | 40.00 | 11.67 | 44.00 | 20.00 | 26.33 | 35.44 | 36.64 | 33.72 | 34.81 | 44.58 | 32.45 | 31.29 | 33.98 | |
ChemLLM-7B-Char-1.5-SFT | 0* | 27.41 | 9.42 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 13.33 | 24.25 | 22.07 | 28.25 | 18.12 | 25.50 | 16.48 | 17.24 | 20.67 |
3* | 25.29 | 14.62 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 6.67 | 16.65 | 15.50 | 15.76 | 12.46 | 15.92 | 10.25 | 14.14 | 14.87 | |
LlaSMol-Mistral-7B | 0* | 25.70 | 4.04 | 3.33 | 30.00 | 19.17 | 64.00 | 24.71 | 9.00 | 24.48 | 17.23 | 18.31 | 23.54 | 23.00 | 24.67 | 12.16 | 19.81 |
3* | 28.27 | 7.31 | 0.00 | 3.33 | 0.00 | 80 | 1.18 | 4.33 | 18.51 | 12.43 | 12.67 | 31.18 | 16.00 | 15.85 | 24.11 | 17.65 | |
ChemELLM | 0* | 80.95 | 63.56 | 30.00 | 56.67 | 45.00 | 64.00 | 52.94 | 60.67 | 72.59 | 75.56 | 80.42 | 74.06 | 80.75 | 69.53 | 73.95 | 72.90 |
3* | 80.36 | 66.35 | 26.67 | 53.33 | 53.33 | 64.00 | 54.12 | 60.00 | 72.07 | 74.70 | 82.19 | 75.44 | 82.08 | 64.49 | 72.93 | 72.73 |
Table 9 3-shot versus 0-shot performance across LLMs and tasks. Bold indicates performance improvement compared to the 0-shot setting.
Category Model | L1 | L2 | L3 | Overall | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | |||
O3-mini | 0* | 74.39 | 75.19 | 0.00 | 6.67 | 60.83 | 36.00 | 20.00 | 57.33 | 59.07 | 72.50 | 67.31 | 72.78 | 61.00 | 53.21 | 47.75 | 58.85 |
3* | 80.92 | 72.69 | 3.33 | 10.00 | 56.67 | 56.00 | 37.65 | 51.00 | 56.29 | 71.44 | 69.36 | 73.48 | 67.25 | 54.74 | 56.14 | 61.46 | |
O1-Preview | 0* | 80.01 | 70.39 | 3.33 | 3.33 | 45.83 | 44.00 | 24.71 | 59.67 | 66.92 | 73.57 | 74.31 | 77.26 | 73.33 | 63.21 | 59.53 | 65.76 |
3* | 79.12 | 72.88 | 6.67 | 20.00 | 39.17 | 64.00 | 30.59 | 54.67 | 64.02 | 70.55 | 73.89 | 75.46 | 70.92 | 61.99 | 66.16 | 65.94 | |
GPT-4o | 0* | 67.63 | 55.77 | 3.75 | 6.67 | 37.50 | 76.00 | 15.29 | 53.33 | 56.89 | 66.71 | 65.36 | 61.54 | 62.83 | 54.54 | 51.84 | 56.69 |
3* | 65.93 | 54.62 | 1.67 | 10.00 | 25.00 | 72.00 | 10.59 | 48.33 | 56.57 | 68.09 | 66.45 | 60.98 | 68.00 | 54.92 | 59.16 | 58.03 | |
Claude-3.7 | 0* | 74.59 | 64.23 | 3.33 | 16.67 | 51.67 | 36.00 | 15.30 | 60.50 | 60.64 | 73.32 | 65.88 | 71.11 | 69.25 | 59.60 | 59.30 | 61.75 |
3* | 79.21 | 63.27 | 6.67 | 30.00 | 62.50 | 48.00 | 29.41 | 55.33 | 62.73 | 74.38 | 73.55 | 69.46 | 73.75 | 61.45 | 64.74 | 65.48 | |
LLaMA 3.1-70B | 0* | 58.23 | 34.23 | 0.00 | 0.00 | 28.33 | 28.00 | 5.88 | 33.67 | 47.42 | 59.09 | 49.53 | 46.28 | 55.33 | 40.29 | 42.19 | 45.26 |
3* | 59.96 | 38.85 | 0.00 | 13.33 | 12.50 | 60.00 | 22.35 | 36.33 | 49.08 | 59.81 | 58.92 | 50.94 | 60.17 | 41.93 | 48.07 | 49.63 | |
DeepSeek-R1 | 0* | 85.92 | 76.73 | 3.33 | 16.67 | 45.00 | 12.00 | 8.24 | 59.00 | 73.44 | 78.13 | 79.16 | 81.46 | 78.50 | 67.92 | 66.68 | 70.33 |
3* | 85.03 | 70.58 | 3.33 | 20.00 | 41.67 | 48.00 | 40.00 | 55.67 | 71.19 | 77.30 | 79.97 | 78.00 | 80.08 | 71.55 | 75.28 | 72.38 | |
DeepSeek-V3 | 0* | 75.90 | 60.96 | 0.00 | 20.00 | 47.50 | 40.00 | 4.71 | 56.00 | 64.92 | 73.22 | 70.63 | 73.14 | 72.25 | 62.59 | 57.56 | 62.96 |
3* | 75.79 | 65.96 | 3.33 | 13.33 | 35.00 | 52.00 | 16.47 | 55.33 | 66.80 | 73.37 | 71.14 | 70.10 | 74.75 | 60.57 | 67.72 | 65.63 | |
Kimi | 0* | 57.80 | 41.35 | 3.33 | 0.00 | 25.00 | 72.00 | 7.06 | 45.67 | 53.02 | 63.63 | 58.40 | 57.32 | 60.92 | 45.59 | 42.07 | 50.24 |
3* | 63.29 | 40.19 | 0.00 | 0.00 | 11.67 | 84.00 | 30.59 | 44.33 | 53.54 | 64.26 | 57.19 | 55.86 | 62.25 | 52.52 | 53.59 | 53.64 | |
GLM-4 | 0* | 63.99 | 41.73 | 0.00 | 0.00 | 28.33 | 44.00 | 4.71 | 47.33 | 54.76 | 65.57 | 60.04 | 56.84 | 66.84 | 55.63 | 50.48 | 53.77 |
3* | 60.52 | 41.73 | 0.00 | 3.33 | 17.50 | 56.00 | 14.12 | 49.00 | 56.30 | 67.31 | 62.06 | 55.74 | 66.00 | 57.36 | 53.60 | 55.08 | |
ERNIE-4.0 | 0* | 63.51 | 47.50 | 3.33 | 3.33 | 37.50 | 72.00 | 25.88 | 46.33 | 62.45 | 67.58 | 61.64 | 61.55 | 67.75 | 57.71 | 51.93 | 57.71 |
3* | 58.34 | 43.27 | 0.00 | 6.67 | 20.00 | 76.00 | 32.94 | 42.00 | 57.00 | 62.61 | 63.97 | 57.34 | 69.17 | 56.42 | 60.88 | 57.13 | |
ChemDFM-13B | 0* | 40.17 | 14.42 | 10.00 | 56.67 | 28.33 | 40.00 | 21.18 | 31.67 | 30.69 | 34.03 | 33.86 | 31.16 | 45.67 | 31.13 | 21.91 | 31.23 |
3* | 43.68 | 15.96 | 10.00 | 40.00 | 11.67 | 44.00 | 20.00 | 26.33 | 35.44 | 36.64 | 33.72 | 34.81 | 44.58 | 32.45 | 31.29 | 33.98 | |
ChemLLM-7B-Char-1.5-SFT | 0* | 27.41 | 9.42 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 13.33 | 24.25 | 22.07 | 28.25 | 18.12 | 25.50 | 16.48 | 17.24 | 20.67 |
3* | 25.29 | 14.62 | 0.00 | 0.00 | 0.00 | 48.00 | 1.18 | 6.67 | 16.65 | 15.50 | 15.76 | 12.46 | 15.92 | 10.25 | 14.14 | 14.87 | |
LlaSMol-Mistral-7B | 0* | 25.70 | 4.04 | 3.33 | 30.00 | 19.17 | 64.00 | 24.71 | 9.00 | 24.48 | 17.23 | 18.31 | 23.54 | 23.00 | 24.67 | 12.16 | 19.81 |
3* | 28.27 | 7.31 | 0.00 | 3.33 | 0.00 | 80 | 1.18 | 4.33 | 18.51 | 12.43 | 12.67 | 31.18 | 16.00 | 15.85 | 24.11 | 17.65 | |
ChemELLM | 0* | 80.95 | 63.56 | 30.00 | 56.67 | 45.00 | 64.00 | 52.94 | 60.67 | 72.59 | 75.56 | 80.42 | 74.06 | 80.75 | 69.53 | 73.95 | 72.90 |
3* | 80.36 | 66.35 | 26.67 | 53.33 | 53.33 | 64.00 | 54.12 | 60.00 | 72.07 | 74.70 | 82.19 | 75.44 | 82.08 | 64.49 | 72.93 | 72.73 |
|
[1] | Chengyi Zhang, Xingyu Wang, Ziyun Wang. Large language model in electrocatalysis [J]. Chinese Journal of Catalysis, 2024, 59(4): 7-14. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||