-
Timestamp 🔥
- [2024/05] Added Claude3-PyPDF, Qwen-api, Moonshot, Skylark-PyPDF on SciAssess Benchmark.
- [2024/05] Added Uni-Smart Nano on SciAssess Benchmark.
- [2024/05] Added LLM base models performers.
- [2024/04] We officially released SciAssess Benchmark! And test in several baseline LLM (Uni-Smart Pro, Gpt4-Withpdf, Gpt3.5-Withpdf).
Table 1: Base Models Performers
Datasets |
MMLU-s |
CMMLU-s |
PubMedQA |
Xiezhi-ch-s |
Xiezhi-en-s |
<10B |
|
|
|
|
|
ChatGLM-6B |
47.35 |
46.07 |
66.90 |
59.45 |
55.15 |
Mistral-7B |
58.97 |
36.97 |
60.70 |
48.74 |
64.31 |
Qwen1.5-7B |
57.71 |
70.40 |
68.40 |
71.37 |
65.69 |
Llama2-7B |
38.73 |
28.37 |
49.30 |
31.87 |
39.66 |
Llama3-8B |
62.33 |
41.99 |
68.40 |
60.18 |
64.07 |
10B ~ 50B |
|
|
|
|
|
Llama2-13B |
50.27 |
32.01 |
51.30 |
39.42 |
56.77 |
Qwen1.5-14B |
66.32 |
77.04 |
69.50 |
74.29 |
69.75 |
>50B |
|
|
|
|
|
Mixtral-8x7B |
68.34 |
45.03 |
68.50 |
61.56 |
69.91 |
Llama2-70B |
65.16 |
44.99 |
67.70 |
61.31 |
64.15 |
Llama3-70B |
77.44 |
64.95 |
77.50 |
74.94 |
73.48 |
Qwen1.5-72B |
75.52 |
83.87 |
72.30 |
75.75 |
70.88 |
Qwen1.5-110B |
77.85 |
88.69 |
77.20 |
74.45 |
73.24 |
Table 2: Performers Overview
Model |
Global Avg Metric |
Drug Discovery |
Biomedicine |
Alloy Materials |
General Chemistry |
Organic Materials |
Uni-Smart Pro |
0.630 |
0.486 |
0.804 |
0.661 |
0.551 |
0.590 |
Uni-Smart Nano |
0.524 |
0.401 |
0.697 |
0.446 |
0.484 |
0.590 |
GPT4+PyPDF |
0.562 |
0.429 |
0.779 |
0.527 |
0.550 |
0.525 |
GPT3.5+PyPDF |
0.390 |
0.294 |
0.640 |
0.359 |
0.349 |
0.281 |
Claude3-PyPDF |
0.410 |
0.278 |
0.701 |
0.367 |
0.308 |
0.298 |
Qwen-api |
0.410 |
0.261 |
0.611 |
0.341 |
0.514 |
0.365 |
Moonshot |
0.540 |
0.347 |
0.730 |
0.582 |
0.479 |
0.520 |
Skylark-PyPDF |
0.320 |
0.233 |
0.475 |
0.300 |
0.436 |
0.224 |
Table 3: Drug Discovery Performers
Model |
Drug Discovery |
Affinity Extraction |
Tag2Mol |
Is Mol Covered |
Markush2Mol |
Targets Extraction |
Reaction Qa |
Drug Chart Qa |
Uni-Smart Pro |
0.486 |
0.230 |
0.297 |
0.889 |
0.555 |
0.500 |
0.400 |
0.533 |
Uni-Smart Nano |
0.401 |
0.286 |
0.184 |
0.711 |
0.412 |
0.750 |
0.200 |
0.267 |
GPT4+PyPDF |
0.429 |
0.378 |
0.045 |
0.467 |
0.696 |
0.683 |
0.200 |
0.533 |
GPT3.5+PyPDF |
0.294 |
0.270 |
0.002 |
0.422 |
0.463 |
0.433 |
0.133 |
0.333 |
Claude3-PyPDF |
0.278 |
0.214 |
0.010 |
0.489 |
0.298 |
0.667 |
0.000 |
0.267 |
Qwen-api |
0.261 |
0.214 |
0.079 |
0.511 |
0.270 |
0.417 |
0.133 |
0.200 |
Moonshot |
0.347 |
0.240 |
0.018 |
0.489 |
0.313 |
0.500 |
0.333 |
0.533 |
Skylark-PyPDF |
0.233 |
0.193 |
0.015 |
0.444 |
0.081 |
0.433 |
0.067 |
0.400 |
Metric |
average |
accuracy_value |
score |
accuracy |
score |
score |
accuracy |
accuracy |
Table 4: Biomedicine Performers
Model |
Biomedicine |
Bio Chart Qa |
Chemical Entities Recognition |
Compound Disease Recognition |
Disease Entities Recognition |
Gene Disease Function |
Gene Disease Regulation |
Medmcqa |
Uni-Smart Pro |
0.804 |
0.533 |
0.890 |
0.942 |
0.714 |
0.924 |
0.911 |
0.717 |
Uni-Smart Nano |
0.697 |
0.467 |
0.314 |
0.969 |
0.353 |
0.957 |
0.948 |
0.870 |
GPT4+PyPDF |
0.779 |
0.400 |
0.919 |
0.936 |
0.726 |
0.932 |
0.834 |
0.703 |
GPT3.5+PyPDF |
0.640 |
0.200 |
0.636 |
0.943 |
0.568 |
0.915 |
0.693 |
0.527 |
Claude3-PyPDF |
0.701 |
0.400 |
0.827 |
0.918 |
0.524 |
0.929 |
0.803 |
0.507 |
Qwen-api |
0.611 |
0.267 |
0.891 |
0.859 |
0.582 |
0.906 |
0.706 |
0.067 |
Moonshot |
0.730 |
0.467 |
0.776 |
0.929 |
0.667 |
0.914 |
0.799 |
0.560 |
Skylark-PyPDF |
0.475 |
0.400 |
0.541 |
0.826 |
0.373 |
0.856 |
0.327 |
0.000 |
Metric |
average |
accuracy |
value_recall |
value_recall |
value_recall |
value_recall |
value_recall |
accuracy |
Table 5: Alloy Materials Performers
Model |
Alloy Materials |
Alloy Sample Differentiation |
Alloy Composition Extraction |
Alloy Temperature Extraction |
Alloy Treatment Sequence |
Alloy Chart Qa |
Uni-Smart Pro |
0.661 |
0.824 |
0.451 |
0.840 |
0.792 |
0.400 |
Uni-Smart Nano |
0.446 |
0.294 |
0.427 |
0.500 |
0.541 |
0.467 |
GPT4+PyPDF |
0.527 |
0.431 |
0.449 |
0.540 |
0.750 |
0.467 |
GPT3.5+PyPDF |
0.359 |
0.137 |
0.417 |
0.200 |
0.708 |
0.333 |
Claude3-PyPDF |
0.367 |
0.157 |
0.420 |
0.300 |
0.625 |
0.333 |
Qwen-api |
0.341 |
0.137 |
0.301 |
0.360 |
0.708 |
0.200 |
Moonshot |
0.582 |
0.725 |
0.451 |
0.760 |
0.708 |
0.267 |
Skylark-PyPDF |
0.300 |
0.216 |
0.358 |
0.200 |
0.458 |
0.267 |
Metric |
average |
accuracy |
accuracy_value |
accuracy |
accuracy |
accuracy |
Table 6: General Chemistry Performers
Model |
General Chemistry |
Balance Chemical Equation |
Mmlu Chemistry |
Uni-Smart Pro |
0.551 |
0.470 |
0.633 |
Uni-Smart Nano |
0.484 |
0.400 |
0.567 |
GPT4+PyPDF |
0.550 |
0.400 |
0.700 |
GPT3.5+PyPDF |
0.349 |
0.330 |
0.367 |
Claude3-PyPDF |
0.308 |
0.350 |
0.267 |
Qwen-api |
0.514 |
0.360 |
0.667 |
Moonshot |
0.479 |
0.290 |
0.667 |
Skylark-PyPDF |
0.436 |
0.340 |
0.533 |
Metric |
average |
accuracy |
accuracy |
Table 7: Organic Materials Performers
Model |
Organic Materials |
Solubility Extraction |
Oled Property Extraction |
Polymer Property Extraction |
Polymer Composition Extraction |
Polymer Chart Qa |
Electrolyte Table Qa |
Reaction Mechanism Qa |
Uni-Smart Pro |
0.590 |
0.371 |
0.303 |
0.739 |
0.733 |
0.733 |
0.750 |
0.500 |
Uni-Smart Nano |
0.590 |
0.390 |
0.362 |
0.800 |
0.600 |
0.800 |
0.542 |
0.636 |
GPT4+PyPDF |
0.525 |
0.434 |
0.356 |
0.818 |
0.467 |
0.600 |
0.542 |
0.455 |
GPT3.5+PyPDF |
0.281 |
0.367 |
0.345 |
0.493 |
0.133 |
0.267 |
0.229 |
0.136 |
Claude3-PyPDF |
0.298 |
0.327 |
0.357 |
0.349 |
0.400 |
0.200 |
0.229 |
0.227 |
Qwen-api |
0.365 |
0.347 |
0.363 |
0.568 |
0.133 |
0.333 |
0.354 |
0.455 |
Moonshot |
0.520 |
0.248 |
0.307 |
0.682 |
0.933 |
0.600 |
0.458 |
0.409 |
Skylark-PyPDF |
0.224 |
0.182 |
0.244 |
0.398 |
0.000 |
0.133 |
0.292 |
0.318 |
Metric |
average |
accuracy_value |
accuracy_value |
accuracy_value |
accuracy |
accuracy |
accuracy |
accuracy |