LeaderBoard


Benchmarking LLM Proficiency on SciAssess
  • Timestamp 🔥
    • [2024/05] Added Claude3-PyPDF, Qwen-api, Moonshot, Skylark-PyPDF on SciAssess Benchmark.
    • [2024/05] Added Uni-Smart Nano on SciAssess Benchmark.
    • [2024/05] Added LLM base models performers.
    • [2024/04] We officially released SciAssess Benchmark! And test in several baseline LLM (Uni-Smart Pro, Gpt4-Withpdf, Gpt3.5-Withpdf).

Table 1: Base Models Performers
Datasets MMLU-s CMMLU-s PubMedQA Xiezhi-ch-s Xiezhi-en-s
<10B
ChatGLM-6B 47.35 46.07 66.90 59.45 55.15
Mistral-7B 58.97 36.97 60.70 48.74 64.31
Qwen1.5-7B 57.71 70.40 68.40 71.37 65.69
Llama2-7B 38.73 28.37 49.30 31.87 39.66
Llama3-8B 62.33 41.99 68.40 60.18 64.07
10B ~ 50B
Llama2-13B 50.27 32.01 51.30 39.42 56.77
Qwen1.5-14B 66.32 77.04 69.50 74.29 69.75
>50B
Mixtral-8x7B 68.34 45.03 68.50 61.56 69.91
Llama2-70B 65.16 44.99 67.70 61.31 64.15
Llama3-70B 77.44 64.95 77.50 74.94 73.48
Qwen1.5-72B 75.52 83.87 72.30 75.75 70.88
Qwen1.5-110B 77.85 88.69 77.20 74.45 73.24

Table 2: Performers Overview
Model Global Avg Metric Drug Discovery Biomedicine Alloy Materials General Chemistry Organic Materials
Uni-Smart Pro 0.630 0.486 0.804 0.661 0.551 0.590
Uni-Smart Nano 0.524 0.401 0.697 0.446 0.484 0.590
GPT4+PyPDF 0.562 0.429 0.779 0.527 0.550 0.525
GPT3.5+PyPDF 0.390 0.294 0.640 0.359 0.349 0.281
Claude3-PyPDF 0.410 0.278 0.701 0.367 0.308 0.298
Qwen-api 0.410 0.261 0.611 0.341 0.514 0.365
Moonshot 0.540 0.347 0.730 0.582 0.479 0.520
Skylark-PyPDF 0.320 0.233 0.475 0.300 0.436 0.224
Table 3: Drug Discovery Performers
Model Drug Discovery Affinity Extraction Tag2Mol Is Mol Covered Markush2Mol Targets Extraction Reaction Qa Drug Chart Qa
Uni-Smart Pro 0.486 0.230 0.297 0.889 0.555 0.500 0.400 0.533
Uni-Smart Nano 0.401 0.286 0.184 0.711 0.412 0.750 0.200 0.267
GPT4+PyPDF 0.429 0.378 0.045 0.467 0.696 0.683 0.200 0.533
GPT3.5+PyPDF 0.294 0.270 0.002 0.422 0.463 0.433 0.133 0.333
Claude3-PyPDF 0.278 0.214 0.010 0.489 0.298 0.667 0.000 0.267
Qwen-api 0.261 0.214 0.079 0.511 0.270 0.417 0.133 0.200
Moonshot 0.347 0.240 0.018 0.489 0.313 0.500 0.333 0.533
Skylark-PyPDF 0.233 0.193 0.015 0.444 0.081 0.433 0.067 0.400
Metric average accuracy_value score accuracy score score accuracy accuracy
Table 4: Biomedicine Performers
Model Biomedicine Bio Chart Qa Chemical Entities Recognition Compound Disease Recognition Disease Entities Recognition Gene Disease Function Gene Disease Regulation Medmcqa
Uni-Smart Pro 0.804 0.533 0.890 0.942 0.714 0.924 0.911 0.717
Uni-Smart Nano 0.697 0.467 0.314 0.969 0.353 0.957 0.948 0.870
GPT4+PyPDF 0.779 0.400 0.919 0.936 0.726 0.932 0.834 0.703
GPT3.5+PyPDF 0.640 0.200 0.636 0.943 0.568 0.915 0.693 0.527
Claude3-PyPDF 0.701 0.400 0.827 0.918 0.524 0.929 0.803 0.507
Qwen-api 0.611 0.267 0.891 0.859 0.582 0.906 0.706 0.067
Moonshot 0.730 0.467 0.776 0.929 0.667 0.914 0.799 0.560
Skylark-PyPDF 0.475 0.400 0.541 0.826 0.373 0.856 0.327 0.000
Metric average accuracy value_recall value_recall value_recall value_recall value_recall accuracy
Table 5: Alloy Materials Performers
Model Alloy Materials Alloy Sample Differentiation Alloy Composition Extraction Alloy Temperature Extraction Alloy Treatment Sequence Alloy Chart Qa
Uni-Smart Pro 0.661 0.824 0.451 0.840 0.792 0.400
Uni-Smart Nano 0.446 0.294 0.427 0.500 0.541 0.467
GPT4+PyPDF 0.527 0.431 0.449 0.540 0.750 0.467
GPT3.5+PyPDF 0.359 0.137 0.417 0.200 0.708 0.333
Claude3-PyPDF 0.367 0.157 0.420 0.300 0.625 0.333
Qwen-api 0.341 0.137 0.301 0.360 0.708 0.200
Moonshot 0.582 0.725 0.451 0.760 0.708 0.267
Skylark-PyPDF 0.300 0.216 0.358 0.200 0.458 0.267
Metric average accuracy accuracy_value accuracy accuracy accuracy
Table 6: General Chemistry Performers
Model General Chemistry Balance Chemical Equation Mmlu Chemistry
Uni-Smart Pro 0.551 0.470 0.633
Uni-Smart Nano 0.484 0.400 0.567
GPT4+PyPDF 0.550 0.400 0.700
GPT3.5+PyPDF 0.349 0.330 0.367
Claude3-PyPDF 0.308 0.350 0.267
Qwen-api 0.514 0.360 0.667
Moonshot 0.479 0.290 0.667
Skylark-PyPDF 0.436 0.340 0.533
Metric average accuracy accuracy
Table 7: Organic Materials Performers
Model Organic Materials Solubility Extraction Oled Property Extraction Polymer Property Extraction Polymer Composition Extraction Polymer Chart Qa Electrolyte Table Qa Reaction Mechanism Qa
Uni-Smart Pro 0.590 0.371 0.303 0.739 0.733 0.733 0.750 0.500
Uni-Smart Nano 0.590 0.390 0.362 0.800 0.600 0.800 0.542 0.636
GPT4+PyPDF 0.525 0.434 0.356 0.818 0.467 0.600 0.542 0.455
GPT3.5+PyPDF 0.281 0.367 0.345 0.493 0.133 0.267 0.229 0.136
Claude3-PyPDF 0.298 0.327 0.357 0.349 0.400 0.200 0.229 0.227
Qwen-api 0.365 0.347 0.363 0.568 0.133 0.333 0.354 0.455
Moonshot 0.520 0.248 0.307 0.682 0.933 0.600 0.458 0.409
Skylark-PyPDF 0.224 0.182 0.244 0.398 0.000 0.133 0.292 0.318
Metric average accuracy_value accuracy_value accuracy_value accuracy accuracy accuracy accuracy