Abstract

Recent breakthroughs in Large Language Models (LLMs) have revolutionized natural language understanding and generation, igniting a surge of interest in leveraging these technologies for the nuanced field of scientific literature analysis. Existing benchmarks, however, inadequately evaluate the proficiency of LLMs in the scientific domain, especially in scenarios involving complex comprehension and multimodal data. In response, we introduced SciAssess, a benchmark tailored for the in-depth analysis of scientific literature, crafted to provide a thorough assessment of LLMs' efficacy. SciAssess focuses on evaluating LLMs' abilities in memorization, comprehension, and analysis within scientific contexts. It includes representative tasks from diverse scientific fields, such as general chemistry, organic materials, and alloy materials. And rigorous quality control measures ensure its reliability in terms of correctness, anonymization, and copyright compliance. SciAssess evaluates leading LLMs, including GPT-4, GPT-3.5-turbo, and Gemini, identifying their strengths and areas for improvement and supporting the ongoing development of LLM applications in scientific literature analysis. SciAssess and its resources are made available at https://sci-assess.github.io/, offering a valuable tool for advancing LLM capabilities in scientific literature analysis.

Benchmark Details

SciAssess evaluates LLMs' abilities in various aspects:

  • L1 (Memorization): The model's ability to accurately answer common factual questions in science autonomously.
  • L2 (Comprehension): The ability to precisely identify and extract key information and facts within a given text.
  • L3 (Analysis and Reasoning): The model's capability to amalgamate extracted information with its existing knowledge base for logical reasoning and analysis.

It covers a wide range of scientific fields and tasks, including but not limited to:

  • General Chemistry: MMLU High-School Chemistry, Abstract2Title, Balancing Equations
  • Alloy Materials: Composition Extraction, Alloy ChartQA, Sample Differentiation
  • Organic Materials: Electrolyte Solubility Data Extraction, Reaction Mechanism QA, Polymer Property Extraction
  • Drug Discovery: Affinity Data Extraction, Tag to Molecule, Drug ChartQA
  • Biology: MedMCQA, CompDisease Recognition, Biology ChartQA

For a complete list of domains and tasks, please refer to our paper.

Contributing

We welcome contributions to the SciAssess benchmark. If you have any suggestions or improvements, please feel free to open an issue or create a pull request on our GitHub repository.

Citation

If you use SciAssess in your research, please cite our paper:

@misc{cai2024sciassess,
      title={SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis},
      author={Hengxing Cai and Xiaochen Cai and Junhan Chang and Sihang Li and Lin Yao and Changxin Wang and Zhifeng Gao and Yongge Li and Mujie Lin and Shuwen Yang and Jiankun Wang and Yuqi Yin and Yaqi Li and Linfeng Zhang and Guolin Ke},