| [1] |
KUNG T H, CHEATHAM M, MEDENILLA A, et al. Performance of
|
|
ChatGPT on USMLE: potential for AI-assisted medical education using
|
|
large language models[J]. PLoS Digital Health, 2023, 2(2): e0000198.
|
| [2] |
ZHOU M, DUAN N, LIU S, et al. Progress in neural NLP: modeling,
|
|
learning, and reasoning[J]. Engineering, 2020, 6(3): 275-290.
|
| [3] |
OTTER D W, MEDINA J R, KALITA J K. A survey of the usages of
|
|
deep learning for natural language processing[J]. IEEE Transactions on
|
|
Neural Networks and Learning Systems, 2021, 32(2): 604-624.
|
| [4] |
KATZ D M, BOMMARITO M J, GAO S, et al. GPT-4 passes the bar exam[J]. Philosophical Transactions of the Royal Society A, 2024, 382(2270): 20230254.
|
| [5] |
SKALIDIS I, CAGNINA A, LUANGPHIPHAT W, et al. ChatGPT takes
|
|
on the European Exam in core cardiology: an artificial intelligence success story?[J]. European Heart Journal-Digital Health, 2023, 4(3): 279-281.
|
| [6] |
TSOUTSANIS P, TSOUTSANIS A. Evaluation of large language model
|
|
performance on the multi-specialty recruitment assessment (MSRA) exam[J]. Computers in Biology and Medicine, 2024, 168: 107794.
|
| [7] |
XU L, CONG X, WANG R, et al. Performance of the large language
|
|
models on the Chinese national nurse licensure examination: cross sectional evaluation study [J]. JMIR medical informatics, 2025, 13:e78279.
|
| [8] |
HONG M, NG W, ZHANG C J, et al. QualBench: benchmarking Chinese
|
|
LLM with localized professional qualifications for vertical domain evaluation [EB/OL]. (2025-09-03)[2025-11 -22]. http://arxiv . org / abs/2505.05225.
|
| [9] |
赵睿卓, 曲紫畅, 陈国英, 等. 大语言模型评估技术研究进展[J]. 数据采集与处理, 2024, 39(3): 502-523.
|
| [10] |
OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[EB/OL]. (2022-05-04)[2025-11-22].http://arxiv.org/abs/2203.02155.
|
| [11] |
HUANG Y, BAI Y, ZHU Z, et al. C-EVAL: a multi-level multi-discipline Chinese evaluation suite for foundation models[C]//Proceedings of
|
|
the 37th International Conference on Neural Information ProcessingSystems. Red Hook, NY, USA: Curran Associates Inc., 2023: 62991-63010.
|
| [12] |
LI H, ZHANG Y, KOTO F, et al. CMMLU: Measuring massive multitask
|
|
language understanding in Chinese[C]//Findings of the Association for
|
|
Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024: 11260-11285.
|
| [13] |
ARTIFICIAL ANALYSIS. LLM leaderboard-compare GPT-4o, Llama3, Mistral, Gemini & other models[EB/OL]. [2025-04-16]. https://artificialanalysis.ai/leaderboards/models.
|
| [14] |
HURST A, LERER A, et al. GPT-4o system card[EB/OL].(2024-10-25)[2025-04-17]. http://arxiv.org/abs/2410.21276.
|
| [15] |
LIU A X, FENG B, XUE B, GOUCHER A P, et al. DeepSeek-V3 technical report[EB/OL].(2025-02-18)[2025-04-16]. http://arxiv.org/abs/2412.19437.
|
| [16] |
GUO D Y, YANG D J, ZHANG H W, et al. DeepSeek-R1: incentivizing
|
|
reasoning capability in LLM via reinforcement learning[EB/OL]. (2025-01-22)[2025-04-17]. http://arxiv.org/abs/2501.12948.
|
| [17] |
SUN Y, WANG S H, LI Y K, et al. ERNIE: enhanced representation through knowledge integration[EB/OL]. (2019-04-19)[2025-04-17].http://arxiv.org/abs/1904.09223.
|
| [18] |
GEORGIEV P, LEI V L, BURNELL R, et al. Gemini 1.5: unlocking
|
|
multimodal understanding across millions of tokens of context[EB/OL].
|
|
(2024-12-16)[2025-04-17]. http://arxiv.org/abs/2403.05530.
|
| [19] |
MU L L WANG X Y, JCUI J J. Evaluation of large language models for
|
|
Chinese text error correction tasks[C]//The 23rd Chinese National Conference on Computational Linguistics, July 25-28, 2024, Taiyuan, China. Beijing, Chinese Information Processing Society of China, 2024: 790-806.
|
| [20] |
SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. (2025-03-16)[2025-04-17]. http://arxiv.org/abs/2402.07927.
|