Chinese aviation national civil servant exam assessment using large language model

Abstract

Abstract:

This paper systematically evaluates and analyzes the performance of five large language models (LLMs) , including ChatGPT-4o on the Chinese national civil servant exam (NCSE). The study selected NCSE past exam

papers from 2022 to 2024, input the questions into the five LLMs using a predefined standardized query format,

recorded their outputs, and calculated their accuracy rates to assess their overall capabilities. The experimental

results show that the total scores for DeepSeek -V3, DeepSeek -R1, ChatGPT -4o, Gemini -1.5 Flash, and

ERNIE Bot-4.0 Turbo are 145.20, 145.41, 127.47, 107.56, and 86.40, respectively, except for ERNIE Bot-4.0

Turbo, all models scored higher than the average human candidate score of 93.50. Among them, DeepSeek-V3

and DeepSeek-R1 achieved scores within the high-score range of NCSE. Furthermore, this paper delves into

the strengths and weaknesses of the five LLMs, provides a detailed comparison of their performance across different question types, such as common sense judgment, verbal comprehension and expression, quantitative relationships, judgment and reasoning, and data analysis, and summarizes typical error patterns when handling complex logical reasoning and multi-step calculation problems.

Key words:

large language model (LLM), national civil servant exam (NCSE), model performance evaluation

CLC Number:

TP18

YANG Kaijie , QIN Xuefenga , MO Jimaob , WANG Chuwei , LI Guanlina , DING H.Q. Chris , CAI Yuanzhea.

Chinese aviation national civil servant exam assessment using large language model

[J]. Journal of Civil Aviation University of China, 2025, 43(6): 88-96.

References 20

[1]	KUNG T H, CHEATHAM M, MEDENILLA A, et al. Performance of
	ChatGPT on USMLE: potential for AI-assisted medical education using
	large language models[J]. PLoS Digital Health, 2023, 2(2): e0000198.
[2]	ZHOU M, DUAN N, LIU S, et al. Progress in neural NLP: modeling,
	learning, and reasoning[J]. Engineering, 2020, 6(3): 275-290.
[3]	OTTER D W, MEDINA J R, KALITA J K. A survey of the usages of
	deep learning for natural language processing[J]. IEEE Transactions on
	Neural Networks and Learning Systems, 2021, 32(2): 604-624.
[4]	KATZ D M, BOMMARITO M J, GAO S, et al. GPT-4 passes the bar exam[J]. Philosophical Transactions of the Royal Society A, 2024, 382(2270): 20230254.
[5]	SKALIDIS I, CAGNINA A, LUANGPHIPHAT W, et al. ChatGPT takes
	on the European Exam in core cardiology: an artificial intelligence success story?[J]. European Heart Journal-Digital Health, 2023, 4(3): 279-281.
[6]	TSOUTSANIS P, TSOUTSANIS A. Evaluation of large language model
	performance on the multi-specialty recruitment assessment (MSRA) exam[J]. Computers in Biology and Medicine, 2024, 168: 107794.
[7]	XU L, CONG X, WANG R, et al. Performance of the large language
	models on the Chinese national nurse licensure examination: cross sectional evaluation study [J]. JMIR medical informatics, 2025, 13:e78279.
[8]	HONG M, NG W, ZHANG C J, et al. QualBench: benchmarking Chinese
	LLM with localized professional qualifications for vertical domain evaluation [EB/OL]. (2025-09-03)[2025-11 -22]. http://arxiv . org / abs/2505.05225.
[9]	赵睿卓, 曲紫畅, 陈国英, 等. 大语言模型评估技术研究进展[J]. 数据采集与处理, 2024, 39(3): 502-523.
[10]	OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[EB/OL]. (2022-05-04)[2025-11-22].http://arxiv.org/abs/2203.02155.
[11]	HUANG Y, BAI Y, ZHU Z, et al. C-EVAL: a multi-level multi-discipline Chinese evaluation suite for foundation models[C]//Proceedings of
	the 37th International Conference on Neural Information ProcessingSystems. Red Hook, NY, USA: Curran Associates Inc., 2023: 62991-63010.
[12]	LI H, ZHANG Y, KOTO F, et al. CMMLU: Measuring massive multitask
	language understanding in Chinese[C]//Findings of the Association for
	Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024: 11260-11285.
[13]	ARTIFICIAL ANALYSIS. LLM leaderboard-compare GPT-4o, Llama3, Mistral, Gemini & other models[EB/OL]. [2025-04-16]. https://artificialanalysis.ai/leaderboards/models.
[14]	HURST A, LERER A, et al. GPT-4o system card[EB/OL].(2024-10-25)[2025-04-17]. http://arxiv.org/abs/2410.21276.
[15]	LIU A X, FENG B, XUE B, GOUCHER A P, et al. DeepSeek-V3 technical report[EB/OL].(2025-02-18)[2025-04-16]. http://arxiv.org/abs/2412.19437.
[16]	GUO D Y, YANG D J, ZHANG H W, et al. DeepSeek-R1: incentivizing
	reasoning capability in LLM via reinforcement learning[EB/OL]. (2025-01-22)[2025-04-17]. http://arxiv.org/abs/2501.12948.
[17]	SUN Y, WANG S H, LI Y K, et al. ERNIE: enhanced representation through knowledge integration[EB/OL]. (2019-04-19)[2025-04-17].http://arxiv.org/abs/1904.09223.
[18]	GEORGIEV P, LEI V L, BURNELL R, et al. Gemini 1.5: unlocking
	multimodal understanding across millions of tokens of context[EB/OL].
	(2024-12-16)[2025-04-17]. http://arxiv.org/abs/2403.05530.
[19]	MU L L WANG X Y, JCUI J J. Evaluation of large language models for
	Chinese text error correction tasks[C]//The 23rd Chinese National Conference on Computational Linguistics, July 25-28, 2024, Taiyuan, China. Beijing, Chinese Information Processing Society of China, 2024: 790-806.
[20]	SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. (2025-03-16)[2025-04-17]. http://arxiv.org/abs/2402.07927.