基于大语言模型的中国民航公务员考试测评

中国民航大学学报 ›› 2025, Vol. 43 ›› Issue (6): 88-96.

• 人工智能 • 上一篇

基于大语言模型的中国民航公务员考试测评

1. 利物浦大学，利物浦，L697ZX； 2. 深圳技术大学 a. 人工智能学院； b. 工程物理学院，广东深圳 518118；

3. 厦门大学马来西亚分校，吉隆坡 43900； 4. 香港中文大学（深圳），广东深圳 518172

收稿日期:2025-09-10 修回日期:2025-11-20 出版日期:2025-12-20 发布日期:2026-01-10
作者简介:杨凯杰（2004— ），男，河北唐山人，本科生，研究方向为人工智能、大模型评价.
基金资助:
深圳市高等院校稳定支持计划项目（20231127194506001）；广东省高校创新项目（2024KTSCX055）

Chinese aviation national civil servant exam assessment using large language model

1. University of Liverpool, Liverpool L697ZX, UK; 2a. College of Artificial Intelligence; 2b. College of Engineering Physics, Shenzhen

University of Technology, Shenzhen 518118, Guangdong, China; 3. Xiamen University Malaysia, Kuala Lumpur 43900, Malaysia;

4. The Chinese University of Hong Kong (Shenzhen), Shenzhen 518172, Guangdong, China

Received:2025-09-10 Revised:2025-11-20 Online:2025-12-20 Published:2026-01-10

1. 基于大语言模型的中国民航公务员考试测评.pdf(4223KB)

摘要/Abstract

摘要：

本文对包括 ChatGPT-4o 在内的 5 种大型语言模型（LLM，large language model）在中国民航国家公务员考

试（NCSE，national civil servant exam）中的应试能力进行了系统评估与分析。研究选取 2022—2024 年

NCSE 真题，以预设的标准化提问范式向 5 种 LLM 分别输入题目并记录其输出结果，进而统计 5 种 LLM

的答题正确率以衡量其综合能力。实验结果显示，DeepSeek-V3、DeepSeek-R1、ChatGPT-4o 、Gemini-

1.5 Flash、ERNIE Bot-4.0 Turbo 总分分别为 145.20、145.41、127.47、107.56、86.40，除 ERNIE Bot-4.0

Turbo 之外，均高于人类考生平均成绩 93.50，其中 DeepSeek-V3、DeepSeek-R1 的分数达到 NCSE 的高

分区间。此外，本文深入讨论 5 种 LLM 的优势与不足，对常识判断、言语理解与表达、数量关系、判断推理、

资料分析等不同题型的答题表现进行了细分对比，并归纳了 5 种 LLM 在应对复杂逻辑推理与多步骤运算

题目时的典型错误类型。

关键词:

大语言模型（LLM）, 国家公务员考试（NCSE）, 模型性能评估

Abstract:

This paper systematically evaluates and analyzes the performance of five large language models (LLMs) , including ChatGPT-4o on the Chinese national civil servant exam (NCSE). The study selected NCSE past exam

papers from 2022 to 2024, input the questions into the five LLMs using a predefined standardized query format,

recorded their outputs, and calculated their accuracy rates to assess their overall capabilities. The experimental

results show that the total scores for DeepSeek -V3, DeepSeek -R1, ChatGPT -4o, Gemini -1.5 Flash, and

ERNIE Bot-4.0 Turbo are 145.20, 145.41, 127.47, 107.56, and 86.40, respectively, except for ERNIE Bot-4.0

Turbo, all models scored higher than the average human candidate score of 93.50. Among them, DeepSeek-V3

and DeepSeek-R1 achieved scores within the high-score range of NCSE. Furthermore, this paper delves into

the strengths and weaknesses of the five LLMs, provides a detailed comparison of their performance across different question types, such as common sense judgment, verbal comprehension and expression, quantitative relationships, judgment and reasoning, and data analysis, and summarizes typical error patterns when handling complex logical reasoning and multi-step calculation problems.

Key words:

large language model (LLM), national civil servant exam (NCSE), model performance evaluation

中图分类号:

TP18

杨凯杰, 秦雪峰 a, 莫济懋 b, 王楚为, 李冠霖 a, DING H.Q. Chris, 蔡元哲 a.

基于大语言模型的中国民航公务员考试测评

[J]. 中国民航大学学报, 2025, 43(6): 88-96.

YANG Kaijie , QIN Xuefenga , MO Jimaob , WANG Chuwei , LI Guanlina , DING H.Q. Chris , CAI Yuanzhea.

Chinese aviation national civil servant exam assessment using large language model

[J]. Journal of Civil Aviation University of China, 2025, 43(6): 88-96.

参考文献 20

[1]	KUNG T H, CHEATHAM M, MEDENILLA A, et al. Performance of
	ChatGPT on USMLE: potential for AI-assisted medical education using
	large language models[J]. PLoS Digital Health, 2023, 2(2): e0000198.
[2]	ZHOU M, DUAN N, LIU S, et al. Progress in neural NLP: modeling,
	learning, and reasoning[J]. Engineering, 2020, 6(3): 275-290.
[3]	OTTER D W, MEDINA J R, KALITA J K. A survey of the usages of
	deep learning for natural language processing[J]. IEEE Transactions on
	Neural Networks and Learning Systems, 2021, 32(2): 604-624.
[4]	KATZ D M, BOMMARITO M J, GAO S, et al. GPT-4 passes the bar exam[J]. Philosophical Transactions of the Royal Society A, 2024, 382(2270): 20230254.
[5]	SKALIDIS I, CAGNINA A, LUANGPHIPHAT W, et al. ChatGPT takes
	on the European Exam in core cardiology: an artificial intelligence success story?[J]. European Heart Journal-Digital Health, 2023, 4(3): 279-281.
[6]	TSOUTSANIS P, TSOUTSANIS A. Evaluation of large language model
	performance on the multi-specialty recruitment assessment (MSRA) exam[J]. Computers in Biology and Medicine, 2024, 168: 107794.
[7]	XU L, CONG X, WANG R, et al. Performance of the large language
	models on the Chinese national nurse licensure examination: cross sectional evaluation study [J]. JMIR medical informatics, 2025, 13:e78279.
[8]	HONG M, NG W, ZHANG C J, et al. QualBench: benchmarking Chinese
	LLM with localized professional qualifications for vertical domain evaluation [EB/OL]. (2025-09-03)[2025-11 -22]. http://arxiv . org / abs/2505.05225.
[9]	赵睿卓, 曲紫畅, 陈国英, 等. 大语言模型评估技术研究进展[J]. 数据采集与处理, 2024, 39(3): 502-523.
[10]	OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[EB/OL]. (2022-05-04)[2025-11-22].http://arxiv.org/abs/2203.02155.
[11]	HUANG Y, BAI Y, ZHU Z, et al. C-EVAL: a multi-level multi-discipline Chinese evaluation suite for foundation models[C]//Proceedings of
	the 37th International Conference on Neural Information ProcessingSystems. Red Hook, NY, USA: Curran Associates Inc., 2023: 62991-63010.
[12]	LI H, ZHANG Y, KOTO F, et al. CMMLU: Measuring massive multitask
	language understanding in Chinese[C]//Findings of the Association for
	Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024: 11260-11285.
[13]	ARTIFICIAL ANALYSIS. LLM leaderboard-compare GPT-4o, Llama3, Mistral, Gemini & other models[EB/OL]. [2025-04-16]. https://artificialanalysis.ai/leaderboards/models.
[14]	HURST A, LERER A, et al. GPT-4o system card[EB/OL].(2024-10-25)[2025-04-17]. http://arxiv.org/abs/2410.21276.
[15]	LIU A X, FENG B, XUE B, GOUCHER A P, et al. DeepSeek-V3 technical report[EB/OL].(2025-02-18)[2025-04-16]. http://arxiv.org/abs/2412.19437.
[16]	GUO D Y, YANG D J, ZHANG H W, et al. DeepSeek-R1: incentivizing
	reasoning capability in LLM via reinforcement learning[EB/OL]. (2025-01-22)[2025-04-17]. http://arxiv.org/abs/2501.12948.
[17]	SUN Y, WANG S H, LI Y K, et al. ERNIE: enhanced representation through knowledge integration[EB/OL]. (2019-04-19)[2025-04-17].http://arxiv.org/abs/1904.09223.
[18]	GEORGIEV P, LEI V L, BURNELL R, et al. Gemini 1.5: unlocking
	multimodal understanding across millions of tokens of context[EB/OL].
	(2024-12-16)[2025-04-17]. http://arxiv.org/abs/2403.05530.
[19]	MU L L WANG X Y, JCUI J J. Evaluation of large language models for
	Chinese text error correction tasks[C]//The 23rd Chinese National Conference on Computational Linguistics, July 25-28, 2024, Taiyuan, China. Beijing, Chinese Information Processing Society of China, 2024: 790-806.
[20]	SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. (2025-03-16)[2025-04-17]. http://arxiv.org/abs/2402.07927.

基于大语言模型的中国民航公务员考试测评

Chinese aviation national civil servant exam assessment using large language model

PDF

补充材料

可视化

摘要/Abstract

引用本文

使用本文

参考文献 20

相关文章 0

编辑推荐

Metrics

本文评价