中国民航大学学报 ›› 2025, Vol. 43 ›› Issue (6): 88-96.

• 人工智能 • 上一篇    

基于大语言模型的中国民航公务员考试测评

  

  1. 1. 利物浦大学,利物浦,L697ZX; 2. 深圳技术大学 a. 人工智能学院; b. 工程物理学院,广东 深圳 518118;
    3. 厦门大学马来西亚分校,吉隆坡 43900; 4. 香港中文大学(深圳),广东 深圳 518172
  • 收稿日期:2025-09-10 修回日期:2025-11-20 出版日期:2025-12-20 发布日期:2026-01-10
  • 作者简介:杨凯杰(2004— ),男,河北唐山人,本科生,研究方向为人工智能、大模型评价.
  • 基金资助:
    深圳市高等院校稳定支持计划项目(20231127194506001);广东省高校创新项目(2024KTSCX055)

Chinese aviation national civil servant exam assessment using large language model

  1. 1. University of Liverpool, Liverpool L697ZX, UK; 2a. College of Artificial Intelligence; 2b. College of Engineering Physics, Shenzhen
    University of Technology, Shenzhen 518118, Guangdong, China; 3. Xiamen University Malaysia, Kuala Lumpur 43900, Malaysia;
    4. The Chinese University of Hong Kong (Shenzhen), Shenzhen 518172, Guangdong, China
  • Received:2025-09-10 Revised:2025-11-20 Online:2025-12-20 Published:2026-01-10

摘要:

本文对包括 ChatGPT-4o 在内的 5 种大型语言模型(LLM,large language model)在中国民航国家公务员考
试(NCSE,national civil servant exam)中的应试能力进行了系统评估与分析。 研究选取 2022—2024 年
NCSE 真题,以预设的标准化提问范式向 5 种 LLM 分别输入题目并记录其输出结果,进而统计 5 种 LLM
的答题正确率以衡量其综合能力。 实 验 结 果 显 示 ,DeepSeek-V3、DeepSeek-R1、ChatGPT-4o 、Gemini-
1.5 Flash、ERNIE Bot-4.0 Turbo 总分分 别为 145.20、145.41、127.47、107.56、86.40,除 ERNIE Bot-4.0
Turbo 之外,均高于人类考生平均成绩 93.50,其中 DeepSeek-V3、DeepSeek-R1 的分数达到 NCSE 的高
分区间。 此外,本文深入讨论 5 种 LLM 的优势与不足,对常识判断、言语理解与表达、数量关系、判断推理、
资料分析等不同题型的答题表现进行了细分对比,并归纳了 5 种 LLM 在应对复杂逻辑推理与多步骤运算
题目时的典型错误类型。

关键词:

Abstract:

This paper systematically evaluates and analyzes the performance of five large language models (LLMs) , including ChatGPT-4o on the Chinese national civil servant exam (NCSE). The study selected NCSE past exam
papers from 2022 to 2024, input the questions into the five LLMs using a predefined standardized query format,
recorded their outputs, and calculated their accuracy rates to assess their overall capabilities. The experimental
results show that the total scores for DeepSeek -V3, DeepSeek -R1, ChatGPT -4o, Gemini -1.5 Flash, and
ERNIE Bot-4.0 Turbo are 145.20, 145.41, 127.47, 107.56, and 86.40, respectively, except for ERNIE Bot-4.0
Turbo, all models scored higher than the average human candidate score of 93.50. Among them, DeepSeek-V3
and DeepSeek-R1 achieved scores within the high-score range of NCSE. Furthermore, this paper delves into
the strengths and weaknesses of the five LLMs, provides a detailed comparison of their performance across different question types, such as common sense judgment, verbal comprehension and expression, quantitative relationships, judgment and reasoning, and data analysis, and summarizes typical error patterns when handling complex logical reasoning and multi-step calculation problems.

Key words:

中图分类号: