Journal of Civil Aviation University of China ›› 2025, Vol. 43 ›› Issue (6): 88-96.

• Artificial intelligence • Previous Articles    

Chinese aviation national civil servant exam assessment using large language model

  

  1. 1. University of Liverpool, Liverpool L697ZX, UK; 2a. College of Artificial Intelligence; 2b. College of Engineering Physics, Shenzhen
    University of Technology, Shenzhen 518118, Guangdong, China; 3. Xiamen University Malaysia, Kuala Lumpur 43900, Malaysia;
    4. The Chinese University of Hong Kong (Shenzhen), Shenzhen 518172, Guangdong, China
  • Received:2025-09-10 Revised:2025-11-20 Online:2025-12-20 Published:2026-01-10

Abstract:

This paper systematically evaluates and analyzes the performance of five large language models (LLMs) , including ChatGPT-4o on the Chinese national civil servant exam (NCSE). The study selected NCSE past exam
papers from 2022 to 2024, input the questions into the five LLMs using a predefined standardized query format,
recorded their outputs, and calculated their accuracy rates to assess their overall capabilities. The experimental
results show that the total scores for DeepSeek -V3, DeepSeek -R1, ChatGPT -4o, Gemini -1.5 Flash, and
ERNIE Bot-4.0 Turbo are 145.20, 145.41, 127.47, 107.56, and 86.40, respectively, except for ERNIE Bot-4.0
Turbo, all models scored higher than the average human candidate score of 93.50. Among them, DeepSeek-V3
and DeepSeek-R1 achieved scores within the high-score range of NCSE. Furthermore, this paper delves into
the strengths and weaknesses of the five LLMs, provides a detailed comparison of their performance across different question types, such as common sense judgment, verbal comprehension and expression, quantitative relationships, judgment and reasoning, and data analysis, and summarizes typical error patterns when handling complex logical reasoning and multi-step calculation problems.

Key words:

CLC Number: