This paper systematically evaluates and analyzes the performance of five large language models (LLMs) , including ChatGPT-4o on the Chinese national civil servant exam (NCSE). The study selected NCSE past exam
papers from 2022 to 2024, input the questions into the five LLMs using a predefined standardized query format,
recorded their outputs, and calculated their accuracy rates to assess their overall capabilities. The experimental
results show that the total scores for DeepSeek -V3, DeepSeek -R1, ChatGPT -4o, Gemini -1.5 Flash, and
ERNIE Bot-4.0 Turbo are 145.20, 145.41, 127.47, 107.56, and 86.40, respectively, except for ERNIE Bot-4.0
Turbo, all models scored higher than the average human candidate score of 93.50. Among them, DeepSeek-V3
and DeepSeek-R1 achieved scores within the high-score range of NCSE. Furthermore, this paper delves into
the strengths and weaknesses of the five LLMs, provides a detailed comparison of their performance across different question types, such as common sense judgment, verbal comprehension and expression, quantitative relationships, judgment and reasoning, and data analysis, and summarizes typical error patterns when handling complex logical reasoning and multi-step calculation problems.