Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship. While recent research has focused on the accuracy of automated scoring (AES), these static approaches fail to capture process evidence or verify genuine student understanding. This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions. In a pilot study with university instructors (N=9), we demonstrate that while Stage 1 (Auto-Scoring) ensures procedural fairness and consistency, Stage 2 (Interactive Verification) is essential for construct validity, effectively diagnosing superficial reasoning or unverified AI use. We report on the systems design, instructor perceptions of fairness versus validity, and the necessity of adaptive difficulty in follow-up questioning. The findings offer a scalable pathway for authentic assessment that moves beyond policing AI to integrating it as a synergistic partner in the evaluation process.
翻译:大型语言模型(LLMs)通过模糊作者身份界限,对传统开放式评估的效度构成挑战。尽管近期研究聚焦于自动评分(AES)的准确性,这些静态方法无法捕捉过程证据或验证学生的真实理解。本文提出一种新颖的人机协作框架,通过结合基于量规的自动评分与AI生成的针对性追问问题,提升评估完整性。在一项针对大学教师(N=9)的试点研究中,我们证明:第一阶段(自动评分)确保了程序公平性与一致性,而第二阶段(交互式验证)对于结构效度至关重要,能有效诊断浅层推理或未经核实的AI使用。我们报告了系统设计、教师对公平性与效度的认知,以及追问问题中自适应难度的必要性。研究结果为真实性评估提供了一条可扩展路径,使评估过程从单纯防范AI转向将其整合为协同合作伙伴。