We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI
翻译:我们介绍了Humains-Junior,这是一个38亿参数的模型,在FACTS Grounding公共子集上以±5个百分点的等效区间与GPT-4o持平。结果:在相同评判标准下的Q1–Q500问题上,GPT-4o得分为73.5%(95%置信区间69.5–77.2),Humains-Junior为72.7%(95%置信区间68.7–76.5);配对差异为0.8个百分点(自助法95%置信区间-3.1至+4.7;置换检验p=0.72;科恩d=0.023)。TOST检验在±5个百分点区间内确立了等效性(在±3个百分点区间内未确立)。若以托管API形式购买,Humains-Junior的基础模型(Phi-3.5-mini-instruct)在微软AI Foundry定价上比GPT-4o便宜约19倍;自托管或边缘部署可将增量推理成本趋近于零。实测与预估定价来源详见附录E。方法:我们的方法结合了最小化定向“外骨骼推理”框架与行为微调,后者教授协议遵循(认知纪律)而非领域答案。单独微调效果甚微;两者结合产生协同效应(+17.7个百分点,p<0.001)并降低方差(约25%)。在前沿模型的纯提示设置中(Q1–Q100;不可比),定向推理将GPT-4o提升+11.8个百分点至85.3%,将Gemini-2.5-Pro提升+5.0个百分点至93.3%(基线88.3%,n=100);参见第5节。总结:一个38亿参数模型实现了GPT-4o级别的FACTS准确性(在Q1–Q500上±5个百分点区间内等效)。云端定价显示其成本比GPT-4o低约19倍,自托管/边缘部署可趋近于零边际成本。定价来源列于附录E。前沿纯提示增益(Q1–Q100;不可比)及早期评判标准下的优化提示探索结果汇总于附录F。关键词:小型语言模型,事实锚定,定向推理,微调,模型对齐,成本高效人工智能