Domain-specific LLMs in TCM face limitations in research settings due to constrained adaptability, insufficient evaluation datasets, and limited computational resources. This study presents TianHui, a specialized TCM LLM built through contextual data integration and domain knowledge fusion. We constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage 2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW) and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC, ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256, epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation and scalable application of TCM knowledge. All resources are open-sourced.
翻译:中医领域的专用大语言模型在研究环境中面临适应性受限、评估数据集不足以及计算资源有限等挑战。本研究提出了天辉,一个通过上下文数据集成与领域知识融合构建的专用中医大语言模型。我们构建了一个大规模中医语料库(0.97GB无监督数据 + 611,312个问答对),并采用了两阶段训练策略,结合了QLoRA、DeepSpeed Stage 2和Flash Attention 2技术。在12个基准测试上的评估表明,天辉在六个数据集(APQ、TCMCD、HFR、HCCA、DHPE、TLAW)的所有指标中均位列前三,并在其余六个数据集(TCMEE、APR、GCPMI、TCMKQA、TCMRC、ADTG)中取得了最优结果。最优配置被确定为LoRA秩=128、alpha=256、训练轮数=4、丢弃率=0.2、最大长度=2048。天辉实现了中医知识的系统化保存与可扩展应用。所有资源均已开源。