Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear. We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe two capabilities: full-length article summarization and fine-grained figure interpretation. Using uniform protocols and blinded PhD-level evaluation, we find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning. We later observe that prompt engineering and lightweight fine-tuning substantially improve textual coverage, and a compute-scaled inference strategy enhances visual question answering. We build an ARIEL agent that integrates textual and visual cues, and we show it can propose testable mechanistic hypotheses. ARIEL delineates current strengths and limitations of foundation models, and provides a reproducible platform for advancing trustworthy AI in biomedicine.
翻译:大语言模型(LLMs)与大多模态模型(LMMs)有望加速生物医学发现,但其可靠性仍不明确。我们提出了ARIEL(专家参与式学习的人工智能研究助手),这是一个开源评估与优化框架,通过整合精选的多模态生物医学语料库与经专家审核的任务,以探究两种核心能力:全文摘要生成与细粒度图表解读。采用统一评估协议与盲审的博士级专家评估,我们发现当前最先进的模型能够生成流畅但内容不完整的摘要,而LMMs在细节视觉推理方面表现欠佳。后续研究表明,提示工程与轻量级微调可显著提升文本覆盖度,而计算规模化的推理策略则增强了视觉问答能力。我们构建了一个整合文本与视觉线索的ARIEL智能体,并证明其能够提出可验证的机制性假设。ARIEL明确了基础模型当前的优势与局限,并为推进生物医学领域可信人工智能提供了可复现的研究平台。