The rise of Large Language Models (LLMs) has driven progress in reasoning tasks -- from program synthesis to scientific hypothesis generation -- yet their ability to handle ranked preferences and structured algorithms in combinatorial domains remains underexplored. We study matching markets, a core framework behind applications like resource allocation and ride-sharing, which require reconciling individual ranked preferences to ensure stable outcomes. We evaluate several state-of-the-art models on a hierarchy of preference-based reasoning tasks -- ranging from stable-matching generation to instability detection, instability resolution, and fine-grained preference queries -- to systematically expose their logical and algorithmic limitations in handling ranked inputs. Surprisingly, even top-performing models with advanced reasoning struggle to resolve instability in large markets, often failing to identify blocking pairs or execute algorithms iteratively. We further show that parameter-efficient fine-tuning (LoRA) significantly improves performance in small markets, but fails to bring about a similar improvement on large instances, suggesting the need for more sophisticated strategies to improve LLMs' reasoning with larger-context inputs.
翻译:大型语言模型(LLMs)的兴起推动了推理任务的进展——从程序合成到科学假说生成——然而,它们在组合领域中处理排序偏好和结构化算法的能力仍未得到充分探索。本研究聚焦匹配市场,这是资源分配和拼车等应用背后的核心框架,需要协调个体排序偏好以确保稳定结果。我们在一系列基于偏好的推理任务层次结构上评估了多个先进模型——涵盖稳定匹配生成、不稳定检测、不稳定解决以及细粒度偏好查询——以系统揭示其在处理排序输入时的逻辑与算法局限性。令人惊讶的是,即使具备高级推理能力的顶尖模型也难以解决大规模市场中的不稳定问题,常常无法识别阻塞对或迭代执行算法。我们进一步表明,参数高效微调(LoRA)在小规模市场中显著提升了性能,但在大规模实例上未能带来类似的改进,这表明需要更复杂的策略来增强LLMs处理更大上下文输入的推理能力。