Human-AI collaboration is increasingly promoted to improve high-stakes decision-making, yet its benefits have not been fully realized. Application-grounded evaluations are needed to better evaluate methods for improving collaboration but often require domain experts, making studies costly and limiting their generalizability. Current evaluation methods are constrained by limited public datasets and reliance on proxy tasks. To address these challenges, we propose an application-grounded framework for large-scale, online evaluations of vision-based decision-making tasks. The framework introduces Blockies, a parametric approach for generating datasets of simulated diagnostic tasks, offering control over the traits and biases in the data used to train real-world models. These tasks are designed to be easy to learn but difficult to master, enabling participation by non-experts. The framework also incorporates storytelling and monetary incentives to manipulate perceived task stakes. An initial empirical study demonstrated that the high-stakes condition significantly reduced healthy distrust of AI, despite longer decision-making times. These findings underscore the importance of perceived stakes in fostering healthy distrust and demonstrate the framework's potential for scalable evaluation of high-stakes Human-AI collaboration.
翻译:暂无翻译