Estimating three-dimensional morphological traits such as volume from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes using RGB images and structured-light 3D scans as ground truth references. Wheat spike volume is promising for phenotyping as it shows high correlation with spike dry weight, a key component of fruiting efficiency. Accounting for the complex geometry of the spikes, we compare different neural network approaches for volume estimation from 2D images and benchmark them against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Fine-tuned Vision Transformers (DINOv2 and DINOv3) with MLPs achieve the lowest MAPE of 5.08\% and 4.67\% and the highest correlation of 0.96 and 0.97 on six-view indoor images, outperforming fine-tuned CNNs (ResNet18 and ResNet50), wheat-specific backbones, and both baselines. When using frozen DINO backbones, deep-supervised LSTMs outperform MLPs, whereas after fine-tuning, improved high-level representations allow simple MLPs to outperform LSTMs. We demonstrate that object shape significantly impacts volume estimation accuracy, with irregular geometries such as wheat spikes posing greater challenges for geometric methods than for deep learning approaches. Fine-tuning DINOv3 on field-based single side-view images yields a MAPE of 8.39\% and a correlation of 0.90, providing a novel pipeline and a fast, accurate, and non-destructive approach for wheat spike volume phenotyping.
翻译:从二维RGB图像估计三维形态特征(如体积)存在固有挑战,这源于深度信息的丢失、投影畸变以及田间条件下的遮挡。本研究探索了多种利用RGB图像进行小麦穗无损体积估计的方法,并以结构光三维扫描作为真实值参考。小麦穗体积在表型分析中具有重要潜力,因其与穗干重(结实效率的关键组成部分)高度相关。考虑到麦穗的复杂几何结构,我们比较了从二维图像进行体积估计的不同神经网络方法,并以两种传统基线方法作为基准:基于二维面积的投影方法和使用轴对齐横截面的几何重建方法。经微调的视觉Transformer(DINOv2和DINOv3)结合多层感知器(MLP)在六视角室内图像上取得了最低的平均绝对百分比误差(MAPE),分别为5.08%和4.67%,以及最高的相关系数0.96和0.97,其性能优于微调的卷积神经网络(ResNet18和ResNet50)、小麦专用骨干网络及两种基线方法。当使用冻结的DINO骨干网络时,深度监督的长短期记忆网络(LSTM)优于MLP;而在微调后,改进的高层表征使得简单的MLP能够超越LSTM。我们证明物体形状对体积估计精度有显著影响,对于小麦穗这类不规则几何体,几何方法比深度学习方法面临更大挑战。在田间单侧视角图像上对DINOv3进行微调,可获得8.39%的MAPE和0.90的相关系数,这为小麦穗体积表型分析提供了一种新颖的流程,以及一种快速、准确且无损的解决方案。