Automatic Image Captioning is the never-ending effort of creating syntactically and validating the accuracy of textual descriptions of an image in natural language with context. The encoder-decoder structure used throughout existing Bengali Image Captioning (BIC) research utilized abstract image feature vectors as the encoder's input. We propose a novel transformer-based architecture with an attention mechanism with a pre-trained ResNet-101 model image encoder for feature extraction from images. Experiments demonstrate that the language decoder in our technique captures fine-grained information in the caption and, then paired with image features, produces accurate and diverse captions on the BanglaLekhaImageCaptions dataset. Our approach outperforms all existing Bengali Image Captioning work and sets a new benchmark by scoring 0.694 on BLEU-1, 0.630 on BLEU-2, 0.582 on BLEU-3, and 0.337 on METEOR.
翻译:自动图像描述是一个永无休止的努力,目的是从整体上创建并验证自然语言图像文字描述的准确性。 在现有的孟加拉图像描述(BIC)中所使用的编码器解码器结构在现有的孟加拉图像描述(BIC)中使用抽象图像矢量作为编码器的输入。我们建议建立一个新型的变压器结构,其关注机制有事先经过培训的ResNet-101模型图像编码器,用于从图像中提取特征。实验表明,我们技术中的语言解码器在标题中捕捉精细刻的信息,然后与图像特征配对,制作关于BanglaLekhaimage Capitations数据集的准确和多样的字幕。我们的方法超越了所有现有的孟加拉图像描述工作,并通过在BLEU-1、BLEU-2、0.630在BLEU-2、0.582在BLEU-3和0.3在METEOR上通过评分0.337来设定新的基准。