利用动态预测进行图像描述的自监视实验 (Experimenting with Self-Supervision using Rotation Prediction for Image Captioning)

Image captioning is a task in the field of Artificial Intelligence that merges between computer vision and natural language processing. It is responsible for generating legends that describe images, and has various applications like descriptions used by assistive technology or indexing images (for search engines for instance). This makes it a crucial topic in AI that is undergoing a lot of research. This task however, like many others, is trained on large images labeled via human annotation, which can be very cumbersome: it needs manual effort, both financial and temporal costs, it is error-prone and potentially difficult to execute in some cases (e.g. medical images). To mitigate the need for labels, we attempt to use self-supervised learning, a type of learning where models use the data contained within the images themselves as labels. It is challenging to accomplish though, since the task is two-fold: the images and captions come from two different modalities and usually handled by different types of networks. It is thus not obvious what a completely self-supervised solution would look like. How it would achieve captioning in a comparable way to how self-supervision is applied today on image recognition tasks is still an ongoing research topic. In this project, we are using an encoder-decoder architecture where the encoder is a convolutional neural network (CNN) trained on OpenImages dataset and learns image features in a self-supervised fashion using the rotation pretext task. The decoder is a Long Short-Term Memory (LSTM), and it is trained, along within the image captioning model, on MS COCO dataset and is responsible of generating captions. Our GitHub repository can be found: https://github.com/elhagry1/SSL_ImageCaptioning_RotationPrediction

翻译：图像说明是计算机视觉和自然语言处理相结合的人工智能领域的一项任务。它负责生成描述图像的传说, 并有各种应用程序, 如辅助技术或索引图像( 例如搜索引擎) 所使用的描述。这使得它成为AI 中一个至关重要的话题, 正在进行大量研究。然而, 与其他许多任务一样, 以通过人类注解标记的大型图像进行训练, 这可能会非常繁琐: 它需要人工操作, 包括财务和时间成本, 它容易出错, 在某些情况下( 如医疗图像) 执行可能很难。为了减轻标签需求, 我们试图使用自监督学习, 这是一种学习模式使用图像本身所含数据作为标签。但是, 由于任务有两重: 图像和说明来自两种不同的模式, 通常由不同种类的网络处理。因此, 完全自我监督的SSS 解调解调的存储器看起来很不明显, 如何在模型上找到它是如何以可比的方式对内内部的内置的内置。在今天, 内置的内置的内置图像中, 正在使用一个内置的内置的内置的内置数据。任务。正在持续进行中。