基于GNES和Tensorflow 2.0的大规模视频语义搜索

Background
项目背景
Nov 22, 2019 2019/11/22
Many people may know me from bert-as-service (and of course from Fashion-MNIST </bragging>). So when they first heard about my new project GNES: Generic Neural Elastic Search, people naturally think that I’m building a semantic text search solution. But actually, GNES has a more ambitious goal to become the next-generation semantic search engine for all content forms, including text, image, video and audio. In this post, I will show you how to use the latest GNES Flow API and Tensorflow 2.0 to build a video semantic search system. For the impatient, feel free to watch the teaser video below before continue reading.

很多朋友可能都在bert-as-service中了解到了肖涵老师的工作，所以很多朋友听到了肖涵的新项目GNES的时候：通用信息弹性搜索的时候会感觉我是在做一个文本语义搜索框架，但是GNES是一个多模态的项目。项目涵盖的所有内容形式的语义搜索引擎，包括文本、图像、视频和音频。在本文中，我将向您展示如何使用最新的GNES流API和Tensorflow 2.0构建视频语义搜索系统。对于希望知道效果的朋友，在继续阅读之前，请随意观看下面的有趣的视频。

What Should We Send/Receive?

Summary

I plan to have a series on the topic of video semantic search using GNES. This article serves as the first part. Readers who are looking for benchmarking, evaluations and models comparision, stay tuned and feel free to subscribe to my Blog.

我计划有一个系列的主题视频语义搜索使用GNES。

本文作为第一部分。如果你在寻找基准（benchmarking），评估（evaluations ）和模型比较，保持关注，并随时订阅肖涵老师的博客。
Formulating the Problem in GNES Framework

在GNES框架中构造问题
The data we are using is Tumblr GIF (TGIF) dataset, which contains 100K animated GIFs and 120K sentences describing visual contents. Our problem is the following: given a video database and a query video, find the top-k semantically related videos from the database.

本文使用的数据是Tumblr GIF（TGIF）数据集，其中包含100K个动画GIF和120K个描述视觉内容的句子。我们的问题是：给定一个视频数据库和一个查询视频，从数据库中找到top-k语义相关（针对矢量化信息进行排序）的视频。

a woman in a car is singing.一个坐在车里唱歌的女孩。

A well-dressed young guy with gelled red hair glides across a room and scans it with his eyes.一个穿着考究、头发呈胶状的年轻人滑过一个房间，用眼睛扫视。

一个穿着西装的男人对着远处的东西微笑。a man wearing a suit smiles at something in the distance.

“Semantic” is a casual and ambiguous word, I know. Depending on your applications and scenarios, it could mean motion-wise similar (sports video), emotional similar (e.g. memes), etc. Right now I will just consider semantically-related as as visually similar.

“语义”是一个随意而含糊不清的词，我知道。根据你的应用程序和场景，它可能意味着动作相似（运动视频），情感相似（比如模因），等等。现在我只会把语义相关看作视觉相似。

Text descriptions of the videos, though potentially can be very useful, are ignored at the moment. We are not building a cross-modality search solution (e.g. from text to video or vice versa), we also do not leverage textual information when building the video search solution. Nonetheless, those text descriptions can be used to evaluate/compare the effectiveness of the system in a quantitative manner.

视频的文本描述虽然可能非常有用，但目前被忽略。我们没有构建跨模态搜索解决方案（例如，从文本到视频或反之亦然），因此在构建视频搜索解决方案时，我们不提供文本信息。然而，这些文本描述可用于定量地评估/比较系统的有效性。

Putting the problem into GNES framework, this breaks down into the following steps:

将问题（视频信息）放入GNES框架中，可以分解为以下步骤：

Index time
索引部分

segment each video into workable semantic units (aka “Chunk” in GNES);
encode each chunk as a fixed-length vector;
store all vector representations in a vector database.

1.将每个视频分割成可操作的语义单元（在GNES中称为“块”）；

2.将每个块编码为固定长度的向量；

3.将所有矢量表示存储在矢量数据库中。
Query time
查询部分

do steps 1,2 in the index time for each incoming query;
retrieve relevant chunks from database;
aggregate the chunk-level score back to document-level;
return the top-k results to users.

1.在每个传入查询的索引需要执行执行索引部分的步骤1、2；

2.从数据库中检索相关块；

3.将块级别的分数聚合回文档级别；

4.将top-k结果返回给用户。

If you find these steps hard to follow, then please first read this blog post to understand the philosophy behind GNES. These steps can be accomplished by using the preprocessor, encoder, indexer and router microservices in GNES. Before we dig into the concrete design of each service, we can first write down these two runtimes using the GNES Flow API.

如果你觉得这些步骤（代码规范）很难遵循，那么请先阅读这篇博文，了解GNES背后的哲学（设计思想）。这些步骤可以通过在GNES中使用预处理器、编码器、索引器和路由器微服务来完成。在深入研究每个服务的具体设计之前，我们可以先使用GNES流API写下这两个运行时。

num_rep = 1

index_flow = (Flow()
      .add_preprocessor(name='chunk_proc', replicas=num_rep)
      .add_indexer(name='doc_idx')
      .add_encoder(replicas=num_rep, recv_from='chunk_proc')
      .add_indexer(name='vec_idx')
      .add_router(name='sync_barrier', yaml_path='BaseReduceRouter',
                  num_part=2, recv_from=['vec_idx', 'doc_idx']))

query_flow = (Flow()
     .add_preprocessor(name='chunk_proc', replicas=num_rep)
     .add_encoder(replicas=num_rep)
     .add_indexer(name='vec_idx')
     .add_router(name='scorer')
     .add_indexer(name='doc_idx', sorted_response='descend'))

One can visualize these flows by flow.build(backend=None).to_url(), which gives:

我们可以通过flow.build(backend=None).to_url()
Index flow
索引过程

Query flow
查询过程

More usages and specifications of GNES Flow API can be found in this post. We are now moving forward to the concrete logic behind each component.

更多的GNES流API的用法和规范可以在本文中找到。我们现在正朝着每个构件后面的更好的代码可读性前进。
Preprocessing Videos

处理视频
In the previous post, I stated that a good neural search is only possible when document and query are comparable semantic units. The preprocessor serves exactly this purpose. It segments a document into a list of semantic units, each of which is called a “chunk” in GNES. For video, a meaningful unary chunk could a frame or a shot (i.e. a series of frames that runs for an uninterrupted period of time). In Tumblr GIF dataset, most of the animations have less than three shots. Thus, I will simply use frame as chunk to represent the document.

在上一篇文章中，小寒老师指出只有当文档和查询是可比较的语义单元时，才有可能进行良好的神经搜索。预处理器正是为了这个目的。它将文档分割成一个语义单元列表，每个语义单元在GNES中称为“块”。对于视频，一个有意义的一元块可以是一个帧或一个镜头（即一系列连续运行一段时间的帧）。在Tumblr GIF数据集中，大多数动画的镜头数都少于三个。因此，我将简单地使用frame作为块来表示文档。
GNES itself does not contain such preprocessor (implementing all possible preprocessors/encoders is also not the design philosophy of GNES), so we need to write our own. Thanks to the well-designed GNES component API, this can be easily done by inheriting from the BaseImagePreprocessor and implement apply(), for example:

GNES本身不包含这样的预处理器（实现所有可能的预处理器/编码器也不是GNES的设计理念），所以我们需要自己编写。由于设计良好的GNES组件API，可以通过从BaseImagePreprocessor类继承并实现apply（）来轻松完成，例如：

from gnes.component import BaseImagePreprocessor
from gnes.proto import array2blob

class GifPreprocessor(BaseImagePreprocessor):
    img_shape = 96
 
 def apply(self, doc: 'gnes_pb2.Document') -> None:
        super().apply(doc)
        im = Image.open(doc.raw_bytes.decode())
        idx = 0
 for frame in get_frames(im):
 try:
                new_frame = frame.convert('RGB').resize([img_shape, ] * 2)
                img = (np.array(new_frame) / 255).astype(np.float32)
                c = doc.chunks.add()
                c.doc_id = doc.doc_id
                c.offset = idx
                c.weight = 1.
                c.blob.CopyFrom(array2blob(img))
 except Exception as ex:
                self.logger.error(ex)
 finally:
                idx = idx + 1

This preprocessor loads the animation, reads its frames into RGB format, resizes each of them to 96x96 and stores indoc.chunks.blob as numpy.ndarray. At the moment we don’t implement any keyframe detection in the preprocessor, so every chunk has a uniform weight, i.e. c.weight=1.

此预处理器加载动画，将其帧读取为RGB格式，将每个帧的大小调整为96x96，并将indoc.chunks.blob存储为numpy.ndarray。目前我们没有在预处理器中实现任何关键帧检测，因此每个块都有一个统一的权重，即c.weight=1。（一旦我们有了观看流量的数据这个位置的权重我认为就可以用场景中的观看流量进行加成)

One may think of more sophisticated preprocessors. For example, smart sub-sampling to reduce the number of near-duplicated frames; using seam carving for better cropping and resizing frames; or adding image effects and enhancements. Everything is possible and I will leave these possibilities to the readers.

人们可能会想到更复杂的预处理器。例如，智能子采样可减少几乎重复的帧数；使用接缝雕刻可更好地裁剪和调整帧大小；或添加图像效果和增强功能。一切皆有可能，我将把这些可能性留给读者。

Encoding Chunks into Vectors

将块编码为向量
In the encoding step, we want to represent each chunk by a fixed-length vector. This can be easily done with the pretrained models in Tensorflow 2.0. For the sake of clarity and simplicity, we will employ MobileNetV2 as our encoder. The pretrained weights on ImageNet are downloaded automatically when instantiating the encoder in post_init. The full list of pretrained models can be found at here.

在编码步骤中，我们希望用一个固定长度的向量来表示每个块。这可以通过Tensorflow 2.0中的预训练模型轻松完成。为了简洁明了，我们将使用MobileNetV2作为编码器。在post_init中实例化编码器时，ImageNet上的预训练权重会自动下载。在这里可以找到完整的预训练模型列表。

from gnes.component import BaseImageEncoder
from gnes.helper import batching, as_numpy_array


class TF2ImageEncoder(BaseImageEncoder):
    batch_size = 128
    img_shape = 96
    pooling_strategy = 'avg',
    model_name = 'MobileNetV2'

 def post_init(self):
        self.model = getattr(tf.keras.applications, self.model_name)(
            input_shape=(self.img_shape, self.img_shape, 3),
            include_top=False,
            pooling=self.pool_strategy,
            weights='imagenet')
        self.model.trainable = False

    @batching
    @as_numpy_array
 def encode(self, img: List['np.ndarray'], *args, **kwargs) -> np.ndarray:
        img = np.stack(img, axis=0)
 return self.model(img)

Code should be fairly straightforward. I create a new encoder class by inherit from BaseImageEncoder, in which the most important function encode() is simply calling the model to extract features. The batching decorator is a very handy helper to control the size of the data flowing into the encoder. After all, OOM error is the last thing you want to see.

代码应该相当简单。我通过继承BaseImageEncoder创建了一个新的编码器类，其中最重要的函数encode（）只是调用模型来提取特性。批处理装饰器是控制流入编码器的数据大小的非常方便的助手。毕竟，OOM错误（显存溢出，内存溢出）是你最不想看到的。
Indexing Chunks and Documents

索引块和文档（块是GNES中的基础概念）
For indexing, I will use the built-in chunk indexers and document indexers of GNES. Chunk indexing is essentially vector indexing, we need to store a map of chunk ids and their corresponding vector representations. As GNES supports Faiss indexer already, you don’t need to write Python code anymore. Simply write a YAML config vec.ymlas follows:
对于索引，我将使用GNES的内置块索引器和文档索引器。块索引本质上是矢量索引，我们需要存储块ID及其对应的向量表示的映射。由于GNES已经支持Faiss indexer，您不需要再编写Python代码了。简单地编写一个YAML config vec.ymlas如下：

!FaissIndexer
parameters:
 num_dim: -1 # automatically determined
 index_key: HNSW32
 data_path: $WORKDIR/idx.binary
gnes_config:
 name: my_vec_indexer # a customized name
 work_dir: $WORKDIR

As eventually in the query time, we are interested in documents not chunks, hence the map of doc id and chunk ids should be also stored. This is essentially a key-value database, and a simple Python Dict structure will do the job. Again, only a YAML config doc.yml is required:

最终在查询时，我们对文档而不是块感兴趣，因此还应该存储doc id和chunk id的映射。这实际上是一个键值数据库，一个简单的Python Dict结构将完成这项工作。同样，只需要一个YAML config doc.yml：

!DictIndexer
gnes_config:
 name: my_doc_indexer # a customized name
 work_dir: $WORKDIR

Note that the doc indexer does not require the encoding step, thus it can be done in parallel with the chunk indexer. Notice how chunk_proc is broadcasting its output to the encoder and doc indexer, and how a sync barrier is placed afterwards to ensure all jobs are completed.
请注意，文档索引器不需要编码步骤，因此可以与块索引器并行进行。请注意chunk_proc如何将其输出广播到编码器和文档索引器，以及随后如何放置同步屏障以确保所有作业都已完成。

Scoring Results

评分结果
Scoring is important but hard, it often requires domain-specific expertise and many iterations. You can simply take the average all chunk scores as the document score, or you can weight chunks differently and combine them with some heuristics. In the current GNES, scorer or ranker can be implemented by inheriting from BaseReduceRouter and overriding its apply method.

效果评价很重要但很难，它通常需要特定领域的专业知识和许多迭代。您可以简单地将所有块分数的平均值作为文档分数，也可以对块进行不同的加权，并将它们与一些启发式方法结合起来。在当前的GNES中，scorer或ranker可以通过继承BaseReduceRouter并重写其apply方法来实现。
When designing your own score function, make sure to use the existing ones from gnes.score_fn.base as your basic building blocks. Stacking and combining these score functions can create a complicated yet explainable score function, greatly reducing the effort when debugging. Besides, all score functions from gnes.score_fn.base are trainable (via .train() method), enabling advanced scoring techniques such as learning to rank.

在设计自己的评分函数时，请确保使用gnes.score_fn.base中现有的那些作为基本的构建块。堆叠和组合这些评分函数可以创建一个复杂但可解释的评分函数，大大减少调试时的工作量。此外，gnes.score_fn.base中的所有评分函数都是可训练的（via.train（）方法），可以使用高级评分技术，例如学习排名。

class ScoreOps:
    multiply = CombinedScoreFn('multiply')
    sum = CombinedScoreFn('sum')
    max = CombinedScoreFn('max')
    min = CombinedScoreFn('min')
    avg = CombinedScoreFn('avg')
    none = ModifierScoreFn('none')
    log = ModifierScoreFn('log')
    log1p = ModifierScoreFn('log1p')
    log2p = ModifierScoreFn('log2p')
    ln = ModifierScoreFn('ln')
    ln1p = ModifierScoreFn('ln1p')
    ln2p = ModifierScoreFn('ln2p')
    square = ModifierScoreFn('square')
    sqrt = ModifierScoreFn('sqrt')
    abs = ModifierScoreFn('abs')
    reciprocal = ModifierScoreFn('reciprocal')
    reciprocal1p = ModifierScoreFn('reciprocal1p')
    const = ConstScoreFn()

Putting it All Together

把所有的结果放在一起（合并结果）
With all the YAML config and Python module we just made, we can import them to the flow by specifying py_pathand yaml_path in the flow. Besides scale out the preprocessor and encoder to 4, I also make a small tweak in the flow: I added a thumbnail preprocessor thumbnail_proc to store all extracted frames in a row as a JPEG file.

对于我们刚刚创建的所有YAML config和Python模块，我们可以通过在流中指定py_path和YAML_path将它们导入到流中。除了将预处理器和编码器的比例缩小到4之外，我还在流程中做了一个小调整：我添加了一个缩略图预处理器thumbnail_proc，将所有提取的帧作为JPEG文件存储在一行中。

replicas = 4

index_flow = (Flow()
    .add_preprocessor(name='chunk_proc', yaml_path='gif2chunk.yml',
                   py_path=['gif_reader.py', 'gif2chunk.py'],
                   replicas=replicas)
    .add_preprocessor(name='thumbnail_proc', yaml_path='chunks2jpg.yml', py_path='chunks2jpg.py', replicas=replicas)
    .add_indexer(name='doc_idx', yaml_path='doc.yml')
    .add_encoder(yaml_path='encode.yml', py_path='encode.py',
              replicas=replicas, recv_from='chunk_proc')
    .add_indexer(name='vec_idx', yaml_path='vec.yml')
    .add_router(name='sync_barrier', yaml_path='BaseReduceRouter',
             num_part=2, recv_from=['vec_idx', 'doc_idx']))

query_flow = (Flow()
     .add_preprocessor(name='chunk_proc', yaml_path='gif2chunk.yml',
                        py_path=['gif_reader.py', 'gif2chunk.py'],
                        replicas=replicas)
     .add_preprocessor(name='thumbnail_proc', yaml_path='chunks2jpg.yml', py_path='chunks2jpg.py', replicas=replicas)
     .add_encoder(yaml_path='encode.yml', py_path='encode.py', replicas=replicas, recv_from='chunk_proc')
     .add_indexer(name='vec_idx', yaml_path='vec.yml')
     .add_router(name='scorer', yaml_path='score.yml', py_path='videoscorer.py')
     .add_indexer(name='doc_idx', yaml_path='doc.yml', sorted_response='descend')
     .add_router(name='sync_barrier', yaml_path='BaseReduceRouter',
            num_part=2, recv_from=['thumbnail_proc', 'doc_idx']))

Visualizing these two flows give:

将这两个流可视化可以得到：
Index flow
索引流（个人理解相当于数据库）

Query flow
查询流（查询张量）

What Should We Send/Receive?
Sending data to the flow is easy, simply build a Iterator[bytes] and feed to flow.index(). The example below get the absolute paths of all animation files and send those paths to the flow:

我们应该发送/接收什么？
向流发送数据很简单，只需构建一个迭代器[bytes] （ Iterator[bytes] ）并将其馈送给flow.index（）。

下面的示例获代码取所有动画文件的绝对路径并将这些路径发送到流：

bytes_gen = (g.encode() for g in glob.glob('dataset/*.gif'))

with index_flow.build(backend='process') as fl:
    fl.index(bytes_gen, batch_size=64)

Of course one can first read() the animation into the memory, and send binary animation directly to the flow. But that will give very poor efficiency. We do not want IO ops to be the bottleneck, and that’s why we spawn four preprocessors in the flow.

当然，可以调用read()动画读入内存，然后直接将二进制动画发送到流（张量）中。但这将导致非常低的效率。我们不希望IO操作成为瓶颈，这就是为什么我们在流中生成四个预处理器。

The indexing procedure is pretty fast. On my i7-8850H desktop with no GPU, indexing the full dataset (~100K videos) takes 4 hours. Things can be much faster if you have a powerful GPU.

索引过程非常快。在我的i7-8850H电脑上没有GPU，索引完整的数据集（~100K视频）需要4个小时。如果你有一个强大的GPU，事情会更快。

Once the flow is indexed, we can throw a video query in it and retrieve relevant videos. To do that, we randomly sample some videos as queries:

流被一次索引，我们就可以在其中抛出一个视频查询并检索相关视频。为此，我们随机抽取一些视频作为查询：

bytes_gen = (g.encode() for g in random.sample(glob.glob(GIF_BLOB), num_docs))
with query_flow.build(backend='process') as fl:
    fl.query(bytes_gen, callback=dump_result_to_json, top_k=60, batch_size=32)

Note that callback=dump_result_to_json in the code. Every time a search result is returned, this callback function will be invoked. In this example, I simply dump the search result into the JSON format so that I can later visualize it in the web frontend.

注意，callback=dump_result_to_json在代码中。每次返回搜索结果时，都会调用此回调函数。在本例中，我只是将搜索结果转储为JSON格式，以便以后可以在web前端可视化它。

fp = open('/topk.json', 'w', encoding='utf8')

def dump_result_to_json(resp):
    resp = remove_envelope(resp)
 for r in resp.search.results:
        v = MessageToDict(r, including_default_value_fields=True)
        v['doc']['rawBytes'] = r.doc.raw_bytes.decode()
 for k, kk in zip(v['topkResults'], r.topk_results):
            k['doc']['rawBytes'] = kk.doc.raw_bytes.decode()
            k['score']['explained'] = json.loads(kk.score.explained)
        fp.write(json.dumps(v, sort_keys=True) + '\n')

Summary

总结
Video semantic search is not only fun (seriously I have spent even more time on watching cat videos after building this system), but has many usages in the customer facing applications, e.g. short videos apps, movie/film editors. Though it is too early to say GNES is the defacto solution to video semantic search, I hope this article sends a good signal: GNES is much more beyond bert-as-service, and it enables the search of almost any content form including text, image, video and audio.

视频语义搜索不仅有趣（说真的，在构建了这个系统之后，我花了更多的时间看猫视频<cat videos:我认为应该是一个国外的短视频信息流平台>），而且在面向客户的应用程序中有很多用途，例如短视频应用程序、电影/电影编辑器。虽然现在说GNES是视频语义搜索的实际解决方案还为时过早，但我希望本文能发出一个好的信号：GNES远远超出了bert的服务范围，它能够搜索几乎所有的内容形式，包括文本、图像、视频和音频。

In the second part, I will use the token-based similarity of textual descriptions (e.g. Rouge-L) as the groundtruth to evaluate our video search system. I also plan to benchmark different pretrained models, preprocessors, indexers and their combinations. If you are interested in reading more on this thread or knowing more about my plan on GNES, stay tuned.

在第二部分中，我将使用基于标记的文本描述相似度（如Rouge-L）作为基础真理来评估我们的视频搜索系统。我还计划对不同的预训练模型、预处理器、索引器及其组合进行基准测试。如果你有兴趣阅读更多关于这条线索或了解更多关于我的GNES计划，请继续关注。

A Better Practice fo... ❯

Like
Subscribe
Error: Comments Not Initialized
Write PreviewLogin with GitHub

Styling with Markdown is supportedPOST

添加微信17710158550 回复GNES 拉你进群

需要算力的小伙伴可以关注一下openbeyes以及www.52lm.xyz

希望打比赛的朋友们可以通过链接

参加众多FlyAI-AI竞赛服务平台参

编辑于 2019-11-26 22:23

TensorFlow 学习

深度学习（Deep Learning）

多模态学习