## ICLR 2019 | 与胶囊网络异曲同工：Bengio等提出四元数循环神经网络

2 月 9 日 机器之心

3 四元循环神经网络

3.2 四元数表征

QRNN 是实值和复值 RNN 到超复数的扩展。在一个四元数密集层中，所有的参数都是四元数，包括输入、输出、权重和偏置。四元数代数通过操作实值矩阵实现。因此，对于大小为 N 的每个输入向量和大小为 M 的输出向量，维度被分离为四个部分：第一个等于 r，第二个等于 x_i，第三个等于 y_j，最后一个等于 z_k，从而构成一个四元数 Q = r1 + xi + yj + zk。全连接层的推断过程通过一个输入向量和一个实值 MxN 权重矩阵之间的点积在实值空间中定义。在一个 QRNN 中，这种运算由带四元数值矩阵的哈密顿乘积取代（即权重矩阵中的每一项都是四元数）。

3.3 学习算法

QRNN 在每个学习子过程中都不同于实值的 RNN。因此，令 x_t 为 t 时间步的输入向量，h_t 为隐藏状态，W_hx、W_hy、W_hh 为输入、输出和隐藏状态权重矩阵。向量 b_h 是隐藏状态的偏差，p_t、y_t 是输出以及期望目标向量。

f 对应任意标准的激活函数。基于先验假设、更好的稳定性（即，纯四元数激活函数包含奇点）和更简单的计算，本研究偏向于使用分离方法。输出向量 p_t 计算如下：

4 实验

✄------------------------------------------------

We develop a system for modeling hand-object interactions in 3D from RGB images that show a hand which is holding a novel object from a known category. We design a Convolutional Neural Network (CNN) for Hand-held Object Pose and Shape estimation called HOPS-Net and utilize prior work to estimate the hand pose and configuration. We leverage the insight that information about the hand facilitates object pose and shape estimation by incorporating the hand into both training and inference of the object pose and shape as well as the refinement of the estimated pose. The network is trained on a large synthetic dataset of objects in interaction with a human hand. To bridge the gap between real and synthetic images, we employ an image-to-image translation model (Augmented CycleGAN) that generates realistically textured objects given a synthetic rendering. This provides a scalable way of generating annotated data for training HOPS-Net. Our quantitative experiments show that even noisy hand parameters significantly help object pose and shape estimation. The qualitative experiments show results of pose and shape estimation of objects held by a hand "in the wild".

Graph deep learning models, such as graph convolutional networks (GCN) achieve remarkable performance for tasks on graph data. Similar to other types of deep models, graph deep learning models often suffer from adversarial attacks. However, compared with non-graph data, the discrete features, graph connections and different definitions of imperceptible perturbations bring unique challenges and opportunities for the adversarial attacks and defences for graph data. In this paper, we propose both attack and defence techniques. For attack, we show that the discrete feature problem could easily be resolved by introducing integrated gradients which could accurately reflect the effect of perturbing certain features or edges while still benefiting from the parallel computations. For defence, we propose to partially learn the adjacency matrix to integrate the information of distant nodes so that the prediction of a certain target is supported by more global graph information rather than just few neighbour nodes. This, therefore, makes the attacks harder since one need to perturb more features/edges to make the attacks succeed. Our experiments on a number of datasets show the effectiveness of the proposed methods.

This paper studies the problems of vehicle make & model classification. Some of the main challenges are reaching high classification accuracy and reducing the annotation time of the images. To address these problems, we have created a fine-grained database using online vehicle marketplaces of Turkey. A pipeline is proposed to combine an SSD (Single Shot Multibox Detector) model with a CNN (Convolutional Neural Network) model to train on the database. In the pipeline, we first detect the vehicles by following an algorithm which reduces the time for annotation. Then, we feed them into the CNN model. It is reached approximately 4% better classification accuracy result than using a conventional CNN model. Next, we propose to use the detected vehicles as ground truth bounding box (GTBB) of the images and feed them into an SSD model in another pipeline. At this stage, it is reached reasonable classification accuracy result without using perfectly shaped GTBB. Lastly, an application is implemented in a use case by using our proposed pipelines. It detects the unauthorized vehicles by comparing their license plate numbers and make & models. It is assumed that license plates are readable.

Existing attention mechanisms are trained to attend to individual items in a collection (the memory) with a predefined, fixed granularity, e.g., a word token or an image grid. We propose area attention: a way to attend to areas in the memory, where each area contains a group of items that are structurally adjacent, e.g., spatially for a 2D memory such as images, or temporally for a 1D memory such as natural language sentences. Importantly, the shape and the size of an area are dynamically determined via learning, which enables a model to attend to information with varying granularity. Area attention can easily work with existing model architectures such as multi-head attention for simultaneously attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation (both character and token-level) and image captioning, and improve upon strong (state-of-the-art) baselines in all the cases. These improvements are obtainable with a basic form of area attention that is parameter free.

Top