論文解読:ArcFace: Additive Angular Margin Loss for Deep Face Recognition

ArcFace の論文を読み解く

直近のコンペで使った手法 ArcFace についてちゃんと理解すべく、論文を読み解く。

f(x) = \int_{-\infty}^\infty\hat f(\xi)\,e^{2 \pi i \xi x}\,d\xi


One of the main challenges in feature learning using Deep Convolutional Neural Networks (DCNNs) for largescale face recognition is the design of appropriate loss functions that enhance discriminative power. Centre loss penalises the distance between the deep features and their corresponding class centres in the Euclidean space to achieve intra-class compactness. SphereFace assumes that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in an angular space and penalises the angles between the deep features and their corresponding weights in a multiplicative way.
Recently, a popular line of research is to incorporate margins in well-established loss functions in order to maximise face class separability.
In this paper, we propose an Additive Angular Margin Loss (ArcFace) to obtain highly discriminative features for face recognition.
The proposed ArcFace has a clear geometric interpretation due to the exact correspondence to the geodesic distance on the hypersphere.
We present arguably the most extensive experimental evaluation of all the recent state-of-the-art face recognition methods on over 10 face recognition benchmarks including a new large-scale image database with trillion level of pairs and a large-scale video dataset.
We show that ArcFace consistently outperforms the state-of-the-art and can be easily implemented with negligible computational overhead.
We release all refined training data, training codes, pre-trained models and training logs, which will help reproduce the results in this paper.

本論文では、顔認識のための識別性の高い特徴を得るために、Additive Angular Margin Loss (ArcFace)を提案する。

  • 解釈
    • 従来は全結合層の出力をSoftmax関数に入力し、損失関数として使っていた。
    • Additive Angular Margin Loss = 加算角度マージン損失
    • 1兆件規模大規模画像データベースを使って検証とは、大掛かりである。
    • 最終層(=全結合層)に連結する損失関数を変えるだけで、性能が向上するなら大助かりである。
      • 特徴量の加工プロセスを変えずに済むので、手間が抑えられる。


Face representation using Deep Convolutional Neural Network (DCNN) embedding is the method of choice for face recognition.
DCNNs map the face image, typically after a pose normalisation step, into a feature that has small intra-class and large inter-class distance.
There are two main lines of research to train DCNNs for face recognition.
Those that train a multi-class classifier which can separate different identities in the training set, such by using a softmax classifier, and those that learn directly an embedding, such as the triplet loss.
Based on the large-scale training data and the elaborate DCNN architectures, both the softmax-loss-based methods and the triplet-loss-based methods can obtain excellent performance on face recognition.

顔認識には、Deep Convolutional Neural Network (DCNN) 埋め込みを用いた顔表現が用いられます。

  • 解釈
    • 従来のSoftmax関数がダメということはない。むしろちゃんと仕事をしてくれる。

However,both the softmax loss and the triplet loss have some drawbacks.
For the softmax loss:

  1. the size of the linear transformation matrix W ∈ Rd×n increases linearly with the identities number n;
  2. the learned features are separable for the closed-set classification problem but not discriminative enough for the open-set face recognition problem.


  1. 線形変換行列 W∈R^{d \times n} の大きさは、同一性の数nに対して線形に増加する。
  2. 学習された特徴量は,閉じられたデータセットの分類問題では分離可能であるが,幅広いデータセットの顔認識問題では十分な識別性が得られない.

For the triplet loss:

  1. there is a combinatorial explosion in the number of face triplets especially for large-scale datasets, leading to a significant increase in the number of iteration steps.
  2. semi-hard sample mining is a quite difficult problem for effective model training.


  • 分類境界面が重複所属を許す場合のことか。

Several variants have been proposed to enhance the discriminative power of the softmax loss.
Wen et al. pioneered the centre loss, the Euclidean distance between each feature vector and its class centre, to obtain intra-class compactness while the interclass dispersion is guaranteed by the joint penalisation of the softmax loss.
Nevertheless, updating the actual centres during training is extremely difficult as the number of face classes available for training has recently dramatically increased.



  • なるほど、確かに人の顔を分類するとなると、とんでもない数のクラス分類をしなければならない。
    • 実用上の課題が、得べき課題として直結している。

By observing that the weights from the last fully connected layer of a classification DCNN trained on the softmax loss bear conceptual similarities with the centres of each face class, the works in proposed a multiplicative angular margin penalty to enforce extra intra-class compactness and inter-class discrepancy simultaneously, leading to a better discriminative power of the trained model.
Even though Sphereface introduced the important idea of angular margin, their loss function required a series of approximations in order to be computed, which resulted in an unstable training of the network.
In order to stabilise training, they proposed a hybrid loss function which includes the standard softmax loss.
Empirically, the softmax loss dominates the training process, because the integer-based multiplicative angular margin makes the target logit curve very precipitous and thus hinders convergence.
CosFace directly adds cosine margin penalty to the target logit, which obtains better performance compared to SphereFace but admits much easier implementation and relieves the need for joint supervision from the softmax loss.



  • Softmax、Sphereface、CosFaceの順番に改善が進んでいる。

In this paper, we propose an Additive Angular Margin Loss (ArcFace) to further improve the discriminative power of the face recognition model and to stabilise the training process.
As illustrated in Figure 2, the dot product between the DCNN feature and the last fully connected layer is equal to the cosine distance after feature and weight normalisation.
We utilise the arc-cosine function to calculate the angle between the current feature and the target weight.
Afterwards, we add an additive angular margin to the target angle, and we get the target logit back again by the cosine function.

本論文では,顔認識モデルの識別力をさらに向上させ,学習プロセスを安定させるために,Additive Angular Margin Loss (ArcFace)を提案する.
その後、ターゲットの角度に付加的な角度マージン(additive angular margin)を加え、コサイン関数によってターゲットのロジットを再び取得します。

  • アークコサイン、つまりコサイン関数の逆関数を使っている。
    • コサインの値から、角度Θを求める。


Training a DCNN for face recognition supervised by the ArcFace loss. Based on the feature xi and weight W normalisation, we get the cos θj (logit) for each class as WT j xi.
We calculate the arccosθyi and get the angle between the feature xi and the ground truth weight Wyi.
In fact, Wj provides a kind of centre for each class.
Then, we add an angular margin penalty m on the target (ground truth) angle θyi.
After that, we calculate cos(θyi + m) and multiply all logits by the feature scale s.
The logits then go through the softmax function and contribute to the cross entropy loss.

特徴量 x_i と重み W の正規化に基づき、各クラスの cosθ_j(logit)W^T_j x_i として求めます。
arccosθ_{y_i} を計算し、特徴量 x_i と目的変数(グランドトゥルース)の重み W_{y_i} の間の角度を求めます。
実際には、W_j は各クラスの中心のようなものを提供します。
そして、目的変数(グランドトゥルース)の角度 θ_{y_i} に角度マージンのペナルティ m を加えます。
その後、cos(θ_{y_i} + m) を計算し、すべてのlogitsに特徴スケール s を掛けます。
その後、logits は Softmax関数を通過し、クロスエントロピー損失に寄与します。


  • W_j は各分類クラスを代表する点のベクトルと見て良さそうだ。
  • 特徴量(ベクトル) x_iW_j(=各分類クラスの代表点ベクトル) のコサイン距離から、アークコサイン関数経由でベクトル間の角度を求める。

Then, we re-scale all logits by a fixed feature norm, and the subsequent steps are exactly the same as in the softmax loss.
The advantages of the proposed ArcFacecan be summarised as follows:


  • Engaging
    ArcFace directly optimises the geodesic distance margin by virtue of the exact correspondence between the angle and arc in the normalised hypersphere.
    We intuitively illustrate what happens in the 512-D space via analysing the angle statistics between features and weights.
  • エンゲージメント
  • Effective
    ArcFace achieves state-of-the-art performance on ten face recognition benchmarks including large-scale image and video datasets.
  • 効果
  • Easy
    ArcFace only needs several lines of code as given in Algorithm 1 and is extremely easy to implement in the computational-graph-based deep learning frameworks, e.g. MxNet, Pytorch and Tensorflow.
    Furthermore, contrary to the works in paper 18 and paper 19, ArcFace does not need to be combined with other loss functions in order to have stable performance, and can easily converge on any training datasets.
  • 簡単
    ArcFaceは、Algorithm 1のように数行のコードを書くだけで、MxNet、Pytorch、Tensorflowなどの計算論的なグラフベースの深層学習フレームワークに極めて簡単に実装することができます。
  • Efficient
    ArcFace only adds negligible computational complexity during training.
    Current GPUs can easily support millions of identities for training and the model parallel strategy can easily support many more identities.
  • 効率的


The most widely used classification loss function, softmax loss, is presented as follows:


where xi ∈ R(d) denotes the deep feature of the i-th sample, belonging to the y(i)-th class.
The embedding feature dimension d is set to 512 in this paper following 38, 46, 18, 37.
Wj ∈ R(d) denotes the j-th column of the weight W ∈ R(d)×n and bj ∈ R(n) is the bias term.

ここで x_i ∈ R^d は、y_i 番目のクラスに属する i 番目のサンプルの深層特徴量を表す。
埋め込み特徴量の次元dは、論文38, 46, 18, 37に従い、本稿では 512とする。
W_j ∈ R^d は、重み W ∈ R^{d \times n}j 番目の列を示し、b_j ∈ R^n は、バイアス項である。

The batch size and the class number are N and n, respectively.

Traditional softmax loss is widely used in deep face recognition.


However, the softmax loss function does not explicitly optimise the feature embedding to enforce higher similarity for intraclass samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations (e.g. pose variations and age gaps ) and large-scale test scenarios (e.g. million or trillion pairs).



  • ソフトマックス損失関数の課題
    • クラス内サンプル間における高い類似性、クラス間サンプルで違いが大きくなるように最適化されてはいない。

For simplicity, we fix the bias bj = 0 as in paper 18.
Then, we transform the logit [26] as W(T,j) xi = |W(j)| |x(i)| cosθ(j) , where θ(j) is the angle between the weight W(j) and the feature x(i).
Following paper 18, 37, 36, we fix the individual weight |W(j)| = 1 by L2 normalisation.

簡単にするために、論文18と同様にバイアス b_j = 0 に固定する。
そして、logit を W^T_j x_i = | W_j | |x_i| cosθ_j 、ここで θ_jは重み W_j と特徴量 x_i の間の角度であると変換する。
論文18、37、36に倣い、L2正規化によって個々の重み を |W_j|=1 に固定する。

Following paper 28, 37, 36, 35, we also fix the embedding feature |x(i)| by L2 normalisation and re-scale it to s.
The normalisation step on features and weights makes the predictions only depend on the angle between the feature and the weight.
The learned embedding features are thus distributed on a hypersphere with a radius of s.

また、論文28、37、36、35に倣い、L2正規化によって埋め込み特徴量 |x_i|を固定し、s を乗じて再スケールする。
このようにして、学習された埋め込み特徴は、半径 s の超球上に分布する。


As the embedding features are distributed around each feature centre on the hypersphere, we add an additive angular margin penalty m between x(i) and W(yi) to simultaneously enhance the intra-class compactness and inter-class discrepancy.
Since the proposed additive angular margin penalty is equal to the geodesic distance margin penalty in the normalised hypersphere, we name our method as ArcFace.

埋め込み特徴は超球上(半径 s)の各特徴中心の周りに分布しているので、x_iW_{y_i} の間に付加的角度マージンペナルティ m を加えることで、クラス内のコンパクト性とクラス間の不一致を同時に強化する。


  • 正解の分類 y_i の場合だけ、角度マージン m が加算されている。

We select face images from 8 different identities containing enough samples (around 1,500 images/class) to train 2-D feature embedding networks with the softmax and ArcFace loss, respectively.
As illustrated in Figure 3, the softmax loss provides roughly separable feature embedding but produces noticeable ambiguity in decision boundaries while the proposed ArcFace loss can obviously enforce a more evident gap between the nearest classes.



Toy examples under the softmax and ArcFace loss on 8 identities with 2D features.
Dots indicate samples and lines refer to the centre direction of each identity.
Based on the feature normalisation, all face features are pushed to the arc space with a fixed radius.
The geodesic distance gap between closest classes becomes evident as the additive angular margin penalty is incorporated.


  • 実験を通して、ソフトマックス関数の課題を浮き彫りにした。
    • 紛らわしいサンプルに弱い