# 教你动手推导Self-Attention！（附代码）

## 实例演示

1. 准备输入
2. 初始化权重
3. 导出key, query and value的表示
4. 计算输入1 的注意力得分(attention scores)
5. 计算softmax
6. 将attention scores乘以value
7. 对加权后的value求和以得到输出1
8. 对输入2重复步骤4–7

Note:

1 准备输入
Fig. 1.1: Prepare inputs

`    `Input 1: [1, 0, 1, 0]     Input 2: [0, 2, 0, 2]    Input 3: [1, 1, 1, 1]``

2 初始化权重

Note:

稍后我们将看到value的维度也就是输出的维度。
Fig. 1.2: Deriving key, query and value representations from each input

key的权重

``    [[0, 0, 1],     [1, 1, 0],     [0, 1, 0],     [1, 1, 0]]``

query的权重

``    [[1, 0, 1],     [1, 0, 0],     [0, 0, 1],     [0, 1, 1]]``

value的权重

``    [[0, 2, 0],     [0, 3, 0],     [1, 0, 3],     [1, 1, 0]]``

Note:

3  从每个输入中导出key, query and value的表示

``                   [0, 0, 1]``    [1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]``                   [0, 1, 0]``                   [1, 1, 0]``

``                   [0, 0, 1]``    [0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]``                   [0, 1, 0]``                   [1, 1, 0]``

``                   [0, 0, 1]``    [1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]``                   [0, 1, 0]``                   [1, 1, 0]``

``                   [0, 0, 1]``    [1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]``    [0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]``    [1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]``

Fig. 1.3a: Derive key representations from each input

``                   [0, 2, 0]``    [1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] ``    [0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]``    [1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]``

Fig. 1.3b: Derive value representations from each input

``                   [1, 0, 1]``    [1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]``    [0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]``    [1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]``

Fig. 1.3c: Derive query representations from each input

Notes:

4 计算输入的注意力得分(attention scores)

``                [0, 4, 2]``    [1, 0, 2] x [1, 4, 3] = [2, 4, 4]``                [1, 0, 1]``

Fig. 1.4: Calculating attention scores (blue) from query 1

Note:

5 计算softmax

Fig. 1.5: Softmax the attention scores (blue)

``    softmax([2, 4, 4]) = [0.0, 0.5, 0.5]``

6 将attention scores乘以value

Fig. 1.6: Derive weighted value representation (yellow) from multiply value(purple) and score (blue)

``    1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]``    2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]``    3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]``

7 对加权后的value求和以得到输出1

Fig. 1.7: Sum all weighted values (yellow) to get Output 1 (dark green)

``      [0.0, 0.0, 0.0]``    + [1.0, 4.0, 0.0]``    + [1.0, 3.0, 1.5]``    -----------------``    = [2.0, 7.0, 1.5]``

8 对输入2重复步骤4–7

Fig. 1.8: Repeat previous steps for Input 2 & Input 3

Notes:

## 代码

Step 1: 准备输入

``    import torch``    x = [``      [1, 0, 1, 0], # Input 1``      [0, 2, 0, 2], # Input 2``      [1, 1, 1, 1]  # Input 3``     ]``    x = torch.tensor(x, dtype=torch.float32)``

Step 2: 初始化权重

``    w_key = [``      [0, 0, 1],``      [1, 1, 0],``      [0, 1, 0],``      [1, 1, 0]``    ]``    w_query = [``      [1, 0, 1],``      [1, 0, 0],``      [0, 0, 1],``      [0, 1, 1]``    ]``    w_value = [``      [0, 2, 0],``      [0, 3, 0],``      [1, 0, 3],``      [1, 1, 0]``    ]``    w_key = torch.tensor(w_key, dtype=torch.float32)``    w_query = torch.tensor(w_query, dtype=torch.float32)``    w_value = torch.tensor(w_value, dtype=torch.float32)``

Step 3:导出key, query and value的表示

``    keys = x @ w_key``    querys = x @ w_query``    values = x @ w_value``    print(keys)``    # tensor([[0., 1., 1.],``    #         [4., 4., 0.],``    #         [2., 3., 1.]])``    print(querys)``    # tensor([[1., 0., 2.],``    #         [2., 2., 2.],``    #         [2., 1., 3.]])``    print(values)``    # tensor([[1., 2., 3.],``    #         [2., 8., 0.],``    #         [2., 6., 3.]])``

Step 4: 计算输入的注意力得分(attention scores)

``    attn_scores = querys @ keys.T``    # tensor([[ 2.,  4.,  4.],  # attention scores from Query 1``    #         [ 4., 16., 12.],  # attention scores from Query 2``    #         [ 4., 12., 10.]]) # attention scores from Query 3``

Step 5: 计算softmax

``    from torch.nn.functional import softmax``    attn_scores_softmax = softmax(attn_scores, dim=-1)``    # tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],``    #         [6.0337e-06, 9.8201e-01, 1.7986e-02],``    #         [2.9539e-04, 8.8054e-01, 1.1917e-01]])``    # For readability, approximate the above as follows``    attn_scores_softmax = [``      [0.0, 0.5, 0.5],``      [0.0, 1.0, 0.0],``      [0.0, 0.9, 0.1]``    ]``    attn_scores_softmax = torch.tensor(attn_scores_softmax)``

Step 6: 将attention scores乘以value

``    weighted_values = values[:,None] * attn_scores_softmax.T[:,:,None]``    # tensor([[[0.0000, 0.0000, 0.0000],``    #          [0.0000, 0.0000, 0.0000],``    #          [0.0000, 0.0000, 0.0000]],``    # ``    #         [[1.0000, 4.0000, 0.0000],``    #          [2.0000, 8.0000, 0.0000],``    #          [1.8000, 7.2000, 0.0000]],``    # ``    #         [[1.0000, 3.0000, 1.5000],``    #          [0.0000, 0.0000, 0.0000],``    #          [0.2000, 0.6000, 0.3000]]])``

Step 7: 对加权后的value求和以得到输出

``    outputs = weighted_values.sum(dim=0)``    # tensor([[2.0000, 7.0000, 1.5000],  # Output 1``    #         [2.0000, 8.0000, 0.0000],  # Output 2``    #         [2.0000, 7.8000, 0.3000]]) # Output 3``

Note：

PyTorch has provided an API for this called* *nn.MultiheadAttention*. However, this API requires that you feed in key, query and value PyTorch tensors. Moreover, the outputs of this module undergo a linear transformation.

Step 8:对输入2重复步骤4–7

## 扩展到Transformers

Within the self-attention module:

• Dimension
• Bias

Inputs to the self-attention module:

• Embedding module
• Positional encoding
• Truncating

• Layer stacking

Modules between self-attention modules:

• Linear transformations
• LayerNorm

## References

Attention Is All You Need
https://arxiv.org/abs/1706.03762

The Illustrated Transformer
https://jalammar.github.io/illustrated-transformer/

## Related Articles

Attn: Illustrated Attention