原理

数据集输入

每次两个句子，用 [CLS], [SEP] 分隔，
这两个句子中的每个词都有 15% 的概率被 [MASK]
被选中 MASK 的词中 80% 真的用 [MASK] 符号替换原词
10% 换成其他随机词（引入噪声）
剩余 10% 啥也不干，虚晃一枪

Embedding

需要 embedding 的有：
token : 将原本的 token 在 embedding 规定的空间维度中表示（例如 embedding 的空间是 768 维空间）
position：将位置信息在 embedding 的空间维度中表示
segment：将当前词属于第一句话还是第二句话的信息在 embedding 的空间中表示

Attention 机制

单个 attention 是通过 Q,K,VQ, K, VQ,K,V 三个矩阵计算出对于一个单词，其他单词与他的相关程度，并把这些相关关系编码到最终的 QKTVQ K^{T}VQKTV 输出的矩阵中
多头注意力机制就是将上述的 attention 重复了 nheadn_{head}nhead 次，这样映射到多个空间中编码词之间的相关关系会更加充分地利用输入信息

模型结构

主体结构

就是 Transformer 网络的编码端；所以借鉴了 Transformer 的基础 block 结构


class LayerNorm(nn.Module):"""Construct a layernorm module (See citation for details).Layer 标准化"""def __init__(self, features, eps=1e-6):super(LayerNorm, self).__init__()self.a_2 = nn.Parameter(torch.ones(features))self.b_2 = nn.Parameter(torch.zeros(features))self.eps = epsdef forward(self, x):mean = x.mean(-1, keepdim=True)std = x.std(-1, keepdim=True)return self.a_2 * (x - mean) / (std + self.eps) + self.b_2class SublayerConnection(nn.Module):"""A residual connection followed by a layer norm.Note for code simplicity the norm is first as opposed to last."""def __init__(self, size, dropout):super(SublayerConnection, self).__init__()self.norm = LayerNorm(size)self.dropout = nn.Dropout(dropout)def forward(self, x, sublayer):"Apply residual connection to any sublayer with the same size."return x + self.dropout(sublayer(self.norm(x)))class PositionwiseFeedForward(nn.Module):"Implements FFN equation."def __init__(self, d_model, d_ff, dropout=0.1):""":param d_model: 词向量的维度:param d_ff::param dropout:"""super(PositionwiseFeedForward, self).__init__()self.w_1 = nn.Linear(d_model, d_ff)self.w_2 = nn.Linear(d_ff, d_model)self.dropout = nn.Dropout(dropout)self.activation = GELU()def forward(self, x):return self.w_2(self.dropout(self.activation(self.w_1(x))))class TransformerBlock(nn.Module):"""Bidirectional Encoder = Transformer (self-attention)Transformer = MultiHead_Attention + Feed_Forward with sublayer connection"""def __init__(self, hidden, attn_heads, feed_forward_hidden, dropout):""":param hidden: hidden size of transformer:param attn_heads: head sizes of multi-head attention:param feed_forward_hidden: feed_forward_hidden, usually 4*hidden_size:param dropout: dropout rate"""super(TransformerBlock, self).__init__()self.attention = MultiHeadedAttention(h=attn_heads, d_model=hidden)self.feed_forward = PositionwiseFeedForward(d_model=hidden, d_ff=feed_forward_hidden, dropout=dropout)self.input_sublayer = SublayerConnection(size=hidden, dropout=dropout)self.output_sublayer = SublayerConnection(size=hidden, dropout=dropout)self.dropout = nn.Dropout(p=dropout)def forward(self, x, mask):x = self.input_sublayer(x, lambda _x: self.attention.forward(_x, mask=mask))x = self.output_sublayer(x, self.feed_forward)return self.dropout(x)

激活函数

使用了作者提出的 GELU 激活函数


class GELU(nn.Module):"""Paper Section 3.4, last paragraph notice that BERT used the GELU instead of RELU在论文的 3.4 节中，作者重写设计了 GELU 激活函数来代替 RELU"""def forward(self, x):return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))

BERT 网络代码


class BERT(nn.Module):"""BERT model : Bidirectional Encoder Representations from Transformers."""def __init__(self, vocab_size, hidden=768, n_layers=12, attn_heads=12, dropout=0.1):""":param vocab_size: vocab_size of total words:param hidden: BERT model hidden size:param n_layers: numbers of Transformer blocks(layers):param attn_heads: number of attention heads:param dropout: dropout rate"""super(BERT, self).__init__()self.hidden = hiddenself.n_layers = n_layersself.attn_heads = attn_heads# paper noted they used 4*hidden_size for ff_network_hidden_sizeself.feed_forward_hidden = hidden * 4# embedding for BERT, sum of positional, segment, token embeddingsself.embedding = BERTEmbedding(vocab_size=vocab_size, d_model=hidden)# multi-layers transformer blocks, deep networkself.transformer_blocks = nn.ModuleList([TransformerBlock(hidden, attn_heads, hidden * 4, dropout) for _ in range(n_layers)])def forward(self, x, segment_info):# attention masking for padded token# torch.ByteTensor([batch_size, 1, seq_len, seq_len)mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)# embedding the indexed sequence to sequence of vectorsx = self.embedding(x, segment_info)# running over multiple transformer blocksfor transformer in self.transformer_blocks:x = transformer.forward(x, mask)return x

上一篇：Redis_List数据类型基础命令

下一篇：详细解析单链表带头节点的结构体定义，普通单链表与有序单链表的创建等操作（含创建步骤与码源）

NLP学习之：BERT 模型复现（4）模型实现

文章目录

原理

数据集输入

Embedding

Attention 机制

模型结构

主体结构

激活函数

BERT 网络代码

相关内容

热门资讯