bi-LSTM + CRF 序列标注

2018-02-15 NLP, LSTM, NER, CNN

本文将基于几篇近年来 BiLSTM 与 CRF 做 NER 的论文,结合具体的 Tensorflow 代码,理解常见的深度学习序列标注方法。

序列标注

序列标注问题是自然语言处理中的基本问题之一,在深度学习火起来之前,常见的序列标注问题的解决方案都是借助于HMM模型,最大熵模型,CRF模型。尤其是CRF,是解决序列标注问题的主流方法。

随着深度学习的发展,RNN在序列标注问题中取得了巨大的成果。而且深度学习中的 end-to-end,也让序列标注问题变得更简单了。

序列标注问题包括自然语言处理中的分词,词性标注,命名实体识别,关键词抽取,词义角色标注等等。我们只要在做序列标注时给定特定的标签集合,就可以进行序列标注。

bi-LSTM + CRF

论文链接:Bidirectional LSTM-CRF Models for Sequence Tagging

经典的 BiLSTM-CRF 模型结构不复杂,双向的 LSTM 可以更好地刻画同一时刻上下文(前文与后文)对当前状态的影响,而 CRF 则在句子级别对 tag 序列进行约束。

BiLSTM-CRF 模型

值得注意的是模块的输入可以是 token 的 one-hot 编码或 embedding 或对应的稀疏特征。

最终,在参数

$$\tilde{\theta}=\theta \cup \{[A]_{i,j} \forall i,j\}$$

(其中$\theta$ 表示 LSTM 模块的网络参数,$[A]_{i,j}$ 表示Tag从第i个值到第j个值的转移值,转移矩阵由 CRF 算得)下,句子 $[X]_1^T$ 对应标签序列 $[i]_1^T$ 的得分为:

$$s([x]_1^T, [i]_1^T, \tilde{\theta})=\sum _{t=1}^T ([A]_{[i]_{t-1},[i]_t}+[f_{\theta}]_{[i]_t,t})$$

模型训练步骤如下:

BiLSTM-CRF 模型训练步骤

下面以 token 的 Embedding 作为输入的训练过程代码片断:

# 注意,在以下的 `words` 可用 `dateset.padded_batch` 使之长度相同(不足的以默认值补充)

# Word Embeddings
vocab_words = tf.contrib.lookup.index_table_from_file(
    params['words_filepath'], num_oov_buckets=1)
word_ids = vocab_words.lookup(words)

glove = np.load(params['glove'])['embeddings']  # np.array
variable = np.vstack([glove, [[0.]*params['dim']]])  # 注意这里的最后一维是零向量
variable = tf.Variable(variable, dtype=tf.float32, trainable=False)
embeddings = tf.nn.embedding_lookup(variable, word_ids)
embeddings = tf.layers.dropout(embeddings, rate=dropout, training=training)  # Embedding 层做 dropout
# BiLSTM
t = tf.transpose(embeddings, perm=[1, 0, 2]) # 转置成以句子序列为主

lstm_cell_fw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size']) # 构建前向rnn

lstm_cell_bw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size'])
lstm_cell_bw = tf.contrib.rnn.TimeReversedFusedRNN(lstm_cell_bw) # 构建反向rnn

output_fw, _ = lstm_cell_fw(t, dtype=tf.float32, sequence_length=nwords)
output_bw, _ = lstm_cell_bw(t, dtype=tf.float32, sequence_length=nwords)

output = tf.concat([output_fw, output_bw], axis=-1) 
# [output_fw_1, output_fw_2, ..., output_fw_n, output_bw_1, output_bw_2, ..., output_bw_n]
output = tf.transpose(output, perm=[1, 0, 2]) # 转置回以batch为主
output = tf.layers.dropout(output, rate=dropout, training=training) # BiLSTM 做 dropout
# CRF
logits = tf.layers.dense(output, num_tags) # 全连接层,维度是标签的个数
crf_params = tf.get_variable("crf", [num_tags, num_tags], dtype=tf.float32)
pred_ids, _ = tf.contrib.crf.crf_decode(logits, crf_params, nwords)

Chars LSTM + bi-LSTM + CRF

论文链接:Neural Architectures for Named Entity Recognition

与经典的 BiLSTM-CRF 模型结构,在此基础上考虑了字粒度(英文语料中就是字母)的特征,文中提到这么做有以下几个好处:

  1. 保留语言中词语里的形态
  2. 解决词性标注、语言模型或依存句法分析中OOV的问题

Chars embeddings 也是由 bi-LSTM 得出,可选择性加入字的原始embedding

Chars embedding

# Char Embeddings
char_ids = vocab_chars.lookup(chars)
variable = tf.get_variable(
    'chars_embeddings', [num_chars, params['dim_chars']], tf.float32)
char_embeddings = tf.nn.embedding_lookup(variable, char_ids)
char_embeddings = tf.layers.dropout(char_embeddings, rate=dropout,
                                    training=training)

# Char LSTM
dim_words = tf.shape(char_embeddings)[1]
dim_chars = tf.shape(char_embeddings)[2]
flat = tf.reshape(char_embeddings, [-1, dim_chars, params['dim_chars']])
t = tf.transpose(flat, perm=[1, 0, 2])
lstm_cell_fw = tf.contrib.rnn.LSTMBlockFusedCell(params['char_lstm_size'])
lstm_cell_bw = tf.contrib.rnn.LSTMBlockFusedCell(params['char_lstm_size'])
lstm_cell_bw = tf.contrib.rnn.TimeReversedFusedRNN(lstm_cell_bw)
_, (_, output_fw) = lstm_cell_fw(t, dtype=tf.float32,
                                 sequence_length=tf.reshape(nchars, [-1]))
_, (_, output_bw) = lstm_cell_bw(t, dtype=tf.float32,
                                 sequence_length=tf.reshape(nchars, [-1]))
output = tf.concat([output_fw, output_bw], axis=-1) # 这里可以加入 Embedding from lookup table
char_embeddings = tf.reshape(output, [-1, dim_words, 50])

做完 chars embedding 后与 word embedding 做一个 Concatenate操作后再作为bi-LSTM的输入即可:

# Word Embeddings
word_ids = vocab_words.lookup(words)
glove = np.load(params['glove'])['embeddings']  # np.array
variable = np.vstack([glove, [[0.] * params['dim']]])
variable = tf.Variable(variable, dtype=tf.float32, trainable=False)
word_embeddings = tf.nn.embedding_lookup(variable, word_ids)

# Concatenate Word and Char Embeddings
embeddings = tf.concat([word_embeddings, char_embeddings], axis=-1)
embeddings = tf.layers.dropout(embeddings, rate=dropout, training=training)

# LSTM
t = tf.transpose(embeddings, perm=[1, 0, 2])  # Need time-major
lstm_cell_fw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size'])
lstm_cell_bw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size'])
lstm_cell_bw = tf.contrib.rnn.TimeReversedFusedRNN(lstm_cell_bw)
output_fw, _ = lstm_cell_fw(t, dtype=tf.float32, sequence_length=nwords)
output_bw, _ = lstm_cell_bw(t, dtype=tf.float32, sequence_length=nwords)
output = tf.concat([output_fw, output_bw], axis=-1)
output = tf.transpose(output, perm=[1, 0, 2])
output = tf.layers.dropout(output, rate=dropout, training=training)

# CRF
logits = tf.layers.dense(output, num_tags)
crf_params = tf.get_variable("crf", [num_tags, num_tags], dtype=tf.float32)
pred_ids, _ = tf.contrib.crf.crf_decode(logits, crf_params, nwords)

Chars CNN + bi-LSTM + CRF

论文链接:End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

这篇文章与前一篇是同一时期出的,思想大致相同,只不过前一篇用的是bi-LSTM做字符级的Embedding,而这篇采用的则是CNN的做法。

文章提到CNN可以有效地捕获单词形态上的特征,目的与上一篇相像。

CNN的结构如下:

chars cnn

def masked_conv1d_and_max(t, weights, filters, kernel_size):
    """Applies 1d convolution and a masked max-pooling

    Parameters
    ----------
    t : tf.Tensor
        A tensor with at least 3 dimensions [d1, d2, ..., dn-1, dn]
    weights : tf.Tensor of tf.bool
        A Tensor of shape [d1, d2, dn-1]
    filters : int
        number of filters
    kernel_size : int
        kernel size for the temporal convolution

    Returns
    -------
    tf.Tensor
        A tensor of shape [d1, d2, dn-1, filters]

    """
    # Get shape and parameters
    shape = tf.shape(t)
    ndims = t.shape.ndims
    dim1 = reduce(lambda x, y: x*y, [shape[i] for i in range(ndims - 2)])
    dim2 = shape[-2]
    dim3 = t.shape[-1]

    # Reshape weights
    weights = tf.reshape(weights, shape=[dim1, dim2, 1])  # 可以看作一个mask,由单词长度的tf.sequence_mask操作得到
    weights = tf.to_float(weights)

    # Reshape input and apply weights
    flat_shape = [dim1, dim2, dim3]
    t = tf.reshape(t, shape=flat_shape)
    t *= weights  # <pad> 填充部分的embedding会置0

    # Apply convolution
    t_conv = tf.layers.conv1d(t, filters, kernel_size, padding='same')  # (dim1, dim2, filters)
    t_conv *= weights  # <pad> 填充部分的embedding会置0

    # Reduce max -- set to zero if all padded

    # 右边是:(dim1, dim2, 1) * (dim1, 1, filters)
    # 总的是:(dim1, dim2, filters) + (dim1, dim2, filters)
    t_conv += (1. - weights) * tf.reduce_min(t_conv, axis=-2, keepdims=True) # 这个是将训练出来的所有<pad>的embedding中每列的min值组合起来做为结果中<pad>的embedding

    t_max = tf.reduce_max(t_conv, axis=-2)  # maxpooling, (dim1, 1, filters)

    # Reshape the output
    final_shape = [shape[i] for i in range(ndims-2)] + [filters]  # [d1, d2, dn-1, filters]
    t_max = tf.reshape(t_max, shape=final_shape)

    return t_max

完整的训练过程如下:

char_ids = vocab_chars
# Char Embeddingslookup(chars)
variable = tf.get_variable(
    'chars_embeddings', [num_chars + 1, params['dim_chars']], tf.float32)
char_embeddings = tf.nn.embedding_lookup(variable, char_ids)  # trainable = True
char_embeddings = tf.layers.dropout(char_embeddings, rate=dropout,
                                    training=training)

# Char 1d convolution
weights = tf.sequence_mask(nchars)  # 做一个mask
char_embeddings = masked_conv1d_and_max(
    char_embeddings, weights, params['filters'], params['kernel_size'])

# Word Embeddings
word_ids = vocab_words.lookup(words)
glove = np.load(params['glove'])['embeddings']  # np.array
variable = np.vstack([glove, [[0.] * params['dim']]])
variable = tf.Variable(variable, dtype=tf.float32, trainable=False)
word_embeddings = tf.nn.embedding_lookup(variable, word_ids)

# Concatenate Word and Char Embeddings
embeddings = tf.concat([word_embeddings, char_embeddings], axis=-1)
embeddings = tf.layers.dropout(embeddings, rate=dropout, training=training)

# LSTM
t = tf.transpose(embeddings, perm=[1, 0, 2])  # Need time-major
lstm_cell_fw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size'])
lstm_cell_bw = tf.contrib.rnn.LSTMBlockFusedCell(params['lstm_size'])
lstm_cell_bw = tf.contrib.rnn.TimeReversedFusedRNN(lstm_cell_bw)
output_fw, _ = lstm_cell_fw(t, dtype=tf.float32, sequence_length=nwords)
output_bw, _ = lstm_cell_bw(t, dtype=tf.float32, sequence_length=nwords)
output = tf.concat([output_fw, output_bw], axis=-1)
output = tf.transpose(output, perm=[1, 0, 2])
output = tf.layers.dropout(output, rate=dropout, training=training)

# CRF
logits = tf.layers.dense(output, num_tags)
crf_params = tf.get_variable("crf", [num_tags, num_tags], dtype=tf.float32)
pred_ids, _ = tf.contrib.crf.crf_decode(logits, crf_params, nwords)

总结

本文介绍了基于bi-LSTM + CRF序列标注的经典结构与基于此的一些变种。

其变种也主要是对字符级特征的提取,个人感觉对英文类的语言帮助会更大(中文一个词语的字符一般较少),另外是能解决部分OOV的问题。

三种方法的在 CoNLL2003 shared task 的表现上,后两个加入chars embedding的方法会有一定的提升,但提升并没有很明显,具体结果可见:guillaumegenthial/tf_ner

最后附上 guillaumegenthial/tf_ner 对这三个模型选用的建议:

Also, considering the complexity of the models and the relatively small gap in performance (0.6 F1), using the lstm_crf model is probably a safe bet for most of the concrete applications.