条件随机场

条件随机场(conditional random field,CRF)是给定一组输入随机变量条件下另一组输出随机变量的条件概率分布模型,并且假设输出随机变量构成马尔可夫随机场(概率无向图模型)
条件随机场定义:设$X$和$Y$是随机变量,$P(Y|X)$是在给定$X$的条件下$Y$的条件概率分布。若随机变量$Y$构成一个由无向图$G=(V,E)$表示的马尔可夫随机场,即式(1)
$$\begin{equation}P(Y_v|X,Y_w,w\neq v) = P(Y_v|X,Y_w,w\sim v)\tag{1}\end{equation}$$
对任意结点$v$成立,则条件概率分布$P(Y|X)$为条件随机场。式(1)中,$w \neq v$表示结点$v$以外的所有结点,$w\sim v$表示在图$G=(V,E)$中与结点$v$有边连接的所有结点$w$,$Y_v$,$Y_u$和$Y_w$为结点$v$,$u$和$w$对应的随机变量。

线性链条件随机场定义:设$X=(X_1,X_2,\cdots,X_n)$,$Y=(Y_1,Y_2,\cdots,Y_n)$均为线性链表示的随机变量序列,若在给定随机变量序列$X$的条件下,随机变量序列$Y$的条件概率分布构成条件随机场,即满足式(2),则称$P(Y|X)$为线性链条件随机场。
$$\begin{equation}P(Y_i|X,Y_1,\cdots,Y_{i-1},Y_{i+1},\cdots,Y_n)=P(Y_i|X,Y_{i-1},Y_{i+1})\tag{2}\end{equation}$$
其中,$i=1,\cdots,n$,当$i=1$或者$i=n$时,只考虑单边。

团与最大团定义:无向图中任意两个结点之间均有边连接的结点子集称为团(clique)。若$C$是无向图$G$的一个团,并且不能再加入任何一个$G$中的结点使其称为一个更大的团,则称$C$为最大团(maximul clique)。
将概率无向图的联合概率分布表示为其最大团上的随机变量的函数的乘积形式的操作,称为概率无向图模型的因子分解(factorization)。设无向图为$G$,$C$为$G$上的最大团,$Y_C$表示$C$对应的随机变量,则概率无向图模型的联合概率分布$P(Y)$可以写作所有最大团$C$上的函数$\Psi_C(Y_C)$的乘积形式,即
$$\begin{equation}P(Y)=\frac{1}{Z}\Pi_C \Psi_C(Y_C)\tag{3}\end{equation}$$
其中,$Z$为规范化因子,$Z=\sum_{Y}\Pi_C\Psi_C(Y_C)$,$\Psi_C(Y_C)$称为势函数(potential function),要求是严格正的,通常定义为指数函数,即$\Psi_C(Y_C)=exp(-E(Y_C))$。
根据概率无向图的因子分解,得出线性链条件随机场$P(Y|X)$的因子分解式,当无向图为线性链形式是,最大团的大小为2,因此因子分解式中的因子是定义在相邻两个结点上的函数

原始形式

设$P(Y|X)$为线性链条件随机场,则在随机变量$X$的取值为x的条件下,随机变量$Y$取值为$y$的条件概率如下,
$$\begin{equation}P(y|x)=\frac{1}{Z}\Pi_{t=1}^{T}\psi(y_{t-1},y_t,x)\tag{4}\end{equation}$$
其中$x=(x_1,\cdots,x_n)$为输入的观测序列,状态序列(输出序列)$y=(y_1,\cdots, y_T)$,$\psi(y_{t-1},y_t,x)\geq 0$为势函数。对势函数进行拆分,假定$\psi(y_{t-1},y_t,x)=\exp(\Delta_{y_{t-1},y_t,x} + \Delta_{y_t,x})$,令$\Delta_{y_{t-1},y_t,x}=\sum_{k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)$,令$\Delta_{y_t,x}=\sum_{l}\mu_lf_l^{state}(y_t,x,t)$,则$P(y|x)$可以写为,
$$\begin{equation}P(y|x)=\frac{1}{Z(x)}\exp(\sum_{t,k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)+\sum_{t,l}\mu_lf_l^{state}(y_t,x,t))\tag{5}\end{equation}$$
其中,$Z(x)$为规范化因子,$Z(x)=\sum_{y}\exp(\sum_{t,k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)+\sum_{t,l}\mu_lf_l^{state}(y_t,x,t))$,$f_{k}^{transition}$为转移特征函数,$f_{l}^{state}$为状态特征函数。在传统的做法中,状态特征函数和转移特征函数是通过人工特征工程的方式设计的,它们的取值为1或0,即满足特征条件时取值为1,否则为0。

内积形式

对式(5)取对数,
$$\begin{equation}\begin{aligned}\log P(y|x)&=\sum_{t,k}\lambda_k f_{k}^{transition}(y_{t-1},y_t,x,t)+\sum_{t,l}\mu_lf_l^{state}(y_t,x,t)-\log Z(x)\\&=\sum_{t}(\sum_{i=1}^{K+L}w_i f_i(y_{t-1},y_t,x, t))-\log Z(x)\end{aligned}\tag{6}\end{equation}$$
其中,为了简便,对转移特征和状态特征用统一的符号表示,

$$ \begin{equation}f_i(y_{t-1},y_t,x,t) = \begin{cases} f^{transition}_i(y_{t-1},y_t,x,t),i=1,\cdots,K \\ f^{state}_l(y_t,x,t), i=K+l;l=1,\cdots,L \end{cases}\tag{7}\end{equation} $$

$$ \begin{equation}w_i = \begin{cases} \lambda_i,i=1,\cdots,K \\ \mu_l, i=K+l;l=1,\cdots,L \end{cases}\tag{8}\end{equation} $$

令$f_i(y,x)=\sum_{t=1}^{T}f_i(y_{t-1},y_t,x,t)$,$F(y,x)=(f_1(y,x),\cdots,f_{K+L}(y,x))^T$,$w=(w_1,\cdots,w_{K+L})^T$。
则条件随机场的对数形式$\log P(y|x)$可以写为$w$和$F$内积的形式
$$\begin{equation}\log P(y|x)=w\cdot F(y,x) - \log Z(x)\tag{9}\end{equation}$$

引入神经网络

式(4)的对数形式为,
$$\begin{equation}\log P(y|x)=\sum_{t=1}^T\log(\psi(y_{t-1},y_t,x_t)) - \log Z(x)\tag{10}\end{equation}$$
其中,$\psi(y_{t-1},y_t,x)$可以理解为根据$y_{t-1},y_t,x$提取的特征的加权和,这一部分可以使用神经网络代替人工特征工程。
引入神经网络表示转移特征函数$f^{transition}$和状态特征函数$f^{state}$,可以将条件随机场的对数形式简化为,
$$\begin{equation}\log P(y|x)=\sum_{t}f^{transition}(y_{t-1},y_t,x)+\sum_{t}f^{state}(y_t,x)-\log Z(x)\tag{11}\end{equation}$$
考虑到当前深度学习模型中,RNN或者层叠CNN或是Bert等模型已经能够使得$f^{state}$充分地捕捉各个y与输出x的联系,因此,不妨考虑转移特征函数$f^{transition}$与x无关,那么可以进一步简化为,
$$\begin{equation}\log P(y|x)=\underbrace{\sum_{t}f^{transition}(y_{t-1},y_t)}_{(a)}+\underbrace{\sum_{t}f^{state}(y_t,x)}_{(b)}-\underbrace{\log Z(x)}_{(c)}\tag{12}\end{equation}$$
对上式取负,使用极大似然法,求解参数。
因此,损失函数的计算分为三部分,分别对应式中的(a)、(b)和(c)。

一元概率
def crf_unary_score(tag_indices, sequence_lengths, inputs):
  """Computes the unary scores of tag sequences.

  Args:
    tag_indices: A [batch_size, max_seq_len] matrix of tag indices.
    sequence_lengths: A [batch_size] vector of true sequence lengths.
    inputs: A [batch_size, max_seq_len, num_tags] tensor of unary potentials.
  Returns:
    unary_scores: A [batch_size] vector of unary scores.
  """
  batch_size = array_ops.shape(inputs)[0]
  max_seq_len = array_ops.shape(inputs)[1]
  num_tags = array_ops.shape(inputs)[2]

  flattened_inputs = array_ops.reshape(inputs, [-1])

  offsets = array_ops.expand_dims(
      math_ops.range(batch_size) * max_seq_len * num_tags, 1)
  offsets += array_ops.expand_dims(math_ops.range(max_seq_len) * num_tags, 0)
  # Use int32 or int64 based on tag_indices' dtype.
  if tag_indices.dtype == dtypes.int64:
    offsets = math_ops.cast(offsets, dtypes.int64)
  flattened_tag_indices = array_ops.reshape(offsets + tag_indices, [-1])

  unary_scores = array_ops.reshape(
      array_ops.gather(flattened_inputs, flattened_tag_indices),
      [batch_size, max_seq_len])

  masks = array_ops.sequence_mask(sequence_lengths,
                                  maxlen=array_ops.shape(tag_indices)[1],
                                  dtype=dtypes.float32)

  unary_scores = math_ops.reduce_sum(unary_scores * masks, 1)
  return unary_scores

crf_unary_score计算一元概率,输出一个维度为[batch_size]的向量,向量中的每一个元素是一个sequence中所有的真实标签概率之和。对应式中的(b)项。输入inputs的维度为[batch_size, max_seq_len, num_tags],一般为BiLSTM或者是Bert的输出,num_tags为标签数,inputsi[k]表示的是,当前batch中的第i个序列,其位置j对应的输出属于类别k的概率。flattened_inputs将input变为一维, tag_indices为真实的标签,维度为[batch_size, max_seq_len],offsets的维度为[batch_size, max_seq_len], offsetsi = (batch_size i + max_seq_len j) * num_tags,因此,offsets + tag_indices中(i,j)位置的值表示的是,当前batch中的第i个序列,其位置j对应的输出属于类别tag_indicesi的概率位于flatten_inputs,拉平后得到 flattened_tag_indices,通过gather函数,得到所有的一元概率。最后,使用sequence_mask根据序列的真是长度,生成mask,得到所有有效的一元概率之和unary_scores

二元概率
def crf_binary_score(tag_indices, sequence_lengths, transition_params):
  """Computes the binary scores of tag sequences.

  Args:
    tag_indices: A [batch_size, max_seq_len] matrix of tag indices.
    sequence_lengths: A [batch_size] vector of true sequence lengths.
    transition_params: A [num_tags, num_tags] matrix of binary potentials.
  Returns:
    binary_scores: A [batch_size] vector of binary scores.
  """
  # Get shape information.
  num_tags = transition_params.get_shape()[0]
  num_transitions = array_ops.shape(tag_indices)[1] - 1

  # Truncate by one on each side of the sequence to get the start and end
  # indices of each transition.
  start_tag_indices = array_ops.slice(tag_indices, [0, 0],
                                      [-1, num_transitions])
  end_tag_indices = array_ops.slice(tag_indices, [0, 1], [-1, num_transitions])

  # Encode the indices in a flattened representation.
  flattened_transition_indices = start_tag_indices * num_tags + end_tag_indices
  flattened_transition_params = array_ops.reshape(transition_params, [-1])

  # Get the binary scores based on the flattened representation.
  binary_scores = array_ops.gather(flattened_transition_params,
                                   flattened_transition_indices)

  masks = array_ops.sequence_mask(sequence_lengths,
                                  maxlen=array_ops.shape(tag_indices)[1],
                                  dtype=dtypes.float32)
  truncated_masks = array_ops.slice(masks, [0, 1], [-1, -1])
  binary_scores = math_ops.reduce_sum(binary_scores * truncated_masks, 1)
  return binary_scores

crf_binary_score计算二元概率,输出维度为[batch_size],向量中的每个元素是一个sequence中所有的转移概率之和。对应式中的(a)项。序列的长度为num_transitions,则会发生num_transitions-1此转移,转移的起始下标start_tag_indices对应0,$\cdots$,num_transitions-2,结束下标end_tag_indices对应1,$\cdots$,num_transitions-1。使用与crf_unary_score类似的原理取出对应位置的转移概率,进行mask操作求和,返回binary_scores

log normalization factor
  def _multi_seq_fn():
    """Forward computation of alpha values."""
    rest_of_input = array_ops.slice(inputs, [0, 1, 0], [-1, -1, -1])

    # Compute the alpha values in the forward algorithm in order to get the
    # partition function.
    forward_cell = CrfForwardRnnCell(transition_params)
    # Sequence length is not allowed to be less than zero.
    sequence_lengths_less_one = math_ops.maximum(
        constant_op.constant(0, dtype=sequence_lengths.dtype),
        sequence_lengths - 1)
    _, alphas = rnn.dynamic_rnn(
        cell=forward_cell,
        inputs=rest_of_input,
        sequence_length=sequence_lengths_less_one,
        initial_state=first_input,
        dtype=dtypes.float32)
    log_norm = math_ops.reduce_logsumexp(alphas, [1])
    # Mask `log_norm` of the sequences with length <= zero.
    log_norm = array_ops.where(math_ops.less_equal(sequence_lengths, 0),
                               array_ops.zeros_like(log_norm),
                               log_norm)
    return log_norm

normalization factor $Z$的计算方式,
$$\begin{equation}Z=\sum_y \Pi_{t=1}^T\psi(y_{t-1},y_t,x)\tag{13}\end{equation}$$
使用前向算法计算$Z$,令
$$\begin{equation}\alpha_j(y_j=s)=\sum_{y1,\cdots,y_{j-1}}(\Pi_{t=1}^{j-1}\psi(y_{t-1},y_t,x))\psi(y_{j-1},y_j=s,x)\tag{14}\end{equation}$$
则有
$$\begin{equation}\begin{aligned}\alpha_{j+1}(y_{j+1}=s^{'})&=\sum_{y_1,\cdots,y_j}(\Pi_{t=1}^{j}\psi(y_{t-1},y_t,x))\psi(y_{j},y_{j+1}=s^{'},x)\\&=\sum_{s=1}^{N}(\sum_{y1,\cdots,y_{j-1}}(\Pi_{t=1}^{j-1}\psi(y_{t-1},y_t,x))\psi(y_{j-1},y_j=s,x))\psi(y_{j}=s,y_{j+1}=s^{'},x)\\&=\sum_{s=1}^N\alpha_j(y_j=s)\psi(y_{j}=s,y_{j+1}=s^{'},x)\\&=\sum_{s=1}^N\alpha_j(y_j=s)\exp(f^{transition}(y_j=s,y_{j+1}=s^{'})+\underbrace{f^{state}(y_{t+1},x)}_{与y_j无关})\\&=\sum_{s=1}^N\underbrace{\alpha_j(y_j=s)}_{(a)}\underbrace{\exp(f^{transition}(y_j=s,y_{j+1}=s^{'}))}_{(b)}\underbrace{\exp(f^{state}(y_{t+1},x))}_{(c)}\end{aligned}\tag{15}\end{equation}$$

class CrfForwardRnnCell(rnn_cell.RNNCell):
  """Computes the alpha values in a linear-chain CRF.

  See http://www.cs.columbia.edu/~mcollins/fb.pdf for reference.
  """

  def __init__(self, transition_params):
    """Initialize the CrfForwardRnnCell.

    Args:
      transition_params: A [num_tags, num_tags] matrix of binary potentials.
          This matrix is expanded into a [1, num_tags, num_tags] in preparation
          for the broadcast summation occurring within the cell.
    """
    self._transition_params = array_ops.expand_dims(transition_params, 0)
    self._num_tags = tensor_shape.dimension_value(transition_params.shape[0])

  @property
  def state_size(self):
    return self._num_tags

  @property
  def output_size(self):
    return self._num_tags

  def __call__(self, inputs, state, scope=None):
    """Build the CrfForwardRnnCell.

    Args:
      inputs: A [batch_size, num_tags] matrix of unary potentials.
      state: A [batch_size, num_tags] matrix containing the previous alpha
          values.
      scope: Unused variable scope of this cell.

    Returns:
      new_alphas, new_alphas: A pair of [batch_size, num_tags] matrices
          values containing the new alpha values.
    """
    state = array_ops.expand_dims(state, 2)

    # This addition op broadcasts self._transitions_params along the zeroth
    # dimension and state along the second dimension. This performs the
    # multiplication of previous alpha values and the current binary potentials
    # in log space.
    transition_scores = state + self._transition_params
    new_alphas = inputs + math_ops.reduce_logsumexp(transition_scores, [1])

    # Both the state and the output of this RNN cell contain the alphas values.
    # The output value is currently unused and simply satisfies the RNN API.
    # This could be useful in the future if we need to compute marginal
    # probabilities, which would require the accumulated alpha values at every
    # time step.
    return new_alphas, new_alphas

crf.png
CrfForwardRnnCell(rnn_cell.RNNCell)基于RNN的架构,在时间步t接收一个输入$x_t$,输出$o_t$,生成新的hidden state $h_t$。
其中,$x_t$为$f^{state}(y_t,x)$,$h_t$即为前向算法中的$\alpha_t=(\alpha_t(y_t=1),\cdots,\alpha_t(y_t=N))^T$,如图所示,在更新$\alpha$时,使用的是加法而非乘法,这是因为,所有的计算都是在log空间中计算的,对应式(15)中的$\log(a) + \log(b) = log(ab)$,所以transition_scores = state + self._transition_params计算得到的矩阵,其第i行第j列的元素是$\log(\alpha_{t-1}(y_{t-1}=i)\times \exp(f^{transition}(y_{t-1}=i,y_t=j)))$,math_ops.reduce_logsumexp(transition_scores, [1])先$\exp$再求和再取对数,得到$\log(\alpha_{t-1}^T F(y_{t-1},y_t))$,再加上inputs,实际是$\log(\sum ab)+ \log (c)$,即$\log((\sum ab)c)$,因此得到的新的$\alpha_{t+1}$(hidden state)结果也是在log空间中的,方便下一个时间步的计算。
inputs的维度是[batch_size, num_tags],state的维度为[batch_size,num_tags],transition_params的维度为[num_tags,num_tags]。

预测

条件随机场的预测使用维特比算法

def viterbi_decode(score, transition_params):
  """Decode the highest scoring sequence of tags outside of TensorFlow.

  This should only be used at test time.

  Args:
    score: A [seq_len, num_tags] matrix of unary potentials.
    transition_params: A [num_tags, num_tags] matrix of binary potentials.

  Returns:
    viterbi: A [seq_len] list of integers containing the highest scoring tag
        indices.
    viterbi_score: A float containing the score for the Viterbi sequence.
  """
  trellis = np.zeros_like(score)
  backpointers = np.zeros_like(score, dtype=np.int32)
  trellis[0] = score[0]

  for t in range(1, score.shape[0]):
    v = np.expand_dims(trellis[t - 1], 1) + transition_params
    trellis[t] = score[t] + np.max(v, 0)
    backpointers[t] = np.argmax(v, 0)

  viterbi = [np.argmax(trellis[-1])]
  for bp in reversed(backpointers[1:]):
    viterbi.append(bp[viterbi[-1]])
  viterbi.reverse()
  viterbi_score = np.max(trellis[-1])
  return viterbi, viterbi_score

其中,trellis记录非规范化概率,backpointers记录路径。

参考

http://www.cs.columbia.edu/~m... LSTM-Based NeuroCRFs for Name Entity Recognition 统计学习方法