Soohyun’s Machine-learning

[NLP] Character-Aware Neural Language Models 본문

Review of Papers

[NLP] Character-Aware Neural Language Models

Alex_Rose 2021. 12. 5. 20:26

Contribution

Suggest a way for improvement word-level language model with character level embeddings

 

Pros and Cons of the Approach

Used network in the paper was general things at the time, but input was different, character-level embeddings. The results was better if the language had more various morphemes. Yet character-level embeddings has a tradeoff between efficiency and time.

 

Model Architecture

char-aware-model-architecture

 

Summary

1. Preprocess

Use Penn Tree Bank dataset for English. (Mikolov., vocabulary size: 10K)

Use different three embedding matrices

1) a character-level embedding matrix (this matrix used as an input for network. This is a contribution point of this paper)

2) a word-level embedding matrix (for comparision)

3) a morpheme matrix (prefix + stem + suffix + word-level embedding matrix)

 

2. CharCNN

1) kernel width w = [1,2,3,4,5,6,7] (a large model), w = [1,2,3,4,5,6] (a small model)

2) height: 15 dimension and the number of hidden was [50,100,150,200,200,200,200]

3) Take a max value of each convolution. After stride, apply tanh to the result, then we will get a vector that a collection of the max values per a filter. Got a max value from that vector, again.

4) Concatenate all the max values. This vector will be an input for Highway Network.

 

3. Highway Network

Input a vector $\mathbf{y}$ that the output of CharCNN.

1) With $\mathbf{y}$ and conduct affine transformation, $\mathbf{W}_{T} * \mathbf{y} + {b}_{T}$ (embedding matrix * input + bias)

2) Apply sigmoid function to the result of 1). This is a transform gate $t$.

3) Conduct the element-wise product to the result of 2) and $ReLU(\mathbf{W}_{H} * \mathbf{y} + {b}_{H})$

4) Compute $(1-t)$ which is a carry gate

5) Product carry gate and $\mathbf{y}$

6) Lastly, compute the result of 3) and 5), then send it to the LSTM

 

All of the $\mathbf{W}_{T}$ and $\mathbf{W}_{H}$ are square matrices, for a computational convenience. (nn.Linear(dim, dim) in PyTorch)

  • the bias of the Network: -2
  • the number of Layers: 2

 

4. LSTM

  • will be add

 

Character-level Convolutional Neural Network

notation explanation
$t$ : timestep e.g. : [someone, has, a, dream, ...]
$C$ : vocabulary of characters the size of character embeddings: 15
$d$ : the dimensionality of character embeddings $d=15$ If word=<k','n'.'o','w'>, then the dimension of character-level embedding vector is 15 by 4
$Q$ : $\mathbf{R}^{d \times C
$\mathbf{C}^{k}$ : a matrix of character embedding for word $k$ If word=<'k','n','o','w'> then 15 by 65. 65 was a max_word_length in the codes. START,END,EOS were applied as well.
$H$: filter, $H$ is a subset of $\mathbf{R}^{d \times \mathbf{w}}$ $\mathbf{w}$ is a width of convolution filter, $w = [1,2,3,4,5,6,7]$ (a large model) Thus, $\mathbf{d} \times \mathbf{w}$ means [15 by 1],[15 by 2],[15 by 3],[15 by 4],[15 by 5],[15 by 6],[15 by 7]. All these $H$ kernel matrix will be product to all words.
${f}^{k}$ is a subset of $\mathbf{R}^{l-1}$ : a feature map  
$h$ (small h) the hidden size of kernels, 650 (a large model)

 

Creation of a feature map $f$

 

 

 

 

 

 

 

 

 

In the paper, describe the product of kenel matrix $H$ as a character n-gram.

 

 

 

 

 

 

 

 

 

 

 

The result of these process is $\mathbf{y}^{k}$ (a vector).

$\mathbf{y}^{k}$ is a matrix representation of specific words $k$ and $\mathbf{y}$.

 

Highway Network

We could simple replace $\mathbf{x}^{k}$ with $\mathbf{y}^{k}$ at each $t$ in the RNN-LM.

$\mathbf{y}$ : output from CharCNN

By construction, the dimensions of $\mathbf{y}$ and $\mathbf{z}$ have to match, and hence $\mathbf{W}_{T}$ and $\mathbf{W}_{H}$ are square matrices.

 

 

 

 

 

Recurrent Neural Network Language Model

$V$ : fixed size of vocabulary of words

$P$ : output embedding matrix

${p}^{j}$ : output word embedding

$q$ : bias term (=$b$)

$g$ : composition function

Our model simply replaces the input embeddings $\mathbf{x}$ with the output from a character-level convolutional neural network.

 

 

 

Baselines

 

Optimization

 

Dropout

 

Hierarchical Softmax

 

The first term of the equations is the probability of picking cluster $r$.

The second term of the equations is the probability of picking word $j$ (with respect to given cluster $r$)

We found that hierarchical softmax was not necessary for models train on DATA-S.

References

1) http://web.stanford.edu/class/cs224n/

2) https://www.quantumdl.com/entry/3%EC%A3%BC%EC%B0%A81-CharacterAware-Neural-Language-Models

Comments