Soohyun’s Machine-learning

[NLP] Character-Aware Neural Language Models 본문

Review of Papers

[NLP] Character-Aware Neural Language Models

Alex_Rose 2021. 12. 5. 20:26

Contribution

Suggest a way for improvement word-level language model with character level embeddings

 

Pros and Cons of the Approach

Used network in the paper was general things at the time, but input was different, character-level embeddings. The results was better if the language had more various morphemes. Yet character-level embeddings has a tradeoff between efficiency and time.

 

Model Architecture

char-aware-model-architecture

 

Summary

1. Preprocess

Use Penn Tree Bank dataset for English. (Mikolov., vocabulary size: 10K)

Use different three embedding matrices

1) a character-level embedding matrix (this matrix used as an input for network. This is a contribution point of this paper)

2) a word-level embedding matrix (for comparision)

3) a morpheme matrix (prefix + stem + suffix + word-level embedding matrix)

 

2. CharCNN

1) kernel width w = [1,2,3,4,5,6,7] (a large model), w = [1,2,3,4,5,6] (a small model)

2) height: 15 dimension and the number of hidden was [50,100,150,200,200,200,200]

3) Take a max value of each convolution. After stride, apply tanh to the result, then we will get a vector that a collection of the max values per a filter. Got a max value from that vector, again.

4) Concatenate all the max values. This vector will be an input for Highway Network.

 

3. Highway Network

Input a vector y that the output of CharCNN.

1) With y and conduct affine transformation, WTy+bT (embedding matrix * input + bias)

2) Apply sigmoid function to the result of 1). This is a transform gate t.

3) Conduct the element-wise product to the result of 2) and ReLU(WHy+bH)

4) Compute (1t) which is a carry gate

5) Product carry gate and y

6) Lastly, compute the result of 3) and 5), then send it to the LSTM

 

All of the WT and $\mathbf{W}_{H}$ are square matrices, for a computational convenience. (nn.Linear(dim, dim) in PyTorch)

  • the bias of the Network: -2
  • the number of Layers: 2

 

4. LSTM

  • will be add

 

Character-level Convolutional Neural Network

notation explanation
t : timestep e.g. : [someone, has, a, dream, ...]
C : vocabulary of characters the size of character embeddings: 15
d : the dimensionality of character embeddings d=15 If word=<k','n'.'o','w'>, then the dimension of character-level embedding vector is 15 by 4
Q : $\mathbf{R}^{d \times C
Ck : a matrix of character embedding for word k If word=<'k','n','o','w'> then 15 by 65. 65 was a max_word_length in the codes. START,END,EOS were applied as well.
H: filter, H is a subset of Rd×w w is a width of convolution filter, w=[1,2,3,4,5,6,7] (a large model) Thus, d×w means [15 by 1],[15 by 2],[15 by 3],[15 by 4],[15 by 5],[15 by 6],[15 by 7]. All these H kernel matrix will be product to all words.
fk is a subset of Rl1 : a feature map  
h (small h) the hidden size of kernels, 650 (a large model)

 

Creation of a feature map f

 

 

 

 

 

 

 

 

 

In the paper, describe the product of kenel matrix H as a character n-gram.

 

 

 

 

 

 

 

 

 

 

 

The result of these process is yk (a vector).

yk is a matrix representation of specific words k and y.

 

Highway Network

We could simple replace xk with yk at each t in the RNN-LM.

y : output from CharCNN

By construction, the dimensions of y and z have to match, and hence WT and $\mathbf{W}_{H}$ are square matrices.

 

 

 

 

 

Recurrent Neural Network Language Model

V : fixed size of vocabulary of words

P : output embedding matrix

pj : output word embedding

q : bias term (=b)

g : composition function

Our model simply replaces the input embeddings x with the output from a character-level convolutional neural network.

 

 

 

Baselines

 

Optimization

 

Dropout

 

Hierarchical Softmax

 

The first term of the equations is the probability of picking cluster r.

The second term of the equations is the probability of picking word j (with respect to given cluster r)

We found that hierarchical softmax was not necessary for models train on DATA-S.

References

1) http://web.stanford.edu/class/cs224n/

2) https://www.quantumdl.com/entry/3%EC%A3%BC%EC%B0%A81-CharacterAware-Neural-Language-Models

Comments