[NLP] Character-Aware Neural Language Models

Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

Soohyun’s Machine-learning

[NLP] Character-Aware Neural Language Models 본문

Review of Papers

[NLP] Character-Aware Neural Language Models

Alex_Rose 2021. 12. 5. 20:26

Contribution

Suggest a way for improvement word-level language model with character level embeddings

Pros and Cons of the Approach

Used network in the paper was general things at the time, but input was different, character-level embeddings. The results was better if the language had more various morphemes. Yet character-level embeddings has a tradeoff between efficiency and time.

Model Architecture

Summary

1. Preprocess

Use Penn Tree Bank dataset for English. (Mikolov., vocabulary size: 10K)

Use different three embedding matrices

1) a character-level embedding matrix (this matrix used as an input for network. This is a contribution point of this paper)

2) a word-level embedding matrix (for comparision)

3) a morpheme matrix (prefix + stem + suffix + word-level embedding matrix)

2. CharCNN

1) kernel width w = [1,2,3,4,5,6,7] (a large model), w = [1,2,3,4,5,6] (a small model)

2) height: 15 dimension and the number of hidden was [50,100,150,200,200,200,200]

3) Take a max value of each convolution. After stride, apply tanh to the result, then we will get a vector that a collection of the max values per a filter. Got a max value from that vector, again.

4) Concatenate all the max values. This vector will be an input for Highway Network.

3. Highway Network

Input a vector $\mathbf{y}$ that the output of CharCNN.

1) With $\mathbf{y}$ and conduct affine transformation, $\mathbf{W}_{T} * \mathbf{y} + {b}_{T}$ (embedding matrix * input + bias)

2) Apply sigmoid function to the result of 1). This is a transform gate $t$ .

3) Conduct the element-wise product to the result of 2) and $ReLU(\mathbf{W}_{H} * \mathbf{y} + {b}_{H})$

4) Compute $(1-t)$ which is a carry gate

5) Product carry gate and $\mathbf{y}$

6) Lastly, compute the result of 3) and 5), then send it to the LSTM

All of the $\mathbf{W}_{T}$ and $\mathbf{W}_{H}$ are square matrices, for a computational convenience. (nn.Linear(dim, dim) in PyTorch)

the bias of the Network: -2
the number of Layers: 2

4. LSTM

will be add

Character-level Convolutional Neural Network

notation	explanation
$t$ : timestep	e.g. : [someone, has, a, dream, ...]
$C$ : vocabulary of characters	the size of character embeddings: 15
$d$ : the dimensionality of character embeddings	$d=15$ If word=<k','n'.'o','w'>, then the dimension of character-level embedding vector is 15 by 4
$Q$ : $\mathbf{R}^{d \times	C
$\mathbf{C}^{k}$ : a matrix of character embedding for word $k$	If word=<'k','n','o','w'> then 15 by 65. 65 was a max_word_length in the codes. START,END,EOS were applied as well.
$H$ : filter, $H$ is a subset of $\mathbf{R}^{d \times \mathbf{w}}$	$\mathbf{w}$ is a width of convolution filter, $w = [1,2,3,4,5,6,7]$ (a large model) Thus, $\mathbf{d} \times \mathbf{w}$ means [15 by 1],[15 by 2],[15 by 3],[15 by 4],[15 by 5],[15 by 6],[15 by 7]. All these $H$ kernel matrix will be product to all words.
${f}^{k}$ is a subset of $\mathbf{R}^{l-1}$ : a feature map
$h$ (small h)	the hidden size of kernels, 650 (a large model)

Creation of a feature map $f$

In the paper, describe the product of kenel matrix $H$ as a character n-gram.

The result of these process is $\mathbf{y}^{k}$ (a vector).

$\mathbf{y}^{k}$ is a matrix representation of specific words $k$ and $\mathbf{y}$ .

Highway Network

We could simple replace $\mathbf{x}^{k}$ with $\mathbf{y}^{k}$ at each $t$ in the RNN-LM.

$\mathbf{y}$ : output from CharCNN

By construction, the dimensions of $\mathbf{y}$ and $\mathbf{z}$ have to match, and hence $\mathbf{W}_{T}$ and $\mathbf{W}_{H}$ are square matrices.

Recurrent Neural Network Language Model

$V$ : fixed size of vocabulary of words

$P$ : output embedding matrix

${p}^{j}$ : output word embedding

$q$ : bias term (= $b$ )

$g$ : composition function

Our model simply replaces the input embeddings $\mathbf{x}$ with the output from a character-level convolutional neural network.

Baselines

Optimization

Dropout

Hierarchical Softmax

The first term of the equations is the probability of picking cluster $r$ .

The second term of the equations is the probability of picking word $j$ (with respect to given cluster $r$ )

We found that hierarchical softmax was not necessary for models train on DATA-S.

References

1) http://web.stanford.edu/class/cs224n/

2) https://www.quantumdl.com/entry/3%EC%A3%BC%EC%B0%A81-CharacterAware-Neural-Language-Models

'Review of Papers' 카테고리의 다른 글

[NLP] RoBERTa : A Robustly Optimized BERT Pretraining Approach (0)	2021.10.02
[NLP][GPT3] Language Models are Few-Shot Learners (0)	2021.08.07
[Tabular] TabNet : Attentive Interpretable Tabular Learning (0)	2021.06.02
[NLP] Character-Aware Neural Language Models (0)	2019.10.29
[NLP] Convolutional Neural Networs for Sentence Classification (0)	2019.07.05

공유하기 링크

페이스북
카카오스토리
트위터

'Review of Papers' Related Articles

Comments

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Soohyun’s Machine-learning

Soohyun’s Machine-learning

[NLP] Character-Aware Neural Language Models 본문

[NLP] Character-Aware Neural Language Models