일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
- #Ringle
- 강동구장어맛집
- 소통챌린지
- CommunicateWiththeWorld
- 총각네장어
- #영어공부
- 둔촌역장어
- 장어랑고기같이
- 링글커리어
- 링글
- 링글경험담
- 영어회화
- 스몰토크
- 영어로전세계와소통하기
- 영어공부
- 화상영어
- #nlp
- #링글후기
- #링글
- 해외취업컨퍼런스
- 뉴노멀챌린지
- 성내동장어
- 링글리뷰
- #직장인영어
- 오피스밋업
- #체험수업
- #영어발음교정
- 영어공부법
- Ringle
- 영어시험
- Today
- Total
Soohyun’s Machine-learning
[NLP] Character-Aware Neural Language Models 본문
Contribution
Suggest a way for improvement word-level language model with character level embeddings
Pros and Cons of the Approach
Used network in the paper was general things at the time, but input was different, character-level embeddings. The results was better if the language had more various morphemes. Yet character-level embeddings has a tradeoff between efficiency and time.
Model Architecture
Summary
1. Preprocess
Use Penn Tree Bank dataset for English. (Mikolov., vocabulary size: 10K)
Use different three embedding matrices
1) a character-level embedding matrix (this matrix used as an input for network. This is a contribution point of this paper)
2) a word-level embedding matrix (for comparision)
3) a morpheme matrix (prefix + stem + suffix + word-level embedding matrix)
2. CharCNN
1) kernel width w = [1,2,3,4,5,6,7] (a large model), w = [1,2,3,4,5,6] (a small model)
2) height: 15 dimension and the number of hidden was [50,100,150,200,200,200,200]
3) Take a max value of each convolution. After stride, apply tanh to the result, then we will get a vector that a collection of the max values per a filter. Got a max value from that vector, again.
4) Concatenate all the max values. This vector will be an input for Highway Network.
3. Highway Network
Input a vector $\mathbf{y}$ that the output of CharCNN.
1) With $\mathbf{y}$ and conduct affine transformation, $\mathbf{W}_{T} * \mathbf{y} + {b}_{T}$ (embedding matrix * input + bias)
2) Apply sigmoid function to the result of 1). This is a transform gate $t$.
3) Conduct the element-wise product to the result of 2) and $ReLU(\mathbf{W}_{H} * \mathbf{y} + {b}_{H})$
4) Compute $(1-t)$ which is a carry gate
5) Product carry gate and $\mathbf{y}$
6) Lastly, compute the result of 3) and 5), then send it to the LSTM
All of the $\mathbf{W}_{T}$ and $\mathbf{W}_{H}$ are square matrices, for a computational convenience. (nn.Linear(dim, dim) in PyTorch)
- the bias of the Network: -2
- the number of Layers: 2
4. LSTM
- will be add
Character-level Convolutional Neural Network
notation | explanation |
---|---|
$t$ : timestep | e.g. : [someone, has, a, dream, ...] |
$C$ : vocabulary of characters | the size of character embeddings: 15 |
$d$ : the dimensionality of character embeddings | $d=15$ If word=<k','n'.'o','w'>, then the dimension of character-level embedding vector is 15 by 4 |
$Q$ : $\mathbf{R}^{d \times | C |
$\mathbf{C}^{k}$ : a matrix of character embedding for word $k$ | If word=<'k','n','o','w'> then 15 by 65. 65 was a max_word_length in the codes. START,END,EOS were applied as well. |
$H$: filter, $H$ is a subset of $\mathbf{R}^{d \times \mathbf{w}}$ | $\mathbf{w}$ is a width of convolution filter, $w = [1,2,3,4,5,6,7]$ (a large model) Thus, $\mathbf{d} \times \mathbf{w}$ means [15 by 1],[15 by 2],[15 by 3],[15 by 4],[15 by 5],[15 by 6],[15 by 7]. All these $H$ kernel matrix will be product to all words. |
${f}^{k}$ is a subset of $\mathbf{R}^{l-1}$ : a feature map | |
$h$ (small h) | the hidden size of kernels, 650 (a large model) |
Creation of a feature map $f$
In the paper, describe the product of kenel matrix $H$ as a character n-gram.
The result of these process is $\mathbf{y}^{k}$ (a vector).
$\mathbf{y}^{k}$ is a matrix representation of specific words $k$ and $\mathbf{y}$.
Highway Network
We could simple replace $\mathbf{x}^{k}$ with $\mathbf{y}^{k}$ at each $t$ in the RNN-LM.
$\mathbf{y}$ : output from CharCNN
By construction, the dimensions of $\mathbf{y}$ and $\mathbf{z}$ have to match, and hence $\mathbf{W}_{T}$ and $\mathbf{W}_{H}$ are square matrices.
Recurrent Neural Network Language Model
$V$ : fixed size of vocabulary of words
$P$ : output embedding matrix
${p}^{j}$ : output word embedding
$q$ : bias term (=$b$)
$g$ : composition function
Our model simply replaces the input embeddings $\mathbf{x}$ with the output from a character-level convolutional neural network.
Baselines
Optimization
Dropout
Hierarchical Softmax
The first term of the equations is the probability of picking cluster $r$.
The second term of the equations is the probability of picking word $j$ (with respect to given cluster $r$)
We found that hierarchical softmax was not necessary for models train on DATA-S.
References
1) http://web.stanford.edu/class/cs224n/
2) https://www.quantumdl.com/entry/3%EC%A3%BC%EC%B0%A81-CharacterAware-Neural-Language-Models
'Review of Papers' 카테고리의 다른 글
[NLP] RoBERTa : A Robustly Optimized BERT Pretraining Approach (0) | 2021.10.02 |
---|---|
[NLP][GPT3] Language Models are Few-Shot Learners (0) | 2021.08.07 |
[Tabular] TabNet : Attentive Interpretable Tabular Learning (0) | 2021.06.02 |
[NLP] Character-Aware Neural Language Models (0) | 2019.10.29 |
[NLP] Convolutional Neural Networs for Sentence Classification (0) | 2019.07.05 |