일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
- 오피스밋업
- #nlp
- 화상영어
- 영어공부법
- #체험수업
- 둔촌역장어
- 링글경험담
- CommunicateWiththeWorld
- 영어회화
- Ringle
- 링글리뷰
- #링글후기
- 링글
- 영어시험
- 스몰토크
- 해외취업컨퍼런스
- 성내동장어
- 영어로전세계와소통하기
- 총각네장어
- 뉴노멀챌린지
- 소통챌린지
- #영어공부
- 장어랑고기같이
- #영어발음교정
- #Ringle
- 강동구장어맛집
- #직장인영어
- 링글커리어
- 영어공부
- #링글
- Today
- Total
Soohyun’s Machine-learning
[NLP] Character-Aware Neural Language Models 본문
Contribution
Suggest a way for improvement word-level language model with character level embeddings
Pros and Cons of the Approach
Used network in the paper was general things at the time, but input was different, character-level embeddings. The results was better if the language had more various morphemes. Yet character-level embeddings has a tradeoff between efficiency and time.
Model Architecture
Summary
1. Preprocess
Use Penn Tree Bank dataset for English. (Mikolov., vocabulary size: 10K)
Use different three embedding matrices
1) a character-level embedding matrix (this matrix used as an input for network. This is a contribution point of this paper)
2) a word-level embedding matrix (for comparision)
3) a morpheme matrix (prefix + stem + suffix + word-level embedding matrix)
2. CharCNN
1) kernel width w = [1,2,3,4,5,6,7] (a large model), w = [1,2,3,4,5,6] (a small model)
2) height: 15 dimension and the number of hidden was [50,100,150,200,200,200,200]
3) Take a max value of each convolution. After stride, apply tanh to the result, then we will get a vector that a collection of the max values per a filter. Got a max value from that vector, again.
4) Concatenate all the max values. This vector will be an input for Highway Network.
3. Highway Network
Input a vector y that the output of CharCNN.
1) With y and conduct affine transformation, WT∗y+bT (embedding matrix * input + bias)
2) Apply sigmoid function to the result of 1). This is a transform gate t.
3) Conduct the element-wise product to the result of 2) and ReLU(WH∗y+bH)
4) Compute (1−t) which is a carry gate
5) Product carry gate and y
6) Lastly, compute the result of 3) and 5), then send it to the LSTM
All of the WT and $\mathbf{W}_{H}$ are square matrices, for a computational convenience. (nn.Linear(dim, dim) in PyTorch)
- the bias of the Network: -2
- the number of Layers: 2
4. LSTM
- will be add
Character-level Convolutional Neural Network
notation | explanation |
---|---|
t : timestep | e.g. : [someone, has, a, dream, ...] |
C : vocabulary of characters | the size of character embeddings: 15 |
d : the dimensionality of character embeddings | d=15 If word=<k','n'.'o','w'>, then the dimension of character-level embedding vector is 15 by 4 |
Q : $\mathbf{R}^{d \times | C |
Ck : a matrix of character embedding for word k | If word=<'k','n','o','w'> then 15 by 65. 65 was a max_word_length in the codes. START,END,EOS were applied as well. |
H: filter, H is a subset of Rd×w | w is a width of convolution filter, w=[1,2,3,4,5,6,7] (a large model) Thus, d×w means [15 by 1],[15 by 2],[15 by 3],[15 by 4],[15 by 5],[15 by 6],[15 by 7]. All these H kernel matrix will be product to all words. |
fk is a subset of Rl−1 : a feature map | |
h (small h) | the hidden size of kernels, 650 (a large model) |
Creation of a feature map f
In the paper, describe the product of kenel matrix H as a character n-gram.
The result of these process is yk (a vector).
yk is a matrix representation of specific words k and y.
Highway Network
We could simple replace xk with yk at each t in the RNN-LM.
y : output from CharCNN
By construction, the dimensions of y and z have to match, and hence WT and $\mathbf{W}_{H}$ are square matrices.
Recurrent Neural Network Language Model
V : fixed size of vocabulary of words
P : output embedding matrix
pj : output word embedding
q : bias term (=b)
g : composition function
Our model simply replaces the input embeddings x with the output from a character-level convolutional neural network.
Baselines
Optimization
Dropout
Hierarchical Softmax
The first term of the equations is the probability of picking cluster r.
The second term of the equations is the probability of picking word j (with respect to given cluster r)
We found that hierarchical softmax was not necessary for models train on DATA-S.
References
1) http://web.stanford.edu/class/cs224n/
2) https://www.quantumdl.com/entry/3%EC%A3%BC%EC%B0%A81-CharacterAware-Neural-Language-Models
'Review of Papers' 카테고리의 다른 글
[NLP] RoBERTa : A Robustly Optimized BERT Pretraining Approach (0) | 2021.10.02 |
---|---|
[NLP][GPT3] Language Models are Few-Shot Learners (0) | 2021.08.07 |
[Tabular] TabNet : Attentive Interpretable Tabular Learning (0) | 2021.06.02 |
[NLP] Character-Aware Neural Language Models (0) | 2019.10.29 |
[NLP] Convolutional Neural Networs for Sentence Classification (0) | 2019.07.05 |