본문 바로가기

NLP

ELMo, GPT1, GPT2, BERT, RoBERTa, ALBERT의 발전 흐름과 비교

처음에 피치못할 사정으로 영어로 써버려서 영어로 일단 올립니다. 저도 한국어가 더 좋은데 말이죠ㅜ

시간이 나면 한국어로도 올릴게요.이번에 GPT3도 나왔는데 제가 본문에 지적한대로 zero-shot은 overreaching이라는걸 그들도 깨달았는지 Few-shot으로 컨셉을 바꿨더라고요? 그거 리뷰할 땐 한글로 하게 될 거 같아요.

출처는 저에요. 주관이 난무하는 글입니다. 저는 카이스트 박사과정 정성희입니다. 바쁘면 맨끝에 Discussion만 보는 것도 좋겠네요.

Byte pair encoding, BERT, Word2Vec을 제 깃에 스크래치부터 구현해뒀으니 논문보다 더 자세한 이해가 필요하시면 한번 들러보셔도 좋을거에요.

https://github.com/hash2430

 

hash2430 - Overview

Speech synthesis & NLP. hash2430 has 38 repositories available. Follow their code on GitHub.

github.com

 

 

1. ELMo: Peters, M., NAACL,  2018

This paper suggests language embedding trained from bi-directional language model.

They proposed this because training language representation on machine translation is inefficient because of the lack of parallel data.

 The name is 'embedding from language model' as opposed to the 'embedding from machine translation' I guess. 
The bi-directional language model task is different than that of BERT. 

They use simple concatenation of hidden states from forward language model and backward language model. 

I see why the authors of BERT call it 'shallow' bi-directional. 

Also, even though this paper starts with drawing the attention to the 'polysemy' problem, I doubt if this method actually solves this issue.

It feels like they themselves fell into the trap of polysemy of the word 'context'.

It seems like, in the beginning of the paper, 'context' meant the literal context(which words are neighboring), but at the end of the paper, 'context' means the 'task'.

I get that their language representation is task-dependent, but I doubt that it is context-dependent. It says "To add ELMo to the supervised model, we first freeze the weights of the biLM and then concatenate the ELMo vector $ELMo_k^{task}$ with $x_k$ and pass the ELMo enhanced representation $[x_k;ELMo_k^{task}]$ into the task RNN."

As the upstream task is LM, by the time the $R_k$ is learned and stored, it will have the context. However, $k$ is subword index of downstream task sentence.

I think when $R_k$ is looked up in the table, serialized to form $ELMo_k^{task}$ and then loaded at the input of task RNN, it has no context of downstream task sentence.

Of course it will have the context after the task RNN, but isn't it also true of Word2Vec, which they claim to be context-independent representation?

When using BERT, the entire encoder is transplanted to the downstream task and the output of that pre-trained encoder generates context-dependent representation.

Thus, BERT language representation includes downstream task sentence context whether the downstream task requires context or not. However, if certain downstream task of ELMo had structure that does not require context, I doubt the ELMo representation has context in it. I doubt it will have different representation for polysemy.

But I liked the part where they make their representation task-dependent.

Also, it was interesting to see that they serialized the language representation to match the task, while BERT serializes the task input. By the way, I am not sure if we should call it "feature-based-approach" or "fine-tuning-approach".

Because even though $R_k$ is simply being looked up from the table, $s_j^{task}$ and $r^{task}$ are randomly initialized and then trained at the downstream task and this can be considered fine-tuning $ELMo_k^{task}$ partly.

I think ELMo falls somewhere between such division.

 

2. GPT1: Peters, M.,  preprint,  2018

By generative pretraining of a language model on diverse corpus of unlabeled text, followed by discriminative fine-tuning on each task, this paper aims to learn universal language representation that can be used at many downstream tasks.
What I less like about this paper than BERT is that this one has less simple serialization protocol for each downstream task input to match upstream task of LM,

while BERT has quite simple serialization protocol across all the down-stream tasks.

Because of this serialization protocol differences, delimiter tokens that are only required for downstream tasks are trained from scratch at the training time of the downstream task while other tokens are being fine-tuned.

This paper has the identical goal as BERT but there are some differences in design choices.

 This GPT1 has LM objective for the finetuning of downstream task, along with the original objective of that task.

 This results in language model being modified with the target task data, which is not a good idea if target task has small training data.

 Another main difference with BERT is that this one uses unidirectional LM, thus, transformer decoder is used for language modeling, instead of transformer encoder.

GPT1 language model is not forced to learn inter-sentence relationship during pretraining at all.

The training data is composed of consecutive sentences so it can learn long-range context compared to BERT, but I guess this is only partially good if you cannot learn the inter-sentence relationship.
I think the term 'generative' in the name 'generative pretraining' is a little bit misleading because the term 'generative' is usually used for referring to the models that use variational autoencoder or GAN.

I guess they used that term because they use transformer decoder which was originally designed for generating sentences in the transformer model.

 

3. GPT2: Radford, A.,  preprint,  2018

This paper starts from the idea that "We can make one large model that can do all NLP downstream tasks without supervision." This sounds rather radical to me.

 Even if you have lived in English-speaking country and you have no problem communicating in it, you would still want to take at least three trial exams before taking TOEIC. 

They argue GPT2 can do zero-shot summarization and I find this 'zero-shot' expression is over-reaching. Even though they don't use DB dedicated to summarization, the training text having 'tl;dr:' is already some sort of supervision I think. This model uses unidirectional LM as in GPT1 which is another limitation.

 

4. BERT: Delvin, J.,  preprint,  2019

BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

As a result, the pretrained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
If I am asked what is so great about this paper, I would say "masked language model". 

Because of this, they use transformer encoder as architecture, unlike GPT1, GPT2 which use transformer decoder.

 Even though ELMo uses bi-directional language model, their way is less delicate than masked language modeling.

 They just concatenate hidden states obtained from forward LM and backward LM, which results in shallower mix than masked LM. 

Other than that, with a bit of exaggeration, it is repetition of preceding works.

 I, personally, also like how they have simplified serialization protocol for all downstream tasks, eliminating task-specific tokens and issues with their training.

Although, they still had to bridge the gap of special token <MASK> between training task and downstream task in a heuristic manner because there will be no <MASK> in downstream task.

They proposed next sentence prediction loss along with masked language model loss to learn the inter-sentence relationship. However, following studies suggested that this objective is too easy to learn and not so helpful because they sampled 'negative' pair from different paragraphs and the model only had to tell if the 'topic' of each sentence matched and did not have to learn the coherence between two sentences.

 

5. RoBERTa: Liu, Y.,  preprint,  2019

The authors argue that by better hyperparameter setting, BERT performance can be significantly enhanced, defeating other modified versions of BERT.

 Their approaches are 

(1) training the model longer, with bigger batches, over more data; 

(2) removing the next sentence prediction objective; 

(3) training on longer sequences; 

(4) dynamically changing the masking pattern applied to the training data.

I think there is not much to discuss about this paper because it is mostly just convincing and believable.

 I was just surprised by the amount of resources to train this model and thought I would never be able to apply this paper to my research because of limited resources that I have.

 I am surprised that University of Washington can offer so much resources to their researchers.

 They used 8 x v100 GPU. They used 160G of pretraining data, whereas original BERT used 16 GB. They trained with a batch size of 8K sentences, whereas the original used 256.

How is that even possible? I find what is interesting here is they proved that next sentence prediction object is not helpful heuristically even though the reasoning behind that was weak.

 

6. ALBERT: Lan, Z., ICLR, 2020.

This paper presents two parameter reduction techniques to lower memory consumption and increase the training speed of BERT.

Those are token embedding decomposition and cross-layer parameter sharing.

They also proposes inter-sentence coherence loss to replace next sentence prediction loss of BERT.
The trend is to use bigger pre-trained model to obtain language representations that can be used universally.

Computation resource issue is serious for researchers at school who has less access to GPUs than those who work at google. This paper tackles real-world problem of memory efficiency and training speed.

Now I understand why many top scores on the leader board come from ALBERT-based approaches.

Because it is light, it is easy to adopt this.

You don't have to be working at google and have access to multiple GPUs to use ALBERT. 
First time I heard 'factorized embedding parameterization', which is the first approach for parameter reduction, I assumed it is going to involve advanced math. However, it didn't. I liked it because it was simple.

 It started from the assumption that why recent contextualized representations work better than context-independent models is because of the deep hidden states which have contexts, not the initial input tokens. 

From this, they decide to have different value for hidden state dimension ($H$) and input token embedding dimension ($E$).

They proposed having $H >> E$.  This assumption led to huge parameter save because $E$ is the very number that is being multiplied to the vocabulary size which is huge.

 They changed the order of parameters from O($V \times H$) into O($V\times E+E\times H$) by this decomposition.
The second method is parameter sharing across repeated transformer encoder blocks.

 They boldly choose to share all the parameters in the block.

 Of course there are some deterioration of performance depending on the down-stream task, but I would gladly pay this trade-off given that number of parameters go down from 108M to 31M. Also, the drop of the performance is not so serious for most of the tasks.
For the sentence-order prediction (SOP) loss, I think the authors make compelling argument.

 When I implemented BERT in assignment 3, I made 'negative' sentence pair with sentences that may come from same paragraph, and may even be the same sentence, may even be consecutive but in reversed order.

The TA's said their model has been trained for more than mine, but my pretrained model captured better context when I visualized hidden state vectors in a sentence and analyzed it qualitatively. I bet if you make the 'negative' pairs from two consecutive sentences with reversed order, it will give you even better result.

 

7. Discussions

I like that they use language model to obtain language representation that can be used universally at any downstream task. It is similar to how I studied foreign language (English) years ago.

 As a kid, I would just listen and read a lot in English, sometimes memorizing some sentences.

 When I had to take any form of English exam, I just solved 3 trial exams and did not have to spend time on studying each problem type.

 When I see grammar problems, I don't have to think about what is verb and objective and stuffs like that. Actually, there is no time for that during test.

 It just feels like something is off if the grammar or the choice of word is wrong.

I could just feel that "that word does not belong there in that position or in that form in that sentence", which is the definition of language model(conditional probability of certain subword given other subwords in that sentence).

So I agree with the idea "if you can tell some word coming at some position is weird(low conditional probability), then you know the language and you can solve problems with only a bit of supervision data".

"Knowing a language" can be an abstract concept, but they made it clear and concrete so that it can be susceptible to scientific study.
I was weak at one type of question (maybe still am).

That was "where in this paragraph, does this sentence belong?" type of question and after observing that I was consistently bad at this type of problems, I had to train for this type separately.

I guess I could have resolved the problem easier if I knew about ALBERT's SOP loss back then. 
I also like that there are ongoing studies on distillation and how to make BERT light.

 As a researcher with limited access to GPU, decreasing memory dependency and computation cost is crucial, even at the cost of performance to some amount.

 

8. References

[1] Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., \ \& Zettlemoyer, L. (2018). (ELMo) Deep Contextualized Word Representations, 2227–2237. doi.org/10.18653/v1/n18-1202

 

[2] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., \ \& Zettlemoyer, L. (2018). (GPT1) Improving Language Understanding by Generative Pre-Training. OpenAI, 1–10. Retrieved from s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language

 

[3] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., \ \& Sutskever, I. (2018). (GPT2) Language Models are Unsupervised Multitask Learners.

 

[4] Devlin, J., Chang, M.-W., Lee, K., \ \& Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, (Mlm). arxiv.org/abs/1810.04805

 

[5] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach, (1).arxiv.org/abs/1907.11692

 

[6] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., \ \& Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, 1–17. arxiv.org/abs/1909.11942