How to become an expert in NLP in 2019 (1)
In this post, I would focus on all of the theoretical knowledge you need for the latest trends in NLP. I made this reading list as I learned new concepts. In the next post, I would share the things that I use to practice these concepts including fine-tuning and models from rank 1 on competition leaderboards. Use this link to get to part 2 (Still to make).
For the resources, I include papers, blogs, videos.
It is not necessary to read most of the stuff. Your main goal should be to understand that in this paper this thing was introduced and do I understand how it works, how it compares it with state of the art.

Trend: Use bigger transformer based models and solve multi-task learning.
Warning: It is an increasing trend in NLP that if you have a new idea in NLP during reading any of the papers, you will have to use massive compute power to get any reasonable results. So you are limited by the open-source models.
- fastai:- I had already watched the videos, so I thought I should add it to the top of the list.
- Lesson 4 Practical Deep Learning for Coders. It will get you up with how to implement a language model in fastai.
- There is Lesson 12 in part 2 of the course but it is still to released officially so I would update the link when it is uploaded.
2. LSTM:- Although transformers are mainly used nowadays, in some cases you can still use LSTM and it was the first successful model to get good results. You should use AWD_LSTM now if you want.
- LONG SHORT-TERM MEMORY paper. A quick skim of the paper is sufficient.
- Understanding LSTM Networks blog. It explains all the details of the LSTM network graphically.
3. AWD_LSTM:- It was proposed to overcome the shortcoming of LSTM by introducing dropout between hidden layers, embedding dropout, weight tying. You should use AWS_LSTM instead of LSTM.
- Regularizing and Optimizing LSTM Language Models paper. AWD_LSTM paper
- Official code by Salesforce
- fastai implementation
4. Pointer Models:- Although not necessary, it is a good read. You can think of it as pre-attention theory.
- Pointer Sentinel Mixture Models paper
- Official video of above paper.
- Improving Neural Language Models with a continuous cache paper
Extra: What is the difference between weight decay and regularization? In weight decay, you directly add something to the update rule while in regularization it is added to the loss function. Why bring this up? Most probably the DL libraries are using weight_decay instead of regularization under the hood.
In some of the papers, you would see that the authors preferred SGD over Adam, citing that Adam does not give good performance. The reason for that is (maybe) PyTorch/Tensorflow are doing the above mistake. This thing is explained in detail in this post.
5. Attention:- Just remember Attention is not all you need.
- CS224n video explaining attention. Attention starts from 1:00:55 hours.
- Attention is all you need paper. This paper also introduces the Transformer which is nothing but a stack of encoder and decoder blocks. The magic is how these blocks are made and connected.
- You can read an annotated version of the above paper in PyTorch.
- Official video explaining Attention
- Google Blog for Transformer
- If you are interested in video you can check these link1, link2.
- Transformer-XL: Attentive Language Models Beyond a Fixed Length Context paper. Better version of Transformer but BERT does not use this.
- Google Blog for Transformer-XL
- Transformer-XL — Combining Transformers and RNNs Into a State-of-the-art Language Model blog
- If you are interested in video you can check this link.
- The Illustrated Transformer blog
- Attention and Memory in Deep Learning and NLP blog.
- Attention and Augmented Recurrent Neural Networks blog.
- Building the Mighty Transformer for Sequence Tagging in PyTorch: Part 1 blog.
- Building the Mighty Transformer for Sequence Tagging in PyTorch: Part 2 blog.
There is a lot of research going on to make better transformers, maybe I will read more papers on this in the future. Some other transformers include the Universal Transformer and Evolved Transformer which used AutoML to come up with Transformer architecture.
The reason why new transformer architectures do not solve the problem. Because you need language models for your NLP tasks which use these transformer blocks. In most of the cases, you will not have the computation resources necessary to train these models as it has been found that the more transformer blocks you use the better. Also, you need larger batch sizes to train these Language Models which means you have to use either Nvidia DGX or Google Cloud TPUs(PyTorch support coming someday).
6. Random resources:- You can skip this section. But for completeness, I provide all the resources I used.
- Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) blog
- Character-Level Language Modeling with Deeper Self-Attention paper.
- Using the output embedding to Improve Langauge Models paper.
- Quasi-Recurrent Neural Networks paper. A very fast version of LSTM. It uses convolution layers to make LSTM computations parallel. Code can be found in the fastai_library or official_code.
- Deep Learning for NLP Best Practices blog by Sebastian Ruder. A collection of best practices to be used when training LSTM models.
- Notes on the state of the art techniques for language modeling blog. A quick summary where Jeremy Howard summarizes some of his tricks which he uses in fastai library.
- Language Modes and Contextualized Word Embeddings blog. Gives a quick overview of ELMo, BERT, and other models.
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) blog.
7. Multi-task Learning:- I am really excited about this. In this case, you train a single model for multiple tasks (more than 10 if you want). So your data looks like “translate to english some_text_in_german”. Your model actually learns to use the initial information to choose the task that it should perform.
- An overview of Multi-Task Learning in deep neural networks paper.
- The Natural Language Decathlon: Multitask Learning as Question Answering paper.
- Multi-Task Deep Neural Networks for Natural Language Understanding paper.
- OpenAI GPT is an example of this.
8. PyTorch:- Pytorch provide good tutorials giving you good references on how to code up most of the stuff in NLP. Although transformers are not in the tutorials but still you should see the tutorials once.
Now we come to the latest research in NLP which has resulted in the NLP’s Imagenet moment. All you need to understand is how Attention works and you are set.
9. ELMo:- The first prominent research done where we moved from pretrained word-embeddings to using pretrained-models for getting the word-embeddings. So you use the input sentence to get the embeddings for the tokens present in the sentence.
- Deep Contextualized word representations paper (ELMo paper)
- If you are interested in video check this link.
10. ULMFit:- Is this better than BERT maybe not, but still in Kaggle competitions and external competitions ULMFiT gets the first place.
- Universal Language Model Fine-tuning for Text Classification paper.
- Jeremy Howard blog post announcing ULMFiT.
- It is explained in Lesson 10 of Cutting Edge Deep Learning course.
11. OpenAI GPT:- I have not compared BERT with GPT2, but you work on some kind on ensemble if you want. Do not use GPT1 as BERT was made to overcome the limitations of GPT1.
12. BERT:- The most successful language model right now (as of May 2019).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper.
- Google blog on BERT
- Dissecting BERT Part 1: The Encoder blog
- Understanding BERT Part 2: BERT Specifics blog
- Dissecting BERT Appendix: The Decoder blog
In order to use all these models in PyTorch you should use hugginface/pytorch_pretrained_BERT repo which gives you complete implementations along with pretrained models for BERT, GPT1, GPT2, TransformerXL.
13. Next Blog:- I may get late writing the next blog so wanted to share this last thing.
- Reducing BERT Pre-Training from 3 Days to 76 Minutes paper. A new optimizer is introduced for language models that can significantly reduce the training time.
Congrats you made it to the end. You now have most of the theoretical knowledge needed to practice NLP using the latest models and techniques.
What to do now?
You only learned the theory, now practice as much as you can. Create crazy ensembles if you want, try to get on top of the leaderboards. I am struggling right now to practice my NLP tasks as I am busy doing some computer vision projects which you check below or on my github.
Most probably I would make a follow-up post by mid or end June giving a list like this one, with some new techniques that I have in mind to read and the things I will do for practice.
If you find this post useful, share this with others that may benefit from it and give this post a clap (it helps a lot, you can give 50 claps max).
Follow me on Medium to get my latest blog posts in your medium feed. My socials linkedin, github, twitter.
My previous blog posts