BERT
BERT is a neural network developed by Jacob Devlin et.al. (Google) back in 2018. It improves performance nerual nets on natural language processing tasks significantly when compared to most other network types, including previous leader - Recurrent Neural Network Architectures. BERT addresses such RNN issues as handling long sequences of text, scalability and parallelism. BERT resolved those by introducing a special type of architecture called Transformers.
Transformers apply positional encoding and attention to build outputs. Positional encoding deals with encoding word order information into the data itself. Attention determines relationship between every single word in the input and establish how it relates to each words in the output. This is something that's learned from data by seeing many examples.
BERT stands for:
- Bidirectional - which means it uses left/right context (i.e. the whole input, not just preceding or following words) when dealing with a word
- Encoder Representation - language modelling system that is pre-trained with unlabelled data, then fine-tuned
- from Transformer - based on NLP transformer algorithm
With BERT, the true novelty was the idea of self-attention, where the model learns the underlying meaning of inputs. For example, the model can derive word meaning, grammar rules, tenses and gender as well as understand context for each word. For the complete visual guide that describes details of the inner working of transformers take a look at https://jalammar.github.io/illustrated-transformer/
Here is a quick example of how BERT can be used for text classification (other uses might include question answering systems and MLM (masked-language modelling):