Generative Pre-Training

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

Summary: Large gains on language tasks can be realized by generative pre-training of language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each task. Use task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to model architecture.

This paper: combination of unsupervised pre-training and supervised fine-tuning. Learn a universal representation that transfers with little adaptation to wide range of tasks. Employ two-stage training procedure. 1 - language modeling objective on unlabeled data to learn initial parameters of a NN. Subsequently, adapt parameters to a task using corresponding supervised objective.

Model architrcture: transformer. Evaluate on 4 language understanding tasks - natural language inference, question answering, semantic similarity and text classification.

Framework:

Unsupervised pre-training.

Given unsuprervised corpus of tokens U={u1,,un}U = \{u_1, \dots, u_n\}, use standard language modeling objective to max the following likelihood.

L1(U)=ilogP(uiuik,,ui1;Θ)L_1(U) = \sum_i \log P(u_i | u_{i-k} , \dots, u_{i-1}; \Theta)

Here, k is the size of context window and condotional probability P is modeled using a NN with parameter Theta.

Transformer decoder for language model, variant of transformer. Applies a multi-headed self-attention on operation over input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:

h0=UWe+Wphl=transformer_block(hl1)i[1,n]P(u)=softmax(hnWeT)h_0 = UW_e + W_p \\ h_l = transformer\_block(h_{l-1})\forall i \in [1,n] \\ P(u) = softmax(h_nW_e^T)

Where U is context vector of tokens, n is number of layers, W_e is token embedding matrix, W_p is position embedding matrix.

Supervised fine-tuning:

After training, we adapt the parameter to supervised target task. Assume a labelled dataset C, each instance consist of sequence of input tokens x1, ... , xm , along with label y. The input are passed through sequence pre-trained model to obtain final activation hlmh_l^m which then fed into an added linear output layer parameters WyW_y to predict yy

P(yx1,,xm)=softmax(hlmWy)P(y|x^1, \dots, x^m) = softmax(h_l^mW_y)

This gives the following objective to maximize:

L2(C)=(x,y)logP(yx1,,xm)L_2(C) = \sum_{(x,y)}\log P(y|x^1, \dots,x^m)

Including language modeling as an auxiliary obective to fine tuning helped learning by improving generalization of supervised model, and accelerating convergence, so we optimize the following objective:

L3(C)=L2(C)+λL1(C)L_3(C) = L_2(C) + \lambda * L_1(C)
Model Architecture

Need modification for certain tasks, significant amount of customization. Use a traversal-style approach, convert structured input into an ordered sequence that the pre-trained model can process. - see paper for detail.

Last updated