# Generative Pre-Training

Summary: Large gains on language tasks can be realized by generative pre-training of language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each task. Use task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to model architecture.

This paper: combination of unsupervised pre-training and supervised fine-tuning. Learn a universal representation that transfers with little adaptation to wide range of tasks. Employ two-stage training procedure. 1 - language modeling objective on unlabeled data to learn initial parameters of a NN. Subsequently, adapt parameters to a task using corresponding supervised objective.

Model architrcture: transformer. Evaluate on 4 language understanding tasks - natural language inference, question answering, semantic similarity and text classification.&#x20;

Framework:

Unsupervised pre-training.

Given unsuprervised corpus of tokens $$U = {u\_1, \dots, u\_n}$$, use standard language modeling objective to max the following likelihood.

$$
L\_1(U) = \sum\_i \log P(u\_i | u\_{i-k} , \dots, u\_{i-1}; \Theta)
$$

Here, k is the size of context window and condotional probability P is modeled using a NN with parameter Theta.

Transformer decoder for language model, variant of transformer. Applies a multi-headed self-attention on operation over input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:

$$
h\_0 = UW\_e + W\_p \\
h\_l = transformer\_block(h\_{l-1})\forall i \in \[1,n] \\
P(u) = softmax(h\_nW\_e^T)
$$

Where U is context vector of tokens, n is number of layers, W\_e is token embedding matrix, W\_p is position embedding matrix.

Supervised fine-tuning:

After training, we adapt the parameter to supervised target task. Assume a labelled dataset C, each instance consist of sequence of input tokens x1, ... , xm , along with label y. The input are passed through sequence pre-trained model to obtain final activation $$h\_l^m$$ which then fed into an added linear output layer parameters $$W\_y$$ to predict $$y$$

$$
P(y|x^1, \dots, x^m) = softmax(h\_l^mW\_y)
$$

This gives the following objective to maximize:

$$
L\_2(C) = \sum\_{(x,y)}\log P(y|x^1, \dots,x^m)
$$

Including language modeling as an auxiliary obective to fine tuning helped learning by improving generalization of supervised model, and accelerating convergence, so we optimize the following objective:

$$
L\_3(C) = L\_2(C) + \lambda \* L\_1(C)
$$

<figure><img src="/files/uEgkv9Ok4g0TBOnmk6tu" alt=""><figcaption><p>Model Architecture</p></figcaption></figure>

Need modification for certain tasks, significant amount of customization. Use a traversal-style approach, convert structured input into an ordered sequence that the pre-trained model can process.  - see paper for detail.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tianyi0216.gitbook.io/blog/reading-notes/ml-readings/generative-pre-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
