Jekyll2023-09-07T18:07:49+00:00https://joeddav.github.io/feed.xmlJoe DavisonML and NLP ResearcherJoe Davisonjoeddav@utah.cs.eduEmbracing the LLM Moment: Diving Head-First Into a Future of NLP & AI Research2023-03-27T00:00:00+00:002023-03-27T00:00:00+00:00https://joeddav.github.io/PhD-announcement<p>I’m going back for a PhD!</p>
<p>This weekend I accepted an offer to join the PhD program in the University of Utah’s School of Computing where I’ll work with <a href="https://svivek.com/">Vivek Srikumar</a> in the Utah NLP Group.</p>
<p>I have no idea what I’m getting myself into, and not for the usual reasons. I have worked for several years’ experience in ML & NLP research, both in academia during my master’s degree and in industry as a member of Hugging Face’s science team.</p>
<p>My trepidation comes not from a lack of experience, but because of the rapidly-evolving nature of research in my chosen field.</p>
<p>From my vantage point, there is a seismic shift underway in the landscape of NLP & ML, making it difficult to predict what my work as a researcher will look like over the course of a 5-year doctoral program.</p>
<p>The recent success of LLMs like GPT-4 has turned the world of NLP and ML on its head, and many of us are left wondering what the future looks like for our field. On the one hand, that makes it feel foolish to commit myself to research whose shape is constantly changing form – almost like writing a blank check.</p>
<p>On the other hand, that is precisely what makes this the most thrilling time to dive back into academia. The questions we as a field will need to answer haven’t even been articulated yet – what better time to dive in and make an impact?</p>
<h3 id="llms-and-a-seismic-shift-in-research">LLMs and a Seismic Shift in Research</h3>
<p>The rise of LLMs and instruction-tuning methods like RLHF has shaken things up for researchers. Many have spent considerable time and attention developing methods for cleverly solving problems only to see LLMs come along and solve them better.</p>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<blockquote class="twitter-tweet" data-lang="en" data-theme="dark" data-align="center"><p lang="en" dir="ltr"><a href="https://twitter.com/andriy_mulyar/status/1636140367350312965">https://twitter.com/andriy_mulyar/status/1636140367350312965</a></p><a href="https://twitter.com/andriy_mulyar/status/1636140367350312965"></a></blockquote>
<p>We’ve been building elaborate Rube Goldberg machines, carefully engineering each component, only for a giant LLM to come along and accomplish the task from a prompt. While humbling and exciting to watch, it can also be a tough pill to swallow. It’s called the <em>bitter lesson</em> for a reason.</p>
<!-- <blockquote class="twitter-tweet" data-lang="en" data-theme="dark" data-conversation="none"><p lang="en" dir="ltr"><a href="https://twitter.com/MasterJeongK/status/1635967360866877442">https://twitter.com/MasterJeongK/status/1635967360866877442</a></p><a href="https://twitter.com/MasterJeongK/status/1635967360866877442"></a></blockquote> -->
<blockquote class="twitter-tweet" data-lang="en" data-theme="dark" data-align="center"><p lang="en" dir="ltr"><a href="https://twitter.com/joeddav/status/1636149957093900290">https://twitter.com/joeddav/status/1636149957093900290</a></p><a href="https://twitter.com/joeddav/status/1636149957093900290"></a></blockquote>
<p>But with this disruption comes opportunity. As we stand at the precipice of an uncertain future, we find ourselves at the most exciting moment in the history of NLP and ML research. We have the chance to ask new questions, explore new directions, and shape the trajectory of our field.</p>
<!-- <blockquote class="twitter-tweet" data-lang="en" data-theme="dark" data-conversation="none" data-align="center"><p lang="en" dir="ltr"><a href="https://twitter.com/srush_nlp/status/1636148196677242884">https://twitter.com/srush_nlp/status/1636148196677242884</a></p><a href="https://twitter.com/srush_nlp/status/1636148196677242884"></a></blockquote> -->
<h3 id="steering-the-future-of-research">Steering the Future of Research</h3>
<p>With the advent of LLMs, it’s easy to feel like our work is becoming obsolete. But I argue that the opposite is true. Now, more than ever, NLP and ML researchers are essential in shaping the future of AI.</p>
<blockquote class="reddit-embed-bq" data-embed-showtitle="true" data-embed-context="2" data-embed-depth="1" data-embed-showmedia="false" data-embed-theme="dark" data-embed-height="677"> <a href="https://www.reddit.com/r/MachineLearning/comments/11rizyb/d_anyone_else_witnessing_a_panic_inside_nlp_orgs/jcabzqg/">Comment</a><br /> by <a href="https://www.reddit.com/user/needlzor">u/needlzor</a> from discussion <a href="https://www.reddit.com/r/MachineLearning/comments/11rizyb/d_anyone_else_witnessing_a_panic_inside_nlp_orgs/">[D] Anyone else witnessing a panic inside NLP orgs of big tech companies?</a><br /> in <a href="https://www.reddit.com/r/MachineLearning/">MachineLearning</a> </blockquote>
<script async="" src="https://embed.reddit.com/widgets.js" charset="UTF-8"></script>
<p><br /></p>
<p>The uncertainty is precisely what is most exciting about this moment in AI research. Sure, it’s a little daunting, but it’s also invigorating. There’s an opportunity to make a serious impact on our field and on the world – to push for higher standards of scientific rigor and to act as grounded voices of sobriety as a counterweight to the throngs of overzealous “AI influencers.”</p>
<p>We have the opportunity to harness the power of LLMs toward a greater understanding of language, computation, and cognition. How can we make AI more transparent, responsible, and ethical? How can we ensure that AI benefits everyone, not just those with access to vast computational resources? How can we leverage LLMs to tackle previously intractable problems? What other key questions remain to be asked?</p>
<p>I, for one, am excited to contribute to this effort.</p>Joe Davisonjoeddav@utah.cs.eduThe success of LLMs has caused a seismic shift in the landscape of AI research. What better time to start a PhD?Zero-Shot Learning in Modern NLP2020-05-29T00:00:00+00:002020-05-29T00:00:00+00:00https://joeddav.github.io/ZSL<h2 id="state-of-the-art-nlp-models-for-text-classification-without-annotated-data">State-of-the-art NLP models for text classification without annotated data.</h2>
<p><em>Original post (with reader comments) on legacy site <a href="https://joeddav.github.io/blog/2020/05/29/ZSL.html">here</a></em></p>
<blockquote>
<p>Check out our live <a href="https://huggingface.co/spaces/joeddav/zero-shot-demo">zero-shot topic classification demo here</a>.</p>
</blockquote>
<p>Natural language processing is a very exciting field right now. In recent years, the community has begun to figure out some pretty effective methods of learning from the enormous amounts of unlabeled data available on the internet. The success of transfer learning from unsupervised models has allowed us to surpass virtually all existing benchmarks on downstream supervised learning tasks. As we continue to develop new model architectures and unsupervised learning objectives, “state of the art” continues to be a rapidly moving target for many tasks where large amounts of labeled data are available.</p>
<p>One major advantage as models continue to grow is that we see a very slow decrease in the reliance on large amounts of annotated data for downstream tasks. This week the team at Open AI released a preprint describing their largest model yet, GPT-3, with 175 billion parameters. The paper is entitled, <a href="https://arxiv.org/abs/2005.14165">“Language Models are Few-Shot Learners”</a>, and shows that extremely large language models can perform competitively on downstream tasks with far less task-specific data than would be required by smaller models.</p>
<p><img src="https://joeddav.github.io/blog/images/zsl/gpt3_triviahq.png" alt="gpt3 triviahq performance" title="GPT-3 few-shot performance as # of parameters grows" /></p>
<p>However, models of this size remain impractical for real-world use. For instance, the largest version of GPT-3 must be partitioned across dozens of GPUs to even fit in memory. In many real-world settings, annotated data is either scarse or unavailable entirely. Models much smaller than GPT-3 such as BERT have still been shown to encode a tremendous amount of information in their weights (<a href="https://arxiv.org/abs/1909.01066">Petroni et al. 2019</a>). It seems like if we were smart about it, we would be able to figure out some techniques for applying these models to downstream tasks in a way that takes advantage of this latent information without the need for so much task-specific annotated data.</p>
<p>Of course, <em>some</em> research has in fact been done in this area. <strong>In this post, I will present a few techniques, both from published research and our own experiments at Hugging Face, for using state-of-the-art NLP models for sequence classification without large annotated training sets.</strong></p>
<h2 id="what-is-zero-shot-learning">What is zero-shot learning?</h2>
<p>Traditionally, zero-shot learning (ZSL) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. Recently, especially in NLP, it’s been used much more broadly to mean <em>get a model to do something that it wasn’t explicitly trained to do.</em> A well-known example of this is in the <a href="https://pdfs.semanticscholar.org/9405/cc0d6169988371b2755e573cc28650d14dfe.pdf">GPT-2 paper</a> where the authors evaluate a language model on downstream tasks like machine translation without fine-tuning on these tasks directly.</p>
<p>The definition is not all that important, but it is useful to understand that the term is used in various ways and that we should therefore take care to understand the experimental setting when comparing different methods. For example, traditional zero-shot learning requires providing some kind of descriptor (<a href="http://proceedings.mlr.press/v37/romera-paredes15.pdf">Romera-Paredes et al. 2015</a>) for an unseen class (such as a set of visual attributes or simply the class name) in order for a model to be able to predict that class without training data. Understanding that different zero-shot methods may adopt different rules for what kind of class descriptors are allowed provides relevant context when communicating about these techniques.</p>
<h2 id="a-latent-embedding-approach">A latent embedding approach</h2>
<p>A common approach to zero shot learning in the computer vision setting is to use an existing featurizer to embed an image and any possible class names into their corresponding latent representations (e.g. <a href="https://arxiv.org/abs/1301.3666">Socher et al. 2013</a>). They can then take some training set and use only a subset of the available labels to learn a linear projection to align the image and label embeddings. At test time, this framework allows one to embed any label (seen or unseen) and any image into the same latent space and measure their distance.</p>
<p>In the text domain, we have the advantage that we can trivially use a single model to embed both the data and the class names into the same space, eliminating the need for the data-hungry alignment step. This is not a new technique – researchers and practitioners have used pooled word vectors in similar ways for some time (such as <a href="https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2016-174.pdf">Veeranna et al. 2016</a>). But recently we have seen a dramatic increase in the quality of sentence embedding models. We therefore decided to run some experiments with Sentence-BERT, a recent technique which fine-tunes the pooled BERT sequence representations for increased semantic richness, as a method for obtaining sequence and label embeddings.</p>
<p>To formalize this, suppose we have a sequence embedding model $\Phi_\text{sent}$ and set of possible class names $C$. We classify a given sequence $x$ according to,</p>
\[\hat{c} = \arg\max_{c \in C} \cos(\Phi_\text{sent}(x), \Phi_\text{sent}(c))\]
<p>where $\cos$ is the cosine similarity. Here’s an example code snippet showing how this can be done using Sentence-BERT as our embedding model $\Phi_\text{sent}$:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># load the sentence-bert model from the HuggingFace model hub
</span><span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">transformers</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">AutoModel</span>
<span class="kn">from</span> <span class="nn">torch.nn</span> <span class="kn">import</span> <span class="n">functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'deepset/sentence_bert'</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModel</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'deepset/sentence_bert'</span><span class="p">)</span>
<span class="n">sentence</span> <span class="o">=</span> <span class="s">'Who are you voting for in 2020?'</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="s">'business'</span><span class="p">,</span> <span class="s">'art & culture'</span><span class="p">,</span> <span class="s">'politics'</span><span class="p">]</span>
<span class="c1"># run inputs through model and mean-pool over the sequence
# dimension to get sequence-level representations
</span><span class="n">inputs</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">batch_encode_plus</span><span class="p">([</span><span class="n">sentence</span><span class="p">]</span> <span class="o">+</span> <span class="n">labels</span><span class="p">,</span>
<span class="n">return_tensors</span><span class="o">=</span><span class="s">'pt'</span><span class="p">,</span>
<span class="n">pad_to_max_length</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">input_ids</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="s">'input_ids'</span><span class="p">]</span>
<span class="n">attention_mask</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="s">'attention_mask'</span><span class="p">]</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="o">=</span><span class="n">attention_mask</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">sentence_rep</span> <span class="o">=</span> <span class="n">output</span><span class="p">[:</span><span class="mi">1</span><span class="p">].</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">label_reps</span> <span class="o">=</span> <span class="n">output</span><span class="p">[</span><span class="mi">1</span><span class="p">:].</span><span class="n">mean</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># now find the labels with the highest cosine similarities to
# the sentence
</span><span class="n">similarities</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">cosine_similarity</span><span class="p">(</span><span class="n">sentence_rep</span><span class="p">,</span> <span class="n">label_reps</span><span class="p">)</span>
<span class="n">closest</span> <span class="o">=</span> <span class="n">similarities</span><span class="p">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">descending</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ind</span> <span class="ow">in</span> <span class="n">closest</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'label: </span><span class="si">{</span><span class="n">labels</span><span class="p">[</span><span class="n">ind</span><span class="p">]</span><span class="si">}</span><span class="s"> </span><span class="se">\t</span><span class="s"> similarity: </span><span class="si">{</span><span class="n">similarities</span><span class="p">[</span><span class="n">ind</span><span class="p">]</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>label: politics similarity: 0.21561521291732788
label: business similarity: 0.004524140153080225
label: art & culture similarity: -0.027396833524107933
</code></pre></div></div>
<blockquote>
<p>Note: This code snippet uses <code class="language-plaintext highlighter-rouge">deepset/sentence_bert</code> which is the smallest version of the S-BERT model. Our experiments use larger models which are currently available only in the <code class="language-plaintext highlighter-rouge">sentence-transformers</code> <a href="https://github.com/UKPLab/sentence-transformers">GitHub repo</a>, which we hope to make available in the Hugging Face model hub soon.</p>
</blockquote>
<p>One problem with this method is that Sentence-BERT is designed to learn effective sentence-level, not single- or multi-word representations like our class names. It is therefore reasonable to suppose that our label embeddings may not be as semantically salient as popular word-level embedding methods (i.e. word2vec). This is seen in the t-SNE visualization below where the data seems to cluster together by class (color) reasonably well, but the labels are poorly aligned. If we were to use word vectors as our label representations, however, we would need annotated data to learn an alignment between the S-BERT sequence representations and the word2vec label representations.</p>
<p><img src="https://joeddav.github.io/blog/images/zsl/tsne_no_projection.png" alt="visual of S-BERT label and text embeddings" title="t-SNE visualization of Yahoo Answers S-BERT embeddings. Plotted points correpond to data and text boxes to corresponding labels. While some labels like 'Computers & Internet' do appear near their corresponding data clusteres in latent space, most are poorly aligned. " /></p>
<p>In some of our own internal experiments, we addressed this issue with the following procedure:</p>
<ol>
<li>Take the top $K$ most frequent words $V$ in the vocabulary of a word2vec model</li>
<li>Obtain embeddings for each word using word2vec, $\Phi_{\text{word}}(V)$</li>
<li>Obtain embeddings for each word using S-BERT, $\Phi_{\text{sent}}(V)$</li>
<li>Learn a least-squares linear projection matrix $Z$ with L2 regularization from $\Phi_{\text{sent}}(V)$ to $\Phi_{\text{word}}(V)$</li>
</ol>
<p>Since we have only learned this projection for embeddings of single words, we cannot expect it to learn an effective mapping between S-BERT sequence representations and labels embedded with word2vec. Instead, we use $Z$ in our classification only as an additional transformation to S-BERT embeddings for both sequences and labels:</p>
\[\hat{c} = \arg\max_{c \in C} \cos(\Phi_{\text{sent}}(x)Z, \Phi_{\text{sent}}(c)Z)\]
<p>This procedure can be thought of as a kind of dimensionality reduction. As seen in the t-SNE visual below, this projection makes the label embeddings much better aligned with their corresponding data clusters while maintining the superior performance of S-BERT compared to pooled word vectors. Importantly, this procedure does not require any additional data beyond a word2vec dictionary sorted by word frequency.</p>
<p>On the Yahoo Answers topic classification task, we find an F1 of $46.9$ and $31.2$ with and without this projection step, respectively. For context, Yahoo Answers has 10 classes and <a href="https://paperswithcode.com/sota/text-classification-on-yahoo-answers">supervised models</a> get an accuracy in the mid 70s.</p>
<p><img src="https://joeddav.github.io/blog/images/zsl/tsne_with_projection.png" alt="visual of S-BERT + projection label and text embeddings" title="t-SNE visualization of embeddings with SBERT to Wordvec projection. This extra projection step results in labels which appear much closer to their corresponding data clusters compared to the previous visual." /></p>
<h3 id="when-some-annotated-data-is-available">When some annotated data is available</h3>
<p>This technique is flexible and easily adapted to the case where a limited amount of labeled data is available (few-shot learning) or where we have annotated data for only a subset of the classes we’re interested in (traditional zero-shot learning).</p>
<p>To do so, we can simply learn an additional least-squares projection matrix to the embeddings of any available labels from their corresponding data embeddings. However, it is important that we do so in a way that does not overfit to our limited data. Our embeddings perform well on their own, so we need to find a projection between them that learns from what training data we have while still utilizing the semantic richness of these representations.</p>
<p>To this end, we add a variant of L2 regularization which pushes the weights towards the identity matrix rather than decreasing their norm. If we define $X_{Tr}, Y_{Tr}$ to be our training data and labels and $\Phi(X) = \Phi_\text{sent}(X)Z$ to be our embedding function as described above, our regularized objective is,</p>
\[W^\ast = \arg\min || \Phi(X)^\top W - \Phi(Y) ||^2 + \lambda ||W - \mathbb{I}_d||^2\]
<p>This is equivalent to Bayesian linear regression with a Gaussian prior on the weights centered at the identity matrix and variance controlled by $\lambda$. By pushing $W$ towards the identity matrix, we’re effectively pushing the resulting projected embeddings $\Phi(X)W^\ast$ towards $\Phi(X)$, which is exactly what we want to do. Informally, we have a prior belief that the best representation for our data is our embedding function $\Phi(X)\mathbb{I}_d=\Phi(X)$ and we update that belief only as we encounter more training data.</p>
<h2 id="classification-as-natural-language-inference">Classification as Natural Language Inference</h2>
<p>We will now explore an alternative method which not only embeds sequences and labels into the same latent space where their distance can be measured, but that can actually tell us something about the compatibility of two distinct sequences out of the box.</p>
<p>As a quick review, <a href="http://nlpprogress.com/english/natural_language_inference.html">natural language inference (NLI)</a> considers two sentences: a “premise” and a “hypothesis”. The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.</p>
<p><img src="https://joeddav.github.io/blog/images/zsl/nli-examples.png" alt="example NLI sentences" title="Examples from http://nlpprogress.com/english/natural_language_inference.html" /></p>
<p>When using transformer architectures like BERT, NLI datasets are typically modeled via <em>sequence-pair classification</em>. That is, we feed both the premise and the hypothesis through the model together as distinct segments and learn a classification head predicting one of <code class="language-plaintext highlighter-rouge">[contradiction, neutral, entailment]</code>.</p>
<p>The approach, proposed by <a href="https://arxiv.org/abs/1909.00161">Yin et al. (2019)</a>, uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well. The idea is to take the sequence we’re interested in labeling as the “premise” and to turn each candidate label into a “hypothesis.” If the NLI model predicts that the premise “entails” the hypothesis, we take the label to be true. See the code snippet below which demonstrates how easily this can be done with 🤗 Transformers.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># load model pretrained on MNLI
</span><span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">BartForSequenceClassification</span><span class="p">,</span> <span class="n">BartTokenizer</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">BartTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'facebook/bart-large-mnli'</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">BartForSequenceClassification</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">'facebook/bart-large-mnli'</span><span class="p">)</span>
<span class="c1"># pose sequence as a NLI premise and label (politics) as a hypothesis
</span><span class="n">premise</span> <span class="o">=</span> <span class="s">'Who are you voting for in 2020?'</span>
<span class="n">hypothesis</span> <span class="o">=</span> <span class="s">'This text is about politics.'</span>
<span class="c1"># run through model pre-trained on MNLI
</span><span class="n">input_ids</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">premise</span><span class="p">,</span> <span class="n">hypothesis</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="s">'pt'</span><span class="p">)</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_ids</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true
</span><span class="n">entail_contradiction_logits</span> <span class="o">=</span> <span class="n">logits</span><span class="p">[:,[</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">]]</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">entail_contradiction_logits</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">true_prob</span> <span class="o">=</span> <span class="n">probs</span><span class="p">[:,</span><span class="mi">1</span><span class="p">].</span><span class="n">item</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Probability that the label is true: </span><span class="si">{</span><span class="n">true_prob</span><span class="p">:</span><span class="mf">0.2</span><span class="n">f</span><span class="si">}</span><span class="s">%'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Probability that the label is true: 99.04%
</code></pre></div></div>
<p>In the paper, the authors report a label-weighted F1 of $37.9$ on Yahoo Answers using the smallest version of BERT fine-tuned only on the Multi-genre NLI (MNLI) corpus. By simply using the larger and more recent Bart model pre-trained on MNLI, we were able to bring this number up to $53.7$.</p>
<p>See <a href="https://huggingface.co/spaces/joeddav/zero-shot-demo">our live demo here</a> to try it out for yourself! Enter a sequence you want to classify and any labels of interest and watch Bart do its magic in real time.</p>
<p><img src="https://joeddav.github.io/blog/images/zsl/zsl-demo-screenshot.png" alt="live demo" /></p>
<h3 id="when-some-annotated-data-is-available-1">When some annotated data is available</h3>
<p>Fine-tuning this model on a small number of annotated data points is not effective, so it is not particularly amenable to the few-shot setting. However, in the traditional zero-shot setting where we have sufficient data for a limited number of classes, this model excels. Training can be done by passing a sequence through the model twice: once with the the correct label and once with a randomly selected false label, optimizing cross-entropy.</p>
<p>One problem that arises after fine-tuning is that the model predicts a much higher probability for labels it has seen than for those it has not. To mitigate this issue, the authors introduce a procedure that penalizes labels at test time which were seen at training time. See <a href="https://www.aclweb.org/anthology/D19-1404/">the paper</a> for full details.</p>
<p>Check out <a href="http://35.208.71.201:8000/">our demo</a> to try out a version of this model fine-tuned on Yahoo Answers. You can also find the authors’ GitHub repo <a href="https://github.com/yinwenpeng/BenchmarkingZeroShot">here</a>.</p>
<h2 id="classification-as-a-cloze-task">Classification as a cloze task</h2>
<p>One in-the-works approach to keep your eye on is a preprint on Pattern-Exploiting Training (PET) from <a href="https://arxiv.org/abs/2001.07676">Schick et al. (2020)</a>. In this paper, the authors reformulate text classification as a cloze task. A cloze question considers a sequence which is partially masked and requires predicting the missing value(s) from the context. PET requires a human practitioner to construct several task-appropriate cloze-style templates which, in the case of topic classification, could look something like the following:</p>
<p><img src="https://joeddav.github.io/blog/images/zsl/cloze.png" alt="cloze examples" title="examples of cloze templates for topic classification. a and b are the question and answers in the case of Yahoo Answers and ____ is the class name which the model must predict." /></p>
<p>A pre-trained masked language model is then tasked with choosing the most likely value for the masked (blank) word from among the possible class names for each cloze sentence.</p>
<p>The result is a set of noisy class predictions for each data point. This process alone serves as a basic zero-shot classifier. In addition, the authors introduce a sort of knowledge distilation procedure. After generating a set of predictions from the cloze task, these predicted values are used as <em>proxy labels</em> on which a new classifier is trained from scratch. My intuition is that this step is effective because it allows us to do inference over the whole test set collectively, allowing the model to learn from the set over which it is predicting rather than treating each test point independently. I suspect that this step would be particularly helpful when adapting to novel domains which do not resemble the MLM’s training corpus.</p>
<p>In the most recent version of their paper, the authors also discuss an iterative self-training procedure on top of PET which reports an impressive accuracy of $70.7\%$ on Yahoo Answers, which nearly approaches the performance of state-of-the-art supervised classification methods.</p>
<p>This brings me back to my earlier point about considering experimental parameters when comparing different methods. Although PET significantly outperforms the other methods described here, it also makes use of data which the other approaches do not assume access to: multiple task-specific, hand-crafted cloze sentences and a large set of unlabeled data for the distilation/self-learning step. I say this not to knock PET by any means, nor do the authors compare themselves to the methods I’ve outlined here, but simply to emphasize the importance of taking care in comparing different approaches which can all be considered, in some sense, “zero-shot”.</p>
<h3 id="when-some-annotated-data-is-available-2">When some annotated data is available</h3>
<p>The authors present a well-developed method for using PET in the case where some training data is available, effectively minimizing a joint loss between optimizing cloze predictions for any available training data and the standard MLM loss. The details are somewhat inovlved, so if you’re interested I highly recommend checking out their <a href="https://arxiv.org/abs/2001.07676">preprint</a>, <a href="https://www.youtube.com/watch?v=01jRE9noSWw">YouTube tutorial</a>, or <a href="https://github.com/timoschick/pet">GitHub repo</a>.</p>
<h2 id="on-low-resource-languages">On low-resource languages</h2>
<p>One extremely important data-scarse setting in NLP is in low-resource languages. Fortunately, it’s a very active research area and much has been written about it. For those interested in this area, I’d highly recommend checking Graham Neubig’s recently released <a href="https://github.com/neubig/lowresource-nlp-bootcamp-2020">Low Resource NLP Bootcamp</a>. This is a fantastic resource in the form of a GitHub repo containing 8 lectures (plus exercises) focused on NLP in data-scarse languages. Additionally, I’d recommend check out Sebastian Ruder’s writings including, <a href="https://ruder.io/cross-lingual-embeddings/">“A survey of cross-lingual word embedding models”</a>.</p>Joe Davisonjoeddav@utah.cs.eduState-of-the-art NLP models for text classification without annotated data. See the live demo on HF spaces.REALM: Knowledge and Transformers2020-03-03T00:00:00+00:002020-03-03T00:00:00+00:00https://joeddav.github.io/REALM<p><em>Summary for the Hugging Face <a href="https://github.com/huggingface/awesome-papers">awesome-papers</a> reading group, March 3, 2020. Paper: <a href="https://arxiv.org/abs/2002.08909">https://arxiv.org/abs/2002.08909</a>.</em></p>
<h3 id="background-language-models-as-knowledge-bases">Background: Language Models as Knowledge Bases</h3>
<p>Due in large part to the massive size and scope of the text corpora on which they are trained, huge language representation learners like BERT have been shown to encode a surprising amount of world knowledge in their weights. <a href="https://arxiv.org/abs/1909.01066">A recent EMNLP paper</a> posed the question of whether models like BERT can be thought of as having some form of latent knowledge base in their parameters. Models like BERT, and in particular <a href="https://arxiv.org/abs/1910.10683">T5</a>, have been shown to do surprisingly well on open-domain question answering, a deliberately information-intensive task, despite having no access to external databases (incidentally, REALM shows how well we can do when such a model is given that access). All of this is to suggest the possibility that, given enough parameters and training data, models might be able to make external knowledge augmentation superflous, instead inferring relevant knowledge from text corpora and encoding it in its parameters.</p>
<h3 id="background-instance-based-learning-and-retrieve-and-edit-models">Background: Instance-based Learning and Retrieve-and-edit Models</h3>
<p>Instance- or memory-based learning is a family of ML algorithms which compare a data point with instances already seen in training (or existing in some reference set), rather than relying entirely on learned model parameters for generalization. An example of this class of algorithm is KNN, which looks up the most similar points in a training set to generalize to a new instance. Retrieve-and-edit can be thought of as a type of instance-based learning with two components: a <em>retriever</em> which chooses similar training examples to a given data point, and an <em>editor</em> which then modifies the retrieved examples to form an appropriate prediction.</p>
<p>Recently, <a href="https://arxiv.org/abs/1911.00172" title="Generalization through Memorization: Nearest Neighbor Language Models">KNN Language Models</a> was proposed as an ICLR 2020 paper. This method looks up the nearest neighbors in LM-embedding space to generate target word predictions. The method does quite well in terms of perplexity, but the authors don’t evaluate on downstream task performance, and it’s questionable how amenable the method is to downstream fine-tuning at all. Regardless, it’s a great example of using instance-based learning to improve model performance.</p>
<p>Unlike KNN-LM, REALM incorporates retrieved instances by appending them to the model context, but it is not the first to propose such a method. Its direct precursor (published by one of REALM’s first authors), <a href="https://arxiv.org/abs/1906.00300">Latent Retrieval for Weakly Supervised Open Domain Question Answering</a>, introduces a pre-training task for the corpus retriever which “makes it possible” to train end-to-end on Wikipedia. Much of the work presented here as novel actually builds upon work done in this prior publication. The specific contributions presented in the REALM paper largely involve pre-training tasks and computational tricks, ultimately yielding impressive empirical results on OpenQA. Other methods, such as Facebook’s <a href="https://arxiv.org/abs/1704.00051">DrQA</a>, employ methods for retrieving knowledge from Wikipedia text as well.</p>
<h3 id="realm-retrieval-augmented-langauge-models">REALM: Retrieval-Augmented Langauge Models</h3>
<p>The authors propose a method for training a masked language model (MLM) by sparsely “attending” over all of Wikipedia in an end-to-end fashion.</p>
<p><img src="/assets/images/realm_fig.png" alt="" /></p>
<p>At a high level, the method goes like this: find the most similar text passages in BERT space, add those passages to the input as additional context, and then make a prediction.</p>
<p>Here’s the more formal, probabilistic explanation of their training objective: Suppose we have a corrupted sentence \(x\) and hidden tokens \(y\), as well as a textual knowledge corpus \(\mathcal{Z}\) (i.e., Wikipedia articles). The objective involves marginalizing over the entire Wikipedia corpus:</p>
\[p(y|x) = \sum_{z\in\mathcal{Z}} p(y|x,z) p(z|x)\]
<p>Of course, summing over every document in Wikipedia is computationally impractical. In some cases we approximate things like this with Monte Carlo:
\begin{equation}
p(y|x) \approx \frac{1}{K} \sum_{z\sim p(z|x)}^K p(y|x,z)
\end{equation}
In other words, if we can sample \(K\) documents from the conditional distribution \(p(z|x)\) and sum over the resulting target likelihoods, we get an unbiased estimator of the objective.</p>
<p>In practice, the authors sample from Wikipedia by simply taking the top \(K\) most similar documents to \(x\). They did this because selecting the top \(K\) allows them to use Maximum Inner Product Search (MIPS) for huge computational benefits, but at the cost of a biased approximation of the objective.</p>
<p><img src="/assets/images/realm_retrieval_examples.png" alt="" title="Example of text retrieval for a given text input" /></p>
<p>The authors evaluate their model on the downstream task of open-domain question answering, comfortably outperforming all other evaluated systems. It should also be noted that most of the included benchmark methods (excluding T5) also use Wikipedia in some way for external information at test time.</p>
<p><img src="/assets/images/realm_results.png" alt="" title="REALM OpenQA Results" /></p>
<h3 id="my-2">My 2¢</h3>
<p>I found this paper interesting because of its commentary on knowledge representation and instance-based language modeling. Do language models have latent knowledge bases? To what degree is that knowledge accessible? This paper makes the argument for “explicitly” modeling the relevant knowledge needed to perform a given task, rather than relying on inferred knowledge. However, the impressive performance on OpenQA benchmarks notwithstanding, they do nothing to substantiate their claims about improved interpretability and modularity of model predictions. For all we know, retrieval from Wikipedia could have <em>increased</em> the knowledge encoded in the model parameters, rather than decreased it.</p>
<p>It would also have been interesting to see more analysis on <em>what exactly</em> the model gets out of retrieved examples. Given that Wikipedia is a natural text corpus, is it possible that the model is attending to linguistic cues in retrieved examples in addition to factual information? The paper focused more on the computational aspect of things, which to be fair is arguably where their greatest contribution was, but I wish they had done some more analysis.</p>
<h3 id="discussion-questions">Discussion Questions</h3>
<ol>
<li>In the introduction, the authors state the following: “In contrast to models that store knowledge in their parameters, this approach explicitly exposes the role of world knowledge by asking the model to decide what knowledge to retrieve and use during inference.” In what way is is this type of knowledge-augmented model valuable compared to standard models which rely on the “latent knowledge” in their parameters? Will the future of NLP involve explicitly incorporating external knowledge, or will such methods become obsolete with bigger and better models?</li>
<li>Should Wikipedia be the go-to general “knowledge base”? Should we focus more on structured knowledge (such as WikiData, which is built from Wikipedia) or large text corpora?</li>
<li>The REALM objective involves marginalizing over the document corpus. The authors approximate this by summing over the predictions corresponding to the top-k highest scoring documents. Presumably, they take this approach (rather than coming up with a stochastic sampling method) to get the computational advantages of asynchronous MIPS. Is this biased estimator of the objective problematic or not a big deal? Why?</li>
<li>Is there a future for instance-based models like REALM in <code class="language-plaintext highlighter-rouge">transformers</code>?</li>
<li>In the discussion, the authors mention that there are several different lenses through which you can view their method: a different take on knowledge representation and augmentation; a transformer with much larger, sparsely-attended contexts; a memory-based or retrieve-and-edit model with learned retrieval. Do any of these perspectives particularly resonate? Are there others that you prefer?</li>
</ol>
<h3 id="summary-of-hf-internal-discussion">Summary of HF Internal Discussion</h3>
<ul>
<li>When knowledge needs to be updated, models like T5 which encode knowledge in their weights must be retrained to reflect the new information. If the model relies on the knowledge base, theoretically all you have to do is update the knowledge base. That’s a major advantage for real-life deployed QA systems, for example.</li>
<li>Bootstrapping the retrieval model is a tough thing to do and the Inverse Cloze Task is a smart way to go about it.</li>
<li>Impossible to know whether the biased objective (Q3 above) has a negative impact on results without more analysis or experimentation, but it would have been nice if the authors discussed it more in the paper.</li>
<li>It would nice to see evaluation on tasks other than just OpenQA.</li>
</ul>Joe Davisonjoeddav@utah.cs.eduSummary of "REALM: Retrieval-Augmented Language Model Pre-Training" the Hugging Face science team reading group, March 3, 2020.