(Expert in Information Science) Lecture on Natural Language Processing

  • One definition of a language model: evaluating the “plausibility” of a sentence

    • Can also be used to select one from several recognition results, such as in speech recognition
  • Probabilistic language model

    • Mathematical representation of a “sentence”
      • Sentence s = hello world
        • represents the beginning/end of a sentence
    • Using the above, evaluate the plausibility of a sentence
      • P(a|b) is the probability of word a occurring after b
      • P() * P(hello | ) * P(world | hello) * P( | hello world)
      • Evaluate how likely each word is based on the preceding context
      • What to do with P(a|b)
        • Maximum likelihood estimation
          • Easily calculated using the frequency of occurrence in a corpus
          • Weak for low-frequency phenomena
          • If it returns 0, the product of P(a|b) becomes 0
        • Approximation using n-gram
          • Use only the previous n words instead of all preceding words for maximum likelihood estimation
          • The smaller the value of n, the stronger it is for low-frequency items
          • The larger the value of n, the more it can consider longer contexts
            • Trade-Off
            • Machine translation is commonly done up to n=4 (4-gram)
        • Estimation using Neural Network
          • Feed it into an RNN
  • Language models retain information about the connections between words when measuring plausibility, etc.

    • In other words, language models can be defined as encoding/decoding text, etc. into vectors?
  • Language Model using Neural Network

    • Feed it into an RNN for embedding (encoding)
      • Output is normalized to 0~1 using Softmax
    • Attention Mechanism
      • When the sentence becomes long, the influence of each word on the output vector becomes small
        • The size of the output vector is fixed
      • Calculate the weight of attention to strongly reflect important words
    • Transformer
      • Eliminate the recursion of RNN and encode/decode only with attention mechanism
    • If encoding into vectors and decoding in another language are possible, machine translation can be done
  • GPT-3, BERT, etc. are applications of Transformer

  • Pre-trained language models

    • Language models that can be adapted to various tasks
    • Large models have maintenance costs and are difficult to handle
      • This is a disadvantage compared to dedicated small models
    • Lightweight models (such as DistillBERT) have also been developed
  • Why Human Language Ability is More Amazing Than Language Models