• History of Encoding
    • Level 1: One-hot vector = [0,0,0,1,0,0,0] (only one 1 (hot))
    • Level 2: Tf-idf = words that appear frequently globally are not important, while words that appear frequently locally are important
    • Level 3: Word2vec = [x,x,x] king - distance(male,female) (includes some essence of tf-idf)
    • Level 4: BERT
      • BERT cannot understand pronouns like he/she
        • But, it can infer from the surrounding words (word sense disambiguation)

https://ishitonton.hatenablog.com/entry/2018/11/25/200332

  • About Embedding

(Expert in Information Science)

  • Overlapping field of Linguistics and Information Science

  • Approach

    • Want to understand the mechanism of human input/output of Language
    • Need to use Neuroscience to understand how the brain processes information
    • Therefore, explore the mechanism through observable language
  • Some things to do

  • It’s not just about processing language

    • It connects to the deep aspects of human intelligence, such as the meaning, knowledge, and emotions that language carries
    • (blu3mo) A broader field than imagined
  • Methodology

    • Cannot process as mere strings (e.g., “ケヤキ” and “ケーキ” are similar as strings but have completely different meanings)

    • How to handle meaning

      • What is meaning?: Something that humans can determine equivalence
      • (Since we cannot observe the processing in our minds, we use observable equivalence judgments)
    • How to combine ”Discrete structures” and “Continuous regularities”

      • The structure of natural language has clear correctness = there is a discrete value structure
        • e.g., Changing one pixel in an image doesn’t have much impact, but changing one character in natural language is a big problem
      • However, there is also ambiguity and uncertainty (statistical, continuous value properties)
        • Directly related to the ambiguity of language
      • In other words, it has a complex nature of both discrete and continuous aspects
    • What to learn from the corpus

      • Natural language text data is called “corpus”
      • Can learn regularities and other things from the corpus
      • e.g., Language Model (evaluating the naturalness of sentences)
    • The most commonly used technology is Machine Learning

  • Syntax analysis

    • Technology to understand the syntax of sentences
    • Described in detail above
  • Semantic Analysis

    • Technology to understand the meaning of sentences/words
    • Described in detail above