Thoughts on the Information Alpha Challenge 2: Training CNN with CIFAR10

  • I thought it was a process of balancing both “adjusting the model to improve accuracy” and “improving generalization performance”.
    • It involved repeatedly working on the task of advancing the direction that should be advanced based on the graphs of accuracy and validation accuracy.
    • I’m not sure if it’s the correct way, though.

from 東大1S情報α (University of Tokyo 1st Semester Information Alpha) Deep Learning Techniques

  • Determining the initial parameter values
    • What kind of initial values are good?
      • Since the distribution tends to be centered around 0 after learning, it is good to start with values centered around 0.
      • Also, the larger the dimension, the smaller the variation.
      • Additionally, it is good to use a nicely shaped normal distribution.
        • This “nicely shaped normal distribution” is called Xavier’s normal distribution or He’s normal distribution.
          • The range of the distribution is different.
  • Gradient clipping
    • Setting an upper limit for the gradient.
  • Preventing overfitting
    • Regularization
      • To avoid large parameters θ.
    • Dropout (Deep Learning)
      • Randomly dropping connections.
      • It seems to be less commonly used recently.
    • Early stopping
      • Just stop before overfitting occurs.
    • Batch normalization
      • Correcting the variation in data within each mini-batch.
      • Making the mean and variance of the batch data the same as the entire data.
        • It allows for stochastic gradient descent on the micro level and is happy to be similar to the entire set on the macro level.
      • It’s not clear why it works.
        • In fact, the original purpose (reducing the deviation from the entire data) doesn’t have much meaning.
        • But, for some reason, it smooths out the roughness of the loss function, so it is used.
          • Magic~ (blu3mo)
  • There are various techniques, but
    • Since the best method is unknown, it is practical to deal with combinations of techniques.
    • As for which one to use, well, it’s trial and error.
    • For now, it seems that batch normalization is particularly strong, so I want to remember it.