• Exploring different ways to represent data

  • Important tasks of a Data Scientist

  • There are things more important than adjusting parameters

  • One-hot Encoding

    • Example: When a variable like job has three values {student, engineer, chef}, it can be represented as {1,0,0} using One-hot Encoding
    • Sometimes, each value is assigned a number, but since they are not continuous values (the order doesn’t matter), they should not be treated as numerical values directly
  • Binning (Discretization)

    • Dividing continuous values into intervals to create classes
  • Polynomial features

    • In models like Linear Model, adding terms like x^2, x^3 along with x can help in separating curves
    • In contrast, it might decrease performance in models like decision trees
    • Besides powers, functions like sin, cos, log can also be used
      • Particularly useful for data that can be transformed to resemble a bell curve
  • Automated feature selection

    • Automatically selecting useful features to reduce the number of features can improve generalization performance
      • Selecting features with high statistical correlations
      • Training a model to obtain feature importance and then further training using the important features
      • Repeating the above steps
  • Utilizing domain knowledge

    • Human experts may have insights that cannot be derived from statistical data
    • Incorporating this knowledge while performing Feature Engineering

#getting Started with Machine Learning in Python