• There are mainly two purposes for using clustering algorithms:

    • Data transformation
    • Clustering
  • PCA (Principal Component Analysis)

  • NMF (Non-negative Matrix Factorization)

  • Clustering

  • k-means algorithm:

    • Used for vector quantization
    • Its strength lies in being able to handle clusters regardless of input dimensions, unlike PCA.
    • Dividing scattered points into 10 clusters is equivalent to separating each point into a 10-dimensional component (One-hot representation).
      • For example: {0,0,0,1,0,0,0,0,0,0}
      • Alternatively, assigning distances to each dimension towards the cluster centers.
  • Agglomerative Clustering

  • Evaluation:

    • When using ground truth for validation: metrics like Adjusted Rand Index (ARI) are used.

    • However, if ground truth is available, supervised learning can be applied.

    • Evaluation without ground truth: metrics like Silhouette Coefficient are used.

      • However, to verify accuracy, human visual inspection of the data is necessary.
      • Unlike supervised learning where metrics like R-squared can automatically validate, clustering evaluation relies on human assessment, making it challenging.

Getting Started with Machine Learning in Python


(From the book “Mastering Information Science”)

  • Assumption: Belonging to the same cluster implies having the same label.