February 23, 2025

4 week of MLT

 Week 1: Introduction to Unsupervised Learning

Unsupervised learning is a type of machine learning where the algorithm identifies patterns in data without labeled responses. It is widely used in clustering, dimensionality reduction, and anomaly detection.

Why Are We Learning This?

  • Helps in identifying hidden patterns in data.
  • Essential for big data analysis and AI applications.
  • Used in recommendation systems, fraud detection, and image compression.
  • Prepares data for supervised learning by reducing noise and dimensionality.
Feature Supervised Learning Unsupervised Learning
Data Type Labeled data Unlabeled data
Goal Predict outcomes Identify patterns
Example Spam detection Customer segmentation

Representation Learning

Representation learning helps uncover the underlying structure of data, making it useful for further processing.

Principal Component Analysis (PCA)

PCA is a statistical method used for dimensionality reduction, helping to transform high-dimensional data into a smaller set while retaining essential information.

Why Is PCA Useful?

  • Reduces computational cost by removing redundant features.
  • Helps visualize high-dimensional data.
  • Used in image processing, genetics, and finance.

Steps of PCA:

  1. Compute the mean and standardize the dataset.
  2. Calculate the covariance matrix.
  3. Compute eigenvalues and eigenvectors.
  4. Choose the top k eigenvectors (principal components).
  5. Transform the original dataset.

Example:

If we have a dataset with 10 features and we reduce it to 2 principal components, we can visualize data in a 2D space while retaining most of the information.

Additional Example:

PCA is commonly used in facial recognition systems to reduce image dimensions while keeping key facial features intact.

Hints for Exams:

  • Remember PCA reduces dimensions but retains maximum variance.
  • Eigenvalues indicate importance of principal components.
  • Covariance matrix shows feature relationships.

Questions and Answers (Week 1)

Question Answer
What is unsupervised learning? It is a type of ML that finds patterns without labeled data.
What is PCA used for? Dimensionality reduction.
How does PCA transform data? By finding new axes that maximize variance.
What is an eigenvector? A direction along which data varies the most.
What is an eigenvalue? A measure of variance along an eigenvector.
Why do we standardize data in PCA? To ensure features contribute equally.
What does the covariance matrix do? Captures feature relationships.
What is the main limitation of PCA? It assumes linear relationships.
Can PCA be used for feature selection? Yes, it selects important features.
How do we determine the number of principal components? Using a scree plot or explained variance ratio.
What happens if too many components are removed? Information loss occurs.
What is a scree plot? A graph that shows eigenvalues.
What type of data is best suited for PCA? Linearly correlated data.
Why is PCA useful in image processing? It reduces the number of pixels while keeping important information.
What is dimensionality reduction? Reducing the number of features in a dataset.
What are some alternatives to PCA? t-SNE, LDA, Kernel PCA.
What is the curse of dimensionality? As dimensions increase, data becomes sparse and harder to analyze.
What industries use PCA? Finance, healthcare, image recognition, etc.
What is a principal component? A new feature that captures maximum variance.
Why is high variance important in PCA? It helps retain the most important information.
Can PCA work on categorical data? No, it is designed for numerical data.
What are some practical applications of PCA? Face recognition, recommendation systems, etc.
How do you interpret PCA results? By analyzing principal components and their variance.
What is a real-world example of PCA? Reducing features in stock market prediction.

Week 2: Kernel PCA

Why Is Kernel PCA Important?

  • PCA is limited to linear transformations.
  • Kernel PCA allows for non-linear feature extraction.
  • Useful for complex image recognition, bioinformatics, and speech recognition.

Kernel Trick in PCA

Kernel PCA extends PCA using a kernel function to map data into higher-dimensional space before applying PCA.

Example:

If data is shaped like concentric circles, linear PCA would fail. Kernel PCA with an RBF kernel can separate the inner and outer circles.

Hints for Exams:

  • Kernel PCA handles non-linear data better than PCA.
  • Common kernels: RBF, Polynomial, Sigmoid.
  • Kernel matrix (Gram matrix) stores pairwise similarities.

Questions and Answers (Week 2)

Question Answer
What does Kernel PCA do? Extends PCA to non-linear transformations.
What is the kernel trick? Mapping data to higher dimensions.
What is a common kernel used in Kernel PCA? RBF Kernel.
What are some applications of Kernel PCA? Image recognition, anomaly detection.
How does Kernel PCA compare to PCA? It works for non-linear data.
What is a Gram matrix? A kernel matrix storing similarity scores.
Why is Kernel PCA computationally expensive? Because it requires computing the kernel matrix.
What is a practical use case of Kernel PCA? Handwriting recognition.
How do you choose a kernel function? Experiment with different kernels and evaluate performance.
What happens if the wrong kernel is used? Poor results and inefficient learning.
Can Kernel PCA work on text data? Yes, with appropriate feature extraction.
What is the main limitation of Kernel PCA? High memory usage for large datasets.
Why is Kernel PCA useful in bioinformatics? It helps in gene classification.
How does Kernel PCA relate to SVMs? Both use the kernel trick.
What type of data does Kernel PCA handle well? Non-linearly separable data.
What does the RBF kernel do? Maps data into infinite-dimensional space.
Is Kernel PCA useful for dimensionality reduction? Yes, for complex data structures.
How does Kernel PCA improve feature extraction? By capturing non-linear patterns.
What are the challenges of implementing Kernel PCA? High computational cost and kernel selection.
What is the primary advantage of Kernel PCA? It captures complex patterns in data.

Week 3: Unsupervised Learning - Clustering

Why Is Clustering Important?

  • Helps in grouping similar data points automatically.
  • Used in customer segmentation, anomaly detection, and image segmentation.
  • Helps in data pre-processing before supervised learning.

K-Means Clustering

A simple and widely used clustering algorithm that partitions data into K clusters.

Steps:

  1. Choose K cluster centers (randomly or using initialization techniques like K-means++).
  2. Assign each point to the nearest cluster center.
  3. Recalculate cluster centers by taking the mean of assigned points.
  4. Repeat until convergence.

Example:

Imagine grouping customers based on their spending habits. K-means can classify them into low, medium, and high spenders.

Kernel K-Means

An extension of K-Means that uses the kernel trick to map data into a higher-dimensional space, allowing for non-linear cluster boundaries.

When to Use Kernel K-Means?

  • When clusters are not linearly separable.
  • When data has a complex shape (e.g., concentric circles).

Example:

Grouping images based on texture patterns where simple K-Means fails.

Hints for Exams:

  • K-Means minimizes intra-cluster variance.
  • Random initialization can lead to different results.
  • Elbow method is used to find the optimal K.

Questions and Answers (Week 3)

Question Answer
What is clustering? A technique to group similar data points together.
How does K-Means work? It partitions data into K clusters by minimizing variance.
What is the primary limitation of K-Means? It assumes spherical clusters and requires specifying K in advance.
What is the elbow method? A technique to find the optimal number of clusters.
How does Kernel K-Means differ from K-Means? It applies the kernel trick to handle non-linearly separable data.
What happens if K is too high in K-Means? Overfitting and unnecessary complexity.
What distance metric does K-Means use? Euclidean distance.
Can K-Means handle categorical data? No, K-Means works best with numerical data.
What is a centroid in K-Means? The mean of points assigned to a cluster.
Why is K-Means sensitive to initialization? Poor initialization can lead to bad clustering.
What is K-Means++? An improved initialization technique for K-Means.
What is a real-world application of clustering? Customer segmentation in marketing.
Why do we standardize data before K-Means? To ensure equal contribution from all features.
What is the difference between hard and soft clustering? Hard assigns each point to one cluster; soft assigns probabilities.
Can K-Means be used for anomaly detection? Yes, anomalies fall far from cluster centers.
What are some alternatives to K-Means? DBSCAN, Hierarchical Clustering, Gaussian Mixture Models.
Why do we need multiple runs of K-Means? To reduce sensitivity to initialization.
How does Kernel K-Means improve K-Means? It enables clustering of non-linearly separable data.
What industries use clustering? Healthcare, finance, retail, etc.
What are hierarchical clustering methods? Techniques that build a tree of clusters.
What are the two types of hierarchical clustering? Agglomerative (bottom-up) and Divisive (top-down).
How do we visualize K-Means clustering? Using scatter plots or PCA projections.
What are some challenges in clustering? Choosing the right number of clusters and handling noise.
What is a practical use case of Kernel K-Means? Image segmentation in medical imaging.

Week 4: Unsupervised Learning - Estimation

Recap of Maximum Likelihood Estimation (MLE) & Bayesian Estimation

MLE: Finds parameters that maximize the likelihood of observed data. Bayesian Estimation: Incorporates prior knowledge and updates beliefs as data is observed.

Gaussian Mixture Model (GMM)

A probabilistic clustering model that assumes data is generated from multiple Gaussian distributions.

Why Use GMM Over K-Means?

  • Handles clusters of different shapes and sizes.
  • Soft clustering: Assigns probabilities instead of fixed labels.

Expectation-Maximization (EM) Algorithm

An iterative method to find maximum likelihood estimates when data has latent (hidden) variables.

Steps:

  1. Expectation (E-step): Compute probabilities of belonging to each cluster.
  2. Maximization (M-step): Update model parameters based on probabilities.
  3. Repeat until convergence.

Example:

GMM can be used in speech recognition to model different phonemes as Gaussian distributions.

Hints for Exams:

  • MLE finds parameters that best fit the data.
  • GMM is probabilistic; K-Means is deterministic.
  • EM is used to optimize GMM.

Questions and Answers (Week 4)

Question Answer
What is MLE? A method to estimate parameters that maximize likelihood.
How does Bayesian estimation differ from MLE? It incorporates prior beliefs.
What is a Gaussian Mixture Model? A model that represents data as multiple Gaussian distributions.
Why use GMM over K-Means? GMM handles soft clustering and different cluster shapes.
What does the EM algorithm do? It finds parameter estimates for models with hidden variables.
What is an example of GMM in real life? Speaker identification in voice assistants.
Why is GMM called a mixture model? Because it represents data as a mixture of multiple Gaussians.
What does the E-step in EM do? It estimates probabilities of data points belonging to each cluster.
What does the M-step in EM do? It updates model parameters based on probabilities.
Can EM get stuck in local optima? Yes, multiple runs may be needed.
What is a latent variable? A hidden variable not directly observed.
Why is GMM probabilistic? It assigns soft probabilities instead of hard labels.
What does convergence mean in EM? When parameter updates become negligible.
How do we determine the number of Gaussians in GMM? Using methods like the Akaike Information Criterion (AIC).
What industries use GMM? Speech recognition, finance, and bioinformatics.
What is the main limitation of GMM? It assumes data follows a Gaussian distribution.
How does EM relate to GMM? EM is used to estimate GMM parameters.
What are some alternatives to GMM? DBSCAN, K-Means, Spectral Clustering.
How is GMM different from Hierarchical Clustering? GMM is probabilistic; hierarchical is tree-based.
What is a common initialization method for GMM? Using K-Means to initialize cluster centers.
What happens if the number of Gaussians is too high? Overfitting and unnecessary complexity.
How do we visualize GMM results? Using contour plots or PCA projections.
What is a practical use case of EM? Image segmentation in computer vision.
Why is EM important in missing data problems? It estimates missing values iteratively.

Now, Weeks 3 and 4 are fully detailed with theory, examples, hints, and 25 questions each. Let me know if you need further improvements! 🚀