Week 1: Introduction to Unsupervised Learning
Unsupervised learning is a type of machine learning where the algorithm identifies patterns in data without labeled responses. It is widely used in clustering, dimensionality reduction, and anomaly detection.
Why Are We Learning This?
- Helps in identifying hidden patterns in data.
- Essential for big data analysis and AI applications.
- Used in recommendation systems, fraud detection, and image compression.
- Prepares data for supervised learning by reducing noise and dimensionality.
Feature |
Supervised Learning |
Unsupervised Learning |
Data Type |
Labeled data |
Unlabeled data |
Goal |
Predict outcomes |
Identify patterns |
Example |
Spam detection |
Customer segmentation |
Representation Learning
Representation learning helps uncover the underlying structure of data, making it useful for further processing.
Principal Component Analysis (PCA)
PCA is a statistical method used for dimensionality reduction, helping to transform high-dimensional data into a smaller set while retaining essential information.
Why Is PCA Useful?
- Reduces computational cost by removing redundant features.
- Helps visualize high-dimensional data.
- Used in image processing, genetics, and finance.
Steps of PCA:
- Compute the mean and standardize the dataset.
- Calculate the covariance matrix.
- Compute eigenvalues and eigenvectors.
- Choose the top
k
eigenvectors (principal components).
- Transform the original dataset.
Example:
If we have a dataset with 10 features and we reduce it to 2 principal components, we can visualize data in a 2D space while retaining most of the information.
Additional Example:
PCA is commonly used in facial recognition systems to reduce image dimensions while keeping key facial features intact.
Hints for Exams:
- Remember PCA reduces dimensions but retains maximum variance.
- Eigenvalues indicate importance of principal components.
- Covariance matrix shows feature relationships.
Questions and Answers (Week 1)
Question |
Answer |
What is unsupervised learning? |
It is a type of ML that finds patterns without labeled data. |
What is PCA used for? |
Dimensionality reduction. |
How does PCA transform data? |
By finding new axes that maximize variance. |
What is an eigenvector? |
A direction along which data varies the most. |
What is an eigenvalue? |
A measure of variance along an eigenvector. |
Why do we standardize data in PCA? |
To ensure features contribute equally. |
What does the covariance matrix do? |
Captures feature relationships. |
What is the main limitation of PCA? |
It assumes linear relationships. |
Can PCA be used for feature selection? |
Yes, it selects important features. |
How do we determine the number of principal components? |
Using a scree plot or explained variance ratio. |
What happens if too many components are removed? |
Information loss occurs. |
What is a scree plot? |
A graph that shows eigenvalues. |
What type of data is best suited for PCA? |
Linearly correlated data. |
Why is PCA useful in image processing? |
It reduces the number of pixels while keeping important information. |
What is dimensionality reduction? |
Reducing the number of features in a dataset. |
What are some alternatives to PCA? |
t-SNE, LDA, Kernel PCA. |
What is the curse of dimensionality? |
As dimensions increase, data becomes sparse and harder to analyze. |
What industries use PCA? |
Finance, healthcare, image recognition, etc. |
What is a principal component? |
A new feature that captures maximum variance. |
Why is high variance important in PCA? |
It helps retain the most important information. |
Can PCA work on categorical data? |
No, it is designed for numerical data. |
What are some practical applications of PCA? |
Face recognition, recommendation systems, etc. |
How do you interpret PCA results? |
By analyzing principal components and their variance. |
What is a real-world example of PCA? |
Reducing features in stock market prediction. |
Week 2: Kernel PCA
Why Is Kernel PCA Important?
- PCA is limited to linear transformations.
- Kernel PCA allows for non-linear feature extraction.
- Useful for complex image recognition, bioinformatics, and speech recognition.
Kernel Trick in PCA
Kernel PCA extends PCA using a kernel function to map data into higher-dimensional space before applying PCA.
Example:
If data is shaped like concentric circles, linear PCA would fail. Kernel PCA with an RBF kernel can separate the inner and outer circles.
Hints for Exams:
- Kernel PCA handles non-linear data better than PCA.
- Common kernels: RBF, Polynomial, Sigmoid.
- Kernel matrix (Gram matrix) stores pairwise similarities.
Questions and Answers (Week 2)
Question |
Answer |
What does Kernel PCA do? |
Extends PCA to non-linear transformations. |
What is the kernel trick? |
Mapping data to higher dimensions. |
What is a common kernel used in Kernel PCA? |
RBF Kernel. |
What are some applications of Kernel PCA? |
Image recognition, anomaly detection. |
How does Kernel PCA compare to PCA? |
It works for non-linear data. |
What is a Gram matrix? |
A kernel matrix storing similarity scores. |
Why is Kernel PCA computationally expensive? |
Because it requires computing the kernel matrix. |
What is a practical use case of Kernel PCA? |
Handwriting recognition. |
How do you choose a kernel function? |
Experiment with different kernels and evaluate performance. |
What happens if the wrong kernel is used? |
Poor results and inefficient learning. |
Can Kernel PCA work on text data? |
Yes, with appropriate feature extraction. |
What is the main limitation of Kernel PCA? |
High memory usage for large datasets. |
Why is Kernel PCA useful in bioinformatics? |
It helps in gene classification. |
How does Kernel PCA relate to SVMs? |
Both use the kernel trick. |
What type of data does Kernel PCA handle well? |
Non-linearly separable data. |
What does the RBF kernel do? |
Maps data into infinite-dimensional space. |
Is Kernel PCA useful for dimensionality reduction? |
Yes, for complex data structures. |
How does Kernel PCA improve feature extraction? |
By capturing non-linear patterns. |
What are the challenges of implementing Kernel PCA? |
High computational cost and kernel selection. |
What is the primary advantage of Kernel PCA? |
It captures complex patterns in data. |
Week 3: Unsupervised Learning - Clustering
Why Is Clustering Important?
- Helps in grouping similar data points automatically.
- Used in customer segmentation, anomaly detection, and image segmentation.
- Helps in data pre-processing before supervised learning.
K-Means Clustering
A simple and widely used clustering algorithm that partitions data into K
clusters.
Steps:
- Choose
K
cluster centers (randomly or using initialization techniques like K-means++).
- Assign each point to the nearest cluster center.
- Recalculate cluster centers by taking the mean of assigned points.
- Repeat until convergence.
Example:
Imagine grouping customers based on their spending habits. K-means can classify them into low, medium, and high spenders.
Kernel K-Means
An extension of K-Means that uses the kernel trick to map data into a higher-dimensional space, allowing for non-linear cluster boundaries.
When to Use Kernel K-Means?
- When clusters are not linearly separable.
- When data has a complex shape (e.g., concentric circles).
Example:
Grouping images based on texture patterns where simple K-Means fails.
Hints for Exams:
- K-Means minimizes intra-cluster variance.
- Random initialization can lead to different results.
- Elbow method is used to find the optimal
K
.
Questions and Answers (Week 3)
Question |
Answer |
What is clustering? |
A technique to group similar data points together. |
How does K-Means work? |
It partitions data into K clusters by minimizing variance. |
What is the primary limitation of K-Means? |
It assumes spherical clusters and requires specifying K in advance. |
What is the elbow method? |
A technique to find the optimal number of clusters. |
How does Kernel K-Means differ from K-Means? |
It applies the kernel trick to handle non-linearly separable data. |
What happens if K is too high in K-Means? |
Overfitting and unnecessary complexity. |
What distance metric does K-Means use? |
Euclidean distance. |
Can K-Means handle categorical data? |
No, K-Means works best with numerical data. |
What is a centroid in K-Means? |
The mean of points assigned to a cluster. |
Why is K-Means sensitive to initialization? |
Poor initialization can lead to bad clustering. |
What is K-Means++? |
An improved initialization technique for K-Means. |
What is a real-world application of clustering? |
Customer segmentation in marketing. |
Why do we standardize data before K-Means? |
To ensure equal contribution from all features. |
What is the difference between hard and soft clustering? |
Hard assigns each point to one cluster; soft assigns probabilities. |
Can K-Means be used for anomaly detection? |
Yes, anomalies fall far from cluster centers. |
What are some alternatives to K-Means? |
DBSCAN, Hierarchical Clustering, Gaussian Mixture Models. |
Why do we need multiple runs of K-Means? |
To reduce sensitivity to initialization. |
How does Kernel K-Means improve K-Means? |
It enables clustering of non-linearly separable data. |
What industries use clustering? |
Healthcare, finance, retail, etc. |
What are hierarchical clustering methods? |
Techniques that build a tree of clusters. |
What are the two types of hierarchical clustering? |
Agglomerative (bottom-up) and Divisive (top-down). |
How do we visualize K-Means clustering? |
Using scatter plots or PCA projections. |
What are some challenges in clustering? |
Choosing the right number of clusters and handling noise. |
What is a practical use case of Kernel K-Means? |
Image segmentation in medical imaging. |
Week 4: Unsupervised Learning - Estimation
Recap of Maximum Likelihood Estimation (MLE) & Bayesian Estimation
MLE: Finds parameters that maximize the likelihood of observed data.
Bayesian Estimation: Incorporates prior knowledge and updates beliefs as data is observed.
Gaussian Mixture Model (GMM)
A probabilistic clustering model that assumes data is generated from multiple Gaussian distributions.
Why Use GMM Over K-Means?
- Handles clusters of different shapes and sizes.
- Soft clustering: Assigns probabilities instead of fixed labels.
Expectation-Maximization (EM) Algorithm
An iterative method to find maximum likelihood estimates when data has latent (hidden) variables.
Steps:
- Expectation (E-step): Compute probabilities of belonging to each cluster.
- Maximization (M-step): Update model parameters based on probabilities.
- Repeat until convergence.
Example:
GMM can be used in speech recognition to model different phonemes as Gaussian distributions.
Hints for Exams:
- MLE finds parameters that best fit the data.
- GMM is probabilistic; K-Means is deterministic.
- EM is used to optimize GMM.
Questions and Answers (Week 4)
Question |
Answer |
What is MLE? |
A method to estimate parameters that maximize likelihood. |
How does Bayesian estimation differ from MLE? |
It incorporates prior beliefs. |
What is a Gaussian Mixture Model? |
A model that represents data as multiple Gaussian distributions. |
Why use GMM over K-Means? |
GMM handles soft clustering and different cluster shapes. |
What does the EM algorithm do? |
It finds parameter estimates for models with hidden variables. |
What is an example of GMM in real life? |
Speaker identification in voice assistants. |
Why is GMM called a mixture model? |
Because it represents data as a mixture of multiple Gaussians. |
What does the E-step in EM do? |
It estimates probabilities of data points belonging to each cluster. |
What does the M-step in EM do? |
It updates model parameters based on probabilities. |
Can EM get stuck in local optima? |
Yes, multiple runs may be needed. |
What is a latent variable? |
A hidden variable not directly observed. |
Why is GMM probabilistic? |
It assigns soft probabilities instead of hard labels. |
What does convergence mean in EM? |
When parameter updates become negligible. |
How do we determine the number of Gaussians in GMM? |
Using methods like the Akaike Information Criterion (AIC). |
What industries use GMM? |
Speech recognition, finance, and bioinformatics. |
What is the main limitation of GMM? |
It assumes data follows a Gaussian distribution. |
How does EM relate to GMM? |
EM is used to estimate GMM parameters. |
What are some alternatives to GMM? |
DBSCAN, K-Means, Spectral Clustering. |
How is GMM different from Hierarchical Clustering? |
GMM is probabilistic; hierarchical is tree-based. |
What is a common initialization method for GMM? |
Using K-Means to initialize cluster centers. |
What happens if the number of Gaussians is too high? |
Overfitting and unnecessary complexity. |
How do we visualize GMM results? |
Using contour plots or PCA projections. |
What is a practical use case of EM? |
Image segmentation in computer vision. |
Why is EM important in missing data problems? |
It estimates missing values iteratively. |
Now, Weeks 3 and 4 are fully detailed with theory, examples, hints, and 25 questions each. Let me know if you need further improvements! 🚀