Week 1: Introduction to Unsupervised Learning
Unsupervised learning is a type of machine learning where the algorithm identifies patterns in data without labeled responses. It is widely used in clustering, dimensionality reduction, and anomaly detection.
Why Are We Learning This?
- Helps in identifying hidden patterns in data.
- Essential for big data analysis and AI applications.
- Used in recommendation systems, fraud detection, and image compression.
- Prepares data for supervised learning by reducing noise and dimensionality.
Feature | Supervised Learning | Unsupervised Learning |
---|---|---|
Data Type | Labeled data | Unlabeled data |
Goal | Predict outcomes | Identify patterns |
Example | Spam detection | Customer segmentation |
Representation Learning
Representation learning helps uncover the underlying structure of data, making it useful for further processing.
Principal Component Analysis (PCA)
PCA is a statistical method used for dimensionality reduction, helping to transform high-dimensional data into a smaller set while retaining essential information.
Why Is PCA Useful?
- Reduces computational cost by removing redundant features.
- Helps visualize high-dimensional data.
- Used in image processing, genetics, and finance.
Steps of PCA:
- Compute the mean and standardize the dataset.
- Calculate the covariance matrix.
- Compute eigenvalues and eigenvectors.
- Choose the top
k
eigenvectors (principal components). - Transform the original dataset.
Example:
If we have a dataset with 10 features and we reduce it to 2 principal components, we can visualize data in a 2D space while retaining most of the information.
Additional Example:
PCA is commonly used in facial recognition systems to reduce image dimensions while keeping key facial features intact.
Hints for Exams:
- Remember PCA reduces dimensions but retains maximum variance.
- Eigenvalues indicate importance of principal components.
- Covariance matrix shows feature relationships.
Questions and Answers (Week 1)
Question | Answer |
---|---|
What is unsupervised learning? | It is a type of ML that finds patterns without labeled data. |
What is PCA used for? | Dimensionality reduction. |
How does PCA transform data? | By finding new axes that maximize variance. |
What is an eigenvector? | A direction along which data varies the most. |
What is an eigenvalue? | A measure of variance along an eigenvector. |
Why do we standardize data in PCA? | To ensure features contribute equally. |
What does the covariance matrix do? | Captures feature relationships. |
What is the main limitation of PCA? | It assumes linear relationships. |
Can PCA be used for feature selection? | Yes, it selects important features. |
How do we determine the number of principal components? | Using a scree plot or explained variance ratio. |
What happens if too many components are removed? | Information loss occurs. |
What is a scree plot? | A graph that shows eigenvalues. |
What type of data is best suited for PCA? | Linearly correlated data. |
Why is PCA useful in image processing? | It reduces the number of pixels while keeping important information. |
What is dimensionality reduction? | Reducing the number of features in a dataset. |
What are some alternatives to PCA? | t-SNE, LDA, Kernel PCA. |
What is the curse of dimensionality? | As dimensions increase, data becomes sparse and harder to analyze. |
What industries use PCA? | Finance, healthcare, image recognition, etc. |
What is a principal component? | A new feature that captures maximum variance. |
Why is high variance important in PCA? | It helps retain the most important information. |
Can PCA work on categorical data? | No, it is designed for numerical data. |
What are some practical applications of PCA? | Face recognition, recommendation systems, etc. |
How do you interpret PCA results? | By analyzing principal components and their variance. |
What is a real-world example of PCA? | Reducing features in stock market prediction. |
Week 2: Kernel PCA
Why Is Kernel PCA Important?
- PCA is limited to linear transformations.
- Kernel PCA allows for non-linear feature extraction.
- Useful for complex image recognition, bioinformatics, and speech recognition.
Kernel Trick in PCA
Kernel PCA extends PCA using a kernel function to map data into higher-dimensional space before applying PCA.
Example:
If data is shaped like concentric circles, linear PCA would fail. Kernel PCA with an RBF kernel can separate the inner and outer circles.
Hints for Exams:
- Kernel PCA handles non-linear data better than PCA.
- Common kernels: RBF, Polynomial, Sigmoid.
- Kernel matrix (Gram matrix) stores pairwise similarities.
Questions and Answers (Week 2)
Question | Answer |
---|---|
What does Kernel PCA do? | Extends PCA to non-linear transformations. |
What is the kernel trick? | Mapping data to higher dimensions. |
What is a common kernel used in Kernel PCA? | RBF Kernel. |
What are some applications of Kernel PCA? | Image recognition, anomaly detection. |
How does Kernel PCA compare to PCA? | It works for non-linear data. |
What is a Gram matrix? | A kernel matrix storing similarity scores. |
Why is Kernel PCA computationally expensive? | Because it requires computing the kernel matrix. |
What is a practical use case of Kernel PCA? | Handwriting recognition. |
How do you choose a kernel function? | Experiment with different kernels and evaluate performance. |
What happens if the wrong kernel is used? | Poor results and inefficient learning. |
Can Kernel PCA work on text data? | Yes, with appropriate feature extraction. |
What is the main limitation of Kernel PCA? | High memory usage for large datasets. |
Why is Kernel PCA useful in bioinformatics? | It helps in gene classification. |
How does Kernel PCA relate to SVMs? | Both use the kernel trick. |
What type of data does Kernel PCA handle well? | Non-linearly separable data. |
What does the RBF kernel do? | Maps data into infinite-dimensional space. |
Is Kernel PCA useful for dimensionality reduction? | Yes, for complex data structures. |
How does Kernel PCA improve feature extraction? | By capturing non-linear patterns. |
What are the challenges of implementing Kernel PCA? | High computational cost and kernel selection. |
What is the primary advantage of Kernel PCA? | It captures complex patterns in data. |
Week 3: Unsupervised Learning - Clustering
Why Is Clustering Important?
- Helps in grouping similar data points automatically.
- Used in customer segmentation, anomaly detection, and image segmentation.
- Helps in data pre-processing before supervised learning.
K-Means Clustering
A simple and widely used clustering algorithm that partitions data into K
clusters.
Steps:
- Choose
K
cluster centers (randomly or using initialization techniques like K-means++). - Assign each point to the nearest cluster center.
- Recalculate cluster centers by taking the mean of assigned points.
- Repeat until convergence.
Example:
Imagine grouping customers based on their spending habits. K-means can classify them into low, medium, and high spenders.
Kernel K-Means
An extension of K-Means that uses the kernel trick to map data into a higher-dimensional space, allowing for non-linear cluster boundaries.
When to Use Kernel K-Means?
- When clusters are not linearly separable.
- When data has a complex shape (e.g., concentric circles).
Example:
Grouping images based on texture patterns where simple K-Means fails.
Hints for Exams:
- K-Means minimizes intra-cluster variance.
- Random initialization can lead to different results.
- Elbow method is used to find the optimal
K
.
Questions and Answers (Week 3)
Question | Answer |
---|---|
What is clustering? | A technique to group similar data points together. |
How does K-Means work? | It partitions data into K clusters by minimizing variance. |
What is the primary limitation of K-Means? | It assumes spherical clusters and requires specifying K in advance. |
What is the elbow method? | A technique to find the optimal number of clusters. |
How does Kernel K-Means differ from K-Means? | It applies the kernel trick to handle non-linearly separable data. |
What happens if K is too high in K-Means? |
Overfitting and unnecessary complexity. |
What distance metric does K-Means use? | Euclidean distance. |
Can K-Means handle categorical data? | No, K-Means works best with numerical data. |
What is a centroid in K-Means? | The mean of points assigned to a cluster. |
Why is K-Means sensitive to initialization? | Poor initialization can lead to bad clustering. |
What is K-Means++? | An improved initialization technique for K-Means. |
What is a real-world application of clustering? | Customer segmentation in marketing. |
Why do we standardize data before K-Means? | To ensure equal contribution from all features. |
What is the difference between hard and soft clustering? | Hard assigns each point to one cluster; soft assigns probabilities. |
Can K-Means be used for anomaly detection? | Yes, anomalies fall far from cluster centers. |
What are some alternatives to K-Means? | DBSCAN, Hierarchical Clustering, Gaussian Mixture Models. |
Why do we need multiple runs of K-Means? | To reduce sensitivity to initialization. |
How does Kernel K-Means improve K-Means? | It enables clustering of non-linearly separable data. |
What industries use clustering? | Healthcare, finance, retail, etc. |
What are hierarchical clustering methods? | Techniques that build a tree of clusters. |
What are the two types of hierarchical clustering? | Agglomerative (bottom-up) and Divisive (top-down). |
How do we visualize K-Means clustering? | Using scatter plots or PCA projections. |
What are some challenges in clustering? | Choosing the right number of clusters and handling noise. |
What is a practical use case of Kernel K-Means? | Image segmentation in medical imaging. |
Week 4: Unsupervised Learning - Estimation
Recap of Maximum Likelihood Estimation (MLE) & Bayesian Estimation
MLE: Finds parameters that maximize the likelihood of observed data. Bayesian Estimation: Incorporates prior knowledge and updates beliefs as data is observed.
Gaussian Mixture Model (GMM)
A probabilistic clustering model that assumes data is generated from multiple Gaussian distributions.
Why Use GMM Over K-Means?
- Handles clusters of different shapes and sizes.
- Soft clustering: Assigns probabilities instead of fixed labels.
Expectation-Maximization (EM) Algorithm
An iterative method to find maximum likelihood estimates when data has latent (hidden) variables.
Steps:
- Expectation (E-step): Compute probabilities of belonging to each cluster.
- Maximization (M-step): Update model parameters based on probabilities.
- Repeat until convergence.
Example:
GMM can be used in speech recognition to model different phonemes as Gaussian distributions.
Hints for Exams:
- MLE finds parameters that best fit the data.
- GMM is probabilistic; K-Means is deterministic.
- EM is used to optimize GMM.
Questions and Answers (Week 4)
Question | Answer |
---|---|
What is MLE? | A method to estimate parameters that maximize likelihood. |
How does Bayesian estimation differ from MLE? | It incorporates prior beliefs. |
What is a Gaussian Mixture Model? | A model that represents data as multiple Gaussian distributions. |
Why use GMM over K-Means? | GMM handles soft clustering and different cluster shapes. |
What does the EM algorithm do? | It finds parameter estimates for models with hidden variables. |
What is an example of GMM in real life? | Speaker identification in voice assistants. |
Why is GMM called a mixture model? | Because it represents data as a mixture of multiple Gaussians. |
What does the E-step in EM do? | It estimates probabilities of data points belonging to each cluster. |
What does the M-step in EM do? | It updates model parameters based on probabilities. |
Can EM get stuck in local optima? | Yes, multiple runs may be needed. |
What is a latent variable? | A hidden variable not directly observed. |
Why is GMM probabilistic? | It assigns soft probabilities instead of hard labels. |
What does convergence mean in EM? | When parameter updates become negligible. |
How do we determine the number of Gaussians in GMM? | Using methods like the Akaike Information Criterion (AIC). |
What industries use GMM? | Speech recognition, finance, and bioinformatics. |
What is the main limitation of GMM? | It assumes data follows a Gaussian distribution. |
How does EM relate to GMM? | EM is used to estimate GMM parameters. |
What are some alternatives to GMM? | DBSCAN, K-Means, Spectral Clustering. |
How is GMM different from Hierarchical Clustering? | GMM is probabilistic; hierarchical is tree-based. |
What is a common initialization method for GMM? | Using K-Means to initialize cluster centers. |
What happens if the number of Gaussians is too high? | Overfitting and unnecessary complexity. |
How do we visualize GMM results? | Using contour plots or PCA projections. |
What is a practical use case of EM? | Image segmentation in computer vision. |
Why is EM important in missing data problems? | It estimates missing values iteratively. |
Now, Weeks 3 and 4 are fully detailed with theory, examples, hints, and 25 questions each. Let me know if you need further improvements! 🚀