Understanding Data & Common Concepts
To build a strong foundation in machine learning, we must first understand the core elements of data representation:
- Notation: The mathematical symbols used to represent features, labels, and model parameters.
- Labeled Dataset: Data where each input has a corresponding output, essential for supervised learning.
- Data-matrix: A structured table where rows represent samples and columns represent features.
- Label Vector: A column of output values corresponding to each data point.
- Data-point: A single example from the dataset.
Example: House Price Prediction
Imagine you're trying to predict house prices. The dataset contains information about house size, number of rooms, location, and price.
- Data-matrix: Each row represents a house, each column represents a feature (size, rooms, location, etc.).
- Label Vector: The final column represents the actual price of each house.
- Data-point: A single house with its features and price.
Mean Squared Error (MSE) – Measuring Model Accuracy
MSE is a widely used loss function to measure the difference between actual and predicted values: A lower MSE means better model performance.
Example: Predicting Student Exam Scores
If a model predicts that a student will score 85, but the actual score is 90, the squared error is (90-85)^2 = 25. MSE averages these errors across multiple predictions.
Overfitting vs. Underfitting – The Bias-Variance Tradeoff
Understanding how models generalize is critical to machine learning success.
Overfitting – Learning Too Much
When a model memorizes the training data rather than learning general patterns, it performs poorly on unseen data.
- Example: A student memorizing answers instead of understanding concepts.
- Solution: Data augmentation, regularization, and pruning.
Toy Dataset – A Small-Scale Example
A toy dataset is a small, simplified dataset used for quick experiments. It helps in understanding model behavior before scaling to large datasets.
Data Augmentation – Expanding Training Data
To combat overfitting, we can artificially increase data by:
- Rotating or flipping images in image classification.
- Adding noise to numerical datasets.
- Translating text data for NLP models.
Example: Handwriting Recognition
If you only train a model on perfectly written letters, it may struggle with different handwriting styles. Data augmentation (adding slight distortions) improves generalization.
Underfitting – Learning Too Little
A model that is too simple fails to capture the underlying patterns in data.
- Example: A student only learning addition when trying to solve algebra problems.
- Solution: Increasing model complexity, adding more features, or reducing regularization.
Model Complexity – Finding the Right Balance
A model should be complex enough to capture patterns but simple enough to generalize well.
Regularization – Controlling Model Complexity
Regularization techniques help prevent overfitting by penalizing overly complex models.
Ridge Regression – L2 Regularization
Ridge regression adds a penalty to large coefficient values: This prevents overfitting by shrinking parameter values.
LASSO Regression – L1 Regularization
LASSO (Least Absolute Shrinkage and Selection Operator) forces some coefficients to become zero, effectively selecting features: This helps with feature selection in high-dimensional data.
Example: Movie Recommendation System
LASSO regression can eliminate unimportant features (like a user’s browser history) while keeping relevant ones (like movie genre preference) to improve recommendations.
Cross-Validation – Evaluating Model Performance
To ensure our model generalizes well, we use cross-validation techniques.
k-Fold Cross-Validation
- Splits data into k subsets (folds)
- Trains model on k-1 folds and tests on the remaining fold
- Repeats k times to ensure robustness
Leave-One-Out Cross-Validation (LOOCV)
- Uses all data points except one for training
- Tests on the excluded data point
- Repeats for every data point
Example: Diagnosing Disease with Medical Data
Cross-validation ensures that a model predicting disease outcomes generalizes well across different patients, avoiding bias from a specific subset of data.
Probabilistic View of Regression
Regression can also be viewed through a probabilistic lens by modeling the likelihood of output values given input features. This helps in uncertainty estimation and Bayesian regression techniques.
Example: Weather Prediction
Instead of predicting a single temperature, a probabilistic regression model can output a temperature range with probabilities, helping meteorologists communicate uncertainty.
By mastering these advanced concepts in W6, students will gain a deeper understanding of model evaluation, regularization techniques, and strategies for handling overfitting and underfitting.