February 28, 2025

forEach in Java: Evolution, Internal Working, Performance & Future Enhancement

Introduction

Iteration is a core programming concept, and Java's forEach has evolved to provide a more readable and functional approach to iteration. However, to match or surpass fast programming languages like C, Rust, or Go, Java developers must understand its limitations, optimize performance, and explore internal JVM optimizations. This article offers a universal perspective, detailing forEach internals, performance considerations, and possible future improvements.


The Need for forEach in Java

Before Java 5, iteration relied on traditional loops:

for (int i = 0; i < list.size(); i++) {
    System.out.println(list.get(i));
}

However, this approach was verbose and prone to IndexOutOfBoundsException. Java 5 introduced the Enhanced for-loop, reducing boilerplate:

for (String item : list) {
    System.out.println(item);
}

Java 8 then introduced the forEach method, integrating functional programming paradigms:

list.forEach(item -> System.out.println(item));

Why Was forEach Introduced?

  • Readability: Less verbose compared to index-based loops.
  • Encapsulation: Abstracts iteration logic.
  • Functional Programming: Supports lambda expressions.
  • Parallel Processing: Works well with Streams.

Internal Working of forEach in JVM

1. How Enhanced for-loop Works Internally

The enhanced for loop internally relies on an Iterator:

Iterator<String> iterator = list.iterator();
while (iterator.hasNext()) {
    String item = iterator.next();
    System.out.println(item);
}

The compiler translates:

for (String item : list) {
    System.out.println(item);
}

to an iterator-based approach.

2. How forEach Works Internally

  • When forEach is called, it uses internal iteration, passing each element to a consumer function (Consumer<T> from java.util.function).
  • Internally, forEach on ArrayList is implemented as:
public void forEach(Consumer<? super E> action) {
    Objects.requireNonNull(action);
    final int expectedModCount = modCount;
    final E[] elementData = (E[]) this.elementData;
    for (int i = 0, size = this.size; i < size; i++) {
        action.accept(elementData[i]);
    }
    if (modCount != expectedModCount) {
        throw new ConcurrentModificationException();
    }
}
  • It fetches elements sequentially, ensuring no concurrent modifications occur.
  • Since forEach is not parallel by default, it does not take advantage of multiple CPU cores unless parallelStream() is used.

Performance Analysis of forEach

Comparing Different Looping Mechanisms

Method Performance Parallelism Readability Index Access
Traditional for loop Fastest (JIT-optimized) No Medium Yes
Enhanced for loop Slight overhead (uses iterator) No High No
forEach method Moderate No Very High No
Stream forEach Slow for single-threaded, fast for parallel Yes Very High No

Key Performance Takeaways

  • Traditional for-loops are still the fastest due to JIT (Just-In-Time) compiler optimizations.
  • Enhanced for-loops use iterators internally, adding slight overhead.
  • forEach has lambda overhead, especially in large collections.
  • Stream forEach is beneficial only in parallel execution, otherwise, it is slower.

Optimizations for Maximum Speed

  1. Use Indexed for-loops for Primitives
    for (int i = 0; i < arr.length; i++) {
        sum += arr[i];
    }
    
  2. Avoid Stream forEach for Large Sequential Operations
    for (String item : list) {
        process(item);
    }
    
  3. Use Parallel Streams with Caution
    list.parallelStream().forEach(item -> process(item));
    
  4. Leverage ForkJoinPool for Custom Parallelism
    ForkJoinPool customPool = new ForkJoinPool(4);
    customPool.submit(() -> list.parallelStream().forEach(System.out::println));
    

Exceptions & Drawbacks of forEach

1. No Break or Continue Support

list.forEach(item -> {
    if (item.equals("Stop")) break; // Compilation Error
});

2. ConcurrentModificationException

list.forEach(item -> {
    if (item.equals("X")) list.remove(item); // Error!
});

Solution: Use Iterator.remove()

Iterator<String> iterator = list.iterator();
while (iterator.hasNext()) {
    if (iterator.next().equals("X")) iterator.remove();
}

Future Improvements for forEach

1. Support for Breaking & Continuing Loops

list.forEachBreakable(item -> if (item.equals("Stop")) break;);

2. Indexed forEach Variant

list.forEach((index, item) -> System.out.println(index + ": " + item));

3. Smarter Parallel Execution

list.autoParallelForEach(item -> process(item));

Conclusion

Java's forEach provides a clean and functional approach to iteration, but it isn't always the fastest. Using indexed loops for primitives and optimizing stream operations can help match or exceed the speed of lower-level languages like C or Rust. Future improvements like break support, index-aware iteration, and smarter parallelization could further enhance performance.

What’s your preferred way of iterating collections in Java? Let’s discuss! πŸš€

February 25, 2025

Agile & Scrum: Best Practices for Efficient Workflows

In the fast-paced world of software development, efficiency, scalability, and seamless collaboration are essential. To prevent delays, miscommunication, and inefficiencies, teams must embrace structured workflows, modern frameworks, and clear documentation.

This guide covers:
βœ… Best practices for efficient workflows
βœ… Common pitfalls in frameworks and how to avoid them
βœ… Effective ownership, documentation, and deployment
βœ… Clear communication strategies for cross-functional teams
βœ… What happens if we don’t follow these best practices?

By implementing these strategies, teams can boost productivity, streamline onboarding, and enhance software quality.




1. Adopting Frameworks for Efficient Development

1.1 Agile & Scrum for Structured Workflows

πŸš€ Why Agile?
Agile promotes incremental development, adaptability, and continuous improvement. Scrum provides a structured approach with defined roles, sprints, and ceremonies.

βœ” Best Practices:

  • Sprint Planning: Define scope, assess priority, and identify dependencies.
  • Daily Stand-ups: Share progress, blockers, and next steps.
  • Retrospectives: Continuously improve processes based on past sprints.

πŸ“Œ Essential Questions to Ask in Agile & Scrum Calls

πŸ’‘ During Sprint Planning:

  1. Is the scope of the ticket clear?
  2. Are there any dependencies blocking this task?
  3. What is the expected effort level (small, medium, large)?
  4. Are the acceptance criteria well-defined?
  5. Do we have all the necessary requirements, or do we need clarification?
  6. Is this task a high-priority item for the sprint?
  7. Who is the owner of this task, and who will collaborate on it?
  8. Do we need additional testing scenarios for this feature?
  9. Has the team estimated the task realistically?

πŸ’‘ During Daily Stand-ups:

  1. What did I accomplish yesterday?
  2. What will I be working on today?
  3. Am I facing any blockers?
  4. Do I need help from another team member?
  5. Are there any unexpected changes impacting the sprint?

πŸ’‘ During Sprint Grooming / Backlog Refinement:

  1. Are all backlog tickets well-defined and refined?
  2. Do we need more details or clarification from the product owner?
  3. Are there any tasks that need to be reprioritized?
  4. Do we need to break down large tasks into smaller user stories?
  5. Are there any dependencies that could cause delays?

πŸ’‘ During Sprint Retrospectives:

  1. What went well in the sprint?
  2. What challenges did we face?
  3. How can we improve collaboration and efficiency?
  4. Were there any miscommunications or unclear expectations?
  5. How can we prevent blockers in future sprints?

⚠ Common Pitfalls & Solutions:
❌ Poor backlog grooming β†’ Leads to unclear requirements. Always refine tickets before the sprint.
❌ Overcommitting β†’ Causes unfinished work. Plan realistically and prioritize.
❌ Skipping retrospectives β†’ Misses learning opportunities. Treat them as essential.

πŸ”΄ If Ignored, Future Risks:
🚨 Unstructured development leads to missed deadlines.
🚨 Unclear requirements cause last-minute changes and stress.
🚨 Lack of continuous improvement results in stagnant teams.


1.2 Kanban for Real-Time Task Management

πŸ“Œ Why Kanban?
Kanban visualizes workflows, making it easier to track progress, identify bottlenecks, and optimize WIP (Work in Progress).

βœ” Best Practices:

  • Use a Kanban board (To Do β†’ In Progress β†’ Code Review β†’ QA β†’ Done).
  • Limit Work in Progress (WIP) to maintain focus.
  • Ensure ticket statuses are updated regularly.

⚠ Common Pitfalls & Solutions:
❌ Too many tasks in progress β†’ Creates inefficiency. Set strict WIP limits.
❌ Lack of visibility β†’ Causes misalignment. Update boards consistently.

πŸ”΄ If Ignored, Future Risks:
🚨 Bottlenecks slow down development, delaying releases.
🚨 Teams work in silos, leading to confusion.
🚨 Higher chances of work duplication and wasted effort.


1.3 DevOps & CI/CD for Seamless Deployment

πŸ›  Why CI/CD?
Continuous Integration and Continuous Deployment ensure that code changes are tested, integrated, and deployed efficiently.

βœ” Best Practices:

  • Automate build, test, and deployment using Jenkins, GitHub Actions, or GitLab CI/CD.
  • Follow a Git branching strategy like Git Flow.
  • Use real-time monitoring tools (e.g., Prometheus, ELK Stack) for issue detection.

⚠ Common Pitfalls & Solutions:
❌ Skipping automated tests β†’ Leads to unstable releases. Enforce test automation.
❌ Manual deployments β†’ Are error-prone. Use Infrastructure as Code (IaC) with Terraform or Ansible.

πŸ”΄ If Ignored, Future Risks:
🚨 Unstable releases cause production outages.
🚨 Higher bug rates lead to customer dissatisfaction.
🚨 Manual deployments increase risk of human errors.


1.4 Shift-Left Testing: Detect Issues Early

πŸ” Why Shift-Left Testing?
Testing earlier in the development cycle prevents last-minute defects and reduces costs.

βœ” Best Practices:

  • Write unit and integration tests alongside feature development.
  • Automate regression testing using JUnit, TestNG, or Selenium.
  • Collaborate with QA before the sprint ends to ensure test readiness.

⚠ Common Pitfalls & Solutions:
❌ Testing only after development β†’ Delays bug fixes. Integrate testing into sprints.
❌ Lack of test documentation β†’ Causes confusion. Ensure all test cases are well-documented.

πŸ”΄ If Ignored, Future Risks:
🚨 Undetected bugs appear in production, causing rollbacks.
🚨 QA struggles with unclear testing requirements.
🚨 More time is spent fixing issues than delivering new features.


2. Best Practices for Day-to-Day Work

2.1 Ticket & Task Management

  • Understand ticket priority, complexity, and dependencies before the sprint begins.
  • Ask clarifying questions upfront to avoid delays.
  • Regularly update ticket status and blockers to maintain transparency.

πŸ”΄ If Ignored, Future Risks:
🚨 Critical tasks may be deprioritized, leading to rushed work.
🚨 Unresolved dependencies delay feature completion.


2.2 Ownership & Accountability

  • Take full ownership of documentation, QA testing, and production deployment.
  • Ensure all team members attend backlog grooming and sprint planning.
  • If a task is at risk of spilling over, raise it in advanceβ€”don’t wait until the sprint ends.

πŸ”΄ If Ignored, Future Risks:
🚨 Lack of ownership causes finger-pointing during failures.
🚨 Missed deadlines impact overall project success.


2.3 Documentation & Knowledge Sharing

πŸ“– Why Documentation Matters?
Lack of documentation causes confusion, delays onboarding, and increases dependency on individuals.

βœ” Best Practices:

  • Maintain a centralized wiki (Confluence, Notion, Google Docs).
  • Include requirements, design approaches, troubleshooting steps, and QA test notes.

πŸ”΄ If Ignored, Future Risks:
🚨 New developers struggle with onboarding.
🚨 Teams become over-dependent on specific individuals.
🚨 QA lacks clarity, leading to longer testing cycles.


3. Maintaining Clear Communication Across Teams

βœ… Join sprint grooming calls β†’ Clarify doubts early to avoid last-minute blockers.
βœ… Proactively update the team β†’ If a feature is at risk of delay, inform early.
βœ… Commit to a production release date β†’ Ensure alignment with stakeholders.

πŸ”΄ If Ignored, Future Risks:
🚨 Delays in product releases.
🚨 Unclear goals cause misalignment with business objectives.


Final Thoughts

By following these future-proof frameworks and best practices, teams can improve collaboration, reduce errors, and enhance overall productivity.

πŸ”₯ Key Takeaways:
βœ… Proactive communication prevents last-minute blockers.
βœ… Strong ownership ensures quality & accountability.
βœ… Continuous learning improves efficiency.
βœ… Automation-first mindset reduces manual errors.

πŸš€ If these best practices aren’t followed, the risks include:

  • Delayed releases & missed deadlines
  • Production outages & increased bug rates
  • Confused QA processes & inefficient onboarding
  • Poor team alignment, leading to project failures

πŸ‘‰ Let’s implement these strategies to create a scalable, efficient, and high-performing development team! πŸš€\

Mastering Multi-Realm Authentication in NestJS with Keycloak

Introduction

In modern applications, especially those built for multi-tenancy, managing authentication across multiple Keycloak realms is a common requirement. Each realm can represent a different tenant, organization, or security boundary.

This guide will teach you how to configure NestJS with Keycloak to support multiple realms, implement guard enforcement policies, and perform token introspection dynamically based on different client_ids and roles.

By the end of this guide, you will:

  • βœ… Understand how Keycloak realms work
  • βœ… Learn how to integrate multiple realms dynamically
  • βœ… Implement role-based access control
  • βœ… Secure routes using Keycloak Guards
  • βœ… Perform token introspection with multiple client_ids and roles

πŸ”Ή What is a Keycloak Realm?

A realm in Keycloak is an isolated authentication domain. Each realm has its own users, roles, groups, and clients. This is useful for multi-tenancy, where different clients or organizations should have separate authentication policies.

Example Use Case:

  • CompanyA uses realm-a with client-1.
  • CompanyB uses realm-b with client-2.

In this scenario, authentication should be dynamically determined based on the request context.


Step 1: Install Dependencies

First, install the necessary packages to integrate Keycloak with NestJS.

npm install nest-keycloak-connect

Step 2: Create a Multi-Realm Configuration Service

Since nest-keycloak-connect expects a single realm configuration, we need a service that resolves realm and client dynamically.

MultiTenantKeycloakConfigService

Create a service that provides dynamic configuration based on the request.

import { Injectable, Scope, Request } from '@nestjs/common';
import {
  KeycloakOptionsFactory,
  KeycloakConnectOptions,
} from 'nest-keycloak-connect';

@Injectable({ scope: Scope.REQUEST }) // Per-request scope
export class MultiTenantKeycloakConfigService implements KeycloakOptionsFactory {
  private readonly realmConfigs = {
    tenant1: {
      realm: 'realm-1',
      clientId: 'client-1',
      secret: 'secret-1',
    },
    tenant2: {
      realm: 'realm-2',
      clientId: 'client-2',
      secret: 'secret-2',
    },
  };

  createKeycloakConnectOptions(@Request() req): KeycloakConnectOptions {
    const realmKey = this.getTenantFromRequest(req);
    const config = this.realmConfigs[realmKey];

    if (!config) {
      throw new Error(`No configuration found for tenant: ${realmKey}`);
    }

    return {
      authServerUrl: 'http://your-keycloak-server/auth',
      realm: config.realm,
      clientId: config.clientId,
      secret: config.secret,
      cookieKey: 'KEYCLOAK_JWT',
    };
  }

  private getTenantFromRequest(req): string {
    return req.headers['x-tenant-id'] || 'default';
  }
}

Step 3: Register Multi-Realm Configuration in app.module.ts

Modify AppModule to use the dynamic Keycloak configuration.

import { Module } from '@nestjs/common';
import { KeycloakConnectModule } from 'nest-keycloak-connect';
import { MultiTenantKeycloakConfigService } from './multi-tenant-keycloak-config.service';

@Module({
  imports: [
    KeycloakConnectModule.registerAsync({
      useClass: MultiTenantKeycloakConfigService,
    }),
  ],
})
export class AppModule {}

Step 4: Implement Keycloak Guards for Role-Based Access Control (RBAC)

Guards enforce authentication and authorization policies.

import { Injectable, CanActivate, ExecutionContext } from '@nestjs/common';
import { Reflector } from '@nestjs/core';
import { KeycloakGuard } from 'nest-keycloak-connect';

@Injectable()
export class RoleGuard extends KeycloakGuard implements CanActivate {
  constructor(private reflector: Reflector) {
    super();
  }

  canActivate(context: ExecutionContext): boolean {
    const requiredRoles = this.reflector.get<string[]>('roles', context.getHandler());
    if (!requiredRoles) return true;
    const request = context.switchToHttp().getRequest();
    const userRoles = request.user.roles;
    return requiredRoles.some(role => userRoles.includes(role));
  }
}

Usage in Controllers:

import { Controller, Get, UseGuards } from '@nestjs/common';
import { Roles } from 'nest-keycloak-connect';
import { RoleGuard } from './role.guard';

@Controller('secure')
export class SecureController {
  @Get()
  @Roles('admin') // Restrict access to users with 'admin' role
  @UseGuards(RoleGuard)
  getSecureData() {
    return { message: 'You have accessed a protected route' };
  }
}

Step 5: Implement Token Introspection with Different Clients & Roles

Token introspection allows validating and extracting claims from tokens issued by different clients in multiple realms.

import { Injectable, ExecutionContext, CanActivate } from '@nestjs/common';
import axios from 'axios';

@Injectable()
export class TokenIntrospectionGuard implements CanActivate {
  async canActivate(context: ExecutionContext): Promise<boolean> {
    const request = context.switchToHttp().getRequest();
    const token = request.headers.authorization?.split(' ')[1];
    if (!token) return false;

    const clientConfig = this.getClientConfig(request.headers['x-tenant-id']);
    const introspectionUrl = `${clientConfig.authServerUrl}/realms/${clientConfig.realm}/protocol/openid-connect/token/introspect`;

    const response = await axios.post(introspectionUrl, {
      token,
      client_id: clientConfig.clientId,
      client_secret: clientConfig.secret,
    });

    return response.data.active;
  }

  private getClientConfig(tenant: string) {
    return { realm: 'realm-1', clientId: 'client-1', secret: 'secret-1', authServerUrl: 'http://your-keycloak-server/auth' };
  }
}

Conclusion

βœ… Dynamic Multi-Realm Authentication is implemented βœ… Guards enforce role-based access βœ… Token introspection supports multiple clients

This approach enables a flexible, scalable, and secure multi-tenant authentication system. πŸš€

February 24, 2025

Understanding Multiple Login Events in Keycloak During Google SSO

Introduction

When implementing Single Sign-On (SSO) with Google as an Identity Provider (IdP) in Keycloak, you may notice that a single user login generates multiple login eventsβ€”one for each client application. This can cause confusion, especially when debugging authentication flows or tracking user sessions.

In this blog post, we will explore why this happens, how to replicate it locally, and how to fix or control this behavior using Keycloak settings.

Why Do Multiple Login Events Occur in Keycloak During Google SSO?

This issue arises due to the OAuth 2.0 Authorization Code Flow with OpenID Connect (OIDC) that Google follows. When a user logs into one client application and then accesses another client application under the same Keycloak realm, Keycloak registers a separate login event for each client.

Key Factors Contributing to Multiple Login Events

  1. Google Uses a Shared Authentication Session:

    • When a user logs into Client1 via Google, Google creates a session.
    • When the user accesses Client2, Google detects the active session and automatically authenticates the user, without prompting for credentials.
    • Even though the user did not re-enter credentials, Keycloak registers a new login event.
  2. Keycloak Treats Each Client Separately:

    • Keycloak considers each client as an independent entity.
    • When a user accesses multiple clients, even within the same session, Keycloak triggers separate login events for tracking purposes.
  3. Google Login Response Includes New ID Tokens Per Client:

    • When Keycloak redirects the user to Google for login, Google issues a new ID Token for each client.
    • Even though the user remains authenticated, the new token issuance results in a new event.
  4. Keycloak Logs Each Authentication Request as a New Event:

    • Keycloak logs a separate authentication event each time an authentication request is made to the IdP (Google), even if credentials are not re-entered.

How to Configure Session Handling in Keycloak

Keycloak allows you to control whether users share the same session across multiple clients or have separate sessions per client. This can help manage login events more effectively.

Using the Same Session Across Clients

To enforce session sharing across multiple clients:

  1. Navigate to Realm Settings β†’ Tokens

  2. Adjust the SSO Session Idle Timeout and SSO Session Max Lifespan to maintain session continuity.

  3. Ensure that Full Scope Allowed is enabled for clients to share user sessions.

  4. Use the SameSite=None cookie setting to ensure session continuity in cross-client authentication.


Controlling the Number of Sessions in Keycloak

To limit how many active sessions a user can have:

  1. Navigate to Realm Settings β†’ Sessions.

  2. Configure Maximum Sessions per User to define the maximum concurrent sessions allowed.

  3. Enable Single Sign-Out so that logging out from one session logs the user out from all active sessions.

  4. Adjust Session Limits per client under Clients β†’ client1 β†’ Settings.

These settings help you control how sessions are handled across different clients and prevent excessive login events.

Using Separate Sessions Per Client

If you need independent sessions for each client:

  1. Disable Full Scope Allowed for the client under Clients β†’ client1 or client2.

  2. Set Client Session Idle Timeout and Client Session Max Lifespan separately for each client under Advanced Settings.

  3. Use the prompt=login parameter to force Google authentication for each client.

How to Replicate Multiple Login Events Locally

Step 1: Set Up Keycloak Locally

  1. Start Keycloak in development mode:
    ./kc.sh start-dev
    
  2. Create a new realm (e.g., my-realm).
  3. Create two clients (client1 and client2) in the same realm.
  4. Configure both clients to use Google as an Identity Provider:
    • Go to Identity Providers β†’ Add provider β†’ Google.
    • Configure the Client ID and Client Secret from your Google Cloud Console.
    • Set Redirect URIs for both clients (http://localhost:8081/* and http://localhost:8082/*).
    • Enable Standard Flow and Implicit Flow.

Step 2: Run Two Applications Using Keycloak Authentication

Run two separate web applications:

  • Client1 β†’ http://localhost:8081
  • Client2 β†’ http://localhost:8082

Ensure both applications:

  • Use the same Keycloak realm.
  • Have client_id configured accordingly.
  • Redirect to Google for authentication.

Step 3: Simulate an SSO Login Flow

  1. Open an Incognito browser window.
  2. Navigate to http://localhost:8081 and initiate login.
  3. Authenticate using Google.
  4. Open another tab and go to http://localhost:8082.
  5. Notice that the second application logs in without prompting for credentials.
  6. Check Keycloak logs under Events β†’ Login Events, and you will see two separate login entries.

How to Fix or Control This Behavior

1. Enable Single Session Across Clients

By default, Keycloak logs each client authentication as a separate event. However, we can enforce session sharing across clients:

  • Go to Realm Settings β†’ Tokens
  • Increase the SSO Session Max Lifespan to keep the session active longer.
  • Increase SSO Session Idle Timeout to avoid unnecessary re-authentication.

2. Adjust Client Session Settings

Each client may have different session handling settings. To standardize session behavior across clients:

  • Navigate to Clients β†’ client1 or client2.
  • Under Advanced Settings, adjust the SSO Session Idle Timeout.
  • Ensure Full Scope Allowed is enabled to share user sessions across multiple clients.

3. Enforce User Session Limits

To control how many login events are generated:

  • Go to Realm Settings β†’ Sessions.
  • Set Maximum Sessions per User to 1 to ensure a single session is active at a time.
  • Enable Single Sign-Out to ensure logging out from one client logs the user out from all clients.

4. Modify Google Authentication Prompt Behavior

If you want Google to prompt users for login every time (rather than using an existing session), modify the authentication request:

  • In the Keycloak Identity Provider settings for Google, set the prompt parameter:
    prompt=select_account
    
  • This forces Google to show the account selection screen every time a login is requested.

5. Control Login Event Logging in Keycloak

If you want to reduce unnecessary login event logging, you can modify Keycloak’s logging levels:

./kc.sh start-dev --log-level=INFO

Or update the logging configuration in standalone.xml:

<logger category="org.keycloak.events">
    <level name="WARN"/>
</logger>

How Custom SSO Calls Trigger Login Events in Keycloak

Custom authentication flows in Keycloak can also trigger login events. Here’s how:

  1. Using Keycloak REST API for Login:

    • When calling the POST /realms/{realm}/protocol/openid-connect/token endpoint with valid credentials, Keycloak registers a login event.
  2. Programmatic Login via JavaScript Adapter:

    keycloak.login({ redirectUri: 'http://localhost:8081/home' });
    
    • This triggers a login event for the client initiating the request.
  3. Custom Authentication Flows:

    • Implementing a custom authenticator in Keycloak SPI that handles login logic can generate login events.
  4. Silent Authentication Requests:

    • When the frontend attempts a silent login via check-sso, Keycloak may log an event if a session refresh is required.

Conclusion

Experiencing multiple login events when using Google SSO with Keycloak is a common scenario due to how OAuth 2.0 and OIDC work. The main reasons include Google’s session management, Keycloak’s client separation, and Google issuing new ID tokens per client.

By following the recommended fixes and understanding how custom SSO calls interact with Keycloak, you can control and optimize login event handling effectively.

February 23, 2025

Week 3: Unsupervised Learning - Clustering - K-means/Kernel K-means


Introduction

Clustering is a fundamental technique in unsupervised learning, which aims to group similar data points together. In this blog, we will cover K-means and Kernel K-means clustering in depth, including their mathematical foundations, examples, and real-world applications.


What is Clustering?

Clustering is the task of dividing a dataset into groups (clusters) where objects in the same group are more similar to each other than to those in other groups. It is widely used in customer segmentation, image compression, anomaly detection, and bioinformatics.

Types of Clustering Algorithms:

  1. Partition-based Clustering (e.g., K-means, Kernel K-means)
  2. Hierarchical Clustering (Agglomerative, Divisive)
  3. Density-based Clustering (DBSCAN, Mean Shift)
  4. Model-based Clustering (Gaussian Mixture Models)

K-Means Clustering

Algorithm Steps:

  1. Select the number of clusters, K.
  2. Initialize K cluster centroids randomly.
  3. Assign each data point to the nearest centroid.
  4. Recalculate the centroids by taking the mean of all points assigned to that cluster.
  5. Repeat steps 3 and 4 until convergence (centroids no longer change significantly).

Example 1: Clustering Customers Based on Spending Behavior

Given Dataset:

Customer ID Annual Income (k$) Spending Score (1-100)
1 15 39
2 16 81
3 17 6
4 18 77
5 19 40

Step 1: Choose K = 2 and initialize centroids randomly.

Step 2: Compute distances and assign points to the closest centroid.

Step 3: Recalculate centroids.

Step 4: Repeat until convergence.

Conclusion: K-means successfully clusters customers into different spending groups, allowing businesses to tailor marketing strategies accordingly.


Kernel K-Means Clustering

Kernel K-means is an extension of K-means that maps data into a higher-dimensional space using a kernel function before performing clustering.

Common Kernel Functions:

  1. Linear Kernel: K(xi,xj)=xiβ‹…xjK(x_i, x_j) = x_i \cdot x_j
  2. Polynomial Kernel: K(xi,xj)=(xiβ‹…xj+c)dK(x_i, x_j) = (x_i \cdot x_j + c)^d
  3. Gaussian (RBF) Kernel: K(xi,xj)=eβˆ’βˆ£βˆ£xiβˆ’xj∣∣22Οƒ2K(x_i, x_j) = e^{-\frac{||x_i - x_j||^2}{2\sigma^2}}

Example 2: Clustering Non-Linearly Separable Data

Dataset: Imagine we have two circular clusters that are not linearly separable.

  1. Apply the Gaussian kernel to transform the data.
  2. Use K-means on the transformed space.

Conclusion: Kernel K-means enables clustering in situations where standard K-means fails.


Advantages & Disadvantages

Method Advantages Disadvantages
K-Means Fast, easy to implement, scalable Sensitive to outliers, requires specifying K
Kernel K-Means Works on non-linear data Computationally expensive

Real-World Applications

  1. Marketing Segmentation – Group customers based on behavior.
  2. Image Segmentation – Divide images into meaningful regions.
  3. Anomaly Detection – Detect fraud in transactions.

20+ Questions and Answers for Understanding

# Question Answer
1 What is clustering in machine learning? Clustering is an unsupervised learning technique that groups similar data points together.
2 What is the primary objective of K-means? To minimize the variance within each cluster.
3 What is a centroid in K-means? The center of a cluster, computed as the mean of all points in that cluster.
4 How does Kernel K-means differ from K-means? It applies a kernel function to transform data into a higher-dimensional space before clustering.
5 What is the time complexity of K-means? O(n * k * i), where n is the number of data points, k is the number of clusters, and i is the number of iterations.
6 What happens if you choose a bad initial centroid? The algorithm may converge to a local minimum.
7 How can you determine the best value for K? Using the Elbow Method or Silhouette Score.
8 What metric is used to measure clustering performance? Inertia, Davies-Bouldin Index, Silhouette Score.
9 What type of clustering is K-means? Partition-based clustering.
10 What is an application of K-means in healthcare? Grouping patients based on medical conditions.
11 What is an outlier in clustering? A data point that does not belong to any cluster.
12 What kernel is commonly used in Kernel K-means? Gaussian (RBF) kernel.
13 Can K-means work with categorical data? No, K-means works best with numerical data.
14 What is the Silhouette Score? A metric that evaluates how well clusters are separated.
15 Why does K-means require normalization? To prevent features with large ranges from dominating the distance calculation.
16 How do you deal with outliers in K-means? Use K-medoids or remove extreme values.
17 What does the term "inertia" mean in K-means? The sum of squared distances from each point to its assigned centroid.
18 How do you speed up K-means? Use K-means++ initialization or Mini-batch K-means.
19 What is a drawback of Kernel K-means? Higher computational cost.
20 When should you use Kernel K-means over K-means? When the data is not linearly separable.

Conclusion

K-means and Kernel K-means are powerful clustering techniques that help analyze and segment data efficiently. While K-means is simple and scalable, Kernel K-means is better suited for complex datasets. Understanding these methods, along with their mathematical foundations and real-world applications, will prepare you well for exams and practical implementations.

Let me know if you need further refinements! πŸš€

Week 3: Unsupervised Learning - Clustering - K-means/Kernel K-means

 Introduction to Clustering

Clustering is an unsupervised learning technique used to group data points into clusters based on their similarities. Unlike supervised learning, clustering does not rely on labeled data. Instead, it finds inherent patterns and structures within the data.

K-Means Clustering

K-Means is one of the most widely used clustering algorithms. It partitions data into K clusters, where each cluster is represented by its centroid.

Steps of K-Means Algorithm

  1. Choose the number of clusters, K.
  2. Randomly initialize K centroids.
  3. Assign each data point to the nearest centroid.
  4. Update centroids as the mean of the assigned points.
  5. Repeat until centroids do not change significantly (convergence).

Mathematical Formulation

The objective function (loss function) for K-Means is the sum of squared distances between data points and their respective cluster centroids:

J=βˆ‘i=1Kβˆ‘xj∈Ci∣∣xjβˆ’ΞΌi∣∣2J = \sum_{i=1}^{K} \sum_{x_j \in C_i} || x_j - \mu_i ||^2

Where:

  • KK is the number of clusters.
  • CiC_i is the set of points belonging to cluster ii.
  • ΞΌi\mu_i is the centroid of cluster ii.
  • ∣∣xjβˆ’ΞΌi∣∣2|| x_j - \mu_i ||^2 is the squared Euclidean distance.

Choosing the Value of K

Several methods can help determine an optimal value for K:

  1. Elbow Method: Plot the Within-Cluster Sum of Squares (WCSS) for different values of K and look for an 'elbow' point where adding more clusters provides diminishing returns.
  2. Silhouette Score: Measures how similar a data point is to its assigned cluster versus other clusters.
  3. Gap Statistic: Compares WCSS with expected WCSS under random distribution.

Example of K-Means Clustering

Step 1: Given Data Points

Assume we have the following data points: (2,3), (3,4), (8,7), (9,6), (10,8)

Step 2: Choose K=2 and Initialize Centroids

Let's assume initial centroids are (2,3) and (8,7).

Step 3: Compute Distance and Assign Clusters

Using Euclidean distance, we compute the distance of each point to the centroids and assign clusters.

Step 4: Compute New Centroids

After assigning clusters, update centroids and repeat the process until convergence.

Kernel K-Means Clustering

Kernel K-Means extends K-Means to non-linearly separable data using a kernel function. Instead of computing distances in the original feature space, it transforms data into a higher-dimensional space where clusters become linearly separable.

Kernel Trick

A kernel function K(xi,xj)K(x_i, x_j) is used to compute distances in transformed space, avoiding explicit transformation. Common kernels:

  1. Linear Kernel: K(x,y)=xTyK(x,y) = x^T y
  2. Polynomial Kernel: K(x,y)=(xTy+c)dK(x,y) = (x^T y + c)^d
  3. Gaussian (RBF) Kernel: K(x,y)=eβˆ’βˆ£βˆ£xβˆ’y∣∣22Οƒ2K(x,y) = e^{- \frac{||x - y||^2}{2 \sigma^2}}

Mathematical Formulation

Instead of centroid-based calculations, Kernel K-Means minimizes the following objective function:

J=βˆ‘i=1Kβˆ‘xj∈CiK(xj,xj)βˆ’2∣Ciβˆ£βˆ‘xj∈Ciβˆ‘xk∈CiK(xj,xk)J = \sum_{i=1}^{K} \sum_{x_j \in C_i} K(x_j, x_j) - \frac{2}{|C_i|} \sum_{x_j \in C_i} \sum_{x_k \in C_i} K(x_j, x_k)

Advantages of Kernel K-Means

  • Handles non-linearly separable clusters
  • Works well with high-dimensional data
  • Utilizes different kernels to adapt clustering to various data distributions

20 Knowledge Testing Questions & Answers

# Question Answer & Explanation
1 What is the main difference between K-Means and Kernel K-Means? K-Means uses Euclidean distance, while Kernel K-Means uses a kernel function to transform data into higher dimensions.
2 How does K-Means decide the initial cluster centers? Randomly selects K points from the dataset as initial centroids.
3 What is the stopping criterion in K-Means? The algorithm stops when centroids do not change significantly between iterations.
4 What type of clustering does K-Means perform? Hard clustering, meaning each point belongs to exactly one cluster.
5 What is the computational complexity of K-Means? O(n * K * d * i), where n = data points, K = clusters, d = dimensions, i = iterations.
6 What is the purpose of the Elbow Method? It helps determine the optimal number of clusters by plotting WCSS vs. K.
7 How do we compute the new centroids in K-Means? By taking the mean of all points assigned to each cluster.
8 How is the Silhouette Score interpreted? A higher silhouette score (close to 1) indicates better clustering.
9 Which kernel is commonly used in Kernel K-Means? Gaussian (RBF) kernel is commonly used.
10 How does Kernel K-Means differ in centroid computation? Instead of averaging points, Kernel K-Means uses a similarity matrix.
11 What distance metric does K-Means use? Euclidean distance.
12 What is a disadvantage of K-Means? Sensitive to outliers and requires predefining the number of clusters.
13 What is the formula for WCSS? Sum of squared distances of points from their centroids.
14 How do we choose the best kernel in Kernel K-Means? Experimentation or cross-validation with different kernels.
15 Why does Kernel K-Means require a similarity matrix? Because it operates in an implicit high-dimensional space.
16 Why does K-Means sometimes fail to converge optimally? Poor centroid initialization can lead to local minima.
17 What technique improves K-Means initialization? K-Means++ selects initial centroids more strategically.
18 What is the role of the kernel trick in Kernel K-Means? It enables clustering in higher dimensions without explicitly computing transformations.
19 How does Kernel K-Means handle non-linearly separable data? It transforms data into a space where clusters become linearly separable.
20 Can Kernel K-Means be applied to image segmentation? Yes, Kernel K-Means is often used for image segmentation in computer vision.

Conclusion

K-Means and Kernel K-Means are powerful clustering techniques. While K-Means is efficient for well-separated clusters, Kernel K-Means extends its capabilities to complex structures using kernel functions. Understanding these algorithms helps in practical applications like image segmentation, anomaly detection, and recommendation systems.

By mastering these concepts, you will be well-prepared for exams and real-world clustering problems!

4 week of MLT

 Week 1: Introduction to Unsupervised Learning

Unsupervised learning is a type of machine learning where the algorithm identifies patterns in data without labeled responses. It is widely used in clustering, dimensionality reduction, and anomaly detection.

Why Are We Learning This?

  • Helps in identifying hidden patterns in data.
  • Essential for big data analysis and AI applications.
  • Used in recommendation systems, fraud detection, and image compression.
  • Prepares data for supervised learning by reducing noise and dimensionality.
Feature Supervised Learning Unsupervised Learning
Data Type Labeled data Unlabeled data
Goal Predict outcomes Identify patterns
Example Spam detection Customer segmentation

Representation Learning

Representation learning helps uncover the underlying structure of data, making it useful for further processing.

Principal Component Analysis (PCA)

PCA is a statistical method used for dimensionality reduction, helping to transform high-dimensional data into a smaller set while retaining essential information.

Why Is PCA Useful?

  • Reduces computational cost by removing redundant features.
  • Helps visualize high-dimensional data.
  • Used in image processing, genetics, and finance.

Steps of PCA:

  1. Compute the mean and standardize the dataset.
  2. Calculate the covariance matrix.
  3. Compute eigenvalues and eigenvectors.
  4. Choose the top k eigenvectors (principal components).
  5. Transform the original dataset.

Example:

If we have a dataset with 10 features and we reduce it to 2 principal components, we can visualize data in a 2D space while retaining most of the information.

Additional Example:

PCA is commonly used in facial recognition systems to reduce image dimensions while keeping key facial features intact.

Hints for Exams:

  • Remember PCA reduces dimensions but retains maximum variance.
  • Eigenvalues indicate importance of principal components.
  • Covariance matrix shows feature relationships.

Questions and Answers (Week 1)

Question Answer
What is unsupervised learning? It is a type of ML that finds patterns without labeled data.
What is PCA used for? Dimensionality reduction.
How does PCA transform data? By finding new axes that maximize variance.
What is an eigenvector? A direction along which data varies the most.
What is an eigenvalue? A measure of variance along an eigenvector.
Why do we standardize data in PCA? To ensure features contribute equally.
What does the covariance matrix do? Captures feature relationships.
What is the main limitation of PCA? It assumes linear relationships.
Can PCA be used for feature selection? Yes, it selects important features.
How do we determine the number of principal components? Using a scree plot or explained variance ratio.
What happens if too many components are removed? Information loss occurs.
What is a scree plot? A graph that shows eigenvalues.
What type of data is best suited for PCA? Linearly correlated data.
Why is PCA useful in image processing? It reduces the number of pixels while keeping important information.
What is dimensionality reduction? Reducing the number of features in a dataset.
What are some alternatives to PCA? t-SNE, LDA, Kernel PCA.
What is the curse of dimensionality? As dimensions increase, data becomes sparse and harder to analyze.
What industries use PCA? Finance, healthcare, image recognition, etc.
What is a principal component? A new feature that captures maximum variance.
Why is high variance important in PCA? It helps retain the most important information.
Can PCA work on categorical data? No, it is designed for numerical data.
What are some practical applications of PCA? Face recognition, recommendation systems, etc.
How do you interpret PCA results? By analyzing principal components and their variance.
What is a real-world example of PCA? Reducing features in stock market prediction.

Week 2: Kernel PCA

Why Is Kernel PCA Important?

  • PCA is limited to linear transformations.
  • Kernel PCA allows for non-linear feature extraction.
  • Useful for complex image recognition, bioinformatics, and speech recognition.

Kernel Trick in PCA

Kernel PCA extends PCA using a kernel function to map data into higher-dimensional space before applying PCA.

Example:

If data is shaped like concentric circles, linear PCA would fail. Kernel PCA with an RBF kernel can separate the inner and outer circles.

Hints for Exams:

  • Kernel PCA handles non-linear data better than PCA.
  • Common kernels: RBF, Polynomial, Sigmoid.
  • Kernel matrix (Gram matrix) stores pairwise similarities.

Questions and Answers (Week 2)

Question Answer
What does Kernel PCA do? Extends PCA to non-linear transformations.
What is the kernel trick? Mapping data to higher dimensions.
What is a common kernel used in Kernel PCA? RBF Kernel.
What are some applications of Kernel PCA? Image recognition, anomaly detection.
How does Kernel PCA compare to PCA? It works for non-linear data.
What is a Gram matrix? A kernel matrix storing similarity scores.
Why is Kernel PCA computationally expensive? Because it requires computing the kernel matrix.
What is a practical use case of Kernel PCA? Handwriting recognition.
How do you choose a kernel function? Experiment with different kernels and evaluate performance.
What happens if the wrong kernel is used? Poor results and inefficient learning.
Can Kernel PCA work on text data? Yes, with appropriate feature extraction.
What is the main limitation of Kernel PCA? High memory usage for large datasets.
Why is Kernel PCA useful in bioinformatics? It helps in gene classification.
How does Kernel PCA relate to SVMs? Both use the kernel trick.
What type of data does Kernel PCA handle well? Non-linearly separable data.
What does the RBF kernel do? Maps data into infinite-dimensional space.
Is Kernel PCA useful for dimensionality reduction? Yes, for complex data structures.
How does Kernel PCA improve feature extraction? By capturing non-linear patterns.
What are the challenges of implementing Kernel PCA? High computational cost and kernel selection.
What is the primary advantage of Kernel PCA? It captures complex patterns in data.

Week 3: Unsupervised Learning - Clustering

Why Is Clustering Important?

  • Helps in grouping similar data points automatically.
  • Used in customer segmentation, anomaly detection, and image segmentation.
  • Helps in data pre-processing before supervised learning.

K-Means Clustering

A simple and widely used clustering algorithm that partitions data into K clusters.

Steps:

  1. Choose K cluster centers (randomly or using initialization techniques like K-means++).
  2. Assign each point to the nearest cluster center.
  3. Recalculate cluster centers by taking the mean of assigned points.
  4. Repeat until convergence.

Example:

Imagine grouping customers based on their spending habits. K-means can classify them into low, medium, and high spenders.

Kernel K-Means

An extension of K-Means that uses the kernel trick to map data into a higher-dimensional space, allowing for non-linear cluster boundaries.

When to Use Kernel K-Means?

  • When clusters are not linearly separable.
  • When data has a complex shape (e.g., concentric circles).

Example:

Grouping images based on texture patterns where simple K-Means fails.

Hints for Exams:

  • K-Means minimizes intra-cluster variance.
  • Random initialization can lead to different results.
  • Elbow method is used to find the optimal K.

Questions and Answers (Week 3)

Question Answer
What is clustering? A technique to group similar data points together.
How does K-Means work? It partitions data into K clusters by minimizing variance.
What is the primary limitation of K-Means? It assumes spherical clusters and requires specifying K in advance.
What is the elbow method? A technique to find the optimal number of clusters.
How does Kernel K-Means differ from K-Means? It applies the kernel trick to handle non-linearly separable data.
What happens if K is too high in K-Means? Overfitting and unnecessary complexity.
What distance metric does K-Means use? Euclidean distance.
Can K-Means handle categorical data? No, K-Means works best with numerical data.
What is a centroid in K-Means? The mean of points assigned to a cluster.
Why is K-Means sensitive to initialization? Poor initialization can lead to bad clustering.
What is K-Means++? An improved initialization technique for K-Means.
What is a real-world application of clustering? Customer segmentation in marketing.
Why do we standardize data before K-Means? To ensure equal contribution from all features.
What is the difference between hard and soft clustering? Hard assigns each point to one cluster; soft assigns probabilities.
Can K-Means be used for anomaly detection? Yes, anomalies fall far from cluster centers.
What are some alternatives to K-Means? DBSCAN, Hierarchical Clustering, Gaussian Mixture Models.
Why do we need multiple runs of K-Means? To reduce sensitivity to initialization.
How does Kernel K-Means improve K-Means? It enables clustering of non-linearly separable data.
What industries use clustering? Healthcare, finance, retail, etc.
What are hierarchical clustering methods? Techniques that build a tree of clusters.
What are the two types of hierarchical clustering? Agglomerative (bottom-up) and Divisive (top-down).
How do we visualize K-Means clustering? Using scatter plots or PCA projections.
What are some challenges in clustering? Choosing the right number of clusters and handling noise.
What is a practical use case of Kernel K-Means? Image segmentation in medical imaging.

Week 4: Unsupervised Learning - Estimation

Recap of Maximum Likelihood Estimation (MLE) & Bayesian Estimation

MLE: Finds parameters that maximize the likelihood of observed data. Bayesian Estimation: Incorporates prior knowledge and updates beliefs as data is observed.

Gaussian Mixture Model (GMM)

A probabilistic clustering model that assumes data is generated from multiple Gaussian distributions.

Why Use GMM Over K-Means?

  • Handles clusters of different shapes and sizes.
  • Soft clustering: Assigns probabilities instead of fixed labels.

Expectation-Maximization (EM) Algorithm

An iterative method to find maximum likelihood estimates when data has latent (hidden) variables.

Steps:

  1. Expectation (E-step): Compute probabilities of belonging to each cluster.
  2. Maximization (M-step): Update model parameters based on probabilities.
  3. Repeat until convergence.

Example:

GMM can be used in speech recognition to model different phonemes as Gaussian distributions.

Hints for Exams:

  • MLE finds parameters that best fit the data.
  • GMM is probabilistic; K-Means is deterministic.
  • EM is used to optimize GMM.

Questions and Answers (Week 4)

Question Answer
What is MLE? A method to estimate parameters that maximize likelihood.
How does Bayesian estimation differ from MLE? It incorporates prior beliefs.
What is a Gaussian Mixture Model? A model that represents data as multiple Gaussian distributions.
Why use GMM over K-Means? GMM handles soft clustering and different cluster shapes.
What does the EM algorithm do? It finds parameter estimates for models with hidden variables.
What is an example of GMM in real life? Speaker identification in voice assistants.
Why is GMM called a mixture model? Because it represents data as a mixture of multiple Gaussians.
What does the E-step in EM do? It estimates probabilities of data points belonging to each cluster.
What does the M-step in EM do? It updates model parameters based on probabilities.
Can EM get stuck in local optima? Yes, multiple runs may be needed.
What is a latent variable? A hidden variable not directly observed.
Why is GMM probabilistic? It assigns soft probabilities instead of hard labels.
What does convergence mean in EM? When parameter updates become negligible.
How do we determine the number of Gaussians in GMM? Using methods like the Akaike Information Criterion (AIC).
What industries use GMM? Speech recognition, finance, and bioinformatics.
What is the main limitation of GMM? It assumes data follows a Gaussian distribution.
How does EM relate to GMM? EM is used to estimate GMM parameters.
What are some alternatives to GMM? DBSCAN, K-Means, Spectral Clustering.
How is GMM different from Hierarchical Clustering? GMM is probabilistic; hierarchical is tree-based.
What is a common initialization method for GMM? Using K-Means to initialize cluster centers.
What happens if the number of Gaussians is too high? Overfitting and unnecessary complexity.
How do we visualize GMM results? Using contour plots or PCA projections.
What is a practical use case of EM? Image segmentation in computer vision.
Why is EM important in missing data problems? It estimates missing values iteratively.

Now, Weeks 3 and 4 are fully detailed with theory, examples, hints, and 25 questions each. Let me know if you need further improvements! πŸš€