July 31, 2025

📬 Amazon SQS Retry Mechanisms: When to Use DelaySeconds vs Visibility Timeout vs DLQ

Message-driven systems need solid retry mechanisms to gracefully handle transient failures. Amazon SQS (Simple Queue Service) provides a few tools to control retries:

  • DelaySeconds

  • VisibilityTimeout

  • Dead Letter Queues (DLQs)

But when should you use each? Let’s break it down.


🔁 1. Visibility Timeout – Automatic Retry Handling

What it is:
When a consumer receives a message from SQS, that message becomes invisible for a duration defined by VisibilityTimeout.

If the consumer fails to delete the message within that time, the message becomes visible again and SQS retries it automatically.

Use Case:

  • For retrying transient failures without custom logic.

  • When using frameworks like Spring Cloud SQS or AWS SDK consumers.

Pros:

  • Simple and automatic

  • No extra code for retrying

Cons:

  • No control over retry timing (it's fixed)

  • Can cause duplicate processing if visibility timeout is too short

Recommended when you just want basic retries with no backoff logic.


⏳ 2. DelaySeconds – Scheduled Retry (Manual Logic)

What it is:
SQS allows setting DelaySeconds on a per-message basis when sending a message. It means: don’t deliver this message until X seconds have passed.

amazonSQS.sendMessage(new SendMessageRequest()
    .withQueueUrl(queueUrl)
    .withMessageBody(messageBody)
    .withDelaySeconds(60)); // Delay for 60 seconds

Use Case:

  • Implement exponential backoff retry logic after a failure

  • Avoid hammering systems with immediate retries

  • Fine-grained control over retry schedule

Pros:

  • Controlled and customizable retry strategy

  • Can implement exponential backoff easily

Cons:

  • Requires extra logic (like tracking retry count)

  • Only works up to 900 seconds (15 minutes)

Recommended when you want advanced retry strategies like exponential backoff.


☠️ 3. Dead Letter Queue (DLQ) – Retry Limit Fallback

What it is:
A DLQ is a secondary queue attached to your primary queue. After a message fails maxReceiveCount times (based on visibility timeout logic), SQS automatically moves it to the DLQ.

Use Case:

  • Capture and isolate messages that consistently fail

  • Avoid infinite retry loops

  • Debug or manually intervene failed cases

Pros:

  • Separates bad messages for inspection

  • Easy monitoring with CloudWatch

Cons:

  • Not a retry mechanism per se — more of a final fallback

Recommended as a safety net, not a retry strategy.


💡 When to Use What?

Scenario Use VisibilityTimeout Use DelaySeconds Use DLQ
Auto retry after failure ✅ Yes ❌ No ❌ No
Manual retry with delay/backoff ❌ No ✅ Yes ❌ No
Retry with increasing delay (backoff) ❌ No ✅ Yes ❌ No
Capturing failed messages after max tries ✅ (to count retries) ✅ (track retries) ✅ Yes
Real-time processing, quick retry ✅ Yes ❌ No ❌ No
Complex processing with risk of overload ❌ No ✅ Yes ✅ Yes

🛠️ Best Practice Combo

For most robust production systems, use a combination:

  1. Set VisibilityTimeout = 15 min

  2. Implement DelaySeconds with exponential backoff on retries

  3. Configure a DLQ for messages exceeding retry threshold


📦 Sample Java (AWS SDK) Exponential Retry Logic

int retryCount = msg.getRetryCount();
int delay = Math.min(900, (int)Math.pow(2, retryCount) * 30); // Cap at 15min

SendMessageRequest request = new SendMessageRequest()
    .withQueueUrl(queueUrl)
    .withMessageBody(objectMapper.writeValueAsString(msg))
    .withDelaySeconds(delay);

amazonSQS.sendMessage(request);

✅ TL;DR

  • Use VisibilityTimeout for simple, default retry behavior

  • Use DelaySeconds for smart retry logic (like exponential backoff)

  • Always use a DLQ as a fallback to avoid infinite retries


Let me know if you'd like this turned into a Markdown doc or HTML blog template.

July 30, 2025

Understanding Locking in Concurrent Systems (Optimistic, Pessimistic, Distributed)

Imagine you and your friends are editing a shared document. If two people make changes at the same time, someone’s changes might get overwritten. In software systems — especially multi-threaded or distributed applications — this same issue occurs when multiple services or users try to update the same data at once. This is called a concurrency problem.

To solve this, we use locking mechanisms.


🔐 1. Optimistic Locking — "Hope for the best, prepare for conflict"

What is it?

Optimistic locking assumes conflicts are rare, so it doesn’t lock anything initially. Instead, it checks whether the data has been modified by someone else just before saving.

How it works?

  • A special field, often called version, is added to your data (e.g., @Version in JPA).

  • When you fetch a record, say version = 3.

  • You modify it and try to save it back.

  • The system checks: is the current version in DB still 3?

    • ✅ Yes: Save and bump version to 4.

    • ❌ No: Someone else updated it. Throw an OptimisticLockException.

@Entity
public class User {
    @Id
    private Long id;

    private String name;

    @Version
    private Long version;
}

Frameworks like Hibernate or JPA handle this automatically behind the scenes using SQL like:

UPDATE user SET name='Jatin', version=4 WHERE id=1 AND version=3;

If no rows are updated, it means someone else already changed it.

When to use?

  • Low contention systems (e.g., user profile updates).

  • REST APIs with stateless calls.

  • Systems where retrying is acceptable.

✅ Pros:

  • No locks, better performance.

  • Scales well.

❌ Cons:

  • Write conflicts lead to retries.

  • Not ideal for high-write, high-conflict cases.


🔒 2. Pessimistic Locking — "Lock first, then act"

What is it?

Pessimistic locking assumes conflicts are likely. So, it locks the data before anyone can modify it.

How it works?

When one process reads data with a lock (SELECT ... FOR UPDATE), others are blocked from reading/updating it until the lock is released (usually after a commit or rollback).

@Lock(LockModeType.PESSIMISTIC_WRITE)
Optional<User> findById(Long id);

Internally, this generates SQL like:

SELECT * FROM user WHERE id = 1 FOR UPDATE;

This locks the row, so no other transaction can modify it until yours is done.

When to use?

  • High contention systems.

  • Critical operations like bank transfers, inventory deduction.

  • When conflicts must be avoided at all costs.

✅ Pros:

  • Safe and conflict-free.

  • No need to retry updates.

❌ Cons:

  • Performance hit due to locking.

  • Can cause deadlocks or long wait times.

  • Doesn't scale well in high concurrency.


🌐 3. Distributed Locking — "One lock to rule them all"

What is it?

Distributed locking is used in distributed systems (multiple services or pods/machines) where shared DB row-level locks or JVM locks don’t work.

How it works?

  • A centralized locking service (like Redis, Zookeeper, or etcd) is used.

  • The system tries to acquire a lock on a key (SET lock_key "uuid" NX PX 30000 in Redis).

  • If successful, it proceeds. Otherwise, waits or fails.

  • Releases the lock explicitly or after timeout.

Libraries/tools:

  • Redisson for Redis

  • Hazelcast

  • Apache Curator (for Zookeeper)

RLock lock = redissonClient.getLock("lock:resource:123");
try {
    if (lock.tryLock(10, 30, TimeUnit.SECONDS)) {
        // Do critical work
    }
} finally {
    lock.unlock();
}

When to use?

  • Multi-instance services (e.g., microservices, Kubernetes pods).

  • Cron jobs or batch processing to ensure only one instance does the work.

  • Event-driven or message-processing systems.

✅ Pros:

  • Works across machines, containers, services.

  • Flexible with fine-grained control.

❌ Cons:

  • More complex.

  • Needs external infra (Redis, etc.).

  • Must handle failures, timeouts, and race conditions properly.


🧠 How to Learn More

Topic Resources
JPA Optimistic Locking Baeldung Optimistic Locking
SQL Pessimistic Locking PostgreSQL FOR UPDATE
Distributed Locking Redisson Docs
Concepts & Patterns Designing Data-Intensive Applications by Martin Kleppmann

🤔 Summary: Which One to Use?

Locking Type Use Case Scalability Conflict Handling
Optimistic Most REST APIs, low-write conflicts ✅ High Retry
Pessimistic High-conflict updates (banking, booking) ❌ Low Block & wait
Distributed Multi-instance processing (jobs, microservices) ✅ High Explicit locking logic

Let me know if you want this as a Markdown file or blog-ready HTML version.