Message-driven systems need solid retry mechanisms to gracefully handle transient failures. Amazon SQS (Simple Queue Service) provides a few tools to control retries:
-
DelaySeconds
-
VisibilityTimeout
-
Dead Letter Queues (DLQs)
But when should you use each? Let’s break it down.
🔁 1. Visibility Timeout – Automatic Retry Handling
What it is:
When a consumer receives a message from SQS, that message becomes invisible for a duration defined by VisibilityTimeout
.
If the consumer fails to delete the message within that time, the message becomes visible again and SQS retries it automatically.
Use Case:
-
For retrying transient failures without custom logic.
-
When using frameworks like Spring Cloud SQS or AWS SDK consumers.
Pros:
-
Simple and automatic
-
No extra code for retrying
Cons:
-
No control over retry timing (it's fixed)
-
Can cause duplicate processing if visibility timeout is too short
✅ Recommended when you just want basic retries with no backoff logic.
⏳ 2. DelaySeconds – Scheduled Retry (Manual Logic)
What it is:
SQS allows setting DelaySeconds
on a per-message basis when sending a message. It means: don’t deliver this message until X seconds have passed.
amazonSQS.sendMessage(new SendMessageRequest()
.withQueueUrl(queueUrl)
.withMessageBody(messageBody)
.withDelaySeconds(60)); // Delay for 60 seconds
Use Case:
-
Implement exponential backoff retry logic after a failure
-
Avoid hammering systems with immediate retries
-
Fine-grained control over retry schedule
Pros:
-
Controlled and customizable retry strategy
-
Can implement exponential backoff easily
Cons:
-
Requires extra logic (like tracking retry count)
-
Only works up to 900 seconds (15 minutes)
✅ Recommended when you want advanced retry strategies like exponential backoff.
☠️ 3. Dead Letter Queue (DLQ) – Retry Limit Fallback
What it is:
A DLQ is a secondary queue attached to your primary queue. After a message fails maxReceiveCount
times (based on visibility timeout logic), SQS automatically moves it to the DLQ.
Use Case:
-
Capture and isolate messages that consistently fail
-
Avoid infinite retry loops
-
Debug or manually intervene failed cases
Pros:
-
Separates bad messages for inspection
-
Easy monitoring with CloudWatch
Cons:
-
Not a retry mechanism per se — more of a final fallback
✅ Recommended as a safety net, not a retry strategy.
💡 When to Use What?
Scenario | Use VisibilityTimeout |
Use DelaySeconds |
Use DLQ |
---|---|---|---|
Auto retry after failure | ✅ Yes | ❌ No | ❌ No |
Manual retry with delay/backoff | ❌ No | ✅ Yes | ❌ No |
Retry with increasing delay (backoff) | ❌ No | ✅ Yes | ❌ No |
Capturing failed messages after max tries | ✅ (to count retries) | ✅ (track retries) | ✅ Yes |
Real-time processing, quick retry | ✅ Yes | ❌ No | ❌ No |
Complex processing with risk of overload | ❌ No | ✅ Yes | ✅ Yes |
🛠️ Best Practice Combo
For most robust production systems, use a combination:
-
Set
VisibilityTimeout = 15 min
-
Implement
DelaySeconds
with exponential backoff on retries -
Configure a DLQ for messages exceeding retry threshold
📦 Sample Java (AWS SDK) Exponential Retry Logic
int retryCount = msg.getRetryCount();
int delay = Math.min(900, (int)Math.pow(2, retryCount) * 30); // Cap at 15min
SendMessageRequest request = new SendMessageRequest()
.withQueueUrl(queueUrl)
.withMessageBody(objectMapper.writeValueAsString(msg))
.withDelaySeconds(delay);
amazonSQS.sendMessage(request);
✅ TL;DR
-
Use
VisibilityTimeout
for simple, default retry behavior -
Use
DelaySeconds
for smart retry logic (like exponential backoff) -
Always use a DLQ as a fallback to avoid infinite retries
Let me know if you'd like this turned into a Markdown doc or HTML blog template.