How a Logging Endpoint Took Down Our API Gateway

The Day Logging Became a Denial-of-Service Attack

The Morning Everything Broke

It was a regular morning. Then alerts started firing users couldn't log in. 504 Gateway Timeout across multiple APIs. The downstream services? Perfectly healthy. They never even received the requests.

The culprit? A logging endpoint running on a completely different server.

This is the story of how a synchronous logging pipeline, a shared API gateway, and a feedback loop combined to create a cascading failure that took down unrelated services for hours.

The Architecture (Before Things Went Wrong)

Client applications collect telemetry and ship it to an Elasticsearch cluster via a REST API. Here's the simplified flow:

Client Application
    │  POST /api/logs  (batch JSON)
    ▼
API Gateway (shared across ALL services)
    │  proxies to backend service
    ▼
Backend Log Service
    │  POST to Elasticsearch /_doc  (synchronous, blocking)
    ▼
Elasticsearch (single node)

Notice two critical design decisions:

The ES write is synchronous and blocking the backend holds the HTTP connection open until Elasticsearch responds.

The API Gateway is shared logging, authentication, sync, and every other API route goes through the same gateway with the same connection pool.

Meanwhile, our internal services had a separate, well-designed logging path:

Service → writes to local log file → Filebeat (async) → Elasticsearch

This path was non-blocking. The client log ingestion path was not.

The Incident: A Cascading Failure

The Spark

Log volume from client applications spiked significantly above normal. A small set of devices started sending considerably more telemetry than usual.

Under normal load, each ES write took single-digit milliseconds. The system handled it fine. But at elevated volume, Elasticsearch running on a single node started saturating.

The Bottleneck Forms

Elasticsearch began rejecting writes. Some returned HTTP 400, others 429 (throttled). Each rejection took longer than a successful write because ES was struggling.

Here's where the synchronous design became lethal:

Client sends log batch
    → Gateway allocates a worker thread + connection
    → Backend receives request, POSTs to ES
    → ES is saturated... waiting... waiting...
    → Seconds later: ES returns error
    → Backend wraps it as HTTP 400 → sends back to client
    → Gateway finally releases the connection

Each failed request held a gateway connection hostage for seconds instead of milliseconds. Connection hold time increased by orders of magnitude.

The Feedback Loop

Here's where it got ugly. When a log upload failed, the client application did the responsible thing it logged the failure. But that failure log was itself sent via... POST /api/logs.

Failed upload
    → generates "upload failed" log
    → POST /api/logs
    → fails
    → generates another "upload failed" log
    → ...

We had a partial guard against infinite recursion if a batch contained only failure summary logs, it was suppressed. But each failed batch still generated one failure event, which got bundled with the next batch of real logs. From a small set of devices, this amplification produced a sustained flood of failure events over hours.

The Collateral Damage

This is the part that surprised everyone.

The API Gateway is like a toll booth with limited lanes. Log requests filled every lane. Each one sat there waiting for ES to respond or timeout.

When requests for other services arrived at the gateway:

Log batch #1  → Gateway → Backend → waiting on ES...
Log batch #2  → Gateway → holding connection...
... many more log requests queued ...
Auth request arrives at Gateway
→ No free workers all busy with log requests
→ Gateway timeout
→ User sees: 504 Gateway Timeout
   (Auth server never received the request)

The downstream services were completely healthy idle, waiting for requests that never arrived. The gateway was the chokepoint. Logging starved everything else.

The Failure Chain

Log volume spike
        ↓
Client applications × batch uploads → POST /api/logs
        ↓
Backend blocks on synchronous ES write
        ↓
ES rejects/throttles (HTTP 400, 429)
        ↓
Error wrapped → client sees 400
        ↓
"Upload failed" event → more /api/logs traffic (amplification)
        ↓
Gateway connection pool exhausted by stuck log requests
        ↓
All other APIs starved → 504 Gateway Timeout

Why This Architecture Was a Ticking Time Bomb

Problem 1 - Synchronous Writes to a Shared Data Store

The backend treated Elasticsearch like a transactional database wait for confirmation before responding. For a logging endpoint, this is fundamentally wrong. Logs are not transactions. They don't need synchronous acknowledgment.

Problem 2 - Shared Gateway, No Isolation

Every API route logging, auth, sync, data competed for the same gateway worker pool. There was no priority lane for critical services. Any misbehaving service could starve the rest.

Problem 3 - Single Elasticsearch Node

A single-node ES cluster has no redundancy and limited throughput. Under normal load it was fine. Under elevated load, it became the single point of failure for the entire platform.

Problem 4 - Failure Amplification

The logging client's retry-and-report behavior turned a spike into a sustained flood. Each failure generated more traffic, which generated more failures. The error-reporting mechanism used the same path that was failing.

The Fix: Decouple, Isolate, Protect

1. Make Log Ingestion Asynchronous

Before:

Client → Gateway → Backend → ES (blocking) → Response

After:

Client → Gateway → Backend → Queue → 202 Accepted (immediate)
                                ↓
                          Consumer → ES (async, with backpressure)

Return 202 Accepted immediately. Push logs to a message queue (Kafka, SQS, Redis Streams). A separate consumer writes to ES at its own pace. The HTTP connection is released in milliseconds, not seconds.

2. Isolate Critical Routes

Dedicate separate gateway worker pools or separate gateway instances for critical paths. Log ingestion should never compete with auth or core APIs for the same connection pool.

Gateway Pool A (auth, core APIs)  → Downstream Services
Gateway Pool B (log ingestion)    → Log Service

3. Rate-Limit at the Edge

Apply per-client rate limiting at the gateway. A small set of misbehaving clients should not be able to saturate the entire platform. A simple token bucket per client ID would have contained this incident to a minor ES slowdown.

4. Break the Feedback Loop

Under backpressure, suppress failure-summary uploads entirely. If the logging endpoint is failing, sending more logs about the failure only makes things worse.

Implement exponential backoff with jitter, and cap retry attempts:

async function uploadLogs(batch) {
    let attempt = 0;
    const maxAttempts = 3;

    while (attempt < maxAttempts) {
        try {
            await fetch('/api/logs', { method: 'POST', body: batch });
            return;
        } catch (err) {
            attempt++;
            // Exponential backoff + jitters don't hammer the endpoint
            const delay = Math.min(1000 * 2 ** attempt, 30000)
                        + Math.random() * 1000;
            await sleep(delay);

            // Don't log the failure back through the same path
            // Store locally or drop silently under backpressure
        }
    }
}

5. Scale the Data Store

For a logging workload, consider:

Multi-node ES cluster with dedicated ingest nodes
Longer refresh intervals for write-heavy indices
ILM (Index Lifecycle Management) policies to manage growth

The Mental Model

Think of your API Gateway as an airport security checkpoint with 20 lanes.

On a normal day, logging passengers (frequent, low-priority) and auth passengers (critical, must-board) both pass through quickly.

Now imagine logging passengers start taking 10 minutes each because the scanning machine (Elasticsearch) is broken. They don't leave the lane they stand there waiting. All 20 lanes fill up.

Auth passengers arrive. They're ready to go. Their gate is open. But they can't get through security. They miss their flight (504 timeout).

The solution isn't to fix the scanning machine faster. It's to give logging passengers a separate entrance entirely.

Key Takeaways

Your observability pipeline should never be able to take down the thing it's observing. If your logging system can cause outages, it's a liability, not an asset.

Synchronous writes for logging are an anti-pattern. Logs are fire-and-forget. Treat them that way.

Shared infrastructure needs isolation boundaries. Without resource isolation, any service can become a noisy neighbor that starves critical paths.

Failure handling can amplify failures. If your error-reporting mechanism uses the same path that's failing, you've built an amplification loop.

The blast radius of a "non-critical" service is determined by what it shares, not what it does. A logging endpoint seems harmless until it shares a connection pool with every other service on the platform.

Final Thought

This incident wasn't caused by a bug. The code worked exactly as written. Every component did its job.

The problem was architectural coupling two systems that should have been independent were sharing a critical resource without isolation.

The most dangerous failures aren't the ones where something breaks. They're the ones where everything works exactly as designed and the design is wrong.

Have you hit a similar cascading failure? What was the unexpected blast radius? Drop it in the comments.

How a Logging Endpoint Took Down Our API Gateway

The Morning Everything Broke

The Architecture (Before Things Went Wrong)

The Incident: A Cascading Failure

The Spark

The Bottleneck Forms

The Feedback Loop

The Collateral Damage

The Failure Chain

Why This Architecture Was a Ticking Time Bomb

Problem 1 - Synchronous Writes to a Shared Data Store

Problem 2 - Shared Gateway, No Isolation

Problem 3 - Single Elasticsearch Node

Problem 4 - Failure Amplification

The Fix: Decouple, Isolate, Protect

1. Make Log Ingestion Asynchronous

2. Isolate Critical Routes

3. Rate-Limit at the Edge

4. Break the Feedback Loop

5. Scale the Data Store

The Mental Model

Key Takeaways

Final Thought

Comments

More from this blog

How Google Photos Finds Your Memories: And What Breaks When Vector Search Goes Wrong

Diving deep into Outbox Pattern.

Your Kafka Pipeline Is Fine ~ Until the Flash Sale Starts

The Memory Nightmare of Big Data: Why the Count-Min Sketch is Essential for Trend Tracking

Command Palette

The Morning Everything Broke

The Architecture (Before Things Went Wrong)

The Incident: A Cascading Failure

The Spark

The Bottleneck Forms

The Feedback Loop

The Collateral Damage

The Failure Chain

Why This Architecture Was a Ticking Time Bomb

Problem 1 - Synchronous Writes to a Shared Data Store

Problem 2 - Shared Gateway, No Isolation

Problem 3 - Single Elasticsearch Node

Problem 4 - Failure Amplification

The Fix: Decouple, Isolate, Protect

1. Make Log Ingestion Asynchronous

2. Isolate Critical Routes

3. Rate-Limit at the Edge

4. Break the Feedback Loop

5. Scale the Data Store

The Mental Model

Key Takeaways

Final Thought

Comments

More from this blog