Infrastructure

Who slowed down my webhooks?

For a while now, the Automations team at ZenBusiness has been seeing partner webhook events show up in their consumer way later than they should. We're talking 30 seconds to 2+ minutes from the time our publicly exposed webhook receiver accepted the webhook to the time the event landed in Pub/Sub. Not great, but workable. The issue became super apparent when the events were so delayed that a signed URL – which was part of the webhook payload – was no longer valid. The team raised the alarm bells.

Spoiler: It was a one-line default in a helper package that everyone has been using "correctly" for years.

The Story

The automation partner fires a webhook at ZenBusiness. We validate it, then an internal magical service called the Postgres Event Distributor (PGED), takes the message and publishes it to a Pub/Sub topic. The automations service consumes from a subscription on that topic. Standard stuff.

Except this partner webhooks were lagging. Another external partner, which uses the exact same plumbing, was fine — about 4 seconds end-to-end. While the automations partner was averaging 80 seconds and trending up over time.

The Problem

The PGED code has a method called GetOrderingKey which looks something like this:

func (pst *PubSubTopic) GetOrderingKey(event *buffer.BufferedEvent) string {
    if event.PartitionKey != "" {
        pst.topic.EnableMessageOrdering = true
        return event.PartitionKey
    }
    return ""
}

The moment PGED sees an event with a non-empty PartitionKey, it flips the topic into ordered mode and uses that key as the Pub/Sub OrderingKey. Sounds reasonable. The catch is what Google's Pub/Sub client does once ordering is enabled — it spins up a per-key publish bundler and serializes publishes for each unique key, one at a time.

In plain English, Google has to do way more work on the backend for each message causing the delay to exponentially increase.

If your keys repeat (think: user ids — same user fires many webhooks), you get a handful of serialized streams, and life is good.

If your keys are unique per event, the client has to maintain one serialized stream for every event it has ever seen. Bookkeeping piles up. Throughput tanks. Lag grows. Google literally warns you about this:

Using message ordering with many ordering keys is expensive. If your application uses many ordering keys, the publisher might use a lot of memory and have lower throughput.

So… why were the automation partner's webhook keys unique per event when no one set them that way?

The Smoking Gun

Buried deep inside one of the helper libraries was this snippet of code:

const eventId = randomUUID();

const baseParams = [
    eventId,
    typeName,
    source,
    requestUuid,
    partitionKey || eventId,   // ← right here
    eventData
];

If you called the library without passing the right arguments, the package silently filled in the freshly generated eventId (a brand-new UUID for every single call) as the partition_key.

PGED reads that row, sees partition_key != "", flips the topic into ordered mode, and now every event has its own dedicated, serialized publish stream. Bingo.

The Naming Trap

Here's the part that hid this bug for years. The same value travels under different names at different layers:

Public Webhook Receiver - partitionKey
Postgres - partition_key
PGED - PartitionKey
Pub/Sub - OrderingKey

"Partition key" in normal-engineer-speak (Kafka, Kinesis) means shard for parallelism: same key → same shard → throughput scales horizontally.

"Ordering key" in Pub/Sub means serialize strictly in order: same key → one-at-a-time publishes → throughput is limited per key.

A developer reading partitionKey reasonably assumes throughput-improving sharding. The actual effect when handed to Pub/Sub is the opposite — throughput-limiting serialization. Worth a rename across the stack.

The Fix

At the end, the fix was straightforward. Instead of opt-in, the library now defaults a missing partition key to '' (empty), which PGED reads as "no ordering, just publish in parallel".

// New helper, used by both raiseEvent and raiseEvents
let conformedPartitionKey = partitionKey;
if (partitionKey === null || partitionKey === undefined) {
    conformedPartitionKey = '';
}

Validating the Fix

The lag symptom was not visible in dev under natural traffic. With about 14 events/day in dev versus ~10/minute in prod, the per-key bundler never accumulates enough to misbehave in dev. We literally had to test the fix in production.

Lessons Learned

Silent defaults are a trap. A || fallback that quietly substitutes a unique value for an absent one is much worse than throwing an error. The package was doing exactly what the code said — it just wasn't what anyone reading the call site would have guessed.
Opt-in fixes age badly. The first attempt (PR#1) added an opt-in flag. The second attempt (PR#2) made the right behavior the default. The second one is the one that actually fixed every partner without anyone having to remember anything.
Names matter. If "partition" and "ordering" mean opposite things in your stack, expect engineers to wire them up backwards.
The automations partner was the canary, not the only victim. When something looks like a one-partner problem, check the other partners using the same plumbing before you ship a one-partner fix.

One Gateway, Zero Trust: How We Gave Internal Services a Controlled Public Face

Automating Observability

From War Rooms to Dojos: How AI Is Reinventing the Testing Session

How I Got AI to Teach Me AI: Using a Claude Skill

Read more

One Gateway, Zero Trust: How We Gave Internal Services a Controlled Public Face

Automating Observability

From War Rooms to Dojos: How AI Is Reinventing the Testing Session

How I Got AI to Teach Me AI: Using a Claude Skill