Talking to a database is slow, don't do it if you don't have to.
Written by
Andreas Thomas
Published on
7 months ago, I wrote about leaving serverless behind, so let's look at how things are today. This post is not a guideline, there are many ways to do it, but here's how we're building our API for low latency. These days we're sustaining throughput around 10,000 requests per second with p99 service latency below 1 millisecond.
I don't want to claim anything that is misleading, so let me be clear about what we're measuring here. I define service latency as the time from when our API receives a request to when it sends a response, excluding network time or TLS. This is the time spent inside our API handler, including any serialisation, validation, database calls or cache lookups.
TLDR it comes down to two design choices:
[TODO] Insert p50, p95, p99 over selected production window.[TODO] Insert cache hit rate per hot cache.The most latency sensitive path in our API is /v2/keys.verifyKey. The route accepts a managed API key and needs to verify its validity, permissions, ratelimits and more.
A request to this route looks like this:
1curl -X POST "https://api.unkey.dev/v2/keys.verify" \
2 -H "Content-Type: application/json" \
3 -H "Authorization: Bearer unkey_xxxx" \
4 -d '{
5 "key": "sk_live_1234567890abcdef",
6 "ratelimits": [...]
7 "permissions": "dns.create_record AND dns.update_record"
8 }'Notice how there are 2 distinct keys in the request: the Authorization header and the key field in the body. The former is a root key that identifies the caller, to ensure you cannot verify keys from other tenants. While the latter is provided by the end user, that the caller wants to authorize.
In an unoptimized implementation, that would involve multiple round trips to the database, which adds up quickly. Some of these can be combined into a single query, but not all of them. We're still left with one query for the root key and one for the user key.
The maybe obvious answer is to just cache these, and that's what we do. API keys themselves are a great candidate for caching, because they are read much more often than they are updated. However invalidation becomes a concern, especially for a critical path like this. If we cache too aggressively, we might end up authorizing revoked keys for longer than we're comfortable with. Furthermore, ratelimits are very dynamic, we need to track the usage of a key in real time to enforce them correctly across many api nodes.
So let's take a step back and fix the low hanging fruits first.
The biggest impact on the tail latency is the round trip time to the database. Even if you do in-memory caching, you will still have cache misses, and those need to be as fast as possible. Going halfway across the globe to your database adds hundreds of milliseconds per request. Imagine doing multiple reuqests in series...
Fortunately many database providers these days offer read-replicas in multiple regions. We're using PlanetScale and the replication lag so far has been sub second, which is good enough for our use case. Our current setup is to have read-replicas in key locations to serve the majority of our traffic with low latency. We have a primary in us-east-1 and read replicas in ap-southeast-2, ap-northeast-1, us-west-2, ap-south-1 and eu-central-1.
Here are our real-world read latency measured from our API nodes (in each region) to planetscale.
It's pretty fast already, and we don't have to worry about stale data from the replicas, because they replicate very quickly. (Anecdotally, I have never seen it lag more than 1 second.)
The downside is that we're paying for our data storage multiple times now, but that's well worth it. Our datasize is low and our read to write ratio is very high.
The next optimisation is to cache the results of these database queries in memory. It's fast, it's easy and it's a nightmare to get right. Caching is a double-edged sword, it can make things faster but it can also make things worse if you don't do it right.
In the simplest case, we would cache the key's configuration by its hash. So as a request comes in, we hash the raw key and then look it up in the cache. If it's a hit, we can return the cached configuration immediately. If it's a miss, we need to go to the database, fetch the configuration, and then populate the cache for future requests. But this is where the problems begin. How long do you keep the cache entry? If you keep it for too long, you might end up authorizing revoked keys. If you keep it for too short, you might end up with a high miss rate and gained nothing.
To help with this, we use a caching strategy called stale-while-revalidate (SWR). The idea is that when a cache entry becomes too old, we can still return it to the caller, but we trigger a background refresh to get a fresh value from the origin. This way we can keep the cache hit rate high while still ensuring that we don't serve outdated data.
As an example, let's say a key is cached at t=0 with Fresh=10s and Stale=60s:
1API request
2 |
3 v
4Check cache age
5 |
6 +-- age <= 10s (fresh)
7 | -> return cached value
8 |
9 +-- 10s < age <= 60s (stale)
10 | -> return stale value now
11 | -> revalidate in background (DB)
12 |
13 +-- age > 60s (expired)
14 -> fetch from DB
15 -> update cache
16 -> return fresh valueRequests for frequently used keys return from memory and the worst case is that we serve data that is 60s old.
Caching the absence of a record is just as important as caching its presence. If we only cache hits, then every request for a missing key would go to the database. While the latency here doesn't really matter much, it's important that make this as cheap as possible for us. Common negative caches for us are often the result of a revoked key, where existing apps might still be trying to use it.
We just use a special cache entry to indicate that the record is not found, and then we can return a 404 immediately without hitting the database again for a while.
During cache population, we transform the raw data into a more convenient format for the handler. For example, we parse the IP whitelist into a map for O(1) lookups, so that we don't have to do that on every request. We store IP allow lists as comma separated strings in the database, but we want to use them as maps in the handler. So we do that transformation once during cache population, and then we can just use the parsed map for every request.
1// internal/services/keys/get.go
2key, hit, err := s.keyCache.SWR(ctx, sha256Hash, func(ctx context.Context) (db.CachedKeyData, error) {
3 row, err := db.WithRetryContext(ctx, func() (db.FindKeyForVerificationRow, error) {
4 return db.Query.FindKeyForVerification(ctx, s.db.RO(), sha256Hash)
5 })
6
7 parsedIPWhitelist := make(map[string]struct{})
8
9 ips := strings.Split(row.IpWhitelist.String, ",")
10 for _, ip := range ips {
11 trimmed := strings.TrimSpace(ip)
12 if trimmed != "" {
13 parsedIPWhitelist[trimmed] = struct{}{}
14 }
15 }
16 return db.CachedKeyData{
17 FindKeyForVerificationRow: row,
18 ParsedIPWhitelist: parsedIPWhitelist,
19 }, nil
20}, caches.DefaultFindFirstOp)Small transformations like this look minor in isolation, but at thousands of requests per second they add up.
Rate limiting sounds like a solved problem. Count requests, reject when the count exceeds a threshold. But when your API runs across multiple nodes or even regions, the problem gets interesting fast.
The naive approach forces every request through a single coordination point like Redis, adding a network round trip to every API call. For a service where latency matters, that's unacceptable. We wanted something better: sub-millisecond decisions on the hot path, with distributed consistency where it matters.
The simplest distributed rate limiter is a Redis INCR with a TTL. Every request increments a key, and if the value exceeds the limit, you reject. It's correct, it's simple, and it adds 1-2ms of latency to every single request.
For most applications running in a single region, that's fine. But it's not possible across regions. And even in a single region, Redis becomes a single point of failure. If it goes down or gets slow, you have to make a decision, either every API call in the system blocks or fails or you open the floodgates.
We wanted a design where:
Before we get into distribution, let's talk about the core algorithm. We use a sliding window instead. The idea is simple: maintain two counters (the current window and the previous window) and blend them based on how far into the current window we are.
1effectiveCount = current.counter + previous.counter × (1 - elapsed)Where elapsed is the fraction of the current window that has passed. If we're 40% into the current minute, we count 100% of this minute's requests plus 60% of last minute's. The previous window's influence decays linearly until it drops off entirely at the boundary.
This means a client who used 80 of 100 requests in the previous window and is 30% into the current one effectively has current + 80 × 0.7 = current + 56 counted against them. They have 44 tokens available, not a fresh 100.
The sliding window never resets abruptly, eliminating the burst-at-boundary problem while requiring only two integers of state per rate limit.
1Previous Window Current Window
2 ┌───────────────┐┌───────────────┐
3 │ counter = 80 ││ counter = 10 │
4 └───────────────┘└───────────────┘
5 ▲
6 │ elapsed = 0.3 (30% into current window)
7 │
8
9 effective = 10 + 80 × (1 - 0.3) = 66
10 remaining = 100 - 66 = 34The foundation of our distributed design is a surprisingly simple idea: if every node agrees on how to slice time into windows, they can make independent decisions without talking to each other.
Every rate limit has a duration, say 60 seconds. We divide the Unix timestamp (in milliseconds) by this duration to produce a sequence number:
1func calculateSequence(t time.Time, duration time.Duration) int64 {
2 return t.UnixMilli() / duration.Milliseconds()
3}At 12:30:00 UTC with a 60-second window, the sequence is time / 60000. At 12:30:45, it's the same number. At 12:31:00, it increments by one. Every node in the cluster—regardless of when it last communicated with any other node—computes the same sequence number for the same moment in time.
No coordination needed. Wall clock alignment gives us a shared frame of reference. (NTP keeps our nodes synchronized to within a few milliseconds, which is more than sufficient for rate limit windows measured in seconds up to months.)
Each unique combination of (name, identifier, limit, duration) maps to a bucket. For example allowing user_123 to use up to 10M inference tokens per minute would look like this: ("inference-tokens", "user_123", 10000000, 60000). A bucket is an in-memory struct holding a map of sequence numbers to windows, protected by its own mutex. This per-bucket locking is critical. A request rate-limiting user-123 never contends with a request rate-limiting user-456, even though both flow through the same service.
When a request arrives, we look up (or create) its bucket, lock it, retrieve the current and previous windows, run the sliding window calculation, and return a decision. The entire critical section is a few arithmetic operations on local memory.
Each node makes rate limit decisions instantly against local counters, but those counters need to eventually converge with the global state in Redis. This happens through a replay buffer.
Every successful rate limit check pushes the request into the buffer. Background goroutines continuously drain it, and for each request, they increment a shared counter in Redis: If the counter in redis is higher than our local counter, because the same ratelimit has been requested on other nodes, we increment our local counter to merge the global state into our local state.
1func (s *service) syncWithOrigin(ctx context.Context, req RatelimitRequest) error {
2
3 w := getCurrentWindow(req.Time)
4
5 newCounter, err := s.redis.Increment(
6 ctx,
7 fmt.Sprintf("%s:%d", req.Identifier, w.sequence)
8 req.Duration * 3, // TTL
9 )
10 if err != nil {
11 return err
12 }
13
14 // One-way ratchet: only update local if Redis is higher
15 if newCounter > w.counter {
16 w.counter = newCounter
17 }
18
19 return nil
20}The critical detail is the max-merge on the response: when Redis returns the new global counter after the increment, we update the local counter only if Redis is higher. This is a one-way ratchet. Local state is always brought up to the global count, never down. It ensures nodes converge toward the true total without any risk of lost increments or double-counting.
The TTL on each Redis key is set to 3× the window duration, giving ample time for the sliding window to reference the previous window before the key expires.
1Request
2 ┃ (sync)
3 ▼
4 ┌───────────┐··············►┌────────────┐···INCRBY····►┌───────┐
5 │ Local │ (non-block) │ Replay │ │ Redis │
6 │ Bucket │◄··············│ Buffer │◄·············│ │
7 │ (in-mem) │ max-merge │ │ new count └───────┘
8 └───────────┘ └────────────┘
9 ┃ (sync)
10 ▼
11 ResponseThis architecture has an explicit tradeoff. During the brief interval between a local decision and its replay to Redis, two nodes can each independently approve a request that collectively exceeds the limit. If Node A and Node B both see 99/100 locally and each approve one more request, the true global count is 101.
For most traffic patterns, where usage is well below the limit, this never matters. The replay catches up in milliseconds, and the next check on either node will see the corrected count.
But right at the limit boundary, this slack could matter. That's where our strict mode comes in.
When any node detects that a request exceeds the limit, it sets a strictUntil timestamp on that bucket, one full window duration into the future:
1if exceeded {
2 b.strictUntil = req.Time.Add(req.Duration)
3}When a node does not have an existing local state or when strictUntil is active, the node bypasses local counters and queries Redis directly before making a decision:
1goToOrigin := req.Time.UnixMilli() < b.strictUntil.UnixMilli()
2if goToOrigin || !currentWindowExisted {
3 currentWindow.counter = max(currentWindow.counter, s.counter.Get(ctx, key))
4}This synchronous Redis check costs a network round trip (~1-2ms), but it only triggers after a previous denial, exactly when over-admission would be most harmful.
The result is an adaptive consistency model:
strictUntil expires and the bucket returns to local-first mode.This means most requests pay zero coordination cost, while the critical edge cases get full accuracy.
1┌────────────┐ request denied ┌─────────────────┐ strictUntil expires ┌────────────┐
2 │ Local-Only │─────────────────►│ Strict Mode │──────────────────────►│ Local-Only │
3 │ (fast path)│ │ (Redis check) │ │ (fast path)│
4 │ ~μs │ │ ~1-2ms │ │ ~μs │
5 └────────────┘ └─────────────────┘ └────────────┘
6 │ │
7 │ for every request: │ for every request:
8 │ check local counters only │ GET from Redis, then check
9 │ buffer to replay │ buffer to replaySometimes a single action needs to satisfy multiple rate limits simultaneously. An API key might have both a per-second limit and a per-day limit. If the per-day check passes but the per-second check fails, you don't want the per-day counter incremented. Only when all ratelimits are passed should a request be counted.
The implementation acquires locks on all involved buckets, checks every limit, and only increments counters if every check passes. But locking multiple buckets introduces a deadlock risk: goroutine A locks bucket X then waits on Y, while goroutine B locks Y then waits on X.
We prevent this by sorting all bucket keys lexicographically before acquiring any locks:
1sort.Slice(reqsWithKeys, func(i, j int) bool {
2 return reqsWithKeys[i].key.toString() < reqsWithKeys[j].key.toString()
3})
4
5for _, b := range uniqueBuckets {
6 b.mu.Lock()
7 defer b.mu.Unlock()
8}Every goroutine acquires locks in the same global order, making circular wait impossible. This is a classic technique, but it's easy to forget when you're working with dynamic sets of locks.
1Without ordering (deadlock!): With sorted ordering (safe):
2
3 Goroutine A Goroutine B Goroutine A Goroutine B
4 ────────── ────────── ────────── ──────────
5 Lock(X) ✓ Lock(Y) ✓ Lock(X) ✓ Lock(X) ⏳
6 Lock(Y) ⏳ Lock(X) ⏳ Lock(Y) ✓ Lock(X) ⏳
7 💀 DEADLOCK Lock(Z) ✓ Lock(X) ⏳
8 Unlock all Lock(X) ✓
9 Lock(Y) ✓
10 Lock(Z) ✓
11 Unlock allIf any limit check fails, we release all locks without incrementing anything.
The replay path to Redis is wrapped in a circuit breaker. After repeated failures, the circuit opens and stops attempting replays entirely. During this period, nodes continue making local-only decisions—accuracy degrades slightly (nodes can't see each other's increments), but availability is preserved.
Redis timeouts are deliberately aggressive: 500ms read/write, 1s dial. We'd rather fail fast and fall back to local state than let a slow Redis response block the replay pipeline and cause backpressure.
The replay buffer itself is bounded at 10,000 entries with a drop policy. If Redis is down long enough for the buffer to fill, new replay events are dropped rather than blocking the rate limit decision path. The rate limit check always returns instantly—it never waits on Redis.
When Redis recovers, the circuit closes and convergence resumes automatically. The max-merge behavior means the local counters will reconcile with whatever global state Redis has, with no manual intervention needed.
Without cleanup, the bucket map would grow forever as new identifiers appear. A background janitor runs every minute and does two things:
This keeps memory proportional to the number of active rate-limited identifiers, not the total number of identifiers ever seen.
So far we've just been relying on time-based expiration, but that comes with the problem of limiting how long we can cache data for. If we want to cache for longer, we need a way to invalidate the cache when the underlying data changes. This is especially important for critical paths like key verification, where we don't want to risk authorizing revoked keys for too long.
We have a proof of concept to use gossip for emitting cache invalidation events. If we can reliably tell each node to revalidate their cache, we could use much longer freshness and stale TTLs, increasing the cache hit rate even more but time will tell how well that goes.
The cache layer records timing entries with status labels (fresh, stale, miss), we can expose them via X-Unkey-Timing headers.
1< x-unkey-timing: cache_swr{cache=verification_key_by_hash,status=stale}=3.961us
2< x-unkey-timing: cache_swr{cache=workspace_quota,status=fresh}=1.56us
3< x-unkey-timing: cache_swr{cache=verification_key_by_hash,status=stale}=2.44usHere we can see we made 3 cache lookups, one for the root key, one for workspace quotas and one more for a user key.
That gives us per-request visibility into:
The core idea is not one trick. It is a stack of decisions that reinforce each other:
150,000 requests per month. No CC required.