Inside RaftLock: the state machine behind the lock
Heaps for expiration and priority, an LRU read cache, and the snapshot detail that channels can't survive.
On this page
Raft guarantees every node sees the same log of commands in the same order. But a log of commands isn’t a lock — something has to execute those commands and enforce what “holding a lock” actually means. That something is the lock package: RaftLock’s state machine, the layer that turns a consensus stream into mutual exclusion.
It implements the business logic:
- State management — the in-memory truth of who holds which lock, for how long, and who’s waiting.
- Lifecycle enforcement — acquisition, renewal, release, and expiration, never violating mutual exclusion.
- Wait management — blocked clients held in priority queues, woken the instant a lock frees up.
Driven deterministically by Raft
The LockManager is deliberately not in charge of when things happen — Raft is. Everything flows through a single Apply method:
- Decode — Raft commits an entry (“Client A acquires Lock X”);
Applydecodes the command bytes. - Idempotency check — it compares against
lastAppliedIndexso the same command never executes twice. This is non-negotiable for replay safety: when a node restarts and Raft replays its log, applying a command a second time would corrupt the state. - Execute — it switches on the operation (
Acquire,Release,Renew,EnqueueWaiter) and mutates the state maps. - Respond — the result goes back to Raft, which routes it to the client.
Because every node runs the same Apply over the same ordered log, every node reaches the same state. That’s the entire contract.
Expiring locks without scanning everything
Locks have TTLs, so something has to find and expire them. The naive approach — scan every lock on a timer — is O(N) and gets worse exactly when you have the most locks. I didn’t want CPU usage that climbs with load, so I used a min-heap, the expirationHeap, ordered by ExpiresAt:
- finding the next lock to expire is
O(1)(it’s the top of the heap), - removing it is
O(log N), - a periodic
Tickchecks only the top — if the soonest-to-expire lock is still valid, the loop stops immediately.
So with thousands of active locks, the expiration sweep does a constant amount of work per tick unless something’s actually expiring. The CPU cost tracks real expirations, not the size of the lock table.
Fair, preemptable waiting
RaftLock supports blocking acquires, and not as a plain FIFO line. The waitQueue is a priority heap:
- Ranking — waiters are ordered by
Priorityfirst (higher wins), then byEnqueuedtime, so ties break FIFO and fairness is preserved within a priority level. - Promotion — when a lock releases or expires,
tryPromoteWaiterLockedpops the highest-priority waiter and immediately grants ownership, notifying them over a Go channel (notifyCh). - Cleanup — a
pendingWaitsmap ties each client to its heap index, so cancelling a wait isO(1)instead of a linear scan.
Priority plus FIFO-within-priority gives you something that’s both fair and expressive — urgent work can jump the queue without starving everyone else.
An LRU cache for reads
The manager protects its state with a single central sync.RWMutex. That’s simple and safe, but heavy read traffic (GetLockInfo) can contend with writes (Acquire/Release). So reads get an optional LRU cache:
- it stores copies of
LockInfoobjects, and reads check it first, - it’s invalidated precisely — whenever a Raft command modifies a lock,
cache.Invalidatedrops the stale entry.
Because invalidation is driven by the same committed commands that change state, readers almost never see stale data, and the central mutex sees far less contention. It’s an optimization that’s safe specifically because it’s wired into the deterministic apply path.
A tour of the code
Modular by responsibility:
lock.go— the coreLockManager: theApplyloop, the state maps (locks,waiters), and the central mutex.waiter.go— thewaitQueuepriority heap and thewaiterstruct with its notification channels.expiration.go— theexpirationHeapforO(1)access to the next expiring lock.cache.go— the thread-safe LRU cache.filters.go— functional options for querying locks (FilterByOwner,FilterExpiringSoon).
The snapshot detail I had to think hardest about
Snapshotting lets Raft truncate its log, so the LockManager has to serialize its entire state. The lockSnapshot struct captures Locks, Waiters, and LastAppliedIndex as JSON. On RestoreSnapshot, I rebuild the optimization structures — the expirationHeap and the clientLocks map — from the flat list of locks.
But here’s the catch that took me a while to fully appreciate: notification channels can’t be serialized. A notifyCh is a transient in-memory Go construct, not data. So if a node restarts and restores from a snapshot, any pending local waits are reset — the channels they’d be notified on no longer exist. The persisted lock state is perfectly consistent; the ephemeral “I’m a client blocked waiting right here” relationship is not, because it never could be. Recognizing which parts of your state are durable data and which are transient runtime wiring is, it turns out, a core skill in building state machines.
The trade-offs
- Complexity for performance. Heaps for both expiration and waiting add real code complexity — managing heap indices in
waiter.goandexpiration.gois fiddly — but they keep the system atO(log N)under heavy load instead of degrading toO(N). For the hot path, worth it. - Memory for consistency. The entire lock state lives in RAM. That makes applying Raft commands extremely fast, but it bounds the maximum number of active locks to available server memory. A deliberate ceiling.
- Granular locking. One
sync.RWMutexfor the manager technically serializes command execution. I mitigated it by keeping critical sections tiny (map updates only) and offloading reads to the cache — simple and safe beats clever and racy.
Closing the loop
This is the layer where everything else pays off. The storage engine makes the log durable, the consensus engine makes it agreed-upon and ordered, the server and client make it usable — and right here, finally, an ordered stream of bytes becomes the simple promise a developer actually wanted: you, and only you, hold this lock right now.
The takeaway from building it: a state machine’s hardest job isn’t the logic, it’s being honest about what is durable data and what is just runtime wiring that has to be rebuilt. Get that boundary right and replay, restart, and snapshot all just work. If you’ve read this far, go break the live cluster — and the code is on GitHub.