fix(presence-sync): tick mutex so setInterval overlap can not spawn parallel ticks #8

Merged
hzhang merged 1 commits from fix/presence-sync-tick-mutex into main 2026-05-26 02:06:21 +00:00
Contributor

What

PresenceSync.tick() iterates accounts serially with await on each agent-login + PUT round-trip. A single tick can easily run 20+s when there are several accounts (each agent-login is a network round-trip to Center, each PUT is a round-trip to Guild). setInterval(intervalMs) does NOT wait for the previous callback — on a busy gateway the next tick fires on top of a still-running one, and two parallel iterations each PUT the same agentId within milliseconds.

Caught in prod (2026-05-25 23:23:35Z)

t2 gateway log, single presence-sync started for 5 account(s) (single instance), yet paired log lines 4-10 ms apart for the same agent:

23:23:24.334  fabric: presence-sync manager → idle
23:23:24.338  fabric: presence-sync manager → idle              ← 4 ms later, duplicate
23:23:27.961  fabric: presence-sync mentor → idle
23:23:27.967  fabric: presence-sync mentor → idle
23:23:35.311  fabric: presence-sync administrative-secretary → idle
23:23:35.316  fabric: presence-sync PUT administrative-secretary failed: 500   ← race tipped backend's INSERT
23:23:39.049  fabric: presence-sync agent-resource-director → idle
23:23:39.052  fabric: presence-sync agent-resource-director → idle

The 500 is a separate backend bug (race-prone findOne+save in agent-presence.service.ts, fixed in nav/Fabric.Backend.Guild#presence-upsert-race). This PR fixes the plugin-side amplifier.

Fix

private inflight = false;

private async tick(): Promise<void> {
  if (this.inflight) return;
  this.inflight = true;
  try {
    await this.tickInner();
  } finally {
    this.inflight = false;
  }
}

Dropping overlapping ticks is safe: lastStatus !== bridge.get gating already means status changes catch the next tick anyway. Skipping a beat costs nothing the next beat won't fix.

Sim test

Sim has only 1 fabric account (no second account to overlap on), so the prod symptom can't be reproduced locally — but the code change is mechanical (4-line mutex around the existing tick() body, renamed to tickInner). Will validate on prod after deploy: paired-log-lines-within-1s count should drop to 0.

## What `PresenceSync.tick()` iterates accounts serially with `await` on each agent-login + PUT round-trip. A single tick can easily run **20+s** when there are several accounts (each agent-login is a network round-trip to Center, each PUT is a round-trip to Guild). `setInterval(intervalMs)` does NOT wait for the previous callback — on a busy gateway the next tick fires on top of a still-running one, and two parallel iterations each PUT the same `agentId` within milliseconds. ## Caught in prod (2026-05-25 23:23:35Z) t2 gateway log, single `presence-sync started for 5 account(s)` (single instance), yet paired log lines 4-10 ms apart for the same agent: ``` 23:23:24.334 fabric: presence-sync manager → idle 23:23:24.338 fabric: presence-sync manager → idle ← 4 ms later, duplicate 23:23:27.961 fabric: presence-sync mentor → idle 23:23:27.967 fabric: presence-sync mentor → idle 23:23:35.311 fabric: presence-sync administrative-secretary → idle 23:23:35.316 fabric: presence-sync PUT administrative-secretary failed: 500 ← race tipped backend's INSERT 23:23:39.049 fabric: presence-sync agent-resource-director → idle 23:23:39.052 fabric: presence-sync agent-resource-director → idle ``` The 500 is a separate backend bug (race-prone `findOne+save` in `agent-presence.service.ts`, fixed in [nav/Fabric.Backend.Guild#presence-upsert-race](https://git.hangman-lab.top/nav/Fabric.Backend.Guild/pulls)). This PR fixes the plugin-side amplifier. ## Fix ```ts private inflight = false; private async tick(): Promise<void> { if (this.inflight) return; this.inflight = true; try { await this.tickInner(); } finally { this.inflight = false; } } ``` Dropping overlapping ticks is safe: `lastStatus !== bridge.get` gating already means status changes catch the **next** tick anyway. Skipping a beat costs nothing the next beat won't fix. ## Sim test Sim has only 1 fabric account (no second account to overlap on), so the prod symptom can't be reproduced locally — but the code change is mechanical (4-line mutex around the existing `tick()` body, renamed to `tickInner`). Will validate on prod after deploy: paired-log-lines-within-1s count should drop to 0.
hzhang added 1 commit 2026-05-26 01:25:47 +00:00
The presence-sync tick iterates accounts serially with await on each
agent-login + PUT round-trip — a single tick can easily run 20+s when
there are several accounts. setInterval(intervalMs) does NOT wait for
the previous tick to finish, so on a busy gateway the next tick fires
on top of a still-running one and two parallel iterations each PUT
the same agentId within ~10 ms. That tipped the guild backend's
first-time-insert race (separate fix in nav/Fabric.Backend.Guild) into
500s on prod (caught in t2 gateway 2026-05-25 23:23:35Z; 6 of 6 agents
showed paired log lines 4-10 ms apart for the same agent → idle).

Fix: a simple `inflight` boolean. tick() returns immediately if
already running; the next interval beat catches up. lastStatus !==
bridge.get gating already means status changes catch the next tick
anyway, so skipping a beat costs nothing the next beat won't fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hzhang merged commit 2acb084ee4 into main 2026-05-26 02:06:21 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nav/Fabric.OpenclawPlugin#8