fix(agent-presence): upsert atomically — kill first-time-insert race #3

Merged
hzhang merged 1 commits from fix/presence-upsert-race into main 2026-05-26 02:06:20 +00:00
Contributor

What

AgentPresenceService.setStatus() was a classic read-modify-write race:

const existing = await this.repo.findOne({ where: { userId } });
if (existing) { /* UPDATE */ }
else { /* INSERT */ }

Two concurrent first-time writes for the same userId:

  1. T0: req1 → findOne → undefined → INSERT
  2. T0+10ms: req2 → findOne → undefined → INSERTDuplicate entry '<userId>' for key 'agent_presences.PRIMARY' → 500

Caught in prod (2026-05-25 23:23:35Z)

PUT /api/agents/fdf37ed8-5a69-4229-a23e-b4e480a9829d/presence  200  23:23:35.205Z
PUT /api/agents/fdf37ed8-5a69-4229-a23e-b4e480a9829d/presence  500  23:23:35.215Z

[ExceptionsHandler] Duplicate entry 'fdf37ed8-5a69-4229-a23e-b4e480a9829d' for key 'agent_presences.PRIMARY'
QueryFailedError ... at MysqlQueryRunner.js:168

The 10-ms-apart pair comes from Fabric.OpenclawPlugin's presence-sync emitting two PUTs from overlapping ticks (nav/Fabric.OpenclawPlugin#presence-sync-tick-mutex fixes the plugin side). Even with that fix landing, this race-prone code path stays exposed to any future caller (hf-plugin, manual curl, monitoring, etc.), so the backend should be the source of truth for atomicity.

Fix

async setStatus(userId, status, source) {
  await this.repo.upsert({ userId, status, source }, ['userId']);
  return this.repo.create({ userId, status, source });
}

repo.upsert() compiles to MySQL INSERT … ON DUPLICATE KEY UPDATE — atomic at the storage engine, no read needed, no race window. Return the synthesized entity since the controller only reads {userId, status} off it (no SELECT round-trip).

Sim test

Rebuilt fabric-backend-guild1 from this branch, fired 5 parallel PUTs to a fresh userId:

Old code This PR
HTTP 200 1 5
HTTP 500 4 0
Guild log "Duplicate entry" 4 0
## What `AgentPresenceService.setStatus()` was a classic read-modify-write race: ```ts const existing = await this.repo.findOne({ where: { userId } }); if (existing) { /* UPDATE */ } else { /* INSERT */ } ``` Two concurrent first-time writes for the same `userId`: 1. T0: req1 → `findOne` → undefined → `INSERT` 2. T0+10ms: req2 → `findOne` → undefined → `INSERT` → **`Duplicate entry '<userId>' for key 'agent_presences.PRIMARY'`** → 500 ## Caught in prod (2026-05-25 23:23:35Z) ``` PUT /api/agents/fdf37ed8-5a69-4229-a23e-b4e480a9829d/presence 200 23:23:35.205Z PUT /api/agents/fdf37ed8-5a69-4229-a23e-b4e480a9829d/presence 500 23:23:35.215Z [ExceptionsHandler] Duplicate entry 'fdf37ed8-5a69-4229-a23e-b4e480a9829d' for key 'agent_presences.PRIMARY' QueryFailedError ... at MysqlQueryRunner.js:168 ``` The 10-ms-apart pair comes from `Fabric.OpenclawPlugin`'s presence-sync emitting two PUTs from overlapping ticks ([nav/Fabric.OpenclawPlugin#presence-sync-tick-mutex](https://git.hangman-lab.top/nav/Fabric.OpenclawPlugin/pulls) fixes the plugin side). Even with that fix landing, this race-prone code path stays exposed to any future caller (hf-plugin, manual curl, monitoring, etc.), so the backend should be the source of truth for atomicity. ## Fix ```ts async setStatus(userId, status, source) { await this.repo.upsert({ userId, status, source }, ['userId']); return this.repo.create({ userId, status, source }); } ``` `repo.upsert()` compiles to MySQL `INSERT … ON DUPLICATE KEY UPDATE` — atomic at the storage engine, no read needed, no race window. Return the synthesized entity since the controller only reads `{userId, status}` off it (no SELECT round-trip). ## Sim test Rebuilt `fabric-backend-guild1` from this branch, fired 5 parallel PUTs to a fresh `userId`: | | Old code | This PR | |---|---|---| | HTTP 200 | 1 | **5** | | HTTP 500 | 4 | **0** | | Guild log "Duplicate entry" | 4 | **0** |
hzhang added 1 commit 2026-05-26 01:25:46 +00:00
Previous setStatus() did read-modify-write:
  findOne → if-exists save / else create+save

Two concurrent first-time writes for the same userId both saw no row,
both INSERT'd, second hit unique-key (agent_presences.PRIMARY) and 500'd
with "Duplicate entry '<userId>' for key 'agent_presences.PRIMARY'" —
visible in prod (2026-05-25 23:23:35Z) when Fabric.OpenclawPlugin's
presence-sync emitted two PUTs ~10 ms apart for the same agent (its
tick-overlap is being fixed separately in nav/Fabric.OpenclawPlugin).

Replace with repo.upsert(values, ['userId']) — compiles to MySQL
`INSERT … ON DUPLICATE KEY UPDATE`, atomic at the storage engine,
no read needed, no race window. Synthesize the returned entity from
the values we just wrote rather than a SELECT round-trip; controller
only reads {userId, status} off it.

Sim verified with 5 parallel PUTs to a fresh userId: all 200, no
Duplicate errors in guild log (was: 1 × 200 + 4 × 500 with the
old code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hzhang merged commit 3f77c0e35d into main 2026-05-26 02:06:20 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nav/Fabric.Backend.Guild#3