fix(inbound): refresh socket.io auth on (re)connect via callback #9

Merged
hzhang merged 1 commits from fix/socket-auth-callback-refresh into main 2026-05-26 12:51:04 +00:00
Contributor

Why

bf3b74f6 (nav/manager) silently stopped receiving DMs ~15 minutes after a gateway start. Backend log (sim & prod) shows:

10:30:51 connected: 5 sockets (×5 agents)
11:04:22 disconnected: ALL 5         ← guildAccessToken TTL hit (900s)
11:04:41 socket rejected ×6          ← reconnect retried with same stale JWT
11:16:46 connected: 1 socket         ← one happened to land a fresh token
11:55:34 disconnected                ← that one expired too
11:55:38–42 flap connect/disconnect

Plugin-side log meanwhile says fabric: agent X joined N channel(s) and channel resync +1 -0 — the connect event fires on the client because TCP succeeded, but the backend rejected auth at the application layer and never put the socket into any channel room. Result: backend emits message.created into rooms with zero subscribers; agent looks unresponsive.

Root cause is one line in connectAgent:

const socket = io(`${g.endpoint}/realtime`, {
  auth: { token: tok },     // ← captured in closure at first connect, never refreshed
  ...
});

socket.io supports the callback form which is re-invoked on every (re)connect.

What

  • auth: { token }auth: (cb) => freshGuildToken(...).then(t => cb({ token: t ?? tok }))
  • Same freshGuildToken we already use to keep the HTTP path (attachment download, reply post) un-stale via tokenCache. No new code paths.
  • ts source + emitted dist both updated.

Verification (sim)

  1. harborforge-backend rebuilt to prod commit d2b83ad equivalent, fabric plugin replaced with this patch.
  2. New DM channel 5745567f-… between hzhang@sim.local and recruiter:
    • second msg → [plugins] fabric: dispatch agent=recruiterdeliver agent=recruiter len=458 → recruiter actually replied ✓
  3. docker restart fabric-backend-guild to force a socket disconnect:
    • [plugins] fabric: agent recruiter joined 12 channel(s) on sim-guild-1 re-fires within 30s ✓
    • Backend log: RealtimeGateway socket connected: t1pVmKtsWJAjFoeyAAAB (no socket rejected) ✓

Pending separately

  • _fabricInboundStarted global guard makes plugin-only reloads (without gateway restart) no-ops. Sim test required full pkill openclaw to pick up the new plugin code. Out of scope for this PR.
  • health-monitor keeps restarting fabric:default because setup-entry.ts:inspect(cfg) returns configured: true whenever centerApiBase is set, even though no real account is bound to default. Out of scope for this PR.

🤖 Generated with Claude Code

## Why `bf3b74f6` (nav/manager) silently stopped receiving DMs ~15 minutes after a gateway start. Backend log (sim & prod) shows: ``` 10:30:51 connected: 5 sockets (×5 agents) 11:04:22 disconnected: ALL 5 ← guildAccessToken TTL hit (900s) 11:04:41 socket rejected ×6 ← reconnect retried with same stale JWT 11:16:46 connected: 1 socket ← one happened to land a fresh token 11:55:34 disconnected ← that one expired too 11:55:38–42 flap connect/disconnect ``` Plugin-side log meanwhile says `fabric: agent X joined N channel(s)` and `channel resync +1 -0` — the `connect` event fires on the client because TCP succeeded, but the backend rejected auth at the application layer and never put the socket into any channel room. Result: backend emits `message.created` into rooms with zero subscribers; agent looks unresponsive. Root cause is one line in `connectAgent`: ```ts const socket = io(`${g.endpoint}/realtime`, { auth: { token: tok }, // ← captured in closure at first connect, never refreshed ... }); ``` socket.io supports the callback form which is re-invoked on every (re)connect. ## What - `auth: { token }` → `auth: (cb) => freshGuildToken(...).then(t => cb({ token: t ?? tok }))` - Same `freshGuildToken` we already use to keep the HTTP path (attachment download, reply post) un-stale via `tokenCache`. No new code paths. - ts source + emitted dist both updated. ## Verification (sim) 1. `harborforge-backend` rebuilt to prod commit d2b83ad equivalent, fabric plugin replaced with this patch. 2. New DM channel `5745567f-…` between `hzhang@sim.local` and `recruiter`: - second msg → `[plugins] fabric: dispatch agent=recruiter` → `deliver agent=recruiter len=458` → recruiter actually replied ✓ 3. `docker restart fabric-backend-guild` to force a socket disconnect: - `[plugins] fabric: agent recruiter joined 12 channel(s) on sim-guild-1` re-fires within 30s ✓ - Backend log: `RealtimeGateway socket connected: t1pVmKtsWJAjFoeyAAAB` (no `socket rejected`) ✓ ## Pending separately - `_fabricInboundStarted` global guard makes plugin-only reloads (without gateway restart) no-ops. Sim test required full `pkill openclaw` to pick up the new plugin code. Out of scope for this PR. - `health-monitor` keeps restarting `fabric:default` because `setup-entry.ts:inspect(cfg)` returns `configured: true` whenever `centerApiBase` is set, even though no real account is bound to `default`. Out of scope for this PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hzhang added 1 commit 2026-05-26 12:50:58 +00:00
Backend issues short-lived guildAccessToken (TTL=900s). The previous
`auth: { token: tok }` shape captured the JWT once in connectAgent's
closure: after socket.io's auto-reconnect the backend kept getting the
same expired JWT and silently rejected the handshake at the application
layer (RealtimeGateway logs 'socket rejected: <id>'). The client's
'connect' event still fired (TCP succeeded) so the plugin happily ran
the channel-resync, emitted join_channel into the void, and logged
'joined N channel(s)' while the backend was actually broadcasting
message.created to a room with zero subscribers. End-user symptom:
DMs/group messages to agents silently dropped 15 min after gateway
start, with no error anywhere on the agent side.

Switch to the callback form, which socket.io re-evaluates on every
(re)connect — same call site we already use for the HTTP path via
freshGuildToken/tokenCache.

Verified in sim (commit 2acb084 + this patch):
1. Connect new DM channel + post msg -> dispatch + reply ✓
2. `docker restart fabric-backend-guild` to force socket disconnect
3. Plugin reconnects automatically and logs
   'fabric: agent recruiter joined 12 channel(s) on sim-guild-1' ✓
   (without the fix this reconnect was silently rejected; sim used to
    log 'WARN socket rejected: <id>' on the guild backend)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hzhang merged commit d47d3467df into main 2026-05-26 12:51:04 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nav/Fabric.OpenclawPlugin#9