fix(inbound): refresh socket.io auth on (re)connect via callback #9
Reference in New Issue
Block a user
Delete Branch "fix/socket-auth-callback-refresh"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Why
bf3b74f6(nav/manager) silently stopped receiving DMs ~15 minutes after a gateway start. Backend log (sim & prod) shows:Plugin-side log meanwhile says
fabric: agent X joined N channel(s)andchannel resync +1 -0— theconnectevent fires on the client because TCP succeeded, but the backend rejected auth at the application layer and never put the socket into any channel room. Result: backend emitsmessage.createdinto rooms with zero subscribers; agent looks unresponsive.Root cause is one line in
connectAgent:socket.io supports the callback form which is re-invoked on every (re)connect.
What
auth: { token }→auth: (cb) => freshGuildToken(...).then(t => cb({ token: t ?? tok }))freshGuildTokenwe already use to keep the HTTP path (attachment download, reply post) un-stale viatokenCache. No new code paths.Verification (sim)
harborforge-backendrebuilt to prod commit d2b83ad equivalent, fabric plugin replaced with this patch.5745567f-…betweenhzhang@sim.localandrecruiter:[plugins] fabric: dispatch agent=recruiter→deliver agent=recruiter len=458→ recruiter actually replied ✓docker restart fabric-backend-guildto force a socket disconnect:[plugins] fabric: agent recruiter joined 12 channel(s) on sim-guild-1re-fires within 30s ✓RealtimeGateway socket connected: t1pVmKtsWJAjFoeyAAAB(nosocket rejected) ✓Pending separately
_fabricInboundStartedglobal guard makes plugin-only reloads (without gateway restart) no-ops. Sim test required fullpkill openclawto pick up the new plugin code. Out of scope for this PR.health-monitorkeeps restartingfabric:defaultbecausesetup-entry.ts:inspect(cfg)returnsconfigured: truewhenevercenterApiBaseis set, even though no real account is bound todefault. Out of scope for this PR.🤖 Generated with Claude Code
Backend issues short-lived guildAccessToken (TTL=900s). The previous `auth: { token: tok }` shape captured the JWT once in connectAgent's closure: after socket.io's auto-reconnect the backend kept getting the same expired JWT and silently rejected the handshake at the application layer (RealtimeGateway logs 'socket rejected: <id>'). The client's 'connect' event still fired (TCP succeeded) so the plugin happily ran the channel-resync, emitted join_channel into the void, and logged 'joined N channel(s)' while the backend was actually broadcasting message.created to a room with zero subscribers. End-user symptom: DMs/group messages to agents silently dropped 15 min after gateway start, with no error anywhere on the agent side. Switch to the callback form, which socket.io re-evaluates on every (re)connect — same call site we already use for the HTTP path via freshGuildToken/tokenCache. Verified in sim (commit2acb084+ this patch): 1. Connect new DM channel + post msg -> dispatch + reply ✓ 2. `docker restart fabric-backend-guild` to force socket disconnect 3. Plugin reconnects automatically and logs 'fabric: agent recruiter joined 12 channel(s) on sim-guild-1' ✓ (without the fix this reconnect was silently rejected; sim used to log 'WARN socket rejected: <id>' on the guild backend) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>