fix(channel): add describeAccount so health-monitor sees real configured state #10

Merged
hzhang merged 1 commits from fix/describe-account-stops-default-restart-loop into main 2026-05-26 15:49:22 +00:00
Contributor

Why

[fabric:default] health-monitor: restarting (reason: stopped) every ~10 minutes on prod t2 (and sim) since forever. Root cause traced today.

openclaw's channelManager.getRuntimeSnapshot() — what channel-health-monitor reads — runs each account through:

function applyDescribedAccountFields(next, described) {
    if (!described) {
        next.configured ??= true;   // ← defaults TRUE when plugin has no describeAccount
        return next;
    }
    ...
}

Fabric never defined describeAccount, so every cycle:

snapshot = { enabled: true, configured: true, running: false }
isManagedAccount === true
→ reason: "not-running" → "stopped" → restart

For the synthetic default account (listFabricAccountIds falls back to [default] when channels.fabric.accounts is empty — the prod shape), running is permanently false because fabric's gateway.startAccount is absent, so startChannelInternal returns early. The restart action is a no-op — but the log noise wasted real triage time today chasing it as a real failure.

(status -- openclaw status --json's account snapshot path goes through buildChannelAccountSnapshotFromAccount, which DOES call isConfigured(account) and so reported configured: false correctly. That's why CLI displayed Fabric default: enabled, not configured while health-monitor saw the opposite.)

What

Add describeAccount mirroring isConfigured:

describeAccount: (account: ResolvedFabricAccount) => ({
  configured: Boolean(account.fabricApiKey),
}),

Real per-agent accounts (managed via ~/.openclaw/fabric-identity.json on prod) still go through gateway_startFabricInbound.start() as before. The framework just no longer thinks default (or any keyless account) is something it should restart.

Verification (sim)

Temporary console.log instrumentation in channel-health-policy-D_eDwUBm.js confirmed:

  • Pre-patch: evaluateChannelHealth got {enabled:true, configured:true, accountId:"default"} every 10min → restart fired
  • Post-patch: {enabled:true, configured:false, accountId:"default"} every cycle → isManagedAccount=false → unmanaged → no restart

Sim gateway up 8+ minutes with the patch, 0 restart events (pre-patch sim with same config restarted at 5min mark).

Pending separately

Not covered here:

  • _fabricInboundStarted global guard prevents plugin-only reload from re-binding inbound (needs full gateway restart to pick up plugin code changes). Out of scope.
  • For configurations that DO register real accounts under channels.fabric.accounts, those accounts get running:false for the same reason default does — fabric doesn't implement gateway.startAccount lifecycle. That would still restart. The current prod doesn't use that shape (everything goes through identity registry), so this PR makes the existing prod silent. Refactoring to the framework lifecycle is a bigger separate piece.

🤖 Generated with Claude Code

## Why `[fabric:default] health-monitor: restarting (reason: stopped)` every ~10 minutes on prod t2 (and sim) since forever. Root cause traced today. `openclaw`'s `channelManager.getRuntimeSnapshot()` — what `channel-health-monitor` reads — runs each account through: ```js function applyDescribedAccountFields(next, described) { if (!described) { next.configured ??= true; // ← defaults TRUE when plugin has no describeAccount return next; } ... } ``` Fabric never defined `describeAccount`, so every cycle: ``` snapshot = { enabled: true, configured: true, running: false } isManagedAccount === true → reason: "not-running" → "stopped" → restart ``` For the synthetic `default` account (`listFabricAccountIds` falls back to `[`default`]` when `channels.fabric.accounts` is empty — the prod shape), `running` is permanently false because fabric's `gateway.startAccount` is absent, so `startChannelInternal` returns early. The restart action is a no-op — but the log noise wasted real triage time today chasing it as a real failure. (`status -- openclaw status --json`'s account snapshot path goes through `buildChannelAccountSnapshotFromAccount`, which DOES call `isConfigured(account)` and so reported `configured: false` correctly. That's why CLI displayed `Fabric default: enabled, not configured` while health-monitor saw the opposite.) ## What Add `describeAccount` mirroring `isConfigured`: ```ts describeAccount: (account: ResolvedFabricAccount) => ({ configured: Boolean(account.fabricApiKey), }), ``` Real per-agent accounts (managed via `~/.openclaw/fabric-identity.json` on prod) still go through `gateway_start` → `FabricInbound.start()` as before. The framework just no longer thinks `default` (or any keyless account) is something it should restart. ## Verification (sim) Temporary `console.log` instrumentation in `channel-health-policy-D_eDwUBm.js` confirmed: - Pre-patch: `evaluateChannelHealth` got `{enabled:true, configured:true, accountId:"default"}` every 10min → restart fired - Post-patch: `{enabled:true, configured:false, accountId:"default"}` every cycle → `isManagedAccount=false` → unmanaged → no restart Sim gateway up 8+ minutes with the patch, 0 restart events (pre-patch sim with same config restarted at 5min mark). ## Pending separately Not covered here: - `_fabricInboundStarted` global guard prevents plugin-only reload from re-binding inbound (needs full gateway restart to pick up plugin code changes). Out of scope. - For configurations that DO register real accounts under `channels.fabric.accounts`, those accounts get `running:false` for the same reason `default` does — fabric doesn't implement `gateway.startAccount` lifecycle. That would still restart. The current prod doesn't use that shape (everything goes through identity registry), so this PR makes the existing prod silent. Refactoring to the framework lifecycle is a bigger separate piece. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hzhang added 1 commit 2026-05-26 15:49:21 +00:00
openclaw's `channelManager.getRuntimeSnapshot()` — called every minute
by the channel-health-monitor — runs accounts through
`applyDescribedAccountFields(next, plugin.config.describeAccount?.(...))`.
When the callback is missing it defaults `configured: true`. Fabric
never defined it, so every health-monitor cycle:

  snapshot = { enabled: true, configured: true, running: false }

For fabric's synthetic 'default' account (returned by
`listFabricAccountIds` when `channels.fabric.accounts` is empty —
the prod shape, where per-agent api-keys live in
`~/.openclaw/fabric-identity.json` and the channel framework never
runs `startAccount` so `running` stays false):

  isManagedAccount({enabled:true, configured:true}) === true
  -> not-running -> 'stopped' -> restart every ~10 min, logging
  '[fabric:default] health-monitor: restarting (reason: stopped)'

The restart is a no-op (fabric's `gateway.startAccount` is absent so
`startChannelInternal` returns early), but the log is loud and
operators chasing real outages keep wasting time on it.

Mirror `isConfigured` from describeAccount so the snapshot
truthfully reports configured:false for any account without a
fabricApiKey. The fabric plugin still self-manages real agents via
`gateway_start` -> `FabricInbound.start()`; the framework just no
longer thinks 'default' is something it should restart.

Verified in sim (this patch alone, no debug instrumentation):
- gateway up 8+ minutes, 0 restart events
- pre-patch sim with same config restarted at 5min mark
- evaluateChannelHealth snapshot for both 'default' and 'recruiter'
  accountId reads configured:false (instrumented with temporary
  console.log in channel-health-policy, since reverted)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hzhang merged commit b659dadb9e into main 2026-05-26 15:49:22 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nav/Fabric.OpenclawPlugin#10