Merge pull request 'perf(meta-push): use cached api.config instead of deprecated loadConfig() — kills ~25% chronic baseline CPU' (#11 ) from fix/meta-push-use-cached-api-config into main

perf(meta-push): use cached api.config instead of deprecated loadConfig()
`pushMetaToMonitor` and `resolveAgentId` were both calling `api.runtime?.config?.loadConfig?.()` to read the agent list. That deprecated path (openclaw warns at gateway start: "plugin runtime config.loadConfig() is deprecated; use config.current()") synchronously rebuilds the full plugin-metadata snapshot — realpathSync walks every plugin's package.json + manifest + source up the directory tree, hashWatchedFiles fingerprints every watched plugin file, and discoverInDirectory re-scans every `dist/extensions/<plugin>` (~100 of them on prod t2). Each rebuild costs ~6-7s of gateway CPU. `pushMetaToMonitor` fires every `reportIntervalSec` (default 30s) from `hooks/gateway-start.js`. With 100 plugins that put the gateway into a chronic ~22-30% CPU baseline even with zero agent activity. V8 profile 2026-05-27 08:14:00 60s window (0 turns, 2 metadata pushes during): lstat 44.2%, statSync(buildInstalledManifestRegistryIndexKey) 6.9%, hashWatchedFiles via memo key 1.7%, all routed through `readPersistedInstalledPluginIndexInstallRecordsSync` -> per-plugin `discoverInDirectory`. Switching to `(api as any).config ?? api.runtime?.config?.loadConfig?.()` reads from the snapshot cache the gateway already maintains — the same pattern already used elsewhere in this file (e.g. the calendar wakeAgent dispatcher at line 284). Same change applied to `resolveAgentId` (only runs once at start, but same anti-pattern). This is a plugin-side perf workaround. The underlying openclaw bug is that `loadConfig()` rebuilds the snapshot rather than returning the cached one — a chronic 'all sync cache validity checks pay the full discovery cost' design issue worth pushing upstream separately (the walks per-call cost we measured here is unrelated to and amplifies any agent-turn-triggered walk path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:25:34 +00:00 · 2026-05-27 09:17:39 +01:00 · 2026-05-26 08:13:42 +00:00 · 2026-05-26 09:10:47 +01:00 · 2026-05-23 14:58:37 +01:00 · 2026-05-23 11:31:27 +01:00
1 changed files with 126 additions and 16 deletions
--- a/plugin/index.ts
+++ b/plugin/index.ts
@@ -128,7 +128,9 @@ function register(api: PluginAPI): void {
    /** Resolve agent ID from env, config, or fallback. */
    function resolveAgentId(): string {
      if (process.env.AGENT_ID) return process.env.AGENT_ID;
-      const cfg = api.runtime?.config?.loadConfig?.();
+      // Read from cached `api.config` first — see pushMetaToMonitor for why
      // the deprecated `api.runtime?.config?.loadConfig?.()` path is heavy.
      const cfg = (api as any).config ?? api.runtime?.config?.loadConfig?.();
      return cfg?.agents?.list?.[0]?.id ?? cfg?.agents?.defaults?.id ?? 'unknown';
    }
@@ -184,6 +186,25 @@ function register(api: PluginAPI): void {
     * Push OpenClaw metadata to the Monitor bridge.
     * This enriches Monitor heartbeats with OpenClaw version/plugin/agent info.
     * Failures are non-fatal — Monitor continues to work without this data.
     *
     * IMPORTANT — read config from the cached `api.config` surface, NOT from
     * the deprecated `api.runtime?.config?.loadConfig?.()` path. The
     * deprecated path triggers a full plugin-metadata-snapshot rebuild on
     * every call: realpathSync walks every plugin's package.json + manifest
     * + source paths (lstats up the directory tree), `hashWatchedFiles`
     * fingerprints all watched plugin files, and `discoverInDirectory`
     * re-scans every `dist/extensions/<plugin>` dir. On t2 with ~100 plugins
     * each rebuild costs ~6-7s of CPU; with this push firing every 30s
     * (default reportIntervalSec) the chronic baseline was ~22-25% gateway
     * CPU even with zero agent activity (V8 profile 2026-05-27 08:14:00 60s:
     * lstat 44.2%, statSync 6.9%, hashWatchedFiles via memo key 1.7%, all
     * routed through readPersistedInstalledPluginIndexInstallRecordsSync ->
     * discoverInDirectory). Switching to `api.config` reads from the
     * already-loaded snapshot cache; the elsewhere-in-this-file pattern was
     * already `api.config ?? api.runtime?.config?.loadConfig?.()`.
     *
     * Same fix is applied to `resolveAgentId` below — that's read once at
     * gateway start so the impact is smaller, but it's the same anti-pattern.
     */
    async function pushMetaToMonitor() {
      const bridgeClient = getBridgeClient();
@@ -191,7 +212,7 @@ function register(api: PluginAPI): void {
      let agentNames: string[] = [];
      try {
-        const cfg = api.runtime?.config?.loadConfig?.();
+        const cfg = (api as any).config ?? api.runtime?.config?.loadConfig?.();
        const agentsList = cfg?.agents?.list;
        if (Array.isArray(agentsList)) {
          agentNames = agentsList
@@ -307,21 +328,22 @@ function register(api: PluginAPI): void {
          )}\n\`\`\``;
        }
-        // First-line ack `WAKEUP_OK` is the plugin's ack-receipt token; the
+        // The wakeup dispatcher's `deliver` callback below only logs the
-        // agent MUST then continue in the same session and drive the
+        // reply text — it does NOT inspect any ack token. The earlier
-        // `hf-wakeup` workflow to completion (calendar_status → task fetch →
+        // `WAKEUP_OK` first-line-ack convention was prompt-only theatre;
-        // sub-workflow → calendar_complete/abort). Without that continuation
+        // nothing in this plugin or in openclaw acted on it. The only
-        // the scheduler keeps re-waking every 30s because the slot stays
+        // thing that ends a wake cycle is the slot transitioning out of
-        // `not_started` forever.
+        // `not_started`, which happens when the agent calls
        // `harborforge_calendar_complete` or `harborforge_calendar_abort`.
        // Tell the agent that plainly instead of asking for a fake ack.
        const wakeupMessage =
-          `You have due slots. **First line of your reply MUST be exactly ` +
+          `You have due slots. Drive the \`hf-wakeup\` workflow of skill ` +
-          `\`WAKEUP_OK\`** so the plugin records the ack. Then, **in this ` +
+          `\`hf-hangman-lab\` to completion in this session — read slot ` +
-          `same session**, drive the \`hf-wakeup\` workflow of skill ` +
+          `context, call the harborforge_calendar_* tools, route to the ` +
-          `\`hf-hangman-lab\` to completion — read slot context, call the ` +
+          `right sub-workflow, and finish with harborforge_calendar_complete ` +
-          `harborforge_calendar_* tools, route to the right sub-workflow, ` +
+          `or harborforge_calendar_abort. The scheduler keeps re-waking you ` +
-          `and finish with harborforge_calendar_complete or abort. Do NOT ` +
+          `every 30s until the slot transitions out of \`not_started\`, so ` +
-          `stop after the ack — the scheduler will re-wake you every 30s ` +
+          `partial work or silence just produces another wake.${slotBlock}`;
          `until the slot transitions out of \`not_started\`.${slotBlock}`;
        const result = await dispatchInboundMessageWithDispatcher({
          ctx: {
@@ -396,6 +418,94 @@ function register(api: PluginAPI): void {
        }
      }
      // Cross-plugin exposure: agent status lookup for other plugins
      // (currently Fabric.OpenclawPlugin uses this to skip delivering
      // `announce` channel messages to busy agents — see DIALECTIC-V2
      // design doc, Phase 1). Backed by calendarBridge.getAgentStatus
      // with a small TTL cache to avoid hammering the HF backend.
      type HfStatus = 'idle' | 'on_call' | 'busy' | 'exhausted' | 'offline';
      const HF_STATUS_CACHE_TTL_MS = 30_000;
      const hfStatusCache = new Map<string, { status: HfStatus; at: number }>();
      const _G = globalThis as Record<string, unknown>;
      _G['__hfAgentStatus'] = {
        async get(agentId: string): Promise<HfStatus | undefined> {
          if (!agentId) return undefined;
          const cached = hfStatusCache.get(agentId);
          if (cached && Date.now() - cached.at < HF_STATUS_CACHE_TTL_MS) {
            return cached.status;
          }
          try {
            const status = await calendarBridge.getAgentStatus(agentId);
            if (status) {
              const typed = status as HfStatus;
              hfStatusCache.set(agentId, { status: typed, at: Date.now() });
              return typed;
            }
          } catch {
            /* fall through to cached-or-undefined */
          }
          return cached?.status;
        },
        /**
         * Approximate "does agent have an on_call slot covering [from, to]?"
         * for cross-plugin pre-check use (currently:
         * Dialectic.OpenclawPlugin's signup HF coverage).
         *
         * v1 honest scope: we only have today's slots in scheduleCache
         * (synced from /calendar/sync which is today-only). Returns:
         *   - true  iff window is same-day AND some cached on_call slot
         *           starts <= from AND ends >= to
         *   - false iff window is same-day AND no such slot
         *   - undefined for cross-day windows OR cache empty for this
         *     agent (caller treats undefined as "I don't know" — see
         *     Dialectic plugin's hf-precheck.ts which degrades to
         *     "skipped" gracefully)
         *
         * Phase TBD: when HF backend ships a `/calendar/slots?agent&from&to`
         * endpoint, swap this to call it for arbitrary windows. Until then,
         * same-day-only coverage gates ~all debates created by analyze-intel
         * (which schedules <2h windows) without needing a backend change.
         */
        async hasOnCallCovering(
          agentId: string,
          fromIso: string,
          toIso: string,
        ): Promise<boolean | undefined> {
          if (!agentId || !fromIso || !toIso) return undefined;
          const from = new Date(fromIso);
          const to = new Date(toIso);
          if (isNaN(from.getTime()) || isNaN(to.getTime())) return undefined;
          if (!(from < to)) return undefined;
          // Cross-day → cache only has today; can't decide.
          const fromDate = from.toISOString().slice(0, 10);
          const toDate = to.toISOString().slice(0, 10);
          if (fromDate !== toDate) return undefined;
          // Cache's cachedDate must match our window's date.
          const cacheStatus = scheduleCache.getStatus();
          if (cacheStatus.cachedDate !== fromDate) return undefined;
          const slots = scheduleCache.getAgentSlots(agentId);
          if (slots.length === 0) return undefined; // cache empty for this agent — can't decide
          for (const s of slots) {
            if (s.slot_type !== 'on_call') continue;
            // status: ignore aborted/cancelled, accept not_started / ongoing / finished
            if (s.status === 'aborted' || s.status === 'cancelled') continue;
            const startStr = s.scheduled_at;
            if (typeof startStr !== 'string') continue;
            // scheduled_at can be HH:MM:SS (cache-relative date) or full ISO
            const start =
              /^\d{2}:\d{2}(:\d{2})?$/.test(startStr)
                ? new Date(`${fromDate}T${startStr}Z`)
                : new Date(startStr);
            if (isNaN(start.getTime())) continue;
            const dur = typeof s.estimated_duration === 'number' ? s.estimated_duration : 0;
            const end = new Date(start.getTime() + dur * 60_000);
            if (start <= from && end >= to) return true;
          }
          return false;
        },
      };
      // Track wakes already dispatched for a slot in the current sync
      // window — the simplified inline scheduler does not PATCH slot
      // status server-side, so without dedupe the check loop re-wakes
Author	SHA1	Message	Date
zhi	c8998c6b0d	Merge pull request 'perf(meta-push): use cached api.config instead of deprecated loadConfig() — kills ~25% chronic baseline CPU' (#11 ) from fix/meta-push-use-cached-api-config into main	2026-05-27 08:25:34 +00:00
hzhang	686f2c7cb0	perf(meta-push): use cached api.config instead of deprecated loadConfig() `pushMetaToMonitor` and `resolveAgentId` were both calling `api.runtime?.config?.loadConfig?.()` to read the agent list. That deprecated path (openclaw warns at gateway start: "plugin runtime config.loadConfig() is deprecated; use config.current()") synchronously rebuilds the full plugin-metadata snapshot — realpathSync walks every plugin's package.json + manifest + source up the directory tree, hashWatchedFiles fingerprints every watched plugin file, and discoverInDirectory re-scans every `dist/extensions/<plugin>` (~100 of them on prod t2). Each rebuild costs ~6-7s of gateway CPU. `pushMetaToMonitor` fires every `reportIntervalSec` (default 30s) from `hooks/gateway-start.js`. With 100 plugins that put the gateway into a chronic ~22-30% CPU baseline even with zero agent activity. V8 profile 2026-05-27 08:14:00 60s window (0 turns, 2 metadata pushes during): lstat 44.2%, statSync(buildInstalledManifestRegistryIndexKey) 6.9%, hashWatchedFiles via memo key 1.7%, all routed through `readPersistedInstalledPluginIndexInstallRecordsSync` -> per-plugin `discoverInDirectory`. Switching to `(api as any).config ?? api.runtime?.config?.loadConfig?.()` reads from the snapshot cache the gateway already maintains — the same pattern already used elsewhere in this file (e.g. the calendar wakeAgent dispatcher at line 284). Same change applied to `resolveAgentId` (only runs once at start, but same anti-pattern). This is a plugin-side perf workaround. The underlying openclaw bug is that `loadConfig()` rebuilds the snapshot rather than returning the cached one — a chronic 'all sync cache validity checks pay the full discovery cost' design issue worth pushing upstream separately (the walks per-call cost we measured here is unrelated to and amplifies any agent-turn-triggered walk path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:17:39 +01:00
h z	81d40ae63d	fix(wakeup): drop WAKEUP_OK ack-token theatre (#10 )	2026-05-26 08:13:42 +00:00
hzhang	65a3fb8d2d	fix(wakeup): drop WAKEUP_OK ack-token theatre from wakeup message The wakeup dispatcher's `deliver` callback only does `logger.info(reply.slice(0,100))` — no token detection, no scheduler state change. The "first line of your reply MUST be exactly WAKEUP_OK so the plugin records the ack" instruction was prompt theatre that nothing in this plugin (or in openclaw) acted on. Confirmed by reading openclaw/dist/plugin-sdk/src/auto-reply/tokens.d.ts which declares HEARTBEAT_OK and SILENT_REPLY tokens but nothing for wakeup. Symptom in the wild: agents would replay WAKEUP_OK every turn for no gain — costing model budget on a no-op token — and the workflow doc (`ClawSkills/workflows/hf-wakeup/flow.md`) carried a wandering appendix explaining the ack "doesn't actually do anything anyway". Rewrite the wakeup message to tell the agent the truth: drive the hf-wakeup workflow to completion; the scheduler keeps re-waking every 30s until the slot transitions out of `not_started` via harborforge_calendar_complete or _abort. No ack token expected. ClawSkills companion change (lyn/ClawSkills d0109f3) removes WAKEUP_OK from skills/hf-hangman-lab/SKILL.md and workflows/hf-wakeup/flow.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 09:10:47 +01:00
hzhang	c2d00c18a7	feat(hf-plugin): __hfAgentStatus.hasOnCallCovering(agentId, from, to) Cross-plugin accessor for "does agent have on_call slot covering this window?" — first consumer is Dialectic.OpenclawPlugin signup pre-check (its hf-precheck.ts has been degrading to "skipped" since Phase 3 ship pending this). v1 honest scope: same-day windows only (scheduleCache is today-only from /calendar/sync). Cross-day or empty-cache windows return undefined which the caller treats as "skipped" (Dialectic backend stores pre_validated:false as audit signal — same as before, just now we actually validate when we can). Logic: for each cached slot where slot_type=on_call AND status not in {aborted,cancelled}, parse scheduled_at (HH:MM:SS or full ISO) and estimated_duration to compute end; return true iff start<=from AND end>=to. Returns false (not undefined) when cache has slots for the agent on this date but none covers — that means "actually no coverage" vs "I dont know". Pairs with Dialectic.OpenclawPlugin/src/hf-precheck.ts which already calls hf.hasOnCallCovering and handles all 3 return shapes. No backend change required.	2026-05-23 14:58:37 +01:00
hzhang	709f7e09ab	feat(hf-plugin): expose globalThis.__hfAgentStatus.get(agentId) Cross-plugin agent-status accessor for use by Fabric.OpenclawPlugin's presence-sync loop (and any future plugin needing 'is agent X busy right now'). Backed by CalendarBridgeClient.getAgentStatus() with a 30s in-memory TTL cache to avoid hammering the HF backend. Returns one of 'idle' \| 'on_call' \| 'busy' \| 'exhausted' \| 'offline' or undefined when the agent isn't known to HF. Cache miss + bridge failure returns the last cached value (stale-data better than no data for delivery-decision use cases). Part of DIALECTIC-V2 Phase 1 (Fabric announce channel + busy-discard). See /home/hzhang/arch/DIALECTIC-V2-DESIGN.md sections 7+8.	2026-05-23 11:31:27 +01:00