Merge pull request 'perf(meta-push): use cached api.config instead of deprecated loadConfig() — kills ~25% chronic baseline CPU' (#11 ) from fix/meta-push-use-cached-api-config into main

perf(meta-push): use cached api.config instead of deprecated loadConfig()
`pushMetaToMonitor` and `resolveAgentId` were both calling `api.runtime?.config?.loadConfig?.()` to read the agent list. That deprecated path (openclaw warns at gateway start: "plugin runtime config.loadConfig() is deprecated; use config.current()") synchronously rebuilds the full plugin-metadata snapshot — realpathSync walks every plugin's package.json + manifest + source up the directory tree, hashWatchedFiles fingerprints every watched plugin file, and discoverInDirectory re-scans every `dist/extensions/<plugin>` (~100 of them on prod t2). Each rebuild costs ~6-7s of gateway CPU. `pushMetaToMonitor` fires every `reportIntervalSec` (default 30s) from `hooks/gateway-start.js`. With 100 plugins that put the gateway into a chronic ~22-30% CPU baseline even with zero agent activity. V8 profile 2026-05-27 08:14:00 60s window (0 turns, 2 metadata pushes during): lstat 44.2%, statSync(buildInstalledManifestRegistryIndexKey) 6.9%, hashWatchedFiles via memo key 1.7%, all routed through `readPersistedInstalledPluginIndexInstallRecordsSync` -> per-plugin `discoverInDirectory`. Switching to `(api as any).config ?? api.runtime?.config?.loadConfig?.()` reads from the snapshot cache the gateway already maintains — the same pattern already used elsewhere in this file (e.g. the calendar wakeAgent dispatcher at line 284). Same change applied to `resolveAgentId` (only runs once at start, but same anti-pattern). This is a plugin-side perf workaround. The underlying openclaw bug is that `loadConfig()` rebuilds the snapshot rather than returning the cached one — a chronic 'all sync cache validity checks pay the full discovery cost' design issue worth pushing upstream separately (the walks per-call cost we measured here is unrelated to and amplifies any agent-turn-triggered walk path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 08:25:34 +00:00 · 2026-05-27 09:17:39 +01:00 · 2026-05-26 08:13:42 +00:00 · 2026-05-26 09:10:47 +01:00 · 2026-05-23 14:58:37 +01:00 · 2026-05-23 11:31:27 +01:00
1 changed files with 166 additions and 16 deletions
--- a/plugin/index.ts
+++ b/plugin/index.ts
@@ -73,6 +73,30 @@ interface PluginAPI {
  getAgentStatus?: () => Promise<{ status: string } | null>;
 }
 /**
 * Coerce a tool execute() return value into the MCP `{ content: [...] }`
 * shape that the openclaw Codex tool dispatcher requires.
 *
 * Background: openclaw's `convertToolContents()` does `result.content.reduce(...)`
 * to compute total text length before flattening. Every HF tool here returned a
 * bare object (`{ running, processing, currentSlot, ... }`) which has no
 * `.content` field, so `undefined.reduce` threw and every call to
 * `harborforge_*` from a Codex-harness agent surfaced as the cryptic
 * `Cannot read properties of undefined (reading 'reduce')`. The fix is to
 * wrap every tool's execute return; doing it at the `registerTool` boundary
 * keeps each tool body unchanged.
 */
 function ensureMcpContentShape(result: unknown): { content: Array<{ type: 'text'; text: string }> } {
  if (
    result && typeof result === 'object' &&
    Array.isArray((result as { content?: unknown }).content)
  ) {
    return result as { content: Array<{ type: 'text'; text: string }> };
  }
  const text = typeof result === 'string' ? result : JSON.stringify(result, null, 2);
  return { content: [{ type: 'text', text }] };
 }
 function register(api: PluginAPI): void {
    const logger = api.logger || {
      info: (...args: any[]) => console.log('[HarborForge]', ...args),
@@ -81,6 +105,22 @@ function register(api: PluginAPI): void {
      warn: (...args: any[]) => console.warn('[HarborForge]', ...args),
    };
    // Wrap api.registerTool so every tool's execute() return is coerced into
    // the MCP `{ content: [...] }` shape openclaw expects. See
    // `ensureMcpContentShape` above.
    const _origRegisterTool = api.registerTool.bind(api);
    api.registerTool = (factory: (ctx: any) => any) => {
      _origRegisterTool((ctx: any) => {
        const def = factory(ctx);
        if (!def || typeof def.execute !== 'function') return def;
        const origExecute = def.execute;
        return {
          ...def,
          execute: async (...args: any[]) => ensureMcpContentShape(await origExecute(...args)),
        };
      });
    };
    function resolveConfig() {
      return getPluginConfig(api);
    }
@@ -88,7 +128,9 @@ function register(api: PluginAPI): void {
    /** Resolve agent ID from env, config, or fallback. */
    function resolveAgentId(): string {
      if (process.env.AGENT_ID) return process.env.AGENT_ID;
-      const cfg = api.runtime?.config?.loadConfig?.();
+      // Read from cached `api.config` first — see pushMetaToMonitor for why
      // the deprecated `api.runtime?.config?.loadConfig?.()` path is heavy.
      const cfg = (api as any).config ?? api.runtime?.config?.loadConfig?.();
      return cfg?.agents?.list?.[0]?.id ?? cfg?.agents?.defaults?.id ?? 'unknown';
    }
@@ -144,6 +186,25 @@ function register(api: PluginAPI): void {
     * Push OpenClaw metadata to the Monitor bridge.
     * This enriches Monitor heartbeats with OpenClaw version/plugin/agent info.
     * Failures are non-fatal — Monitor continues to work without this data.
     *
     * IMPORTANT — read config from the cached `api.config` surface, NOT from
     * the deprecated `api.runtime?.config?.loadConfig?.()` path. The
     * deprecated path triggers a full plugin-metadata-snapshot rebuild on
     * every call: realpathSync walks every plugin's package.json + manifest
     * + source paths (lstats up the directory tree), `hashWatchedFiles`
     * fingerprints all watched plugin files, and `discoverInDirectory`
     * re-scans every `dist/extensions/<plugin>` dir. On t2 with ~100 plugins
     * each rebuild costs ~6-7s of CPU; with this push firing every 30s
     * (default reportIntervalSec) the chronic baseline was ~22-25% gateway
     * CPU even with zero agent activity (V8 profile 2026-05-27 08:14:00 60s:
     * lstat 44.2%, statSync 6.9%, hashWatchedFiles via memo key 1.7%, all
     * routed through readPersistedInstalledPluginIndexInstallRecordsSync ->
     * discoverInDirectory). Switching to `api.config` reads from the
     * already-loaded snapshot cache; the elsewhere-in-this-file pattern was
     * already `api.config ?? api.runtime?.config?.loadConfig?.()`.
     *
     * Same fix is applied to `resolveAgentId` below — that's read once at
     * gateway start so the impact is smaller, but it's the same anti-pattern.
     */
    async function pushMetaToMonitor() {
      const bridgeClient = getBridgeClient();
@@ -151,7 +212,7 @@ function register(api: PluginAPI): void {
      let agentNames: string[] = [];
      try {
-        const cfg = api.runtime?.config?.loadConfig?.();
+        const cfg = (api as any).config ?? api.runtime?.config?.loadConfig?.();
        const agentsList = cfg?.agents?.list;
        if (Array.isArray(agentsList)) {
          agentNames = agentsList
@@ -267,21 +328,22 @@ function register(api: PluginAPI): void {
          )}\n\`\`\``;
        }
-        // First-line ack `WAKEUP_OK` is the plugin's ack-receipt token; the
+        // The wakeup dispatcher's `deliver` callback below only logs the
-        // agent MUST then continue in the same session and drive the
+        // reply text — it does NOT inspect any ack token. The earlier
-        // `hf-wakeup` workflow to completion (calendar_status → task fetch →
+        // `WAKEUP_OK` first-line-ack convention was prompt-only theatre;
-        // sub-workflow → calendar_complete/abort). Without that continuation
+        // nothing in this plugin or in openclaw acted on it. The only
-        // the scheduler keeps re-waking every 30s because the slot stays
+        // thing that ends a wake cycle is the slot transitioning out of
-        // `not_started` forever.
+        // `not_started`, which happens when the agent calls
        // `harborforge_calendar_complete` or `harborforge_calendar_abort`.
        // Tell the agent that plainly instead of asking for a fake ack.
        const wakeupMessage =
-          `You have due slots. **First line of your reply MUST be exactly ` +
+          `You have due slots. Drive the \`hf-wakeup\` workflow of skill ` +
-          `\`WAKEUP_OK\`** so the plugin records the ack. Then, **in this ` +
+          `\`hf-hangman-lab\` to completion in this session — read slot ` +
-          `same session**, drive the \`hf-wakeup\` workflow of skill ` +
+          `context, call the harborforge_calendar_* tools, route to the ` +
-          `\`hf-hangman-lab\` to completion — read slot context, call the ` +
+          `right sub-workflow, and finish with harborforge_calendar_complete ` +
-          `harborforge_calendar_* tools, route to the right sub-workflow, ` +
+          `or harborforge_calendar_abort. The scheduler keeps re-waking you ` +
-          `and finish with harborforge_calendar_complete or abort. Do NOT ` +
+          `every 30s until the slot transitions out of \`not_started\`, so ` +
-          `stop after the ack — the scheduler will re-wake you every 30s ` +
+          `partial work or silence just produces another wake.${slotBlock}`;
          `until the slot transitions out of \`not_started\`.${slotBlock}`;
        const result = await dispatchInboundMessageWithDispatcher({
          ctx: {
@@ -356,6 +418,94 @@ function register(api: PluginAPI): void {
        }
      }
      // Cross-plugin exposure: agent status lookup for other plugins
      // (currently Fabric.OpenclawPlugin uses this to skip delivering
      // `announce` channel messages to busy agents — see DIALECTIC-V2
      // design doc, Phase 1). Backed by calendarBridge.getAgentStatus
      // with a small TTL cache to avoid hammering the HF backend.
      type HfStatus = 'idle' | 'on_call' | 'busy' | 'exhausted' | 'offline';
      const HF_STATUS_CACHE_TTL_MS = 30_000;
      const hfStatusCache = new Map<string, { status: HfStatus; at: number }>();
      const _G = globalThis as Record<string, unknown>;
      _G['__hfAgentStatus'] = {
        async get(agentId: string): Promise<HfStatus | undefined> {
          if (!agentId) return undefined;
          const cached = hfStatusCache.get(agentId);
          if (cached && Date.now() - cached.at < HF_STATUS_CACHE_TTL_MS) {
            return cached.status;
          }
          try {
            const status = await calendarBridge.getAgentStatus(agentId);
            if (status) {
              const typed = status as HfStatus;
              hfStatusCache.set(agentId, { status: typed, at: Date.now() });
              return typed;
            }
          } catch {
            /* fall through to cached-or-undefined */
          }
          return cached?.status;
        },
        /**
         * Approximate "does agent have an on_call slot covering [from, to]?"
         * for cross-plugin pre-check use (currently:
         * Dialectic.OpenclawPlugin's signup HF coverage).
         *
         * v1 honest scope: we only have today's slots in scheduleCache
         * (synced from /calendar/sync which is today-only). Returns:
         *   - true  iff window is same-day AND some cached on_call slot
         *           starts <= from AND ends >= to
         *   - false iff window is same-day AND no such slot
         *   - undefined for cross-day windows OR cache empty for this
         *     agent (caller treats undefined as "I don't know" — see
         *     Dialectic plugin's hf-precheck.ts which degrades to
         *     "skipped" gracefully)
         *
         * Phase TBD: when HF backend ships a `/calendar/slots?agent&from&to`
         * endpoint, swap this to call it for arbitrary windows. Until then,
         * same-day-only coverage gates ~all debates created by analyze-intel
         * (which schedules <2h windows) without needing a backend change.
         */
        async hasOnCallCovering(
          agentId: string,
          fromIso: string,
          toIso: string,
        ): Promise<boolean | undefined> {
          if (!agentId || !fromIso || !toIso) return undefined;
          const from = new Date(fromIso);
          const to = new Date(toIso);
          if (isNaN(from.getTime()) || isNaN(to.getTime())) return undefined;
          if (!(from < to)) return undefined;
          // Cross-day → cache only has today; can't decide.
          const fromDate = from.toISOString().slice(0, 10);
          const toDate = to.toISOString().slice(0, 10);
          if (fromDate !== toDate) return undefined;
          // Cache's cachedDate must match our window's date.
          const cacheStatus = scheduleCache.getStatus();
          if (cacheStatus.cachedDate !== fromDate) return undefined;
          const slots = scheduleCache.getAgentSlots(agentId);
          if (slots.length === 0) return undefined; // cache empty for this agent — can't decide
          for (const s of slots) {
            if (s.slot_type !== 'on_call') continue;
            // status: ignore aborted/cancelled, accept not_started / ongoing / finished
            if (s.status === 'aborted' || s.status === 'cancelled') continue;
            const startStr = s.scheduled_at;
            if (typeof startStr !== 'string') continue;
            // scheduled_at can be HH:MM:SS (cache-relative date) or full ISO
            const start =
              /^\d{2}:\d{2}(:\d{2})?$/.test(startStr)
                ? new Date(`${fromDate}T${startStr}Z`)
                : new Date(startStr);
            if (isNaN(start.getTime())) continue;
            const dur = typeof s.estimated_duration === 'number' ? s.estimated_duration : 0;
            const end = new Date(start.getTime() + dur * 60_000);
            if (start <= from && end >= to) return true;
          }
          return false;
        },
      };
      // Track wakes already dispatched for a slot in the current sync
      // window — the simplified inline scheduler does not PATCH slot
      // status server-side, so without dedupe the check loop re-wakes
Author	SHA1	Message	Date
zhi	c8998c6b0d	Merge pull request 'perf(meta-push): use cached api.config instead of deprecated loadConfig() — kills ~25% chronic baseline CPU' (#11 ) from fix/meta-push-use-cached-api-config into main	2026-05-27 08:25:34 +00:00
hzhang	686f2c7cb0	perf(meta-push): use cached api.config instead of deprecated loadConfig() `pushMetaToMonitor` and `resolveAgentId` were both calling `api.runtime?.config?.loadConfig?.()` to read the agent list. That deprecated path (openclaw warns at gateway start: "plugin runtime config.loadConfig() is deprecated; use config.current()") synchronously rebuilds the full plugin-metadata snapshot — realpathSync walks every plugin's package.json + manifest + source up the directory tree, hashWatchedFiles fingerprints every watched plugin file, and discoverInDirectory re-scans every `dist/extensions/<plugin>` (~100 of them on prod t2). Each rebuild costs ~6-7s of gateway CPU. `pushMetaToMonitor` fires every `reportIntervalSec` (default 30s) from `hooks/gateway-start.js`. With 100 plugins that put the gateway into a chronic ~22-30% CPU baseline even with zero agent activity. V8 profile 2026-05-27 08:14:00 60s window (0 turns, 2 metadata pushes during): lstat 44.2%, statSync(buildInstalledManifestRegistryIndexKey) 6.9%, hashWatchedFiles via memo key 1.7%, all routed through `readPersistedInstalledPluginIndexInstallRecordsSync` -> per-plugin `discoverInDirectory`. Switching to `(api as any).config ?? api.runtime?.config?.loadConfig?.()` reads from the snapshot cache the gateway already maintains — the same pattern already used elsewhere in this file (e.g. the calendar wakeAgent dispatcher at line 284). Same change applied to `resolveAgentId` (only runs once at start, but same anti-pattern). This is a plugin-side perf workaround. The underlying openclaw bug is that `loadConfig()` rebuilds the snapshot rather than returning the cached one — a chronic 'all sync cache validity checks pay the full discovery cost' design issue worth pushing upstream separately (the walks per-call cost we measured here is unrelated to and amplifies any agent-turn-triggered walk path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 09:17:39 +01:00
h z	81d40ae63d	fix(wakeup): drop WAKEUP_OK ack-token theatre (#10 )	2026-05-26 08:13:42 +00:00
hzhang	65a3fb8d2d	fix(wakeup): drop WAKEUP_OK ack-token theatre from wakeup message The wakeup dispatcher's `deliver` callback only does `logger.info(reply.slice(0,100))` — no token detection, no scheduler state change. The "first line of your reply MUST be exactly WAKEUP_OK so the plugin records the ack" instruction was prompt theatre that nothing in this plugin (or in openclaw) acted on. Confirmed by reading openclaw/dist/plugin-sdk/src/auto-reply/tokens.d.ts which declares HEARTBEAT_OK and SILENT_REPLY tokens but nothing for wakeup. Symptom in the wild: agents would replay WAKEUP_OK every turn for no gain — costing model budget on a no-op token — and the workflow doc (`ClawSkills/workflows/hf-wakeup/flow.md`) carried a wandering appendix explaining the ack "doesn't actually do anything anyway". Rewrite the wakeup message to tell the agent the truth: drive the hf-wakeup workflow to completion; the scheduler keeps re-waking every 30s until the slot transitions out of `not_started` via harborforge_calendar_complete or _abort. No ack token expected. ClawSkills companion change (lyn/ClawSkills d0109f3) removes WAKEUP_OK from skills/hf-hangman-lab/SKILL.md and workflows/hf-wakeup/flow.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 09:10:47 +01:00
hzhang	c2d00c18a7	feat(hf-plugin): __hfAgentStatus.hasOnCallCovering(agentId, from, to) Cross-plugin accessor for "does agent have on_call slot covering this window?" — first consumer is Dialectic.OpenclawPlugin signup pre-check (its hf-precheck.ts has been degrading to "skipped" since Phase 3 ship pending this). v1 honest scope: same-day windows only (scheduleCache is today-only from /calendar/sync). Cross-day or empty-cache windows return undefined which the caller treats as "skipped" (Dialectic backend stores pre_validated:false as audit signal — same as before, just now we actually validate when we can). Logic: for each cached slot where slot_type=on_call AND status not in {aborted,cancelled}, parse scheduled_at (HH:MM:SS or full ISO) and estimated_duration to compute end; return true iff start<=from AND end>=to. Returns false (not undefined) when cache has slots for the agent on this date but none covers — that means "actually no coverage" vs "I dont know". Pairs with Dialectic.OpenclawPlugin/src/hf-precheck.ts which already calls hf.hasOnCallCovering and handles all 3 return shapes. No backend change required.	2026-05-23 14:58:37 +01:00
hzhang	709f7e09ab	feat(hf-plugin): expose globalThis.__hfAgentStatus.get(agentId) Cross-plugin agent-status accessor for use by Fabric.OpenclawPlugin's presence-sync loop (and any future plugin needing 'is agent X busy right now'). Backed by CalendarBridgeClient.getAgentStatus() with a 30s in-memory TTL cache to avoid hammering the HF backend. Returns one of 'idle' \| 'on_call' \| 'busy' \| 'exhausted' \| 'offline' or undefined when the agent isn't known to HF. Cache miss + bridge failure returns the last cached value (stale-data better than no data for delivery-decision use cases). Part of DIALECTIC-V2 Phase 1 (Fabric announce channel + busy-discard). See /home/hzhang/arch/DIALECTIC-V2-DESIGN.md sections 7+8.	2026-05-23 11:31:27 +01:00
hzhang	1c4cf773e5	fix(hf-plugin): wrap tool returns in MCP {content:[...]} shape OpenClaw's Codex tool dispatcher (thread-lifecycle:255) expects every tool execute() to return { content: [...] } and calls result.content.reduce() to compute total text length. All 9 harborforge_* tools returned bare objects ({ running, processing, currentSlot, ... }) which has no .content field — so .reduce of undefined threw, and the agent saw the cryptic 'Cannot read properties of undefined (reading reduce)' on every call. This silently blocked every calendar slot transition on prod for hours: agents could call harborforge_calendar_complete but it always errored, so slots never moved out of not_started. Fix is at the registerTool boundary: api.registerTool is wrapped once to coerce every tool's execute return through ensureMcpContentShape. Tools that already return the correct shape are unchanged. No per-tool edits needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 08:48:05 +01:00
h z	ba420e858a	Merge pull request 'fix: wakeup message says 'continue in same session', not 'only reply WAKEUP_OK'' (#9 ) from fix/wakeup-message-no-ack-only into main	2026-05-21 10:05:34 +00:00