ContractorAgent

Author	SHA1	Message	Date
hzhang	0f49edf59c	fix(bridge): recover inline-prefixed metadata in user message body OpenClaw's canonical convention is to emit metadata envelopes (chat_id, sender, reply target, …) as SEPARATE user-role messages folded into the openai-completions request right after the real one — `extractLatestUserMessage` skips those whole. Fabric.OpenclawPlugin's dispatch does not split: it passes metadata blocks and the real user content as ONE merged user-message body, separated by blank lines. With the prior filter that meant the entire turn was dropped with "no user message found" (HTTP 400) because the first line matched a sentinel — the actual prompt sitting after the metadata blocks never reached the bridge. When the whole-body check fails for a single-message body, walk past leading sentinel-prefixed blocks (sentinel header + optional ```json code fence + blank-line separator) and use whatever non-metadata block follows. Falls back to the previous "skip entirely" semantics when the body is metadata-only. End-user symptom that surfaced this: every contractor agent (Claude / Gemini) subscribed to a Fabric channel silently failed to reply to sub-discussion messages during recruitment — fabric dispatch said "completed" in 1.6s but trajectory had `assistantTexts: []`, `terminalError: non_deliverable_terminal_turn`, `errorMessage: "400 \"no user message found\""`. Surfaced recruiting developer1 on prod-t2 2026-05-31.	2026-05-31 20:52:15 +01:00
hzhang	037e92b421	fix(bridge): /mcp/execute handles raw-object tool results (not just AgentToolResult) OpenClaw plugins return tool results in one of two shapes: (a) AgentToolResult — { content: [{type:'text', text:'...'}] } used when the plugin wraps via asContent() helper. Every Dialectic.OpenclawPlugin tool follows this pattern. (b) raw JSON-able object — { ok:true, ...domain fields } used when the plugin returns data directly. Every Fabric.OpenclawPlugin tool follows this pattern (fabric-channel-list, fabric-guild-list, fabric-send-message, fabric-channel-set-purpose, etc). The bridge's /mcp/execute handler only handled shape (a). When a contractor agent (developer / contractor-test) called any fabric tool through Claude Code, the bridge ran the tool successfully but fell back to the literal string '(no result)' because toolResult.content was undefined. Claude Code then dutifully rendered '(no result)' as the tool result. Reproduced on prod: openclaw agent --agent developer -m 'Call fabric-channel-list ...' → claude code session called mcp__openclaw__fabric-channel-list → bridge logged: mcp/execute tool=fabric-channel-list ... → bridge replied: { result: '(no result)' } → claude code rendered: '' Fix: normalize the result in the bridge. If toolResult is null → empty string; if it has a .content array → join the text segments (shape a); if it's a string → use directly; else → JSON.stringify the whole thing (shape b). Falls back to '(no result)' only when all of those produce empty string. Verified on prod after fix: agent receives real {"ok":true,"count":1,"channels":[...]} JSON payload (one real prod-push-test channel) in the response.	2026-05-24 09:33:12 +01:00
zhi	0b24330787	fix(bridge): emit empty content delta as heartbeat; preserve user provider fields on reinstall OpenClaw's LLM idle watchdog (default 120s) fires on lack of model progress, not lack of bytes — an SSE comment frame (": keepalive\n\n") keeps the TCP socket alive but isn't recognized as progress, so a long quiet tool-call phase still idles out. When that happens OpenClaw falls back to re-sending the prior turn's assistant text (pi-embedded:1308 fallbackAnswerText), producing duplicate-Discord-message symptoms. Heartbeat now emits a real chat.completion.chunk with an empty content delta every 30s. Clients drop empty deltas; the upstream idle watchdog should count it as model progress because it's a real event on the canonical streaming channel. scripts/install.mjs now spreads the existing provider entry before overriding script-managed fields, so user-added fields like timeoutSeconds survive reinstall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-14 08:53:22 +00:00
zhi	1b7cd6b215	fix(bridge): skip new untrusted-metadata envelopes too, not just legacy header extractLatestUserMessage used isRuntimeContextMessage to skip envelopes OpenClaw splices into the request as extra role=user messages. It only recognized the legacy "OpenClaw runtime context for the immediately preceding user message" header, but current OpenClaw emits a different family of envelopes — INBOUND_META_SENTINELS in strip-inbound-meta-*.js: "Conversation info (untrusted metadata):", "Sender (untrusted metadata):", reply target / thread starter / forwarded / chat history / untrusted context. These slipped through the filter, so the newest-first scan picked the Conversation info envelope as the "latest user message" and forwarded only chat_id / sender JSON to claude. Claude saw no actual prompt and replied with a stock greeting, while the user's real message a few slots earlier was ignored. Add the seven inbound-meta headers to isRuntimeContextMessage, matched by exact equality of the trimmed first line to avoid swallowing user text that happens to mention the phrase. Must stay in sync with INBOUND_META_SENTINELS in OpenClaw's strip-inbound-meta module — any new envelope type added upstream needs to be appended here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:03:02 +00:00
zhi	cce85a9be8	fix(bridge): reset claude session when OpenClaw sends no assistant history The May 7 fix made the bridge detect /new turns by scanning messages for the bare-reset marker ("A new session was started via /new or /reset"). That handles the case where /new is the body of the current user turn, but misses a very common path: the user types `/new` as a standalone slash command. OpenClaw processes those in a side lane (e.g. agent:<id>:discord:slash:<chat>) that doesn't go through the bridge — it just renames the old session file aside. The follow-up real message then lands on a brand-new OpenClaw session, but as a normal turn with `softResetTriggered=false`, non-empty body, not bare /new — so isBareSessionReset is false in OpenClaw (get-reply isBareSessionReset condition) and the marker is never injected. The bridge keeps resuming the long-stale claudeSessionId from before the reset. OpenClaw always sends the full conversation history each turn (system + user/assistant pairs + latest user). A request with zero assistant turns in messages[] is therefore a positive signal that the OpenClaw session is brand-new and any prior claudeSessionId we hold belongs to an abandoned OpenClaw session. Treat "no assistant history" as equivalent to bareSessionReset: removeSession + existingEntry = null, so dispatchToClaude is called without --resume and claude starts a fresh CLI session whose id we then store. Also covers any future OpenClaw reset path that resets the session without injecting the marker (idle timeout new-session, admin tooling, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 16:40:16 +00:00
zhi	2e64e9ce02	fix(bridge): abort propagation, SSE heartbeat, per-session FIFO queue Three coordinated fixes for the duplicate-Discord-message bug where the same prompt would be answered by two different claude subprocesses running in parallel. Root cause: handleChatCompletions had no concurrency control and no way to detect when OpenClaw closed the upstream HTTP connection. When OpenClaw's idle watchdog tripped (default 120s of stream silence), it would close the socket and retry the prompt — but the original claude subprocess kept running, and the bridge spawned a second one alongside it. Both eventually streamed back, both got delivered to Discord. Native (non-bridge) flow doesn't hit this because OpenClaw's fetch is abort-aware end-to-end: attempt timeout fires AbortSignal, fetch closes the socket, the model provider sees it, work stops. Bridge broke the chain at "spawn subprocess" — this restores it. Changes: * SSE heartbeat (server.ts): write a `: keepalive\n\n` SSE comment every 30s while a turn is in flight. Counts as bytes on the wire so upstream idle timer resets, but is a spec-mandated no-op for the OpenAI stream parser. Eliminates the 120s-silence trigger that was causing OpenClaw to give up on long tool-call sequences in the first place. * Abort propagation (server.ts + both adapters): hook req.on('close') to an AbortController and pass signal: through to dispatchToClaude / dispatchToGemini. Adapters listen on signal abort and call markDone → scheduleCleanup which SIGTERMs the child process group (3s grace for claude, 5s for gemini) then SIGKILLs. Mirrors what native fetch does when its caller aborts. * Per-sessionKey FIFO queue (server.ts): same-session turns serialize via a Map<sessionKey, Promise<void>> chain so a user firing multiple Discord messages back-to-back gets them processed in order rather than spawning concurrent subprocesses (which would corrupt the shared --resume session file). Cross-session requests live on independent chains and run in parallel. Subtle correctness points: * getSession() moved to head-of-queue so we resume into the latest claudeSessionId from the just-finished prior turn instead of a stale request-arrival snapshot. * Aborted turns skip session-map persistence — the subprocess may have already updated its own session file on disk, so the next retry resumes from there. * Queue chain GC uses Map identity check so we don't delete an entry that a later request has already chained onto. * prev.then(() => mySlot, () => mySlot) tolerates a crashed prior turn so the chain doesn't poison forever. * writeHead(200) before queue wait so OpenClaw sees response status immediately; heartbeat covers the queue-wait quiet period. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 23:58:17 +00:00
zhi	91acce9b32	fix(bridge): skip OpenClaw runtime-context envelope when picking prompt OpenClaw emits its runtime-context block as a separate custom_message; the openai-completions adapter folds that into the request as an extra role=user message after the real user input. extractLatestUserMessage was taking the last user message unconditionally, so Claude received only the metadata envelope and replied "your message came through empty". Walk user messages backward, skip ones starting with the runtime-context marker, and return the most recent real user message instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:19:59 +00:00
zhi	992f4d8703	fix(bridge): scope CLI sessions per OpenClaw session and reset on /new The bridge was keying claudeSessionId by agentId alone, so every Discord channel, DM, and cron run for a single agent shared one Claude CLI session. Two consequences in the wild: - Cross-channel context bleed: 8.7MB session for `developer` mixed references from channels 1474327736242798612 and 1498579994044010566 plus the operator DM all in one --resume thread. - `/new` had no effect on the CLI side. OpenClaw rotated its session file but the bridge kept --resume-ing the same long-lived claudeSessionId, eventually crossing the 1M model context (debug log showed `prompt is too long: 1179616 tokens > 1000000 maximum`). Changes: * input-filter: extract `chat_id` from the Conversation-info untrusted-metadata block (scanning all messages, since runtimeOnly turns put it in the system prompt) and detect bare `/new`/`/reset` via the BARE_SESSION_RESET_PROMPT_BASE marker. Add buildSessionKey `${agentId}::${chatId}` and resolveDispatchPrompt fallback for the empty user message that OpenClaw sends on bare resets. * server: use the composite session key for getSession/putSession; on bareSessionReset, removeSession before dispatching so the CLI starts a fresh session; on a CLI result_error (typically prompt_too_long) drop the entry too so the next turn doesn't re-resume into the poisoned context. * claude/sdk-adapter: surface CLI terminal errors via a new `result_error` event (carries reason + sessionId) so the bridge can react instead of just streaming the synthetic "Prompt is too long" assistant text and silently re-using the same session. * index: convert register() to synchronous (OpenClaw rejects async register with "plugin register must be synchronous"); replace the pre-bind port probe with a server-level EADDRINUSE handler. * .gitignore: ignore node_modules/ and dist/.	2026-04-28 12:32:37 +00:00
zhi	e73a7ea049	fix: support root execution and factory-registered tool lookup - Replace --dangerously-skip-permissions with --allowedTools whitelist to support running Claude Code as root (root blocks the former flag) - Fix /mcp/execute tool lookup for plugins that register tools via factory functions (e.g. padded-cell pcexec) where the global registry names array is empty — now falls back to instantiating factories and matching by returned tool name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-17 12:23:43 +00:00
hzhang	07a0f06e2e	refactor: restructure to plugin/ + services/ layout and add per-turn bootstrap injection - Migrate src/ → plugin/ (plugin/core/, plugin/web/, plugin/commands/) and src/mcp/ → services/ per OpenClaw plugin dev spec - Add Gemini CLI backend (plugin/core/gemini/sdk-adapter.ts) with GEMINI.md system-prompt injection - Inject bootstrap as stateless system prompt on every turn instead of first turn only: Claude via --system-prompt, Gemini via workspace/GEMINI.md; eliminates isFirstTurn branch, keeps skills in sync with OpenClaw snapshots - Fix session-map-store defensive parsing (sessions ?? []) to handle bare {} reset files without crashing on .find() - Add docs/TEST_FLOW.md with E2E test scenarios and expected outcomes - Add docs/claude/BRIDGE_MODEL_FINDINGS.md with contractor-probe results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 21:21:32 +01:00

10 Commits