Commit Graph

25 Commits

Author SHA1 Message Date
0f49edf59c fix(bridge): recover inline-prefixed metadata in user message body
OpenClaw's canonical convention is to emit metadata envelopes (chat_id,
sender, reply target, …) as SEPARATE user-role messages folded into the
openai-completions request right after the real one — `extractLatestUserMessage`
skips those whole.

Fabric.OpenclawPlugin's dispatch does not split: it passes metadata
blocks and the real user content as ONE merged user-message body,
separated by blank lines. With the prior filter that meant the entire
turn was dropped with "no user message found" (HTTP 400) because the
first line matched a sentinel — the actual prompt sitting after the
metadata blocks never reached the bridge.

When the whole-body check fails for a single-message body, walk past
leading sentinel-prefixed blocks (sentinel header + optional ```json
code fence + blank-line separator) and use whatever non-metadata block
follows. Falls back to the previous "skip entirely" semantics when the
body is metadata-only.

End-user symptom that surfaced this: every contractor agent (Claude /
Gemini) subscribed to a Fabric channel silently failed to reply to
sub-discussion messages during recruitment — fabric dispatch said
"completed" in 1.6s but trajectory had `assistantTexts: []`,
`terminalError: non_deliverable_terminal_turn`,
`errorMessage: "400 \"no user message found\""`. Surfaced
recruiting developer1 on prod-t2 2026-05-31.
2026-05-31 20:52:15 +01:00
037e92b421 fix(bridge): /mcp/execute handles raw-object tool results (not just AgentToolResult)
OpenClaw plugins return tool results in one of two shapes:

  (a) AgentToolResult — { content: [{type:'text', text:'...'}] }
      used when the plugin wraps via asContent() helper. Every
      Dialectic.OpenclawPlugin tool follows this pattern.

  (b) raw JSON-able object — { ok:true, ...domain fields }
      used when the plugin returns data directly. Every
      Fabric.OpenclawPlugin tool follows this pattern
      (fabric-channel-list, fabric-guild-list, fabric-send-message,
      fabric-channel-set-purpose, etc).

The bridge's /mcp/execute handler only handled shape (a). When a
contractor agent (developer / contractor-test) called any fabric
tool through Claude Code, the bridge ran the tool successfully but
fell back to the literal string '(no result)' because
toolResult.content was undefined. Claude Code then dutifully
rendered '(no result)' as the tool result.

Reproduced on prod:
  openclaw agent --agent developer -m 'Call fabric-channel-list ...'
  → claude code session called mcp__openclaw__fabric-channel-list
  → bridge logged: mcp/execute tool=fabric-channel-list ...
  → bridge replied: { result: '(no result)' }
  → claude code rendered: ''

Fix: normalize the result in the bridge. If toolResult is null →
empty string; if it has a .content array → join the text segments
(shape a); if it's a string → use directly; else → JSON.stringify
the whole thing (shape b). Falls back to '(no result)' only when
all of those produce empty string.

Verified on prod after fix:
  agent receives real {"ok":true,"count":1,"channels":[...]}
  JSON payload (one real prod-push-test channel) in the response.
2026-05-24 09:33:12 +01:00
zhi
0b24330787 fix(bridge): emit empty content delta as heartbeat; preserve user provider fields on reinstall
OpenClaw's LLM idle watchdog (default 120s) fires on lack of *model
progress*, not lack of bytes — an SSE comment frame (": keepalive\n\n")
keeps the TCP socket alive but isn't recognized as progress, so a long
quiet tool-call phase still idles out. When that happens OpenClaw falls
back to re-sending the prior turn's assistant text (pi-embedded:1308
fallbackAnswerText), producing duplicate-Discord-message symptoms.

Heartbeat now emits a real chat.completion.chunk with an empty content
delta every 30s. Clients drop empty deltas; the upstream idle watchdog
should count it as model progress because it's a real event on the
canonical streaming channel.

scripts/install.mjs now spreads the existing provider entry before
overriding script-managed fields, so user-added fields like
timeoutSeconds survive reinstall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 08:53:22 +00:00
zhi
1b7cd6b215 fix(bridge): skip new untrusted-metadata envelopes too, not just legacy header
extractLatestUserMessage used isRuntimeContextMessage to skip envelopes
OpenClaw splices into the request as extra role=user messages. It only
recognized the legacy "OpenClaw runtime context for the immediately
preceding user message" header, but current OpenClaw emits a different
family of envelopes — INBOUND_META_SENTINELS in
strip-inbound-meta-*.js: "Conversation info (untrusted metadata):",
"Sender (untrusted metadata):", reply target / thread starter /
forwarded / chat history / untrusted context.

These slipped through the filter, so the newest-first scan picked the
Conversation info envelope as the "latest user message" and forwarded
only chat_id / sender JSON to claude. Claude saw no actual prompt and
replied with a stock greeting, while the user's real message a few
slots earlier was ignored.

Add the seven inbound-meta headers to isRuntimeContextMessage, matched
by exact equality of the trimmed first line to avoid swallowing user
text that happens to mention the phrase.

Must stay in sync with INBOUND_META_SENTINELS in OpenClaw's
strip-inbound-meta module — any new envelope type added upstream needs
to be appended here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:03:02 +00:00
zhi
cce85a9be8 fix(bridge): reset claude session when OpenClaw sends no assistant history
The May 7 fix made the bridge detect /new turns by scanning messages
for the bare-reset marker ("A new session was started via /new or
/reset"). That handles the case where /new is the body of the
current user turn, but misses a very common path: the user types
`/new` as a standalone slash command. OpenClaw processes those in a
side lane (e.g. agent:<id>:discord:slash:<chat>) that doesn't go
through the bridge — it just renames the old session file aside.
The follow-up real message then lands on a brand-new OpenClaw
session, but as a normal turn with `softResetTriggered=false`,
non-empty body, not bare /new — so isBareSessionReset is false in
OpenClaw (get-reply isBareSessionReset condition) and the marker is
never injected. The bridge keeps resuming the long-stale
claudeSessionId from before the reset.

OpenClaw always sends the full conversation history each turn
(system + user/assistant pairs + latest user). A request with zero
assistant turns in messages[] is therefore a positive signal that
the OpenClaw session is brand-new and any prior claudeSessionId we
hold belongs to an abandoned OpenClaw session.

Treat "no assistant history" as equivalent to bareSessionReset:
removeSession + existingEntry = null, so dispatchToClaude is called
without --resume and claude starts a fresh CLI session whose id we
then store. Also covers any future OpenClaw reset path that resets
the session without injecting the marker (idle timeout new-session,
admin tooling, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:40:16 +00:00
zhi
5fca8f5da1 fix: don't block one-shot openclaw subcommands; migrate to current plugin SDK
Bridge server lifecycle:
- Move createBridgeServer() out of register() into an api.on("gateway_start", ...)
  handler. register() runs in every CLI subprocess that loads plugins
  (e.g. `openclaw completion`, `openclaw doctor`); eagerly binding the
  bridge HTTP listener there could pin those processes when no gateway
  is already holding the port.
- Call server.unref() so the listener never pins the host's event loop,
  even if startup somehow runs outside the gateway.

Plugin SDK convention update:
- Wrap default export with definePluginEntry({ id, name, description, register })
  per the current openclaw plugin authoring contract.
- Switch imports from the deprecated root barrel "openclaw/plugin-sdk" to
  focused "openclaw/plugin-sdk/core" / "openclaw/plugin-sdk/plugin-entry".
- Modernize openclaw.plugin.json: drop version/main, add activation.onStartup
  so gateway_start fires for this plugin at boot, declare commandAliases
  for the contractor-agents CLI command.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 07:54:11 +00:00
zhi
2e64e9ce02 fix(bridge): abort propagation, SSE heartbeat, per-session FIFO queue
Three coordinated fixes for the duplicate-Discord-message bug where the
same prompt would be answered by two different claude subprocesses
running in parallel.

Root cause: handleChatCompletions had no concurrency control and no
way to detect when OpenClaw closed the upstream HTTP connection. When
OpenClaw's idle watchdog tripped (default 120s of stream silence), it
would close the socket and retry the prompt — but the original claude
subprocess kept running, and the bridge spawned a second one alongside
it. Both eventually streamed back, both got delivered to Discord.

Native (non-bridge) flow doesn't hit this because OpenClaw's fetch is
abort-aware end-to-end: attempt timeout fires AbortSignal, fetch closes
the socket, the model provider sees it, work stops. Bridge broke the
chain at "spawn subprocess" — this restores it.

Changes:

* SSE heartbeat (server.ts): write a `: keepalive\n\n` SSE comment
  every 30s while a turn is in flight. Counts as bytes on the wire so
  upstream idle timer resets, but is a spec-mandated no-op for the
  OpenAI stream parser. Eliminates the 120s-silence trigger that was
  causing OpenClaw to give up on long tool-call sequences in the first
  place.

* Abort propagation (server.ts + both adapters): hook req.on('close')
  to an AbortController and pass signal: through to dispatchToClaude /
  dispatchToGemini. Adapters listen on signal abort and call markDone
  → scheduleCleanup which SIGTERMs the child process group (3s grace
  for claude, 5s for gemini) then SIGKILLs. Mirrors what native fetch
  does when its caller aborts.

* Per-sessionKey FIFO queue (server.ts): same-session turns serialize
  via a Map<sessionKey, Promise<void>> chain so a user firing multiple
  Discord messages back-to-back gets them processed in order rather
  than spawning concurrent subprocesses (which would corrupt the shared
  --resume session file). Cross-session requests live on independent
  chains and run in parallel.

Subtle correctness points:

* getSession() moved to head-of-queue so we resume into the latest
  claudeSessionId from the just-finished prior turn instead of a stale
  request-arrival snapshot.
* Aborted turns skip session-map persistence — the subprocess may have
  already updated its own session file on disk, so the next retry
  resumes from there.
* Queue chain GC uses Map identity check so we don't delete an entry
  that a later request has already chained onto.
* prev.then(() => mySlot, () => mySlot) tolerates a crashed prior turn
  so the chain doesn't poison forever.
* writeHead(200) before queue wait so OpenClaw sees response status
  immediately; heartbeat covers the queue-wait quiet period.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 23:58:17 +00:00
h z
b94e0d25f6 Merge pull request 'fix(bridge): skip OpenClaw runtime-context envelope when picking prompt' (#3) from fix/skip-runtime-context-message into main
Reviewed-on: #3
2026-04-29 08:32:52 +00:00
zhi
91acce9b32 fix(bridge): skip OpenClaw runtime-context envelope when picking prompt
OpenClaw emits its runtime-context block as a separate custom_message; the
openai-completions adapter folds that into the request as an extra role=user
message after the real user input. extractLatestUserMessage was taking the
last user message unconditionally, so Claude received only the metadata
envelope and replied "your message came through empty".

Walk user messages backward, skip ones starting with the runtime-context
marker, and return the most recent real user message instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:19:59 +00:00
h z
4e015c677b Merge pull request 'fix(bridge): scope CLI sessions per OpenClaw session and reset on /new' (#2) from fix/per-openclaw-session-cli-mapping into main
Reviewed-on: #2
2026-04-29 07:27:09 +00:00
zhi
992f4d8703 fix(bridge): scope CLI sessions per OpenClaw session and reset on /new
The bridge was keying claudeSessionId by agentId alone, so every Discord
channel, DM, and cron run for a single agent shared one Claude CLI
session. Two consequences in the wild:

  - Cross-channel context bleed: 8.7MB session for `developer` mixed
    references from channels 1474327736242798612 and 1498579994044010566
    plus the operator DM all in one --resume thread.
  - `/new` had no effect on the CLI side. OpenClaw rotated its session
    file but the bridge kept --resume-ing the same long-lived
    claudeSessionId, eventually crossing the 1M model context (debug log
    showed `prompt is too long: 1179616 tokens > 1000000 maximum`).

Changes:

  * input-filter: extract `chat_id` from the Conversation-info
    untrusted-metadata block (scanning all messages, since runtimeOnly
    turns put it in the system prompt) and detect bare `/new`/`/reset`
    via the BARE_SESSION_RESET_PROMPT_BASE marker. Add buildSessionKey
    `${agentId}::${chatId}` and resolveDispatchPrompt fallback for the
    empty user message that OpenClaw sends on bare resets.

  * server: use the composite session key for getSession/putSession;
    on bareSessionReset, removeSession before dispatching so the CLI
    starts a fresh session; on a CLI result_error (typically
    prompt_too_long) drop the entry too so the next turn doesn't
    re-resume into the poisoned context.

  * claude/sdk-adapter: surface CLI terminal errors via a new
    `result_error` event (carries reason + sessionId) so the bridge
    can react instead of just streaming the synthetic
    "Prompt is too long" assistant text and silently re-using the
    same session.

  * index: convert register() to synchronous (OpenClaw rejects async
    register with "plugin register must be synchronous"); replace the
    pre-bind port probe with a server-level EADDRINUSE handler.

  * .gitignore: ignore node_modules/ and dist/.
2026-04-28 12:32:37 +00:00
zhi
6be8d47982 fix(claude-adapter): commit turn on result event, dont block on process exit
Previously dispatchToClaude awaited child.on(close) before yielding the done
event. Claude CLIs Bash tool occasionally leaves ssh/bash grandchildren alive
(e.g. a GUI app that ignores SIGPIPE on the remote end of a piped ssh command);
that kept claude -p alive past end-of-turn, which kept the bridge SSE stream
open, which kept OpenClaw from committing the turn to its session jsonl.

Switch to emitting done as soon as the terminal result stream-json event
arrives. Spawn claude in its own process group (detached:true) and schedule
a best-effort SIGTERM/SIGKILL sweep of leaked descendants; temp-file cleanup
runs asynchronously on actual process close.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-21 08:52:57 +00:00
zhi
e73a7ea049 fix: support root execution and factory-registered tool lookup
- Replace --dangerously-skip-permissions with --allowedTools whitelist
  to support running Claude Code as root (root blocks the former flag)
- Fix /mcp/execute tool lookup for plugins that register tools via
  factory functions (e.g. padded-cell pcexec) where the global registry
  names array is empty — now falls back to instantiating factories and
  matching by returned tool name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-17 12:23:43 +00:00
49af2129ae --append 2026-04-12 14:05:56 +01:00
h z
9cd90f7213 Merge pull request 'feat/plugin-services-restructure' (#1) from feat/plugin-services-restructure into main
Reviewed-on: #1
2026-04-11 20:38:30 +00:00
6ae795eea8 chore: add .gitignore and untrack internal Claude dev notes
- Add .gitignore covering node_modules, .idea, credential files
- Untrack docs/claude/LESSONS_LEARNED.md and OPENCLAW_PLUGIN_DEV.md
  (internal dev notes, not part of the plugin repo)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:28:56 +01:00
07a0f06e2e refactor: restructure to plugin/ + services/ layout and add per-turn bootstrap injection
- Migrate src/ → plugin/ (plugin/core/, plugin/web/, plugin/commands/)
  and src/mcp/ → services/ per OpenClaw plugin dev spec
- Add Gemini CLI backend (plugin/core/gemini/sdk-adapter.ts) with GEMINI.md
  system-prompt injection
- Inject bootstrap as stateless system prompt on every turn instead of
  first turn only: Claude via --system-prompt, Gemini via workspace/GEMINI.md;
  eliminates isFirstTurn branch, keeps skills in sync with OpenClaw snapshots
- Fix session-map-store defensive parsing (sessions ?? []) to handle bare {}
  reset files without crashing on .find()
- Add docs/TEST_FLOW.md with E2E test scenarios and expected outcomes
- Add docs/claude/BRIDGE_MODEL_FINDINGS.md with contractor-probe results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:21:32 +01:00
eee62efbf1 feat: add openclaw origin-name alias to MCP tool descriptions
Claude Code sees tools as mcp__openclaw__<name> but skills/instructions
reference the original OpenClaw name (e.g. "memory_search"). Each tool
description now includes a disambiguation line:

  [openclaw tool: <name>] If any skill or instruction refers to "<name>",
  this is the tool to call.

This ensures Claude can correctly map skill references to the right tool
even when the MCP prefix makes the name appear different.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 17:25:33 +01:00
7a779c8560 feat: relay MCP tool calls to OpenClaw plugin registry instead of reimplementing
When Claude Code calls an MCP tool via /mcp/execute, the bridge now:
1. Looks up the tool in OpenClaw's global plugin registry (getGlobalPluginRegistry)
2. Calls the tool's factory with proper context (config from globalThis, canonical
   session key "agent:<agentId>:direct:bridge" for memory agentId resolution)
3. Executes the real tool implementation and returns the AgentToolResult text

This replaces the manual memory_search/memory_get implementations in memory-handlers.ts
with a generic relay that works for any registered OpenClaw tool. The agentId is now
propagated from dispatchToClaude through the MCP server env (AGENT_ID) to /mcp/execute.

The OpenClaw config is stored in globalThis._contractorOpenClawConfig during plugin
registration (index.ts api.config) since getRuntimeConfigSnapshot() uses module-level
state not shared across bundle boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 17:09:02 +01:00
f9d43477bf feat: implement memory_search/memory_get MCP tools and fix resume MCP disconnect
Key fixes:
- Pass --mcp-config on every turn (not just first): MCP server exits with each
  claude -p process, so --resume also needs a fresh MCP server
- Pass openclawTools on every turn for the same reason
- Add WORKSPACE env var to MCP server so execute requests include workspace path
- Implement memory_search (keyword search over workspace/memory/*.md + MEMORY.md)
- Implement memory_get (line-range read from workspace memory files)
- Both are workspace-aware, resolved per-request via request body workspace field

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 15:55:44 +01:00
1d9b765a6d docs: document Claude Code identity/SOUL.md conflict analysis (14.9)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 15:46:51 +01:00
fb1ec1b5c4 feat: inject SOUL.md/IDENTITY.md persona and detect workspace context files
- input-filter: scan workspace for SOUL.md, IDENTITY.md, MEMORY.md etc. on each turn
- bootstrap: when these files exist, instruct Claude to Read them at session start
- This gives the contractor agent its OpenClaw persona (SOUL.md embody, IDENTITY.md fill)
- Memory note added to bootstrap when workspace has memory files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 15:41:25 +01:00
76a7931f97 feat: implement MCP proxy for OpenClaw tool access in contractor agent
Complete the MCP tool call chain:
- contractor-agent bridge exposes /mcp/execute endpoint for tool callbacks
- openclaw-mcp-server.mjs proxies OpenClaw tool defs to Claude as MCP tools
- sdk-adapter passes --mcp-config on first turn with all OpenClaw tools
- tool-test plugin registers contractor_echo in globalThis tool handler map
- agent-config-writer auto-sets tools.profile=full so OpenClaw sends tool defs
- Fix --mcp-config arg ordering: prompt must come before <configs...> flag

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 13:05:03 +01:00
nav
f7c5875eeb docs: expand Claude contractor design 2026-04-11 06:57:11 +00:00
nav
ca91c1de41 docs: add initial ContractorAgent planning 2026-04-11 06:39:54 +00:00