4 Commits

Author SHA1 Message Date
h z
d726c3c35d Merge pull request 'fix(bridge): strip NODE_OPTIONS --inspect before spawning claude/gemini' (#5) from fix/strip-inspect-from-spawn-env into main 2026-05-31 20:04:54 +00:00
d381c486ab fix(bridge): strip NODE_OPTIONS --inspect before spawning claude/gemini
claude-code and gemini-cli are both Node binaries. When the parent
gateway is launched with `NODE_OPTIONS=--inspect=127.0.0.1:9229` (for
debugging), spawn(child).env = {...process.env} propagates the flag into
the child. The child Node then tries to bind the same inspector port,
fails EADDRINUSE, and exits SILENTLY (no stdout, no stderr).

Bridge sees an empty stream and reports `claude did not return a
session_id` with an empty stderr summary — extremely opaque diagnostic
that took non-trivial digging to root-cause.

Sanitize NODE_OPTIONS before spawn: keep everything except
`--inspect*` / `--inspect-brk*` / `--debug*`. Operators that legitimately
need other NODE_OPTIONS values (e.g. `--max-old-space-size`) keep them.

Verified end-user repro on prod-t2 2026-05-31: with
`Environment=NODE_OPTIONS=--inspect=127.0.0.1:9229` in the gateway
systemd drop-in, `claude -p "hi" --output-format stream-json --verbose`
spawned from the bridge returned ZERO bytes; running the exact same
command from a shell without the env var returned the full init →
assistant → result stream in ~6s. Surfaced recruiting developer1
(Cody, contractor-claude-bridge).
2026-05-31 21:04:53 +01:00
h z
689e3da0ba Merge pull request 'fix(bridge): recover inline-prefixed metadata in user message body' (#4) from fix/inline-metadata-strip into main 2026-05-31 19:52:39 +00:00
0f49edf59c fix(bridge): recover inline-prefixed metadata in user message body
OpenClaw's canonical convention is to emit metadata envelopes (chat_id,
sender, reply target, …) as SEPARATE user-role messages folded into the
openai-completions request right after the real one — `extractLatestUserMessage`
skips those whole.

Fabric.OpenclawPlugin's dispatch does not split: it passes metadata
blocks and the real user content as ONE merged user-message body,
separated by blank lines. With the prior filter that meant the entire
turn was dropped with "no user message found" (HTTP 400) because the
first line matched a sentinel — the actual prompt sitting after the
metadata blocks never reached the bridge.

When the whole-body check fails for a single-message body, walk past
leading sentinel-prefixed blocks (sentinel header + optional ```json
code fence + blank-line separator) and use whatever non-metadata block
follows. Falls back to the previous "skip entirely" semantics when the
body is metadata-only.

End-user symptom that surfaced this: every contractor agent (Claude /
Gemini) subscribed to a Fabric channel silently failed to reply to
sub-discussion messages during recruitment — fabric dispatch said
"completed" in 1.6s but trajectory had `assistantTexts: []`,
`terminalError: non_deliverable_terminal_turn`,
`errorMessage: "400 \"no user message found\""`. Surfaced
recruiting developer1 on prod-t2 2026-05-31.
2026-05-31 20:52:15 +01:00
3 changed files with 92 additions and 3 deletions

View File

@@ -158,10 +158,29 @@ export async function* dispatchToClaude(
// detached:true puts claude in its own process group. Claude's Bash tool
// occasionally leaks shells/ssh that keep claude alive past end-of-turn; when
// that happens we SIGKILL the whole group rather than wait forever.
// Sanitize NODE_OPTIONS before spawning. Claude Code is a Node CLI; if
// the parent gateway runs with `NODE_OPTIONS=--inspect=...:9229`, every
// child Node process — including claude — tries to bind the same inspector
// port, fails (EADDRINUSE), and exits SILENTLY (no stdout, no stderr).
// Bridge then sees an empty stream and reports `claude did not return a
// session_id` with no useful diagnostic. Strip any --inspect* /
// --inspect-brk* / --debug* flag from NODE_OPTIONS; keep everything else
// (e.g. --max-old-space-size) in case operators depend on it.
const childEnv: NodeJS.ProcessEnv = { ...process.env };
if (childEnv.NODE_OPTIONS) {
const filtered = childEnv.NODE_OPTIONS
.split(/\s+/)
.filter((tok) => tok && !tok.startsWith("--inspect") && !tok.startsWith("--debug"))
.join(" ")
.trim();
if (filtered) childEnv.NODE_OPTIONS = filtered;
else delete childEnv.NODE_OPTIONS;
}
const child = spawn("claude", args, {
cwd: workspace,
stdio: ["ignore", "pipe", "pipe"],
env: { ...process.env },
env: childEnv,
detached: true,
});

View File

@@ -156,10 +156,24 @@ export async function* dispatchToGemini(
args.push("--resume", resumeSessionId);
}
// Sanitize NODE_OPTIONS before spawning — same reason as the claude
// adapter: gemini-cli is a Node binary; inheriting a parent
// `NODE_OPTIONS=--inspect=...:9229` makes every child silently EADDRINUSE.
const childEnv: NodeJS.ProcessEnv = { ...process.env };
if (childEnv.NODE_OPTIONS) {
const filtered = childEnv.NODE_OPTIONS
.split(/\s+/)
.filter((tok) => tok && !tok.startsWith("--inspect") && !tok.startsWith("--debug"))
.join(" ")
.trim();
if (filtered) childEnv.NODE_OPTIONS = filtered;
else delete childEnv.NODE_OPTIONS;
}
const child = spawn("gemini", args, {
cwd: workspace,
stdio: ["ignore", "pipe", "pipe"],
env: { ...process.env },
env: childEnv,
});
const stderrLines: string[] = [];

View File

@@ -61,6 +61,51 @@ function isRuntimeContextMessage(text: string): boolean {
return INBOUND_META_SENTINELS.includes(firstLine);
}
/**
* Strip *prefixed* metadata blocks from a single user message body.
*
* Background: OpenClaw's older / canonical convention is to emit metadata
* envelopes (chat_id, sender, reply target, …) as SEPARATE user-role
* messages folded into the openai-completions request right after the real
* one. `extractLatestUserMessage` skips those separate messages whole.
*
* The Fabric channel plugin (Fabric.OpenclawPlugin) does not split: it
* passes the metadata blocks and the actual user content as ONE merged
* user-message body, separated by blank lines. Result: a Fabric inbound
* turn has a single user message whose first line matches a sentinel, so
* `isRuntimeContextMessage` returned true and the whole turn was dropped
* with "no user message found" (HTTP 400). Symptom: every contractor agent
* subscribed to a Fabric channel silently fails to reply.
*
* Strip leading sentinel-prefixed blocks (sentinel header + optional code
* fence + blank-line separator) until we hit a block whose first line is
* NOT a sentinel — that's the real prompt. Returns "" if the entire body
* is metadata-only (still a no-op turn, same as before).
*/
function stripPrefixedMetadataBlocks(raw: string): string {
// Split on blank-line block boundaries. Within a metadata block the JSON
// code fence has only single newlines, so it stays in the same chunk as
// its sentinel header.
const blocks = raw.split(/\n\n+/);
const kept: string[] = [];
let stillStripping = true;
for (const block of blocks) {
if (stillStripping) {
const trimmed = block.trimStart();
const firstLine = trimmed.split("\n", 1)[0].trim();
if (
INBOUND_META_SENTINELS.includes(firstLine) ||
trimmed.startsWith(LEGACY_RUNTIME_CONTEXT_MARKER)
) {
continue; // drop this prefix block
}
stillStripping = false;
}
kept.push(block);
}
return kept.join("\n\n").trim();
}
/**
* Extract the latest user-authored message from the OpenClaw request.
*
@@ -85,7 +130,18 @@ export function extractLatestUserMessage(req: BridgeInboundRequest): string {
for (let i = userMessages.length - 1; i >= 0; i -= 1) {
const raw = messageText(userMessages[i]);
if (!raw) continue;
if (isRuntimeContextMessage(raw)) continue;
// First try the separate-message convention (Discord/Telegram path): a
// user message whose entire body is one metadata envelope — skip whole.
if (isRuntimeContextMessage(raw)) {
// ...but also try inline-prefixed-metadata recovery (Fabric path):
// some channels splice metadata + real content into one body. Walk
// past leading sentinel blocks; if any non-metadata block follows,
// that's the prompt. If nothing follows, this turn really is pure
// metadata and we keep skipping.
const stripped = stripPrefixedMetadataBlocks(raw);
if (stripped) return stripOpenClawTimestampPrefix(stripped);
continue;
}
return stripOpenClawTimestampPrefix(raw);
}
return "";