diff --git a/PROJECT_PLAN.md b/PROJECT_PLAN.md new file mode 100644 index 0000000..0d779bd --- /dev/null +++ b/PROJECT_PLAN.md @@ -0,0 +1,511 @@ +# ClawSpec Project Plan + +## 1. Project Overview + +**ClawSpec** is an automation testing framework for OpenClaw plugins. + +Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider. + +Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results. + +--- + +## 2. Problem Statement + +OpenClaw plugin testing is currently hard to standardize because: + +- plugin behavior often depends on runtime integration rather than isolated pure functions +- real LLM responses are non-deterministic and expensive +- testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation +- there is no unified way to express plugin test environments and expected outcomes + +ClawSpec aims to solve this by offering: + +- declarative test environment definitions +- deterministic model behavior via rules instead of real LLMs +- automated provisioning of OpenClaw runtime units +- repeatable execution of plugin integration tests +- structured validation and test reporting + +--- + +## 3. Project Goals + +### Primary Goals + +- Define test environments for one or more OpenClaw runtime units +- Install and configure plugins automatically per test spec +- Provide a fake model service that responds according to declarative rules +- Execute test cases against running OpenClaw units +- Verify expected behavior through built-in assertions and custom verifier scripts +- Produce reproducible test artifacts and reports + +### Secondary Goals + +- Support multiple OpenClaw versions for compatibility testing +- Support multi-unit scenarios in a single test suite +- Provide reusable example specs for plugin developers +- Make local plugin integration testing fast enough for everyday development + +### Non-Goals for V1 + +- Full UI dashboard +- Large-scale distributed execution +- Performance benchmarking +- Fuzzing or random conversation generation +- Automatic support for every possible provider protocol +- Replacing unit tests inside plugin repositories + +--- + +## 4. Product Positioning + +ClawSpec is **not** a generic unit test runner. + +It is a **scenario-driven integration testing framework** for validating the behavior of: + +- OpenClaw runtime +- installed plugins +- model interaction boundaries +- tool-calling flows +- message outputs +- side effects exposed through plugins or configured backends + +The core value is deterministic validation of complex runtime behavior. + +--- + +## 5. High-Level Architecture + +ClawSpec is expected to contain four major components. + +### 5.1 Spec Loader + +Responsible for: + +- reading the project spec file +- validating structure against schema +- normalizing runtime definitions +- producing an execution plan for the runner + +### 5.2 Environment Orchestrator + +Responsible for: + +- provisioning OpenClaw test units +- generating or managing Docker Compose definitions +- preparing workspace directories and mounted volumes +- installing plugins +- applying plugin configuration +- starting and stopping test environments + +### 5.3 Fake Model Service + +Responsible for: + +- exposing a model-compatible endpoint for OpenClaw +- receiving model requests from test units +- matching incoming requests against declarative rules +- returning deterministic text responses and/or tool-calling instructions +- logging interactions for debugging and verification + +### 5.4 Test Runner + +Responsible for: + +- selecting target test cases +- injecting input events/messages +- collecting outputs, logs, and tool-call traces +- evaluating built-in assertions +- executing optional verifier scripts +- producing final pass/fail results and artifacts + +--- + +## 6. Core Design Principles + +### Determinism First + +The framework should avoid real LLM randomness in automated tests. + +### Runtime Realism + +Tests should run against realistic OpenClaw environments, not only mocked plugin internals. + +### Declarative Configuration + +Test environments and cases should be defined in configuration files rather than hard-coded scripts. + +### Extensible Verification + +Built-in assertions should cover common cases, while custom scripts should support project-specific validation. + +### Reproducible Artifacts + +All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results. + +--- + +## 7. Proposed Spec Structure + +The initial idea is to define a single spec file that describes: + +- OpenClaw runtime units +- plugin installation and configuration +- test cases +- fake model behavior +- expected validation steps + +A normalized V1 structure may look like this: + +```json +{ + "version": "v1", + "environment": { + "networkName": "clawspec-net", + "workspaceRoot": "./.clawspec/workspaces", + "artifactsRoot": "./.clawspec/artifacts" + }, + "clawUnits": [ + { + "unitId": "calendar-agent", + "openClawVersion": "0.5.0", + "image": "ghcr.io/openclaw/openclaw:0.5.0", + "plugins": [ + { + "pluginName": "harborforge-calendar", + "installCommand": "openclaw plugins add harborforge-calendar", + "configs": { + "backendUrl": "http://fake-backend:8080", + "agentId": "calendar-test-agent" + } + } + ] + } + ], + "testCases": [ + { + "testId": "calendar-reminder-basic", + "targetUnitId": "calendar-agent", + "input": { + "channel": "discord", + "chatType": "direct", + "message": "明天下午3点提醒我开会" + }, + "modelRules": [ + { + "receive": ".*明天下午3点提醒我开会.*", + "action": { + "type": "tool_call_then_respond", + "toolName": "harborforge_calendar_create", + "toolParameters": { + "title": "开会", + "time": "tomorrow 15:00" + }, + "text": "已经帮你记下了" + } + } + ], + "expected": [ + { + "type": "tool_called", + "toolName": "harborforge_calendar_create" + }, + { + "type": "message_contains", + "value": "已经帮你记下了" + } + ], + "verifier": { + "type": "script", + "path": "./verifiers/calendar-reminder-basic.sh" + } + } + ] +} +``` + +This is only a starting point. The exact schema should be refined during implementation. + +--- + +## 8. Fake Model Service Design + +The fake model service is one of the most important parts of ClawSpec. + +It should behave like a deterministic model backend that OpenClaw can call during tests. + +### Responsibilities + +- receive model requests from OpenClaw +- inspect request content and context +- match a rule set in declared order +- return predefined outputs +- support text-only responses +- support tool-calling responses +- support tool-call plus final text patterns +- emit logs showing which rule matched and what output was generated + +### Why This Matters + +Without this service, tests would depend on live model providers, causing: + +- unstable results +- variable tool-calling behavior +- token costs +- difficult reproduction of failures + +The fake model service turns model behavior into a controlled part of the test spec. + +--- + +## 9. Verification Model + +ClawSpec should support two layers of verification. + +### 9.1 Built-in Assertions + +Common assertions should be supported directly by the framework, such as: + +- `message_contains` +- `message_equals` +- `tool_called` +- `tool_called_with` +- `exit_code` +- `log_contains` + +### 9.2 External Verifier Scripts + +Custom verifier scripts should be supported for advanced cases, such as: + +- checking database state +- validating generated files +- verifying HTTP side effects +- checking plugin-specific external systems + +This combination keeps common tests simple while preserving flexibility. + +--- + +## 10. Execution Flow + +A typical ClawSpec test run should look like this: + +1. Load and validate spec file +2. Prepare workspace and artifact directories +3. Materialize runtime environment definitions +4. Start fake model service +5. Start target OpenClaw unit(s) +6. Install and configure required plugins +7. Inject test input into the target unit +8. Let the runtime interact with the fake model service +9. Collect outputs, logs, tool traces, and events +10. Evaluate expected assertions +11. Run external verifier script if defined +12. Produce final result summary and artifact bundle +13. Tear down environment unless retention is requested + +--- + +## 11. Proposed Directory Structure + +```text +ClawSpec/ +├── README.md +├── PROJECT_PLAN.md +├── docs/ +│ ├── architecture.md +│ ├── spec-schema.md +│ ├── fake-model.md +│ └── runner.md +├── schema/ +│ └── clawspec.schema.json +├── examples/ +│ ├── basic.json +│ └── calendar-plugin.json +├── docker/ +│ ├── compose.template.yml +│ └── fake-model.Dockerfile +├── src/ +│ ├── cli/ +│ ├── spec/ +│ ├── orchestrator/ +│ ├── model/ +│ ├── runner/ +│ └── report/ +├── verifiers/ +│ └── examples/ +└── .clawspec/ + ├── workspaces/ + └── artifacts/ +``` + +--- + +## 12. Recommended Tech Stack + +### Preferred Language + +**TypeScript / Node.js** + +Reasoning: + +- fits well with OpenClaw ecosystem conventions +- convenient for JSON schema validation +- good support for CLI tooling +- convenient for HTTP fake service implementation +- straightforward Docker and subprocess orchestration + +### Suggested Libraries + +- `ajv` for schema validation +- `commander` or `yargs` for CLI +- `execa` for shell and Docker command orchestration +- `fastify` or `express` for fake model service +- `yaml` for optional YAML support in the future +- `vitest` for ClawSpec self-tests + +--- + +## 13. V1 Scope + +The first version should focus on the smallest useful end-to-end workflow. + +### V1 Must Include + +- load one spec file +- validate the basic schema +- start one OpenClaw test unit +- install one or more plugins in that unit +- apply plugin configuration entries +- start one fake model service +- inject one test input +- support rule-based text responses +- support rule-based tool-calling responses +- support basic assertions: + - message contains + - tool called + - script verifier +- generate logs and a pass/fail summary + +### V1 Should Avoid + +- complex multi-turn state machines +- distributed execution +- UI dashboard +- performance benchmarking +- broad provider emulation beyond test needs +- advanced matrix test expansion + +--- + +## 14. Milestone Proposal + +### Milestone 0 - Project Bootstrap + +- initialize repository layout +- define coding conventions +- write initial README and project plan +- select runtime and libraries + +### Milestone 1 - Spec Definition + +- draft spec schema v0.1 +- implement spec parser and validation +- add example specs + +### Milestone 2 - Fake Model Service + +- define internal rule format +- implement rule matcher +- implement deterministic response generation +- add request/response logging + +### Milestone 3 - Environment Orchestrator + +- generate runtime environment configuration +- start and stop OpenClaw containers +- apply plugin install commands +- apply plugin configs + +### Milestone 4 - Test Runner + +- inject test inputs +- collect runtime outputs +- evaluate assertions +- execute verifier scripts +- output structured test reports + +### Milestone 5 - First Real Plugin Demo + +- create an example test suite for a real OpenClaw plugin +- validate the full workflow end to end +- document limitations and next steps + +--- + +## 15. Risks and Open Questions + +### Runtime Interface Risk + +The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests. + +### Plugin Installation Variability + +Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks. + +### Observable Output Boundaries + +Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result. + +### Docker/Image Strategy + +The project needs a clear policy for: + +- official base images +- local image overrides +- plugin source mounting during local development +- OpenClaw version pinning + +### Test Case Reuse + +It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file. + +--- + +## 16. Success Criteria + +ClawSpec can be considered successful in its first phase if: + +- a plugin developer can define a test spec without writing framework code +- a test run is reproducible across machines with the same environment +- plugin integration behavior can be validated without a real LLM +- failed runs produce enough artifacts to diagnose the issue quickly +- at least one real plugin can be tested end-to-end using the framework + +--- + +## 17. Next Recommended Deliverables + +After this plan, the next most useful documents are: + +1. `README.md` — concise positioning and quick start +2. `docs/spec-schema.md` — formalize the spec design +3. `schema/clawspec.schema.json` — machine-validatable V0 schema +4. `docs/fake-model.md` — define fake model request/response behavior +5. `TASKLIST.md` or milestone tracker — implementation breakdown + +--- + +## 18. Summary + +ClawSpec should become a deterministic integration testing framework for OpenClaw plugins. + +Its core innovation is simple: + +- run real OpenClaw runtime environments +- replace real LLM behavior with a rule-driven fake model service +- execute declarative test cases +- verify runtime behavior with stable, repeatable assertions + +If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.