docs: add initial project plan
This commit is contained in:
511
PROJECT_PLAN.md
Normal file
511
PROJECT_PLAN.md
Normal file
@@ -0,0 +1,511 @@
|
|||||||
|
# ClawSpec Project Plan
|
||||||
|
|
||||||
|
## 1. Project Overview
|
||||||
|
|
||||||
|
**ClawSpec** is an automation testing framework for OpenClaw plugins.
|
||||||
|
|
||||||
|
Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider.
|
||||||
|
|
||||||
|
Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Problem Statement
|
||||||
|
|
||||||
|
OpenClaw plugin testing is currently hard to standardize because:
|
||||||
|
|
||||||
|
- plugin behavior often depends on runtime integration rather than isolated pure functions
|
||||||
|
- real LLM responses are non-deterministic and expensive
|
||||||
|
- testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation
|
||||||
|
- there is no unified way to express plugin test environments and expected outcomes
|
||||||
|
|
||||||
|
ClawSpec aims to solve this by offering:
|
||||||
|
|
||||||
|
- declarative test environment definitions
|
||||||
|
- deterministic model behavior via rules instead of real LLMs
|
||||||
|
- automated provisioning of OpenClaw runtime units
|
||||||
|
- repeatable execution of plugin integration tests
|
||||||
|
- structured validation and test reporting
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Project Goals
|
||||||
|
|
||||||
|
### Primary Goals
|
||||||
|
|
||||||
|
- Define test environments for one or more OpenClaw runtime units
|
||||||
|
- Install and configure plugins automatically per test spec
|
||||||
|
- Provide a fake model service that responds according to declarative rules
|
||||||
|
- Execute test cases against running OpenClaw units
|
||||||
|
- Verify expected behavior through built-in assertions and custom verifier scripts
|
||||||
|
- Produce reproducible test artifacts and reports
|
||||||
|
|
||||||
|
### Secondary Goals
|
||||||
|
|
||||||
|
- Support multiple OpenClaw versions for compatibility testing
|
||||||
|
- Support multi-unit scenarios in a single test suite
|
||||||
|
- Provide reusable example specs for plugin developers
|
||||||
|
- Make local plugin integration testing fast enough for everyday development
|
||||||
|
|
||||||
|
### Non-Goals for V1
|
||||||
|
|
||||||
|
- Full UI dashboard
|
||||||
|
- Large-scale distributed execution
|
||||||
|
- Performance benchmarking
|
||||||
|
- Fuzzing or random conversation generation
|
||||||
|
- Automatic support for every possible provider protocol
|
||||||
|
- Replacing unit tests inside plugin repositories
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Product Positioning
|
||||||
|
|
||||||
|
ClawSpec is **not** a generic unit test runner.
|
||||||
|
|
||||||
|
It is a **scenario-driven integration testing framework** for validating the behavior of:
|
||||||
|
|
||||||
|
- OpenClaw runtime
|
||||||
|
- installed plugins
|
||||||
|
- model interaction boundaries
|
||||||
|
- tool-calling flows
|
||||||
|
- message outputs
|
||||||
|
- side effects exposed through plugins or configured backends
|
||||||
|
|
||||||
|
The core value is deterministic validation of complex runtime behavior.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. High-Level Architecture
|
||||||
|
|
||||||
|
ClawSpec is expected to contain four major components.
|
||||||
|
|
||||||
|
### 5.1 Spec Loader
|
||||||
|
|
||||||
|
Responsible for:
|
||||||
|
|
||||||
|
- reading the project spec file
|
||||||
|
- validating structure against schema
|
||||||
|
- normalizing runtime definitions
|
||||||
|
- producing an execution plan for the runner
|
||||||
|
|
||||||
|
### 5.2 Environment Orchestrator
|
||||||
|
|
||||||
|
Responsible for:
|
||||||
|
|
||||||
|
- provisioning OpenClaw test units
|
||||||
|
- generating or managing Docker Compose definitions
|
||||||
|
- preparing workspace directories and mounted volumes
|
||||||
|
- installing plugins
|
||||||
|
- applying plugin configuration
|
||||||
|
- starting and stopping test environments
|
||||||
|
|
||||||
|
### 5.3 Fake Model Service
|
||||||
|
|
||||||
|
Responsible for:
|
||||||
|
|
||||||
|
- exposing a model-compatible endpoint for OpenClaw
|
||||||
|
- receiving model requests from test units
|
||||||
|
- matching incoming requests against declarative rules
|
||||||
|
- returning deterministic text responses and/or tool-calling instructions
|
||||||
|
- logging interactions for debugging and verification
|
||||||
|
|
||||||
|
### 5.4 Test Runner
|
||||||
|
|
||||||
|
Responsible for:
|
||||||
|
|
||||||
|
- selecting target test cases
|
||||||
|
- injecting input events/messages
|
||||||
|
- collecting outputs, logs, and tool-call traces
|
||||||
|
- evaluating built-in assertions
|
||||||
|
- executing optional verifier scripts
|
||||||
|
- producing final pass/fail results and artifacts
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Core Design Principles
|
||||||
|
|
||||||
|
### Determinism First
|
||||||
|
|
||||||
|
The framework should avoid real LLM randomness in automated tests.
|
||||||
|
|
||||||
|
### Runtime Realism
|
||||||
|
|
||||||
|
Tests should run against realistic OpenClaw environments, not only mocked plugin internals.
|
||||||
|
|
||||||
|
### Declarative Configuration
|
||||||
|
|
||||||
|
Test environments and cases should be defined in configuration files rather than hard-coded scripts.
|
||||||
|
|
||||||
|
### Extensible Verification
|
||||||
|
|
||||||
|
Built-in assertions should cover common cases, while custom scripts should support project-specific validation.
|
||||||
|
|
||||||
|
### Reproducible Artifacts
|
||||||
|
|
||||||
|
All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Proposed Spec Structure
|
||||||
|
|
||||||
|
The initial idea is to define a single spec file that describes:
|
||||||
|
|
||||||
|
- OpenClaw runtime units
|
||||||
|
- plugin installation and configuration
|
||||||
|
- test cases
|
||||||
|
- fake model behavior
|
||||||
|
- expected validation steps
|
||||||
|
|
||||||
|
A normalized V1 structure may look like this:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"version": "v1",
|
||||||
|
"environment": {
|
||||||
|
"networkName": "clawspec-net",
|
||||||
|
"workspaceRoot": "./.clawspec/workspaces",
|
||||||
|
"artifactsRoot": "./.clawspec/artifacts"
|
||||||
|
},
|
||||||
|
"clawUnits": [
|
||||||
|
{
|
||||||
|
"unitId": "calendar-agent",
|
||||||
|
"openClawVersion": "0.5.0",
|
||||||
|
"image": "ghcr.io/openclaw/openclaw:0.5.0",
|
||||||
|
"plugins": [
|
||||||
|
{
|
||||||
|
"pluginName": "harborforge-calendar",
|
||||||
|
"installCommand": "openclaw plugins add harborforge-calendar",
|
||||||
|
"configs": {
|
||||||
|
"backendUrl": "http://fake-backend:8080",
|
||||||
|
"agentId": "calendar-test-agent"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"testCases": [
|
||||||
|
{
|
||||||
|
"testId": "calendar-reminder-basic",
|
||||||
|
"targetUnitId": "calendar-agent",
|
||||||
|
"input": {
|
||||||
|
"channel": "discord",
|
||||||
|
"chatType": "direct",
|
||||||
|
"message": "明天下午3点提醒我开会"
|
||||||
|
},
|
||||||
|
"modelRules": [
|
||||||
|
{
|
||||||
|
"receive": ".*明天下午3点提醒我开会.*",
|
||||||
|
"action": {
|
||||||
|
"type": "tool_call_then_respond",
|
||||||
|
"toolName": "harborforge_calendar_create",
|
||||||
|
"toolParameters": {
|
||||||
|
"title": "开会",
|
||||||
|
"time": "tomorrow 15:00"
|
||||||
|
},
|
||||||
|
"text": "已经帮你记下了"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"expected": [
|
||||||
|
{
|
||||||
|
"type": "tool_called",
|
||||||
|
"toolName": "harborforge_calendar_create"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "message_contains",
|
||||||
|
"value": "已经帮你记下了"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"verifier": {
|
||||||
|
"type": "script",
|
||||||
|
"path": "./verifiers/calendar-reminder-basic.sh"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is only a starting point. The exact schema should be refined during implementation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Fake Model Service Design
|
||||||
|
|
||||||
|
The fake model service is one of the most important parts of ClawSpec.
|
||||||
|
|
||||||
|
It should behave like a deterministic model backend that OpenClaw can call during tests.
|
||||||
|
|
||||||
|
### Responsibilities
|
||||||
|
|
||||||
|
- receive model requests from OpenClaw
|
||||||
|
- inspect request content and context
|
||||||
|
- match a rule set in declared order
|
||||||
|
- return predefined outputs
|
||||||
|
- support text-only responses
|
||||||
|
- support tool-calling responses
|
||||||
|
- support tool-call plus final text patterns
|
||||||
|
- emit logs showing which rule matched and what output was generated
|
||||||
|
|
||||||
|
### Why This Matters
|
||||||
|
|
||||||
|
Without this service, tests would depend on live model providers, causing:
|
||||||
|
|
||||||
|
- unstable results
|
||||||
|
- variable tool-calling behavior
|
||||||
|
- token costs
|
||||||
|
- difficult reproduction of failures
|
||||||
|
|
||||||
|
The fake model service turns model behavior into a controlled part of the test spec.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Verification Model
|
||||||
|
|
||||||
|
ClawSpec should support two layers of verification.
|
||||||
|
|
||||||
|
### 9.1 Built-in Assertions
|
||||||
|
|
||||||
|
Common assertions should be supported directly by the framework, such as:
|
||||||
|
|
||||||
|
- `message_contains`
|
||||||
|
- `message_equals`
|
||||||
|
- `tool_called`
|
||||||
|
- `tool_called_with`
|
||||||
|
- `exit_code`
|
||||||
|
- `log_contains`
|
||||||
|
|
||||||
|
### 9.2 External Verifier Scripts
|
||||||
|
|
||||||
|
Custom verifier scripts should be supported for advanced cases, such as:
|
||||||
|
|
||||||
|
- checking database state
|
||||||
|
- validating generated files
|
||||||
|
- verifying HTTP side effects
|
||||||
|
- checking plugin-specific external systems
|
||||||
|
|
||||||
|
This combination keeps common tests simple while preserving flexibility.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Execution Flow
|
||||||
|
|
||||||
|
A typical ClawSpec test run should look like this:
|
||||||
|
|
||||||
|
1. Load and validate spec file
|
||||||
|
2. Prepare workspace and artifact directories
|
||||||
|
3. Materialize runtime environment definitions
|
||||||
|
4. Start fake model service
|
||||||
|
5. Start target OpenClaw unit(s)
|
||||||
|
6. Install and configure required plugins
|
||||||
|
7. Inject test input into the target unit
|
||||||
|
8. Let the runtime interact with the fake model service
|
||||||
|
9. Collect outputs, logs, tool traces, and events
|
||||||
|
10. Evaluate expected assertions
|
||||||
|
11. Run external verifier script if defined
|
||||||
|
12. Produce final result summary and artifact bundle
|
||||||
|
13. Tear down environment unless retention is requested
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Proposed Directory Structure
|
||||||
|
|
||||||
|
```text
|
||||||
|
ClawSpec/
|
||||||
|
├── README.md
|
||||||
|
├── PROJECT_PLAN.md
|
||||||
|
├── docs/
|
||||||
|
│ ├── architecture.md
|
||||||
|
│ ├── spec-schema.md
|
||||||
|
│ ├── fake-model.md
|
||||||
|
│ └── runner.md
|
||||||
|
├── schema/
|
||||||
|
│ └── clawspec.schema.json
|
||||||
|
├── examples/
|
||||||
|
│ ├── basic.json
|
||||||
|
│ └── calendar-plugin.json
|
||||||
|
├── docker/
|
||||||
|
│ ├── compose.template.yml
|
||||||
|
│ └── fake-model.Dockerfile
|
||||||
|
├── src/
|
||||||
|
│ ├── cli/
|
||||||
|
│ ├── spec/
|
||||||
|
│ ├── orchestrator/
|
||||||
|
│ ├── model/
|
||||||
|
│ ├── runner/
|
||||||
|
│ └── report/
|
||||||
|
├── verifiers/
|
||||||
|
│ └── examples/
|
||||||
|
└── .clawspec/
|
||||||
|
├── workspaces/
|
||||||
|
└── artifacts/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Recommended Tech Stack
|
||||||
|
|
||||||
|
### Preferred Language
|
||||||
|
|
||||||
|
**TypeScript / Node.js**
|
||||||
|
|
||||||
|
Reasoning:
|
||||||
|
|
||||||
|
- fits well with OpenClaw ecosystem conventions
|
||||||
|
- convenient for JSON schema validation
|
||||||
|
- good support for CLI tooling
|
||||||
|
- convenient for HTTP fake service implementation
|
||||||
|
- straightforward Docker and subprocess orchestration
|
||||||
|
|
||||||
|
### Suggested Libraries
|
||||||
|
|
||||||
|
- `ajv` for schema validation
|
||||||
|
- `commander` or `yargs` for CLI
|
||||||
|
- `execa` for shell and Docker command orchestration
|
||||||
|
- `fastify` or `express` for fake model service
|
||||||
|
- `yaml` for optional YAML support in the future
|
||||||
|
- `vitest` for ClawSpec self-tests
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. V1 Scope
|
||||||
|
|
||||||
|
The first version should focus on the smallest useful end-to-end workflow.
|
||||||
|
|
||||||
|
### V1 Must Include
|
||||||
|
|
||||||
|
- load one spec file
|
||||||
|
- validate the basic schema
|
||||||
|
- start one OpenClaw test unit
|
||||||
|
- install one or more plugins in that unit
|
||||||
|
- apply plugin configuration entries
|
||||||
|
- start one fake model service
|
||||||
|
- inject one test input
|
||||||
|
- support rule-based text responses
|
||||||
|
- support rule-based tool-calling responses
|
||||||
|
- support basic assertions:
|
||||||
|
- message contains
|
||||||
|
- tool called
|
||||||
|
- script verifier
|
||||||
|
- generate logs and a pass/fail summary
|
||||||
|
|
||||||
|
### V1 Should Avoid
|
||||||
|
|
||||||
|
- complex multi-turn state machines
|
||||||
|
- distributed execution
|
||||||
|
- UI dashboard
|
||||||
|
- performance benchmarking
|
||||||
|
- broad provider emulation beyond test needs
|
||||||
|
- advanced matrix test expansion
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Milestone Proposal
|
||||||
|
|
||||||
|
### Milestone 0 - Project Bootstrap
|
||||||
|
|
||||||
|
- initialize repository layout
|
||||||
|
- define coding conventions
|
||||||
|
- write initial README and project plan
|
||||||
|
- select runtime and libraries
|
||||||
|
|
||||||
|
### Milestone 1 - Spec Definition
|
||||||
|
|
||||||
|
- draft spec schema v0.1
|
||||||
|
- implement spec parser and validation
|
||||||
|
- add example specs
|
||||||
|
|
||||||
|
### Milestone 2 - Fake Model Service
|
||||||
|
|
||||||
|
- define internal rule format
|
||||||
|
- implement rule matcher
|
||||||
|
- implement deterministic response generation
|
||||||
|
- add request/response logging
|
||||||
|
|
||||||
|
### Milestone 3 - Environment Orchestrator
|
||||||
|
|
||||||
|
- generate runtime environment configuration
|
||||||
|
- start and stop OpenClaw containers
|
||||||
|
- apply plugin install commands
|
||||||
|
- apply plugin configs
|
||||||
|
|
||||||
|
### Milestone 4 - Test Runner
|
||||||
|
|
||||||
|
- inject test inputs
|
||||||
|
- collect runtime outputs
|
||||||
|
- evaluate assertions
|
||||||
|
- execute verifier scripts
|
||||||
|
- output structured test reports
|
||||||
|
|
||||||
|
### Milestone 5 - First Real Plugin Demo
|
||||||
|
|
||||||
|
- create an example test suite for a real OpenClaw plugin
|
||||||
|
- validate the full workflow end to end
|
||||||
|
- document limitations and next steps
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 15. Risks and Open Questions
|
||||||
|
|
||||||
|
### Runtime Interface Risk
|
||||||
|
|
||||||
|
The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests.
|
||||||
|
|
||||||
|
### Plugin Installation Variability
|
||||||
|
|
||||||
|
Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks.
|
||||||
|
|
||||||
|
### Observable Output Boundaries
|
||||||
|
|
||||||
|
Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result.
|
||||||
|
|
||||||
|
### Docker/Image Strategy
|
||||||
|
|
||||||
|
The project needs a clear policy for:
|
||||||
|
|
||||||
|
- official base images
|
||||||
|
- local image overrides
|
||||||
|
- plugin source mounting during local development
|
||||||
|
- OpenClaw version pinning
|
||||||
|
|
||||||
|
### Test Case Reuse
|
||||||
|
|
||||||
|
It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 16. Success Criteria
|
||||||
|
|
||||||
|
ClawSpec can be considered successful in its first phase if:
|
||||||
|
|
||||||
|
- a plugin developer can define a test spec without writing framework code
|
||||||
|
- a test run is reproducible across machines with the same environment
|
||||||
|
- plugin integration behavior can be validated without a real LLM
|
||||||
|
- failed runs produce enough artifacts to diagnose the issue quickly
|
||||||
|
- at least one real plugin can be tested end-to-end using the framework
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 17. Next Recommended Deliverables
|
||||||
|
|
||||||
|
After this plan, the next most useful documents are:
|
||||||
|
|
||||||
|
1. `README.md` — concise positioning and quick start
|
||||||
|
2. `docs/spec-schema.md` — formalize the spec design
|
||||||
|
3. `schema/clawspec.schema.json` — machine-validatable V0 schema
|
||||||
|
4. `docs/fake-model.md` — define fake model request/response behavior
|
||||||
|
5. `TASKLIST.md` or milestone tracker — implementation breakdown
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 18. Summary
|
||||||
|
|
||||||
|
ClawSpec should become a deterministic integration testing framework for OpenClaw plugins.
|
||||||
|
|
||||||
|
Its core innovation is simple:
|
||||||
|
|
||||||
|
- run real OpenClaw runtime environments
|
||||||
|
- replace real LLM behavior with a rule-driven fake model service
|
||||||
|
- execute declarative test cases
|
||||||
|
- verify runtime behavior with stable, repeatable assertions
|
||||||
|
|
||||||
|
If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.
|
||||||
Reference in New Issue
Block a user