docs: add initial project plan
This commit is contained in:
511
PROJECT_PLAN.md
Normal file
511
PROJECT_PLAN.md
Normal file
@@ -0,0 +1,511 @@
|
||||
# ClawSpec Project Plan
|
||||
|
||||
## 1. Project Overview
|
||||
|
||||
**ClawSpec** is an automation testing framework for OpenClaw plugins.
|
||||
|
||||
Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider.
|
||||
|
||||
Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results.
|
||||
|
||||
---
|
||||
|
||||
## 2. Problem Statement
|
||||
|
||||
OpenClaw plugin testing is currently hard to standardize because:
|
||||
|
||||
- plugin behavior often depends on runtime integration rather than isolated pure functions
|
||||
- real LLM responses are non-deterministic and expensive
|
||||
- testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation
|
||||
- there is no unified way to express plugin test environments and expected outcomes
|
||||
|
||||
ClawSpec aims to solve this by offering:
|
||||
|
||||
- declarative test environment definitions
|
||||
- deterministic model behavior via rules instead of real LLMs
|
||||
- automated provisioning of OpenClaw runtime units
|
||||
- repeatable execution of plugin integration tests
|
||||
- structured validation and test reporting
|
||||
|
||||
---
|
||||
|
||||
## 3. Project Goals
|
||||
|
||||
### Primary Goals
|
||||
|
||||
- Define test environments for one or more OpenClaw runtime units
|
||||
- Install and configure plugins automatically per test spec
|
||||
- Provide a fake model service that responds according to declarative rules
|
||||
- Execute test cases against running OpenClaw units
|
||||
- Verify expected behavior through built-in assertions and custom verifier scripts
|
||||
- Produce reproducible test artifacts and reports
|
||||
|
||||
### Secondary Goals
|
||||
|
||||
- Support multiple OpenClaw versions for compatibility testing
|
||||
- Support multi-unit scenarios in a single test suite
|
||||
- Provide reusable example specs for plugin developers
|
||||
- Make local plugin integration testing fast enough for everyday development
|
||||
|
||||
### Non-Goals for V1
|
||||
|
||||
- Full UI dashboard
|
||||
- Large-scale distributed execution
|
||||
- Performance benchmarking
|
||||
- Fuzzing or random conversation generation
|
||||
- Automatic support for every possible provider protocol
|
||||
- Replacing unit tests inside plugin repositories
|
||||
|
||||
---
|
||||
|
||||
## 4. Product Positioning
|
||||
|
||||
ClawSpec is **not** a generic unit test runner.
|
||||
|
||||
It is a **scenario-driven integration testing framework** for validating the behavior of:
|
||||
|
||||
- OpenClaw runtime
|
||||
- installed plugins
|
||||
- model interaction boundaries
|
||||
- tool-calling flows
|
||||
- message outputs
|
||||
- side effects exposed through plugins or configured backends
|
||||
|
||||
The core value is deterministic validation of complex runtime behavior.
|
||||
|
||||
---
|
||||
|
||||
## 5. High-Level Architecture
|
||||
|
||||
ClawSpec is expected to contain four major components.
|
||||
|
||||
### 5.1 Spec Loader
|
||||
|
||||
Responsible for:
|
||||
|
||||
- reading the project spec file
|
||||
- validating structure against schema
|
||||
- normalizing runtime definitions
|
||||
- producing an execution plan for the runner
|
||||
|
||||
### 5.2 Environment Orchestrator
|
||||
|
||||
Responsible for:
|
||||
|
||||
- provisioning OpenClaw test units
|
||||
- generating or managing Docker Compose definitions
|
||||
- preparing workspace directories and mounted volumes
|
||||
- installing plugins
|
||||
- applying plugin configuration
|
||||
- starting and stopping test environments
|
||||
|
||||
### 5.3 Fake Model Service
|
||||
|
||||
Responsible for:
|
||||
|
||||
- exposing a model-compatible endpoint for OpenClaw
|
||||
- receiving model requests from test units
|
||||
- matching incoming requests against declarative rules
|
||||
- returning deterministic text responses and/or tool-calling instructions
|
||||
- logging interactions for debugging and verification
|
||||
|
||||
### 5.4 Test Runner
|
||||
|
||||
Responsible for:
|
||||
|
||||
- selecting target test cases
|
||||
- injecting input events/messages
|
||||
- collecting outputs, logs, and tool-call traces
|
||||
- evaluating built-in assertions
|
||||
- executing optional verifier scripts
|
||||
- producing final pass/fail results and artifacts
|
||||
|
||||
---
|
||||
|
||||
## 6. Core Design Principles
|
||||
|
||||
### Determinism First
|
||||
|
||||
The framework should avoid real LLM randomness in automated tests.
|
||||
|
||||
### Runtime Realism
|
||||
|
||||
Tests should run against realistic OpenClaw environments, not only mocked plugin internals.
|
||||
|
||||
### Declarative Configuration
|
||||
|
||||
Test environments and cases should be defined in configuration files rather than hard-coded scripts.
|
||||
|
||||
### Extensible Verification
|
||||
|
||||
Built-in assertions should cover common cases, while custom scripts should support project-specific validation.
|
||||
|
||||
### Reproducible Artifacts
|
||||
|
||||
All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results.
|
||||
|
||||
---
|
||||
|
||||
## 7. Proposed Spec Structure
|
||||
|
||||
The initial idea is to define a single spec file that describes:
|
||||
|
||||
- OpenClaw runtime units
|
||||
- plugin installation and configuration
|
||||
- test cases
|
||||
- fake model behavior
|
||||
- expected validation steps
|
||||
|
||||
A normalized V1 structure may look like this:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "v1",
|
||||
"environment": {
|
||||
"networkName": "clawspec-net",
|
||||
"workspaceRoot": "./.clawspec/workspaces",
|
||||
"artifactsRoot": "./.clawspec/artifacts"
|
||||
},
|
||||
"clawUnits": [
|
||||
{
|
||||
"unitId": "calendar-agent",
|
||||
"openClawVersion": "0.5.0",
|
||||
"image": "ghcr.io/openclaw/openclaw:0.5.0",
|
||||
"plugins": [
|
||||
{
|
||||
"pluginName": "harborforge-calendar",
|
||||
"installCommand": "openclaw plugins add harborforge-calendar",
|
||||
"configs": {
|
||||
"backendUrl": "http://fake-backend:8080",
|
||||
"agentId": "calendar-test-agent"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"testCases": [
|
||||
{
|
||||
"testId": "calendar-reminder-basic",
|
||||
"targetUnitId": "calendar-agent",
|
||||
"input": {
|
||||
"channel": "discord",
|
||||
"chatType": "direct",
|
||||
"message": "明天下午3点提醒我开会"
|
||||
},
|
||||
"modelRules": [
|
||||
{
|
||||
"receive": ".*明天下午3点提醒我开会.*",
|
||||
"action": {
|
||||
"type": "tool_call_then_respond",
|
||||
"toolName": "harborforge_calendar_create",
|
||||
"toolParameters": {
|
||||
"title": "开会",
|
||||
"time": "tomorrow 15:00"
|
||||
},
|
||||
"text": "已经帮你记下了"
|
||||
}
|
||||
}
|
||||
],
|
||||
"expected": [
|
||||
{
|
||||
"type": "tool_called",
|
||||
"toolName": "harborforge_calendar_create"
|
||||
},
|
||||
{
|
||||
"type": "message_contains",
|
||||
"value": "已经帮你记下了"
|
||||
}
|
||||
],
|
||||
"verifier": {
|
||||
"type": "script",
|
||||
"path": "./verifiers/calendar-reminder-basic.sh"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This is only a starting point. The exact schema should be refined during implementation.
|
||||
|
||||
---
|
||||
|
||||
## 8. Fake Model Service Design
|
||||
|
||||
The fake model service is one of the most important parts of ClawSpec.
|
||||
|
||||
It should behave like a deterministic model backend that OpenClaw can call during tests.
|
||||
|
||||
### Responsibilities
|
||||
|
||||
- receive model requests from OpenClaw
|
||||
- inspect request content and context
|
||||
- match a rule set in declared order
|
||||
- return predefined outputs
|
||||
- support text-only responses
|
||||
- support tool-calling responses
|
||||
- support tool-call plus final text patterns
|
||||
- emit logs showing which rule matched and what output was generated
|
||||
|
||||
### Why This Matters
|
||||
|
||||
Without this service, tests would depend on live model providers, causing:
|
||||
|
||||
- unstable results
|
||||
- variable tool-calling behavior
|
||||
- token costs
|
||||
- difficult reproduction of failures
|
||||
|
||||
The fake model service turns model behavior into a controlled part of the test spec.
|
||||
|
||||
---
|
||||
|
||||
## 9. Verification Model
|
||||
|
||||
ClawSpec should support two layers of verification.
|
||||
|
||||
### 9.1 Built-in Assertions
|
||||
|
||||
Common assertions should be supported directly by the framework, such as:
|
||||
|
||||
- `message_contains`
|
||||
- `message_equals`
|
||||
- `tool_called`
|
||||
- `tool_called_with`
|
||||
- `exit_code`
|
||||
- `log_contains`
|
||||
|
||||
### 9.2 External Verifier Scripts
|
||||
|
||||
Custom verifier scripts should be supported for advanced cases, such as:
|
||||
|
||||
- checking database state
|
||||
- validating generated files
|
||||
- verifying HTTP side effects
|
||||
- checking plugin-specific external systems
|
||||
|
||||
This combination keeps common tests simple while preserving flexibility.
|
||||
|
||||
---
|
||||
|
||||
## 10. Execution Flow
|
||||
|
||||
A typical ClawSpec test run should look like this:
|
||||
|
||||
1. Load and validate spec file
|
||||
2. Prepare workspace and artifact directories
|
||||
3. Materialize runtime environment definitions
|
||||
4. Start fake model service
|
||||
5. Start target OpenClaw unit(s)
|
||||
6. Install and configure required plugins
|
||||
7. Inject test input into the target unit
|
||||
8. Let the runtime interact with the fake model service
|
||||
9. Collect outputs, logs, tool traces, and events
|
||||
10. Evaluate expected assertions
|
||||
11. Run external verifier script if defined
|
||||
12. Produce final result summary and artifact bundle
|
||||
13. Tear down environment unless retention is requested
|
||||
|
||||
---
|
||||
|
||||
## 11. Proposed Directory Structure
|
||||
|
||||
```text
|
||||
ClawSpec/
|
||||
├── README.md
|
||||
├── PROJECT_PLAN.md
|
||||
├── docs/
|
||||
│ ├── architecture.md
|
||||
│ ├── spec-schema.md
|
||||
│ ├── fake-model.md
|
||||
│ └── runner.md
|
||||
├── schema/
|
||||
│ └── clawspec.schema.json
|
||||
├── examples/
|
||||
│ ├── basic.json
|
||||
│ └── calendar-plugin.json
|
||||
├── docker/
|
||||
│ ├── compose.template.yml
|
||||
│ └── fake-model.Dockerfile
|
||||
├── src/
|
||||
│ ├── cli/
|
||||
│ ├── spec/
|
||||
│ ├── orchestrator/
|
||||
│ ├── model/
|
||||
│ ├── runner/
|
||||
│ └── report/
|
||||
├── verifiers/
|
||||
│ └── examples/
|
||||
└── .clawspec/
|
||||
├── workspaces/
|
||||
└── artifacts/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Recommended Tech Stack
|
||||
|
||||
### Preferred Language
|
||||
|
||||
**TypeScript / Node.js**
|
||||
|
||||
Reasoning:
|
||||
|
||||
- fits well with OpenClaw ecosystem conventions
|
||||
- convenient for JSON schema validation
|
||||
- good support for CLI tooling
|
||||
- convenient for HTTP fake service implementation
|
||||
- straightforward Docker and subprocess orchestration
|
||||
|
||||
### Suggested Libraries
|
||||
|
||||
- `ajv` for schema validation
|
||||
- `commander` or `yargs` for CLI
|
||||
- `execa` for shell and Docker command orchestration
|
||||
- `fastify` or `express` for fake model service
|
||||
- `yaml` for optional YAML support in the future
|
||||
- `vitest` for ClawSpec self-tests
|
||||
|
||||
---
|
||||
|
||||
## 13. V1 Scope
|
||||
|
||||
The first version should focus on the smallest useful end-to-end workflow.
|
||||
|
||||
### V1 Must Include
|
||||
|
||||
- load one spec file
|
||||
- validate the basic schema
|
||||
- start one OpenClaw test unit
|
||||
- install one or more plugins in that unit
|
||||
- apply plugin configuration entries
|
||||
- start one fake model service
|
||||
- inject one test input
|
||||
- support rule-based text responses
|
||||
- support rule-based tool-calling responses
|
||||
- support basic assertions:
|
||||
- message contains
|
||||
- tool called
|
||||
- script verifier
|
||||
- generate logs and a pass/fail summary
|
||||
|
||||
### V1 Should Avoid
|
||||
|
||||
- complex multi-turn state machines
|
||||
- distributed execution
|
||||
- UI dashboard
|
||||
- performance benchmarking
|
||||
- broad provider emulation beyond test needs
|
||||
- advanced matrix test expansion
|
||||
|
||||
---
|
||||
|
||||
## 14. Milestone Proposal
|
||||
|
||||
### Milestone 0 - Project Bootstrap
|
||||
|
||||
- initialize repository layout
|
||||
- define coding conventions
|
||||
- write initial README and project plan
|
||||
- select runtime and libraries
|
||||
|
||||
### Milestone 1 - Spec Definition
|
||||
|
||||
- draft spec schema v0.1
|
||||
- implement spec parser and validation
|
||||
- add example specs
|
||||
|
||||
### Milestone 2 - Fake Model Service
|
||||
|
||||
- define internal rule format
|
||||
- implement rule matcher
|
||||
- implement deterministic response generation
|
||||
- add request/response logging
|
||||
|
||||
### Milestone 3 - Environment Orchestrator
|
||||
|
||||
- generate runtime environment configuration
|
||||
- start and stop OpenClaw containers
|
||||
- apply plugin install commands
|
||||
- apply plugin configs
|
||||
|
||||
### Milestone 4 - Test Runner
|
||||
|
||||
- inject test inputs
|
||||
- collect runtime outputs
|
||||
- evaluate assertions
|
||||
- execute verifier scripts
|
||||
- output structured test reports
|
||||
|
||||
### Milestone 5 - First Real Plugin Demo
|
||||
|
||||
- create an example test suite for a real OpenClaw plugin
|
||||
- validate the full workflow end to end
|
||||
- document limitations and next steps
|
||||
|
||||
---
|
||||
|
||||
## 15. Risks and Open Questions
|
||||
|
||||
### Runtime Interface Risk
|
||||
|
||||
The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests.
|
||||
|
||||
### Plugin Installation Variability
|
||||
|
||||
Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks.
|
||||
|
||||
### Observable Output Boundaries
|
||||
|
||||
Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result.
|
||||
|
||||
### Docker/Image Strategy
|
||||
|
||||
The project needs a clear policy for:
|
||||
|
||||
- official base images
|
||||
- local image overrides
|
||||
- plugin source mounting during local development
|
||||
- OpenClaw version pinning
|
||||
|
||||
### Test Case Reuse
|
||||
|
||||
It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file.
|
||||
|
||||
---
|
||||
|
||||
## 16. Success Criteria
|
||||
|
||||
ClawSpec can be considered successful in its first phase if:
|
||||
|
||||
- a plugin developer can define a test spec without writing framework code
|
||||
- a test run is reproducible across machines with the same environment
|
||||
- plugin integration behavior can be validated without a real LLM
|
||||
- failed runs produce enough artifacts to diagnose the issue quickly
|
||||
- at least one real plugin can be tested end-to-end using the framework
|
||||
|
||||
---
|
||||
|
||||
## 17. Next Recommended Deliverables
|
||||
|
||||
After this plan, the next most useful documents are:
|
||||
|
||||
1. `README.md` — concise positioning and quick start
|
||||
2. `docs/spec-schema.md` — formalize the spec design
|
||||
3. `schema/clawspec.schema.json` — machine-validatable V0 schema
|
||||
4. `docs/fake-model.md` — define fake model request/response behavior
|
||||
5. `TASKLIST.md` or milestone tracker — implementation breakdown
|
||||
|
||||
---
|
||||
|
||||
## 18. Summary
|
||||
|
||||
ClawSpec should become a deterministic integration testing framework for OpenClaw plugins.
|
||||
|
||||
Its core innovation is simple:
|
||||
|
||||
- run real OpenClaw runtime environments
|
||||
- replace real LLM behavior with a rule-driven fake model service
|
||||
- execute declarative test cases
|
||||
- verify runtime behavior with stable, repeatable assertions
|
||||
|
||||
If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.
|
||||
Reference in New Issue
Block a user