14 KiB
ClawSpec Project Plan
1. Project Overview
ClawSpec is an automation testing framework for OpenClaw plugins.
Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider.
Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results.
2. Problem Statement
OpenClaw plugin testing is currently hard to standardize because:
- plugin behavior often depends on runtime integration rather than isolated pure functions
- real LLM responses are non-deterministic and expensive
- testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation
- there is no unified way to express plugin test environments and expected outcomes
ClawSpec aims to solve this by offering:
- declarative test environment definitions
- deterministic model behavior via rules instead of real LLMs
- automated provisioning of OpenClaw runtime units
- repeatable execution of plugin integration tests
- structured validation and test reporting
3. Project Goals
Primary Goals
- Define test environments for one or more OpenClaw runtime units
- Install and configure plugins automatically per test spec
- Provide a fake model service that responds according to declarative rules
- Execute test cases against running OpenClaw units
- Verify expected behavior through built-in assertions and custom verifier scripts
- Produce reproducible test artifacts and reports
Secondary Goals
- Support multiple OpenClaw versions for compatibility testing
- Support multi-unit scenarios in a single test suite
- Provide reusable example specs for plugin developers
- Make local plugin integration testing fast enough for everyday development
Non-Goals for V1
- Full UI dashboard
- Large-scale distributed execution
- Performance benchmarking
- Fuzzing or random conversation generation
- Automatic support for every possible provider protocol
- Replacing unit tests inside plugin repositories
4. Product Positioning
ClawSpec is not a generic unit test runner.
It is a scenario-driven integration testing framework for validating the behavior of:
- OpenClaw runtime
- installed plugins
- model interaction boundaries
- tool-calling flows
- message outputs
- side effects exposed through plugins or configured backends
The core value is deterministic validation of complex runtime behavior.
5. High-Level Architecture
ClawSpec is expected to contain four major components.
5.1 Spec Loader
Responsible for:
- reading the project spec file
- validating structure against schema
- normalizing runtime definitions
- producing an execution plan for the runner
5.2 Environment Orchestrator
Responsible for:
- provisioning OpenClaw test units
- generating or managing Docker Compose definitions
- preparing workspace directories and mounted volumes
- installing plugins
- applying plugin configuration
- starting and stopping test environments
5.3 Fake Model Service
Responsible for:
- exposing a model-compatible endpoint for OpenClaw
- receiving model requests from test units
- matching incoming requests against declarative rules
- returning deterministic text responses and/or tool-calling instructions
- logging interactions for debugging and verification
5.4 Test Runner
Responsible for:
- selecting target test cases
- injecting input events/messages
- collecting outputs, logs, and tool-call traces
- evaluating built-in assertions
- executing optional verifier scripts
- producing final pass/fail results and artifacts
6. Core Design Principles
Determinism First
The framework should avoid real LLM randomness in automated tests.
Runtime Realism
Tests should run against realistic OpenClaw environments, not only mocked plugin internals.
Declarative Configuration
Test environments and cases should be defined in configuration files rather than hard-coded scripts.
Extensible Verification
Built-in assertions should cover common cases, while custom scripts should support project-specific validation.
Reproducible Artifacts
All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results.
7. Proposed Spec Structure
The initial idea is to define a single spec file that describes:
- OpenClaw runtime units
- plugin installation and configuration
- test cases
- fake model behavior
- expected validation steps
A normalized V1 structure may look like this:
{
"version": "v1",
"environment": {
"networkName": "clawspec-net",
"workspaceRoot": "./.clawspec/workspaces",
"artifactsRoot": "./.clawspec/artifacts"
},
"clawUnits": [
{
"unitId": "calendar-agent",
"openClawVersion": "0.5.0",
"image": "ghcr.io/openclaw/openclaw:0.5.0",
"plugins": [
{
"pluginName": "harborforge-calendar",
"installCommand": "openclaw plugins add harborforge-calendar",
"configs": {
"backendUrl": "http://fake-backend:8080",
"agentId": "calendar-test-agent"
}
}
]
}
],
"testCases": [
{
"testId": "calendar-reminder-basic",
"targetUnitId": "calendar-agent",
"input": {
"channel": "discord",
"chatType": "direct",
"message": "明天下午3点提醒我开会"
},
"modelRules": [
{
"receive": ".*明天下午3点提醒我开会.*",
"action": {
"type": "tool_call_then_respond",
"toolName": "harborforge_calendar_create",
"toolParameters": {
"title": "开会",
"time": "tomorrow 15:00"
},
"text": "已经帮你记下了"
}
}
],
"expected": [
{
"type": "tool_called",
"toolName": "harborforge_calendar_create"
},
{
"type": "message_contains",
"value": "已经帮你记下了"
}
],
"verifier": {
"type": "script",
"path": "./verifiers/calendar-reminder-basic.sh"
}
}
]
}
This is only a starting point. The exact schema should be refined during implementation.
8. Fake Model Service Design
The fake model service is one of the most important parts of ClawSpec.
It should behave like a deterministic model backend that OpenClaw can call during tests.
Responsibilities
- receive model requests from OpenClaw
- inspect request content and context
- match a rule set in declared order
- return predefined outputs
- support text-only responses
- support tool-calling responses
- support tool-call plus final text patterns
- emit logs showing which rule matched and what output was generated
Why This Matters
Without this service, tests would depend on live model providers, causing:
- unstable results
- variable tool-calling behavior
- token costs
- difficult reproduction of failures
The fake model service turns model behavior into a controlled part of the test spec.
9. Verification Model
ClawSpec should support two layers of verification.
9.1 Built-in Assertions
Common assertions should be supported directly by the framework, such as:
message_containsmessage_equalstool_calledtool_called_withexit_codelog_contains
9.2 External Verifier Scripts
Custom verifier scripts should be supported for advanced cases, such as:
- checking database state
- validating generated files
- verifying HTTP side effects
- checking plugin-specific external systems
This combination keeps common tests simple while preserving flexibility.
10. Execution Flow
A typical ClawSpec test run should look like this:
- Load and validate spec file
- Prepare workspace and artifact directories
- Materialize runtime environment definitions
- Start fake model service
- Start target OpenClaw unit(s)
- Install and configure required plugins
- Inject test input into the target unit
- Let the runtime interact with the fake model service
- Collect outputs, logs, tool traces, and events
- Evaluate expected assertions
- Run external verifier script if defined
- Produce final result summary and artifact bundle
- Tear down environment unless retention is requested
11. Proposed Directory Structure
ClawSpec/
├── README.md
├── PROJECT_PLAN.md
├── docs/
│ ├── architecture.md
│ ├── spec-schema.md
│ ├── fake-model.md
│ └── runner.md
├── schema/
│ └── clawspec.schema.json
├── examples/
│ ├── basic.json
│ └── calendar-plugin.json
├── docker/
│ ├── compose.template.yml
│ └── fake-model.Dockerfile
├── src/
│ ├── cli/
│ ├── spec/
│ ├── orchestrator/
│ ├── model/
│ ├── runner/
│ └── report/
├── verifiers/
│ └── examples/
└── .clawspec/
├── workspaces/
└── artifacts/
12. Recommended Tech Stack
Preferred Language
TypeScript / Node.js
Reasoning:
- fits well with OpenClaw ecosystem conventions
- convenient for JSON schema validation
- good support for CLI tooling
- convenient for HTTP fake service implementation
- straightforward Docker and subprocess orchestration
Suggested Libraries
ajvfor schema validationcommanderoryargsfor CLIexecafor shell and Docker command orchestrationfastifyorexpressfor fake model serviceyamlfor optional YAML support in the futurevitestfor ClawSpec self-tests
13. V1 Scope
The first version should focus on the smallest useful end-to-end workflow.
V1 Must Include
- load one spec file
- validate the basic schema
- start one OpenClaw test unit
- install one or more plugins in that unit
- apply plugin configuration entries
- start one fake model service
- inject one test input
- support rule-based text responses
- support rule-based tool-calling responses
- support basic assertions:
- message contains
- tool called
- script verifier
- generate logs and a pass/fail summary
V1 Should Avoid
- complex multi-turn state machines
- distributed execution
- UI dashboard
- performance benchmarking
- broad provider emulation beyond test needs
- advanced matrix test expansion
14. Milestone Proposal
Milestone 0 - Project Bootstrap
- initialize repository layout
- define coding conventions
- write initial README and project plan
- select runtime and libraries
Milestone 1 - Spec Definition
- draft spec schema v0.1
- implement spec parser and validation
- add example specs
Milestone 2 - Fake Model Service
- define internal rule format
- implement rule matcher
- implement deterministic response generation
- add request/response logging
Milestone 3 - Environment Orchestrator
- generate runtime environment configuration
- start and stop OpenClaw containers
- apply plugin install commands
- apply plugin configs
Milestone 4 - Test Runner
- inject test inputs
- collect runtime outputs
- evaluate assertions
- execute verifier scripts
- output structured test reports
Milestone 5 - First Real Plugin Demo
- create an example test suite for a real OpenClaw plugin
- validate the full workflow end to end
- document limitations and next steps
15. Risks and Open Questions
Runtime Interface Risk
The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests.
Plugin Installation Variability
Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks.
Observable Output Boundaries
Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result.
Docker/Image Strategy
The project needs a clear policy for:
- official base images
- local image overrides
- plugin source mounting during local development
- OpenClaw version pinning
Test Case Reuse
It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file.
16. Success Criteria
ClawSpec can be considered successful in its first phase if:
- a plugin developer can define a test spec without writing framework code
- a test run is reproducible across machines with the same environment
- plugin integration behavior can be validated without a real LLM
- failed runs produce enough artifacts to diagnose the issue quickly
- at least one real plugin can be tested end-to-end using the framework
17. Next Recommended Deliverables
After this plan, the next most useful documents are:
README.md— concise positioning and quick startdocs/spec-schema.md— formalize the spec designschema/clawspec.schema.json— machine-validatable V0 schemadocs/fake-model.md— define fake model request/response behaviorTASKLIST.mdor milestone tracker — implementation breakdown
18. Summary
ClawSpec should become a deterministic integration testing framework for OpenClaw plugins.
Its core innovation is simple:
- run real OpenClaw runtime environments
- replace real LLM behavior with a rule-driven fake model service
- execute declarative test cases
- verify runtime behavior with stable, repeatable assertions
If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.