nav/ClawSpec

Fork 0

Files

nav 723a6d9903 docs: add initial project plan

2026-04-06 17:41:01 +00:00

14 KiB

Raw Blame History

ClawSpec Project Plan

1. Project Overview

ClawSpec is an automation testing framework for OpenClaw plugins.

Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider.

Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results.

2. Problem Statement

OpenClaw plugin testing is currently hard to standardize because:

plugin behavior often depends on runtime integration rather than isolated pure functions
real LLM responses are non-deterministic and expensive
testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation
there is no unified way to express plugin test environments and expected outcomes

ClawSpec aims to solve this by offering:

declarative test environment definitions
deterministic model behavior via rules instead of real LLMs
automated provisioning of OpenClaw runtime units
repeatable execution of plugin integration tests
structured validation and test reporting

3. Project Goals

Primary Goals

Define test environments for one or more OpenClaw runtime units
Install and configure plugins automatically per test spec
Provide a fake model service that responds according to declarative rules
Execute test cases against running OpenClaw units
Verify expected behavior through built-in assertions and custom verifier scripts
Produce reproducible test artifacts and reports

Secondary Goals

Support multiple OpenClaw versions for compatibility testing
Support multi-unit scenarios in a single test suite
Provide reusable example specs for plugin developers
Make local plugin integration testing fast enough for everyday development

Non-Goals for V1

Full UI dashboard
Large-scale distributed execution
Performance benchmarking
Fuzzing or random conversation generation
Automatic support for every possible provider protocol
Replacing unit tests inside plugin repositories

4. Product Positioning

ClawSpec is not a generic unit test runner.

It is a scenario-driven integration testing framework for validating the behavior of:

OpenClaw runtime
installed plugins
model interaction boundaries
tool-calling flows
message outputs
side effects exposed through plugins or configured backends

The core value is deterministic validation of complex runtime behavior.

5. High-Level Architecture

ClawSpec is expected to contain four major components.

5.1 Spec Loader

Responsible for:

reading the project spec file
validating structure against schema
normalizing runtime definitions
producing an execution plan for the runner

5.2 Environment Orchestrator

Responsible for:

provisioning OpenClaw test units
generating or managing Docker Compose definitions
preparing workspace directories and mounted volumes
installing plugins
applying plugin configuration
starting and stopping test environments

5.3 Fake Model Service

Responsible for:

exposing a model-compatible endpoint for OpenClaw
receiving model requests from test units
matching incoming requests against declarative rules
returning deterministic text responses and/or tool-calling instructions
logging interactions for debugging and verification

5.4 Test Runner

Responsible for:

selecting target test cases
injecting input events/messages
collecting outputs, logs, and tool-call traces
evaluating built-in assertions
executing optional verifier scripts
producing final pass/fail results and artifacts

6. Core Design Principles

Determinism First

The framework should avoid real LLM randomness in automated tests.

Runtime Realism

Tests should run against realistic OpenClaw environments, not only mocked plugin internals.

Declarative Configuration

Test environments and cases should be defined in configuration files rather than hard-coded scripts.

Extensible Verification

Built-in assertions should cover common cases, while custom scripts should support project-specific validation.

Reproducible Artifacts

All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results.

7. Proposed Spec Structure

The initial idea is to define a single spec file that describes:

OpenClaw runtime units
plugin installation and configuration
test cases
fake model behavior
expected validation steps

A normalized V1 structure may look like this:

{
  "version": "v1",
  "environment": {
    "networkName": "clawspec-net",
    "workspaceRoot": "./.clawspec/workspaces",
    "artifactsRoot": "./.clawspec/artifacts"
  },
  "clawUnits": [
    {
      "unitId": "calendar-agent",
      "openClawVersion": "0.5.0",
      "image": "ghcr.io/openclaw/openclaw:0.5.0",
      "plugins": [
        {
          "pluginName": "harborforge-calendar",
          "installCommand": "openclaw plugins add harborforge-calendar",
          "configs": {
            "backendUrl": "http://fake-backend:8080",
            "agentId": "calendar-test-agent"
          }
        }
      ]
    }
  ],
  "testCases": [
    {
      "testId": "calendar-reminder-basic",
      "targetUnitId": "calendar-agent",
      "input": {
        "channel": "discord",
        "chatType": "direct",
        "message": "明天下午3点提醒我开会"
      },
      "modelRules": [
        {
          "receive": ".*明天下午3点提醒我开会.*",
          "action": {
            "type": "tool_call_then_respond",
            "toolName": "harborforge_calendar_create",
            "toolParameters": {
              "title": "开会",
              "time": "tomorrow 15:00"
            },
            "text": "已经帮你记下了"
          }
        }
      ],
      "expected": [
        {
          "type": "tool_called",
          "toolName": "harborforge_calendar_create"
        },
        {
          "type": "message_contains",
          "value": "已经帮你记下了"
        }
      ],
      "verifier": {
        "type": "script",
        "path": "./verifiers/calendar-reminder-basic.sh"
      }
    }
  ]
}

This is only a starting point. The exact schema should be refined during implementation.

8. Fake Model Service Design

The fake model service is one of the most important parts of ClawSpec.

It should behave like a deterministic model backend that OpenClaw can call during tests.

Responsibilities

receive model requests from OpenClaw
inspect request content and context
match a rule set in declared order
return predefined outputs
support text-only responses
support tool-calling responses
support tool-call plus final text patterns
emit logs showing which rule matched and what output was generated

Why This Matters

Without this service, tests would depend on live model providers, causing:

unstable results
variable tool-calling behavior
token costs
difficult reproduction of failures

The fake model service turns model behavior into a controlled part of the test spec.

9. Verification Model

ClawSpec should support two layers of verification.

9.1 Built-in Assertions

Common assertions should be supported directly by the framework, such as:

message_contains
message_equals
tool_called
tool_called_with
exit_code
log_contains

9.2 External Verifier Scripts

Custom verifier scripts should be supported for advanced cases, such as:

checking database state
validating generated files
verifying HTTP side effects
checking plugin-specific external systems

This combination keeps common tests simple while preserving flexibility.

10. Execution Flow

A typical ClawSpec test run should look like this:

Load and validate spec file
Prepare workspace and artifact directories
Materialize runtime environment definitions
Start fake model service
Start target OpenClaw unit(s)
Install and configure required plugins
Inject test input into the target unit
Let the runtime interact with the fake model service
Collect outputs, logs, tool traces, and events
Evaluate expected assertions
Run external verifier script if defined
Produce final result summary and artifact bundle
Tear down environment unless retention is requested

11. Proposed Directory Structure

ClawSpec/
├── README.md
├── PROJECT_PLAN.md
├── docs/
│   ├── architecture.md
│   ├── spec-schema.md
│   ├── fake-model.md
│   └── runner.md
├── schema/
│   └── clawspec.schema.json
├── examples/
│   ├── basic.json
│   └── calendar-plugin.json
├── docker/
│   ├── compose.template.yml
│   └── fake-model.Dockerfile
├── src/
│   ├── cli/
│   ├── spec/
│   ├── orchestrator/
│   ├── model/
│   ├── runner/
│   └── report/
├── verifiers/
│   └── examples/
└── .clawspec/
    ├── workspaces/
    └── artifacts/

12. Recommended Tech Stack

Preferred Language

TypeScript / Node.js

Reasoning:

fits well with OpenClaw ecosystem conventions
convenient for JSON schema validation
good support for CLI tooling
convenient for HTTP fake service implementation
straightforward Docker and subprocess orchestration

Suggested Libraries

ajv for schema validation
commander or yargs for CLI
execa for shell and Docker command orchestration
fastify or express for fake model service
yaml for optional YAML support in the future
vitest for ClawSpec self-tests

13. V1 Scope

The first version should focus on the smallest useful end-to-end workflow.

V1 Must Include

load one spec file
validate the basic schema
start one OpenClaw test unit
install one or more plugins in that unit
apply plugin configuration entries
start one fake model service
inject one test input
support rule-based text responses
support rule-based tool-calling responses
support basic assertions:
- message contains
- tool called
- script verifier
generate logs and a pass/fail summary

V1 Should Avoid

complex multi-turn state machines
distributed execution
UI dashboard
performance benchmarking
broad provider emulation beyond test needs
advanced matrix test expansion

14. Milestone Proposal

Milestone 0 - Project Bootstrap

initialize repository layout
define coding conventions
write initial README and project plan
select runtime and libraries

Milestone 1 - Spec Definition

draft spec schema v0.1
implement spec parser and validation
add example specs

Milestone 2 - Fake Model Service

define internal rule format
implement rule matcher
implement deterministic response generation
add request/response logging

Milestone 3 - Environment Orchestrator

generate runtime environment configuration
start and stop OpenClaw containers
apply plugin install commands
apply plugin configs

Milestone 4 - Test Runner

inject test inputs
collect runtime outputs
evaluate assertions
execute verifier scripts
output structured test reports

Milestone 5 - First Real Plugin Demo

create an example test suite for a real OpenClaw plugin
validate the full workflow end to end
document limitations and next steps

15. Risks and Open Questions

Runtime Interface Risk

The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests.

Plugin Installation Variability

Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks.

Observable Output Boundaries

Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result.

Docker/Image Strategy

The project needs a clear policy for:

official base images
local image overrides
plugin source mounting during local development
OpenClaw version pinning

Test Case Reuse

It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file.

16. Success Criteria

ClawSpec can be considered successful in its first phase if:

a plugin developer can define a test spec without writing framework code
a test run is reproducible across machines with the same environment
plugin integration behavior can be validated without a real LLM
failed runs produce enough artifacts to diagnose the issue quickly
at least one real plugin can be tested end-to-end using the framework

17. Next Recommended Deliverables

After this plan, the next most useful documents are:

README.md — concise positioning and quick start
docs/spec-schema.md — formalize the spec design
schema/clawspec.schema.json — machine-validatable V0 schema
docs/fake-model.md — define fake model request/response behavior
TASKLIST.md or milestone tracker — implementation breakdown

18. Summary

ClawSpec should become a deterministic integration testing framework for OpenClaw plugins.

Its core innovation is simple:

run real OpenClaw runtime environments
replace real LLM behavior with a rule-driven fake model service
execute declarative test cases
verify runtime behavior with stable, repeatable assertions

If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.

14 KiB Raw Blame History

ClawSpec Project Plan

1. Project Overview

2. Problem Statement

3. Project Goals

Primary Goals

Secondary Goals

Non-Goals for V1

4. Product Positioning

5. High-Level Architecture

5.1 Spec Loader

5.2 Environment Orchestrator

5.3 Fake Model Service

5.4 Test Runner

6. Core Design Principles

Determinism First

Runtime Realism

Declarative Configuration

Extensible Verification

Reproducible Artifacts

7. Proposed Spec Structure

8. Fake Model Service Design

Responsibilities

Why This Matters

9. Verification Model

9.1 Built-in Assertions

9.2 External Verifier Scripts

10. Execution Flow

11. Proposed Directory Structure

12. Recommended Tech Stack

Preferred Language

Suggested Libraries

13. V1 Scope

V1 Must Include

V1 Should Avoid

14. Milestone Proposal

Milestone 0 - Project Bootstrap

Milestone 1 - Spec Definition

Milestone 2 - Fake Model Service

Milestone 3 - Environment Orchestrator

Milestone 4 - Test Runner

Milestone 5 - First Real Plugin Demo

15. Risks and Open Questions

Runtime Interface Risk

Plugin Installation Variability

Observable Output Boundaries

Docker/Image Strategy

Test Case Reuse

16. Success Criteria

17. Next Recommended Deliverables

18. Summary

14 KiB

Raw Blame History