# ClawSpec Project Plan

## 1. Project Overview

**ClawSpec** is an automation testing framework for OpenClaw plugins.

Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider.

Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results.

---

## 2. Problem Statement

OpenClaw plugin testing is currently hard to standardize because:

- plugin behavior often depends on runtime integration rather than isolated pure functions
- real LLM responses are non-deterministic and expensive
- testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation
- there is no unified way to express plugin test environments and expected outcomes

ClawSpec aims to solve this by offering:

- declarative test environment definitions
- deterministic model behavior via rules instead of real LLMs
- automated provisioning of OpenClaw runtime units
- repeatable execution of plugin integration tests
- structured validation and test reporting

---

## 3. Project Goals

### Primary Goals

- Define test environments for one or more OpenClaw runtime units
- Install and configure plugins automatically per test spec
- Provide a fake model service that responds according to declarative rules
- Execute test cases against running OpenClaw units
- Verify expected behavior through built-in assertions and custom verifier scripts
- Produce reproducible test artifacts and reports

### Secondary Goals

- Support multiple OpenClaw versions for compatibility testing
- Support multi-unit scenarios in a single test suite
- Provide reusable example specs for plugin developers
- Make local plugin integration testing fast enough for everyday development

### Non-Goals for V1

- Full UI dashboard
- Large-scale distributed execution
- Performance benchmarking
- Fuzzing or random conversation generation
- Automatic support for every possible provider protocol
- Replacing unit tests inside plugin repositories

---

## 4. Product Positioning

ClawSpec is **not** a generic unit test runner.

It is a **scenario-driven integration testing framework** for validating the behavior of:

- OpenClaw runtime
- installed plugins
- model interaction boundaries
- tool-calling flows
- message outputs
- side effects exposed through plugins or configured backends

The core value is deterministic validation of complex runtime behavior.

---

## 5. High-Level Architecture

ClawSpec is expected to contain four major components.

### 5.1 Spec Loader

Responsible for:

- reading the project spec file
- validating structure against schema
- normalizing runtime definitions
- producing an execution plan for the runner

### 5.2 Environment Orchestrator

Responsible for:

- provisioning OpenClaw test units
- generating or managing Docker Compose definitions
- preparing workspace directories and mounted volumes
- installing plugins
- applying plugin configuration
- starting and stopping test environments

### 5.3 Fake Model Service

Responsible for:

- exposing a model-compatible endpoint for OpenClaw
- receiving model requests from test units
- matching incoming requests against declarative rules
- returning deterministic text responses and/or tool-calling instructions
- logging interactions for debugging and verification

### 5.4 Test Runner

Responsible for:

- selecting target test cases
- injecting input events/messages
- collecting outputs, logs, and tool-call traces
- evaluating built-in assertions
- executing optional verifier scripts
- producing final pass/fail results and artifacts

---

## 6. Core Design Principles

### Determinism First

The framework should avoid real LLM randomness in automated tests.

### Runtime Realism

Tests should run against realistic OpenClaw environments, not only mocked plugin internals.

### Declarative Configuration

Test environments and cases should be defined in configuration files rather than hard-coded scripts.

### Extensible Verification

Built-in assertions should cover common cases, while custom scripts should support project-specific validation.

### Reproducible Artifacts

All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results.

---

## 7. Proposed Spec Structure

The initial idea is to define a single spec file that describes:

- OpenClaw runtime units
- plugin installation and configuration
- test cases
- fake model behavior
- expected validation steps

A normalized V1 structure may look like this:

```json
{
  "version": "v1",
  "environment": {
    "networkName": "clawspec-net",
    "workspaceRoot": "./.clawspec/workspaces",
    "artifactsRoot": "./.clawspec/artifacts"
  },
  "clawUnits": [
    {
      "unitId": "calendar-agent",
      "openClawVersion": "0.5.0",
      "image": "ghcr.io/openclaw/openclaw:0.5.0",
      "plugins": [
        {
          "pluginName": "harborforge-calendar",
          "installCommand": "openclaw plugins add harborforge-calendar",
          "configs": {
            "backendUrl": "http://fake-backend:8080",
            "agentId": "calendar-test-agent"
          }
        }
      ]
    }
  ],
  "testCases": [
    {
      "testId": "calendar-reminder-basic",
      "targetUnitId": "calendar-agent",
      "input": {
        "channel": "discord",
        "chatType": "direct",
        "message": "明天下午3点提醒我开会"
      },
      "modelRules": [
        {
          "receive": ".*明天下午3点提醒我开会.*",
          "action": {
            "type": "tool_call_then_respond",
            "toolName": "harborforge_calendar_create",
            "toolParameters": {
              "title": "开会",
              "time": "tomorrow 15:00"
            },
            "text": "已经帮你记下了"
          }
        }
      ],
      "expected": [
        {
          "type": "tool_called",
          "toolName": "harborforge_calendar_create"
        },
        {
          "type": "message_contains",
          "value": "已经帮你记下了"
        }
      ],
      "verifier": {
        "type": "script",
        "path": "./verifiers/calendar-reminder-basic.sh"
      }
    }
  ]
}
```

This is only a starting point. The exact schema should be refined during implementation.

---

## 8. Fake Model Service Design

The fake model service is one of the most important parts of ClawSpec.

It should behave like a deterministic model backend that OpenClaw can call during tests.

### Responsibilities

- receive model requests from OpenClaw
- inspect request content and context
- match a rule set in declared order
- return predefined outputs
- support text-only responses
- support tool-calling responses
- support tool-call plus final text patterns
- emit logs showing which rule matched and what output was generated

### Why This Matters

Without this service, tests would depend on live model providers, causing:

- unstable results
- variable tool-calling behavior
- token costs
- difficult reproduction of failures

The fake model service turns model behavior into a controlled part of the test spec.

---

## 9. Verification Model

ClawSpec should support two layers of verification.

### 9.1 Built-in Assertions

Common assertions should be supported directly by the framework, such as:

- `message_contains`
- `message_equals`
- `tool_called`
- `tool_called_with`
- `exit_code`
- `log_contains`

### 9.2 External Verifier Scripts

Custom verifier scripts should be supported for advanced cases, such as:

- checking database state
- validating generated files
- verifying HTTP side effects
- checking plugin-specific external systems

This combination keeps common tests simple while preserving flexibility.

---

## 10. Execution Flow

A typical ClawSpec test run should look like this:

1. Load and validate spec file
2. Prepare workspace and artifact directories
3. Materialize runtime environment definitions
4. Start fake model service
5. Start target OpenClaw unit(s)
6. Install and configure required plugins
7. Inject test input into the target unit
8. Let the runtime interact with the fake model service
9. Collect outputs, logs, tool traces, and events
10. Evaluate expected assertions
11. Run external verifier script if defined
12. Produce final result summary and artifact bundle
13. Tear down environment unless retention is requested

---

## 11. Proposed Directory Structure

```text
ClawSpec/
├── README.md
├── PROJECT_PLAN.md
├── docs/
│   ├── architecture.md
│   ├── spec-schema.md
│   ├── fake-model.md
│   └── runner.md
├── schema/
│   └── clawspec.schema.json
├── examples/
│   ├── basic.json
│   └── calendar-plugin.json
├── docker/
│   ├── compose.template.yml
│   └── fake-model.Dockerfile
├── src/
│   ├── cli/
│   ├── spec/
│   ├── orchestrator/
│   ├── model/
│   ├── runner/
│   └── report/
├── verifiers/
│   └── examples/
└── .clawspec/
    ├── workspaces/
    └── artifacts/
```

---

## 12. Recommended Tech Stack

### Preferred Language

**TypeScript / Node.js**

Reasoning:

- fits well with OpenClaw ecosystem conventions
- convenient for JSON schema validation
- good support for CLI tooling
- convenient for HTTP fake service implementation
- straightforward Docker and subprocess orchestration

### Suggested Libraries

- `ajv` for schema validation
- `commander` or `yargs` for CLI
- `execa` for shell and Docker command orchestration
- `fastify` or `express` for fake model service
- `yaml` for optional YAML support in the future
- `vitest` for ClawSpec self-tests

---

## 13. V1 Scope

The first version should focus on the smallest useful end-to-end workflow.

### V1 Must Include

- load one spec file
- validate the basic schema
- start one OpenClaw test unit
- install one or more plugins in that unit
- apply plugin configuration entries
- start one fake model service
- inject one test input
- support rule-based text responses
- support rule-based tool-calling responses
- support basic assertions:
  - message contains
  - tool called
  - script verifier
- generate logs and a pass/fail summary

### V1 Should Avoid

- complex multi-turn state machines
- distributed execution
- UI dashboard
- performance benchmarking
- broad provider emulation beyond test needs
- advanced matrix test expansion

---

## 14. Milestone Proposal

### Milestone 0 - Project Bootstrap

- initialize repository layout
- define coding conventions
- write initial README and project plan
- select runtime and libraries

### Milestone 1 - Spec Definition

- draft spec schema v0.1
- implement spec parser and validation
- add example specs

### Milestone 2 - Fake Model Service

- define internal rule format
- implement rule matcher
- implement deterministic response generation
- add request/response logging

### Milestone 3 - Environment Orchestrator

- generate runtime environment configuration
- start and stop OpenClaw containers
- apply plugin install commands
- apply plugin configs

### Milestone 4 - Test Runner

- inject test inputs
- collect runtime outputs
- evaluate assertions
- execute verifier scripts
- output structured test reports

### Milestone 5 - First Real Plugin Demo

- create an example test suite for a real OpenClaw plugin
- validate the full workflow end to end
- document limitations and next steps

---

## 15. Risks and Open Questions

### Runtime Interface Risk

The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests.

### Plugin Installation Variability

Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks.

### Observable Output Boundaries

Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result.

### Docker/Image Strategy

The project needs a clear policy for:

- official base images
- local image overrides
- plugin source mounting during local development
- OpenClaw version pinning

### Test Case Reuse

It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file.

---

## 16. Success Criteria

ClawSpec can be considered successful in its first phase if:

- a plugin developer can define a test spec without writing framework code
- a test run is reproducible across machines with the same environment
- plugin integration behavior can be validated without a real LLM
- failed runs produce enough artifacts to diagnose the issue quickly
- at least one real plugin can be tested end-to-end using the framework

---

## 17. Next Recommended Deliverables

After this plan, the next most useful documents are:

1. `README.md` — concise positioning and quick start
2. `docs/spec-schema.md` — formalize the spec design
3. `schema/clawspec.schema.json` — machine-validatable V0 schema
4. `docs/fake-model.md` — define fake model request/response behavior
5. `TASKLIST.md` or milestone tracker — implementation breakdown

---

## 18. Summary

ClawSpec should become a deterministic integration testing framework for OpenClaw plugins.

Its core innovation is simple:

- run real OpenClaw runtime environments
- replace real LLM behavior with a rule-driven fake model service
- execute declarative test cases
- verify runtime behavior with stable, repeatable assertions

If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.