docs: rewrite project plan in chinese
This commit is contained in:
689
PROJECT_PLAN.md
689
PROJECT_PLAN.md
@@ -1,162 +1,208 @@
|
|||||||
# ClawSpec Project Plan
|
# ClawSpec 项目规划
|
||||||
|
|
||||||
## 1. Project Overview
|
## 1. 项目概述
|
||||||
|
|
||||||
**ClawSpec** is an automation testing framework for OpenClaw plugins.
|
**ClawSpec** 是一个面向 **OpenClaw 插件自动化测试** 的框架。
|
||||||
|
|
||||||
Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider.
|
它的目标不是替代插件仓库内部已有的单元测试,而是为 **插件在真实 OpenClaw 运行时中的集成行为** 提供一套可重复、可编排、可验证的自动化测试方案。
|
||||||
|
|
||||||
Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results.
|
ClawSpec 的核心思路是:
|
||||||
|
|
||||||
|
- 使用配置文件声明测试场景
|
||||||
|
- 启动一个或多个 OpenClaw 测试实体(claw unit)
|
||||||
|
- 自动安装并配置目标插件
|
||||||
|
- 使用一个**规则驱动的假模型服务**替代真实 LLM
|
||||||
|
- 通过规则控制模型回复、工具调用和测试终止时机
|
||||||
|
- 在测试结束或超时后自动执行断言与验证脚本
|
||||||
|
- 输出稳定、可复现的测试结果和调试产物
|
||||||
|
|
||||||
|
它要解决的问题,本质上是:
|
||||||
|
|
||||||
|
> 如何在不依赖真实大模型、不手工搭环境、不靠人工点点点观察日志的前提下,对 OpenClaw 插件做稳定的集成测试。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 2. Problem Statement
|
## 2. 要解决的问题
|
||||||
|
|
||||||
OpenClaw plugin testing is currently hard to standardize because:
|
目前 OpenClaw 插件测试存在几个现实问题:
|
||||||
|
|
||||||
- plugin behavior often depends on runtime integration rather than isolated pure functions
|
- 插件行为很多不是纯函数,而是依赖 OpenClaw 运行时上下文
|
||||||
- real LLM responses are non-deterministic and expensive
|
- 插件经常要和模型调用、工具调用、消息路由、配置项共同工作
|
||||||
- testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation
|
- 使用真实 LLM 会引入不确定性、成本和复现困难
|
||||||
- there is no unified way to express plugin test environments and expected outcomes
|
- 手工搭建测试环境成本高,流程不统一
|
||||||
|
- 缺少统一的方式描述测试环境、测试输入和预期结果
|
||||||
|
|
||||||
ClawSpec aims to solve this by offering:
|
因此,ClawSpec 需要提供一套标准化能力:
|
||||||
|
|
||||||
- declarative test environment definitions
|
- 用声明式配置描述测试环境和测试用例
|
||||||
- deterministic model behavior via rules instead of real LLMs
|
- 用假模型替代真实 LLM,保证测试稳定性
|
||||||
- automated provisioning of OpenClaw runtime units
|
- 自动拉起 OpenClaw 运行环境并完成插件安装与配置
|
||||||
- repeatable execution of plugin integration tests
|
- 支持单实体和多实体协作场景
|
||||||
- structured validation and test reporting
|
- 在测试结束后自动校验结果并生成报告
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 3. Project Goals
|
## 3. 项目目标
|
||||||
|
|
||||||
### Primary Goals
|
### 3.1 核心目标
|
||||||
|
|
||||||
- Define test environments for one or more OpenClaw runtime units
|
- 定义一个测试配置文件格式,用于描述 OpenClaw 测试场景
|
||||||
- Install and configure plugins automatically per test spec
|
- 根据配置自动创建一个或多个 OpenClaw 测试实体
|
||||||
- Provide a fake model service that responds according to declarative rules
|
- 自动安装和配置测试所需插件
|
||||||
- Execute test cases against running OpenClaw units
|
- 提供规则驱动的假模型服务,不接真实 LLM
|
||||||
- Verify expected behavior through built-in assertions and custom verifier scripts
|
- 支持单轮和多轮对话测试
|
||||||
- Produce reproducible test artifacts and reports
|
- 支持基于断言和脚本的结果验证
|
||||||
|
- 输出稳定、可重现、便于排查的测试产物
|
||||||
|
|
||||||
### Secondary Goals
|
### 3.2 次级目标
|
||||||
|
|
||||||
- Support multiple OpenClaw versions for compatibility testing
|
- 在需要时支持不同 OpenClaw 版本的兼容性测试
|
||||||
- Support multi-unit scenarios in a single test suite
|
- 支持多个 OpenClaw 实体之间的协作测试
|
||||||
- Provide reusable example specs for plugin developers
|
- 为插件开发者提供可复用的示例测试配置
|
||||||
- Make local plugin integration testing fast enough for everyday development
|
- 让日常本地插件集成测试足够轻量、足够快
|
||||||
|
|
||||||
### Non-Goals for V1
|
### 3.3 非目标(至少不在 V1)
|
||||||
|
|
||||||
- Full UI dashboard
|
- 替代插件仓库内部单元测试
|
||||||
- Large-scale distributed execution
|
- 一上来做图形化管理界面
|
||||||
- Performance benchmarking
|
- 一上来做分布式大规模并发执行
|
||||||
- Fuzzing or random conversation generation
|
- 一上来做性能压测、压力测试、模糊测试
|
||||||
- Automatic support for every possible provider protocol
|
- 一上来模拟所有模型供应商的全部协议细节
|
||||||
- Replacing unit tests inside plugin repositories
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 4. Product Positioning
|
## 4. 产品定位
|
||||||
|
|
||||||
ClawSpec is **not** a generic unit test runner.
|
ClawSpec **不是一个通用测试框架**,也**不是一个纯单元测试工具**。
|
||||||
|
|
||||||
It is a **scenario-driven integration testing framework** for validating the behavior of:
|
它更准确的定位是:
|
||||||
|
|
||||||
- OpenClaw runtime
|
> 一个面向 OpenClaw 插件生态的、基于场景编排的、确定性集成测试框架。
|
||||||
- installed plugins
|
|
||||||
- model interaction boundaries
|
|
||||||
- tool-calling flows
|
|
||||||
- message outputs
|
|
||||||
- side effects exposed through plugins or configured backends
|
|
||||||
|
|
||||||
The core value is deterministic validation of complex runtime behavior.
|
它关注的对象是以下整体行为:
|
||||||
|
|
||||||
|
- OpenClaw runtime 本身
|
||||||
|
- 已安装插件的行为
|
||||||
|
- 模型请求与回复边界
|
||||||
|
- 工具调用链路
|
||||||
|
- 消息输出
|
||||||
|
- 插件对外部系统产生的可观察副作用
|
||||||
|
|
||||||
|
所以它测试的不是“某个函数返回值对不对”,而是:
|
||||||
|
|
||||||
|
> 当 OpenClaw + 插件 + 模型交互 + 工具链路一起工作时,最终表现是否符合预期。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 5. High-Level Architecture
|
## 5. 核心设计原则
|
||||||
|
|
||||||
ClawSpec is expected to contain four major components.
|
### 5.1 确定性优先
|
||||||
|
|
||||||
### 5.1 Spec Loader
|
自动化测试尽量不依赖真实 LLM 的随机性。假模型服务必须成为测试过程中的受控变量。
|
||||||
|
|
||||||
Responsible for:
|
### 5.2 运行时真实性
|
||||||
|
|
||||||
- reading the project spec file
|
测试目标是接近真实 OpenClaw 运行时,而不是只 mock 插件内部函数。
|
||||||
- validating structure against schema
|
|
||||||
- normalizing runtime definitions
|
|
||||||
- producing an execution plan for the runner
|
|
||||||
|
|
||||||
### 5.2 Environment Orchestrator
|
### 5.3 声明式配置
|
||||||
|
|
||||||
Responsible for:
|
测试环境、测试输入、模型规则、断言条件都应尽可能通过配置描述,而不是散落在脚本里。
|
||||||
|
|
||||||
- provisioning OpenClaw test units
|
### 5.4 默认简单,按需扩展
|
||||||
- generating or managing Docker Compose definitions
|
|
||||||
- preparing workspace directories and mounted volumes
|
|
||||||
- installing plugins
|
|
||||||
- applying plugin configuration
|
|
||||||
- starting and stopping test environments
|
|
||||||
|
|
||||||
### 5.3 Fake Model Service
|
大部分插件测试只需要一个 claw unit,因此单实体测试应当是默认路径;多实体协作应当支持,但不应让单实体测试变复杂。
|
||||||
|
|
||||||
Responsible for:
|
### 5.5 可调试性
|
||||||
|
|
||||||
- exposing a model-compatible endpoint for OpenClaw
|
每次测试都要尽量留下足够的产物用于排查,比如日志、规则命中记录、工具调用轨迹、验证脚本输出等。
|
||||||
- receiving model requests from test units
|
|
||||||
- matching incoming requests against declarative rules
|
|
||||||
- returning deterministic text responses and/or tool-calling instructions
|
|
||||||
- logging interactions for debugging and verification
|
|
||||||
|
|
||||||
### 5.4 Test Runner
|
|
||||||
|
|
||||||
Responsible for:
|
|
||||||
|
|
||||||
- selecting target test cases
|
|
||||||
- injecting input events/messages
|
|
||||||
- collecting outputs, logs, and tool-call traces
|
|
||||||
- evaluating built-in assertions
|
|
||||||
- executing optional verifier scripts
|
|
||||||
- producing final pass/fail results and artifacts
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Core Design Principles
|
## 6. claw unit 的定位
|
||||||
|
|
||||||
### Determinism First
|
这里需要明确一个关键概念:
|
||||||
|
|
||||||
The framework should avoid real LLM randomness in automated tests.
|
`clawUnits` 的存在,**主要不是为了同时测试多个版本或多个配置**,而是为了支持:
|
||||||
|
|
||||||
### Runtime Realism
|
- 一个插件要作用于多个 OpenClaw 实体
|
||||||
|
- 插件行为依赖多个 agent / runtime 协作
|
||||||
|
- 测试场景本身需要多个 OpenClaw 节点参与
|
||||||
|
|
||||||
Tests should run against realistic OpenClaw environments, not only mocked plugin internals.
|
例如:
|
||||||
|
|
||||||
### Declarative Configuration
|
- A 实体发送消息,B 实体响应
|
||||||
|
- 一个插件监听另一个实体的行为
|
||||||
|
- 多 agent 协作场景中的插件联动
|
||||||
|
|
||||||
Test environments and cases should be defined in configuration files rather than hard-coded scripts.
|
因此:
|
||||||
|
|
||||||
### Extensible Verification
|
- **绝大多数插件测试应当只需要一个 `clawUnit`**
|
||||||
|
- `multi-unit` 是为了协作测试,而不是为了把所有变化都塞进一个场景里
|
||||||
Built-in assertions should cover common cases, while custom scripts should support project-specific validation.
|
- 如果只是测试版本兼容性或配置差异,更适合通过多个测试场景分别执行,而不是默认依赖多个 unit 同时运行
|
||||||
|
|
||||||
### Reproducible Artifacts
|
|
||||||
|
|
||||||
All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 7. Proposed Spec Structure
|
## 7. 总体架构
|
||||||
|
|
||||||
The initial idea is to define a single spec file that describes:
|
ClawSpec 可以拆成四个核心模块。
|
||||||
|
|
||||||
- OpenClaw runtime units
|
### 7.1 Spec Loader(配置加载器)
|
||||||
- plugin installation and configuration
|
|
||||||
- test cases
|
|
||||||
- fake model behavior
|
|
||||||
- expected validation steps
|
|
||||||
|
|
||||||
A normalized V1 structure may look like this:
|
负责:
|
||||||
|
|
||||||
|
- 读取测试配置文件
|
||||||
|
- 校验配置结构是否合法
|
||||||
|
- 标准化字段
|
||||||
|
- 生成内部执行计划
|
||||||
|
|
||||||
|
### 7.2 Environment Orchestrator(环境编排器)
|
||||||
|
|
||||||
|
负责:
|
||||||
|
|
||||||
|
- 创建 OpenClaw 测试环境
|
||||||
|
- 生成和管理 Docker Compose 运行定义
|
||||||
|
- 准备工作目录、挂载目录、产物目录
|
||||||
|
- 安装插件
|
||||||
|
- 写入插件配置
|
||||||
|
- 启动和销毁测试环境
|
||||||
|
|
||||||
|
### 7.3 Fake Model Service(假模型服务)
|
||||||
|
|
||||||
|
负责:
|
||||||
|
|
||||||
|
- 提供一个供 OpenClaw 调用的模型接口
|
||||||
|
- 接收 OpenClaw 发来的模型请求
|
||||||
|
- 根据配置规则决定返回什么
|
||||||
|
- 生成稳定的文本回复或工具调用
|
||||||
|
- 在满足终止条件时调用 `test-finished`
|
||||||
|
- 记录请求、规则命中和输出结果
|
||||||
|
|
||||||
|
### 7.4 Test Runner(测试执行器)
|
||||||
|
|
||||||
|
负责:
|
||||||
|
|
||||||
|
- 选择测试用例
|
||||||
|
- 注入输入事件或测试消息
|
||||||
|
- 收集输出消息、工具调用、日志和事件
|
||||||
|
- 判断测试何时结束
|
||||||
|
- 执行断言与外部验证脚本
|
||||||
|
- 汇总结果并输出报告
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. 测试配置的整体思路
|
||||||
|
|
||||||
|
ClawSpec 需要一个统一的配置文件来描述以下内容:
|
||||||
|
|
||||||
|
- 测试环境
|
||||||
|
- 一个或多个 claw unit
|
||||||
|
- 各 unit 使用的 OpenClaw 版本或镜像
|
||||||
|
- 需要安装的插件及配置
|
||||||
|
- 测试输入
|
||||||
|
- 假模型行为规则
|
||||||
|
- 断言和验证脚本
|
||||||
|
|
||||||
|
一个建议中的配置结构如下:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@@ -187,6 +233,7 @@ A normalized V1 structure may look like this:
|
|||||||
{
|
{
|
||||||
"testId": "calendar-reminder-basic",
|
"testId": "calendar-reminder-basic",
|
||||||
"targetUnitId": "calendar-agent",
|
"targetUnitId": "calendar-agent",
|
||||||
|
"timeout": 15000,
|
||||||
"input": {
|
"input": {
|
||||||
"channel": "discord",
|
"channel": "discord",
|
||||||
"chatType": "direct",
|
"chatType": "direct",
|
||||||
@@ -204,6 +251,16 @@ A normalized V1 structure may look like this:
|
|||||||
},
|
},
|
||||||
"text": "已经帮你记下了"
|
"text": "已经帮你记下了"
|
||||||
}
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"receive": ".*已经帮你记下了.*",
|
||||||
|
"action": {
|
||||||
|
"type": "tool_call",
|
||||||
|
"toolName": "test-finished",
|
||||||
|
"toolParameters": {
|
||||||
|
"reason": "expected final response observed"
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"expected": [
|
"expected": [
|
||||||
@@ -225,89 +282,176 @@ A normalized V1 structure may look like this:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
This is only a starting point. The exact schema should be refined during implementation.
|
这只是方向草案,后续需要再细化为正式 schema。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 8. Fake Model Service Design
|
## 9. 假模型服务设计
|
||||||
|
|
||||||
The fake model service is one of the most important parts of ClawSpec.
|
假模型服务是 ClawSpec 的关键能力之一。
|
||||||
|
|
||||||
It should behave like a deterministic model backend that OpenClaw can call during tests.
|
如果没有它,测试就会依赖真实模型,带来以下问题:
|
||||||
|
|
||||||
### Responsibilities
|
- 结果不稳定
|
||||||
|
- 工具调用行为不可预测
|
||||||
|
- 成本上升
|
||||||
|
- 失败难以复现
|
||||||
|
|
||||||
- receive model requests from OpenClaw
|
因此 ClawSpec 里的假模型不应该只是“随便返回一句话”,而应是一个 **规则驱动的测试模型**。
|
||||||
- inspect request content and context
|
|
||||||
- match a rule set in declared order
|
|
||||||
- return predefined outputs
|
|
||||||
- support text-only responses
|
|
||||||
- support tool-calling responses
|
|
||||||
- support tool-call plus final text patterns
|
|
||||||
- emit logs showing which rule matched and what output was generated
|
|
||||||
|
|
||||||
### Why This Matters
|
### 9.1 假模型的职责
|
||||||
|
|
||||||
Without this service, tests would depend on live model providers, causing:
|
- 接收 OpenClaw 发来的模型请求
|
||||||
|
- 读取当前 test case 对应的规则集
|
||||||
|
- 按规则顺序或匹配优先级决定响应动作
|
||||||
|
- 返回文本回复、工具调用,或两者组合
|
||||||
|
- 在满足结束条件时触发 `test-finished`
|
||||||
|
- 输出详细日志,记录每次命中的规则与返回动作
|
||||||
|
|
||||||
- unstable results
|
### 9.2 规则支持的动作方向
|
||||||
- variable tool-calling behavior
|
|
||||||
- token costs
|
|
||||||
- difficult reproduction of failures
|
|
||||||
|
|
||||||
The fake model service turns model behavior into a controlled part of the test spec.
|
至少应支持:
|
||||||
|
|
||||||
|
- 纯文本回复
|
||||||
|
- 工具调用
|
||||||
|
- 先工具调用再回复文本
|
||||||
|
- 明确结束测试(调用 `test-finished`)
|
||||||
|
- 不匹配时的默认行为(例如返回空结果、报错或记录未命中)
|
||||||
|
|
||||||
|
### 9.3 多轮对话支持
|
||||||
|
|
||||||
|
这里也要明确:
|
||||||
|
|
||||||
|
> 多轮对话本身不难,没必要一开始就把它设计成复杂状态机。
|
||||||
|
|
||||||
|
一个足够实用且容易落地的规则是:
|
||||||
|
|
||||||
|
- 每个 test case 都必须定义 `timeout`
|
||||||
|
- 每个 test case 的 `modelRules` 中都必须包含一个**测试结束规则**
|
||||||
|
- 当结束规则被命中时,假模型调用工具 `test-finished`
|
||||||
|
- 只要 `test-finished` 被观察到,测试就可以进入结果检查阶段
|
||||||
|
- 如果在 `timeout` 时间内没有观察到 `test-finished`,也应自动停止等待并开始检查结果
|
||||||
|
|
||||||
|
这样带来的好处:
|
||||||
|
|
||||||
|
- 可以天然支持多轮对话
|
||||||
|
- 不需要一开始就做复杂状态机
|
||||||
|
- 测试终止条件明确
|
||||||
|
- 超时处理简单统一
|
||||||
|
|
||||||
|
换句话说,ClawSpec 的多轮测试机制,第一版完全可以建立在:
|
||||||
|
|
||||||
|
- 规则匹配
|
||||||
|
- 显式结束信号 `test-finished`
|
||||||
|
- 统一 timeout
|
||||||
|
|
||||||
|
这三件事上。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 9. Verification Model
|
## 10. 测试结束机制
|
||||||
|
|
||||||
ClawSpec should support two layers of verification.
|
ClawSpec 中每个测试用例都应该具备两个结束维度:
|
||||||
|
|
||||||
### 9.1 Built-in Assertions
|
### 10.1 显式结束
|
||||||
|
|
||||||
Common assertions should be supported directly by the framework, such as:
|
由假模型通过工具调用:
|
||||||
|
|
||||||
|
- `test-finished`
|
||||||
|
|
||||||
|
来表示测试已经到达预期终点。
|
||||||
|
|
||||||
|
这代表:
|
||||||
|
|
||||||
|
- 测试逻辑已经跑到设计的结束条件
|
||||||
|
- 可以停止继续等待新一轮交互
|
||||||
|
- 可以开始执行断言和验证脚本
|
||||||
|
|
||||||
|
### 10.2 超时结束
|
||||||
|
|
||||||
|
每个 test case 必须提供 `timeout` 字段。
|
||||||
|
|
||||||
|
如果在超时时间内没有观察到 `test-finished`,则:
|
||||||
|
|
||||||
|
- runner 停止继续等待
|
||||||
|
- 将当前收集到的日志、消息、工具调用作为测试结果输入
|
||||||
|
- 自动开始执行断言和验证脚本
|
||||||
|
|
||||||
|
这样做的意义是:
|
||||||
|
|
||||||
|
- 防止测试无限挂起
|
||||||
|
- 允许一些“不要求明确结束工具”的场景仍可评估结果
|
||||||
|
- 给开发者提供统一的失败诊断入口
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. 验证模型
|
||||||
|
|
||||||
|
ClawSpec 至少需要支持两层验证方式。
|
||||||
|
|
||||||
|
### 11.1 内建断言
|
||||||
|
|
||||||
|
适合高频常见场景,例如:
|
||||||
|
|
||||||
- `message_contains`
|
- `message_contains`
|
||||||
- `message_equals`
|
- `message_equals`
|
||||||
- `tool_called`
|
- `tool_called`
|
||||||
- `tool_called_with`
|
- `tool_called_with`
|
||||||
- `exit_code`
|
|
||||||
- `log_contains`
|
- `log_contains`
|
||||||
|
- `exit_code`
|
||||||
|
|
||||||
### 9.2 External Verifier Scripts
|
这些断言由框架直接执行,适合大多数常见插件测试。
|
||||||
|
|
||||||
Custom verifier scripts should be supported for advanced cases, such as:
|
### 11.2 外部验证脚本
|
||||||
|
|
||||||
- checking database state
|
适合复杂或高度项目定制的检查,例如:
|
||||||
- validating generated files
|
|
||||||
- verifying HTTP side effects
|
|
||||||
- checking plugin-specific external systems
|
|
||||||
|
|
||||||
This combination keeps common tests simple while preserving flexibility.
|
- 数据库状态是否正确
|
||||||
|
- 某个文件是否生成
|
||||||
|
- 某个 HTTP 回调是否发生
|
||||||
|
- 某个外部服务状态是否变化
|
||||||
|
|
||||||
|
这类场景可以通过:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"verifier": {
|
||||||
|
"type": "script",
|
||||||
|
"path": "./verifiers/check-result.sh"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
来执行。
|
||||||
|
|
||||||
|
内建断言负责覆盖通用场景,脚本验证负责保留扩展性。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 10. Execution Flow
|
## 12. 执行流程
|
||||||
|
|
||||||
A typical ClawSpec test run should look like this:
|
一次典型的 ClawSpec 测试流程应当如下:
|
||||||
|
|
||||||
1. Load and validate spec file
|
1. 读取并校验测试配置
|
||||||
2. Prepare workspace and artifact directories
|
2. 准备 workspace、artifacts 等目录
|
||||||
3. Materialize runtime environment definitions
|
3. 根据配置生成运行环境定义
|
||||||
4. Start fake model service
|
4. 启动假模型服务
|
||||||
5. Start target OpenClaw unit(s)
|
5. 启动目标 OpenClaw unit(单个或多个)
|
||||||
6. Install and configure required plugins
|
6. 安装并配置目标插件
|
||||||
7. Inject test input into the target unit
|
7. 向目标 unit 注入测试输入
|
||||||
8. Let the runtime interact with the fake model service
|
8. 让 OpenClaw 与假模型进行一轮或多轮交互
|
||||||
9. Collect outputs, logs, tool traces, and events
|
9. 持续观察是否出现 `test-finished`
|
||||||
10. Evaluate expected assertions
|
10. 若命中结束信号,则进入结果验证
|
||||||
11. Run external verifier script if defined
|
11. 若超时,则停止等待并进入结果验证
|
||||||
12. Produce final result summary and artifact bundle
|
12. 收集日志、消息、工具调用记录和事件轨迹
|
||||||
13. Tear down environment unless retention is requested
|
13. 执行内建断言
|
||||||
|
14. 如有需要,执行外部验证脚本
|
||||||
|
15. 输出 pass/fail、摘要和产物路径
|
||||||
|
16. 按配置决定是否销毁环境
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 11. Proposed Directory Structure
|
## 13. 推荐目录结构
|
||||||
|
|
||||||
```text
|
```text
|
||||||
ClawSpec/
|
ClawSpec/
|
||||||
@@ -322,7 +466,7 @@ ClawSpec/
|
|||||||
│ └── clawspec.schema.json
|
│ └── clawspec.schema.json
|
||||||
├── examples/
|
├── examples/
|
||||||
│ ├── basic.json
|
│ ├── basic.json
|
||||||
│ └── calendar-plugin.json
|
│ └── multi-unit.json
|
||||||
├── docker/
|
├── docker/
|
||||||
│ ├── compose.template.yml
|
│ ├── compose.template.yml
|
||||||
│ └── fake-model.Dockerfile
|
│ └── fake-model.Dockerfile
|
||||||
@@ -342,170 +486,197 @@ ClawSpec/
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 12. Recommended Tech Stack
|
## 14. 技术选型建议
|
||||||
|
|
||||||
### Preferred Language
|
### 14.1 推荐语言
|
||||||
|
|
||||||
**TypeScript / Node.js**
|
推荐优先使用 **TypeScript / Node.js**。
|
||||||
|
|
||||||
Reasoning:
|
理由:
|
||||||
|
|
||||||
- fits well with OpenClaw ecosystem conventions
|
- 与 OpenClaw 生态更贴近
|
||||||
- convenient for JSON schema validation
|
- 处理 JSON / schema / CLI 更顺手
|
||||||
- good support for CLI tooling
|
- 实现假模型 HTTP 服务成本低
|
||||||
- convenient for HTTP fake service implementation
|
- 调用 Docker、OpenClaw CLI、外部脚本都比较方便
|
||||||
- straightforward Docker and subprocess orchestration
|
|
||||||
|
|
||||||
### Suggested Libraries
|
### 14.2 推荐基础库方向
|
||||||
|
|
||||||
- `ajv` for schema validation
|
- `ajv`:schema 校验
|
||||||
- `commander` or `yargs` for CLI
|
- `commander` 或 `yargs`:CLI
|
||||||
- `execa` for shell and Docker command orchestration
|
- `execa`:子进程与命令调度
|
||||||
- `fastify` or `express` for fake model service
|
- `fastify` 或 `express`:假模型服务
|
||||||
- `yaml` for optional YAML support in the future
|
- `yaml`:未来支持 YAML 配置
|
||||||
- `vitest` for ClawSpec self-tests
|
- `vitest`:框架自身测试
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 13. V1 Scope
|
## 15. V1 范围建议
|
||||||
|
|
||||||
The first version should focus on the smallest useful end-to-end workflow.
|
第一版应该先打通最小闭环,而不是过早扩张。
|
||||||
|
|
||||||
### V1 Must Include
|
### 15.1 V1 必须具备
|
||||||
|
|
||||||
- load one spec file
|
- 读取一个 spec 文件
|
||||||
- validate the basic schema
|
- 校验基础 schema
|
||||||
- start one OpenClaw test unit
|
- 启动一个 OpenClaw test unit
|
||||||
- install one or more plugins in that unit
|
- 在该 unit 中安装一个或多个插件
|
||||||
- apply plugin configuration entries
|
- 应用插件配置项
|
||||||
- start one fake model service
|
- 启动一个假模型服务
|
||||||
- inject one test input
|
- 注入一条测试输入
|
||||||
- support rule-based text responses
|
- 支持单轮和多轮交互
|
||||||
- support rule-based tool-calling responses
|
- 要求每个 test case 包含 `timeout`
|
||||||
- support basic assertions:
|
- 要求每个 test case 定义显式结束规则,并通过 `test-finished` 结束
|
||||||
- message contains
|
- 支持规则驱动文本回复
|
||||||
- tool called
|
- 支持规则驱动工具调用
|
||||||
- script verifier
|
- 支持基础断言:
|
||||||
- generate logs and a pass/fail summary
|
- `message_contains`
|
||||||
|
- `tool_called`
|
||||||
|
- `script verifier`
|
||||||
|
- 输出日志和 pass/fail 摘要
|
||||||
|
|
||||||
### V1 Should Avoid
|
### 15.2 V1 可以先不做
|
||||||
|
|
||||||
- complex multi-turn state machines
|
- 图形界面
|
||||||
- distributed execution
|
- 分布式执行
|
||||||
- UI dashboard
|
- 性能测试
|
||||||
- performance benchmarking
|
- 高级 provider 协议仿真
|
||||||
- broad provider emulation beyond test needs
|
- 复杂矩阵测试展开
|
||||||
- advanced matrix test expansion
|
- 过度复杂的对话状态机
|
||||||
|
|
||||||
|
换句话说,V1 只要把:
|
||||||
|
|
||||||
|
- 环境起来
|
||||||
|
- 插件装好
|
||||||
|
- 假模型能按规则回
|
||||||
|
- 测试能结束
|
||||||
|
- 结果能校验
|
||||||
|
|
||||||
|
这条链路打通,就已经有实际价值。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 14. Milestone Proposal
|
## 16. 里程碑建议
|
||||||
|
|
||||||
### Milestone 0 - Project Bootstrap
|
### Milestone 0:项目初始化
|
||||||
|
|
||||||
- initialize repository layout
|
- 初始化仓库结构
|
||||||
- define coding conventions
|
- 建立基础工程
|
||||||
- write initial README and project plan
|
- 写 README 和项目规划
|
||||||
- select runtime and libraries
|
- 明确技术栈
|
||||||
|
|
||||||
### Milestone 1 - Spec Definition
|
### Milestone 1:配置定义
|
||||||
|
|
||||||
- draft spec schema v0.1
|
- 明确 spec 字段设计
|
||||||
- implement spec parser and validation
|
- 编写 `docs/spec-schema.md`
|
||||||
- add example specs
|
- 编写基础 schema 文件
|
||||||
|
- 提供一个最小示例
|
||||||
|
|
||||||
### Milestone 2 - Fake Model Service
|
### Milestone 2:假模型服务
|
||||||
|
|
||||||
- define internal rule format
|
- 定义规则结构
|
||||||
- implement rule matcher
|
- 实现规则匹配
|
||||||
- implement deterministic response generation
|
- 实现文本和工具调用响应
|
||||||
- add request/response logging
|
- 实现 `test-finished` 结束机制
|
||||||
|
- 实现请求与规则命中日志
|
||||||
|
|
||||||
### Milestone 3 - Environment Orchestrator
|
### Milestone 3:环境编排
|
||||||
|
|
||||||
- generate runtime environment configuration
|
- 启动 OpenClaw 容器或运行实例
|
||||||
- start and stop OpenClaw containers
|
- 安装插件
|
||||||
- apply plugin install commands
|
- 写入插件配置
|
||||||
- apply plugin configs
|
- 管理测试生命周期
|
||||||
|
|
||||||
### Milestone 4 - Test Runner
|
### Milestone 4:测试执行器
|
||||||
|
|
||||||
- inject test inputs
|
- 注入测试输入
|
||||||
- collect runtime outputs
|
- 收集输出和工具调用轨迹
|
||||||
- evaluate assertions
|
- 处理 timeout 与结束规则
|
||||||
- execute verifier scripts
|
- 执行断言和验证脚本
|
||||||
- output structured test reports
|
- 输出报告
|
||||||
|
|
||||||
### Milestone 5 - First Real Plugin Demo
|
### Milestone 5:真实插件验证
|
||||||
|
|
||||||
- create an example test suite for a real OpenClaw plugin
|
- 选一个真实 OpenClaw 插件做端到端样例
|
||||||
- validate the full workflow end to end
|
- 验证框架设计是否足够支撑真实需求
|
||||||
- document limitations and next steps
|
- 记录限制和下一步演进方向
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 15. Risks and Open Questions
|
## 17. 风险与待确认问题
|
||||||
|
|
||||||
### Runtime Interface Risk
|
### 17.1 OpenClaw 模型接入接口
|
||||||
|
|
||||||
The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests.
|
假模型服务想顺利工作,前提是要足够清楚 OpenClaw 调用模型时的接口约定。这一点必须尽早验证。
|
||||||
|
|
||||||
### Plugin Installation Variability
|
### 17.2 插件安装流程差异
|
||||||
|
|
||||||
Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks.
|
不同插件可能需要不同安装方式、初始化步骤、配置项和额外依赖。ClawSpec 需要决定哪些做成通用能力,哪些交给 setup hook 或脚本。
|
||||||
|
|
||||||
### Observable Output Boundaries
|
### 17.3 可观察结果边界
|
||||||
|
|
||||||
Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result.
|
有些插件的结果体现在消息里,有些体现在工具调用里,有些体现在外部系统副作用里。框架需要定义清楚“什么是结果来源”。
|
||||||
|
|
||||||
### Docker/Image Strategy
|
### 17.4 Docker / 镜像策略
|
||||||
|
|
||||||
The project needs a clear policy for:
|
需要明确:
|
||||||
|
|
||||||
- official base images
|
- 基础镜像怎么选
|
||||||
- local image overrides
|
- OpenClaw 版本怎么管理
|
||||||
- plugin source mounting during local development
|
- 本地开发时插件源码如何挂载
|
||||||
- OpenClaw version pinning
|
- 如何兼顾快速迭代和环境稳定性
|
||||||
|
|
||||||
### Test Case Reuse
|
### 17.5 规则表达能力
|
||||||
|
|
||||||
It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file.
|
如果规则太简单,复杂插件测不了;如果规则太复杂,配置会很难写。需要在表达能力和可维护性之间找平衡。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 16. Success Criteria
|
## 18. 成功标准
|
||||||
|
|
||||||
ClawSpec can be considered successful in its first phase if:
|
在第一阶段,ClawSpec 可以被认为成功,如果它满足以下条件:
|
||||||
|
|
||||||
- a plugin developer can define a test spec without writing framework code
|
- 插件开发者不需要写框架代码就能描述一个测试场景
|
||||||
- a test run is reproducible across machines with the same environment
|
- 同样配置在相同环境下能重复得到相同结果
|
||||||
- plugin integration behavior can be validated without a real LLM
|
- 插件集成行为可以在不依赖真实 LLM 的情况下验证
|
||||||
- failed runs produce enough artifacts to diagnose the issue quickly
|
- 测试失败时能留下足够产物用于定位问题
|
||||||
- at least one real plugin can be tested end-to-end using the framework
|
- 至少有一个真实插件能够通过 ClawSpec 完成端到端自动化测试
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 17. Next Recommended Deliverables
|
## 19. 下一步建议产物
|
||||||
|
|
||||||
After this plan, the next most useful documents are:
|
在这份项目规划之后,最值得继续补的文档有:
|
||||||
|
|
||||||
1. `README.md` — concise positioning and quick start
|
1. `README.md`
|
||||||
2. `docs/spec-schema.md` — formalize the spec design
|
- 用简洁语言说明项目是什么、解决什么问题、怎么快速开始
|
||||||
3. `schema/clawspec.schema.json` — machine-validatable V0 schema
|
|
||||||
4. `docs/fake-model.md` — define fake model request/response behavior
|
2. `docs/spec-schema.md`
|
||||||
5. `TASKLIST.md` or milestone tracker — implementation breakdown
|
- 把配置结构正式写清楚
|
||||||
|
- 明确 `clawUnits`、`testCases`、`modelRules`、`timeout`、`test-finished` 等字段
|
||||||
|
|
||||||
|
3. `schema/clawspec.schema.json`
|
||||||
|
- 提供一份可以直接用于校验的机器可读 schema
|
||||||
|
|
||||||
|
4. `docs/fake-model.md`
|
||||||
|
- 明确假模型服务的输入输出协议、规则匹配方式、结束机制
|
||||||
|
|
||||||
|
5. `TASKLIST.md`
|
||||||
|
- 把里程碑拆成可执行任务
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 18. Summary
|
## 20. 总结
|
||||||
|
|
||||||
ClawSpec should become a deterministic integration testing framework for OpenClaw plugins.
|
ClawSpec 的目标可以概括成一句话:
|
||||||
|
|
||||||
Its core innovation is simple:
|
> 在真实 OpenClaw 运行时中,用规则驱动的假模型替代真实 LLM,对插件进行稳定、可重复的自动化集成测试。
|
||||||
|
|
||||||
- run real OpenClaw runtime environments
|
它的关键价值在于:
|
||||||
- replace real LLM behavior with a rule-driven fake model service
|
|
||||||
- execute declarative test cases
|
|
||||||
- verify runtime behavior with stable, repeatable assertions
|
|
||||||
|
|
||||||
If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.
|
- 用真实 runtime 测集成行为
|
||||||
|
- 用假模型消除 LLM 随机性
|
||||||
|
- 用声明式配置统一测试场景
|
||||||
|
- 用显式结束规则 `test-finished` + `timeout` 解决多轮测试收口问题
|
||||||
|
- 用断言和脚本验证兼顾通用性与扩展性
|
||||||
|
|
||||||
|
如果这个框架做成,它会非常适合作为 OpenClaw 插件开发中的基础测试设施。
|
||||||
Reference in New Issue
Block a user