diff --git a/PROJECT_PLAN.md b/PROJECT_PLAN.md
index 0d779bd..6b1dbb0 100644
--- a/PROJECT_PLAN.md
+++ b/PROJECT_PLAN.md
@@ -1,162 +1,208 @@
-# ClawSpec Project Plan
+# ClawSpec 项目规划
 
-## 1. Project Overview
+## 1. 项目概述
 
-**ClawSpec** is an automation testing framework for OpenClaw plugins.
+**ClawSpec** 是一个面向 **OpenClaw 插件自动化测试** 的框架。
 
-Its purpose is to provide a deterministic, reproducible way to validate plugin behavior in a realistic OpenClaw runtime environment without relying on an actual LLM provider.
+它的目标不是替代插件仓库内部已有的单元测试，而是为 **插件在真实 OpenClaw 运行时中的集成行为** 提供一套可重复、可编排、可验证的自动化测试方案。
 
-Instead of calling real language models, ClawSpec will run OpenClaw instances against a rule-based fake model service. This allows plugin developers to test message handling, tool-calling flows, plugin configuration, integration boundaries, and observable side effects with stable results.
+ClawSpec 的核心思路是：
+
+- 使用配置文件声明测试场景
+- 启动一个或多个 OpenClaw 测试实体（claw unit）
+- 自动安装并配置目标插件
+- 使用一个**规则驱动的假模型服务**替代真实 LLM
+- 通过规则控制模型回复、工具调用和测试终止时机
+- 在测试结束或超时后自动执行断言与验证脚本
+- 输出稳定、可复现的测试结果和调试产物
+
+它要解决的问题，本质上是：
+
+> 如何在不依赖真实大模型、不手工搭环境、不靠人工点点点观察日志的前提下，对 OpenClaw 插件做稳定的集成测试。
 
 ---
 
-## 2. Problem Statement
+## 2. 要解决的问题
 
-OpenClaw plugin testing is currently hard to standardize because:
+目前 OpenClaw 插件测试存在几个现实问题：
 
-- plugin behavior often depends on runtime integration rather than isolated pure functions
-- real LLM responses are non-deterministic and expensive
-- testing usually requires manual setup of OpenClaw, plugin installation, configuration, and message simulation
-- there is no unified way to express plugin test environments and expected outcomes
+- 插件行为很多不是纯函数，而是依赖 OpenClaw 运行时上下文
+- 插件经常要和模型调用、工具调用、消息路由、配置项共同工作
+- 使用真实 LLM 会引入不确定性、成本和复现困难
+- 手工搭建测试环境成本高，流程不统一
+- 缺少统一的方式描述测试环境、测试输入和预期结果
 
-ClawSpec aims to solve this by offering:
+因此，ClawSpec 需要提供一套标准化能力：
 
-- declarative test environment definitions
-- deterministic model behavior via rules instead of real LLMs
-- automated provisioning of OpenClaw runtime units
-- repeatable execution of plugin integration tests
-- structured validation and test reporting
+- 用声明式配置描述测试环境和测试用例
+- 用假模型替代真实 LLM，保证测试稳定性
+- 自动拉起 OpenClaw 运行环境并完成插件安装与配置
+- 支持单实体和多实体协作场景
+- 在测试结束后自动校验结果并生成报告
 
 ---
 
-## 3. Project Goals
+## 3. 项目目标
 
-### Primary Goals
+### 3.1 核心目标
 
-- Define test environments for one or more OpenClaw runtime units
-- Install and configure plugins automatically per test spec
-- Provide a fake model service that responds according to declarative rules
-- Execute test cases against running OpenClaw units
-- Verify expected behavior through built-in assertions and custom verifier scripts
-- Produce reproducible test artifacts and reports
+- 定义一个测试配置文件格式，用于描述 OpenClaw 测试场景
+- 根据配置自动创建一个或多个 OpenClaw 测试实体
+- 自动安装和配置测试所需插件
+- 提供规则驱动的假模型服务，不接真实 LLM
+- 支持单轮和多轮对话测试
+- 支持基于断言和脚本的结果验证
+- 输出稳定、可重现、便于排查的测试产物
 
-### Secondary Goals
+### 3.2 次级目标
 
-- Support multiple OpenClaw versions for compatibility testing
-- Support multi-unit scenarios in a single test suite
-- Provide reusable example specs for plugin developers
-- Make local plugin integration testing fast enough for everyday development
+- 在需要时支持不同 OpenClaw 版本的兼容性测试
+- 支持多个 OpenClaw 实体之间的协作测试
+- 为插件开发者提供可复用的示例测试配置
+- 让日常本地插件集成测试足够轻量、足够快
 
-### Non-Goals for V1
+### 3.3 非目标（至少不在 V1）
 
-- Full UI dashboard
-- Large-scale distributed execution
-- Performance benchmarking
-- Fuzzing or random conversation generation
-- Automatic support for every possible provider protocol
-- Replacing unit tests inside plugin repositories
+- 替代插件仓库内部单元测试
+- 一上来做图形化管理界面
+- 一上来做分布式大规模并发执行
+- 一上来做性能压测、压力测试、模糊测试
+- 一上来模拟所有模型供应商的全部协议细节
 
 ---
 
-## 4. Product Positioning
+## 4. 产品定位
 
-ClawSpec is **not** a generic unit test runner.
+ClawSpec **不是一个通用测试框架**，也**不是一个纯单元测试工具**。
 
-It is a **scenario-driven integration testing framework** for validating the behavior of:
+它更准确的定位是：
 
-- OpenClaw runtime
-- installed plugins
-- model interaction boundaries
-- tool-calling flows
-- message outputs
-- side effects exposed through plugins or configured backends
+> 一个面向 OpenClaw 插件生态的、基于场景编排的、确定性集成测试框架。
 
-The core value is deterministic validation of complex runtime behavior.
+它关注的对象是以下整体行为：
+
+- OpenClaw runtime 本身
+- 已安装插件的行为
+- 模型请求与回复边界
+- 工具调用链路
+- 消息输出
+- 插件对外部系统产生的可观察副作用
+
+所以它测试的不是“某个函数返回值对不对”，而是：
+
+> 当 OpenClaw + 插件 + 模型交互 + 工具链路一起工作时，最终表现是否符合预期。
 
 ---
 
-## 5. High-Level Architecture
+## 5. 核心设计原则
 
-ClawSpec is expected to contain four major components.
+### 5.1 确定性优先
 
-### 5.1 Spec Loader
+自动化测试尽量不依赖真实 LLM 的随机性。假模型服务必须成为测试过程中的受控变量。
 
-Responsible for:
+### 5.2 运行时真实性
 
-- reading the project spec file
-- validating structure against schema
-- normalizing runtime definitions
-- producing an execution plan for the runner
+测试目标是接近真实 OpenClaw 运行时，而不是只 mock 插件内部函数。
 
-### 5.2 Environment Orchestrator
+### 5.3 声明式配置
 
-Responsible for:
+测试环境、测试输入、模型规则、断言条件都应尽可能通过配置描述，而不是散落在脚本里。
 
-- provisioning OpenClaw test units
-- generating or managing Docker Compose definitions
-- preparing workspace directories and mounted volumes
-- installing plugins
-- applying plugin configuration
-- starting and stopping test environments
+### 5.4 默认简单，按需扩展
 
-### 5.3 Fake Model Service
+大部分插件测试只需要一个 claw unit，因此单实体测试应当是默认路径；多实体协作应当支持，但不应让单实体测试变复杂。
 
-Responsible for:
+### 5.5 可调试性
 
-- exposing a model-compatible endpoint for OpenClaw
-- receiving model requests from test units
-- matching incoming requests against declarative rules
-- returning deterministic text responses and/or tool-calling instructions
-- logging interactions for debugging and verification
-
-### 5.4 Test Runner
-
-Responsible for:
-
-- selecting target test cases
-- injecting input events/messages
-- collecting outputs, logs, and tool-call traces
-- evaluating built-in assertions
-- executing optional verifier scripts
-- producing final pass/fail results and artifacts
+每次测试都要尽量留下足够的产物用于排查，比如日志、规则命中记录、工具调用轨迹、验证脚本输出等。
 
 ---
 
-## 6. Core Design Principles
+## 6. claw unit 的定位
 
-### Determinism First
+这里需要明确一个关键概念：
 
-The framework should avoid real LLM randomness in automated tests.
+`clawUnits` 的存在，**主要不是为了同时测试多个版本或多个配置**，而是为了支持：
 
-### Runtime Realism
+- 一个插件要作用于多个 OpenClaw 实体
+- 插件行为依赖多个 agent / runtime 协作
+- 测试场景本身需要多个 OpenClaw 节点参与
 
-Tests should run against realistic OpenClaw environments, not only mocked plugin internals.
+例如：
 
-### Declarative Configuration
+- A 实体发送消息，B 实体响应
+- 一个插件监听另一个实体的行为
+- 多 agent 协作场景中的插件联动
 
-Test environments and cases should be defined in configuration files rather than hard-coded scripts.
+因此：
 
-### Extensible Verification
-
-Built-in assertions should cover common cases, while custom scripts should support project-specific validation.
-
-### Reproducible Artifacts
-
-All important outputs should be captured for debugging, including logs, matched model rules, tool-call traces, and verifier results.
+- **绝大多数插件测试应当只需要一个 `clawUnit`**
+- `multi-unit` 是为了协作测试，而不是为了把所有变化都塞进一个场景里
+- 如果只是测试版本兼容性或配置差异，更适合通过多个测试场景分别执行，而不是默认依赖多个 unit 同时运行
 
 ---
 
-## 7. Proposed Spec Structure
+## 7. 总体架构
 
-The initial idea is to define a single spec file that describes:
+ClawSpec 可以拆成四个核心模块。
 
-- OpenClaw runtime units
-- plugin installation and configuration
-- test cases
-- fake model behavior
-- expected validation steps
+### 7.1 Spec Loader（配置加载器）
 
-A normalized V1 structure may look like this:
+负责：
+
+- 读取测试配置文件
+- 校验配置结构是否合法
+- 标准化字段
+- 生成内部执行计划
+
+### 7.2 Environment Orchestrator（环境编排器）
+
+负责：
+
+- 创建 OpenClaw 测试环境
+- 生成和管理 Docker Compose 运行定义
+- 准备工作目录、挂载目录、产物目录
+- 安装插件
+- 写入插件配置
+- 启动和销毁测试环境
+
+### 7.3 Fake Model Service（假模型服务）
+
+负责：
+
+- 提供一个供 OpenClaw 调用的模型接口
+- 接收 OpenClaw 发来的模型请求
+- 根据配置规则决定返回什么
+- 生成稳定的文本回复或工具调用
+- 在满足终止条件时调用 `test-finished`
+- 记录请求、规则命中和输出结果
+
+### 7.4 Test Runner（测试执行器）
+
+负责：
+
+- 选择测试用例
+- 注入输入事件或测试消息
+- 收集输出消息、工具调用、日志和事件
+- 判断测试何时结束
+- 执行断言与外部验证脚本
+- 汇总结果并输出报告
+
+---
+
+## 8. 测试配置的整体思路
+
+ClawSpec 需要一个统一的配置文件来描述以下内容：
+
+- 测试环境
+- 一个或多个 claw unit
+- 各 unit 使用的 OpenClaw 版本或镜像
+- 需要安装的插件及配置
+- 测试输入
+- 假模型行为规则
+- 断言和验证脚本
+
+一个建议中的配置结构如下：
 
 ```json
 {
@@ -187,6 +233,7 @@ A normalized V1 structure may look like this:
     {
       "testId": "calendar-reminder-basic",
       "targetUnitId": "calendar-agent",
+      "timeout": 15000,
       "input": {
         "channel": "discord",
         "chatType": "direct",
@@ -204,6 +251,16 @@ A normalized V1 structure may look like this:
             },
             "text": "已经帮你记下了"
           }
+        },
+        {
+          "receive": ".*已经帮你记下了.*",
+          "action": {
+            "type": "tool_call",
+            "toolName": "test-finished",
+            "toolParameters": {
+              "reason": "expected final response observed"
+            }
+          }
         }
       ],
       "expected": [
@@ -225,89 +282,176 @@ A normalized V1 structure may look like this:
 }
 ```
 
-This is only a starting point. The exact schema should be refined during implementation.
+这只是方向草案，后续需要再细化为正式 schema。
 
 ---
 
-## 8. Fake Model Service Design
+## 9. 假模型服务设计
 
-The fake model service is one of the most important parts of ClawSpec.
+假模型服务是 ClawSpec 的关键能力之一。
 
-It should behave like a deterministic model backend that OpenClaw can call during tests.
+如果没有它，测试就会依赖真实模型，带来以下问题：
 
-### Responsibilities
+- 结果不稳定
+- 工具调用行为不可预测
+- 成本上升
+- 失败难以复现
 
-- receive model requests from OpenClaw
-- inspect request content and context
-- match a rule set in declared order
-- return predefined outputs
-- support text-only responses
-- support tool-calling responses
-- support tool-call plus final text patterns
-- emit logs showing which rule matched and what output was generated
+因此 ClawSpec 里的假模型不应该只是“随便返回一句话”，而应是一个 **规则驱动的测试模型**。
 
-### Why This Matters
+### 9.1 假模型的职责
 
-Without this service, tests would depend on live model providers, causing:
+- 接收 OpenClaw 发来的模型请求
+- 读取当前 test case 对应的规则集
+- 按规则顺序或匹配优先级决定响应动作
+- 返回文本回复、工具调用，或两者组合
+- 在满足结束条件时触发 `test-finished`
+- 输出详细日志，记录每次命中的规则与返回动作
 
-- unstable results
-- variable tool-calling behavior
-- token costs
-- difficult reproduction of failures
+### 9.2 规则支持的动作方向
 
-The fake model service turns model behavior into a controlled part of the test spec.
+至少应支持：
+
+- 纯文本回复
+- 工具调用
+- 先工具调用再回复文本
+- 明确结束测试（调用 `test-finished`）
+- 不匹配时的默认行为（例如返回空结果、报错或记录未命中）
+
+### 9.3 多轮对话支持
+
+这里也要明确：
+
+> 多轮对话本身不难，没必要一开始就把它设计成复杂状态机。
+
+一个足够实用且容易落地的规则是：
+
+- 每个 test case 都必须定义 `timeout`
+- 每个 test case 的 `modelRules` 中都必须包含一个**测试结束规则**
+- 当结束规则被命中时，假模型调用工具 `test-finished`
+- 只要 `test-finished` 被观察到，测试就可以进入结果检查阶段
+- 如果在 `timeout` 时间内没有观察到 `test-finished`，也应自动停止等待并开始检查结果
+
+这样带来的好处：
+
+- 可以天然支持多轮对话
+- 不需要一开始就做复杂状态机
+- 测试终止条件明确
+- 超时处理简单统一
+
+换句话说，ClawSpec 的多轮测试机制，第一版完全可以建立在：
+
+- 规则匹配
+- 显式结束信号 `test-finished`
+- 统一 timeout
+
+这三件事上。
 
 ---
 
-## 9. Verification Model
+## 10. 测试结束机制
 
-ClawSpec should support two layers of verification.
+ClawSpec 中每个测试用例都应该具备两个结束维度：
 
-### 9.1 Built-in Assertions
+### 10.1 显式结束
 
-Common assertions should be supported directly by the framework, such as:
+由假模型通过工具调用：
+
+- `test-finished`
+
+来表示测试已经到达预期终点。
+
+这代表：
+
+- 测试逻辑已经跑到设计的结束条件
+- 可以停止继续等待新一轮交互
+- 可以开始执行断言和验证脚本
+
+### 10.2 超时结束
+
+每个 test case 必须提供 `timeout` 字段。
+
+如果在超时时间内没有观察到 `test-finished`，则：
+
+- runner 停止继续等待
+- 将当前收集到的日志、消息、工具调用作为测试结果输入
+- 自动开始执行断言和验证脚本
+
+这样做的意义是：
+
+- 防止测试无限挂起
+- 允许一些“不要求明确结束工具”的场景仍可评估结果
+- 给开发者提供统一的失败诊断入口
+
+---
+
+## 11. 验证模型
+
+ClawSpec 至少需要支持两层验证方式。
+
+### 11.1 内建断言
+
+适合高频常见场景，例如：
 
 - `message_contains`
 - `message_equals`
 - `tool_called`
 - `tool_called_with`
-- `exit_code`
 - `log_contains`
+- `exit_code`
 
-### 9.2 External Verifier Scripts
+这些断言由框架直接执行，适合大多数常见插件测试。
 
-Custom verifier scripts should be supported for advanced cases, such as:
+### 11.2 外部验证脚本
 
-- checking database state
-- validating generated files
-- verifying HTTP side effects
-- checking plugin-specific external systems
+适合复杂或高度项目定制的检查，例如：
 
-This combination keeps common tests simple while preserving flexibility.
+- 数据库状态是否正确
+- 某个文件是否生成
+- 某个 HTTP 回调是否发生
+- 某个外部服务状态是否变化
+
+这类场景可以通过：
+
+```json
+{
+  "verifier": {
+    "type": "script",
+    "path": "./verifiers/check-result.sh"
+  }
+}
+```
+
+来执行。
+
+内建断言负责覆盖通用场景，脚本验证负责保留扩展性。
 
 ---
 
-## 10. Execution Flow
+## 12. 执行流程
 
-A typical ClawSpec test run should look like this:
+一次典型的 ClawSpec 测试流程应当如下：
 
-1. Load and validate spec file
-2. Prepare workspace and artifact directories
-3. Materialize runtime environment definitions
-4. Start fake model service
-5. Start target OpenClaw unit(s)
-6. Install and configure required plugins
-7. Inject test input into the target unit
-8. Let the runtime interact with the fake model service
-9. Collect outputs, logs, tool traces, and events
-10. Evaluate expected assertions
-11. Run external verifier script if defined
-12. Produce final result summary and artifact bundle
-13. Tear down environment unless retention is requested
+1. 读取并校验测试配置
+2. 准备 workspace、artifacts 等目录
+3. 根据配置生成运行环境定义
+4. 启动假模型服务
+5. 启动目标 OpenClaw unit（单个或多个）
+6. 安装并配置目标插件
+7. 向目标 unit 注入测试输入
+8. 让 OpenClaw 与假模型进行一轮或多轮交互
+9. 持续观察是否出现 `test-finished`
+10. 若命中结束信号，则进入结果验证
+11. 若超时，则停止等待并进入结果验证
+12. 收集日志、消息、工具调用记录和事件轨迹
+13. 执行内建断言
+14. 如有需要，执行外部验证脚本
+15. 输出 pass/fail、摘要和产物路径
+16. 按配置决定是否销毁环境
 
 ---
 
-## 11. Proposed Directory Structure
+## 13. 推荐目录结构
 
 ```text
 ClawSpec/
@@ -322,7 +466,7 @@ ClawSpec/
 │   └── clawspec.schema.json
 ├── examples/
 │   ├── basic.json
-│   └── calendar-plugin.json
+│   └── multi-unit.json
 ├── docker/
 │   ├── compose.template.yml
 │   └── fake-model.Dockerfile
@@ -342,170 +486,197 @@ ClawSpec/
 
 ---
 
-## 12. Recommended Tech Stack
+## 14. 技术选型建议
 
-### Preferred Language
+### 14.1 推荐语言
 
-**TypeScript / Node.js**
+推荐优先使用 **TypeScript / Node.js**。
 
-Reasoning:
+理由：
 
-- fits well with OpenClaw ecosystem conventions
-- convenient for JSON schema validation
-- good support for CLI tooling
-- convenient for HTTP fake service implementation
-- straightforward Docker and subprocess orchestration
+- 与 OpenClaw 生态更贴近
+- 处理 JSON / schema / CLI 更顺手
+- 实现假模型 HTTP 服务成本低
+- 调用 Docker、OpenClaw CLI、外部脚本都比较方便
 
-### Suggested Libraries
+### 14.2 推荐基础库方向
 
-- `ajv` for schema validation
-- `commander` or `yargs` for CLI
-- `execa` for shell and Docker command orchestration
-- `fastify` or `express` for fake model service
-- `yaml` for optional YAML support in the future
-- `vitest` for ClawSpec self-tests
+- `ajv`：schema 校验
+- `commander` 或 `yargs`：CLI
+- `execa`：子进程与命令调度
+- `fastify` 或 `express`：假模型服务
+- `yaml`：未来支持 YAML 配置
+- `vitest`：框架自身测试
 
 ---
 
-## 13. V1 Scope
+## 15. V1 范围建议
 
-The first version should focus on the smallest useful end-to-end workflow.
+第一版应该先打通最小闭环，而不是过早扩张。
 
-### V1 Must Include
+### 15.1 V1 必须具备
 
-- load one spec file
-- validate the basic schema
-- start one OpenClaw test unit
-- install one or more plugins in that unit
-- apply plugin configuration entries
-- start one fake model service
-- inject one test input
-- support rule-based text responses
-- support rule-based tool-calling responses
-- support basic assertions:
-  - message contains
-  - tool called
-  - script verifier
-- generate logs and a pass/fail summary
+- 读取一个 spec 文件
+- 校验基础 schema
+- 启动一个 OpenClaw test unit
+- 在该 unit 中安装一个或多个插件
+- 应用插件配置项
+- 启动一个假模型服务
+- 注入一条测试输入
+- 支持单轮和多轮交互
+- 要求每个 test case 包含 `timeout`
+- 要求每个 test case 定义显式结束规则，并通过 `test-finished` 结束
+- 支持规则驱动文本回复
+- 支持规则驱动工具调用
+- 支持基础断言：
+  - `message_contains`
+  - `tool_called`
+  - `script verifier`
+- 输出日志和 pass/fail 摘要
 
-### V1 Should Avoid
+### 15.2 V1 可以先不做
 
-- complex multi-turn state machines
-- distributed execution
-- UI dashboard
-- performance benchmarking
-- broad provider emulation beyond test needs
-- advanced matrix test expansion
+- 图形界面
+- 分布式执行
+- 性能测试
+- 高级 provider 协议仿真
+- 复杂矩阵测试展开
+- 过度复杂的对话状态机
+
+换句话说，V1 只要把：
+
+- 环境起来
+- 插件装好
+- 假模型能按规则回
+- 测试能结束
+- 结果能校验
+
+这条链路打通，就已经有实际价值。
 
 ---
 
-## 14. Milestone Proposal
+## 16. 里程碑建议
 
-### Milestone 0 - Project Bootstrap
+### Milestone 0：项目初始化
 
-- initialize repository layout
-- define coding conventions
-- write initial README and project plan
-- select runtime and libraries
+- 初始化仓库结构
+- 建立基础工程
+- 写 README 和项目规划
+- 明确技术栈
 
-### Milestone 1 - Spec Definition
+### Milestone 1：配置定义
 
-- draft spec schema v0.1
-- implement spec parser and validation
-- add example specs
+- 明确 spec 字段设计
+- 编写 `docs/spec-schema.md`
+- 编写基础 schema 文件
+- 提供一个最小示例
 
-### Milestone 2 - Fake Model Service
+### Milestone 2：假模型服务
 
-- define internal rule format
-- implement rule matcher
-- implement deterministic response generation
-- add request/response logging
+- 定义规则结构
+- 实现规则匹配
+- 实现文本和工具调用响应
+- 实现 `test-finished` 结束机制
+- 实现请求与规则命中日志
 
-### Milestone 3 - Environment Orchestrator
+### Milestone 3：环境编排
 
-- generate runtime environment configuration
-- start and stop OpenClaw containers
-- apply plugin install commands
-- apply plugin configs
+- 启动 OpenClaw 容器或运行实例
+- 安装插件
+- 写入插件配置
+- 管理测试生命周期
 
-### Milestone 4 - Test Runner
+### Milestone 4：测试执行器
 
-- inject test inputs
-- collect runtime outputs
-- evaluate assertions
-- execute verifier scripts
-- output structured test reports
+- 注入测试输入
+- 收集输出和工具调用轨迹
+- 处理 timeout 与结束规则
+- 执行断言和验证脚本
+- 输出报告
 
-### Milestone 5 - First Real Plugin Demo
+### Milestone 5：真实插件验证
 
-- create an example test suite for a real OpenClaw plugin
-- validate the full workflow end to end
-- document limitations and next steps
+- 选一个真实 OpenClaw 插件做端到端样例
+- 验证框架设计是否足够支撑真实需求
+- 记录限制和下一步演进方向
 
 ---
 
-## 15. Risks and Open Questions
+## 17. 风险与待确认问题
 
-### Runtime Interface Risk
+### 17.1 OpenClaw 模型接入接口
 
-The exact model-provider interface expected by OpenClaw must be verified early. The fake model service depends on matching this contract well enough for tests.
+假模型服务想顺利工作，前提是要足够清楚 OpenClaw 调用模型时的接口约定。这一点必须尽早验证。
 
-### Plugin Installation Variability
+### 17.2 插件安装流程差异
 
-Different plugins may require different setup flows. ClawSpec must decide how much it standardizes versus how much it leaves to custom setup hooks.
+不同插件可能需要不同安装方式、初始化步骤、配置项和额外依赖。ClawSpec 需要决定哪些做成通用能力，哪些交给 setup hook 或脚本。
 
-### Observable Output Boundaries
+### 17.3 可观察结果边界
 
-Some plugins expose behavior through logs, some through tool calls, some through external HTTP effects. The framework must define what counts as the authoritative observable result.
+有些插件的结果体现在消息里，有些体现在工具调用里，有些体现在外部系统副作用里。框架需要定义清楚“什么是结果来源”。
 
-### Docker/Image Strategy
+### 17.4 Docker / 镜像策略
 
-The project needs a clear policy for:
+需要明确：
 
-- official base images
-- local image overrides
-- plugin source mounting during local development
-- OpenClaw version pinning
+- 基础镜像怎么选
+- OpenClaw 版本怎么管理
+- 本地开发时插件源码如何挂载
+- 如何兼顾快速迭代和环境稳定性
 
-### Test Case Reuse
+### 17.5 规则表达能力
 
-It may be useful later to split infra definitions, model rules, and assertions into reusable modules rather than keeping everything in one file.
+如果规则太简单，复杂插件测不了；如果规则太复杂，配置会很难写。需要在表达能力和可维护性之间找平衡。
 
 ---
 
-## 16. Success Criteria
+## 18. 成功标准
 
-ClawSpec can be considered successful in its first phase if:
+在第一阶段，ClawSpec 可以被认为成功，如果它满足以下条件：
 
-- a plugin developer can define a test spec without writing framework code
-- a test run is reproducible across machines with the same environment
-- plugin integration behavior can be validated without a real LLM
-- failed runs produce enough artifacts to diagnose the issue quickly
-- at least one real plugin can be tested end-to-end using the framework
+- 插件开发者不需要写框架代码就能描述一个测试场景
+- 同样配置在相同环境下能重复得到相同结果
+- 插件集成行为可以在不依赖真实 LLM 的情况下验证
+- 测试失败时能留下足够产物用于定位问题
+- 至少有一个真实插件能够通过 ClawSpec 完成端到端自动化测试
 
 ---
 
-## 17. Next Recommended Deliverables
+## 19. 下一步建议产物
 
-After this plan, the next most useful documents are:
+在这份项目规划之后，最值得继续补的文档有：
 
-1. `README.md` — concise positioning and quick start
-2. `docs/spec-schema.md` — formalize the spec design
-3. `schema/clawspec.schema.json` — machine-validatable V0 schema
-4. `docs/fake-model.md` — define fake model request/response behavior
-5. `TASKLIST.md` or milestone tracker — implementation breakdown
+1. `README.md`
+   - 用简洁语言说明项目是什么、解决什么问题、怎么快速开始
+
+2. `docs/spec-schema.md`
+   - 把配置结构正式写清楚
+   - 明确 `clawUnits`、`testCases`、`modelRules`、`timeout`、`test-finished` 等字段
+
+3. `schema/clawspec.schema.json`
+   - 提供一份可以直接用于校验的机器可读 schema
+
+4. `docs/fake-model.md`
+   - 明确假模型服务的输入输出协议、规则匹配方式、结束机制
+
+5. `TASKLIST.md`
+   - 把里程碑拆成可执行任务
 
 ---
 
-## 18. Summary
+## 20. 总结
 
-ClawSpec should become a deterministic integration testing framework for OpenClaw plugins.
+ClawSpec 的目标可以概括成一句话：
 
-Its core innovation is simple:
+> 在真实 OpenClaw 运行时中，用规则驱动的假模型替代真实 LLM，对插件进行稳定、可重复的自动化集成测试。
 
-- run real OpenClaw runtime environments
-- replace real LLM behavior with a rule-driven fake model service
-- execute declarative test cases
-- verify runtime behavior with stable, repeatable assertions
+它的关键价值在于：
 
-If implemented well, ClawSpec can become the standard foundation for plugin-level automated testing in the OpenClaw ecosystem.
+- 用真实 runtime 测集成行为
+- 用假模型消除 LLM 随机性
+- 用声明式配置统一测试场景
+- 用显式结束规则 `test-finished` + `timeout` 解决多轮测试收口问题
+- 用断言和脚本验证兼顾通用性与扩展性
+
+如果这个框架做成，它会非常适合作为 OpenClaw 插件开发中的基础测试设施。
\ No newline at end of file