Changelog¶

Human-in-the-Loop Integration¶

Runtime: LocalRuntime now routes human agents (agent.human == True) to registered HumanInterface implementations instead of the LLM.
New: GitHubHuman — GitHub comment-based human interface (post prompt, poll for reply).
New: DashboardHuman — WebSocket-based human interface for the dashboard.
New: GitHubCommentWatcher — background task that polls GitHub comments and injects them as events into a running society.
EventBus: Added inject() method for external event injection (async human feedback).
Demo: github_live.py now supports --human flag for human maintainer oversight.

Phase 9c — Resilience & Production Hardening¶

Circuit breaker: CircuitBreaker trips after N consecutive LLM failures, fast-failing subsequent calls with CircuitBreakerOpen instead of burning through retries. Opt-in via LocalRuntime(circuit_breaker=CircuitBreaker(max_failures=5)).
Budget rollback: ArtifactStore.checkpoint() / rollback() support. When SocietyConfig(on_budget_exceeded=BudgetExceededPolicy.ROLLBACK) is set, artifacts revert to pre-run state on budget exceeded.
Agent timeout enforcement: Watchdog integration in the drain loop. When LocalRuntime(watchdog=Watchdog(default_timeout=timedelta(seconds=30))) is set, agent turns that exceed the timeout produce a timeout event instead of hanging.
Structured logging: RunContext generates unique correlation IDs per society run. StructuredLogFormatter outputs JSON log lines with run_id, agent, event_type fields. SocietyResult.run_id tracks the correlation ID.
Society deserialization: Society.from_dict() and Society.from_json() reconstruct societies from serialized form (inverse of to_dict() / to_json()). Full roundtrip support for all edge types, group edges, resolve strategies, and config.
Resilience tests: Edge-case coverage for malformed LLM responses, concurrent artifact writes, LLM failures mid-run, zero-budget termination, empty societies, and disconnected agents.
Example acceptance tests: Parametrized subprocess tests verify all example scripts (pr_review.py, competitive_codegen.py, research_pipeline.py, dev_team.py) exit cleanly.
Exports: Added CircuitBreaker, CircuitBreakerOpen, RunContext, StructuredLogFormatter to claw.__init__.
Tests: 61 new tests (634 total).

Phase 9b — CLI + Documentation + Real-LLM Tests¶

CLI entrypoint: claw run <script.py> executes society scripts (sync and async main()). claw run --dry-run sets CLAW_DRY_RUN=1 env var. claw init <name> scaffolds a new project with society.py and pyproject.toml. Registered via [project.scripts] in pyproject.toml.
Documentation guides (7 new, 8 total):
quickstart.md — 5-minute getting started with a complete runnable example
agents.md — agent creation, tools, human agents, validation
edges.md — all 5 edge types, group edges, resolution strategies, decision matrix
artifacts.md — built-in types, versioning, budget-aware reads, custom artifacts
societies.md — building, configuration, graph queries, serialization, common patterns
runtime.md — execution model, LocalRuntime, ReAct loop, observer, persistence
llm-backends.md — LiteLLM multi-provider, MockLLM testing, prompt structure, multi-turn
Real-LLM smoke tests: Anthropic Claude tests (test_smoke_anthropic.py), Gemini ReAct loop test. All gated behind API key env vars, marked @pytest.mark.slow.
Tests: 12 new tests (573 total).

Phase 9a — ReAct Tool Execution Loop¶

Multi-turn conversation: Message dataclass, ToolCall.id field, Prompt.messages for conversation history within agent turns.
LiteLLM multi-turn: _build_messages() includes assistant/tool messages in OpenAI format. _parse_response() captures tool call IDs.
Tool call classification: classify_tool_calls() separates executable tools (file_edit, shell_exec, github) from claw actions (emit_event, write_artifact).
ReAct inner loop: LocalRuntime._execute_agent() enters a think→act→observe loop when tool_executor is provided. Agents can execute tools, see results, and iterate up to max_tool_rounds (default 10).
Budget-accurate counting: Each LLM call in the ReAct loop counts toward max_llm_calls via AgentOutput.llm_call_count.
Backward compatible: Without tool_executor, runtime behaves exactly as before.
Tests: 42 new tests (561 total).

Phase 8 — GitHub Integration & Demo¶

claw.tools.FileEditTool: Sandbox-scoped file editing with read/write/patch/delete/list operations. Absolute paths and .. escapes are rejected. Writes are auto-staged with git add.
claw.tools.ShellExecTool: Strict allowlist-only command execution via asyncio.subprocess. No shell (no sh -c), preventing injection. Configurable timeout (SIGTERM→SIGKILL) and output cap (100KB).
claw.tools.GitHubTool: gh CLI wrapper for PR lifecycle: create, review, approve, request changes, merge. Also handles issues: get, comment, list. Full dry-run mode.
claw.tools.GitWorkflow: Git worktree management for agent isolation. Creates feature branches (claw/<issue>-<slug>), manages commits, push, PR creation, merge, and cleanup.
claw.tools.ToolExecutor: Middleware dispatching LLM tool calls to real tool implementations. Supports dry-run mode and tracks execution history.
claw.artifacts.CodeFileArtifact: File-backed artifact with git commits as versions. diff() uses git diff between commits.
claw.artifacts.GitHubIssueArtifact: Artifact backed by a GitHub issue via gh CLI. Reads issue data, posts comments, manages labels.
claw.artifacts.GitHubPRArtifact: Artifact backed by a GitHub PR. Supports approve, request changes, and merge lifecycle actions.
claw.triggers.GitHubPollingTrigger: Polls gh issue list on a schedule. Tracks seen issues and triggers callbacks for new ones.
claw.triggers.GitHubWebhookTrigger: FastAPI router for GitHub webhook payloads. HMAC signature verification. Maps issues.opened, issue_comment.created, and pull_request.opened to Claw events.
Demo: claw-demo/ — todo CLI app with pre-filed issues. examples/github_demo.py — investor demo script.
Safety guardrails: Allowlist-only shell execution, sandbox-scoped file access, dry-run mode for all external operations.
Tests: 127 new tests (519 total).

Phase 7 — Live Observer Dashboard¶

claw.server.WebSocketObserver: RuntimeObserver implementation that serializes lifecycle events to JSON and broadcasts to WebSocket clients. Ring buffer (default 500) replays events for late-joining clients.
claw.server.create_app(): FastAPI application with REST endpoints (/api/society, /api/artifacts, /api/trace), WebSocket endpoint (/ws), static file serving, and configurable CORS.
claw.server.serve(): One-call launcher that runs a society with the dashboard server concurrently via asyncio.gather(). Drop-in replacement for LocalRuntime.run().
React dashboard (dashboard/): Vite + React + TypeScript + TailwindCSS v4 + @xyflow/react. Graph visualization with agent status indicators, color-coded typed edges, chronological event feed with filtering, agent detail cards, artifact version history with diffs.
Example: examples/pr_review_dashboard.py — PR review society with live dashboard.
Guide: docs/guides/dashboard.md — usage, layout, dev workflow, API reference.
Tests: 26 new tests (392 total).

Phase 6 — E2E Use Cases¶

4 example scripts (examples/): pr_review, competitive_codegen, research_pipeline, dev_team
11 E2E integration tests across 5 test modules
Cross-edge event routing (_resolve_edge_id())
artifacts parameter on LocalRuntime.run()

Phase 5 — LLM & Human Backends¶

MockLLM with scripted responses and specificity-based matching
LiteLLMBackend wrapping litellm for all providers
MockHuman and CLIHuman for human-in-the-loop
App-level LLM retry (3 attempts, exponential backoff)

Phase 4 — Runtime Engine¶

Event model with monotonic sequence IDs
EventBus with edge-routed delivery and batch independence detection
LocalRuntime: compile → seed → drain → return loop
EdgeResolutionTracker, Watchdog, EventLogger

Phase 3 — Compiler¶

Graph-to-prompt compilation (compile(), compile_for_event())
Per-agent system prompts, context filtering, tool binding

Phase 2 — Artifacts¶

StringArtifact, JsonArtifact with version history
Budget-aware reads, RFC 7386 merge-patch, diff support

Phase 1 — Scaffold & Data Models¶

Agent, Edge, GroupEdge, Society models
5 edge types: Cooperation, Competition, Oversight, Delegation, Coopetition
SocietyConfig with budget/timeout policies