Skip to content

Changelog

Human-in-the-Loop Integration

  • Runtime: LocalRuntime now routes human agents (agent.human == True) to registered HumanInterface implementations instead of the LLM.
  • New: GitHubHuman — GitHub comment-based human interface (post prompt, poll for reply).
  • New: DashboardHuman — WebSocket-based human interface for the dashboard.
  • New: GitHubCommentWatcher — background task that polls GitHub comments and injects them as events into a running society.
  • EventBus: Added inject() method for external event injection (async human feedback).
  • Demo: github_live.py now supports --human flag for human maintainer oversight.

Phase 9c — Resilience & Production Hardening

  • Circuit breaker: CircuitBreaker trips after N consecutive LLM failures, fast-failing subsequent calls with CircuitBreakerOpen instead of burning through retries. Opt-in via LocalRuntime(circuit_breaker=CircuitBreaker(max_failures=5)).
  • Budget rollback: ArtifactStore.checkpoint() / rollback() support. When SocietyConfig(on_budget_exceeded=BudgetExceededPolicy.ROLLBACK) is set, artifacts revert to pre-run state on budget exceeded.
  • Agent timeout enforcement: Watchdog integration in the drain loop. When LocalRuntime(watchdog=Watchdog(default_timeout=timedelta(seconds=30))) is set, agent turns that exceed the timeout produce a timeout event instead of hanging.
  • Structured logging: RunContext generates unique correlation IDs per society run. StructuredLogFormatter outputs JSON log lines with run_id, agent, event_type fields. SocietyResult.run_id tracks the correlation ID.
  • Society deserialization: Society.from_dict() and Society.from_json() reconstruct societies from serialized form (inverse of to_dict() / to_json()). Full roundtrip support for all edge types, group edges, resolve strategies, and config.
  • Resilience tests: Edge-case coverage for malformed LLM responses, concurrent artifact writes, LLM failures mid-run, zero-budget termination, empty societies, and disconnected agents.
  • Example acceptance tests: Parametrized subprocess tests verify all example scripts (pr_review.py, competitive_codegen.py, research_pipeline.py, dev_team.py) exit cleanly.
  • Exports: Added CircuitBreaker, CircuitBreakerOpen, RunContext, StructuredLogFormatter to claw.__init__.
  • Tests: 61 new tests (634 total).

Phase 9b — CLI + Documentation + Real-LLM Tests

  • CLI entrypoint: claw run <script.py> executes society scripts (sync and async main()). claw run --dry-run sets CLAW_DRY_RUN=1 env var. claw init <name> scaffolds a new project with society.py and pyproject.toml. Registered via [project.scripts] in pyproject.toml.
  • Documentation guides (7 new, 8 total):
  • quickstart.md — 5-minute getting started with a complete runnable example
  • agents.md — agent creation, tools, human agents, validation
  • edges.md — all 5 edge types, group edges, resolution strategies, decision matrix
  • artifacts.md — built-in types, versioning, budget-aware reads, custom artifacts
  • societies.md — building, configuration, graph queries, serialization, common patterns
  • runtime.md — execution model, LocalRuntime, ReAct loop, observer, persistence
  • llm-backends.md — LiteLLM multi-provider, MockLLM testing, prompt structure, multi-turn
  • Real-LLM smoke tests: Anthropic Claude tests (test_smoke_anthropic.py), Gemini ReAct loop test. All gated behind API key env vars, marked @pytest.mark.slow.
  • Tests: 12 new tests (573 total).

Phase 9a — ReAct Tool Execution Loop

  • Multi-turn conversation: Message dataclass, ToolCall.id field, Prompt.messages for conversation history within agent turns.
  • LiteLLM multi-turn: _build_messages() includes assistant/tool messages in OpenAI format. _parse_response() captures tool call IDs.
  • Tool call classification: classify_tool_calls() separates executable tools (file_edit, shell_exec, github) from claw actions (emit_event, write_artifact).
  • ReAct inner loop: LocalRuntime._execute_agent() enters a think→act→observe loop when tool_executor is provided. Agents can execute tools, see results, and iterate up to max_tool_rounds (default 10).
  • Budget-accurate counting: Each LLM call in the ReAct loop counts toward max_llm_calls via AgentOutput.llm_call_count.
  • Backward compatible: Without tool_executor, runtime behaves exactly as before.
  • Tests: 42 new tests (561 total).

Phase 8 — GitHub Integration & Demo

  • claw.tools.FileEditTool: Sandbox-scoped file editing with read/write/patch/delete/list operations. Absolute paths and .. escapes are rejected. Writes are auto-staged with git add.
  • claw.tools.ShellExecTool: Strict allowlist-only command execution via asyncio.subprocess. No shell (no sh -c), preventing injection. Configurable timeout (SIGTERM→SIGKILL) and output cap (100KB).
  • claw.tools.GitHubTool: gh CLI wrapper for PR lifecycle: create, review, approve, request changes, merge. Also handles issues: get, comment, list. Full dry-run mode.
  • claw.tools.GitWorkflow: Git worktree management for agent isolation. Creates feature branches (claw/<issue>-<slug>), manages commits, push, PR creation, merge, and cleanup.
  • claw.tools.ToolExecutor: Middleware dispatching LLM tool calls to real tool implementations. Supports dry-run mode and tracks execution history.
  • claw.artifacts.CodeFileArtifact: File-backed artifact with git commits as versions. diff() uses git diff between commits.
  • claw.artifacts.GitHubIssueArtifact: Artifact backed by a GitHub issue via gh CLI. Reads issue data, posts comments, manages labels.
  • claw.artifacts.GitHubPRArtifact: Artifact backed by a GitHub PR. Supports approve, request changes, and merge lifecycle actions.
  • claw.triggers.GitHubPollingTrigger: Polls gh issue list on a schedule. Tracks seen issues and triggers callbacks for new ones.
  • claw.triggers.GitHubWebhookTrigger: FastAPI router for GitHub webhook payloads. HMAC signature verification. Maps issues.opened, issue_comment.created, and pull_request.opened to Claw events.
  • Demo: claw-demo/ — todo CLI app with pre-filed issues. examples/github_demo.py — investor demo script.
  • Safety guardrails: Allowlist-only shell execution, sandbox-scoped file access, dry-run mode for all external operations.
  • Tests: 127 new tests (519 total).

Phase 7 — Live Observer Dashboard

  • claw.server.WebSocketObserver: RuntimeObserver implementation that serializes lifecycle events to JSON and broadcasts to WebSocket clients. Ring buffer (default 500) replays events for late-joining clients.
  • claw.server.create_app(): FastAPI application with REST endpoints (/api/society, /api/artifacts, /api/trace), WebSocket endpoint (/ws), static file serving, and configurable CORS.
  • claw.server.serve(): One-call launcher that runs a society with the dashboard server concurrently via asyncio.gather(). Drop-in replacement for LocalRuntime.run().
  • React dashboard (dashboard/): Vite + React + TypeScript + TailwindCSS v4 + @xyflow/react. Graph visualization with agent status indicators, color-coded typed edges, chronological event feed with filtering, agent detail cards, artifact version history with diffs.
  • Example: examples/pr_review_dashboard.py — PR review society with live dashboard.
  • Guide: docs/guides/dashboard.md — usage, layout, dev workflow, API reference.
  • Tests: 26 new tests (392 total).

Phase 6 — E2E Use Cases

  • 4 example scripts (examples/): pr_review, competitive_codegen, research_pipeline, dev_team
  • 11 E2E integration tests across 5 test modules
  • Cross-edge event routing (_resolve_edge_id())
  • artifacts parameter on LocalRuntime.run()

Phase 5 — LLM & Human Backends

  • MockLLM with scripted responses and specificity-based matching
  • LiteLLMBackend wrapping litellm for all providers
  • MockHuman and CLIHuman for human-in-the-loop
  • App-level LLM retry (3 attempts, exponential backoff)

Phase 4 — Runtime Engine

  • Event model with monotonic sequence IDs
  • EventBus with edge-routed delivery and batch independence detection
  • LocalRuntime: compile → seed → drain → return loop
  • EdgeResolutionTracker, Watchdog, EventLogger

Phase 3 — Compiler

  • Graph-to-prompt compilation (compile(), compile_for_event())
  • Per-agent system prompts, context filtering, tool binding

Phase 2 — Artifacts

  • StringArtifact, JsonArtifact with version history
  • Budget-aware reads, RFC 7386 merge-patch, diff support

Phase 1 — Scaffold & Data Models

  • Agent, Edge, GroupEdge, Society models
  • 5 edge types: Cooperation, Competition, Oversight, Delegation, Coopetition
  • SocietyConfig with budget/timeout policies