Chump roadmap
This file is the single source of truth for what to work on. Doc index: README.md. Heartbeat (work, opportunity, cursor_improve rounds), the Discord bot, and Cursor agents should read this file—and docs/CHUMP_PROJECT_BRIEF.md for focus and conventions—to know what they're doing. Do not invent your own roadmap; pick from the unchecked items below, from the task queue, or from codebase scans (TODOs, clippy, tests).
Ordered achievable plan: The unchecked items in this file are the prioritized backlog. Choose work based on value/effort; use this file to check boxes when work merges.
Architecture vision: For cognitive architecture roadmap, empirical status, and frontier research direction, see CHUMP_TO_COMPLEX.md.
North star: Roadmap and focus should improve implementation (ship working code and docs), speed (faster rounds, less friction, quicker handoffs), quality (tests, clippy, error handling, clarity), and bot capabilities—especially understanding the user in Discord and taking action from intent (infer what they want from natural language; create tasks, run commands, or answer without over-asking).
How to use this file
- Full prioritized backlog: Pick from the unchecked items in this file, ordered by priority.
- Chump (heartbeat / Discord): In work rounds, use the task queue first; when the queue is empty or in opportunity/cursor_improve rounds, read this file and
docs/CHUMP_PROJECT_BRIEF.md, then create tasks or do work from the unchecked items. - Cursor (when Chump delegates or you're in this repo): Read this file and
docs/CHUMP_PROJECT_BRIEF.mdwhen starting. Pick implementation work from the roadmap priorities or from the prompt Chump gave you. Align with conventions in CHUMP_PROJECT_BRIEF and.cursor/rules/.
Aspirational: Claude-tier core upgrades
Long-horizon architecture backlog (semantic context vs summarization, smarter edits, task-driven autonomy continuations, structured reasoning, delegate preprocessing of huge tool output): tracked in docs/gaps.yaml. Open gaps there are candidates for this section.
Current focus (align with CHUMP_PROJECT_BRIEF)
- Implementation, speed, quality, bot capabilities: Prioritize work that improves what we ship, how fast we ship it, how good it is, and how well the Discord bot understands and acts on user intent (NLP / natural language).
- Improve the product and the Chump–Cursor relationship: rules, docs, handoffs, use Cursor to implement.
- Task queue and GitHub (optional): create tasks from Discord or issues; use chump/* branches and PRs unless CHUMP_AUTO_PUBLISH is set.
- Keep the stack healthy: Ollama, embed server, battle QA self-heal, autonomy tests. Run the roles in the background: Farmer Brown, Heartbeat Shepherd, Memory Keeper, Sentinel, Oven Tender (Chump Menu → Roles tab; schedule with launchd/cron per docs/OPERATIONS.md).
- Fleet expansion: Chump external work, research rounds, review round; Mabel watch rounds; Scout/PWA as primary interface — see FLEET_ROLES.md.
- Long-term vision: In-process inference (mistral.rs), eBPF observability, managed browser (Firecrawl), stateless task decomposition, JIT WASM tools — see CHUMP_TO_COMPLEX.md for the frontier roadmap.
Product: Chief of staff (COS) — autonomous staff + product factory
Product vision, 60 user stories, phased waves (instrument → close the loop → discovery factory → adjacent products): PRODUCT_ROADMAP_CHIEF_OF_STAFF.md. Weekly snapshot script: ./scripts/generate-cos-weekly-snapshot.sh → logs/cos-weekly-*.md.
Wave 1 (instrument):
-
COS weekly Markdown snapshot from
chump_memory.db(scripts/generate-cos-weekly-snapshot.sh). -
Schedule snapshot:
cos-weekly-snapshot.plist.example+./scripts/install-roles-launchd.sh(Monday 08:00); unload inunload-roles-launchd.sh. -
[COS]task template in PRODUCT_ROADMAP_CHIEF_OF_STAFF.md; heartbeat context injects latestlogs/cos-weekly-*.mdon COS-oriented rounds (context_assembly). - ChumpMenu README links to PRODUCT_ROADMAP_CHIEF_OF_STAFF.md.
Wave 2 (COS close the loop) — partial:
-
W2.1 Weekly COS heartbeat:
scripts/heartbeat-self-improve.shrunsWEEKLY_COS_PROMPTon Mondays (local, 05:00–22:00) once per day (logs/.weekly-cos-last-run); disable withCHUMP_WEEKLY_COS_HEARTBEAT=0. Context typeweekly_cosgets COS snapshot injection (context_assembly). -
W2.2 Interrupt notify policy:
CHUMP_INTERRUPT_NOTIFY_POLICY=restrict,CHUMP_NOTIFY_INTERRUPT_EXTRA,src/interrupt_notify.rs,docs/COS_DECISION_LOG.md; context hint inassemble_context. -
W2.3 Decision log:
docs/COS_DECISION_LOG.md(brain-relativecos/decisions/YYYY-MM-DD.md+ template + interrupt tags). -
W2.4 ChumpMenu Chat tab: streaming
/api/chat+ Allow once / Deny →POST /api/approve(same bearer as chat).
Wave 3 (discovery factory) — scripts landed:
-
W3.1
scripts/github-triage-snapshot.sh+ W3.2scripts/ci-failure-digest.sh(SHA dedupe file) + W3.3scripts/repo-health-sweep.sh(REPO_HEALTH_AUTOFIX=1) + W3.4scripts/golden-path-timing.sh(CI artifact + relaxed limit in .github/workflows/ci.yml).
Wave 4 (adjacent products / COS factory):
-
W4.1 PROBLEM_VALIDATION_CHECKLIST.md · W4.2
scripts/scaffold-side-repo.sh+templates/side-repo/· W4.3 templates/cos-portfolio.md · W4.4scripts/quarterly-cos-memo.sh
Market wedge and pilot metrics (H1 + market demands plan)
Single index: MARKET_EVALUATION.md §8. Supporting docs and scripts:
- Pilot SQL / API / JSONL recipes for N3–N4: WEDGE_PILOT_METRICS.md
-
Golden path extension (PWA task + optional
autonomy_once): WEDGE_H1_GOLDEN_EXTENSION.md, scripts/wedge-h1-smoke.sh - Intent calibration harness (labeled set + procedure): INTENT_CALIBRATION.md
- Model flap drill (reliability acceptance): INFERENCE_STABILITY.md (Model flap drill)
- Public trust summary + diagram (speculative rollback limits): TRUST_SPECULATIVE_ROLLBACK.md
- PWA-first H1 path audit (no Discord required for wedge): PWA_WEDGE_PATH.md
- PWA in-app discoverability for task create / wedge hint — web/index.html Tasks panel + PWA_WEDGE_PATH.md
-
N4 pilot export:
GET /api/pilot-summary+ scripts/export-pilot-summary.sh + WEB_API_REFERENCE.md + WEDGE_PILOT_METRICS.md - Phase 2 market critique (docs): MARKET_EVALUATION.md §2b baseline scores, §4.2 sprint tracker, §4.4 progress line; PRODUCT_CRITIQUE.md quarterly pass; README troubleshooting; CONTRIBUTING.md repro
- Phase 2 research scaffolding: evidence tables + blind scratch pad in MARKET_RESEARCH_EVIDENCE_LOG.md; §4.2/§4.4 cross-links in MARKET_EVALUATION.md (sessions themselves still tracked below).
- Phase 2 research execution: complete ≥5 blind sessions (log B1–B5) + ≥8 interviews; then refresh market evaluation scores from evidence.
Universal power / daily driver (full program)
Goal: Make Chump reliable, reachable, governable, context-rich, and polished enough to serve as a primary execution layer (overcome “hobby stack” limits). Authoritative pillar backlog and acceptance criteria: ROADMAP_UNIVERSAL_POWER.md (items P1.x–P5.x).
Rollup — check a box when that pillar’s exit criteria in that doc are met:
-
P1 — Reliability boring — green-path + preflight + CI + degraded UX matrix +
turn_errorhints + local OpenAI retry/circuit doc (P1.5–P1.6 done in ROADMAP_UNIVERSAL_POWER.md). - P2 — Reach — P2.1–P2.5 shipped in ROADMAP_UNIVERSAL_POWER.md (Web Push MVP, async jobs, webhook hardening, cron snippets, repo profiles). Stretch: P2.6 remote runner RFC/MVP.
- P3 — Governance — P3.1–P3.5 shipped (approval parity, baseline approve tests + policy overrides + audit export + autopilot controls). Optional tighten: full P3.2 SSE-continues-after-approve e2e behind stub provider; dedicated filterable audit page.
- P4 — Compounding context — P4.1–P4.5 shipped (CONTEXT_PRECEDENCE.md, session limits doc, optional LLM e2e flag, task spine hints, COS decisions API + PWA). Optional: automated long-thread soak.
- P5 — Product polish — P5.2 mobile pass done (touch targets, sidecar overlay, drawer responsive, input/approval compacted); P5.3 parity matrix done; P5.4 turn_error copy done. Remaining: P5.1 onboarding (partial: PWA bar + Settings + step track, needs pilot friction log rows); P5.5 signed/notarized distribution (needs Apple Developer cert).
Execution order: P1 → P2 → P3 → P4 → P5 (see dependency notes in ROADMAP_UNIVERSAL_POWER.md).
Architecture vs proof (sustained use)
External reviews often praise runtime depth (cascade, context assembly, approvals, consciousness, speculative batches) while warning “built but not proven.” The roadmap already tracks most features; this block tracks evidence so claims stay tied to the repo and DAILY_DRIVER_95_STEPS.md.
| Review theme | Already in roadmap / docs | Gap to close |
|---|---|---|
| Policy-driven cascade, privacy, regimes | P1–P4, PROVIDER_CASCADE.md, CONTEXT_PRECEDENCE.md | Keep green; extend only with metrics when changing defaults. |
| Speculative rollback ≠ file/HTTP undo | TRUST_SPECULATIVE_ROLLBACK.md, ADR-001, sandbox_tool | Prefer sandbox / git worktrees for reversible file work; do not imply full transactional side effects. |
| PWA “developer-grade” / scaling | P5 polish, PWA_TIER2_SPEC.md, UI_MANUAL_TEST_MATRIX_20.md | FE architecture gate — ADR-003-pwa-dashboard-fe-gate.md (accepted); still scope large dashboard work deliberately. |
| Inference wall time dominates UX | PERFORMANCE.md, STEADY_RUN.md, INFERENCE_STABILITY.md, CHUMP_LIGHT_CONTEXT | Latency envelope below; hardware/model path is primary lever—document baseline before arguing “fast enough.” |
| Consciousness adds latency; utility unclear | CHUMP_TO_COMPLEX.md, A/B harness in ROADMAP “Chump-to-Complex” | Utility pass below (same tasks, on vs off). |
| One operator, intermittent use | Phase 2 blinds, daily driver | Blinds + 95-step plan are the corrective—PRODUCT_REALITY_CHECK.md for review hygiene. |
Unchecked proof work (pick in order; do not skip P5 while inventing new “consciousness” features):
-
Latency envelope (daily driver): Measured and documented in LATENCY_ENVELOPE.md. Tool-free fast path + schema compaction + KV cache keep-alive: 26s → 0.5s (warm cache) on qwen2.5:7b Ollama. Three optimization layers:
compact_tools_for_light(),message_likely_needs_tools()withresponse_wanted_tools()auto-retry,keep_alive=30m. See PERFORMANCE.md §8. - PWA / dashboard FE gate: Architecture choice recorded in ADR-003-pwa-dashboard-fe-gate.md; linked from PWA_TIER2_SPEC.md and ROADMAP_UNIVERSAL_POWER.md P5.
-
Overnight / 72h soak: Run all roles + primary surface for 72h. Capture pre/post: SQLite size/WAL pattern, model server restarts,
logs/growth, andGET /api/stack-statussamples; append findings to INFERENCE_STABILITY.md §Soak. -
Consciousness utility pass: Same scripted task mix with
CHUMP_CONSCIOUSNESS_ENABLED=0vs1(wall time, pass/fail, optional baseline JSON). Procedure + log table: CONSCIOUSNESS_UTILITY_PASS.md. Extend MISTRALRS_AGENT_POWER_PATH.md §8 when correlating with inference A/Bs; cross-link METRICS.md. -
Review stat hygiene: PRODUCT_REALITY_CHECK.md +
./scripts/print-repo-metrics.sh; CI prints metrics after verify-external-golden-path.sh.
Prioritized goals (unchecked = work to do)
Bot capabilities (Discord: understanding and intent)
- Understand user intent in Discord: infer what the user wants (create task, run something, answer question, remember something) from natural language; take the right action (task create, run_cli, memory store, etc.) without asking for clarification when intent is clear. Soul and INTENT_ACTION_PATTERNS.md guide this.
- Document intent→action patterns: add examples or rules (e.g. in .cursor/rules or docs) so Chump and Cursor improve at parsing "can you …", "remind me …", "run …", "add a task …", etc.
- Reduce over-asking: when the user's message implies a clear action, do it and confirm briefly; only ask when genuinely ambiguous or dangerous. In soul: "Prefer action over asking."
- Improve reply quality and speed in Discord: concise answers, optional structured follow-ups (e.g. "I created task 3; say 'work on it' to start"). In soul: "Reply concisely; add a short follow-up when relevant."
Push to Chump repo and self-reboot
-
Ensure Chump repo is in
CHUMP_GITHUB_REPOSandGITHUB_TOKENis set so the bot can git_commit and git_push to chump/* branches. SetCHUMP_AUTO_PUSH=1so the bot may push after commit without asking. Documented in OPERATIONS.md and .env.example. -
After pushing changes that affect the bot (soul, tools, src): run
scripts/self-reboot.shto kill the current Discord process, rebuild release, and start the new bot. Documented in OPERATIONS.md "Push to Chump repo and self-reboot"; user can say "reboot yourself" or invoke via run_cli. Optional:CHUMP_SELF_REBOOT_DELAY=10.
Capability improvements (no model changes)
-
Context window summarize-and-trim: when token count exceeds
CHUMP_CONTEXT_SUMMARY_THRESHOLD, delegate summarizes oldest messages and one summary block is injected;CHUMP_CONTEXT_MAX_TOKENSwired in context_window and local_openai. -
Soul / system prompt reorder: hard rules first, tool examples, routing table, assemble_context, soul and brain last (primacy/recency for small models).
CHUMP_TOOL_EXAMPLESoverride. -
Context round filter:
assemble_context()gates sections byCHUMP_HEARTBEAT_TYPE(work = tasks only; research = episodes; cursor_improve = git diff + frustrating episodes; CLI = all). - Delegate task types: classify (text + categories) and validate (text + criteria) added in delegate_tool.rs.
-
Tool-side intelligence: read_file auto-summary when file exceeds
CHUMP_READ_FILE_MAX_CHARS(default 4000); run_cli middle-trim (first 1K + last 2K with marker).
Product and Chump–Cursor
-
Add or refine
.cursor/rules/*.mdcso Cursor follows repo conventions and handoff format. - Update AGENTS.md and docs (e.g. CURSOR_CLI_INTEGRATION.md, CHUMP_PROJECT_BRIEF.md) so Cursor and Chump have clear context.
- Improve handoffs: when Chump calls Cursor CLI, pass enough context in the prompt; document what works in docs.
- Run cursor_improve rounds (or Cursor) to implement one roadmap item at a time; mark done here when complete.
- Define Chump–Cursor communication protocol and direct API contract: roles, shared context, message types, lifecycle (docs/CHUMP_CURSOR_PROTOCOL.md); expand CURSOR_CLI_INTEGRATION.md with prompt format, timeouts, and API contract for future HTTP bridge.
Keep roles running (background help)
-
Run Farmer Brown on a schedule (e.g. launchd every 120s) so the stack is diagnosed and repaired automatically. Run Heartbeat Shepherd, Sentinel, Memory Keeper, Oven Tender on their recommended schedules. See docs/OPERATIONS.md "Roles" and "Farmer Brown"; one-shot:
./scripts/install-roles-launchd.shinstalls all five plists for 24/7. Chump Menu → Roles tab shows all five.
Implementation, speed, and quality
- Reduce unwrap() in non-test code: high-impact call sites fixed (limits, agent_loop, github_tools). Remaining unwraps verified as test-only (delegate_tool, episode_db, state_db, schedule_db, task_db, repo_tools, memory_*, calc_tool, local_openai, main, cli_tool).
-
Fix or document TODOs in
src/: no TODO/FIXME in src/ currently; add docs/TODO.md or code comments when introducing new work. -
Keep battle QA green: run
BATTLE_QA_ITERATIONS=5 ./scripts/battle-qa.shuntil pass; fix failures in logs/battle-qa-failures.txt. Self-heal: see docs/BATTLE_QA_SELF_FIX.md and WORK_PROMPT "run battle QA and fix yourself." -
Clippy clean: run
cargo clippyand fix warnings. - Speed: shorten round latency where possible (prompt size, tool use batching, model choice). Documented in docs/OPERATIONS.md "What slows rounds (speed)".
- Quality: ensure edits include tests/docs where appropriate; clear PR descriptions and handoff summaries. In docs/CHUMP_PROJECT_BRIEF.md "Quality".
Optional integrations
- GitHub: add repo to CHUMP_GITHUB_REPOS, set GITHUB_TOKEN; Chump can list issues, create branches, open PRs. Documented in .env.example, docs/OPERATIONS.md "Push to Chump repo", docs/AUTONOMOUS_PR_WORKFLOW.md.
- ADB tool: see docs/ROADMAP_ADB.md for Pixel/Termux companion; enable via CHUMP_ADB_* in .env (see .env.example).
Fleet / Mabel–Chump symbiosis
See ROADMAP_MABEL_DRIVER.md and FLEET_ROLES.md for context.
-
Mutual supervision: Mac has PIXEL_SSH_HOST (and PIXEL_SSH_PORT); Pixel has MAC_TAILSCALE_IP, MAC_SSH_PORT, MAC_CHUMP_HOME; Pixel SSH key on Mac. Both restart scripts run and exit 0 when heartbeats are up. Checklist + gate: OPERATIONS.md (Mutual supervision);
./scripts/verify-mutual-supervision.shfrom the Mac (exit 0 = both directions OK). -
Single fleet report: Mabel's report round writes
logs/mabel-report-*.md+ notify. Retire Mac hourly-update when stable:./scripts/retire-mac-hourly-fleet-report.sh(see OPERATIONS.md Single fleet report). Chump keeps notify for ad-hoc. -
Hybrid inference: On the Pixel set
MABEL_HEAVY_MODEL_BASE(e.g.http://<MAC_TAILSCALE_IP>:8000/v1);heartbeat-mabel.shswitches API for research and report only; patrol/intel/verify/peer_sync stay on localOPENAI_API_BASE. Documented in OPERATIONS.md Hybrid inference + ANDROID_COMPANION.md; helper:scripts/apply-mabel-badass-env.sh. -
Peer_sync loop: Chump writes
brain/a2a/chump-last-reply.mdviacontext_assembly::record_last_reply(Discord + web).PEER_SYNC_PROMPTinscripts/heartbeat-mabel.shinstructsmemory_brain read_file a2a/chump-last-reply.mdand episode log line "Chump said: …". -
Mabel self-heal (Pixel):
scripts/mabel-farmer.shrunsstart-companion.shwhen local model/bot is down ifMABEL_FARMER_FIX_LOCAL=1(default). See script header and OPERATIONS Keeping the stack running. -
On-demand status: Discord
!status/status report— Chump and Mabel reply with latestlogs/mabel-report-*.mdwhen present; otherwise Chump points to Mabel/Pixel and the retire script (discord.rson_demand_fleet_status_markdown).
PWA / brain workflows (Phase D — pragmatic)
-
Quick capture hardening:
POST /api/ingestand/api/shortcut/captureenforce 512 KiB max payload, optionalsourceprovenance comment,RequestBodyLimitLayeron JSON routes; PWA sendssource: pwa. See WEB_API_REFERENCE.md, CHUMP_BRAIN.md Capture size. -
External repo + projects: Documented
CHUMP_REPO/CHUMP_HOME, multi-repo,projects/playbooks, and PWA/api/projectsin CHUMP_BRAIN.md External repos; heartbeat prompts already usememory_brain+set_working_repo. -
Research pipeline (baseline): PWA
/api/researchcreates queued briefs underresearch/; agent-side multi-pass synthesis viaRESEARCH_BRIEF_PROMPT→research/latest.mdand research rounds inheartbeat-self-improve.sh. Full “Research X for me” one-shot product flow remains incremental (see ROADMAP_FULL.md Tier 1). -
Watchlists + alerts:
GET /api/watch/alertsscanswatch/*.mdfor flagged bullets (urgent / deadline /[!]/ asap / etc.);GET /api/briefingincludes Watchlists + Watch alerts. MabelINTEL_PROMPTreadswatch/when present (heartbeat-mabel.sh). -
Morning briefing DM:
scripts/morning-briefing-dm.sh— fetch/api/briefing, format withjq, pipe tochump --notify(schedule via cron/launchd). Optional Web Push “research ready” still future.
Rust infrastructure (reliability & velocity)
Design and status: RUST_INFRASTRUCTURE.md. Suggested sequence: Tower → tracing → proc macro → inventory → typestate → pool → notify.
-
Tower middleware (~1 d): Wrap every tool call in a composable stack (timeout, concurrency limit, rate limit, circuit breaker, tracing). Replaces ad-hoc tool timeouts and collapses tool health / error-budget into one layer. Build once at startup; all tools get same guarantees. Done:
tool_middleware.rswith 30s timeout + tool_health_db recording + per-tool circuit breaker + process-wideCHUMP_TOOL_MAX_IN_FLIGHTconcurrency; all Discord/CLI/web registrations usewrap_tool(). Full Tower ServiceBuilder layers (rate limit, extra layers) can be added next. -
tracing migration (1–2 d): Replace/adjoin
chump_logwithtracingspans (agent turn = span, tool call = child span). Unifies logging, episode recording, tool health, introspect; span DB makes "what did I do last session?" trivial. Done (first phase): tracing + tracing-subscriber in main (RUST_LOG); agent_loop events (agent_turn, tool_calls); tool_middleware#[instrument]on execute. chump_log kept; span DB / introspect later. -
Proc macro for tools (~1.5 d):
#[chump_tool(name, description, schema)]on impl block generatesname(),description(),input_schema(); ~30 lines per tool. Done: chump-tool-macro crate, calc_tool migrated. See RUST_INFRASTRUCTURE.md. -
inventory tool registration (~0.5 d): Auto-collect tools at link time via
inventory;register_from_inventory()in discord.rs; new tool = onesubmit!in tool_inventory (or per-tool file). Enables Chump self-discovery. Done: see RUST_INFRASTRUCTURE.md §3. -
Typestate session (~0.5 d):
Session<S: SessionState>(Uninitialized → Ready → Running → Closed); CLI uses start/close so double-close and tools-before-assemble don't compile. Done:src/session.rs; see RUST_INFRASTRUCTURE.md §5. -
rusqlite connection pool (~0.5 d): r2d2-sqlite + WAL + busy_timeout in
src/db_pool.rs; all DB modules use pool. Done: see RUST_INFRASTRUCTURE.md §7. -
notify file watcher (~0.5 d): Real-time repo watch via
notifyinsrc/file_watch.rs;assemble_contextdrains "Files changed since last run (live)". Done: see RUST_INFRASTRUCTURE.md §6.
External readiness (adoption / “take flight”)
Baseline docs: EXTERNAL_GOLDEN_PATH.md, PRODUCT_CRITIQUE.md, ONBOARDING_FRICTION_LOG.md. README quick start must stay aligned with the golden path.
- README + golden path: Root README.md describes Chump (not a placeholder), links LICENSE, and quick start matches EXTERNAL_GOLDEN_PATH.md.
-
External safety banner in
.env.example(executive mode, auto-push, cascade privacy, autonomy/RPC cautions). -
Naive onboarding pass: Cold clone + timed
cargo buildrecorded in ONBOARDING_FRICTION_LOG.md; launch gates L2/L6 updated in PRODUCT_CRITIQUE.md; smoke scriptverify-external-golden-path.sh. Optional: third-party reviewer still welcome. -
Optional polish: README architecture diagram + PWA preview asset; GitHub issue template for bugs (see
.github/ISSUE_TEMPLATE/). -
Novice OOTB desktop distribution: In-tree (unsigned QA): bundled
chump+ Tauri shell, first-run wizard (Ollama + optional OpenAI-compatible base, streamingollama pull, Application Support.env, health-gated start), retail plist modeCHUMP_BUNDLE_RETAIL=1inscripts/macos-cowork-dock-app.sh, macOS bundle CI.github/workflows/tauri-desktop.yml. Still open for public download: Apple signing + notarization + versioned DMG/pkg.
Strategic evaluation alignment (external enterprise / defense doc)
Living map of an external strategy paper vs this repo: EXTERNAL_PLAN_ALIGNMENT.md. Granular work packages (WP-IDs), priorities, and completion rules: HIGH_ASSURANCE_AGENT_PHASES.md (§3 includes WP-1.4 matrix + WP-1.5 multimodal RFC Proposed). Theme order defaults to inference/ops → pilot kit → fleet transport → research/RFCs. Optional in-process depth: MISTRALRS_CAPABILITY_MATRIX.md.
- Alignment doc + pilot repro kit + inference RFC skeleton: EXTERNAL_PLAN_ALIGNMENT.md, DEFENSE_PILOT_REPRO_KIT.md, rfcs/RFC-inference-backends.md.
-
Inference hardening (ops + UX): Extend INFERENCE_STABILITY.md and OPERATIONS.md with a degraded-mode playbook (MLX OOM symptoms, Ollama fallback, when
farmer-brownapplies); ensure browser/PWA surfacesstack-statusinference.errorwhere users already load stack status (e.g. Providers/settings flows) whenmodels_reachable === false.
mistral.rs — higher-performance agents (measurement + next tier)
-
Agent power path: MISTRALRS_AGENT_POWER_PATH.md (metrics, fixed AB prompts, modes A/B/C),
scripts/mistralrs-inference-ab-smoke.sh,scripts/env-mistralrs-power.sh; PWA streaming default inscripts/run-web-mistralrs-infer.sh. - RFC multimodal (WP-1.5): Accept or reject RFC-mistralrs-multimodal-in-tree.md with rationale, then implement per RFC if accepted (MISTRALRS_CAPABILITY_MATRIX.md).
-
Structured output / grammar (in-process mistral): S3 spike: ADR-002, matrix row, opt-in
CHUMP_MISTRALRS_OUTPUT_JSON_SCHEMAon tool-free completions inmistralrs_provider.rs. Follow-up: tool-argument grammar / repair when JSON reliability is the bottleneck (MISTRALRS_CAPABILITY_MATRIX.md). Sprint: S3. -
run_cli governance (pilot tier): Document sponsor-safe defaults (
CHUMP_TOOLS_ASK,CHUMP_AUTO_APPROVE_*off for demos) in DEFENSE_PILOT_REPRO_KIT.md or TOOL_APPROVAL.md; optional follow-up issue for containerized or SSH-jump execution profile. - Fleet transport spike: Design note under FLEET_ROLES.md or ROADMAP_MABEL_DRIVER.md + time-boxed prototype — outbound WebSocket or MQTT over Tailscale from Pixel to Mac; Mac pauses sentinel-delegated repair when peer last-seen exceeds threshold (no infinite wait).
-
WASM tool lane: Extend WASM_TOOLS.md with a “new sandboxed tool” checklist; explicit non-goal near term: WASM-wrapping all of
run_cli. - High-assurance agent architecture (paper → phases): HIGH_ASSURANCE_AGENT_PHASES.md — §3 master registry (WP-1.1 … WP-1.4 … WP-8.1), §4 handoff template, §17 when to check this box. Rule: one WP-ID per Cursor run; set WP Status to Done in §3 when merged. Closed under §17 strict (2026-04-09): all P0 WPs 2.2, 3.1, 4.1 are Done in §3. (Use §17 loose if you later reopen the umbrella until Phases 1–5 are materially complete—document in a follow-up PR.)
Repo hygiene and storage (periodic; see STORAGE_AND_ARCHIVE.md)
Baseline: scripts/cleanup-repo.sh + archive layout documented. Below = optional polish when disk or clone maintenance matters.
-
Embed cache hygiene — Document or script safe pruning of
.fastembed_cache/when usinginprocess-embed(re-download cost vs disk); cross-link STORAGE_AND_ARCHIVE.md. Done: STORAGE_AND_ARCHIVE.md § In-process embed cache. -
Git maintenance runbook — Short maintainer note: when to run
git gc, how to spot history bloat / large blobs, links to GitHub limits; no obligation for routine devs. Done: STORAGE_AND_ARCHIVE.md § Git maintenance. -
Quarterly cold export — Runbook: tarball
sessions/,logs/, and a defined subset ofchump-brain/(or full brain) to cold storage; one-page restore/smoke check so archives are trustworthy. Done: STORAGE_AND_ARCHIVE.md § Quarterly cold export +cleanup-repo.sh.
Turnstone-inspired deployment (observability, safety, governance)
Phased deployment for production-ready ops and compliance. See plan in repo; OPERATIONS.md and ARCHITECTURE.md document the result.
-
Phase 1 — Observability: Tool-call metrics in middleware; health endpoint includes
model_circuit,status(healthy/degraded),tool_calls. OPERATIONS.md "Observability (GET /health)". - Phase 2 — Safety: Heuristic risk for run_cli (and optional write_file); CHUMP_TOOLS_ASK; approval flow with ToolApprovalRequest; one approval UX (Discord + Web); audit logging (tool_approval_audit in chump.log). OPERATIONS.md "Tool approval", docs/TOOL_APPROVAL.md, ARCHITECTURE.md "Tool policy (allow / deny / ask)".
- Phase 3 — Resilience and governance: Per-tool circuit breaker (CHUMP_TOOL_CIRCUIT_*); retention and audit documented (OPERATIONS.md "Retention and audit"); RUST_INFRASTRUCTURE.md updated. Session eviction at capacity is optional and deferred (single-session or low concurrency).
Backlog (see docs/WISHLIST.md)
- run_test tool: structured pass/fail, which tests failed (wrap cargo/npm test). Implemented in src/run_test_tool.rs; registered in Discord and CLI agent builds.
- read_url: fetch docs page (strip nav/footer) for research. Implemented in src/read_url_tool.rs; registered in Discord and CLI agent builds.
- Task routing (assignee): task_db assignee column (chump/mabel/jeff/any); task tool create/list; context_assembly "Tasks for Jeff". See docs/FLEET_ROLES.md.
- Other wishlist items as prioritized (screenshot+vision, watch_file; sandbox / introspect done; emotional memory done — episode sentiment + recent frustrating in context_assembly).
Autonomy (planning + task execution)
See docs/AUTONOMY_ROADMAP.md for the detailed milestone plan.
-
Task contract: structured task notes (Context/Plan/Acceptance/Verify/Risks) +
task_contracthelpers (ensure_contract, section accessors) + tests. Task tool applies template on create. -
Planner → Executor → Verifier loop:
autonomy_loop::autonomy_once— pick task, lease, contract preflight, agent executor prompt, verify (run_test/ Verify commands),doneorblocked+ episode + follow-up task. -
Task claim/lease locking: DB-backed leases in
task_db+autonomy_loop.rs(claim, renew, release);chump --reap-leasesand task toolreap_leases. Tests:task_db::task_lease_second_owner_cannot_claim_until_released; ops: OPERATIONS.md. -
Autonomy driver / ops:
scripts/autonomy-cron.sh(reap-leases +--autonomy-once);CHUMP_RPC_JSONL_LOGmirrorschump --rpcJSONL to a file. Auto-approve (opt-in):CHUMP_AUTO_APPROVE_LOW_RISK(low-riskrun_cli) andCHUMP_AUTO_APPROVE_TOOLS; audited astool_approval_audit(see OPERATIONS.md). -
Autonomy conformance tests:
autonomy_looptests with fake executor/verifier; lease contention test intask_db; CI:.github/workflows/ci.ymlrunscargo test+cargo clippy.
Chump-to-Complex transition (synthetic consciousness)
Master vision and detail: CHUMP_TO_COMPLEX.md. Research brief for external review: CHUMP_RESEARCH_BRIEF.md.
Section 1 — Harden and measure (near-term)
-
Metric definitions (
docs/METRICS.md): CIS, Turn Duration, Auto-approve Rate, Phi Proxy, Surprisal Threshold — exact computation from DB/logs. -
A/B harness: consciousness modules enabled vs disabled (
CHUMP_CONSCIOUSNESS_ENABLED=0); compare task success, tool calls, latency. (Live runs complete: 2026-04-15) - A/B Round 2 (Paper Grade): Add LLM-as-a-judge scoring for prompt semantic accuracy, and capture scaling curves across 3+ models (e.g. 3B vs 9B vs 14B) to correlate latency penalty with parameter counts.
- memory_graph in context_assembly: inject triple count when graph has triples.
-
Blackboard persistence: persist high-salience entries to
chump_blackboard_persist; restore on startup; prune to top 50. -
Phi proxy calibration: per-session metrics to
chump_consciousness_metricstable for phi–surprisal correlation tracking. - Consciousness regression suite: 5 regression tests asserting module state transitions (high-surprise regime shift, persistence roundtrip, metrics recording, A/B toggle, memory_graph in context).
- Battle QA consciousness gate: compares consciousness baselines; warns on surprisal regression (>50%) and lesson count drops.
Section 2 — Build missing core (medium-term)
-
Belief state module (
src/belief_state.rs): per-tool Beta(α,β) confidence, task trajectory tracking, EFE scoring (G = ambiguity + risk − pragmatic_value), context injection.update_tool_belief()anddecay_turn()called from agent_loop hot path after every tool result. 9 tests. -
EFE-based tool ordering (2026-04-14):
efe_order_tool_calls()in agent_loop scores tools by Expected Free Energy and reorders execution (lowest G first). Combined with epsilon-greedy exploration. Belief state now drives action selection, not just context. -
Precision-weighted surprisal (2026-04-14):
surprise_tracker::compute_surprisal()amplifies surprise when beliefs are confident (×1.4 at low uncertainty), dampens when uncertain (×0.6). Closes the Active Inference perception-action loop. -
Surprise-driven escalation: epistemic uncertainty check in agent_loop after tool calls; posts high-urgency blackboard entry when task uncertainty exceeds threshold (
CHUMP_EPISTEMIC_ESCALATION_THRESHOLD). -
Control shell for blackboard: regime-adaptive
SalienceWeights(exploit/balanced/explore/conservative) replacing static weights; manual override viaset_salience_weights(). -
Async module posting:
tokio::sync::mpscunbounded channel withpost_async()andinit_async_channel()drain task; falls back to synchronous post if channel not initialized. -
Subscriber filtering:
Blackboard::subscribe()registers module interests;read_subscribed()returns only matching entries with cross-module read tracking. -
LLM-assisted triple extraction:
extract_triples_llm()sends text to worker model, parses JSON array of (S,R,O,confidence); regex fallback on any failure.store_triples_with_confidence()uses confidence as weight. -
Personalized PageRank: proper iterative PPR with power method (α=0.85, ε=1e-6 convergence) over adjacency loaded from connected component BFS. Replaces bounded BFS in
associative_recall(). -
Valence and gist:
relation_valence()maps relations to [-1,+1];entity_valence()computes weighted average;entity_gist()produces one-sentence summary with tone and top relations. -
Noise-as-resource exploration:
exploration_epsilon()returns regime-dependent ε;epsilon_greedy_select()picks random non-best index with probability ε. Wired into agent_loop viaefe_order_tool_calls()(2026-04-14). -
Dissipation tracking:
record_turn_metrics()logs tool_calls, tokens, duration, regime, surprisal EMA, and dissipation_rate tochump_turn_metricstable. Wired into agent_loop at turn end. -
Episode causal graph:
CausalGraphwith nodes (Action/Outcome/Observation) and edges;build_causal_graph_heuristic()constructs DAG from episode tool calls;paths_from()for traversal; JSON serialization. -
Counterfactual query engine:
counterfactual_query()implements simplified do-calculus — single intervention, graph path analysis, past lesson lookup. Returns predicted outcome with confidence and reasoning. -
Human review loop:
claims_for_review()surfaces high-confidence frequently-applied lessons;review_causal_claim()boosts or reduces confidence based on user confirmation.
Shipped (2026-04-15) — perception, eval, enriched memory, retrieval, verification
-
Structured perception layer (
src/perception.rs): TaskType classification, entity extraction, constraint detection, risk indicators, ambiguity scoring. Wired into agent_loop before the main model call. -
Eval framework (
src/eval_harness.rs): EvalCase, EvalCategory, ExpectedProperty types. DB tables chump_eval_cases, chump_eval_runs. Property-based checking with regression detection, wired into battle_qa. - Memory enrichment: chump_memory gains confidence, verified, sensitivity, expires_at, memory_type columns. Memory tool accepts confidence, memory_type, expires_after_hours params.
- Retrieval improvements: RRF merge weighted by freshness decay and confidence. Query expansion via memory graph. Context compression to 4K char budget.
- Action verification: ToolVerification struct in tool_middleware.rs. Post-execution verification for write tools. ToolVerificationResult SSE event.
- Configurable thresholds: CHUMP_EXPLOIT_THRESHOLD, CHUMP_BALANCED_THRESHOLD, CHUMP_EXPLORE_THRESHOLD, CHUMP_NEUROMOD_NA_ALPHA, CHUMP_NEUROMOD_SERO_ALPHA, CHUMP_LLM_RETRY_DELAYS_MS, CHUMP_ADAPTIVE_OUTCOME_WINDOW.
-
cargo-audit CI job:
.github/workflows/runs cargo-audit for dependency vulnerability scanning. - Error handling fixes: ask_jeff_tool, provider_quality, rpc_mode hardened.
Section 3 — Frontier concepts (long-term, research-grade; gate criteria in CHUMP_TO_COMPLEX.md)
- Quantum cognition prototype: density matrix belief states for ambiguity resolution; gate: >5% improvement on multi-choice tool selection.
- Topological integration metric (TDA): persistent homology on blackboard traffic; gate: better correlation with task success than phi_proxy.
-
Synthetic neuromodulation (
src/neuromodulation.rs): three modulators (dopamine, noradrenaline, serotonin) as system-wide meta-parameters. DA scales reward sensitivity, NA modulates regime thresholds (wired into precision_controller), 5HT controls tool budget, temporal patience, and tool-free fast path threshold (wired into agent_loop 2026-04-14). Context injection and health endpoint metrics. 8 tests. -
Holographic Global Workspace (
src/holographic_workspace.rs):amari-holographicv0.19 ProductCl3x32 (256-dim, ~46 capacity). Encodes blackboard entries as HRR key-value pairs;sync_from_blackboard()called in context_assembly; query_similarity and retrieve_by_key for content-based and key-based lookup. Health endpoint metrics. 7 tests. -
Speculative execution (
speculative_execution.rs, wired fromagent_loopfor ≥3 tools/batch): snapshots belief_state, neuromod, full blackboard;evaluate()uses surprisal EMA delta since fork plus confidence and failure ratio;rollback()restores in-process state only. Seedocs/ADR-001-transactional-tool-speculation.md. Tests inspeculative_execution+ integration coverage. - Workspace merge for fleet: two Chump instances share blackboard via peer_sync for bounded turns (dynamic autopoiesis).
-
Abstraction audit (
src/consciousness_traits.rs): 9 trait interfaces —SurpriseSource,BeliefTracker,PrecisionPolicy,GlobalWorkspace,IntegrationMetric,CausalReasoner,AssociativeMemory,Neuromodulator,HolographicStore— each with aDefault*implementation backed by the current singleton modules.ConsciousnessSubstratebundles all 9 into a single injectable struct. 9 tests.
When you complete an item
- Uncheck → check the box in this file (patch_file or write_file:
- [ ]→- [x]). - If it was a task, set task status to done and episode log.
- Optionally notify if something is ready for review.
Related docs
Full doc index: README.md. Key references: CHUMP_TO_COMPLEX.md (architecture vision, empirical status, and frontier roadmap), CHUMP_PROJECT_BRIEF.md (focus and conventions), FLEET_ROLES.md, RUST_INFRASTRUCTURE.md (Tower, tracing, proc macro, inventory, typestate, pool, notify), EXTERNAL_GOLDEN_PATH.md (external adoption), CONSCIOUSNESS_AB_RESULTS.md (A/B study data), research/consciousness-framework-paper.md (research paper with Scaffolding U-curve + neuromod findings).