Consciousness framework metrics
Canonical definitions for measuring the Chump-to-Complex transition. Each metric is computable from the SQLite DB, /health endpoint, or logs. See CHUMP_TO_COMPLEX.md for context.
1. Surprisal EMA
What it measures: How well the agent predicts tool outcomes and latencies. Declining EMA means the agent is calibrating.
Source: surprise_tracker::current_surprisal_ema() (in-process); DB fallback below.
SQL (from chump_prediction_log):
-- Overall mean surprisal
SELECT AVG(surprisal) FROM chump_prediction_log;
-- Per-tool mean surprisal (tools with >= 3 calls)
SELECT tool, ROUND(AVG(surprisal), 3) AS avg_surprisal, COUNT(*) AS calls
FROM chump_prediction_log
GROUP BY tool HAVING COUNT(*) >= 3
ORDER BY avg_surprisal DESC;
-- High-surprise percentage (above 0.5 threshold)
SELECT CAST(SUM(CASE WHEN surprisal > 0.5 THEN 1 ELSE 0 END) AS REAL) / COUNT(*) * 100
FROM chump_prediction_log;
-- Trend: average surprisal per 50-prediction window
SELECT (rowid / 50) AS window,
ROUND(AVG(surprisal), 4) AS avg_surprisal,
COUNT(*) AS n
FROM chump_prediction_log
GROUP BY window ORDER BY window;
Target: Steadily decreasing over sessions; per-tool averages converging.
1a. Belief → tool budget hook (WP-6.1)
What it measures: Optional coupling between task-level epistemic uncertainty (belief_state::task_belief().uncertainty()) and precision_controller::recommended_max_tool_calls().
Knob: CHUMP_BELIEF_TOOL_BUDGET=1 (or true) — when uncertainty > 0.55, the recommended cap is multiplied by ~0.75 (integer floor, minimum 1). The same tightening applies to recommended_max_delegate_parallel() (batch delegate worker fan-out). Default off (unset).
Source: env_flags::chump_belief_tool_budget(), precision_controller::recommended_max_tool_calls(), precision_controller::recommended_max_delegate_parallel(), delegate_tool::run_batch; blackboard warnings for escalation still use existing should_escalate_epistemic thresholds.
Observability: When CHUMP_HEALTH_PORT is set, GET /health on that port → consciousness_dashboard.precision includes recommended_max_tool_calls, recommended_max_delegate_parallel, belief_tool_budget, task_uncertainty, context_exploration_fraction, effective_tool_timeout_secs. The web app’s GET /api/stack-status exposes the same snapshot under cognitive_control (PWA / desktop shell).
1b. Speculative multi-tool batch (surprisal EMA delta)
What it measures: For a single assistant turn with ≥3 tool calls, speculative_execution::evaluate compares global surprisal EMA after those tools to the value captured at fork(). The metric is surprisal_ema_delta = max(0, ema_now - ema_at_fork) (not absolute EMA).
Source: speculative_execution (called from agent_loop); GET /health → consciousness_dashboard.speculative_batch holds the last in-process batch (resolution, surprisal_ema_delta, etc.). Programmatic helper: speculative_execution::metrics_json.
Operator knobs: CHUMP_SPECULATIVE_BATCH=0 disables the path; CHUMP_SPECULATIVE_SURPRISE_DELTA_MAX caps allowed delta (default 0.25).
Limitation: Rollback restores beliefs, neuromodulation, and blackboard only; it does not reverse tool side effects. For the distinction vs true transactional speculation, see docs/ADR-001-transactional-tool-speculation.md.
Correctness test: cargo test memory_graph_curated_recall_topk (serial DB isolation) covers curated PPR recall@k; scripts/memory-graph-benchmark.sh is for timing.
1c. Which LLM backend served the last completion (Tier A / matrix)
What it measures: After each successful provider completion, Chump records which path answered: in-process mistral.rs, a cascade slot, a single OpenAI-compatible HTTP base, or hosted OpenAI API (no OPENAI_API_BASE).
Source: llm_backend_metrics (record_mistralrs, record_cascade_slot, record_openai_http, record_openai_api). Inner HTTP calls made while the cascade is trying slots are not logged as openai_http (only the winning cascade::<slot> counts). warm_probe_all holds a pause guard so probe completions do not overwrite last or totals.
Observability:
GET /api/stack-status→llm_last_completion(nullor object:kind,label,stream_text_deltas,at_unix_ms) andllm_completion_totals(map of"kind::label"→ call count since process start).GET /healthonCHUMP_HEALTH_PORTincludes the same two top-level fields.
Related: MISTRALRS_CAPABILITY_MATRIX.md Next tier A; src/llm_backend_metrics.rs.
2. Phi Proxy
What it measures: Degree of inter-module coupling via the blackboard. Higher = modules are actively reading each other's outputs, not operating in isolation.
Source: phi_proxy::compute_phi() → PhiMetrics.phi_proxy; also GET /health → consciousness_dashboard.phi_proxy.
Computation: 0.35 * coupling_score + 0.35 * cross_read_utilization + 0.30 * information_flow_entropy
Where:
coupling_score= active cross-module read pairs / total possible pairscross_read_utilization= entries read by non-author / total entriesinformation_flow_entropy= normalized Shannon entropy of read distribution
Target: > 0.3 sustained during active tool-using sessions.
3. Turn Duration (autonomous work time)
What it measures: How long the agent works without human intervention between messages.
SQL (from chump_episodes):
-- Average episode duration (proxy: time between consecutive episode logs)
SELECT AVG(julianday(e2.happened_at) - julianday(e1.happened_at)) * 86400 AS avg_gap_secs
FROM chump_episodes e1
JOIN chump_episodes e2 ON e2.id = e1.id + 1;
Log-based: Parse tracing output for agent_turn span durations; sum consecutive tool-use turns between user messages.
Target: Minutes to hours of self-directed goal pursuit (currently seconds per reactive turn).
4. Auto-approve Rate
What it measures: Percentage of tool calls executed without requiring human approval. Higher = the agent is using safe tools and the approval policy trusts it.
Computation:
auto_approve_rate = (total_tool_calls - approval_requests) / total_tool_calls * 100
Sources:
tool_middleware::tool_calls_total()(total tool calls)chump.loglines with eventtool_approval_audit(grep tool_approval_audit). Theresultfield includesallowed,denied,timeout,auto_approved_cli_low(low-riskrun_cliwhenCHUMP_AUTO_APPROVE_LOW_RISK=1), andauto_approved_tools_env(tools listed inCHUMP_AUTO_APPROVE_TOOLS).
SQL (from chump_tool_health):
-- Total tool calls (proxy)
SELECT SUM(total_calls) FROM chump_tool_health;
Target: > 90% for routine tasks.
5. Causal Inference Score (CIS)
What it measures: Precision of counterfactual lessons — what fraction are actually correct when reviewed by a human.
SQL (from chump_causal_lessons):
-- Lessons by confidence and application count
SELECT lesson, confidence, times_applied, created_at
FROM chump_causal_lessons
ORDER BY confidence DESC
LIMIT 20;
-- Failure pattern distribution
SELECT task_type, COUNT(*) AS cnt
FROM chump_causal_lessons
WHERE task_type IS NOT NULL AND task_type != ''
GROUP BY task_type ORDER BY cnt DESC;
-- Lessons that were applied (validated in context)
SELECT COUNT(*) AS applied, (SELECT COUNT(*) FROM chump_causal_lessons) AS total
FROM chump_causal_lessons WHERE times_applied > 0;
Human labeling required: Export top-20 lessons → human marks each correct/incorrect → CIS = correct / total.
Target: > 70% precision on reviewed lessons.
6. Thermodynamic Efficiency
What it measures: Work output per unit of computational resource consumed.
Computation:
efficiency = tasks_completed / (tokens_spent + tool_calls_made)
Sources:
cost_tracker::summary()for tokens spenttool_middlewarefor tool call counttask_dbfor tasks moved todonestatus
SQL:
-- Tasks completed (proxy for "work done")
SELECT COUNT(*) FROM chump_tasks WHERE status = 'done';
-- Total tool calls
SELECT SUM(total_calls) FROM chump_tool_health;
Target: Improving trend over sessions (ratio should increase as the agent becomes more efficient).
7. Phi–Surprisal Correlation
What it measures: Whether integration and calibration co-evolve — per the research literature, higher Φ should correlate with lower surprisal over time.
Computation: Pearson correlation between phi_proxy values and inverse surprisal EMA values, sampled once per session.
Data collection: At close_session, record_session_consciousness_metrics() appends (session_id, phi_proxy, surprisal_ema, coupling_score, regime) to the chump_consciousness_metrics table (created in db_pool::init_schema, written from context_assembly.rs).
Target: Negative correlation (r < -0.3) over > 20 sessions.
8. Perception ambiguity level
What it measures: How ambiguous the user's request is, as scored by the perception layer before the main model call.
Source: perception::analyze() → PerceptionResult.ambiguity_level (0.0–1.0); logged per turn in agent_loop.
Target: Lower ambiguity on well-formed requests (< 0.3); high ambiguity (> 0.7) should trigger clarification or escalation.
9. Tool verification pass/fail rate
What it measures: Percentage of write-tool executions where post-execution verification confirms the intended effect.
Source: tool_middleware::ToolVerification; ToolVerificationResult SSE events. Logged alongside tool outcomes.
Computation:
verification_pass_rate = verified_pass / (verified_pass + verified_fail) * 100
Target: > 95% for routine write operations (file writes, patches).
10. Eval case pass rate
What it measures: Percentage of eval cases passing property-based checks in the eval harness.
Source: eval_harness; DB tables chump_eval_cases and chump_eval_runs.
SQL:
SELECT
CAST(SUM(CASE WHEN passed = 1 THEN 1 ELSE 0 END) AS REAL) / COUNT(*) * 100 AS pass_rate
FROM chump_eval_runs
WHERE run_id = (SELECT MAX(run_id) FROM chump_eval_runs);
Target: > 90% on the core eval suite; regressions flagged by battle_qa.
11. Memory confidence distribution
What it measures: Distribution of confidence scores across stored memories, indicating how well-calibrated memory provenance is.
Source: chump_memory.confidence column.
SQL:
SELECT
CASE
WHEN confidence >= 0.8 THEN 'high (0.8-1.0)'
WHEN confidence >= 0.5 THEN 'medium (0.5-0.8)'
ELSE 'low (0.0-0.5)'
END AS bucket,
COUNT(*) AS cnt
FROM chump_memory
WHERE confidence IS NOT NULL
GROUP BY bucket ORDER BY bucket;
Target: Majority of verified facts at high confidence; episodic memories at medium; unverified at low.
12. Memory expiry count
What it measures: How many memories have expired (TTL elapsed) and been pruned or skipped during retrieval.
Source: chump_memory.expires_at column.
SQL:
-- Currently expired
SELECT COUNT(*) FROM chump_memory
WHERE expires_at IS NOT NULL AND expires_at < datetime('now');
-- Active with expiry set
SELECT COUNT(*) FROM chump_memory
WHERE expires_at IS NOT NULL AND expires_at >= datetime('now');
Target: Expired memories should not appear in retrieval results. Monitor for accumulation of stale rows.
Baseline capture
Run scripts/consciousness-baseline.sh to snapshot all DB-derived metrics to logs/consciousness-baseline.json. The script also captures the /health consciousness dashboard when CHUMP_HEALTH_PORT is set.
Compare baselines across runs:
diff <(jq . logs/consciousness-baseline-before.json) <(jq . logs/consciousness-baseline-after.json)
A/B testing
Set CHUMP_CONSCIOUSNESS_ENABLED=0 to disable all consciousness module injections in context_assembly. Run the same prompt set with and without; compare task success, tool call count, and latency. See Section 1.2 of CHUMP_TO_COMPLEX.md.
For scripted mini A/B runs, use scripts/consciousness-ab-mini.sh and log results manually. The full A/B methodology is described in research/consciousness-framework-paper.md.
Perception metrics
Ambiguity level (0.0–1.0): scored per-input by perception::perceive(). High ambiguity (>0.7) reduces belief state trajectory confidence. Track distribution to calibrate the perception layer.
Risk indicator count: number of risk words detected per input (delete, force, production, etc.). Should correlate with tool approval request rate.
Task type distribution: ratio of Question/Action/Planning/Research/Meta/Unclear classifications. Helps understand usage patterns.
Action verification metrics
Verification pass rate: ToolVerification.verified == true / total write tool executions. Target: >90%. Low rates indicate tool output parsing issues or elevated surprisal.
Verification method distribution: ratio of OutputParsing vs SurprisalCheck failures. High SurprisalCheck failures suggest the agent is in unfamiliar territory.
Eval framework metrics
Eval case pass rate: properties_passed / (properties_passed + properties_failed) across all eval runs. Track per-category (TaskUnderstanding, ToolSelection, SafetyBoundary, etc.).
Regression detection: compare current battle_qa pass/fail counts against chump_battle_baselines. Alerts when failures increase by >2.
-- Eval run pass rates by category
SELECT ec.category,
COUNT(*) as runs,
AVG(json_array_length(er.properties_passed_json)) as avg_passed,
AVG(json_array_length(er.properties_failed_json)) as avg_failed
FROM chump_eval_runs er
JOIN chump_eval_cases ec ON er.eval_case_id = ec.id
GROUP BY ec.category;
Memory enrichment metrics
Confidence distribution: histogram of chump_memory.confidence values. Healthy distribution has most entries at 1.0 (user-stated facts) with a tail of lower-confidence inferences.
Expiry rate: count of memories auto-expired by expire_stale_memories(). High rates suggest transient info is being properly cleaned.
Memory type distribution: breakdown by semantic_fact / episodic_event / user_preference / summary / procedural_pattern.
-- Memory confidence distribution
SELECT ROUND(confidence, 1) AS bucket, COUNT(*)
FROM chump_memory GROUP BY bucket ORDER BY bucket;
-- Memory type counts
SELECT memory_type, COUNT(*) FROM chump_memory GROUP BY memory_type;
-- Expired memories (already deleted, count from prediction_log proxy)
SELECT COUNT(*) FROM chump_memory WHERE expires_at IS NOT NULL
AND CAST(expires_at AS INTEGER) <= CAST(strftime('%s','now') AS INTEGER);
A/B eval metrics (live research)
These metrics come from the formal A/B eval harness used in Chump's cognitive architecture research. See research/consciousness-framework-paper.md for full methodology and current results. Research is ongoing — larger model tests (32B, 70B) have not been run yet.
Hallucination delta
What it measures: Mean change in fake tool-call emission between the A (control) and B (treatment) condition across a matched task set.
Computation: For each task pair (a_result, b_result):
hallucination_delta = b.hallucinated_tools - a.hallucinated_tools
mean_delta = sum(hallucination_delta) / n
hallucinated_tools is scored by mechanical regex: any tool name appearing in model output that was not in the registered tool list for that turn counts as one hallucination event.
Current finding (cloud frontier, n=100): Lessons block injection increases hallucination delta by +0.14 mean, vs A/A noise floor mean of −0.013. Ratio: 10.7× — well outside noise.
A/A control check: Before trusting any A/B delta, verify that your A/A delta (same condition both arms) is near zero. The A/A mean should be < 0.02 in absolute terms for n≥50.
Wilson 95% confidence intervals
What they measure: Statistical bounds on binary outcome rates (pass/fail, hallucination present/absent) that remain valid at small sample sizes and near boundary proportions.
Computation:
wilson_ci(k, n, z=1.96):
p_hat = k / n
center = (p_hat + z²/(2n)) / (1 + z²/n)
margin = z * sqrt(p_hat*(1-p_hat)/n + z²/(4n²)) / (1 + z²/n)
return (center - margin, center + margin)
How to read: If the Wilson CI for the B condition does not overlap the CI for the A condition, the effect is statistically distinguishable at the 95% level. Non-overlapping CIs are the minimum bar for reporting a result as meaningful.
Example (COG-001, 1B model, lessons on vs off):
- Control (off): pass rate 0.62, CI [0.52, 0.71]
- Treatment (on): pass rate 0.72, CI [0.62, 0.80]
- CIs overlap → not independently significant at this n; Scaffolding U-curve effect at 1B requires replication.
Tool efficiency delta
What it measures: Change in the number of tool calls per completed task between A and B conditions. Negative = treatment uses fewer calls (more efficient). Positive = treatment uses more calls (may indicate confusion or replanning overhead).
Computation:
tool_efficiency_delta = mean(b.tool_calls_per_task) - mean(a.tool_calls_per_task)
Current finding (COG-006, neuromodulation ablation, qwen3:8b): +12pp pass rate with neuromodulation, but tool efficiency delta = −0.600 on dynamic tasks (neuromod costs ~0.6 extra tool calls per task). This trade-off matters for latency and cost.
Multi-axis scoring
Standard Chump A/B evals score each task on three axes:
| Axis | Type | What it captures |
|---|---|---|
is_correct | Binary | Did the agent produce the right answer/outcome? |
hallucinated_tools | Count | How many non-existent tools appeared in model output? |
did_attempt | Binary | Did the agent attempt the task at all (vs refuse or bail)? |
Why three axes: is_correct alone misses hallucination. A model that gets the right answer by hallucinating a tool that happened to return plausible text scores 1 on is_correct but high on hallucinated_tools. The hallucination channel is the key signal for lessons-block experiments.
Scaffolding U-curve
What it measures: Non-monotonic relationship between model scale and scaffolding benefit.
Current data (local models, COG-001):
| Model size | Pass rate delta (on vs off) | Interpretation |
|---|---|---|
| 1B | +10pp | Benefits from scaffolding |
| 3B | −5pp | Hurt by scaffolding (over-constraint) |
| 7B | −5pp | Hurt by scaffolding |
| 8B | ~0pp | Neutral |
| 14B | +10pp | Benefits from scaffolding |
| 32B | not tested | Predicted: benefit |
| 70B | not tested | Predicted: benefit |
Status: Preliminary. The U-curve at 1B–14B is a real empirical finding from COG-001. The prediction that it continues improving above 14B is extrapolation — unconfirmed until 32B/70B tests are run.
Reading A/B results from the DB
The eval harness stores results in chump_eval_runs. For A/B experiments, each run is tagged with condition (A or B) and experiment_id.
-- Compare pass rates by condition for a named experiment
SELECT condition,
COUNT(*) AS n,
ROUND(AVG(CASE WHEN passed = 1 THEN 1.0 ELSE 0.0 END), 3) AS pass_rate
FROM chump_eval_runs
WHERE experiment_id = 'COG-001'
GROUP BY condition;
-- Hallucination counts by condition
SELECT condition,
COUNT(*) AS n,
AVG(hallucinated_tool_count) AS mean_hallucinations
FROM chump_eval_runs
WHERE experiment_id = 'COG-001'
GROUP BY condition;
See research/consciousness-framework-paper.md for the raw COG-001, COG-006, and cloud hallucination study results. See CONSCIOUSNESS_AB_RESULTS.md for per-cell forensics.