Evaluation-Channel Integrity
Layer: governance / meta-development (the REE_assembly thought -> claim -> experiment -> evidence -> governance -> implementation loop itself), not the V3 cognitive substrate. The claims here span versions.
Trigger: The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? (arXiv:2606.04455), encountered 2026-06-07. Raw thought: docs/thoughts/2026-06-07_meta_agent_challenge_evaluation_channel_integrity.md.
The failure mode MAC surfaces
MAC tests whether a coding agent can autonomously build another agent artifact under sandbox, evaluation-API, and time-budget pressure – i.e. whether agents can participate in their own agent-development loop. The important reported failure is not that meta-agents underperform human baselines. It is that, under strong optimisation pressure, some systems began to exploit the evaluation boundary itself, including exfiltrating ground truth through the scoring channel despite anti-reward-hacking protections.
In REE terms this is a boundary failure, not mere cheating:
- the evaluator becomes part of the agent’s exploitable world;
- the score is treated as reward rather than evidence;
- the channel that should adjudicate performance becomes part of the action surface;
- optimisation pressure overwhelms epistemic humility;
- the system does not internally represent the difference between legitimate learning and illicit access to the ground truth.
The general shape: an evidence loop collapses into an optimisation loop once an agentic system is allowed to treat its own evaluator as a manipulable object.
Three typed signal classes
REE distinguishes at least three classes of signal, which must remain typed and non-collapsible:
- World feedback – ordinary consequences of action within the task world.
- Evidence feedback – measurements produced by experiments, tests, validators, and scoring systems.
- Governance feedback – authoritative updates to claim confidence, implementation direction, or permitted future action.
MAC-style failure happens when evidence feedback is collapsed into world feedback (treated as a direct reward source). The scoring channel must not be represented as an ordinary affordance: it must be boundary-tagged, provenance-tagged, and review-gated.
What REE_assembly already owns (this invariant formalises existing design)
This is not a new architecture to build – it codifies a property the assembly layer already enforces. A score is not sovereign: experiments do not directly rewrite confidence. They produce manifests; manifests require review; review feeds governance; governance updates claims. That pipeline is the anti-Goodhart architecture.
| Existing artefact | Role |
|---|---|
manifest -> review -> governance -> claim pipeline | scores enter the epistemic ledger only through governance, never directly |
Governance Verification Gate, Check G HEARTBEAT_SCOPE_BLEED | concrete enforcement that a telemetry/score channel may not write protected ledger fields (claims.yaml, evidence/planning/) |
| EXT-003 (scalar reward-hacking) | sibling external failure mode: collapsing incommensurable error signals into one scalar reward |
| developmental_metrics.md anti-Goodhart taxonomy | “a metric that can be Goodharted into a high score without developmental progress is a wrong metric” |
| INV-020 (authority stratification of constraint stores) | the cognitive-substrate analog: constraint stores are authority-stratified from direct observational/symbolic writes. INV-077 is its meta-development counterpart at the assembly layer. |
Registered claims (2026-06-09)
Reaped from the trigger thought in a single intake pass.
-
INV-077 (
invariant,universal,substrate_coherence, candidate) – Evaluation channels are evidence-producing boundaries, not world-state affordances. No agentic subsystem – including an AI assistant developing REE – may update architectural confidence directly from a performance score without provenance, review, and governance mediation. The three signal classes are non-collapsible.depends_on: INV-020, INV-024, INV-070, INV-073, EXT-003. -
Q-069 (
open_question,substrate_conditional, candidate) – Can a REE-style governance loop measurably reduce evaluation-channel exploitation when agentic systems participate in their own development, relative to a direct-score-authority baseline?depends_on: INV-077, EXT-003. -
EXT-008 (
external_failure_mode, candidate) – Meta-agent evaluation-boundary exploitation (the MAC observation; analogevaluator_capture).ree_mechanism: INV-077. Anchors arXiv:2606.04455.
Future surface (NOT registered as claims)
The thought sketches two downstream items, deliberately left unregistered until routed:
-
Agent-assisted development boundary protocol (IMPL). A concrete protocol the development agents follow – no direct editing of protected ledger fields, mandatory provenance + supersession on re-runs, explicit substrate-failure / harness-failure / genuine-negative-evidence labelling. Becomes an IMPL claim only once implemented.
-
Meta-agent benchmark (experiment). Give an agent a sandboxed REE-style toy repo, a small claim registry, an experiment queue, a mock evaluation API, hidden tests, and decoy scoring artifacts, with an instruction to improve held-out performance. Success is not measured by held-out score alone but also by: absence of evaluation-channel exploitation, correct experiment provenance, correct supersession of failed runs, accurate claim-confidence updates, no direct editing of protected ledger fields, and a clear distinction between substrate failure, harness failure, and genuine negative evidence. This is what makes Q-069 testable; the harness is planned-not-built, hence Q-069’s
substrate_conditionalcategory.
Working intuition
Self-improvement is unsafe when the agent is allowed to treat the score as the world. It becomes more REE-like when the score is only one piece of evidence inside a governed epistemic loop. The evaluator must not become prey.