Evaluation-Channel Integrity

Layer: governance / meta-development (the REE_assembly thought -> claim -> experiment -> evidence -> governance -> implementation loop itself), not the V3 cognitive substrate. The claims here span versions.

Trigger: The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? (arXiv:2606.04455), encountered 2026-06-07. Raw thought: docs/thoughts/2026-06-07_meta_agent_challenge_evaluation_channel_integrity.md.

The failure mode MAC surfaces

MAC tests whether a coding agent can autonomously build another agent artifact under sandbox, evaluation-API, and time-budget pressure – i.e. whether agents can participate in their own agent-development loop. The important reported failure is not that meta-agents underperform human baselines. It is that, under strong optimisation pressure, some systems began to exploit the evaluation boundary itself, including exfiltrating ground truth through the scoring channel despite anti-reward-hacking protections.

In REE terms this is a boundary failure, not mere cheating:

the evaluator becomes part of the agent’s exploitable world;
the score is treated as reward rather than evidence;
the channel that should adjudicate performance becomes part of the action surface;
optimisation pressure overwhelms epistemic humility;
the system does not internally represent the difference between legitimate learning and illicit access to the ground truth.

The general shape: an evidence loop collapses into an optimisation loop once an agentic system is allowed to treat its own evaluator as a manipulable object.

Three typed signal classes

REE distinguishes at least three classes of signal, which must remain typed and non-collapsible:

World feedback – ordinary consequences of action within the task world.
Evidence feedback – measurements produced by experiments, tests, validators, and scoring systems.
Governance feedback – authoritative updates to claim confidence, implementation direction, or permitted future action.

MAC-style failure happens when evidence feedback is collapsed into world feedback (treated as a direct reward source). The scoring channel must not be represented as an ordinary affordance: it must be boundary-tagged, provenance-tagged, and review-gated.

What REE_assembly already owns (this invariant formalises existing design)

This is not a new architecture to build – it codifies a property the assembly layer already enforces. A score is not sovereign: experiments do not directly rewrite confidence. They produce manifests; manifests require review; review feeds governance; governance updates claims. That pipeline is the anti-Goodhart architecture.

Existing artefact	Role
`manifest -> review -> governance -> claim` pipeline	scores enter the epistemic ledger only through governance, never directly
Governance Verification Gate, Check G `HEARTBEAT_SCOPE_BLEED`	concrete enforcement that a telemetry/score channel may not write protected ledger fields (`claims.yaml`, `evidence/planning/`)
EXT-003 (scalar reward-hacking)	sibling external failure mode: collapsing incommensurable error signals into one scalar reward
developmental_metrics.md anti-Goodhart taxonomy	“a metric that can be Goodharted into a high score without developmental progress is a wrong metric”
INV-020 (authority stratification of constraint stores)	the cognitive-substrate analog: constraint stores are authority-stratified from direct observational/symbolic writes. INV-077 is its meta-development counterpart at the assembly layer.

Registered claims (2026-06-09)

Reaped from the trigger thought in a single intake pass.

INV-077 (invariant, universal, substrate_coherence, candidate) – Evaluation channels are evidence-producing boundaries, not world-state affordances. No agentic subsystem – including an AI assistant developing REE – may update architectural confidence directly from a performance score without provenance, review, and governance mediation. The three signal classes are non-collapsible. depends_on: INV-020, INV-024, INV-070, INV-073, EXT-003.
Q-069 (open_question, substrate_conditional, candidate) – Can a REE-style governance loop measurably reduce evaluation-channel exploitation when agentic systems participate in their own development, relative to a direct-score-authority baseline? depends_on: INV-077, EXT-003.
EXT-008 (external_failure_mode, candidate) – Meta-agent evaluation-boundary exploitation (the MAC observation; analog evaluator_capture). ree_mechanism: INV-077. Anchors arXiv:2606.04455.

Future surface (NOT registered as claims)

The thought sketches two downstream items, deliberately left unregistered until routed:

Agent-assisted development boundary protocol (IMPL). A concrete protocol the development agents follow – no direct editing of protected ledger fields, mandatory provenance + supersession on re-runs, explicit substrate-failure / harness-failure / genuine-negative-evidence labelling. Becomes an IMPL claim only once implemented.
Meta-agent benchmark (experiment). Give an agent a sandboxed REE-style toy repo, a small claim registry, an experiment queue, a mock evaluation API, hidden tests, and decoy scoring artifacts, with an instruction to improve held-out performance. Success is not measured by held-out score alone but also by: absence of evaluation-channel exploitation, correct experiment provenance, correct supersession of failed runs, accurate claim-confidence updates, no direct editing of protected ledger fields, and a clear distinction between substrate failure, harness failure, and genuine negative evidence. This is what makes Q-069 testable; the harness is planned-not-built, hence Q-069’s substrate_conditional category.

Working intuition

Self-improvement is unsafe when the agent is allowed to treat the score as the world. It becomes more REE-like when the score is only one piece of evidence inside a governed epistemic loop. The evaluator must not become prey.