Infant Substrate Expansion Design

Status

Draft — 2026-05-16. Based on targeted lit-pulls commissioned 2026-05-16; three lit-pull agents still writing (affordance/valence, intrinsic motivation, RL diversity). Evidence table includes all completed summaries; action column will be refined once affordance/valence agent completes.

1. Purpose and Scope

The infant stage (INV-055, ARC-046) must produce three things before the childhood/play phase can begin: (a) a seeded valence map, (b) a behavioural repertoire broad enough to support trajectory diversity, and (c) z_goal initialised by at least one high-salience benefit contact. The current V3 substrate compresses developmental possibilities in ways that make all three harder to achieve. This document identifies those compression points, synthesises the relevant empirical and computational literature, and proposes substrate features, curriculum parameters, and gate criteria to address them.

Goal: increase exploratory diversity, accidental benefit encounters, valence-map richness, transition diversity, replay usefulness, and behavioural repertoire breadth WITHOUT increasing premature strategic convergence, catastrophic residue, sparse-reward collapse, early E3 rigidity, or monostrategy basin lock-in.

2. Current Infant-Stage Compression Analysis

The CausalGridWorldV2 with default infant-stage parameters exhibits the following structural compressions. Each generates a specific downstream deficit.

2.1 Identity Compression: Single Resource Type (default)

Symptom: z_goal always carries the same identity signature regardless of which resource the agent contacts. MECH-189 requires that high-salience benefit under high contextual complexity seeds z_goal — but when all benefits are identical, contextual complexity doesn’t differentiate, and the z_goal seeding is weak or degenerate.

Downstream deficit: z_goal remains uninitialised or weakly seeded until very long training, because there is no z_goal identity recovery signal to concentrate gradient on.

Status: SD-049 multi_resource_heterogeneity_enabled with resource_introduction_schedule addresses this, but is not yet default-enabled for infant stage.

2.2 Harm Signal Compression: Binary Hazard Contact

Symptom: Harm is all-or-nothing. Either the agent avoids hazards completely (and the harm residue field is never populated) or it contacts them (catastrophic). There is no graded harm zone that produces mild residue without termination risk.

Downstream deficit: The residue field (ARC-013) cannot develop geographic curvature during infancy. Sleep consolidation has nothing useful to consolidate. z_harm develops no gradient signal — only a step function at hazard contact.

Status: No current substrate feature addresses this. SD-011 dual nociceptive streams are implemented but both streams are activated only on contact, not proximity.

2.3 Spatial Compression: Homogeneous Geography

Symptom: All gridworld cells are equivalent except for the transient placement of hazards and resources (which drift under SD-047). There are no spatially distinct microhabitats — no region that is intrinsically safer, richer, or more novel than another.

Downstream deficit: Shannon entropy of visited positions can be high (broad coverage) without any ecologically meaningful structure. Replay is uniform (no state-value gradient to bias replay scheduler). The agent never learns that “different places mean different things.”

Status: SD-054 reef safe zones + bipartite layout partially address this, but reef patches are safety zones without additional resource differentiation.

2.4 Action-Class Compression: No Functional Distinction Between Actions

Symptom: With 5 actions (4 cardinal + noop), all movement actions are functionally equivalent (same dynamics, same consequences) and noop is always suboptimal (energy_decay continues). There is no action that has a qualitatively different consequence from any other.

Downstream deficit: Action-class entropy collapses. Once the policy learns to avoid hazards and collect resources, the 4 cardinal directions become symmetric and interchangeable — only local geometry matters. This produces monostrategy without any action-specific learning.

Status: The V4 harness spec (v4_developmental_harness_spec.md) introduces carry/push/use actions with distinct affordance bundles. That is V4. V3 needs a lighter-weight approximation.

2.5 Temporal Compression: No Non-Terminal Consequences

Symptom: Every adverse event is either no-consequence (missed resource) or near-terminal (hazard contact). There are no events that have persistent consequences short of termination.

Downstream deficit: The infant agent never learns from graded feedback — it learns only binary avoid/pursue. The transition database is impoverished. Replay is dominated by termination-adjacent trajectories (high salience, low diversity) or unremarkable mid-episode sequences.

2.6 Reward Compression: Sparse Accidental Benefit

Symptom: The only benefit signal is explicit resource contact. There are no ambient low-level benefit signals that can be encountered “accidentally” during broad exploration.

Downstream deficit: z_goal seeding requires a high-salience benefit contact (MECH-189). If resource placement is uniform-random, early episodes may have no resource contacts, and z_goal stays near-zero for hundreds of episodes. The childhood transition gate is blocked not because the agent is developmentally unready but because the environment never delivered the necessary experience.

3. Literature Evidence Table

Format: Source

Key finding

REE relevance

DEV-NEED IDs

Claim IDs

Conf

Action

3.1 Motor Babbling and Sensorimotor Exploration

Source	Key finding	REE relevance	DEV-NEED	Claims	Conf	Action
Leitao & Gahr (2024) PNAS	Premature babbling causally OPENS the sensory template acquisition window; motor exploration is upstream of perceptual encoding	Exploration epoch cannot be skipped or compressed — it opens the representational window for later hippocampal retrieval	DEV-NEED-001, DEV-NEED-002	ARC-065, INV-073	0.76	Refines ARC-065: exploration epoch must precede residue consolidation, not overlap with it
Doupe & Kuhl (1999) Annu Rev Neurosci	BG-driven motor variability is necessary for repertoire formation; pharmacological silencing = monostrategy; critical period is partly self-closing	Establishes developmental window during which exploration must be broad. Monostrategy is the experimentally confirmed outcome of abbreviated exploration	DEV-NEED-001, DEV-NEED-005	ARC-065, MECH-309	0.78	Supports ARC-065; adds critical-period temporal constraint to infant gate criteria
Warlaumont & Finnegan (2016) PLoS ONE	Computational model: intrinsic reward (auditory salience of own output) necessary for diversity; loss of reward signal → monostrategy	Without an intrinsic diversity reward, infant RL collapses to lowest-effort attractor — mirrors EXQ-561 V3 monostrategy findings	DEV-NEED-001, DEV-NEED-003	ARC-065, MECH-313, MECH-314	0.70	Supports MECH-313 (noise floor) + MECH-314 (structured curiosity); both must be active at start of infant stage
Fuchs (2026) Annals NYAS	Canonical babbling requires 3-way motor/respiratory/vocal coordination; piecemeal single-system exploration leaves representation incomplete	Infant exploration must cover multiple action combinations, not single-mode actions; diversity is multi-dimensional from the start	DEV-NEED-001, DEV-NEED-005	ARC-065, INV-073	0.68	Suggests action-combination diversity metric as gate criterion
Garcia-Guzman (2026) Nature family	Spontaneous infant movements optimise for dynamic excitation, NOT goal-directed reaching; distinct functional mode with higher kinematic variability	Infant exploration phase is not degraded goal-seeking — it is a separate mode optimising for body-dynamic discovery	DEV-NEED-001, DEV-NEED-002	ARC-065, INV-073	0.82	Strongly supports separate infant exploration mode with different objective from childhood; suggests disabling E3 goal-planning during infant phase
Dolinskaya et al. (2021) Neurosci Lett	Inter-limb coordination gradients accumulate continuously during spontaneous phase; structure appears before goal-directed onset	Latent-state diversity accumulates during exploration — structure does not appear suddenly at goal onset	DEV-NEED-001, DEV-NEED-007	ARC-065, MECH-314	0.75	Supports running diversity metrics during infant phase; confirms gradual accumulation predicts readiness

3.2 Environmental Enrichment and Deprivation

Source	Key finding	REE relevance	DEV-NEED	Claims	Conf	Action
Wang et al. (2017) Behav Brain Res	Novelty enrichment reduces never-tried-option failures; social enrichment alone INCREASES perseveration; effects dissociable	Type of enrichment matters, not quantity. Novelty-specific enrichment populates the option library; other types may not	DEV-NEED-003, DEV-NEED-005	ARC-065, INV-073	0.64	Suggests environment must provide novelty-type enrichment (distinct zones, varying layout) not just added stimulation
Li (2007) Behav Pharmacol	Isolation-reared rats: initial strategy acquisition normal; strategy UPDATE impossible; clozapine partially rescues	Developmental exploration poverty produces permanent monostrategy in adulthood; cannot switch strategies even when contingencies change	DEV-NEED-001, DEV-NEED-005	MECH-309, INV-073	0.68	Confirms MECH-309 developmental mechanism; supports treating monostrategy as permanent deficit if infant substrate is impoverished
Ventura et al. (2024) CA3 flexibility	Environmental enrichment → greater CA3 spatial tuning and enhanced contextual remapping; flat environments → poor representational differentiation	Environment structure during early exploration determines downstream hippocampal representational flexibility	DEV-NEED-003, DEV-NEED-007	ARC-065, ARC-007	0.65	Supports geographically structured infant environment; flat CausalGridWorld produces flat z_world coverage
Adolph (2017/2019) Psych Learn Mem	Behavioral flexibility requires varied motor experience BEFORE RL narrows repertoire; posture-specific; environmental variation necessary	INV-073 instantiated: exploration epoch must cover all action modalities before goal pursuit. Impoverished substrate → narrow option library	DEV-NEED-001, DEV-NEED-004	INV-073, INV-055, ARC-065	0.78	Strongly supports disabling goal-directed training during infant phase; supports diverse action coverage metric

3.3 Curiosity and Intrinsic Motivation

Source	Key finding	REE relevance	DEV-NEED	Claims	Conf	Action
Pathak et al. (2017) ICML	Forward-model PE in learned feature space drives exploration; fails with stochastic attractors (“noisy TV problem”)	MECH-314c is well-founded, but stochastic elements in infant env will capture novelty bonus; audit env for random attractors	DEV-NEED-003, DEV-NEED-006	MECH-314, ARC-065	0.75	Action: audit CausalGridWorldV2 infant params for stochastic attractors; fix before turning up novelty_bonus_weight
Burda et al. (2018) ICLR	Large-scale confirmation of curiosity-driven learning; noisy TV = permanent novelty trap	novelty_bonus_weight at max is INSUFFICIENT if env has irreducible random elements; structured env required	DEV-NEED-003, DEV-NEED-006	MECH-314, ARC-065	0.78	Critical design constraint: structured environment required alongside novelty bonus; not a parameter fix
Oudeyer (2016) IEEE Trans Cog Dev Syst	Learning-progress curiosity (not novelty) produces emergent developmental stage ordering in robots	MECH-314c learning-progress sub-flavour is the key for staged development; pure novelty-max does not sequence stages	DEV-NEED-008, DEV-NEED-001	MECH-314, ARC-046	0.76	Supports making MECH-314c the primary infant-stage curiosity signal; pure MECH-314a novelty is secondary
Monosov (2024) Neuron review	Novelty-based (striatal) and uncertainty-based (frontopolar) curiosity circuits anatomically dissociable in primates	MECH-314a/b split is biologically justified; combining them into one mechanism would miss dissociation	DEV-NEED-003	MECH-314, Q-044	0.82	Supports maintaining MECH-314a/b sub-flavour distinction; early calibration of both required during infant stage
Kidd & Hayden (2015) Neuron	Curiosity follows Goldilocks principle: neither too simple nor too complex maximises exploration	Infant environment difficulty gradient must be tunable to current competence level; flat difficulty produces no differential exploration	DEV-NEED-003, DEV-NEED-006	MECH-314, ARC-065	0.78	Suggests difficulty-graduated infant environment; some always-mastered, some at competence edge, some aspirational

3.3b Additional Sensorimotor Findings (babbling agent, 2026-05-16)

Source	Key finding	REE relevance	DEV-NEED	Claims	Conf	Action
Denisova & Zhao (2017) Sci Rep	Context-inflexibility (not raw variability level) predicts ASD delay; healthy infants show different movement distributions across contexts; at-risk infants do not	Pathological failure mode = context-rigidity, not entropy collapse. Rung 2 (context-sensitive switching) required from the very start of infant phase, not only later	DEV-NEED-001, DEV-NEED-005	ARC-065, Q-046	0.79	Refines gate criterion: action_entropy_by_zone must differ across zones, not just global entropy > ln(3)
Griffin et al. (2026) Brain Sci	Spiking model: Phase 1 (Hebbian self-org via babbling) builds action-perception cell assemblies WITHOUT reward; Phase 2 (dopamine) consolidates subset. Phase 1 MUST precede Phase 2 or there is no option diversity for reward to select from	Mechanistic justification for two-phase infant design. REE currently applies E3 scoring from step 1 — this model says a reward-free Phase 0 is mechanistically necessary, not just helpful	DEV-NEED-001, DEV-NEED-002	ARC-065, INV-073	0.76	Suggests missing claim: explicit Phase 0 = reward-free Hebbian diversity phase before any E3/reward weighting; candidate ARC-claim
Gliga (2018) Front Psychol	Infants exhibit 3 distinct variability forms: (1) deliberate hypothesis testing, (2) environmental variance probing, (3) sensorimotor contingency discovery — these are NOT one mechanism	Diversity is a family of 3 distinct probing strategies, mapping to MECH-313 (noise floor), MECH-314a/b (novelty/uncertainty), MECH-314c (learning progress). All three needed	DEV-NEED-001, DEV-NEED-003	MECH-313, MECH-314	0.72	Refines: do not collapse 3 diversity mechanisms into single `novelty_bonus_weight` parameter
Kaplan & Oudeyer (2007) Front Neurosci	Tonic dopamine encodes learning progress (rate of PE reduction); agent avoids both predictable AND unpredictable contexts, concentrating in zone of maximal expected learning progress	REE infant stage requires tonic (learning-rate) signal distinct from phasic (reward) signal. E1 prediction error rate derivative is the signal; currently absent	DEV-NEED-003, DEV-NEED-008	MECH-314, ARC-065	0.73	Suggests implementation: running E1 loss derivative as MECH-314c learning-progress signal broadcast to exploration bonus

3.3c Additional Curiosity/RL Diversity Findings (safe-exploration agent, 2026-05-16)

Source	Key finding	REE relevance	DEV-NEED	Claims	Conf	Action
MAP-Elites (Mouret & Clune, 2015)	Behavioral descriptor defines cells; local competition within cells prevents all-but-one dominant strategy. The behavioral descriptor is the KEY design choice for structural diversity	REE has no explicit behavioral descriptor definition. Without it, CEM candidate diversity cannot be measured volumetrically (DvD finding); candidates collapse to dominant attractor	DEV-NEED-005	ARC-065, MECH-309	0.75	Suggests missing claim: define behavioral descriptor for REE navigation domain: (zone_profile, reward_mode, episode_phase) — currently absent from ARC-065/MECH-31x
IMGEP (Forestier et al., JMLR 2022)	Learning-progress-driven goal selection generates automatic curriculum of increasing complexity. Cold-start: agent needs initial competence on some goals before progress signal is meaningful	REE’s residue field + schema salience = natural home for IMGEP-like learning-progress proxy. Track per-region wanting saturation rate (not just wanting level) as curriculum signal	DEV-NEED-003, DEV-NEED-008	MECH-314, ARC-065	0.77	Suggests implementation metric: per-schema learning-progress rate as modulator of schema salience; actionable without architectural change
Narvekar et al. (2020) JMLR survey	Difficulty-based curricula do NOT prevent monostrategy if all tasks require same strategy. Diversity-based curricula (by behavioral type) do. Adaptive curricula outperform fixed schedules	behavioral_diversity_acceptance_criteria.md Rung ladder is diversity-based (correct design). Rung ordering by behavioral mode rather than difficulty is the literature-correct choice	DEV-NEED-008	ARC-065, Q-046	0.72	Validates Rung 0-4 structure; curriculum must be triggered by behavioral competence plateau, not episode count

3.4 Affordance Learning and Early Valence (affordance/valence agent, 2026-05-16)

Source	Key finding	REE relevance	DEV-NEED	Claims	Conf	Action
Burnay et al. (2020) Dev Sci	Crawling experience predicts drop-off avoidance in walkers; posture-specific. Water cliff elicits more exploration than real cliff — intrinsic attractive valence despite danger signal	Valence of harm-adjacent features is not purely aversive: some features attract despite danger proximity. Infant substrate needs mixed-valence zones (safe but adjacent to harm), not just binary safe/unsafe	DEV-NEED-004, DEV-NEED-001	INV-055, ARC-013	0.70	Supports microhabitat zone B (moderate resources + elevated hazard) as distinct from pure safe zone; mixed valence is developmentally informative
Valadi et al. (2020) Iran J Child Neurology	Environment affordance richness (variety of stimulation) → significantly broader developmental outcomes across motor, problem-solving, and communication; effect independent of SES	Environment structure during early exploration determines breadth of competencies. V3 infant env richness has downstream consequences for ALL developmental dimensions, not just motor	DEV-NEED-003, DEV-NEED-005	ARC-065, INV-073	0.65	Supports multi-zone/multi-resource infant environment; richness of stimulation variety is the key dimension (not SES-dependent quantity)
Keren-Portnoy & Tomasello (2021) Infancy	Infants detect and act on action-outcome contingencies rapidly without social mediation; accidental-contact-to-repetition cycle bootstraps action patterns within minutes	Contingency learning is fast: if infant substrate delivers sufficient benefit contact frequency, z_goal seeding should occur within relatively few episodes. Transient benefit patches address this	DEV-NEED-006	MECH-189, INV-055	0.73	Strengthens case for transient benefit patches; confirms rapid contingency learning without caregiver input is possible and expected
Berridge (1998) Neurosci Biobehav Rev	Wanting/liking dissociation: dopamine mediates approach motivation (wanting) independently of pleasure (liking). Hedonic liking reactions present from birth (innate). Wanting bootstraps from accidental contact independent of calibrated liking	Benefit contacts should generate a wanting-type signal (approach motivation, z_goal seeding) before liking is calibrated. Infant stage can bootstrap z_goal from accidental contacts alone	DEV-NEED-006	MECH-189, SD-014	0.72	Strongly supports transient benefit patches for z_goal seeding; wanting is separable from and earlier than liking — early contacts suffice

3.5 Safe Exploration and RL Diversity

Source	Key finding	REE relevance	DEV-NEED	Claims	Conf	Action
Eysenbach et al. (2019) ICLR DIAYN	Optimal DIAYN: partition state space evenly across skills; each skill mode occupies structurally distinct region	Formalises the no-monostrategy condition: distinct behavioural modes must occupy non-overlapping state-space partitions	DEV-NEED-005, DEV-NEED-003	ARC-065, Q-046, MECH-309	0.80	Suggests partition-coverage metric for infant gate: do distinct trajectory classes occupy distinct z_world regions?
Parker-Holder et al. (2020) NeurIPS DvD	Apparent diversity (pairwise distance) != effective diversity (volume); population can be spread but degenerate	Current V3 entropy metrics may be measuring apparent diversity; need volumetric trajectory coverage	DEV-NEED-005, DEV-NEED-007	Q-046, ARC-065	0.78	Refines Rung 1 metric: volumetric coverage of trajectory space, not just pairwise distance or action entropy
Dennis et al. (2020) NeurIPS PAIRED	Fixed bipartite environment still produces monostrategy if one attractor is easier; need difficulty calibrated to agent competence	SD-054 bipartite layout is necessary but not sufficient; relative salience must track agent performance	DEV-NEED-005, DEV-NEED-008	ARC-065, Q-046	0.76	Suggests curriculum parameter tracking which behavioral modes are underperforming; adjust relative attractor salience during infant stage

4. Six Diversity Types: Infant-Stage Taxonomy

For the infant stage, not all diversity types are targets. The following table specifies which types are active targets, which are excluded, and why.

Diversity type	Definition	Infant-stage target?	Rationale
Exploration diversity	Shannon entropy of visited positions (H_pos)	YES, primary	Substrate must be traversed broadly before residue geography can be formed
Strategic diversity	Trajectory class count / CEM candidate distinctness	NO (deferred to Rung 1/childhood)	Premature strategic diversity = E3 convergence before valence map exists; this is the failure mode we avoid
Valence diversity	Residue field coverage: distinct harm + benefit residue signatures over latent space	YES, primary	Infant stage must populate harm/benefit geography; without it, sleep consolidation has nothing to consolidate
Trajectory diversity	Edit-distance or cosine distance across stored trajectories (Rung 1 prerequisite)	YES, latent seed	Diverse trajectories must exist in hippocampal store before E3 can draw from them at childhood onset
Action-class diversity	Entropy of action usage across ALL classes (not just at probe states)	YES, secondary	Without action-class coverage, some action classes never appear in trajectories; later stages cannot discover their affordances
Latent-state diversity	Coverage breadth of z_world across episodes	YES, primary	This is what Ventura 2024 CA3 remapping and Adolph 2017 posture-specificity both target; foundation for downstream representational flexibility

5. Proposed Substrate Features

5.1 Graduated Harm Zones (Non-Terminal Hazard Gradients)

Status: IMPLEMENTED 2026-05-16 (infant_substrate:GAP-1 / INF-ENV-001). Validation: V3-EXQ-576 PASS (substrate-readiness diagnostic). See ree_core/environment/causal_grid_world.py.

Proposal: Introduce harm_gradient_enabled mode where cells within radius r_outer of a hazard exert mild negative reward (0.05–0.15) proportional to distance, without terminal contact until radius r_inner. Terminal contact only at r_inner (existing behaviour).

Mechanism: At each step, compute distance to nearest hazard. If d < r_outer: apply harm_gradient_reward = -hazard_harm * (1 - d/r_outer)^2 * harm_gradient_scale. If d < r_inner: existing terminal contact logic.

Diversity benefits targeted: Valence diversity (mild harm residue populates field geography); transition diversity (graduated approach/retreat rather than binary avoid/contact); replay utility (approach-then-retreat sequences are diverse and informative).

V3-testable: YES — environment parameter change only.

Parameters:

harm_gradient_enabled: bool = False       # off by default
harm_gradient_outer_radius: float = 2.0  # cells
harm_gradient_inner_radius: float = 0.5  # existing contact boundary
harm_gradient_scale: float = 0.3         # fraction of hazard_harm applied at outer boundary

Gate relevance: Residue field coverage (residue_coverage_pct) now measurable without requiring terminal hazard contacts.

5.2 Microhabitat Zones (Structured Spatial Ecology)

Proposal: Divide the gridworld into 3–4 functionally distinct zones with persistent but softly-enforced identities:

Zone A (Forage zone): Higher baseline resource density, lower hazard density. Lower novelty signal (zone is “familiar”). Maps to reef-excluded area in SD-054.
Zone B (Risk zone): Moderate resources, elevated hazard density. Mild harm gradient active. Higher novelty signal (zone is “challenging”).
Zone C (Novelty zone): Low resources, no hazards, mild positive ambient signal for presence. Distinct from forage zone but rewarding in a different way.
Zone D (Transition zone): Borders between zones. Resources and hazards at intermediate densities. Zone identity less clear — promotes exploration of boundaries.

Mechanism: Soft zone identities via zone_map[y][x] (computed once per episode from a random zone-boundary Voronoi seed). Zone assignment determines per-cell baseline resource spawn probability and hazard spawn probability. Zone-A cells: resource_prob *= zone_A_resource_factor, hazard_prob *= zone_A_hazard_factor (default: resource=1.5x, hazard=0.3x).

Diversity benefits targeted: Latent-state diversity (spatially distinct z_world signatures per zone); valence diversity (harm/benefit geography becomes structurally predictable, unlike random placement); replay utility (zone-crossing transitions are high-information for residue consolidation).

V3-testable: YES — extends SD-054 zone concept with per-zone resource/hazard modulation.

Parameters:

microhabitat_enabled: bool = False
n_microhabitats: int = 3              # A, B, C (D = borders, automatic)
zone_A_resource_factor: float = 1.5
zone_A_hazard_factor: float = 0.3
zone_B_resource_factor: float = 0.8
zone_B_hazard_factor: float = 1.8
zone_C_resource_factor: float = 0.3
zone_C_hazard_factor: float = 0.0
zone_C_ambient_bonus: float = 0.05   # positive presence signal
zone_novelty_decay: float = 0.95     # ambient bonus decays if zone visited too often

5.3 Transient Benefit Patches (Accidental Benefit Mechanism)

Proposal: Add time-limited high-salience benefit patches that appear and disappear stochastically. Appearance probability higher in zone A and zone C. Duration 10–30 steps. Benefit value 2–5x standard resource (to ensure high salience for z_goal seeding).

Mechanism: At each step, with probability transient_benefit_prob, spawn a transient patch at a random cell weighted toward zone A/C. Patch persists for transient_benefit_duration steps. Reward on contact: resource_benefit * transient_benefit_multiplier. No terminal contact.

Diversity benefits targeted: z_goal seeding (DEV-NEED-006): accidental contact with high-salience patch provides the MECH-189 seeding condition without requiring the agent to pursue explicit goals. Accidental encounters are by definition diverse in location and timing.

V3-testable: YES — new environment feature.

Parameters:

transient_benefit_enabled: bool = False
transient_benefit_prob: float = 0.02     # per-step appearance probability
transient_benefit_duration: int = 20     # steps patch remains
transient_benefit_multiplier: float = 3.0  # relative to standard resource_benefit
transient_benefit_zone_weight: list = [0.4, 0.2, 0.3, 0.1]  # A, B, C, border preference

Gate relevance: Enables measurement of “accidental benefit contact frequency” as proxy for z_goal seeding opportunity.

5.4 Action Diversity Scaffolding (Lightweight V3 Approximation)

Proposal: Without adding V4 action types, create weak functional differentiation between existing actions by introducing:

Directional wind: A per-episode drift vector that makes movement in one direction slightly cheaper and movement against it slightly costlier (energy_decay differential ±0.005 per step). Rotates every N episodes. This means optimal traversal strategy changes direction-by-direction.
Zone-gated noop bonus: A small positive reward for noop within zone C (novelty zone) — makes noop strategically useful in a specific spatial context.

Mechanism: wind_direction is sampled at episode start. Actions moving with wind: energy penalty reduced by wind_assist_factor; actions against wind: energy penalty increased by wind_resist_factor. The optimal traversal pattern changes across episodes, preventing a single movement pattern from dominating.

Diversity benefits targeted: Action-class diversity (all 5 action classes now have situationally distinct consequences); trajectory diversity (different wind vectors require different episode-level traversal strategies, even to reach the same goal).

V3-testable: YES — environment parameter change.

Parameters:

directional_wind_enabled: bool = False
wind_assist_factor: float = 0.005       # energy_decay reduction with-wind
wind_resist_factor: float = 0.005       # energy_decay increase against-wind
wind_rotation_interval: int = 50        # episodes between direction change
noop_zone_bonus_enabled: bool = False
noop_zone: str = "C"                    # zone where noop carries ambient bonus
noop_zone_bonus: float = 0.03

5.5 Dynamic Episode Map Seeds (Per-Episode Distinctness)

Proposal: Currently SD-047 provides multi-source hazard drift within an episode. Add a per-episode structural seed that varies the starting configuration: zone boundaries, initial resource/hazard placement, and agent start position. Already partially implemented in existing reset() but lacks zone-identity persistence.

Mechanism: Episode seed = hash(episode_id) used to generate zone boundaries + hazard starting positions. Ensures each episode’s structure is meaningfully distinct while respecting zone statistics. Increases trajectory diversity across episodes without disrupting within-episode consistency.

V3-testable: YES — trivial parameter; likely already partially present.

Parameters:

structural_diversity_seed: bool = True   # default on
structural_diversity_strength: float = 1.0  # 0=fixed layout, 1=fully variable

5.6 Stochastic-Attractor Audit (Pre-Condition for Curiosity Bonus Deployment)

Proposal (design constraint, not a feature): Before enabling novelty_bonus_weight > 0 in infant stage, audit CausalGridWorldV2 for all stochastic attractors. Per Burda (2018) / Pathak (2017), any irreducibly random element with accessible variance will permanently capture a prediction-error / novelty curiosity signal (the “noisy-TV” failure).

Audit completed 2026-05-16 (GAP-4); line numbers refreshed 2026-05-21 (IGW-027 / INF-ENV-004, code-read only). No random.random() or torch.randn() in causal_grid_world.py; env RNG is self._rng = np.random.default_rng(seed) at :855 (reproducible across runs). Attractor concern is within-episode irreducible entropy in encoded channels, not cross-run nondeterminism.

Enumeration and classification of every RNG call site in reset() / step() (plus callees invoked only from those paths):

Site (causal_grid_world.py)	Source	Class	Attractor risk
:944, :2964, :3058, :3134-3135	reset placement shuffle / GAP-2 Voronoi seeds / zone-weighted spawn / SD-054 pool shuffles	(a) learnable-stochastic	none
:1065, :1184-1186 (:2668, :2690-2713)	SD-049 type draw / SD-023 landmark placement	(a) learnable-stochastic	none
:3245, :3255	SD-047 `_init_multi_source_state` drift pool + initial velocity	(a) learnable-stochastic	none
:1728, :1871/:3594, :1914, :1967/:3661-3678, :3715, :3780	SD-022 limb failure / SD-029 external hazard / counter-evidence / env+reef hazard drift / respawn shuffles	(a) learnable-stochastic (rates + observable state)	none
:2008/:3757	infant GAP-3 transient benefit spawn	(a) learnable-stochastic	none
:3365-3378	SD-047 background drift move (hazard grid position observable)	(a) learnable-stochastic	low
:3275	SD-047 weather AR(1) Gaussian innovation -> hazard_field	(b) partial: AR(1) learnable, innovation irreducible	moderate (OFF by default)
:3311, :3336	SD-047 transient-event Poisson appear/disappear	(b) partial	moderate (OFF by default)
:3466	SD-048 fatigue AR(1) innovation on harm_obs_a	(b) partial	moderate
:3485	SD-048 sensitisation Poisson onset (harm_obs_a)	(b) irreducible	high
:3507	SD-048 autonomic i.i.d. Gaussian on harm_obs_a, every tick	(b) irreducible	PRIMARY (noisy-TV)
:2368 (`_traj_pair_rng`)	GAP-7 trajectory pair sampling (telemetry only)	out of obs path	none

Verdict: SD-048 interoceptive noise is the primary irreducible stochastic attractor – chiefly Source 1 autonomic i.i.d. Gaussian (:2631) and the Poisson sensitisation onset (:2609); the fatigue AR(1) (:2575) is a partial attractor. SD-047 weather innovation (:2399) and transient-event Poisson (:2415) are secondary partial attractors. Reset-time, spawn, respawn and hazard-drift randomness are learnable-stochastic and pose no attractor risk. All of SD-047/SD-048 are OFF by default; the risk materialises only when these substrates are enabled concurrently with novelty_bonus_weight > 0 (the infant-stage configuration). This confirms the plan’s pre-registered suspicion (Burda 2018 / Pathak 2017; SD-048 named as primary suspect at plan registration).

Action (binding constraint on GAP-13 / EXQ-ISEF-004 and any infant-stage curiosity run):

MECH-314a striatal novelty is min-distance in z_world RBF space; z_world does not carry harm_obs_a, so 314a is structurally safe as-is – no change required.
The exposure is MECH-314b/c: they consume e3._running_variance (PE), which the SD-048 autonomic Gaussian inflates with irreducible variance, so 314b/c would chase it. When use_structured_curiosity AND interoceptive_noise_enabled, the PE feed to 314b/c must exclude / low-pass the harm-stream PE component. Action option (b) “exclude from the novelty signal” applies; option (a) “make it Markovian” is structurally unavailable for i.i.d. readout noise.
Treat interoceptive_noise_scale and SD-047 multi_source_intensity_scale as quantities that must be masked from any novelty-signal computation; wire an explicit config-coupling assertion BEFORE GAP-13 runs (GAP-13 depends_on GAP-4).

6. Infant Curriculum Schedule

Four phases, triggered by episode count + telemetry thresholds.

Phase 0 — Motor Babbling (Episodes 0–100)

Goal: Generate a minimally structured transition database. E3 disabled or near-random. No goal-directed training. Exploration mode only.

Active features: MECH-313 (noise floor), MECH-314a (novelty bonus), dynamic episode seeds. Disabled features: MECH-314c (learning-progress), harm gradients (too early for residue geography), transient benefit patches (too salient — would pull policy before babbling completes).

Parameters:

novelty_bonus_weight: 0.5    # moderate, not max; MECH-313 noise floor active
commit_threshold: 1.0        # near-never-commit (sweep_amplitude=1.0)
residue_scale_factor: 0.0    # no residue accumulation yet
e3_planning_enabled: False   # or E3 random-action mode
offline_integration_frequency: 10  # every 10 steps; high sleep:wake ratio
harm_gradient_enabled: False
transient_benefit_enabled: False

Exit condition: H_pos > 0.20 * ln(grid_size^2) AND episode count >= 100

Recalibrated 2026-05-31 from 0.70 -> 0.20 per failure_autopsy_V3-EXQ-591_2026-05-27.md section 7 (GAP-C prereq 3, behavioral_diversity_isolation_plan.md). Observed rolling-mean H_pos band over 2000 episodes / 5 seeds is 0.03-1.08; the legacy 0.70 threshold (0.70 * ln(144) ~ 3.48 for size=12) was structurally unreachable. The 0.20 calibration (0.20 * ln(144) ~ 0.99) sits inside the observed band with ~9% margin at the upper end. Module-level constant H_POS_FRAC_OF_MAX in ree-v3/experiments/infant_curriculum.py. Path (b) alternative-gate (z_goal-norm / residue-coverage) deferred until goal_pipeline:GAP-4 clears the z_goal-collapse blocker.

Phase 1 — Benefit Discovery (Episodes 100–500)

Goal: First z_goal seeding events. Begin populating benefit side of residue field.

Active features: Add transient benefit patches, multi-resource heterogeneity (food only, water introduced at episode 300). Mild harm gradient introduced. MECH-314c begins.

Parameters:

novelty_bonus_weight: 0.7
transient_benefit_enabled: True
transient_benefit_multiplier: 3.0
harm_gradient_enabled: True
harm_gradient_scale: 0.15      # mild; enough to generate benefit/harm contrast, not dangerous
multi_resource_heterogeneity_enabled: True
resource_introduction_schedule: {"food": 0, "water": 200}  # ep offset within this phase
offline_integration_frequency: 20
residue_scale_factor: 0.05     # begin gentle residue accumulation

Exit condition: z_goal.norm() > infant_goal_threshold_low (e.g. 0.3) AND accidental_benefit_contacts >= 5 during the last 100 episodes.

Phase 2 — Harm/Benefit Geography (Episodes 500–2000)

Goal: Full residue geography. Multiple competing behavioral attractors (zone A vs zone C). Action-class diversity develops.

Active features: Microhabitat zones, directional wind, full harm gradient, full SD-049 (food + water + novelty). Bipartite SD-054 if monostrategy detected (see gate).

Parameters:

novelty_bonus_weight: 0.5      # reduce from Phase 1; structured curiosity takes over
microhabitat_enabled: True
directional_wind_enabled: True
harm_gradient_scale: 0.30
multi_resource_heterogeneity_enabled: True
resource_introduction_schedule: {"food": 0, "water": 0, "novelty": 100}  # ep offset in phase
offshore_integration_frequency: 50
residue_scale_factor: 0.10     # ARC-046 protected rate (~10% adult)
reef_bipartite_layout: True    # activate to test for monostrategy

Monostrategy sentinel: If action_entropy < ln(2) for 200 consecutive episodes, activate reef_bipartite_layout=True AND temporarily increase zone_B hazard density (relative attractor salience adjustment per PAIRED principle).

Exit condition: See Section 8 (Gate Criteria).

Phase 3 — Pre-Childhood Readiness (Episodes 2000+)

Goal: All gate criteria passing. Sleep consolidation quality confirmed. Replay is diverse.

Active features: All Phase 2 features. Offline consolidation frequency reduced to adult rate. E3 planning enabled at reduced weight. Slow E3 annealing begins.

Parameters:

e3_planning_weight: 0.1   # gradual, starting from 0.0
offline_integration_frequency: 100  # approach adult default
residue_scale_factor: 0.15  # still protected but approaching adult

Exit condition: All 7 infant gate criteria passing (Section 8).

7. New Telemetry Metrics

The following metrics must be computed and logged during infant training. Existing framework provides obs_dict and info at each step; these metrics are computed in the training loop or a parallel monitor.

7.1 Exploration Diversity Metrics

H_pos                  # Shannon entropy of position histogram (per episode + rolling window)
unique_cells_visited   # count of distinct cells visited in last N episodes
zone_coverage          # fraction of each zone's cells visited in last N episodes

7.2 Valence Diversity Metrics

residue_coverage_pct   # fraction of 10x10 grid cells with |residue| > threshold
residue_harm_cells     # count of cells with harm residue > 0.05
residue_benefit_cells  # count of cells with benefit residue > 0.05
harm_benefit_ratio     # residue_harm_cells / residue_benefit_cells; target 0.3–3.0

7.3 Action-Class Diversity Metrics

action_entropy_global       # H over full episode action histogram
action_entropy_by_zone      # separate H_pos for zone A, B, C (detects zone-specific repertoire)
action_class_coverage       # bitmask: which action classes used at least once per episode

7.4 Trajectory Diversity Metrics (Rung 1 prerequisites)

traj_pairwise_cosine_mean   # mean cosine distance over N sampled trajectory pairs
traj_volume_estimate        # log-det of trajectory kernel matrix (DvD-style; approximated)
traj_zone_crossing_count    # trajectory transitions crossing zone boundaries per episode

7.5 z_goal and Latent Diversity Metrics

z_goal_norm             # existing; target > infant_goal_threshold
z_goal_identity_count   # distinct resource-type identity signatures (from SD-049)
z_world_coverage_norms  # PCA variance in z_world over last N timesteps (latent coverage)

7.6 Sleep Consolidation Quality

post_sleep_z_goal_retention     # z_goal.norm() ratio before/after sleep integration
residue_consolidation_delta     # change in residue field magnitude after sleep
replay_diversity_index          # diversity of trajectories selected for replay (by zone/type)

8. Enriched Infant-Stage Gate Criteria

Replaces the single existing criterion (z_goal.norm() > infant_goal_threshold).

All 7 criteria must pass before childhood transition is approved. Criteria 1–3 are blocking; criteria 4–7 are advisory (flag if missing but allow transition with supervisor override).

#	Criterion	Threshold	Blocking?	DEV-NEED addressed
1	z_goal.norm() > threshold	0.4 (existing)	YES	DEV-NEED-006
2	H_pos > 0.65 * ln(grid_cells)	rolling 100-ep	YES	DEV-NEED-001
3	residue_coverage_pct > 0.15	at least 15% of cells	YES	DEV-NEED-004
4	action_entropy_global > ln(3) AND entropy differs across zones (H_zone_A vs H_zone_B KL > 0.05)	>= 3 effective actions; context-sensitive	advisory	DEV-NEED-001, DEV-NEED-005
5	harm_benefit_ratio in [0.2, 5.0]	harm and benefit both present	advisory	DEV-NEED-004
6	post_sleep_z_goal_retention > 0.85	z_goal survives sleep	advisory	DEV-NEED-007
7	traj_pairwise_cosine_mean > 0.3	trajectory library non-trivially diverse	advisory	DEV-NEED-002, DEV-NEED-005

Notes:

Criterion 2 requires NEW telemetry (H_pos infrastructure).
Criterion 3 requires residue field to be enabled during infant phase (residue_scale_factor > 0); currently ARC-046 protection sets this to ~0. Proposed: enable at 0.05 from Phase 1 onward.
Criterion 6 is novel and requires post-sleep z_goal retention logging.
Criteria 4 and 5 replace the implicit “behavioral entropy below ceiling” from developmental_curriculum.md with explicit actionable thresholds.

9. Developmental-Gate Update Recommendations for DEV-NEED Register

The following updates to developmental_needs_register.md are recommended. The register currently lacks quantitative thresholds for most infant-stage DEV-NEEDs.

DEV-NEED	Current status	Recommended update
DEV-NEED-001 (sensorimotor grounding)	No quantitative gate	Add H_pos > 0.65 * ln(grid_cells) as blocking gate; add zone_coverage > 0.60 as advisory
DEV-NEED-002 (transition database)	No quantitative gate	Add traj_pairwise_cosine_mean > 0.3; add zone_crossing_count > 5/ep as advisory
DEV-NEED-003 (curiosity calibration)	After V_s diversity	Add: must confirm MECH-313 + MECH-314 active at Phase 0 onset; add stochastic-attractor audit as pre-condition
DEV-NEED-004 (residue geography)	No quantitative gate	Add residue_coverage_pct > 0.15 as blocking; add harm_benefit_ratio in [0.2, 5.0] as advisory
DEV-NEED-005 (repertoire breadth)	After V_s diversity	Add action_entropy_global > ln(3) as advisory; add note: strategic diversity explicitly excluded until Rung 1 cleared
DEV-NEED-006 (z_goal seeding)	z_goal.norm() gate exists	Strengthen: also require accidental_benefit_contacts >= 5 in last 100 eps; transient benefit patches facilitate
DEV-NEED-007 (sleep consolidation)	No quantitative gate	Add post_sleep_z_goal_retention > 0.85; add replay_diversity_index measurement
DEV-NEED-008 (transition gate)	Exists but partial	Replace with 7-criterion table in Section 8

10. Candidate Experimental Manifests

EXQ-ISEF-001: Harm Gradient Baseline

Scientific question: Does enabling harm_gradient_enabled=True during infant stage produce measurable residue field coverage (criterion 3) faster than binary-contact baseline?

Design: Compare:

Control: standard infant params, binary hazard contact
Treatment: harm_gradient_enabled=True, harm_gradient_scale=0.30
N=5 seeds each; 2000 episodes each

Primary metric: residue_coverage_pct at episode 500, 1000, 2000. Secondary metrics: harm_benefit_ratio, H_pos. Success criterion: Treatment coverage > 2x control at episode 1000. V3-testable: YES

{
  "queue_id": "EXQ-ISEF-001",
  "description": "Harm gradient vs binary-contact: residue geography formation speed",
  "script": "experiments/infant_harm_gradient_baseline.py",
  "claim_ids": ["DEV-NEED-004", "ARC-013"],
  "architecture_epoch": "ree_hybrid_guardrails_v1",
  "estimated_duration_min": 30,
  "machine_affinity": "any",
  "tags": ["infant_substrate", "harm_gradient", "residue_geography"]
}

EXQ-ISEF-002: Transient Benefit Patch z_goal Seeding Rate

Scientific question: Does transient_benefit_enabled=True produce earlier z_goal.norm() threshold crossing (criterion 1) and more accidental benefit contacts?

Design: Compare:

Control: standard infant params, uniform resource placement
Treatment: transient_benefit_enabled=True, transient_benefit_multiplier=3.0
N=5 seeds each; 1000 episodes each

Primary metric: Episode at which z_goal.norm() > 0.4 first achieved. Secondary metrics: accidental_benefit_contacts, z_goal_identity_count. Success criterion: Treatment median first-crossing < 0.7x control median. V3-testable: YES

{
  "queue_id": "EXQ-ISEF-002",
  "description": "Transient benefit patches: z_goal seeding rate comparison",
  "script": "experiments/infant_transient_benefit_seeding.py",
  "claim_ids": ["DEV-NEED-006", "MECH-189"],
  "architecture_epoch": "ree_hybrid_guardrails_v1",
  "estimated_duration_min": 25,
  "machine_affinity": "any",
  "tags": ["infant_substrate", "z_goal_seeding", "transient_benefit"]
}

EXQ-ISEF-003: Microhabitat vs Homogeneous Geography (Latent Diversity)

Scientific question: Does the 3-zone microhabitat structure produce greater z_world coverage breadth and higher traj_pairwise_cosine_mean than homogeneous geography at matched episode count?

Design: Compare:

Control: standard infant params, SD-054 reef only (two-zone)
Treatment: microhabitat_enabled=True, n_microhabitats=3
N=5 seeds each; 2000 episodes each

Primary metric: PCA variance in z_world (top 5 components) at episode 500, 1000, 2000. Secondary metrics: traj_pairwise_cosine_mean, zone_coverage. Success criterion: Treatment PC-1 through PC-3 all > 1.2x control at episode 1000. V3-testable: YES — requires microhabitat feature implementation first.

{
  "queue_id": "EXQ-ISEF-003",
  "description": "Microhabitat zones vs homogeneous: latent state diversity comparison",
  "script": "experiments/infant_microhabitat_latent_diversity.py",
  "claim_ids": ["DEV-NEED-001", "DEV-NEED-007", "ARC-065"],
  "architecture_epoch": "ree_hybrid_guardrails_v1",
  "estimated_duration_min": 45,
  "machine_affinity": "any",
  "tags": ["infant_substrate", "microhabitat", "latent_diversity", "z_world_coverage"]
}

EXQ-ISEF-004: Stochastic Attractor Audit — Curiosity Bonus Calibration

Scientific question: At what novelty_bonus_weight value does the curiosity signal become captured by stochastic elements (noisy-TV analogues) vs driving productive exploration?

Design: Grid search novelty_bonus_weight in [0.1, 0.3, 0.5, 0.7, 1.0] with:

interoceptive_noise_enabled=True (stochastic element)
microhabitat_enabled=True (structured environment for comparison)
N=3 seeds per weight; 1000 episodes each

Primary metric: residue_coverage_pct and H_pos as functions of novelty_bonus_weight. Success criterion: Identify optimal weight where both metrics are maximised (Goldilocks point). V3-testable: YES

{
  "queue_id": "EXQ-ISEF-004",
  "description": "Novelty bonus weight calibration: stochastic attractor risk assessment",
  "script": "experiments/infant_novelty_bonus_calibration.py",
  "claim_ids": ["MECH-314", "DEV-NEED-003"],
  "architecture_epoch": "ree_hybrid_guardrails_v1",
  "estimated_duration_min": 60,
  "machine_affinity": "any",
  "tags": ["infant_substrate", "curiosity_calibration", "stochastic_attractor"]
}

EXQ-ISEF-005: Full Infant Curriculum vs Flat Parameters

Scientific question: Does the 4-phase curriculum (Section 6) produce better gate-criterion satisfaction at episode 2000 than a flat best-single-parameter setting?

Design: Compare:

Control A: flat novelty_bonus_weight=0.7, all features on from episode 0
Control B: flat novelty_bonus_weight=0.5, standard infant params
Treatment: 4-phase curriculum as specified in Section 6
N=5 seeds; 2000 episodes each

Primary metric: Fraction of 7 gate criteria passing at episode 2000. Success criterion: Treatment passing > 5/7 criteria; at least Controls A and B passing < 5/7. V3-testable: YES — requires curriculum scheduler implementation.

{
  "queue_id": "EXQ-ISEF-005",
  "description": "4-phase infant curriculum vs flat parameter baselines: gate-criterion comparison",
  "script": "experiments/infant_curriculum_vs_flat.py",
  "claim_ids": ["DEV-NEED-008", "ARC-046"],
  "architecture_epoch": "ree_hybrid_guardrails_v1",
  "estimated_duration_min": 90,
  "machine_affinity": "any",
  "tags": ["infant_substrate", "curriculum", "developmental_gate"]
}

11. V3/V4/Environment Extension Classification

Feature	V3-testable?	Requires env extension?	Requires V4?	Notes
Harm gradient zones	YES	YES (env change)	No	Low-complexity env change
Microhabitat zones	YES	YES (env change)	No	Extends SD-054 zone concept
Transient benefit patches	YES	YES (env change)	No	New env feature
Directional wind	YES	YES (env change)	No	Lightweight action differentiation
Dynamic episode seeds	YES	NO (already partial)	No	Trivial change
4-phase curriculum scheduler	YES	YES (training loop)	No	Requires curriculum schedule hook
7-criterion gate checker	YES	NO (telemetry only)	No	Python monitoring code
H_pos, residue_coverage telemetry	YES	NO (training loop add)	No	Instrumentation
Traj volume metric (DvD-style)	YES	NO (training loop add)	No	Numerically approximate
Post-sleep z_goal retention	YES	NO (training loop add)	No	Needs sleep-phase logging
Caregiver valence scaffolding (INV-043)	No	No	YES (V4)	Multi-agent required
Social enrichment (DEV-NEED-014)	No	No	YES (V4)	Multi-agent required
Loveability internalization (DEV-NEED-017)	No	No	YES (V4)	Multi-agent required
Full affordance object bundles (MECH-278)	No	Partial (V4 spec)	YES (V4)	V4 harness required

12. Design Principle Summary

The central insight from this synthesis: The V_s monostrategy problem is not solvable by parameter tuning alone. Turning up novelty_bonus_weight to maximum (a) risks noisy-TV capture if the environment contains irreducible stochastic elements (Burda 2018), (b) produces novelty coverage without developmental staging if MECH-314c learning-progress is absent (Oudeyer 2016), and (c) fails entirely if the environment is structurally homogeneous — exploration covers the space without forming the distinct attractor signatures that prevent monostrategy collapse (Ventura 2024, Dennis 2020, Li 2007).

The correct frame: The infant environment must contain learnable structure at multiple difficulty levels (Kidd & Hayden 2015 Goldilocks principle) — some regions always-mastered, some at competence edge, some aspirational. Without this structure, no curiosity or diversity mechanism can prevent collapse. The substrate proposals above create this structure via microhabitats, harm gradients, and transient benefit patches.

The temporal constraint: Based on Leitao & Gahr (2024), Doupe & Kuhl (1999), and Garcia-Guzman (2026), the motor exploration epoch (Phase 0/1) must be protected from goal-directed training. Garcia-Guzman 2026 establishes that spontaneous exploration is not degraded goal-seeking — it is a distinct functional mode. Premature E3 activation during this phase is the mechanism by which the critical window closes too early. Griffin et al. (2026) provides the computational mechanism: Phase 0 must build action-perception cell assemblies by Hebbian co-activity before reward weighting begins, or there is no option diversity for reward to select from.

The curriculum principle: Fixed bipartite environments (SD-054) will still produce monostrategy if relative attractor salience is not calibrated to agent competence (PAIRED principle, Dennis 2020). The sentinel mechanism in Phase 2 (detect monostrategy → adjust zone salience) is a manual approximation of this curriculum principle within V3’s single-environment constraint.

The context-rigidity failure mode: Denisova & Zhao (2017) establishes that the pathological outcome of insufficient exploration is not low entropy but context-rigidity: the agent produces the same action distribution regardless of context (zone, state, task-demand). This is a stronger criterion than global entropy. Gate criterion 4 must therefore be zone-specific: H(action|zone_A) and H(action|zone_B) must differ, not just H(action) must exceed a threshold.

The wrong diversity level: DIAYN (Eysenbach 2019) and DvD (Parker-Holder 2020) independently show that step-level novelty and pairwise trajectory distance are insufficient diversity metrics. What is needed is trajectory-level discriminability (are state distributions under different modes distinguishable?) and volumetric coverage (do candidates span behavioral space or cluster?). This means MECH-314 as currently implemented at the step level is not the diversity signal needed for monostrategy prevention — it is a necessary but insufficient component.

The missing ingredient: REE has no behavioral descriptor — a compact trajectory-level characterisation of which behavioral mode was active during an episode (MAP-Elites, Mouret & Clune 2015). Without this, CEM candidate diversity cannot be measured volumetrically, and the system cannot detect degenerate clustering that pairwise metrics miss. This is a structural gap that no parameter change can fill.

The wanting/liking entry point for z_goal: Berridge & Robinson (1998) and Keren-Portnoy (2021) together establish that goal anchor formation proceeds via the WANTING pathway independently of LIKING calibration. An infant agent exposed to accidental benefit contacts will rapidly generate approach motivation (wanting) even before the consummatory value is calibrated. Transient benefit patches are the operational mechanism for triggering this pathway in the V3 substrate.

13. Suggested Missing Claims

These candidate claims emerged from the literature synthesis and are not currently registered in claims.yaml. They are not proposed for registration in this session — flagged for governance review.

Candidate ARC-claim: Phase 0 Reward-Free Exploration Epoch

Source grounding: Griffin et al. (2026) + Doupe & Kuhl (1999) + Garcia-Guzman (2026).

Claim statement: A gradient-based training substrate must include a reward-free Phase 0 during which action-outcome associations are built by Hebbian co-activity before reward-weighted learning begins. Activating E3 scoring from step 1 collapses the option library before it is populated.

Why not registered yet: Would require ARC-046 revision. Needs discussion with user on whether to fold into ARC-046 (infant phase spec) or register as a standalone ARC.

Candidate MECH-claim: Wanting-Before-Liking Sequencing

Source grounding: Berridge & Robinson (1998) + Keren-Portnoy & Tomasello (2021).

Claim statement: During the infant stage, VALENCE_WANTING (approach motivation, goal anchor seeding) is activated by accidental benefit contacts before VALENCE_LIKING is calibrated. Goal anchor formation does not require prior liking calibration; wanting is the entry point.

Why not registered yet: MECH-189 covers the benefit-contact seeding mechanism but does not specify the wanting-before-liking temporal ordering. Would be a child claim of MECH-189.

Candidate MECH-claim: Novelty-Approach as Accidental-Benefit Bridge

Source grounding: Burnay et al. (2020) water cliff + Gliga (2018) hypothesis-testing variability.

Claim statement: Novelty-driven approach to mixed-valence regions (attractive-but-risky) is the primary mechanism by which accidental benefit contacts are generated in the infant stage. An infant environment with only safe exploration zones and purely dangerous zones will not produce this bridge.

Why not registered yet: The mechanism spans INV-055, MECH-189, and ARC-013 without being cleanly owned by any. Candidate for a new MECH bridging exploration and residue seeding.

Candidate MECH-claim: Behavioral Descriptor Specification Requirement

Source grounding: MAP-Elites (Mouret & Clune, 2015) + DIAYN (Eysenbach 2019).

Claim statement: Structural prevention of monostrategy requires an explicit behavioral descriptor that characterises which behavioral mode was active during an episode. Without a defined descriptor, diversity metrics degenerate to pairwise distance (which does not detect degenerate clustering) and CEM candidates cannot be scored for volumetric coverage.

Why not registered yet: ARC-065 MECH-31x cluster does not include this requirement explicitly. Would be a MECH child of ARC-065.

14. Appendix: DEV-NEED Readiness Impact Summary

After implementing the proposals in this document:

DEV-NEED	Before	After	Change
DEV-NEED-001	H_pos metric missing; no quantitative gate	H_pos gate + zone_coverage gate added; Phase 0/1 protect babbling	Blocking gate added
DEV-NEED-002	No transition quality metric	traj_pairwise_cosine_mean + zone_crossing_count	Advisory gate added
DEV-NEED-003	No stochastic-attractor check	Audit pre-condition + Goldilocks calibration experiment	Implementation de-risked
DEV-NEED-004	residue_coverage_pct not measured	Blocking gate at 0.15; harm_gradient enables geography without terminal risk	Blocking gate added
DEV-NEED-005	No action diversity gate	action_entropy_global > ln(3) advisory; strategic diversity explicitly excluded	Advisory gate added; scope clarified
DEV-NEED-006	z_goal.norm() gate only	+ accidental_benefit_contacts gate; transient patches deliver seeding events	Gate strengthened
DEV-NEED-007	No sleep quality measurement	post_sleep_z_goal_retention + replay_diversity_index	New telemetry
DEV-NEED-008	Single criterion	7-criterion table; 3 blocking + 4 advisory	Full gate operationalised