Developmental Experiment Priorities

Date: 2026-05-16
Cross-references: developmental_needs_register.md | developmental_curriculum.md | developmental_metrics.md | infant_substrate_expansion.md | replay_development_analysis.md
Status: Planning document — not a registered claim. Informs experiment queue decisions and session governance.
Governing principle: Optimise for developmental understanding, not short-term PASS count.

1. Executive Summary

The dominant bottleneck as of 2026-05-16 is the cold-start warm-start failure (DEV-NEED-029): all ARC-065 diversity mechanisms (MECH-313 noise floor, MECH-314 structured curiosity, MECH-320 tonic vigor) produce zero differential on a cold-start substrate because the ResidueField, EWMA, and E3 score variance are all near-zero. EXQ-573 confirmed this: 10x bias scale sweep produced bit-for-bit identical arms. The same floor blocks INV-049 sleep consolidation tests, Q-043/044/045 behavioral ablations, and every replay diversity experiment currently proposed.

The highest-leverage action is not to run more behavioral experiments on the existing substrate. It is to enrich the infant substrate so the warm-start gate (DEV-NEED-029) becomes passable.

Once the warm-start gate is established, the experiment sequence becomes:

Confirm warm-start via EXQ-ISEF-001/002/003/004 (substrate validation series)
Re-run Q-043/044/045 ARC-065 behavioral ablation on warm substrate
Run EXQ-IDEV-001 (longitudinal replay quality), EXQ-IDEV-006 (SD-017 functional validation)
Queue sleep aggregation cluster (MECH-272/273/275/285) only after SD-017 functional test PASSes

2. Current State Assessment

2.1 FAIL Pattern Analysis

The recent FAIL cluster falls into three mechanistically distinct groups:

FAIL group	Representative experiments	Root cause	Correct response
Cold-start warm-start	EXQ-573 (10x bias sweep), EXQ-549 (MECH-320 discriminative), EXQ-552 (forced warmup), EXQ-553 (orthogonal CEM)	Diversity mechanisms require warm substrate; infant is cold-start by definition	Infant substrate enrichment (EXQ-ISEF-001..005) before any diversity behavioral test
Monostrategy basin	EXQ-550 (z_goal falsifier), EXQ-554 (decoder collapse), EXQ-555/557 (seed sweep), EXQ-556 (module swap)	CEM proposer collapses to single strategy; seed-dependent (some seeds escape)	SD-055 differentiable CEM (substrate-readiness PASS EXQ-568); ARC-062 Phase 3 wiring; microhabitat zones
Substrate ceiling	EXQ-530/530c (ARC-016 precision commit), EXQ-532 (SD-005 latent domain), EXQ-537-537d (SD-029 self-attribution)	Claim requires substrate capabilities beyond current V3 baseline	Hold pending substrate enrichment; do not re-queue under same conditions
Methods-valid but function-unproven	EXQ-265/265a PASS (SD-017 phase infrastructure), EXQ-563c PASS (CEM bias calibration)	Substrate method works; behavioral discriminative test not yet run	Queue functional validation experiments in dependency order

2.2 PASS Pattern Analysis

The recent PASS cluster is concentrated in two categories:

Substrate-readiness probes (EXQ-540c, EXQ-542, EXQ-544, EXQ-545, EXQ-546, EXQ-547, EXQ-548, EXQ-559/559a, EXQ-560, EXQ-561, EXQ-563..563c, EXQ-564, EXQ-565, EXQ-566, EXQ-567, EXQ-568): The substrate machinery is landing successfully. This is real progress — infrastructure is in place. But infrastructure PASSes do not count as behavioral evidence.
Diagnostic probes (EXQ-551/551a pipeline entropy, EXQ-555 seed factorization, EXQ-557/558 seed sweep): These correctly localise the monostrategy cliff to the CEM proposer stage and establish that the basin is large but not universal. This is the right scientific methodology.

Interpretation: The substrate is richer than six months ago, but behavioral tests on it remain consistently blocked by the cold-start problem. The PASSes are building the correct machinery; the FAILs are correctly diagnosing what machinery is not yet sufficient for discriminative tests.

2.3 Monostrategy Characterisation (as of 2026-05-16)

From EXQ-551/551a (pipeline entropy), EXQ-555 (2x2 env/agent seed), EXQ-557 (30-cell agent sweep), EXQ-558 (seed-pair rank):

Entropy cliff is at the CEM proposer stage (not decoder, not E3 scoring, not environment)
The collapse is agent-init-dependent (some seeds escape with action_class_entropy ~0.68; seeds 42/17 collapse to 0.0)
The collapse is not attributable to a single submodule (EXQ-556 module-swap FAIL)
SD-054 bipartite layout (EXQ-548 PASS) provides 1.27x structural divergence uplift but does not fully escape the basin
SD-055 differentiable CEM (EXQ-568 PASS substrate-readiness) is the next proposer-stage fix

This characterises the monostrategy as a deep attractor in initialization space that is too large to escape via parameter perturbations, but escapes via specific init seeds or substrate enrichment that creates competing attractors. The infant substrate enrichment proposals (microhabitat zones, harm gradients, transient benefit patches) are designed to reshape the attractor landscape during training — creating multiple distinct learned attractor zones that compete for policy dominance.

3. Literature Evidence Mapping Table

The following table synthesises all seven required lit-pull topics. Topics 5-7 draw from the existing synthesis in infant_substrate_expansion.md and replay_development_analysis.md. Topics 1-4 are from methodological literature.

3.1 Experiment Prioritization Under Uncertainty

Source	Finding	REE relevance	DEV-NEED IDs	Claim IDs	Confidence	Classification
Lindley (1956) Biometrika 43:70 — Bayesian experiment design	Expected information gain (EIG) = expected reduction in posterior entropy. Optimal experiment selection maximises EIG per unit cost	When theory is well-founded but experiments keep failing, EIG of further same-type experiments approaches zero — invest in substrate investigation instead	DEV-NEED-029	ARC-065, INV-049	High	suggests experiment (warm-start gate calibration before behavioral tests; EIG near-zero for repeated cold-start diversity tests)
Chaloner & Verdinelli (1995) Statistical Science 13:273 — BOED survey	Bayesian Optimal Experimental Design (BOED): prioritise experiments that maximally discriminate between competing hypotheses. When hypotheses make similar predictions under current substrate, discriminative experiments are uninformative	The repeated FAIL pattern (EXQ-385/385a for INV-049, EXQ-537..537d for SD-029) suggests competing hypotheses (theory wrong vs substrate not ready) make identical predictions on current substrate. BOED says: first run the substrate-discrimination experiment	DEV-NEED-029	INV-049, SD-029	High	refines existing claim (substrate-validation experiment is the BOED-optimal next step, not theory-test replication)
Srinivas et al. (2010) ICML GP-UCB	Upper Confidence Bound (UCB) for sequential experiment selection: balance exploiting known-good conditions (exploit) with testing unexplored substrate regimes (explore). UCB-regret bounds show systematic exploration is required even when exploit looks locally optimal	REE has been in an exploit loop: re-testing claims under conditions known to be substrate-limited. UCB principle: switch to explore (new substrate regime = infant enrichment) when current regime yields zero-information	DEV-NEED-001, DEV-NEED-029	ARC-065	0.85	suggests experiment (schedule substrate exploration experiments before behavioral exploitation tests)
Settles (2012) Active Learning synthesis	Uncertainty sampling: query points where model is most uncertain. Expected model change: query points that would most change the model. In scientific discovery, “model” = theory belief state; “query” = experiment	INV-049 double-FAIL raises genuine uncertainty about whether offline integration matters for this substrate — but the correct query is not “run INV-049 again under same conditions” but “establish whether current substrate can produce the conditions under which INV-049 would be discriminative”	DEV-NEED-007, DEV-NEED-029	INV-049, ARC-011	High	suggests experiment (EXQ-IDEV-001 longitudinal replay quality is the active-learning-optimal query for INV-049)
Frazier (2018) arXiv:1807.02811 — Bayesian Optimization tutorial	Knowledge gradient: the most valuable next experiment is the one that would most change your decision about which design to deploy. In RL: if you wouldn’t change anything after seeing the result, the experiment has zero gradient value	Applied to monostrategy: an experiment that will FAIL regardless of substrate state (because entropy=0 regardless of mechanism state) has zero knowledge gradient. DO NOT queue these until warm-start gate established	DEV-NEED-029	ARC-065, MECH-313, MECH-314, MECH-320	High	refines existing claim (blocking constraint: zero KG experiments are waste; warm-start gate is prerequisite for non-zero KG behavioral tests)

3.2 Active Learning for Scientific Discovery

Source	Finding	REE relevance	DEV-NEED IDs	Claim IDs	Confidence	Classification
Reker et al. (2019) Nature Chemistry 11:895	Active learning in drug discovery: sequential querying reduces experiments needed 3-5x vs random screening. Key: pool of candidates must be screened for plausibility before querying — uninformative candidates waste the oracle’s budget	Analogy: REE oracle = experiment runner (expensive). Uninformative experiments (zero-differential arms on cold-start substrate) consume runner hours with zero evidence gain. Plausibility screening = warm-start gate check before queuing	DEV-NEED-029	ARC-065	0.80	refines existing claim (warm-start gate IS the plausibility screen for ARC-065 behavioral experiments)
Hernandez-Lobato et al. (2017) PMLR 54 — Parallel BOED	Batch query selection under uncertainty: select the most diverse batch of experiments when parallel evaluation is available. Diversity of experiments != diversity of results; need behaviorally-distinct experimental conditions	REE runs experiments in parallel on 5 machines. But recent queued batches have been behaviorally equivalent (all collapse to monostrategy) — the batch is diverse in mechanism flags but not in behavioral regime	DEV-NEED-001, DEV-NEED-005	ARC-065	0.78	suggests experiment (batch selection criterion: experiments must differ in substrate regime, not just mechanism flag)

3.3 Causal Experiment Design

Source	Finding	REE relevance	DEV-NEED IDs	Claim IDs	Confidence	Classification
Pearl (2009) Causality Ch.3 — do-calculus	Interventional evidence (do(X)) is strictly stronger than observational evidence. An ablation that removes mechanism X while not controlling for indirect effects of X’s removal is confounded — the result cannot be attributed to X specifically	INV-049 ablation (remove sleep = no sleep) also reduces total training steps and removes the offline-integration’s indirect effect on E1 update timing. The FAIL may be confounded — need matched-compute control	DEV-NEED-007	INV-049, SD-017	High	refines existing claim (INV-049 ablation design must add matched-compute control arm before result is interpretable)
Sculley et al. (2015) NIPS — ML Technical Debt	Ablations in large systems accumulate “entanglement debt” — removing feature X changes the effective distribution that feature Y sees, invalidating the ablation’s clean interpretation. Multiple ablations accumulated over iterations compound this	REE experiment chains (EXQ-537 -> 537b -> 537c -> 537d SD-029 series) each added a new intervention while inheriting the previous design’s substrate assumptions. The FAIL cascade is partly confound accumulation	DEV-NEED-029	SD-029, ARC-016	0.82	suggests experiment (SD-029 series needs a clean baseline re-run after substrate enrichment, not incremental patch on failed design)
Meyes et al. (2019) arXiv:1901.08644 — Ablation Studies survey	Most ablation studies in deep learning fail one or more validity criteria: (a) no random seeds, (b) no comparison to random baseline, (c) ablation evaluated on same data distribution as training. Benchmark: a correct ablation is meaningless unless its discriminative power under the specific substrate is characterised first	DEV-NEED-029 directly addresses criterion (c): ARC-065 mechanisms are evaluated under the same cold-start distribution they were designed to ameliorate — the evaluation domain is the domain where the ablation has zero discriminative power	DEV-NEED-029	ARC-065, MECH-313, MECH-314, MECH-320	High	supports existing claim (DEV-NEED-029 warm-start gate IS the Meyes criterion-c check)
Imai & Dyk (2004) Political Analysis 12:341 — Causal mechanism identification	Mechanism identification requires both (a) causal identification of the outcome effect and (b) a mediator that correctly represents the proposed mechanism. If the mediator is not operating (e.g. tonic vigor EWMA=0 because substrate is cold), mechanism identification is impossible regardless of experimental design	MECH-320 tonic vigor: substrate-readiness PASS (mechanism instantiated correctly) but discriminative pair FAIL (EXQ-549). The mechanism is not operating during the test because the EWMA hasn’t accumulated. This is not a theory failure — it is a mediator-inactivity confound	DEV-NEED-029	MECH-320, ARC-066	0.85	supports existing claim (mediator-inactivity pattern is the formal causal-inference description of cold-start null results)

3.4 Ablation Study Design

Source	Finding	REE relevance	DEV-NEED IDs	Claim IDs	Confidence	Classification
Lipton & Steinhardt (2019) ICML — Troubling Trends in ML	Ablation results without confidence intervals and with insufficient seeds are unreliable. Single-seed ablations on stochastic systems frequently reverse at N=5. Pre-registering the ablation hypotheses before running prevents p-hacking	EXQ-573 ran 10 arms but all bit-for-bit identical — this is actually conclusive evidence of the substrate problem, not ambiguous. Multi-seed, pre-registered ablations that produce identical results across all arms are strong evidence that the ablation’s discriminative condition isn’t met	DEV-NEED-029	ARC-065	High	supports existing claim (EXQ-573 methodology was correct; result is conclusive rather than ambiguous)
Kohavi et al. (2020) Trustworthy Online Controlled Experiments	Sequential experimentation: once an experiment shows null result with clear diagnosis (substrate not warm), immediately route to the substrate fix rather than running parametric variations. “Continuing the experiment after a clear negative result wastes resources without adding information”	The 540a/540b/540c/540d/540e/540f MECH-307 iteration chain is a positive example of this principle being followed. The diversity-mechanism chain (EXQ-573 + monostrategy series) should now follow the same principle: substrate fix is the route, not more variation	DEV-NEED-029	MECH-307, ARC-065	High	suggests experiment (route to EXQ-ISEF-001..005 immediately; do not queue more ARC-065 behavioral variations on cold substrate)
Bouthillier et al. (2021) NeurIPS — Accounting for Variance in ML Benchmarks	Performance differences < 1 std(seed variation) are not meaningful. For developmental AI, the analogous principle is: mechanism effect < substrate noise = the experiment cannot detect the mechanism	ARC-065 diversity mechanisms produce ~0 differential when the substrate baseline variance itself is 0 (monostrategy cold-start). The signal-to-noise ratio for any diversity test is 0/0. The denominator must be non-zero before testing	DEV-NEED-029	ARC-065, MECH-313, MECH-314, MECH-320	High	refines existing claim (SNR criterion for DEV-NEED-029 gate: E3 score variance must exceed noise floor before any diversity mechanism test is valid)

3.5 Developmental Robotics Benchmarking

(Synthesised from infant_substrate_expansion.md §3 and replay_development_analysis.md §2 — both 2026-05-16)

Source	Finding	REE relevance	DEV-NEED IDs	Claim IDs	Confidence	Classification
Garcia-Guzman (2026) Nature family	Infant exploration is not degraded goal-seeking — it is a functionally distinct mode optimising for body-dynamic discovery. Premature goal-activation closes the window	Disabling E3 during Phase 0 (motor babbling) is not an approximation — it IS the biologically-appropriate design	DEV-NEED-001	INV-055, ARC-065, INV-073	0.82	supports existing claim (Phase 0 reward-free epoch in infant_substrate_expansion.md §6.1 is correct design)
Adolph (2017/2019) Psych Learn Mem	Behavioral flexibility requires varied motor experience BEFORE RL narrows repertoire; posture-specific	Infant exploration must cover all action modalities before goal pursuit — impoverished substrate produces narrow option library permanently	DEV-NEED-001, DEV-NEED-004	INV-073, INV-055, ARC-065	0.78	supports existing claim (7-criterion gate from infant_substrate_expansion.md §8 is necessary)
Doupe & Kuhl (1999) Annu Rev Neurosci	BG-driven motor variability necessary for repertoire; pharmacological silencing = permanent monostrategy	If waking exploration is monostrategy (current V3 state), the repertoire deficit is permanent unless substrate is fixed during the developmental window	DEV-NEED-001, DEV-NEED-005	ARC-065, MECH-309	0.78	supports existing claim; time-urgency argument for prioritising substrate enrichment

3.6 Curriculum Ablation in Reinforcement Learning

(Synthesised from infant_substrate_expansion.md §3.3 and §3.5)

Source	Finding	REE relevance	DEV-NEED IDs	Claim IDs	Confidence	Classification
Narvekar et al. (2020) JMLR survey	Difficulty-based curricula do NOT prevent monostrategy if all tasks require the same strategy. Diversity-based curricula (by behavioral type) do. Adaptive curricula outperform fixed schedules	REE’s Rung 0-4 behavioral curriculum is diversity-based (correct). But the Rung trigger must be behavioral competence plateau, not episode count	DEV-NEED-008	ARC-065, Q-046	0.72	validates existing design; suggests triggering mechanism
Dennis et al. (2020) NeurIPS PAIRED	Fixed bipartite environment produces monostrategy if one attractor is easier. Need difficulty calibrated to agent competence	SD-054 bipartite layout necessary but insufficient. Relative salience must track agent performance — monostrategy sentinel mechanism in infant_substrate_expansion.md §6 Phase 2	DEV-NEED-005, DEV-NEED-008	ARC-065, Q-046	0.76	refines existing claim (sentinel mechanism is the PAIRED-analog; implementation required)
Oudeyer (2016) IEEE Trans Cog Dev Syst	Learning-progress curiosity produces emergent developmental stage ordering; pure novelty-max does not sequence stages	MECH-314c (learning progress sub-flavour) is the primary infant curiosity signal; needs E1 loss derivative as the progress signal	DEV-NEED-008	MECH-314, ARC-046	0.76	refines existing claim (MECH-314c priority over 314a during phase 0/1)

3.7 Information Gain in Experimental Design

(Synthesised from replay_development_analysis.md §1.3, §4)

Source	Finding	REE relevance	DEV-NEED IDs	Claim IDs	Confidence	Classification
Parker-Holder et al. (2020) NeurIPS DvD	Apparent diversity (pairwise distance) != effective diversity (volume). Population can be spread but degenerate	Current V3 entropy metrics measure apparent diversity. Need volumetric trajectory coverage — traj_volume_estimate (DvD-style log-det). Zero pairwise distance means zero information gain from diversity experiments	DEV-NEED-005, DEV-NEED-007	Q-046, ARC-065	0.78	refines existing claim (traj_volume_estimate is the correct info-gain metric; current entropy metrics are insufficient)
Schapiro et al. (2017) Phil Trans R Soc B	Dense, diverse replay required for cross-episode generalisation. Biased or sparse replay fails to extract transferable principles	Replay info-gain is bounded by waking diversity. Sleep cannot restructure content that was never created	DEV-NEED-007, DEV-NEED-011	ARC-011, MECH-194	High	supports existing claim; confirms infant substrate enrichment is upstream prerequisite for replay info-gain
Joo & Frank (2024) Science	SWR incidence increases with novelty and harm salience. Salient trials over-represented in offline replay	Harm salience is the natural SWR-gating factor. High-harm episodes MUST exist in waking experience for replay to exploit them. Current infant phase (binary hazard, near-zero residue accumulation) does not generate these	DEV-NEED-007	MECH-285, MECH-205	High	supports existing claim; confirms harm gradient feature (EXQ-ISEF-001) is prerequisite for replay quality

4. Ranked Experiment Proposals

Ranking criteria: expected information gain × developmental leverage × bottleneck reduction × future unblocking. Cost and misleading-success risk are tiebreakers.

RANK 1 — EXQ-ISEF-001: Harm Gradient Baseline

Status: Script not yet written — must go through /queue-experiment skill
DEV-NEED IDs: DEV-NEED-002, DEV-NEED-003, DEV-NEED-004
Claim IDs: ARC-013, ARC-046, INV-055
Priority: Highest. This is the most direct substrate fix for content poverty (§4.1/B1 of replay_development_analysis.md).

Why first:

Residue field geography (DEV-NEED-004) is the single structural prerequisite for: (a) sleep consolidation experiments, (b) MECH-285 staleness-priority replay, (c) DEV-NEED-029 warm-start gate calibration, (d) the harm-benefit separation gate (DEV-NEED-002). Without residue geography, everything downstream is substrate-limited.
Binary hazard contact (current default) means the residue field never accumulates geography during normal episodes — agent either avoids hazards entirely (no residue) or contacts them terminally (brief residue spike, then episode ends).
Harm gradients are an environment-level change only — no ree_core architecture change required. Low implementation cost, high downstream leverage.

Design: Control (binary contact) vs Treatment (harm_gradient_enabled=True, scale=0.30). N=5 seeds each, 2000 episodes. Compare residue_coverage_pct at ep 500/1000/2000.

Expected observations:

Treatment: residue_coverage_pct 0.15+ by episode 1000; harm zones show graded approach/retreat trajectories in replay buffer; traj_pairwise_cosine_mean higher than control.
Control: residue_coverage_pct near-zero unless terminal contact; replay dominated by termination-adjacent sequences only.

Interpretation grid:

Outcome	Interpretation	Next action
Treatment residue_coverage > 2x control at ep 1000	Harm gradients are the bottleneck; binary contact is the root cause	Enable harm gradients as infant default; proceed to EXQ-ISEF-002
Both conditions show low residue_coverage	Harm gradient alone is insufficient; likely also needs microhabitat zones for spatial differentiation	Run EXQ-ISEF-003 in parallel or serially
Treatment shows higher coverage but still below 0.15	Partial fix; need higher harm_gradient_scale or larger outer radius	Re-queue as EXQ-ISEF-001a with parameter sweep
Treatment produces catastrophic residue saturation	harm_gradient_scale=0.30 is too high; residue_saturation_pct > 0.15	Reduce harm_gradient_scale to 0.15; re-queue

Failure interpretation: A Treatment FAIL where coverage remains near-zero despite gradients is strong evidence that the spatial homogeneity compression (§2.3 of infant_substrate_expansion.md) is the root cause, not the binary-contact compression — route immediately to EXQ-ISEF-003.

Downstream decision: PASS enables queuing EXQ-ISEF-002 and EXQ-ISEF-003 simultaneously. Also enables the DEV-NEED-029 warm-start gate calibration experiment (EXQ-ISEF-004 needs warm residue field to have a meaningful stochastic-attractor audit).

RANK 2 — EXQ-ISEF-002: Transient Benefit Patches (z_goal Seeding Rate)

Status: Script not yet written — must go through /queue-experiment skill
DEV-NEED IDs: DEV-NEED-006, DEV-NEED-008
Claim IDs: MECH-189, INV-055
Priority: High. The infant-to-childhood gate (DEV-NEED-008) is blocked by z_goal norm remaining near-zero. This directly unblocks the transition gate.

Why second:

z_goal.norm() > 0.4 is the #1 blocking criterion for the infant gate. It cannot be met if resource contacts are sparse and uniform. Transient benefit patches create the high-salience accidental contact events that MECH-189 requires.
Berridge & Robinson (1998) / Keren-Portnoy (2021) ground the wanting-before-liking pathway: accidental benefit contact creates approach motivation (z_goal seeding) before consummatory calibration. The mechanism is fast (minutes in infants) and environment-driven.
Can run in parallel with EXQ-ISEF-001 (independent environment features).

Design: Control (uniform resource placement) vs Treatment (transient_benefit_enabled=True, multiplier=3.0). N=5 seeds, 1000 episodes. Primary: episode at which z_goal.norm() > 0.4 first achieved. Secondary: accidental_benefit_contacts, z_goal_identity_count.

Expected observations:

Treatment: z_goal first-crossing before episode 300 in median; z_goal_identity_count > 1 (multiple resource type seeding) by episode 500.
Control: z_goal first-crossing after episode 700 or never within 1000 episodes for some seeds.

Interpretation grid:

Outcome	Interpretation	Next action
Treatment median first-crossing < 0.7x control	Transient patches are effective; z_goal seeding unblocked	Enable as infant default; include in 7-criterion gate
Both conditions show similar first-crossing times	Sparse resource contacts are not the bottleneck; likely the goal encoder or MECH-189 write path is the issue	Run goal-seeding pipeline diagnostic (similar to EXQ-536a instrumentation)
Treatment shows z_goal seeding but z_goal_identity_count stays at 1	Single resource type is dominating; SD-049 multi-resource heterogeneity needed alongside	Queue EXQ testing SD-049 multi-resource + transient patches combined

Failure interpretation: If z_goal seeding occurs but does not improve the transition gate, the gate metric (z_goal.norm() threshold) may need recalibration — or the childhood-readiness issue is not z_goal seeding but behavioral entropy.

Downstream decision: PASS enables: (a) enabling transient patches as infant default, (b) re-testing the infant-to-childhood gate as a meaningful developmental transition rather than a trivially-blocked one.

RANK 3 — EXQ-ISEF-003: Microhabitat Zones (Latent State Diversity)

Status: Script not yet written (requires microhabitat_enabled feature implementation first)
DEV-NEED IDs: DEV-NEED-001, DEV-NEED-003, DEV-NEED-007
Claim IDs: ARC-065, ARC-007, INV-055
Priority: High. Directly addresses spatial homogeneity compression (§2.3), which prevents z_world coverage and replay geography.

Why third:

Ventura et al. (2024) CA3 flexibility: environment structure during infant exploration determines downstream hippocampal representational flexibility. Flat environment → flat z_world → flat replay.
Without spatially distinct zones, the warm-start gate criteria for MECH-314a (novelty bias needs distinct RBF centers) cannot be met — the ResidueField will not populate distinct zones.
Implementation requires environment extension (Voronoi zone boundaries, per-zone resource/hazard modulation), which is a moderate-cost change.

Design: Control (SD-054 reef bipartite, two-zone) vs Treatment (microhabitat_enabled=True, n_microhabitats=3). N=5 seeds, 2000 episodes. Primary: PCA variance in z_world (top 5 components) at ep 500/1000/2000. Secondary: traj_pairwise_cosine_mean, zone_coverage.

Expected observations:

Treatment: z_world PC-1 through PC-3 all > 1.2x control at ep 1000; zone-specific action distributions differ (H(action|zone_A) vs H(action|zone_B) KL > 0.05).
Control: z_world PC structure dominated by single geography; zone-specific entropy near-identical.

Interpretation grid:

Outcome	Interpretation	Next action
Treatment PCs > 1.2x and zone entropy differs	Microhabitat zones create the structural diversity needed for warm-start gate	Enable microhabitats as infant default; proceed to warm-start calibration
Treatment PCs higher but zone entropy identical	Zones create distinct z_world representations but agent hasn’t learned zone-conditional behavior	Check whether E3 planning weight is too low for zone-specific learning; may need minimal E3 in Phase 2
Both conditions similar PCA structure	CausalGridWorldV2 spatial structure doesn’t propagate to z_world; the encoding is geography-blind	Investigate z_world encoder’s sensitivity to spatial position; may need positional encoding

Failure interpretation: A FAIL here would be important: it would mean the z_world encoder is not sensitive to spatial geography even when the environment has structured geography. This would be evidence for a representation quality issue upstream of the diversity mechanisms.

Downstream decision: PASS enables DEV-NEED-029 warm-start gate calibration (ResidueField needs populated zones before the zone-center-count threshold can be measured).

RANK 4 — EXQ-ISEF-004: Stochastic Attractor Audit + Curiosity Bonus Calibration

Status: Script not yet written
DEV-NEED IDs: DEV-NEED-003
Claim IDs: MECH-314, ARC-065
Priority: High. This is a PREREQUISITE for turning up novelty_bonus_weight to its intended operating level — without this audit, increasing MECH-314a risks permanent noisy-TV capture.

Why fourth:

Burda et al. (2018) ICLR: large-scale curiosity; noisy-TV problem is real and blocks permanent novelty trap if environment has irreducible random elements.
Pathak et al. (2017) ICML: forward-model PE in learned feature space — fails with stochastic attractors. CausalGridWorldV2 has: (a) hazard drift (SD-047 stochastic), (b) resource respawn (random), (c) SD-048 interoceptive noise.
These elements capture the MECH-314c learning-progress signal permanently if not controlled.

Design: Grid search novelty_bonus_weight in {0.1, 0.3, 0.5, 0.7, 1.0} with structured environment (microhabitat_enabled=True). Primary: residue_coverage_pct and H_pos as functions of weight. Success: identify Goldilocks point where both metrics maximise.

Interpretation grid:

Outcome	Interpretation	Next action
Clear Goldilocks maximum (both metrics peak at same weight)	Noisy-TV capture visible at high weights; structured env protects at mid-range	Set novelty_bonus_weight to optimal point; include in infant defaults
Both metrics monotone increasing with weight	Stochastic elements not capturing novelty signal; environment is well-structured	Can safely use high novelty_bonus_weight; simplifies infant curriculum
residue_coverage peaks at low weight, H_pos flat	Noisy-TV capture dominant even at low weight	Audit SD-047/SD-048 for irreducible stochasticity; Markovianise or exclude from novelty signal

Downstream decision: Sets the safe operating range for novelty_bonus_weight in infant Phase 1/2, which directly determines how effectively MECH-314a/314c can drive exploration coverage.

RANK 5 — EXQ-ISEF-005: Full Infant Curriculum vs Flat Parameters

Status: Script not yet written (requires curriculum scheduler hook implementation)
DEV-NEED IDs: DEV-NEED-008
Claim IDs: ARC-046, INV-055
Priority: Medium-High. This is the integration test that checks whether the four-phase curriculum (§6 of infant_substrate_expansion.md) produces better gate-criterion satisfaction than flat parameter settings.

Prerequisite: EXQ-ISEF-001, 002, 003, 004 must all PASS to validate individual features before the combined curriculum is tested. This avoids confounded interpretation if the curriculum test fails.

Design: Three-arm: (A) flat novelty_bonus=0.7 with all features on from ep 0, (B) flat with standard infant params, (C) 4-phase curriculum. N=5 seeds, 2000 episodes. Primary: fraction of 7 gate criteria passing at ep 2000.

Interpretation grid:

Outcome	Interpretation	Next action
Curriculum > 5/7 criteria; controls < 5/7	Staging is necessary; features cause harm if introduced without sequencing	Adopt curriculum as infant default; proceed to warm-start calibration
All conditions > 5/7 at ep 2000	Gate criteria are too easy; need harder thresholds or longer runs	Recalibrate thresholds; may indicate curriculum is not differentiating enough
Curriculum fails specific criteria that flat passes	Early-phase feature suppression hurts specific gates	Investigate specific phase boundaries; adjust Phase 0 exit criterion

RANK 6 — DEV-NEED-029 Warm-Start Gate Calibration (EXQ-ISEF-006, to be designed)

Status: Not yet designed — depends on EXQ-ISEF-001/003 PASS
DEV-NEED IDs: DEV-NEED-029
Claim IDs: ARC-065, MECH-313, MECH-314, MECH-320
Priority: Medium-High. This experiment establishes the empirical thresholds for the three DEV-NEED-029 criteria:

ResidueField center count > N_min (threshold TBD)
MECH-320 EWMA > epsilon (threshold TBD)
E3 score variance > noise floor (threshold TBD)

Design: Two-phase within a single run: (Phase 1) allow substrate to warm naturally until residue_coverage_pct > 0.15 and z_goal.norm() > 0.4 (infant gate criteria met); (Phase 2) run diversity mechanism sweep on the now-warm substrate. Observe at what substrate state each diversity mechanism first produces non-zero differential. That state becomes the gate threshold.

Interpretation grid:

Outcome	Interpretation	Next action
All mechanisms produce non-zero differential once residue_coverage > 0.15	Gate criterion is residue coverage; this is the single warm-start proxy	Adopt residue_coverage > 0.15 as DEV-NEED-029 gate; unblocks Q-043/044/045 queuing
MECH-320 needs EWMA > specific value independent of residue	Gate criterion is EWMA-based; both thresholds needed	Register both thresholds; update DEV-NEED-029 with dual criteria
E3 score variance is the limiting factor	Gate criterion is E3 training depth, not substrate coverage	Gate criterion needs episode count + E3 training metric

Downstream decision: PASS establishes the empirical thresholds for DEV-NEED-029, converting it from a PROPOSED gap into a blocking gate with measurable criteria. This is the meta-experiment that makes all subsequent ARC-065 behavioral experiments scientifically valid.

RANK 7 — Q-043 / Q-044 / Q-045 ARC-065 Behavioral Ablation Cohort

Status: Designed; blocked on warm-start gate confirmation
DEV-NEED IDs: DEV-NEED-003, DEV-NEED-005, DEV-NEED-029
Claim IDs: MECH-313, MECH-314, MECH-320, Q-043, Q-044, Q-045
Priority: Medium (becomes Highest once DEV-NEED-029 gate is passed)

Why seventh (not first):

These experiments have substrate-readiness PASSes (EXQ-544, EXQ-545, EXQ-547).
But behavioral discriminative tests require warm substrate (DEV-NEED-029 gate).
Running now = zero information (EXQ-573 demonstrated this).
Running after EXQ-ISEF-006 warm-start calibration = maximum information.

Q-043 design: Relative weight calibration (MECH-313 vs MECH-314 vs MECH-320 contribution weight sweep). 4-arm factorial. Run AFTER DEV-NEED-029 gate established.

Q-044 design: MECH-314 sub-flavour independence test (314a alone, 314b alone, 314c alone, all three). Establishes whether any single sub-flavour drives the diversity benefit.

Q-045 design: MECH-313 vs MECH-260 collapse falsifier with extended 8-cell design (per R4 verdict from Q-045 lit-pull). Requires SD-054 multi-trial temporal horizon precondition (R5 precondition from Kennerley 2006).

RANK 8 — EXQ-IDEV-006: SD-017 Functional Validation (Context Discrimination)

Status: Partially designed in replay_development_analysis.md §7.6
DEV-NEED IDs: DEV-NEED-007
Claim IDs: SD-017, MECH-166, INV-044
Priority: Medium. This is the functional test that EXQ-265/265a methods PASSes built toward.

Design: Two conditions over 3 waking+sleep cycles: (A) 3 sessions + 3 sleep cycles, (B) 3 sessions + no sleep. Measure context cosine_sim before and after each cycle. Target: sleep condition context_cosine_sim < 0.95 by cycle 3; no-sleep stays > 0.95.

Prerequisite: SD-017 SWS/REM phase infrastructure is confirmed working (EXQ-265a PASS — confirmed). Residue geography must exist for context differentiation to be meaningful — needs EXQ-ISEF-001/003 PASS.

Why this matters: INV-049 (offline phases mathematical necessity) has FAILed twice. The functional SD-017 test would provide direct positive evidence for the consolidation mechanism rather than trying to ablate it. If this PASSes, it strongly supports INV-049 despite its ablation failures, and confirms the content-poverty interpretation of those FAILs.

RANK 9 — EXQ-IDEV-001: Longitudinal Infant Replay Quality

Status: Designed in replay_development_analysis.md §7.1
DEV-NEED IDs: DEV-NEED-007, DEV-NEED-008
Claim IDs: INV-055, ARC-011, INV-049
Priority: Medium. This is the active-learning-optimal query for INV-049 (Settles 2012 criterion).

Design: Track replay metrics longitudinally across the infant phase. If replay quality metrics (replay_zone_coverage_fraction, replay_residue_coverage_delta, post_sleep_z_goal_retention) improve monotonically and converge with gate criteria at the same episode count, this confirms that replay is driving developmental progress, not just measuring noise.

Critical failure interpretation: If replay quality metrics improve but gate criteria do NOT improve, replay is reorganizing impoverished content rather than creating new developmental capacity — the substrate fix is the correct intervention, confirming the content-poverty theory of INV-049 FAILs.

Prerequisite: Requires warm infant substrate (EXQ-ISEF-001..004 PASS).

RANK 10 — ARC-062 Phase 3 Wiring Redesign

Status: V3-EXQ-543d FAIL; redesign needed
DEV-NEED IDs: DEV-NEED-019 (gradual responsibility expansion)
Claim IDs: ARC-062, SD-033a, MECH-260
Priority: Medium. This is the next architectural wiring pass on a separate (non-monostrategy-dependent) thread.

ARC-062 Phase 3 (discriminator → SD-033a LateralPFCAnalog.update() + bias-head parameters to E3 optimizer) has FAILed in three iterations (543b, 543c, 543d). The 2x2 factorial (gated_policy x dACC) with MECH-260 anti-recency=0.5 did not isolate the wiring issue. The next iteration needs to: (a) verify the discriminator output is reaching its consumer with non-trivial signal, (b) instrument the LateralPFCAnalog.update() path for signal through-flow before the full training run.

Downstream decision: Phase 3 PASS closes commitment_closure_plan.md GAP-1 and unblocks the SD-033a/SD-034 consumer cascade (goal_pipeline_plan.md GAP-2/GAP-5).

5. Dependency-Aware Scheduling

EXQ-ISEF-001 (harm gradient)
    │
    ├──> EXQ-ISEF-002 (transient benefit patches) [parallel safe]
    │        │
    │        └──> Infant gate criteria become testable
    │
    ├──> EXQ-ISEF-003 (microhabitat zones) [parallel safe after env feature lands]
    │        │
    │        └──> z_world coverage testable
    │
    └──> EXQ-ISEF-004 (curiosity bonus calibration) [needs EXQ-ISEF-003 warm substrate]
             │
             ├──> novelty_bonus_weight calibrated for Phase 1/2
             │
             └──> EXQ-ISEF-006 (DEV-NEED-029 warm-start gate calibration)
                      │
                      ├──> Q-043 / Q-044 / Q-045 ARC-065 behavioral ablation [UNBLOCKED]
                      │
                      ├──> EXQ-IDEV-006 (SD-017 functional validation) [UNBLOCKED]
                      │        │
                      │        └──> EXQ-IDEV-001 (longitudinal replay quality)
                      │
                      └──> EXQ-IDEV-003 (coverage vs RPE replay policy) [UNBLOCKED]

EXQ-ISEF-005 (full curriculum vs flat) [after EXQ-ISEF-001..004 all PASS]

ARC-062 Phase 3 wiring [independent thread, no dependency on infant substrate]
    │
    └──> SD-033a consumer cascade (goal_pipeline GAP-2/5)
             │
             └──> SD-049 Phase 2 behavioral validation (after GAP-3 env extensions)

Critical path: EXQ-ISEF-001 → EXQ-ISEF-003 → EXQ-ISEF-006 → Q-043/044/045

Parallel-safe pairs:

EXQ-ISEF-001 ‖ EXQ-ISEF-002 (independent environment features)
ARC-062 Phase 3 redesign ‖ any EXQ-ISEF experiment (different code paths)
EXQ-IDEV-006 ‖ Q-043/044/045 (once warm-start gate established)

Serial dependencies:

EXQ-ISEF-004 must follow EXQ-ISEF-003 (needs warm microhabitat substrate for meaningful attractor audit)
EXQ-ISEF-006 must follow EXQ-ISEF-001 + EXQ-ISEF-003 (needs populated ResidueField and z_world zones)
Q-043/044/045 must follow DEV-NEED-029 gate establishment (EXQ-ISEF-006)

6. Experiments to Defer

The following experiments should NOT be queued until the specified prerequisite conditions are met. Queuing them now would produce zero-information results at non-trivial runner cost.

Experiment type	Prerequisite not met	Approximate prerequisite experiment	Deferral rationale
Any ARC-065 behavioral discriminative pair	DEV-NEED-029 warm-start gate not established	EXQ-ISEF-006	Zero KG / zero SNR on cold substrate (EXQ-573 demonstrated)
INV-049 retest (offline consolidation necessity ablation)	Infant substrate content poverty not fixed	EXQ-ISEF-001, EXQ-ISEF-003	INV-049 FAILs are most plausibly content-poverty failures, not theory failures; retesting under same conditions wastes oracle budget
SD-029 self-attribution retest	V_s monostrategy not resolved	ARC-065 behavioral validation + SD-055 differentiable CEM	SD-029 requires agent-vs-environment event distributions; monomodal policy cannot generate balanced C2/C3 distributions
ARC-016 precision commitment retest	Substrate-ceiling determination pending	V4 spec or major substrate enrichment	EXQ-530/530c FAIL pattern is consistent with substrate ceiling; more experiments on same substrate will not advance this
EXQ-IDEV-002 (play+real interleaved replay)	Play substrate not implemented	V4 developmental harness spec	Requires bilateral play frame + caregiver scaffold; full V4 play substrate needed
EXQ-IDEV-003 (coverage vs RPE replay policy)	Warm-start gate not established AND MECH-285 staleness priority not implemented	DEV-NEED-029 gate + MECH-285 implementation	Replay policy comparison is only meaningful on warm substrate with staleness mechanism present
EXQ-IDEV-004 (sleep prevents residue saturation)	Warm-start gate not established	EXQ-ISEF-006	Residue saturation prevention requires populated residue field; cold substrate has nothing to saturate
Q-043/044/045 before warm-start gate	DEV-NEED-029 gate not met	EXQ-ISEF-006	See EXQ-573; mechanism ABs produce identical arms on cold substrate
SD-032b dACC arbitration behavioral test	ARC-062 Phase 3 wiring not complete	commitment_closure_plan GAP-1	Consumer of Phase 3 wiring; premature test cannot route arbitration signal
Any V4 social/language experiment	V4 multi-agent substrate not built	V4 spec and implementation	INV-043, MECH-158, ARC-047, ARC-048 require genuine caregiver + peer agents; no V3 proxy is valid

Deferral Principle

Do not queue an experiment where the pre-registered interpretation grid contains no row that would produce a theory-informative result. If every plausible experimental outcome routes to “substrate not ready, fix substrate first,” the experiment has zero developmental information value.

For INV-049 specifically: a third FAIL under current substrate would not change the current interpretation (content poverty, not theory failure). The correct response is EXQ-IDEV-001 (longitudinal replay quality), which tests whether replay IS improving within-stage even if the between-condition ablation is not yet discriminative.

7. Governing Principles for This Priority List

7.1 Substrate First, Mechanisms Second

The ARC-065 mechanism cluster (MECH-313, MECH-314, MECH-320) is implemented and has substrate-readiness PASSes. The infant substrate expansion (EXQ-ISEF series) is the prerequisite for these mechanisms to produce non-zero differential. Queuing behavioral validation before the substrate produces warm states is equivalent to testing whether a car engine works by testing it with no fuel — the negative result tells you nothing about the engine.

7.2 The Warm-Start Gate Is the Science Gate

DEV-NEED-029 is not just a technical prerequisite. It is the scientific gate that determines whether ARC-065 behavioral experiments can produce theory-informative results. Until DEV-NEED-029 thresholds are empirically calibrated (EXQ-ISEF-006), every ARC-065 behavioral test has unknown interpretability.

7.3 FAILs Are Information — Use Them

The recent FAIL cluster (EXQ-573, EXQ-549, EXQ-552, EXQ-553) is extremely informative:

They confirm the cold-start warm-start failure theory.
They localise the monostrategy cliff to the CEM proposer.
They establish that the basin is agent-init-dependent (not universal). The correct response is to use this information to design better experiments, not to add more parameters or retry under near-identical conditions.

7.4 PASS Infrastructure Counts — Build On It

The recent PASS cluster (substrate-readiness probes, SD-055 differentiable CEM, SD-054 bipartite layout) represents real capability gains. The SD-055 differentiable CEM (EXQ-568 PASS) is particularly valuable as it directly targets the CEM proposer-stage monostrategy cliff. This should be integrated into the infant substrate alongside the environment enrichment features.

7.5 Distinguish Curriculum Failure from Substrate Failure

A substrate failure is when the experimental conditions cannot produce the theoretical outcome regardless of curriculum (e.g., no residue geography, no z_world diversity). A curriculum failure is when the conditions are present but the staging is wrong (e.g., E3 activated too early, suppressing the Phase 0 babbling window). The EXQ-ISEF series tests for substrate failures. EXQ-ISEF-005 tests for curriculum failures. These must not be conflated.

7.6 The Monostrategy Is Developmental, Not Mechanical

The seed-dependent escape from monostrategy (EXQ-555/557) suggests the monostrategy basin is not a mathematical certainty for any given architecture. Some initializations escape. The infant substrate enrichment is designed to reshape the attractor landscape during the developmental window — creating competing benefit/harm attractors before the policy can converge to a single basin. This is a developmental intervention, not a parameter fix. Getting it right requires the sequenced curriculum (Phase 0 protected babbling, Phase 1 benefit discovery, Phase 2 competing attractors) specified in infant_substrate_expansion.md.

8. Open Questions for Governance

DEV-NEED-029 thresholds: The three threshold values (N_min for ResidueField centers, epsilon for MECH-320 EWMA, noise floor for E3 score variance) are TBD. EXQ-ISEF-006 must be designed and PASSed before these can become blocking gate criteria. Who decides the minimum acceptable thresholds?
Phase 0 reward-free epoch: Griffin et al. (2026) + Garcia-Guzman (2026) ground the mechanistic necessity of a reward-free Phase 0. Currently ARC-046 mentions the infant stage but does not explicitly mandate a Phase 0 epoch. Should this be registered as an ARC-046 amendment or as a new candidate ARC-claim?
Wanting-before-liking sequencing (candidate MECH): The Berridge/Keren-Portnoy finding (wanting = approach motivation, seeded before liking is calibrated) is not currently a registered MECH. It would be a child of MECH-189. Governance decision needed before EXQ-ISEF-002 results can be properly tagged.
INV-049 status: Two FAILs (EXQ-385, EXQ-385a) but strong literature grounding. The content-poverty interpretation is now well-supported. Should INV-049 be flagged pending_substrate_reconfirmation while EXQ-ISEF series is in progress? This would prevent the two FAILs from continuing to weigh against INV-049 in governance scoring.
ARC-070/071 R6 SAFETY-CRITICAL: The governance decision on MECH-094 hypothesis_tag strict-vs-relaxed for chunking writes (ARC-071) is gated and should not be held up by the infant substrate work. These are independent threads.

infant_substrate_expansion.md — substrate feature proposals (EXQ-ISEF series scripts + gate criteria)
replay_development_analysis.md — replay bottleneck analysis + candidate experiments (EXQ-IDEV series)
developmental_needs_register.md — full DEV-NEED traceability table
developmental_metrics.md — quantitative gate thresholds
behavioral_diversity_acceptance_criteria.md — Rung 0-4 diversity ladder
evidence/planning/infant_substrate_expansion_plan.md — when created, this will be the plan-of-record for EXQ-ISEF series

Developmental Experiment Priorities

1. Executive Summary

2. Current State Assessment

2.1 FAIL Pattern Analysis

2.2 PASS Pattern Analysis

2.3 Monostrategy Characterisation (as of 2026-05-16)

3. Literature Evidence Mapping Table

3.1 Experiment Prioritization Under Uncertainty

3.2 Active Learning for Scientific Discovery

3.3 Causal Experiment Design

3.4 Ablation Study Design

3.5 Developmental Robotics Benchmarking

3.6 Curriculum Ablation in Reinforcement Learning

3.7 Information Gain in Experimental Design

4. Ranked Experiment Proposals

RANK 1 — EXQ-ISEF-001: Harm Gradient Baseline

RANK 2 — EXQ-ISEF-002: Transient Benefit Patches (z_goal Seeding Rate)

RANK 3 — EXQ-ISEF-003: Microhabitat Zones (Latent State Diversity)

RANK 4 — EXQ-ISEF-004: Stochastic Attractor Audit + Curiosity Bonus Calibration

RANK 5 — EXQ-ISEF-005: Full Infant Curriculum vs Flat Parameters

RANK 6 — DEV-NEED-029 Warm-Start Gate Calibration (EXQ-ISEF-006, to be designed)

RANK 7 — Q-043 / Q-044 / Q-045 ARC-065 Behavioral Ablation Cohort

RANK 8 — EXQ-IDEV-006: SD-017 Functional Validation (Context Discrimination)

RANK 9 — EXQ-IDEV-001: Longitudinal Infant Replay Quality

RANK 10 — ARC-062 Phase 3 Wiring Redesign

5. Dependency-Aware Scheduling

6. Experiments to Defer

Deferral Principle

7. Governing Principles for This Priority List

7.1 Substrate First, Mechanisms Second

7.2 The Warm-Start Gate Is the Science Gate

7.3 FAILs Are Information — Use Them

7.4 PASS Infrastructure Counts — Build On It

7.5 Distinguish Curriculum Failure from Substrate Failure

7.6 The Monostrategy Is Developmental, Not Mechanical

8. Open Questions for Governance

Related Documents