SpilloverDiD survey_design= on HC1/CR1 via Binder TSL (Wave E.1) by igerber · Pull Request #468 · igerber/diff-diff

igerber · 2026-05-19T11:07:42Z

Summary

SpilloverDiD now supports survey_design= for vcov_type ∈ {hc1} plus cluster=<col> (CR1), closing the Wave B/C/D NotImplementedError gate. Documented synthesis of Gerber (2026, arXiv:2605.04124) Proposition 1 (Binder Taylor Series Linearization for IF representations of smooth functionals; explicitly derived for TwoStageDiD in the Appendix) with the Wave D Gardner GMM first-stage uncertainty correction (Butts 2021 §3.1 + Gardner 2022 §4) applied to SpilloverDiD's ring-indicator stage-2 design. No reference software combines all ingredients. Mechanical: Wave D Psi → PSU-aggregated → audited Binder TSL meat; survey weights enter via Hájek normalization at gamma_hat / eps / bread.
Both event_study=False AND event_study=True branches; both is_staggered=True AND is_staggered=False paths audited per feedback_cohort_loop_trigger_cache_both_branches.
Public surface deferrals: vcov_type="conley" + survey_design= → NotImplementedError (Wave E.2 will compose Conley × survey product-kernel with within-stratum Conley sandwich on PSU totals); replicate-weight variance (BRR/Fay/JK1/JKn/SDR) → NotImplementedError (Gerber 2026 Appendix A notes the IF-reweighting shortcut does not apply because gamma_hat is weight-sensitive — needs per-replicate full re-fit); non-pweight → ValueError. Tracked in TODO.md.
Shared-helper fix: _inject_cluster_as_psu now honors SurveyDesign.nest (raises on cross-stratum cluster overlap under nest=False), matching the explicit-PSU resolver. Also benefits TwoStageDiD survey path (18 existing TwoStageDiD survey tests pass unchanged).

Methodology references

Method: SpilloverDiD ring-indicator estimator with survey-design variance (Wave E.1 — HC1 / CR1 via Binder TSL composed with Wave D Gardner GMM)
Papers / sources:
- Gerber, I. (2026). "Design-Based Variance Estimation for Modern Heterogeneity-Robust Difference-in-Differences Estimators." arXiv:2605.04124. (Proposition 1 + Appendix derives TwoStageDiD's IF.)
- Binder, D. A. (1983). "On the Variances of Asymptotically Normal Estimators from Complex Surveys." International Statistical Review, 51(3), 279-292.
- Butts, K. (2023, originally 2021). "Difference-in-Differences with Spatial Spillovers." arXiv:2105.03737v3.
- Gardner, J. (2022). "Two-stage differences in differences." arXiv:2207.05943.
Intentional deviation from TwoStageDiD: under saturated df_survey = 0 (singleton PSUs + lonely_psu="remove"), Wave E.1 surfaces a UserWarning matching "df_survey" and NaN-fails. TwoStageDiD's _compute_gmm_variance (two_stage.py:2003-2005) currently NaN-fails silently; Wave E.1 surfaces the diagnostic per feedback_no_silent_failures. Documented as a **Note:** in docs/methodology/REGISTRY.md SpilloverDiD "Variance (Wave E.1)" subsection.
Full synthesis derivation + restrictions documented in docs/methodology/REGISTRY.md SpilloverDiD section and docs/api/spillover.rst.

Validation

Tests added: TestSpilloverDiDWaveE1SurveyDesignHc1 (17 invariants — bit-identity fallback, Binder TSL hand-check uniform + non-uniform, lonely_psu, FPC degenerate limits ×3, saturated NaN-fail with pytest.warns(match="df_survey"), rejections (conley+survey, replicate, non-pweight), fit idempotency, finite_mask subsetting, no-PSU regressions (weights-only, weights+strata, cluster+survey-no-PSU, cluster overlap nest=False/True), zero-weight Omega_0 exclusion + all-zero raises) + TestSpilloverDiDWaveE1SurveyDesignEventStudy (7 — both is_staggered branches with df_survey lincom verification, distinguishability between survey-share and sample-share lincom rules via cohort-correlated weights + non-constant tau_k + manual reconstruction, aggregate-vs-event-study parity, drift goldens, subset-path invariant).
Full SpilloverDiD test pass: 250 tests pass (224 prior + 24 new + 2 existing tests updated for the new contract). TwoStageDiD survey path unaffected: 18 tests pass.
Methodology drift goldens pinned at rtol=1e-12, atol=1e-14 per feedback_assert_allclose_numerical_parity.

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

github-actions · 2026-05-19T11:20:05Z

Overall assessment

✅ Looks good

Executive summary

No P0/P1 methodology issues found. The Wave E.1 survey path matches the cited methodological stack: Binder-style Taylor linearization for asymptotically linear estimators, Gardner’s two-stage GMM variance framing, and Butts’ spillover DiD setup. The two numerically meaningful departures visible in the diff are explicitly documented in docs/methodology/REGISTRY.md:L3053-L3116, so I do not count them as defects. citeturn6academia0turn6academia1turn6academia2turn6academia3
One concrete P2 issue remains: _inject_cluster_as_psu changed its default cross-stratum behavior, but its direct unit test in tests/test_survey.py still asserts the pre-PR contract.
The deferred Conley×survey and replicate-weight paths are properly tracked in TODO.md, so those follow-ups are mitigated per the review rules.
I could not run the test suite here because the sandbox is missing Python deps (numpy import failed); the test finding below is from static inspection.

Methodology

No findings. The new survey-design variance route is consistent with the cited source material, and the changed event-study share rule plus the surfaced df_survey=0 warning are both documented in docs/methodology/REGISTRY.md:L3053-L3116. citeturn6academia0turn6academia1turn6academia2turn6academia3

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The unsupported vcov_type="conley" + survey_design and replicate-weight variance paths are explicitly tracked in TODO.md, so they are non-blocking.

Security

No findings.

Documentation/Tests

Severity: P2
Impact: diff_diff/survey.py:L1285-L1343 now makes _inject_cluster_as_psu raise on cross-stratum cluster-label reuse unless nest=True, but tests/test_survey.py:L2906-L2927 still expects the old always-nest behavior. The new spillover tests already encode the revised contract at tests/test_spillover.py:L5016-L5094, so the test surface is internally inconsistent and likely red in a full run.
Concrete fix: Update tests/test_survey.py::TestRound16Fixes.test_injected_cluster_nested_in_strata so it either sets nest=True explicitly, or replace it with the same two-branch contract now used in spillover: default nest=False should raise on overlap, and nest=True should succeed.

github-actions · 2026-05-19T11:57:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 15044a488804b49763fe538a7ba2a57a75597157

Overall assessment

✅ Looks good

Executive summary

The previous re-review issue is resolved: tests/test_survey.py now matches the new nest contract, and spillover coverage exercises both nest=False rejection and nest=True success in tests/test_survey.py:L2906-L2954 and tests/test_spillover.py:L4985-L5056.
No unmitigated P0/P1 methodology issues found in the new SpilloverDiD survey path. The changed code matches the documented Wave E.1 contract for Binder TSL on the Wave D GMM scores, survey-weighted event-study aggregation, and df_survey-based inference in diff_diff/spillover.py:L2943-L2954, diff_diff/spillover.py:L3071-L3164, and docs/methodology/REGISTRY.md:L3053-L3057, docs/methodology/REGISTRY.md:L3100-L3116.
The deferred vcov_type="conley" + survey_design and replicate-weight paths are properly front-door rejected and tracked in TODO.md, so they are non-blocking in diff_diff/spillover.py:L2187-L2401 and TODO.md:L135-L136.
One minor follow-up remains: ResolvedSurveyDesign.subset_to_units() does not propagate the new nest flag, so unit-level survey copies do not fully preserve the source design contract in diff_diff/survey.py:L635-L679.

Methodology

No findings. Affected method: SpilloverDiD Wave E.1 survey-design variance. The two behavior changes most likely to look like deviations, survey-weighted horizon shares and the surfaced df_survey=0 warning/NaN-fail, are explicitly documented in docs/methodology/REGISTRY.md:L3053-L3057 and docs/methodology/REGISTRY.md:L3112-L3116, so they are informational rather than defects.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3 [Newly identified]
Impact: ResolvedSurveyDesign.subset_to_units() rebuilds a new ResolvedSurveyDesign without carrying over nest, even though this PR makes nest load-bearing for _inject_cluster_as_psu. That means unit-level survey copies now silently revert to nest=False, so the new nest-aware design contract is only partially propagated. This does not affect the new SpilloverDiD path directly, but it leaves the PR's propagation incomplete. Relevant locations: diff_diff/survey.py:L560-L588, diff_diff/survey.py:L635-L679, with callers in diff_diff/continuous_did.py:L623-L651, diff_diff/continuous_did.py:L1276-L1302, diff_diff/continuous_did.py:L1420-L1437, and diff_diff/efficient_did.py:L504-L527.
Concrete fix: add nest=self.nest in ResolvedSurveyDesign.subset_to_units(), and add a regression test that collapses a nest=True design to unit level and verifies the flag is preserved.

Tech Debt

No findings. The deferred Wave E.2 and replicate-weight work is already tracked in TODO.md:L135-L136.

Security

No findings.

Documentation/Tests

No findings. The prior test-surface inconsistency is fixed. Validation note: this was a static review only; I could not execute the test suite in this sandbox because pytest and numpy are not installed.

github-actions · 2026-05-19T12:43:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a161725794fafbca89bfa873d9c3693e7fc789b6

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review item is resolved: nest now propagates through ResolvedSurveyDesign.subset_to_units() and is covered in diff_diff/survey.py and tests/test_survey.py.
The Wave E.1 implementation otherwise matches the documented registry contract for Binder TSL composition, survey-weighted event-study aggregation, df_survey-based inference, and the documented Wave E.1 deviations/rejections in docs/methodology/REGISTRY.md and docs/methodology/REGISTRY.md.
Severity: P1. The new cluster=<col> survey path turns cluster into the effective PSU after the only within-unit survey-constancy validation has already run, so a time-varying cluster can silently alter n_psu, df_survey, and Binder TSL variance instead of being rejected.
Severity: P2. The new finite_mask survey subsetting branch is not directly exercised by the added tests; both new tests explicitly say they only cover the no-op or indirect path.
Static review only. I could not run the suite here because pytest and numpy are not installed in this sandbox.

Methodology

Severity: P1. Impact: Affected method is the new SpilloverDiD Wave E.1 survey-design variance, specifically the cluster=<col> CR1 path. fit() validates within-unit constancy only for columns named in survey_design, then later injects cluster=<col> as the effective PSU. That means survey_design.psu="cluster" would be rejected by the panel-survey contract in diff_diff/survey.py, but the same time-varying labels passed via cluster= are accepted and used to compute PSU totals, n_psu, and df_survey in diff_diff/spillover.py and diff_diff/spillover.py. That is an undocumented missing assumption check on the new variance path. Concrete fix: before _inject_cluster_as_psu, validate that cluster is constant within unit whenever it is being used as an implicit PSU, or extend _validate_unit_constant_survey() to cover the injected cluster column too.
No other methodology findings. The survey-share att aggregation, df_survey threading, documented df_survey=0 warning/NaN-fail, and the Conley/replicate rejections all match the registry and tracked TODO items in diff_diff/spillover.py, diff_diff/spillover.py, diff_diff/two_stage.py, and TODO.md.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings. The deferred conley + survey_design and replicate-weight follow-ons are properly tracked in TODO.md.

Security
No findings.

Documentation/Tests

Severity: P2. Impact: The load-bearing branch that subsets survey_weights/strata/psu/fpc after finite_mask drops rows lives in diff_diff/spillover.py, but the added tests explicitly acknowledge they do not hit that n_nan > 0 path and only check the pass-through or an indirect invariant in tests/test_spillover.py and tests/test_spillover.py. Concrete fix: add a survey-enabled SpilloverDiD fixture that actually forces finite_mask row drops and then asserts post-mask array alignment plus recomputed n_psu, n_strata, and df_survey.

Path to Approval

Add a within-unit-constancy guard for cluster=<col> when it is being injected as the effective PSU on the survey path.
Add a regression showing that a time-varying cluster is rejected on the implicit-PSU path, and that a unit-constant cluster still matches the explicit survey_design.psu= path.

github-actions · 2026-05-19T13:05:09Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9c9532eae457fa169e188340821adb4f18c726f8

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in this re-review.

Executive Summary

The prior P1 is fixed: Spillover’s implicit-PSU path now rejects time-varying cluster=<col> before cluster-as-PSU injection, matching the explicit-PSU survey contract (diff_diff/spillover.py:L2904-L2954, tests/test_spillover.py:L5021-L5088).
The prior P2 is fixed: the survey finite_mask drop/subset branch now has a real regression test instead of only no-op coverage (tests/test_spillover.py:L5516-L5565).
Spillover Wave E.1’s HC1/CR1 survey implementation matches the documented Binder TSL + Gardner/GMM contract, df_survey inference threading, survey-share event-study aggregation, and explicit Wave E.1 deviations in the Methodology Registry (docs/methodology/REGISTRY.md:L3047-L3118).
The remaining PreTrendsPower methodology gaps are explicitly documented/tracked in the authoritative registry/TODO, so they are informational rather than blocking (docs/methodology/REGISTRY.md:L2805-L2823, TODO.md:L97-L101).
One P3 docs issue remains: the public PreTrends API page still describes a nonexistent custom_delta argument and unnormalized violation formulas that do not match the shipped implementation (docs/api/pretrends.rst:L115-L128, diff_diff/pretrends.py:L470-L476, diff_diff/pretrends.py:L516-L559).
Static review only: I could not run tests in this sandbox because pytest, numpy, pandas, and scipy are unavailable.

Methodology

Severity: P3. Impact: PreTrendsPower still carries known deviations from Roth (linear-γ scaling and CS/SA diagonal Σ22 fallback), but they are explicitly documented in the authoritative registry and tracked in TODO, so they do not block this PR (docs/methodology/REGISTRY.md:L2805-L2823, TODO.md:L97-L101). Concrete fix: none required in this PR; handle in the tracked follow-up audit.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: this PR removes the dedicated PreTrends methodology fixture/generator (tests/test_methodology_pretrends.py, benchmarks/R/generate_pretrends_golden.R) and correspondingly demotes the surface back to tracked debt: METHODOLOGY_REVIEW.md now marks PreTrendsPower as In Progress, with the missing audit/parity work recorded in TODO.md (METHODOLOGY_REVIEW.md:L1043-L1061, TODO.md:L97-L101). Concrete fix: none required for approval; restore the pinned-R/methodology harness in the follow-up audit PR.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: docs/api/pretrends.rst misdocuments the public PreTrendsPower API. It says custom uses custom_delta and describes linear/constant violations as raw delta[t] = M * t / delta[t] = M, but the shipped API takes violation_weights and normalizes every violation direction to unit L2 norm before scaling by M (docs/api/pretrends.rst:L115-L128, diff_diff/pretrends.py:L470-L476, diff_diff/pretrends.py:L516-L559). That will mislead users about both the argument name and the meaning of M. Concrete fix: update the API page to use violation_weights and either describe the normalized-direction contract directly or link readers to the registry deviation note (docs/methodology/REGISTRY.md:L2805-L2823).

github-actions · 2026-05-19T14:46:06Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0391c79d6c2788098d79505429212996491f8b4c

Overall Assessment

⚠️ Needs changes

The Spillover Wave E.1 survey work looks coherent in this re-review, but the PR still leaves unmitigated P1 methodology/reporting issues in the PreTrendsPower stack.

Executive Summary

Prior Spillover survey-path issues appear resolved; I did not find a new P0/P1 in the Wave E.1 Binder-TSL path.
[Newly identified] P1: PreTrendsPower is still Wald-only, and the updated methodology docs still do not record that as a deviation from Roth’s paper-analyzed NIS pretest.
[Newly identified] P1: violation_type="constant" is still normalized internally, so exposed M/MDV are mis-scaled relative to the registry/API contract δ_t = M.
[Newly identified] P1: the new mdv / |ATT| tiering in DiagnosticReport / BusinessReport compares incompatible quantities and can misclassify power.
The previously tracked linear-weight/γ-units issue is now explicitly documented in the registry, so I treat that one as informational rather than blocking.
The prior P3 docs issue still remains: docs/api/pretrends.rst still mentions nonexistent custom_delta.

Path to Approval

Either implement Roth-faithful NIS support/default for PreTrendsPower, or add an explicit **Note (deviation from paper):** in docs/methodology/REGISTRY.md and matching API/reporting docs that this surface is intentionally Wald-only.
Align violation_type="constant" with its public contract: either expose/report per-period level-shift units (δ_t = M) or add an explicit deviation note and update the API docs so the reported M/MDV are not mislabeled.
Remove or redesign mdv_share_of_att so the numerator and denominator are on the same contract, and add a regression test that equivalent pre-period codings do not change the power tier mechanically.

Methodology

Severity: P1 [Newly identified]. PreTrendsPower still computes only Wald/noncentral-χ² power and MDV in diff_diff/pretrends.py:L726-L861, and the registry now presents that as the main contract in docs/methodology/REGISTRY.md:L2782-L2797. Roth’s paper review in-repo still describes the paper-analyzed/common pretest as the NIS box and treats Wald as only a paper-supported alternative in docs/methodology/papers/roth-2022-review.md:L40-L49 and docs/methodology/papers/roth-2022-review.md:L164-L178. Impact: this PR leaves a source-material deviation undocumented under the project’s required Note/Deviation labels, while downstream reporting consumes the resulting MDV/power as if they were the Roth-cited object. Concrete fix: either implement/expose NIS as the paper-faithful form, or add an explicit deviation note in REGISTRY.md and the public docs that this feature is intentionally Wald-based.
Severity: P1 [Newly identified]. violation_type="constant" is still exposed on the wrong scale. The implementation normalizes equal weights to unit norm in diff_diff/pretrends.py:L544-L557, and the tests lock that behavior in at tests/test_pretrends.py:L242-L252, so a reported constant MDV of M actually means a per-period deviation of M / sqrt(K). The registry/API still describe the type as δ_t = c / δ[t] = M with no deviation note in docs/methodology/REGISTRY.md:L2799-L2803 and docs/api/pretrends.rst:L119-L128. Impact: constant-form MDV/power are numerically mislabeled and not comparable across numbers of pre-periods. Concrete fix: either stop normalizing the constant pattern before exposing M/MDV, or keep the internal normalization but convert all reported quantities back to per-period level-shift units and document that choice.
Severity: P1 [Newly identified]. The new power tier mixes incompatible quantities. DiagnosticReport now hard-codes compute_pretrends_power(..., violation_type="linear") and computes mdv_share_of_att = mdv / abs(att) in diff_diff/diagnostic_report.py:L1415-L1452; REPORTING.md makes that ratio the tier driver in docs/methodology/REPORTING.md:L284-L344; BusinessReport renders it as MDV / |ATT| in diff_diff/business_report.py:L2467-L2474. But the numerator is a linear-pattern MDV built from normalized pretrend weights in diff_diff/pretrends.py:L516-L559, not an ATT-scale level effect. Impact: well_powered / moderately_powered / underpowered can change mechanically with the pre-period grid/normalization rather than substantive detectability. Concrete fix: base the tier on a same-scale quantity, or drop the numeric MDV / |ATT| ratio until the numerator is translated onto the ATT scale.
Severity: P3. The previously identified linear-pattern γ-units mismatch is now explicitly documented in docs/methodology/REGISTRY.md:L2805-L2807 and tracked in TODO.md:L101, so I do not count it against approval. Impact: informational only for this review. Concrete fix: none required in this PR.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt

Severity: P3. The PR removes the dedicated PreTrends methodology/parity harness (tests/test_methodology_pretrends.py, benchmarks/R/generate_pretrends_golden.R) and METHODOLOGY_REVIEW.md explicitly demotes PreTrendsPower back to In Progress at METHODOLOGY_REVIEW.md:L1043-L1060. This is tracked in TODO.md:L97-L101, so it is non-blocking. Impact: lower automated protection against future methodology drift in this area. Concrete fix: none required for approval beyond keeping the tracked follow-up on the audit/parity path.

Security
No findings.

Documentation/Tests

Severity: P3. The public API page still documents a nonexistent custom_delta argument in docs/api/pretrends.rst:L127-L128, while the actual surface is violation_weights in diff_diff/pretrends.py:L470-L476, and the helpers explicitly do not support custom weights in diff_diff/pretrends.py:L1076-L1109. Impact: the user-facing docs still do not match the shipped API. Concrete fix: update the API page to use violation_weights and describe the helper limitation directly.
Static review only: I could not run tests in this sandbox because pytest, numpy, pandas, and scipy are unavailable.

github-actions · 2026-05-19T15:10:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a04c96b2a8f3f543df31fdb99e9e2cbb974aadc3

Overall Assessment

✅ Looks good

Executive Summary

I did not find a new P0/P1 in the changed SpilloverDiD Wave E.1 survey-design path. The Binder-TSL/Gardner composition, df_survey inference, and survey-share event-study aggregation are consistent with docs/methodology/REGISTRY.md.
The prior AI review’s PreTrendsPower findings are outside the current changed-file set, so I did not count them in this diff-scoped re-review.
P2: the new Binder meat helper does repeated full-array scans per PSU, which makes the newly added no-PSU survey branches quadratic in n_obs when strata/FPC are present.
P3: diff_diff/guides/llms.txt describes "cluster" as a vcov_type, but the actual public contract is vcov_type="hc1" plus cluster=<col> for CR1.
P3: the new “hand-check” Binder test does not actually compare SpilloverDiD’s vcov to a manual end-to-end Binder reconstruction; it only checks metadata and then tests the shared helper separately.
Static review only: I could not run tests here because the sandbox lacks pytest, numpy, pandas, and scipy.

Methodology

No findings. I checked the affected methods against docs/methodology/REGISTRY.md:L3071-L3142: SpilloverDiD Wave E.1 Binder-TSL survey variance, the survey-weighted event-study scalar att, and the nest-aware cluster-as-PSU path. The implementation in diff_diff/spillover.py / diff_diff/two_stage.py matches the documented contract, including the documented Wave E.1 deviation that saturated df_survey=0 warns and NaN-fails instead of silently NaN-failing.

Code Quality

No findings.

Performance

Severity: P2. Impact: diff_diff/two_stage.py:L466-L480 maps PSU-to-stratum and PSU-to-FPC by calling np.where(psu_arr == c)[0][0] once per PSU. In the new no-PSU survey branches (psu_arr = arange(n_obs)), weights+strata / weights+fpc fits become quadratic in n_obs before the Binder meat is even computed. That materially undermines the new no-PSU support on larger panels. Concrete fix: compute first-occurrence indices once (for example with np.unique(psu_arr, return_index=True) or a single scatter from psu_indices) and index strata_arr[first_idx] / fpc_arr[first_idx] directly.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: diff_diff/guides/llms.txt:L61 says the feature works for vcov_type ∈ {hc1, cluster}, but the actual contract is vcov_type="hc1" plus optional cluster=<col> for CR1, as documented in docs/api/spillover.rst:L221-L223 and implemented in diff_diff/spillover.py:L1748-L1753, diff_diff/spillover.py:L3082-L3087. An agent following llms.txt can set vcov_type="cluster" and silently miss the intended CR1 path unless it also supplies cluster=. Concrete fix: change the guide text to vcov_type="hc1" with optional cluster=<col> for CR1, and ideally add front-door validation rejecting unsupported vcov_type values like "cluster".
Severity: P3. Impact: tests/test_spillover.py:L4677-L4729 is labeled as an end-to-end Binder hand-check, but it never compares SpilloverDiD’s vcov to a manual PSU-aggregated reconstruction; it checks metadata and then unit-tests _compute_stratified_meat_from_psu_scores on unrelated synthetic scores. That leaves the new Wave E.1 composition without the direct parity test the docstring claims. Concrete fix: either reconstruct Psi/bread and compare the estimator vcov end-to-end, or rename the test/docstring so it no longer overstates coverage.

github-actions · 2026-05-19T16:04:00Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c0e8967c9c87ce95a401e33963e84bfb6bee2a14

Overall Assessment

✅ Looks good

Executive Summary

No new P0/P1 in the changed SpilloverDiD Wave E.1 survey path. The implementation matches the documented registry contract, and the source-material cross-check is consistent with the cited Butts/Gardner/Gerber stack. docs/methodology/REGISTRY.md:L3124-L3159, diff_diff/spillover.py:L2578-L3252, diff_diff/two_stage.py:L62-L514. citeturn3view0turn2view0turn4view0
The prior P2 performance finding appears resolved: _compute_binder_tsl_meat() now uses a single np.unique(..., return_index=True, return_inverse=True) mapping instead of repeated per-PSU scans. diff_diff/two_stage.py:L454-L480.
The prior llms.txt contract issue is resolved: the guide now documents vcov_type="hc1" with optional cluster=<col> for CR1. diff_diff/guides/llms.txt:L61.
One prior P3 remains: the new “Binder hand-check” test still does not do the end-to-end manual vcov reconstruction its docstring claims. tests/test_spillover.py:L4677-L4729.
Static review only: I could not run the suite here because this environment is missing both numpy and pytest.

Methodology

No findings.

Affected methods are SpilloverDiD Wave E.1 survey-design variance, the survey-weighted event-study scalar att, and nest-aware cluster-as-PSU resolution. The registry explicitly documents this as a synthesis, and the changed implementation follows that contract: weighted stage-1 FE support, weighted gamma_hat/Psi/bread, df_survey propagation, survey-share aggregation, and PSU injection behavior line up across docs/methodology/REGISTRY.md:L3071-L3159, diff_diff/spillover.py:L2578-L3252, diff_diff/two_stage.py:L62-L514, and diff_diff/survey.py:L596-L633 plus L1286-L1344. The paper cross-check is supportive rather than a SpilloverDiD-specific closed-form derivation: Butts defines the spillover DiD problem, Gardner provides the two-stage untreated-first estimator, and Gerber states that smooth IF- and regression-based modern DiD estimators admit Binder-style stratified-cluster survey variance. citeturn3view0turn2view0turn4view0

Code Quality

No findings. The changed inference sites still route through safe_inference, so I did not find a new inline t-stat / p-value / CI anti-pattern in the modified estimator path. diff_diff/spillover.py:L836-L1025, diff_diff/spillover.py:L3240-L3265.

Performance

No findings. The earlier O(G·n) PSU-to-stratum/FPC mapping issue is fixed in diff_diff/two_stage.py:L454-L480.

Maintainability

No findings. The new nest field is propagated through the changed survey-design copy/resolve paths and covered by targeted regression tests. diff_diff/survey.py:L243-L270, diff_diff/survey.py:L412-L422, diff_diff/survey.py:L661-L680, tests/test_survey.py:L2906-L3001.

Tech Debt

No findings. The remaining deferred pieces from this initiative, vcov_type="conley" + survey_design and replicate-weight variance, are explicitly tracked in TODO.md:L131-L134, so they are mitigated under the stated review rubric.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: tests/test_spillover.py:L4677-L4729 is still labeled as an end-to-end Binder hand-check, but it only verifies metadata on the fitted result and then unit-tests _compute_stratified_meat_from_psu_scores() on synthetic scores. That leaves the new estimator-level composition in diff_diff/spillover.py / diff_diff/two_stage.py without the direct parity check the docstring claims. Concrete fix: either reconstruct Psi and the bread on an actual SpilloverDiD.fit(...) run and compare result.vcov to a manual bread @ meat @ bread, or rename the test/docstring so it clearly describes helper-level coverage instead of end-to-end parity.

github-actions · 2026-05-19T16:18:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5866913f2159f89b6ee67fdb6688db4693ab6f3d

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The changed Wave E.1 survey-design path is consistent with the registry’s documented synthesis and with the cited source material at the level claimed by the PR: Binder-style design-based variance for smooth IF/regression-based modern DiD estimators, Gardner’s untreated-first two-stage structure, and Butts’s spillover DiD identification setting. (arxiv.org)

Executive Summary

No new P0/P1 in the changed SpilloverDiD(survey_design=...) path. The survey gates, positive-weight Omega_0 handling, PSU aggregation, df_survey propagation, and event-study survey-share aggregation are all documented and wired through the changed implementation. docs/methodology/REGISTRY.md:L3071-L3142, diff_diff/spillover.py:L2357-L2401, diff_diff/spillover.py:L2578-L2689, diff_diff/spillover.py:L2838-L3236, diff_diff/two_stage.py:L62-L514
The prior P3 from the last review is resolved: tests/test_spillover.py:L4677-L4696 no longer presents the Binder helper check as an end-to-end vcov reconstruction.
The earlier PSU-mapping performance concern appears resolved in diff_diff/two_stage.py:L454-L480.
One minor P3 remains in test prose: tests/test_spillover.py:L5414-L5418 says att_dynamic.n_obs carries survey-weight totals, but the implementation still stores raw counts and keeps survey-share totals only in internal weight_sum_per_col.
Deferred Conley×survey and replicate-weight variance work is correctly tracked in TODO.md:L131-L134, so those follow-ups are mitigated under the stated rubric.
Static review only: pytest is not installed in this environment, so I could not execute the new tests.

Methodology

No findings. Affected methods are the Wave E.1 survey-design variance path, the survey-weighted scalar att aggregation in event-study mode, and cluster→PSU resolution under survey inference. The registry explicitly documents these as a synthesis, and the cited papers support the high-level pieces the PR relies on: Binder covers complex-survey variance for asymptotically normal estimators, Gerber extends Binder-type design-based variance to smooth IF/regression-based modern DiD estimators, Gardner supplies the untreated-first two-stage structure, and Butts supplies the spillover DiD identification setting. docs/methodology/REGISTRY.md:L3071-L3142, diff_diff/spillover.py:L2357-L2401, diff_diff/spillover.py:L2578-L2689, diff_diff/spillover.py:L2838-L3236, diff_diff/two_stage.py:L62-L514 (www150.statcan.gc.ca)

No findings. The potentially controversial behaviors are explicitly documented rather than silent deviations: df_survey=0 warning+NaN propagation, Conley×survey rejection, and replicate-weight deferral. docs/methodology/REGISTRY.md:L3136-L3140, TODO.md:L131-L134

Code Quality

No findings. The modified inference call sites still route through safe_inference, so I did not find a new inline t-stat / p-value / CI anti-pattern in the changed estimator path. diff_diff/spillover.py:L828-L845, diff_diff/spillover.py:L993-L1024, diff_diff/spillover.py:L3240-L3265

Performance

No findings. _compute_binder_tsl_meat() now uses a single np.unique(..., return_index=True, return_inverse=True) mapping instead of repeated per-PSU scans. diff_diff/two_stage.py:L454-L480

Maintainability

No findings. The changed survey helpers consistently propagate nest in the touched construction/copy paths, and the new _inject_cluster_as_psu() behavior is covered by focused regression tests. diff_diff/survey.py:L243-L270, diff_diff/survey.py:L412-L422, diff_diff/survey.py:L661-L680, diff_diff/survey.py:L1286-L1344, tests/test_survey.py:L2906-L3002

Tech Debt

No findings. The remaining Wave E.2 Conley×survey and replicate-weight follow-ups are explicitly tracked in TODO.md, which mitigates them under the review rubric. TODO.md:L131-L134

Security

No findings.

Documentation/Tests

Severity: P3. Impact: tests/test_spillover.py:L5414-L5418 says att_dynamic.n_obs holds survey-weight totals on the survey path, but the implementation still fills n_obs from raw post-finite_mask row counts and keeps survey-share totals only in internal weight_sum_per_col. This is only a prose mismatch, but it can mislead future maintainers about the public results contract. Concrete fix: rewrite that docstring block to say the test reconstructs survey shares directly from normalized weights, or expose the share totals on the result object if that metadata is intended to be public. diff_diff/spillover.py:L821-L826, diff_diff/spillover.py:L844-L845, diff_diff/spillover.py:L2984-L2995

No further findings. The prior helper-vs-end-to-end test-description issue is resolved at tests/test_spillover.py:L4677-L4696.

github-actions · 2026-05-19T16:32:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: fe7c2cb59f64a0e43ba67bdccd23a851e0605b55

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The changed Wave E.1 survey_design= path is consistent with the Methodology Registry’s documented Gerber/Binder + Gardner + Butts synthesis at the level claimed by the PR. docs/methodology/REGISTRY.md:L3124-L3142, diff_diff/spillover.py:L2357-L3365, diff_diff/two_stage.py:L62-L514 (arxiv.org)

Executive Summary

No new P0/P1 in the changed SpilloverDiD(survey_design=...) path. Survey weights are threaded through stage 1, stage 2, Psi, the bread, and df_survey, and the survey-share event-study aggregation is both documented and implemented. docs/methodology/REGISTRY.md:L3071-L3142, diff_diff/spillover.py:L790-L1024, diff_diff/spillover.py:L2357-L3365, diff_diff/two_stage.py:L62-L514
The prior re-review P3 is resolved: the event-study test prose now correctly distinguishes raw att_dynamic.n_obs counts from the internal survey-share weights. tests/test_spillover.py:L5401-L5420
The changed cluster/PSU logic is internally consistent: SurveyDesign.nest is propagated in the changed survey copy paths, _inject_cluster_as_psu() now enforces nest=False overlap rejection, and Spillover adds the missing within-unit constancy guard when cluster=<col> becomes the effective PSU. diff_diff/survey.py:L243-L270, diff_diff/survey.py:L412-L422, diff_diff/survey.py:L661-L676, diff_diff/survey.py:L1286-L1344, diff_diff/spillover.py:L2895-L2969
Deferred Conley×survey and replicate-weight variance work is explicitly rejected and tracked, so those items are mitigated under the review rubric. TODO.md:L133-L134
One minor P3 remains: tests/test_spillover.py:L4741-L4761 is labeled as a non-uniform-weight “hand-check,” but it only smoke-tests finiteness and unchanged df_survey.

Methodology

No findings.

Affected methods are the new Wave E.1 survey-design variance path for SpilloverDiD, the survey-weighted scalar att aggregation in event-study mode, and the survey-aware cluster/PSU resolution rules. The registry explicitly documents this as a synthesis rather than a claim of direct reference-software parity, and the cited source material supports the underlying ingredients the PR relies on: Gerber extends Binder-style design-based variance to smooth IF-/regression-based modern DiD estimators, Gardner’s did2s formulation gives the two-stage GMM sandwich structure, Butts provides the spillover DiD identification framework, and Binder is the foundational survey-linearization reference. I did not find an undocumented deviation from that documented synthesis in the changed implementation. docs/methodology/REGISTRY.md:L3071-L3142, docs/api/spillover.rst:L221-L259, diff_diff/spillover.py:L790-L1024, diff_diff/spillover.py:L2357-L3365, diff_diff/two_stage.py:L62-L514 (arxiv.org)

The potentially controversial behaviors are documented rather than silent: survey-weighted event-study shares, df_survey=0 warning+NaN propagation, PSU-over-cluster precedence, and explicit rejection of Conley×survey and replicate-weight variance. docs/methodology/REGISTRY.md:L3077-L3081, docs/methodology/REGISTRY.md:L3134-L3140, TODO.md:L133-L134

Code Quality

No findings.

The changed inference sites continue to route through safe_inference(), so I did not find a new inline t-stat / p-value / CI anti-pattern in the modified estimator path. diff_diff/spillover.py:L830-L836, diff_diff/spillover.py:L1024-L1024, diff_diff/spillover.py:L3251-L3265

Performance

No findings.

The new Binder helper avoids repeated per-PSU scans by using a single np.unique(..., return_index=True, return_inverse=True) mapping before aggregation. diff_diff/two_stage.py:L454-L479

Maintainability

No findings.

The changed survey helper paths consistently preserve nest where they now copy ResolvedSurveyDesign, and the new _inject_cluster_as_psu() contract is covered by focused regression tests. diff_diff/survey.py:L243-L270, diff_diff/survey.py:L412-L422, diff_diff/survey.py:L661-L676, diff_diff/survey.py:L1286-L1344, tests/test_survey.py:L2901-L3002

Tech Debt

No findings.

The remaining Conley×survey and replicate-weight follow-ups are explicitly tracked in TODO.md, which mitigates them under the stated rubric. TODO.md:L133-L134

Security

No findings.

Documentation/Tests

Severity: P3. Impact: tests/test_spillover.py:L4741-L4761 is described as a non-uniform-weight Binder/weighted-Psi “hand-check,” but it only verifies finite att/se and unchanged df_survey. That leaves the most methodology-sensitive part of the weighted path without a real non-uniform numerical parity check in the estimator-level test suite. Concrete fix: either rename/reword this test as a smoke test, or replace it with an actual non-uniform-weight reconstruction/golden that numerically checks the weighted gamma_hat / Psi / bread composition on a small fixture.

No further findings. The prior prose mismatch around att_dynamic.n_obs appears resolved at tests/test_spillover.py:L5401-L5420.

Static review only: pytest is not installed in this environment, so I could not execute the new tests.

github-actions · 2026-05-19T16:52:42Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cce7bed58ef48bc9e2c39f1ea350949d5129a2dd

Overall Assessment

⚠️ Needs changes

Executive Summary

The high-level Wave E.1 composition is plausible against the cited source material: Butts provides the spillover DiD setup, Gardner the two-stage DiD structure, and Gerber explicitly argues that Binder-style stratified-cluster linearization applies to smooth IF-/regression-based modern DiD estimators. (arxiv.org)
P1: the new survey-path finite_mask subsetting drops zero-weight padding rows/PSUs from the survey design when they lose FE support, so n_psu, df_survey, and the Binder meat are recomputed on the reduced sample rather than the full design. That is an undocumented methodology mismatch against the registry’s subpopulation contract and can silently change SEs/CIs/p-values. diff_diff/spillover.py:L2578-L2689, diff_diff/spillover.py:L2838-L2887, diff_diff/spillover.py:L3178-L3205, diff_diff/survey.py:L1810-L1898, docs/methodology/REGISTRY.md:L3748-L3772
I did not find a new P0 NaN/inference anti-pattern in the changed estimator path; the modified inference sites still go through safe_inference(). diff_diff/spillover.py:L830-L844, diff_diff/spillover.py:L1024-L1024, diff_diff/spillover.py:L3251-L3265
The nest propagation and cluster-as-PSU fixes look internally consistent and are covered by focused tests. diff_diff/survey.py:L240-L270, diff_diff/survey.py:L412-L422, diff_diff/survey.py:L635-L680, diff_diff/survey.py:L1286-L1344, tests/test_survey.py:L2906-L3000
The prior re-review P3 on the mislabeled non-uniform “hand-check” looks addressed. The remaining low-priority issue is a docs/code mismatch on whether cluster=<col> + survey_design.psu always warns or only warns when the groupings differ. tests/test_spillover.py:L4732-L4771, docs/api/spillover.rst:L257-L259, diff_diff/survey.py:L1261-L1283

Methodology

Affected methods are the new SpilloverDiD Wave E.1 survey sandwich and the survey-weighted event-study ATT aggregation. Aside from the edge case below, I did not find another unmitigated mismatch in the intended Gerber/Binder + Gardner + Butts synthesis. (arxiv.org)

Severity: P1. Impact: diff_diff/spillover.py:L2838-L2887 subsets resolved_survey to the post-finite_mask sample and then recomputes survey_metadata; diff_diff/spillover.py:L3178-L3205 uses that reduced-design df_survey for inference. When zero-weight padding rows (for example from SurveyDesign.subpopulation(), or a manually zero-weighted unit/PSU) lose FE support and become non-finite, they are physically removed from the survey design instead of retained as zero-score domain padding. That conflicts with the registry’s explicit full-design subpopulation rule, and with standard domain-estimation guidance that subpopulation inference should preserve the original design rather than treat the kept rows as a new survey. Because _compute_stratified_meat_from_psu_scores() centers PSU totals within strata and applies n_h/(n_h-1) / lonely_psu logic, dropping a zero-weight PSU can change the Binder meat itself, not just the reported df_survey. diff_diff/spillover.py:L2578-L2689, diff_diff/spillover.py:L2838-L2887, diff_diff/spillover.py:L3178-L3205, diff_diff/survey.py:L1810-L1898, docs/methodology/REGISTRY.md:L3748-L3772, tests/test_spillover.py:L5201-L5241 (faculty.washington.edu)
Concrete fix: keep design-level bookkeeping separate from fit-sample alignment. The Binder score matrix still needs fit-sample-aligned arrays, but zero-weight rows/PSUs that are dropped only because their FE is undefined should remain in the survey design as zero-score padding for PSU/strata counts, centering, lonely_psu, and df_survey. A practical fix is to preserve a full-design resolved_survey/survey_metadata for inference metadata and zero-pad dropped zero-weight rows’ IF contribution instead of removing them from the design object.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new docs say cluster=<col> + survey_design.psu “emits a UserWarning,” but _resolve_effective_cluster() only warns when the cluster and PSU partitions differ. Equivalent groupings quietly proceed, so the public docs overstate the warning contract. docs/api/spillover.rst:L257-L259, docs/methodology/REGISTRY.md:L3136-L3138, CHANGELOG.md:L10-L11, diff_diff/survey.py:L1261-L1283
Concrete fix: either change the docs to “warns if the groupings differ” or emit the warning unconditionally whenever both knobs are supplied.

Path to Approval

Preserve full-design survey structure for zero-weight padding rows/PSUs that are dropped only because their FE is undefined; df_survey, n_psu, and Binder centering should not be recomputed on a naively subsetted domain sample.
Add a regression test using SurveyDesign.subpopulation() (or an equivalent all-zero-weight unit/PSU) where the excluded unit has no positive-weight Omega_0, and assert n_psu, df_survey, and SE match the full-design reference rather than the dropped-row fit.
Align the cluster + survey_design.psu warning contract between docs and code.

github-actions · 2026-05-19T17:17:38Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8c7906597d0737f332ad7bfc214951b066571425

Overall Assessment

✅ Looks good

Executive Summary

The previously raised P1 on survey subpopulation preservation is now explicitly documented in docs/methodology/REGISTRY.md and tracked in TODO.md, so under this review rubric it is mitigated rather than blocking.
The new Wave E.1 survey path is internally consistent with the documented Gerber/Binder + Gardner synthesis: weighted stage-1 FE support, weighted gamma_hat/Psi/bread, Binder TSL meat, and df_survey-based safe_inference() are all wired through the changed estimator path.
The event-study survey path now aggregates scalar att with survey-weight totals rather than raw counts, which matches the registry note and fixes the weighted-horizon estimand mismatch.
The prior docs/code mismatch on cluster=<col> + survey_design.psu warning behavior appears resolved; docs now match _resolve_effective_cluster()’s “warn only when groupings differ” contract.
I did not find new unmitigated P0/P1 issues in the changed estimator, survey helper, or inference code.
Static review only: I could not run the test suite in this environment because pytest and numpy are unavailable.

Methodology

Severity: P3
Impact: Affected methods are the new SpilloverDiD Wave E.1 survey sandwich and the event-study survey-share aggregation. The previously raised SurveyDesign.subpopulation() / finite_mask design-preservation mismatch remains in code, but this PR now explicitly documents it as a known Wave E.3 limitation and tracks it in TODO, so it is informational rather than blocking under the stated rules.
Concrete fix: None required for approval in this PR. The already-tracked Wave E.3 fix is to preserve the full resolved design for PSU/strata bookkeeping and zero-pad dropped zero-weight rows’ IF contribution instead of physically subsetting the survey design.
References: diff_diff/spillover.py:L2838-L2887, docs/methodology/REGISTRY.md:L3136-L3139, docs/api/spillover.rst:L262-L268, TODO.md:L133-L135, tests/test_spillover.py:L5544-L5647
No other findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No additional findings. The remaining Wave E.3 subpopulation-preservation work is appropriately tracked in TODO.md:L135.

Security

No findings.

Documentation/Tests

No findings in the changed docs/tests. The prior warning-contract mismatch is aligned between diff_diff/survey.py:L1261-L1283 and docs/api/spillover.rst:L257-L261.
Verification note: runtime test execution was not possible here because the environment is missing pytest and numpy.

github-actions · 2026-05-19T19:04:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bf28e106b7bbeae1b2802a23723a3245e17e431c

Overall Assessment

✅ Looks good

Executive Summary

Re-review scope: the prior P1/P3 concerns I could trace are either fixed in code or now explicitly mitigated by REGISTRY.md/TODO.md, so I do not see a remaining unmitigated blocker.
The affected methods are SpilloverDiD’s new survey_design= HC1/CR1 variance path, the event-study scalar att aggregation under survey weights, and the shared survey helper logic that determines the effective PSU.
The documented Wave E.1 composition is consistent with the cited source material: Gerber’s Proposition 1 covers design-based variance for IF-based DiD estimators, Gerber’s appendix/replicate section treats two-stage DiD as an IF-based refit case and requires full re-estimation for replicate weights when first-stage coefficients move, Gardner provides the untreated-first-stage / residualized-second-stage structure, and Butts provides the spillover-aware DiD framework with staggered-timing identification. (arxiv.org)
The earlier cluster/PSU consistency issues also look resolved: _inject_cluster_as_psu() now honors nest, and the implicit-PSU path now rejects time-varying within-unit cluster labels before Binder aggregation (diff_diff/survey.py:L1286-L1344, diff_diff/spillover.py:L2891-L2975).
I did not find any new unmitigated P0/P1 issues in the changed estimator, variance, or inference paths.
Static review only: I could not execute the test suite here because numpy, pandas, scipy, and pytest are not installed in the environment.

Methodology

Affected methods reviewed: diff_diff/spillover.py:L2357-L2401, diff_diff/spillover.py:L2578-L3236, diff_diff/two_stage.py:L62-L514, and the survey-helper changes in diff_diff/survey.py:L635-L680 and diff_diff/survey.py:L1286-L1344.

Against the cited papers and the registry, the new survey path is internally coherent: survey weights are threaded through stage 1, Psi, and the bread; Psi is aggregated to PSU totals for Binder TSL; df_survey is used at the inference call sites; and the event-study scalar att uses survey-weight shares on the survey path (diff_diff/spillover.py:L784-L1024, diff_diff/spillover.py:L2991-L3236, diff_diff/two_stage.py:L226-L297, docs/methodology/REGISTRY.md:L3124-L3143). That matches the Gerber/Gardner/Butts synthesis the PR claims. (arxiv.org)

Severity: P3
Impact: SurveyDesign.subpopulation() designs with zero-weight padding rows still use the reduced post-finite_mask sample for n_psu, df_survey, and Binder centering when those padded rows lose stage-1 FE support. This is a real limitation, but it is explicitly documented and tracked, so under this review rubric it is mitigated rather than blocking. References: diff_diff/spillover.py:L2838-L2887, docs/methodology/REGISTRY.md:L3137-L3141, TODO.md:L134-L136.
Concrete fix: None required for approval in this PR. The tracked Wave E.3 fix is to preserve full-design bookkeeping and zero-pad dropped rows’ IF contributions instead of shrinking the survey design.

No other methodology findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining Conley+survey, replicate-weight survey variance, and subpopulation-preservation follow-ups are explicitly tracked in TODO.md:L134-L136, so they are non-blocking for this review.

Security

No findings.

Documentation/Tests

No findings in the changed docs/tests. The prior docs/code mismatch around cluster=<col> + survey_design.psu warning behavior appears resolved between diff_diff/survey.py:L1261-L1283 and docs/api/spillover.rst:L257-L261.

Verification note: this was a static review only. I could not run the tests here because the environment is missing numpy, pandas, scipy, and pytest.

Composes Gerber (2026, arXiv:2605.04124) Proposition 1 Binder Taylor Series Linearization for IF representations of smooth functionals -- explicitly derived for TwoStageDiD in the paper's Appendix -- with the Wave D Gardner GMM first-stage uncertainty correction (Butts 2021 §3.1 + Gardner 2022 §4) applied to SpilloverDiD's ring-indicator stage-2 design. No reference software combines all ingredients. Mechanical composition: SpilloverDiD's per-obs Wave D IF `psi_i = gamma_hat' * X_{10,i} * eps_{10,i} - X_{2,i} * eps_{2,i}` (with survey weights threaded through gamma_hat solve, eps construction, and bread inversion via Hajek normalization) is aggregated to PSU totals and passed to the audited `_compute_stratified_meat_from_psu_scores` Binder TSL meat helper. Stage-1 FE estimation extends `_iterative_fe_subset` with a `weights=` kwarg implementing WLS-FE via weighted bincount; the `weights is None` path is bit-identical to Wave B/C/D unweighted bincount. Identification support correctness: under survey_design, stage-1 FE support and connectivity are evaluated on the POSITIVE-WEIGHT portion of Omega_0 (`omega_0_effective = omega_0_mask & (survey_weights > 0)`), consistently for unsupported-period / unsupported-unit checks, `_check_omega_0_connectivity`, and `stage1_n_obs`. Zero-weight rows are outside the WLS estimating sample; treating them as identifying support would silently corrupt point estimates and SEs when raw Omega_0 membership masquerades as positive-weight support. Event-study scalar `att`: under survey_design, per-horizon shares are SURVEY-WEIGHT TOTALS rather than raw observation counts. Using raw n_obs_per_col shares on weighted WLS horizon coefficients would mix unweighted aggregation with weighted horizons and target the wrong estimand. The same survey-weight totals enter both `att` and `Var(att) = w' V_subset w`, keeping the lincom variance consistent with the point estimate. Cluster-vs-PSU resolution: `cluster=<col> + survey_design.psu` warns and uses PSU (TwoStageDiD parity). `cluster=<col> + survey_design` without PSU injects cluster as the effective PSU via `_inject_cluster_as_psu`, which now honors `SurveyDesign.nest`: under `nest=False`, cluster labels must be globally unique across strata (raises if they repeat, matching the explicit-PSU resolver's contract). The result's `cluster_name` reports the effective PSU label when PSU wins, not the user-supplied cluster column. DOF: `ResolvedSurveyDesign.df_survey` (4-way branch: PSU+strata -> n_PSU - n_strata; PSU only -> n_PSU - 1; strata only -> n_obs - n_strata; neither -> n_obs - 1) threaded through all four `safe_inference` call sites (aggregate tau_total, per-ring delta_j, event-study per-event-time tau_k / delta_jk, scalar att lincom). Survey-array subsetting: when `finite_mask` drops baseline-treated rows, `survey_weights` and `ResolvedSurveyDesign.{weights, strata, psu, fpc, replicate_weights}` are subsetted in parallel; `n_psu`, `n_strata`, and `survey_metadata` are recomputed after PSU injection so summary() / to_dict() reflect the actual variance design. Saturated `df_survey = 0` NaN-fail: when `lonely_psu="remove"` removes all strata (singleton PSUs), the meat helper returns `(_, var_computed=False, legit_zero=0)` and SpilloverDiD's Wave E.1 path returns NaN meat with a UserWarning matching "df_survey" so callers can `pytest.warns(match="df_survey")`. This is a departure from TwoStageDiD (`two_stage.py:2003-2005`) which currently NaN-fails silently; Wave E.1 surfaces the diagnostic per `feedback_no_silent_failures`. Public surface restrictions: - `vcov_type="conley" + survey_design=` raises NotImplementedError pointing at planned Wave E.2 (Conley x survey product-kernel synthesis with within-stratum Conley sandwich on PSU totals). - Replicate-weight variance (BRR/Fay/JK1/JKn/SDR) raises NotImplementedError per Gerber (2026) Appendix A (IF-reweighting shortcut does NOT apply to TwoStageDiD-class estimators because gamma_hat is weight-sensitive; correct support requires per-replicate full re-fit). - Non-pweight (`weight_type ∈ {"fweight", "aweight"}`) raises ValueError (the Binder TSL assumes probability weights). Implementation: - `_compute_gmm_corrected_meat` extended with `survey_weights` + `resolved_survey` kwargs at `diff_diff/two_stage.py`. - New module-level helper `_compute_binder_tsl_meat` at `diff_diff/two_stage.py` wraps `_compute_stratified_meat_from_psu_scores` with implicit per-obs PSU synthesis for no-PSU survey designs (matches `ResolvedSurveyDesign.df_survey` no-PSU branches). - `_iterative_fe_subset` weighted path with bit-identical no-weights fallback + positive-weight identification gate. - `_inject_cluster_as_psu` honors `nest` (shared helper fix that also benefits TwoStageDiD survey path; 18 existing TwoStageDiD survey tests unaffected). - `ResolvedSurveyDesign` gains a `nest` field propagated through all 5 construction sites in resolve()/subpopulation()/replicate-helpers. - `SpilloverDiDResults` extended with `survey_metadata`, `n_psu`, `n_strata` fields at `diff_diff/results.py`. Tests: 24 new invariants across `TestSpilloverDiDWaveE1SurveyDesignHc1` (17: bit-identity fallback, Binder TSL hand-check uniform + non-uniform weights, lonely_psu, FPC e1/e2/e3, saturated NaN-fail with `pytest.warns(match="df_survey")`, rejections (conley+survey, replicate, non-pweight), fit idempotency, finite_mask subsetting, no-PSU regressions (weights-only, weights+strata, cluster+survey-no-PSU, cluster overlap with nest=False/True), zero-weight Omega_0 exclusion + all-zero raises) and `TestSpilloverDiDWaveE1SurveyDesignEventStudy` (7: both is_staggered branches with df_survey lincom verification, distinguishability between survey-share and sample-share lincom rules via cohort-correlated weights + non-constant tau_k + manual reconstruction, aggregate-vs-event-study parity, drift goldens, subset-path invariant). Docs: REGISTRY new "Variance (Wave E.1)" subsection + restrictions block update + survey-weighted share rule note; api/spillover.rst parallel .. note:: block; references.rst adds Gerber (2026); CHANGELOG [Unreleased] new bullet; TODO.md replaces deferred row with Wave E.2 + replicate-weight follow-ups; llms.txt + README.md catalog entries appended. Folds the `.gitignore` carryover (survey-did-paper-arxiv-v2/ entry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T20:05:11Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 967c6a3e9058e86711be0ea3c6010677c7588248

Overall Assessment

✅ Looks good

Executive Summary

Re-review scope: I did not find any surviving P1+ issues from the prior review. The earlier cluster/PSU parity concerns I could trace now look addressed by _inject_cluster_as_psu() honoring nest and by the new within-unit constancy guard when cluster=<col> is used as the effective PSU (diff_diff/survey.py:L1286-L1344, diff_diff/spillover.py:L2904-L2975).
Affected methods reviewed: SpilloverDiD’s new survey_design= HC1/CR1 variance path, the survey-weighted scalar att aggregation in event-study mode, and the shared survey-helper PSU resolution (diff_diff/spillover.py:L784-L1024, diff_diff/spillover.py:L2357-L3369, diff_diff/two_stage.py:L62-L514).
The Wave E.1 methodology is consistent with the cited source stack at the level claimed in the PR: Gerber frames design-based stratified-cluster variance for smooth IF-based/regression-based modern DiD estimators, Gardner frames the untreated-first-stage / treated-second-stage estimator structure, and Butts frames the spillover-aware DiD estimand and staggered-timing setting. citeturn1view0turn6view0turn2view0
The only remaining in-scope limitation I see is already documented and tracked: SurveyDesign.subpopulation() designs with FE-undefined zero-weight padding rows still recompute design bookkeeping on the reduced fit sample (docs/methodology/REGISTRY.md:L3138-L3141, TODO.md:L135-L137).
Static review only: I could not run the suite here because this environment lacks numpy, pandas, scipy, and pytest.

Methodology

The affected method set is SpilloverDiD.fit()’s survey HC1/CR1 branch, _extract_event_study_results()’s survey-share att aggregation, and the shared Binder meat / PSU-injection helpers. I do not see an unmitigated source-material mismatch in those paths (diff_diff/spillover.py:L784-L1024, diff_diff/spillover.py:L2357-L3369, diff_diff/two_stage.py:L62-L514, diff_diff/survey.py:L1261-L1344, docs/methodology/REGISTRY.md:L3077-L3141). The Gerber/Gardner/Butts synthesis claimed in the registry is a reasonable fit for what the code is doing. citeturn1view0turn6view0turn2view0

Severity: P3. Impact: SurveyDesign.subpopulation() designs with FE-undefined zero-weight rows still shrink the effective design before Binder centering, so n_psu / df_survey can reflect the reduced fit sample rather than the full domain design. This is explicitly documented and tracked, so it is mitigated and non-blocking. References: docs/methodology/REGISTRY.md:L3138-L3141, TODO.md:L135-L137. Concrete fix: None for approval in this PR.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: The remaining survey follow-ups in scope, Conley×survey, replicate-weight survey variance, and Wave E.3 subpopulation preservation, are explicitly tracked in TODO.md:L135-L137, so they are non-blocking under this rubric. Concrete fix: None for approval in this PR.

Security

No findings.

Documentation/Tests

No findings. The docs now disclose the survey-share att rule, the df_survey=0 NaN-fail, and the deferred follow-ups, and the added tests cover aggregate/event-study survey fits plus the nest / cluster-as-PSU regressions (docs/api/spillover.rst:L212-L268, tests/test_spillover.py:L4547-L5686, tests/test_survey.py:L2903-L2992).

igerber force-pushed the spillover-conley-wave-e-survey-design branch from ec641b8 to 15044a4 Compare May 19, 2026 11:47

igerber force-pushed the spillover-conley-wave-e-survey-design branch from 15044a4 to a161725 Compare May 19, 2026 12:33

igerber force-pushed the spillover-conley-wave-e-survey-design branch from a161725 to 9c9532e Compare May 19, 2026 12:55

igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026

igerber force-pushed the spillover-conley-wave-e-survey-design branch from 9c9532e to 0391c79 Compare May 19, 2026 14:35

igerber force-pushed the spillover-conley-wave-e-survey-design branch from 0391c79 to a04c96b Compare May 19, 2026 14:58

igerber removed the ready-for-ci Triggers CI test workflows label May 19, 2026

igerber force-pushed the spillover-conley-wave-e-survey-design branch from a04c96b to c0e8967 Compare May 19, 2026 15:54

igerber force-pushed the spillover-conley-wave-e-survey-design branch from c0e8967 to 5866913 Compare May 19, 2026 16:07

igerber force-pushed the spillover-conley-wave-e-survey-design branch from 5866913 to fe7c2cb Compare May 19, 2026 16:24

igerber force-pushed the spillover-conley-wave-e-survey-design branch from fe7c2cb to cce7bed Compare May 19, 2026 16:41

igerber force-pushed the spillover-conley-wave-e-survey-design branch from cce7bed to 8c79065 Compare May 19, 2026 17:11

igerber added ready-for-ci Triggers CI test workflows and removed ready-for-ci Triggers CI test workflows labels May 19, 2026

igerber force-pushed the spillover-conley-wave-e-survey-design branch from 8c79065 to bf28e10 Compare May 19, 2026 18:49

igerber added ready-for-ci Triggers CI test workflows and removed ready-for-ci Triggers CI test workflows labels May 19, 2026

igerber force-pushed the spillover-conley-wave-e-survey-design branch from bf28e10 to 967c6a3 Compare May 19, 2026 19:53

igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026

Conversation

igerber commented May 19, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant