Lift Gate 1: HC2/HC2-BM for TwoWayFixedEffects via full-dummy auto-route by igerber · Pull Request #469 · igerber/diff-diff

igerber · 2026-05-19T13:07:31Z

Summary

TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) now produces finite small-sample inference via an inline full-dummy build inside TWFE.fit() — the last absorbed-FE HC2/HC2-BM gate (DiD-absorb DiD-absorb HC2/HC2-BM: auto-route to fixed_effects internally #458, MPD-absorb Lift MultiPeriodDiD-absorb HC2/HC2-BM gate via auto-route #459, MPD cluster+contrast-DOF Lift Gate 6: cluster-aware CR2 Bell-McCaffrey contrast DOF for MultiPeriodDiD avg_att #465 already merged).
Auto-cluster-at-unit default preserved on hc2_bm (routes to CR2-BM at unit) and on hc2 + wild_bootstrap; dropped on explicit hc2 + analytical to match the one-way contract.
Surface change (matches DiD/MPD absorb auto-route): under hc2/hc2_bm, result.coefficients (incl. FE-dummy entries), vcov, residuals, fitted_values, r_squared reflect the full-dummy fit; result.att and analytical inference are FWL-invariant. len(coefficients) == vcov.shape[0] invariant upheld.
hc2 / hc2_bm + replicate-weight survey designs explicitly rejected (replicate path re-demeans per replicate, doesn't compose with full-dummy).

Methodology references (required if estimator / math changes)

Method name(s): HC2 (MacKinnon-White 1985), CR2 Bell-McCaffrey (Pustejovsky-Tipton 2018 §3.3; Imbens-Kolesar 2016)
Paper / source link(s): sandwich::vcovHC(type="HC2") and clubSandwich::vcovCR(type="CR2") + coef_test()$df_Satt (R parity targets; singleton-cluster trick = one-way HC2-BM)
Any intentional deviations from the source (and why): None — empirical parity to R at atol=1e-10 verified on the new twfe_two_period scenario in benchmarks/data/clubsandwich_cr2_golden.json. The full-dummy build is the FWL-correct algebra (within-transform preserves coefficients but NOT the hat matrix, so HC2 leverage on the demeaned design would be wrong).

Validation

Tests added/updated:
- tests/test_estimators_vcov_type.py::TestFitBehavior — 10 behavioral tests (rejection flip → behavioral; refactor regression vs DiD(fixed_effects=[unit, time]) at atol=1e-12; auto-cluster default coverage on hc2_bm with distinguishability check vs one-way HC2-BM at >1% gap; explicit hc2 + analytical no-auto-cluster; hc2 + wild_bootstrap auto-cluster preserved; hc2 / hc2_bm + replicate rejection; always-treated unit finite ATT; coefficients-vs-vcov alignment invariant)
- tests/test_methodology_twfe.py::TestTWFEHC2RParity — 3 R-parity tests at atol=1e-10 (HC2 SE vs sandwich::vcovHC; one-way BM DOF vs singleton-cluster CR2; CR2-BM clustered-at-unit DOF vs vcovCR(cluster=unit))
- benchmarks/R/generate_clubsandwich_golden.R — new twfe_two_period scenario (8 units × 4 periods, binary post indicator); JSON regenerated with meta.source = "clubSandwich"
Backtest / simulation / notebook evidence: N/A (small-sample SE refinement, not a new estimator)

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

Replaces the unconditional NotImplementedError at twfe.py for `vcov_type in {"hc2","hc2_bm"}` with an inline full-dummy branch. TWFE has no absorb=/fixed_effects= parameter to swap (unit + time FE are baked into the estimator's identity), so the auto-route trick used for DiD-absorb / MPD-absorb doesn't apply directly. Instead, `TwoWayFixedEffects.fit()` bypasses the within-transform on hc2/hc2_bm and stacks [intercept, treated×post, covariates, factor(unit), factor(time)] explicitly so the leverage correction and BM DOF compute on the full FE projection (FWL preserves coefficients and residuals but NOT the hat matrix). **Auto-cluster default:** preserved on hc2_bm (routes to CR2-BM at unit) and on hc2 + wild_bootstrap; dropped on explicit hc2 + analytical to match the one-way contract (the linalg validator rejects hc2 + cluster_ids). **Surface change disclosure** (matches DiD-absorb / MPD-absorb): under vcov_type in {"hc2","hc2_bm"}, result.coefficients, result.vcov, result.residuals, result.fitted_values, and result.r_squared reflect the full-dummy fit. FE-dummy entries are included alongside the "ATT" key (len(coefficients) == vcov.shape[0] invariant upheld). result.att, its SE, and analytical inference are unchanged (FWL-equivalent). **Rejected combos:** vcov_type in {"hc2","hc2_bm"} + replicate- weight survey designs raises NotImplementedError because the replicate path re-demeans per replicate, which doesn't compose with the full-dummy build. Survey variance precedence: any resolved SurveyDesign drives variance via TSL/replicate (matches existing DiD/MPD contract), not the analytical small-sample sandwich. Verified at atol=1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=seq_len(n), type="CR2") + coef_test()$df_Satt` on a new `twfe_two_period` scenario in benchmarks/data/clubsandwich_cr2_golden.json. New tests: - TestFitBehavior (10 behavioral tests including refactor regression vs DiD(fixed_effects=[unit, time]) at atol=1e-12, auto-cluster distinguishability check vs one-way HC2-BM at 1% gap, replicate- weight rejection, coefficients-vs-vcov alignment invariant) - TestTWFEHC2RParity (3 R-parity tests at atol=1e-10) Lifts Gate 1 of the six HC2/HC2-BM NotImplementedError gates — the last absorbed-FE gate. Remaining gates: weighted one-way HC2-BM, weighted CR2-BM (both blocked on the clubSandwich WLS algebra derivation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T13:17:30Z

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The HC2/HC2-BM methodology change in TwoWayFixedEffects is consistent with the Methodology Registry and the intended full-projection HC2/CR2 logic.

Executive Summary

The new TWFE hc2 / hc2_bm route uses the full dummy projection the registry requires for leverage-based inference, rather than the within-transformed design. I did not find an undocumented estimator/variance mismatch. diff_diff/twfe.py:L147-L315 docs/methodology/REGISTRY.md:L2560-L2564
The inference surface still flows through LinearRegression.get_inference() / safe_inference; I did not find a new inline t-stat / partial-NaN anti-pattern in the changed code. diff_diff/twfe.py:L511-L582
The replicate-weight survey rejection is documented in the registry and tracked in TODO.md, so it is informational rather than blocking. diff_diff/twfe.py:L221-L236 docs/methodology/REGISTRY.md:L2562-L2562 TODO.md:L103-L103
Main residual risk: the new path eagerly materializes dense unit/time dummies, which can be expensive on the large panels TWFE is usually chosen for. diff_diff/twfe.py:L279-L300
Main test gap: the added coverage pins ATT/SE/DOF behavior, but not the newly documented residuals / fitted_values / r_squared surface change or the non-replicate survey-weighted hc2 / hc2_bm path. tests/test_estimators_vcov_type.py:L657-L935 tests/test_methodology_twfe.py:L1230-L1350
Static review only: I could not execute the test suite in this sandbox because pytest / pandas are unavailable.

Methodology

No findings. The implementation in diff_diff/twfe.py:L147-L315 matches the absorbed-FE note in docs/methodology/REGISTRY.md:L2560-L2564, and the added parity coverage targets the right HC2 / CR2 references. tests/test_methodology_twfe.py:L1230-L1350

Code Quality

No findings.

Performance

Severity: P2. Impact: the new hc2 / hc2_bm path always builds dense unit and time dummy matrices and then dense-stacks them into X. On realistically large TWFE panels, that can turn the old explicit rejection into OOMs or severe latency. Concrete fix: add a size guard or warning before dummy expansion, or route this path through a sparse/shared full-dummy helper instead of unconditional dense get_dummies + np.column_stack. diff_diff/twfe.py:L279-L300

Maintainability

Severity: P3. Impact: TWFE now carries a second bespoke full-dummy builder that overlaps with the existing DifferenceInDifferences(fixed_effects=...) implementation, which increases drift risk around FE naming, survey behavior, and result-surface conventions. Concrete fix: extract a shared helper or delegate the TWFE hc2 / hc2_bm route to the existing fixed-effects DiD path. diff_diff/twfe.py:L279-L315 diff_diff/estimators.py:L467-L485

Tech Debt

Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs, but that limitation is explicitly documented and tracked, so it is not a blocker under the project’s deferred-work policy. Concrete fix: none required for approval; implement the tracked follow-up when replicate support is needed. diff_diff/twfe.py:L221-L236 docs/methodology/REGISTRY.md:L2562-L2562 TODO.md:L103-L103

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new tests cover ATT/SE parity, CR2 auto-clustering, replicate rejection, and coefficients-vs-vcov alignment, but they do not pin the newly documented full-surface behavior for result.residuals, result.fitted_values, result.r_squared, or the non-replicate survey-weighted hc2 / hc2_bm path. A regression there could ship while the current parity tests still pass. Concrete fix: add one regression comparing those fields to DifferenceInDifferences(fixed_effects=["unit", time]) on hc2 / hc2_bm, plus one survey-weighted non-replicate TWFE test on the new path. tests/test_estimators_vcov_type.py:L657-L935 tests/test_methodology_twfe.py:L1230-L1350

Addresses CI Codex review findings on PR #469: P2 (Performance): the HC2/HC2-BM full-dummy build can OOM on large TWFE panels (n × (n_units + n_times) float64 entries). Add a memory- size warning at >50M entries (~400 MB) suggesting hc1 (within- transform) for large panels. P3 (Docs/Tests): new tests pinned ATT/SE/DOF but not the documented full-surface change (residuals/fitted_values/r_squared reflect the full-dummy fit). Add `test_twfe_hc2_full_surface_matches_did_fixed_effects` parametrized over hc2/hc2_bm, asserting bit-equality against DifferenceInDifferences(fixed_effects=[unit, time]) at atol=1e-12 on all three fields. P3 (Maintainability): TWFE's inline full-dummy builder duplicates DiD's fixed_effects= dummy-construction logic. Substantive refactor — better as a follow-up than inline in this PR. Added a TODO row. P3 (Tech debt — replicate weights): already tracked, no action needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T13:29:57Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b0ccc33c77cbf04637ecc5abe5ca34af19445140

Overall Assessment

✅ Looks good

Executive Summary

I did not find an unmitigated P0/P1 issue in the new TWFE hc2 / hc2_bm route. The changed estimator now computes the small-sample leverage/CR2 quantities on an explicit full-dummy FE design, which is the right methodological direction for this gate lift. diff_diff/twfe.py:L279-L335 docs/methodology/REGISTRY.md:L2560-L2564
The prior non-blocking review items were mostly addressed: the full results-surface regression was added, the dense full-dummy path now warns before large allocations, and the duplicate-builder / replicate-weight limitations are explicitly tracked in TODO.md. tests/test_estimators_vcov_type.py:L938-L966 diff_diff/twfe.py:L296-L309 TODO.md:L103-L104
Remaining non-blocker: I still do not see a direct regression for the new non-replicate survey-weighted TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) branch. The added tests cover non-survey full-dummy behavior and replicate-weight rejection, while the existing TWFE survey tests stay on the default path. tests/test_estimators_vcov_type.py:L657-L966 tests/test_survey.py:L893-L921 tests/test_survey.py:L3237-L3285
Static review only: I could not execute the test suite in this sandbox because numpy / pandas are unavailable.

Methodology

No findings. Methods affected: HC2, one-way Bell-McCaffrey DOF, and clustered CR2 Bell-McCaffrey on TWFE. The new branch in diff_diff/twfe.py:L279-L335 matches the registry update in docs/methodology/REGISTRY.md:L2560-L2564: HC estimators depend on hat-matrix diagonals, and the absorbed-FE CR2 shortcut is not generally equivalent to computing the adjustment from the full fixed-effects model, so the move to an explicit full-dummy projection is methodologically consistent. The new R-parity coverage also targets the right reference surfaces. tests/test_methodology_twfe.py:L1230-L1352 (sandwich.r-forge.r-project.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: TWFE still carries a bespoke full-dummy builder that duplicates the DifferenceInDifferences(fixed_effects=...) path, so FE naming / survey behavior / results-surface drift remains possible. Concrete fix: none required for approval; this is now explicitly tracked in TODO.md:L104-L104 against diff_diff/twfe.py:L279-L335 and diff_diff/estimators.py:L478-L485.

Tech Debt

Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs. This is a documented fast-fail rather than silent wrong inference, and it is tracked under the project’s deferred-work policy. Concrete fix: none required for approval; follow the tracked per-replicate full-dummy refit work in diff_diff/twfe.py:L221-L236, docs/methodology/REGISTRY.md:L2562-L2562, and TODO.md:L103-L103.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the PR now pins the non-survey full-dummy behavior well, but it still does not directly exercise the new non-replicate survey-weighted hc2 / hc2_bm TWFE path end-to-end. A regression where that branch accidentally reverts to the within-transform path, or mishandles survey PSU injection under the new FE route, could still slip through. Concrete fix: add one SurveyDesign(weights=..., strata=..., psu=...) TWFE-vs-DifferenceInDifferences(fixed_effects=["unit","time"]) comparison and one weights-only survey case for both hc2 and hc2_bm. tests/test_estimators_vcov_type.py:L657-L966 tests/test_methodology_twfe.py:L1230-L1352 tests/test_survey.py:L893-L921 tests/test_survey.py:L3237-L3285

CI review flagged the missing end-to-end test for the new non-replicate survey-weighted hc2/hc2_bm TWFE path. A regression there could ship while existing tests still pass. New parametrized test compares TWFE(vcov_type=vcov, cluster=...) with SurveyDesign(weights=...) against DifferenceInDifferences(vcov_type=vcov, cluster=..., fixed_effects=['unit', 'time']) with the same design. Both paths feed the survey-resolved full-dummy X to LinearRegression's compute_survey_vcov (TSL), so ATT and SE match bit-equally at atol=1e-12. Test documents the explicit-cluster requirement on hc2_bm: TWFE's implicit auto-cluster + survey path intentionally drops PSU injection (per the survey-design scope rule in _resolve_effective_cluster); the explicit cluster='unit' form is what aligns with DiD's clustered- survey behavior and is the documented user-facing way to invoke clustered survey-aware HC2-BM on TWFE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T13:39:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0a2706655c849b9296325ee621d7e94d9bb42f63

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated P0/P1 issues found in the new TWFE hc2 / hc2_bm route.
Methodology is aligned with the absorbed-FE registry note: the PR moves HC2 / CR2 leverage computation onto an explicit full-dummy FE design instead of the within-transformed design, which is the correct fix for FWL hat-matrix non-equivalence. See diff_diff/twfe.py:L147-L155, diff_diff/twfe.py:L279-L325, docs/methodology/REGISTRY.md:L2560-L2564.
The prior non-blocking survey-path gap is only partially closed: the new weights-only survey regression is valuable, but I still do not see a direct SurveyDesign(weights=..., strata=..., psu=...) parity test on the new full-dummy hc2 / hc2_bm branch. See tests/test_estimators_vcov_type.py:L968-L1017.
The replicate-weight rejection and duplicated full-dummy builder are now explicitly tracked in TODO.md, so they are informational rather than blockers. See TODO.md:L103-L104.
Static review only: I could not run the tests in this environment because numpy, pandas, and pytest are unavailable.

Methodology

No findings. Methods affected are HC2, one-way Bell-McCaffrey / Imbens-Kolesar DOF, and clustered CR2 Bell-McCaffrey. The new full-dummy route and the added R fixture/parity tests target the right methodology surfaces. See diff_diff/twfe.py:L279-L325, benchmarks/R/generate_clubsandwich_golden.R:L235-L284, tests/test_methodology_twfe.py:L1230-L1352.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: TWFE still carries a bespoke full-dummy builder instead of reusing the existing DifferenceInDifferences(fixed_effects=...) construction, so future FE naming / survey-handling / results-surface drift remains possible. Concrete fix: none required for approval; this is already tracked in TODO.md:L104-L104. Relevant code: diff_diff/twfe.py:L279-L335, diff_diff/estimators.py:L477-L485.

Tech Debt

Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still fast-fails on replicate-weight survey designs. This is an explicit guard, not silent wrong inference. Concrete fix: none required for approval; follow the tracked per-replicate full-dummy refit work in TODO.md:L103-L103 and diff_diff/twfe.py:L221-L237.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: [Previously noted, partially addressed] the new weights-only survey parity test closes the most obvious gap, but I still do not see a direct SurveyDesign(weights=..., strata=..., psu=...) TWFE-vs-DiD full-dummy regression for hc2 / hc2_bm. Also, the new test block itself documents a survey exception to the broad “hc2_bm auto-cluster is preserved” wording: with survey_design= and no explicit cluster=, TWFE intentionally keeps the established implicit-PSU path, so users need explicit cluster=unit or survey_design.psu to get unit-level PSU injection. Concrete fix: add one explicit strata+psu parity test, and add one sentence to the TWFE docstring / registry note clarifying that the auto-cluster note applies to the non-survey analytical path. See tests/test_estimators_vcov_type.py:L968-L1017, tests/test_survey.py:L3237-L3286, diff_diff/twfe.py:L374-L393, docs/methodology/REGISTRY.md:L2562-L2562.

… for survey path CI review polish: - Add `test_twfe_hc2_with_survey_strata_psu_matches_did_fixed_effects` (parametrized over hc2/hc2_bm) — extends the weights-only survey regression with `SurveyDesign(weights="w", strata="stratum", psu="psu")`. Both TWFE and DiD(fixed_effects=[unit, time]) paths feed the survey-resolved full-dummy X to LinearRegression's TSL with stratified-design adjustments, so ATT/SE match bit-equally at atol=1e-12. - Clarify the "auto-cluster preserved on hc2_bm" wording in TWFE docstring and REGISTRY: the auto-cluster default applies to the non-survey analytical path. Under `survey_design=` with no explicit `cluster=`, TWFE keeps the documented implicit-PSU path (auto-cluster NOT injected into survey PSU); users wanting unit-level PSU under a survey design must pass explicit `cluster="unit"` or set `survey_design.psu`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T13:49:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 382824162ef52d45a83481d1c9b302579646b9cc

Overall Assessment

✅ Looks good

Executive Summary

Re-review focus: the prior non-blocking survey-path concern appears addressed. The PR now documents the non-survey-only auto-cluster wording and adds explicit survey parity coverage, including a SurveyDesign(weights, strata, psu) case in diff_diff/twfe.py:L60-L81, docs/methodology/REGISTRY.md:L2560-L2563, and tests/test_estimators_vcov_type.py:L968-L1062.
Methods affected are HC2 and HC2-Bell-McCaffrey / CR2 Bell-McCaffrey. The new TWFE route correctly computes leverage / BM DOF on the full FE dummy design instead of the within-transformed design, matching the absorbed-FE registry note and the new R-golden setup in diff_diff/twfe.py:L153-L161, diff_diff/twfe.py:L285-L341, and benchmarks/R/generate_clubsandwich_golden.R:L235-L284.
No unmitigated P0/P1 findings found in the changed estimator, variance, or inference paths.
Remaining replicate-weight rejection and the duplicated full-dummy builder are explicitly tracked in TODO.md, so they are informational rather than blockers at TODO.md:L103-L104.
One documentation-only inconsistency remains: older CHANGELOG.md bullets still say TWFE HC2/HC2-BM rejection remains a follow-up, which now contradicts the new lead entry at CHANGELOG.md:L10-L11 and CHANGELOG.md:L23-L24.
Static review only: I could not run the new tests here because numpy and pytest are unavailable in this environment.

Methodology

No findings. The affected methods are HC2 and HC2-Bell-McCaffrey / CR2 Bell-McCaffrey, and the implementation now routes TWFE through an explicit full-dummy FE design before calling the shared OLS / vcov path, which is the correct fix for the FWL hat-matrix issue documented in diff_diff/twfe.py:L285-L341, docs/methodology/REGISTRY.md:L2560-L2563, and the new parity tests at tests/test_methodology_twfe.py:L1205-L1352.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P3. Impact: the TWFE HC2/HC2-BM path still duplicates the full-dummy FE construction already present in DifferenceInDifferences(fixed_effects=...), so future drift in FE naming, survey handling, or result-surface behavior remains possible. Concrete fix: follow the tracked refactor in TODO.md:L104-L104, pulling the shared dummy-construction logic out of diff_diff/twfe.py:L285-L341 and diff_diff/estimators.py:L477-L485. Tracked in TODO.md, so informational only.

Tech Debt

Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs; this is an explicit fast-fail, not silent wrong inference. Concrete fix: implement per-replicate full-dummy refits as tracked in diff_diff/twfe.py:L227-L243 and TODO.md:L103-L103. Tracked in TODO.md, so informational only.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: CHANGELOG.md now contradicts itself. The new top entry says TWFE HC2/HC2-BM is supported, but older bullets still state that TWFE rejection remains a follow-up, which can mislead release-note readers about the current shipped behavior. Concrete fix: update or remove the stale TWFE-follow-up sentences in CHANGELOG.md:L23-L24 so they agree with CHANGELOG.md:L10-L11.
No additional findings on coverage. The previously noted survey-path gap looks closed by the new behavioral and parity tests in tests/test_estimators_vcov_type.py:L968-L1062 and tests/test_methodology_twfe.py:L1205-L1352.

CI review (R4) flagged that the new top entry says TWFE HC2/HC2-BM is supported, but two older bullets in the same [Unreleased] block (DiD- absorb and MPD-absorb entries) still claim "TwoWayFixedEffects rejection remains as a follow-up" / "remain as follow-ups". Those sentences were factually correct when those PRs landed but become contradictory once Gate 1 ships in the same release. Replace both with cross-references to the Gate 1 top entry, noting that TWFE uses a separate full-dummy branch (no fixed_effects= equivalent inside TWFE) rather than the absorb→fixed_effects parameter swap used by DiD/MPD. Per feedback_changelog_accuracy_fixes.md: scanned all CHANGELOG bullets for similar stale claims; the remaining matches are unrelated entries (Spillover, BaconDecomposition, Conley waves) that don't reference TWFE rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T16:02:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: baceb22bda068f738f540f24fd6b98d7a12af7ce

Overall Assessment

⚠️ Needs changes

Executive Summary

Re-review focus: the prior documentation-only concern looks addressed. CHANGELOG.md, REGISTRY.md, and TODO.md now consistently describe TWFE HC2/HC2-BM support and the remaining tracked limitations.
Methods affected are HC2 and HC2 Bell-McCaffrey / CR2 Bell-McCaffrey. The new analytical TWFE full-dummy route is aligned with the Methodology Registry and the new R-parity harness.
Severity P1 [Newly identified]: the new TwoWayFixedEffects(vcov_type="hc2", inference="wild_bootstrap") path is not rank-deficiency-safe. A dropped nuisance column on the full-dummy design can poison the bootstrap sample construction and crash the fit.
The replicate-weight rejection and duplicated full-dummy builder are now explicitly tracked in TODO.md, so they are informational only.
Static review only: I could not run the test suite here because this environment is missing pandas/pytest.

Methodology

No findings. The affected methods are HC2 and HC2 Bell-McCaffrey / CR2 Bell-McCaffrey, and the analytical TWFE implementation matches the documented full-FE-hat-matrix requirement in docs/methodology/REGISTRY.md L2559-L2564, the new full-dummy branch in diff_diff/twfe.py L285-L341, and the new parity tests in tests/test_methodology_twfe.py L1205-L1352.

Code Quality

Severity: P1 [Newly identified]. Impact: the newly supported hc2/hc2_bm TWFE path uses an explicit full-dummy design (diff_diff/twfe.py L285-L341) and now routes inference="wild_bootstrap" through the shared bootstrap helper (diff_diff/twfe.py L596-L600). That helper re-estimates the restricted model and then builds y_star with raw X @ beta_restricted (diff_diff/utils.py L568-L596). If the full-dummy solve dropped any nuisance column to NaN (for example, a time-invariant covariate collinear with unit FE), beta_restricted contains NaN, y_star becomes NaN, and the bootstrap crashes even though ATT is still identified analytically. The new bootstrap coverage only smoke-tests a full-rank panel (tests/test_estimators_vcov_type.py L828-L846). Concrete fix: make wild_bootstrap_se() reconstruct fitted values from identified columns only, mirroring solve_ols()’s rank-deficient fitted-value logic; while touching that helper, do not coerce invalid bootstrap draws to t*=0.0.

Performance

No findings.

Maintainability

Severity: P3. Impact: TWFE’s HC2/HC2-BM inline full-dummy builder still duplicates the DifferenceInDifferences(fixed_effects=...) dummy-construction logic, which leaves drift risk, but this is now explicitly tracked in TODO.md L104 and does not block approval. Concrete fix: follow the tracked shared-helper / delegation refactor.

Tech Debt

Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs, but that limitation is explicitly documented in docs/methodology/REGISTRY.md L2562 and tracked in TODO.md L103, so it is informational only. Concrete fix: defer to the tracked per-replicate full-dummy refit follow-up.

Security

No findings.

Documentation/Tests

No unmitigated findings. The stale CHANGELOG contradiction from the previous review appears resolved in CHANGELOG.md L10-L24. Static review only: I could not execute the new tests in this environment.

Path to Approval

Make wild_bootstrap_se() rank-deficiency-safe for dropped nuisance columns by constructing restricted fitted values from the kept columns only, not raw X @ beta_restricted.
Add a regression test for TwoWayFixedEffects(vcov_type="hc2", inference="wild_bootstrap") on a full-dummy rank-deficient-but-ATT-identified design (for example, a time-invariant covariate collinear with unit FE) and assert the fit still returns finite bootstrap inference.

CI review (R5) identified a P1 bug in wild_bootstrap_se() that was newly reachable via the TWFE HC2/HC2-BM full-dummy path: Before this fix, wild_bootstrap_se built each draw's pseudo-outcome as `y_star = X @ beta_restricted`. When solve_ols dropped a rank- deficient nuisance column (e.g. a time-invariant covariate collinear with the unit FE on the full-dummy design), beta_restricted contained NaN on the dropped slot, and X @ beta_restricted propagated NaN through every observation. The ATT was analytically identified but the bootstrap crashed because y_star was all-NaN. Pre-PR this was unreachable on TWFE (the within-transform absorbed time-invariant covariates before they entered X), but the new full- dummy HC2/HC2-BM branch keeps unit/time dummies explicit alongside covariates, exposing the bug. Two fixes in wild_bootstrap_se (diff_diff/utils.py): 1. Use solve_ols(return_fitted=True) to get NaN-safe fitted values from the kept columns; build y_star = fitted_restricted + residuals_restricted * obs_weights instead of X @ beta_restricted. fitted_restricted is computed from the kept columns by solve_ols, so dropped nuisance NaN doesn't propagate. 2. Replace bootstrap_t_stats[b] = 0.0 fallback for singular draws with np.nan + a finite_mask filter at the p-value step. Setting t* = 0 biased the p-value downward (|0| < |t_original| counts as non-rejection, but those draws are invalid, not non-rejections). The same nan-safe filter applies to bootstrap_coefs for the SE and percentile CI. New regression test `test_twfe_hc2_wild_bootstrap_survives_rank_deficient_full_dummy` fits TwoWayFixedEffects(vcov_type='hc2', inference='wild_bootstrap', covariates=['x_invariant']) on a panel where x_invariant is time- invariant (collinear with unit FE on the full-dummy design); asserts finite ATT, SE, p-value, and CI. Pre-fix this test crashed with all-NaN y_star. No regression in the existing 53 wild_bootstrap tests across test_wild_bootstrap, test_methodology_did, test_methodology_twfe, test_conley_vcov, test_estimators_vcov_type, test_business_report, test_replicate_weight_expansion, test_survey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T16:12:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7f5df65fb038ac807fd7d7ef84e1d34af3b91c71

Overall Assessment

⛔ Blocker

Executive Summary

Previous re-review P1 looks addressed: wild_bootstrap_se() now reconstructs restricted fitted values from identified columns instead of X @ beta_restricted, and the new TWFE regression test covers that rank-deficient path. diff_diff/utils.py:L573-L585 tests/test_estimators_vcov_type.py:L830-L879
Methodology check: the affected methods are HC2 and Bell-McCaffrey / CR2-BM for TWFE. The new full-dummy route is consistent with the Methodology Registry and the new R parity harness; I did not find an undocumented paper mismatch in that algebra. docs/methodology/REGISTRY.md:L2559-L2564 diff_diff/twfe.py:L285-L341 benchmarks/R/generate_clubsandwich_golden.R:L235-L284 tests/test_methodology_twfe.py:L1230-L1352
Severity P0: the changed bootstrap helper can now return se = NaN while still surfacing a finite p_value and/or percentile CI when too few valid bootstrap draws survive, and the estimator layer passes those fields straight through. diff_diff/utils.py:L627-L655 diff_diff/estimators.py:L778-L786 diff_diff/twfe.py:L596-L600
The remaining replicate-weight rejection and duplicated full-dummy builder are both documented/tracked in TODO.md, so they are informational only. TODO.md:L103-L104
Static review only: I could not run pytest here because pytest is not installed in this environment.

Methodology

No findings. The HC2 and HC2-BM/CR2-BM TWFE route matches the absorbed-FE note in the Registry and the new parity target/design described in the benchmark and methodology tests. docs/methodology/REGISTRY.md:L2559-L2564 diff_diff/twfe.py:L285-L341 benchmarks/R/generate_clubsandwich_golden.R:L235-L284 tests/test_methodology_twfe.py:L1230-L1352

Code Quality

Severity P0. Impact: wild_bootstrap_se() now filters invalid draws, but when that leaves n_valid == 0 it returns p_value = 1.0, and when valid_coefs.size == 1 it still returns a percentile CI from that single draw even though se_bootstrap is NaN. _run_wild_bootstrap_inference() maps those fields directly onto estimator results, so the new TWFE hc2 + wild_bootstrap path can emit undefined SE alongside non-NaN p-value/CI/t-stat. That is misleading statistical output and violates the project’s all-or-nothing NaN inference contract. diff_diff/utils.py:L627-L655 diff_diff/estimators.py:L778-L786 diff_diff/twfe.py:L596-L600 Concrete fix: when fewer than two valid coefficient draws survive, return NaN for t_stat, p_value, and both CI endpoints as well; do not special-case p_value=1.0. If you want to preserve the raw original t-stat for diagnostics, keep it off the top-level inference surface.

Performance

No findings.

Maintainability

Severity P3. Impact: the TWFE HC2/HC2-BM full-dummy builder duplicates the DiD fixed-effects dummy-construction logic, which is a drift risk, but this is already tracked in TODO.md and is non-blocking. TODO.md:L104-L104 diff_diff/twfe.py:L285-L341 diff_diff/estimators.py:L478-L486 Concrete fix: extract a shared full-dummy FE builder or delegate TWFE’s HC2/HC2-BM path to the DiD fixed_effects= branch with TWFE-specific cluster handling.

Tech Debt

Severity P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs; that limitation is documented in the Registry and tracked in TODO.md, so it is informational rather than blocking. diff_diff/twfe.py:L227-L243 docs/methodology/REGISTRY.md:L2562-L2562 TODO.md:L103-L103 Concrete fix: implement per-replicate full-dummy refits for the HC2/HC2-BM survey path.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests cover the repaired rank-deficient finite-case bootstrap path, but they do not exercise the n_valid == 0 / valid_coefs.size == 1 branches in wild_bootstrap_se(), which is why the P0 slipped through. tests/test_estimators_vcov_type.py:L830-L879 diff_diff/utils.py:L627-L655 Concrete fix: add direct regression tests for both insufficient-valid-draw branches and assert se, t_stat, p_value, and both CI endpoints are all NaN.

Path to Approval

Make wild_bootstrap_se() return all-NaN inference fields when bootstrap output is not fully identified, at minimum for n_valid == 0 and valid_coefs.size < 2, and ensure the surfaced estimator t_stat does not remain finite on that path.
Add regression tests that force those insufficient-valid-draw branches and assert top-level estimator outputs satisfy the project’s NaN inference contract.

…otstrap CI review (R6) caught a new P0 in my R5 wild_bootstrap_se() fix: the degenerate-bootstrap branches violated the all-or-nothing NaN contract from feedback_bootstrap_nan_on_invalid_contract: - n_valid == 0 returned p_value = 1.0 with se = NaN (split inference) - valid_coefs.size == 1 returned a finite percentile CI from a single draw alongside se = NaN - t_stat_original was always finite (analytical), surfacing alongside NaN bootstrap se when bootstrap was degenerate Fix: when n_valid < 2 OR valid_coefs.size < 2, NaN-out the entire inference quadruple (se, p_value, ci_lower, ci_upper) AND the surfaced t_stat_original. The analytical t-stat from step 1 is still computed for diagnostic use inside the helper but not propagated to the user-facing result on a degenerate bootstrap — this prevents the estimator wrapper from emitting an analytical t-stat alongside NaN bootstrap fields, which would mix inference families on the same coefficient. New regression tests in tests/test_wild_bootstrap.py:: TestWildBootstrapDegenerateAllNaN: - test_degenerate_n_valid_zero_returns_all_nan: monkeypatches solve_ols so every bootstrap draw has singular vcov; asserts ALL five user-surface fields are NaN. - test_degenerate_single_valid_draw_returns_all_nan: forces exactly one valid draw (n_valid == 1); asserts ALL five fields NaN — no percentile CI from a single-point sample. Both branches were previously not exercised by the analytical-design tests, which is why the R5 fix passed but the R6 reviewer caught the contract violation via code inspection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T16:29:52Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 25c364f77c681db7320b8e652d6dfa3deae2fe7e

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in the changed estimator/inference paths. The prior blocker around degenerate wild-bootstrap inference appears resolved. Highest remaining unmitigated issue is a P2 documentation/test-contract gap.

Executive Summary

The previous blocker is fixed: wild_bootstrap_se() now reconstructs restricted fitted values from identified columns and NaN-gates the full inference surface when fewer than two valid draws survive; estimator results propagate those NaNs consistently. diff_diff/utils.py:L573-L676, diff_diff/estimators.py:L764-L786, tests/test_wild_bootstrap.py:L719-L866
Affected methods are HC2 and HC2 Bell-McCaffrey / CR2-BM for TwoWayFixedEffects. The new inline full-dummy route matches the Registry’s full-hat-matrix rationale and is backed by new R-parity coverage for HC2 SE plus one-way and clustered BM DOF. docs/methodology/REGISTRY.md:L2559-L2564, diff_diff/twfe.py:L153-L161, diff_diff/twfe.py:L245-L341, benchmarks/R/generate_clubsandwich_golden.R:L235-L284, tests/test_methodology_twfe.py:L1205-L1352
I did not find any remaining unmitigated P0/P1 in the changed estimator, variance, clustering, or survey-routing logic.
Severity P2: TwoWayFixedEffects.fit() now broadens result.coefficients on the unchanged within-transform HC1/classical/Conley paths too, but the changelog/registry/tests only disclose and pin that surface change under vcov_type in {"hc2","hc2_bm"}. diff_diff/twfe.py:L342-L364, diff_diff/twfe.py:L646-L672, CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2562-L2562, tests/test_estimators_vcov_type.py:L960-L1019
The remaining replicate-weight rejection and duplicate full-dummy builder are already tracked in TODO.md, so they are informational only. TODO.md:L103-L104
Static review only: I could not run pytest here because pytest is not installed, and this environment has no usable temporary directory.

Methodology

No findings. Severity: none. Impact: the HC2 / HC2-BM TWFE change is documented in the Methodology Registry and the implementation follows the cited requirement that leverage / CR2 adjustments be computed on the full FE projection rather than the within-transformed design. Concrete fix: none. docs/methodology/REGISTRY.md:L2559-L2564, diff_diff/twfe.py:L153-L161, diff_diff/twfe.py:L285-L341, benchmarks/R/generate_clubsandwich_golden.R:L235-L284, tests/test_methodology_twfe.py:L1205-L1352

Code Quality

No remaining findings. Severity: none. Impact: the prior P0 on partial-NaN bootstrap inference is addressed; undefined bootstrap output now forces se, t_stat, p_value, and both CI endpoints to move together as NaN. Concrete fix: none. diff_diff/utils.py:L573-L676, diff_diff/estimators.py:L764-L786, tests/test_wild_bootstrap.py:L719-L866

Performance

No findings. Severity: none. Impact: none identified in the changed code; the new dense TWFE full-dummy route at least warns on large designs instead of silently risking extreme memory use. Concrete fix: none. diff_diff/twfe.py:L296-L315

Maintainability

Severity: P3. Impact: the TWFE HC2/HC2-BM full-dummy builder duplicates the fixed-effect dummy construction already present in DifferenceInDifferences.fit(), which creates drift risk on FE naming, survey behavior, and result-surface conventions, but this is already tracked in TODO.md and is non-blocking. Concrete fix: extract a shared FE-dummy builder or delegate TWFE’s HC2/HC2-BM path to the existing DiD full-dummy path. diff_diff/twfe.py:L285-L341, diff_diff/estimators.py:L478-L485, TODO.md:L104-L104

Tech Debt

Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs; this limitation is documented in the Registry and tracked in TODO.md, so it is informational rather than blocking. Concrete fix: implement per-replicate full-dummy refits for the TWFE HC2/HC2-BM survey path. diff_diff/twfe.py:L227-L243, docs/methodology/REGISTRY.md:L2562-L2562, TODO.md:L103-L103

Security

No findings. Severity: none. Impact: I did not see security-relevant changes or accidental secrets in the changed files. Concrete fix: none.

Documentation/Tests

Severity: P2. Impact: TwoWayFixedEffects.fit() now builds result.coefficients from _twfe_var_names on both branches, so HC1/classical/Conley within-transform fits also expose const and covariate entries. The changelog/registry only disclose this result-surface change for the HC2/HC2-BM full-dummy path, and the new regression tests only assert the expanded coefficients/vcov contract for hc2/hc2_bm. That leaves a user-visible API change on unchanged TWFE paths undocumented and unpinned. Concrete fix: either restore the old within-transform contract ({"ATT": att} unless use_full_dummy is active) or document and regression-test the broader all-path TWFE result.coefficients contract. diff_diff/twfe.py:L342-L364, diff_diff/twfe.py:L646-L672, CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2562-L2562, tests/test_estimators_vcov_type.py:L960-L1019

CI review (R7) flagged a P2 scope creep in the R1 coefficients-dict fix: I built `_twfe_var_names` on BOTH the full-dummy and within- transform branches, which silently broadened `result.coefficients` on HC1/classical/Conley paths from `{"ATT": att}` to `{"const": c, "ATT": att, ...covariates}`. That's a user-visible API change on unchanged TWFE paths that wasn't documented in CHANGELOG/REGISTRY or regression-tested. Per the reviewer's recommendation, restoring the historical `{"ATT": att}` contract on within-transform paths by setting `_twfe_var_names = None` on the else branch (the fallback at the DiDResults construction site handles None via the existing `{"ATT": float(att)}` literal). Only the HC2/HC2-BM full-dummy path now broadens the dict — which is what the REGISTRY/CHANGELOG surface-change disclosure documents, and what the alignment-invariant test and full-surface regression test pin. Verified end-to-end: hc1/classical → `{'ATT'}`; hc2/hc2_bm → `{'const', 'ATT', '_fe_unit_*', '_fe_time_*'}`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-19T16:37:03Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a66b9ada5efde2836301123b6985fde6d449e1ce

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in the changed TWFE HC2/HC2-BM or wild-bootstrap paths.

Executive Summary

Affected methods are TwoWayFixedEffects with vcov_type="hc2" and vcov_type="hc2_bm". The new inline full-dummy route matches the Registry’s requirement that HC2 leverage / CR2 Bell-McCaffrey DOF be computed on the full FE projection, not the within-transformed design. docs/methodology/REGISTRY.md:L2559-L2564, diff_diff/twfe.py:L153-L368, benchmarks/R/generate_clubsandwich_golden.R:L235-L284, tests/test_methodology_twfe.py:L1205-L1352
I did not find any remaining unmitigated P0/P1 in the changed estimator, variance, clustering, survey, or bootstrap paths.
The prior P2 documentation/test-contract gap is resolved: HC1 / classical / Conley now keep the historical {"ATT": ...} coefficients surface, while the expanded coefficients/vcov contract is explicitly scoped to the HC2/HC2-BM full-dummy path. diff_diff/twfe.py:L342-L368, diff_diff/twfe.py:L650-L664, CHANGELOG.md:L11-L11, tests/test_estimators_vcov_type.py:L960-L1019
The prior bootstrap blocker is addressed: wild_bootstrap_se() now reconstructs restricted fitted values from kept columns and NaN-gates the full inference surface on degenerate bootstrap output. diff_diff/utils.py:L577-L665, tests/test_estimators_vcov_type.py:L830-L899, tests/test_wild_bootstrap.py:L719-L866
Remaining issues are P3 informational only and already tracked in TODO.md: TWFE replicate-weight survey rejection and duplicated FE-dummy construction. TODO.md:L103-L104
Static review only: I could not run pytest here because pytest is not installed.

Methodology

No findings. Severity: none. Impact: the HC2 / HC2-BM TWFE change is documented in the Methodology Registry, implemented via an explicit full-dummy design, and pinned against R goldens for HC2 SE plus one-way and unit-clustered BM DOF. Concrete fix: none. docs/methodology/REGISTRY.md:L2560-L2564, diff_diff/twfe.py:L285-L341, benchmarks/data/clubsandwich_cr2_golden.json:L85-L106, tests/test_methodology_twfe.py:L1230-L1352

Code Quality

No findings. Severity: none. Impact: the changed bootstrap path now avoids NaN propagation from dropped nuisance columns and keeps the full inference tuple aligned under degenerate bootstrap output. Concrete fix: none. diff_diff/utils.py:L577-L665, tests/test_wild_bootstrap.py:L719-L866

Performance

No findings. Severity: none. Impact: the new dense full-dummy TWFE path warns before very large allocations instead of silently pushing users into an obvious memory hazard. Concrete fix: none. diff_diff/twfe.py:L296-L315

Maintainability

Severity: P3. Impact: the TWFE HC2/HC2-BM full-dummy builder duplicates the fixed-effect dummy construction already present in DifferenceInDifferences(fixed_effects=...), which creates drift risk on FE naming, survey behavior, and result-surface conventions, but this is already tracked in TODO.md and is non-blocking. Concrete fix: extract a shared FE-dummy builder or delegate TWFE’s HC2/HC2-BM path to the DiD fixed-effects branch. diff_diff/twfe.py:L285-L341, diff_diff/estimators.py:L478-L485, TODO.md:L104-L104

Tech Debt

Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs; this limitation is documented in the Registry and tracked in TODO.md, so it is informational rather than blocking. Concrete fix: implement per-replicate full-dummy refits on the TWFE replicate-variance path. diff_diff/twfe.py:L227-L243, docs/methodology/REGISTRY.md:L2562-L2562, TODO.md:L103-L103

Security

No findings. Severity: none. Impact: I did not see security-relevant changes or accidental secrets in the changed files. Concrete fix: none.

Documentation/Tests

No findings. Severity: none. Impact: the previous doc/test-contract issue is resolved, and the new behavior is now documented in CHANGELOG.md / REGISTRY.md and pinned by behavioral coverage plus R-parity tests. Concrete fix: none. CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2562-L2562, tests/test_estimators_vcov_type.py:L660-L1035, tests/test_methodology_twfe.py:L1205-L1352

igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026

igerber merged commit 4c505d6 into main May 19, 2026
33 of 34 checks passed

igerber deleted the twfe-hc2-auto-route branch May 19, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lift Gate 1: HC2/HC2-BM for TwoWayFixedEffects via full-dummy auto-route#469

Lift Gate 1: HC2/HC2-BM for TwoWayFixedEffects via full-dummy auto-route#469
igerber merged 8 commits into
mainfrom
twfe-hc2-auto-route

igerber commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 19, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant