Skip to content

Lift Gate 1: HC2/HC2-BM for TwoWayFixedEffects via full-dummy auto-route#469

Merged
igerber merged 8 commits into
mainfrom
twfe-hc2-auto-route
May 19, 2026
Merged

Lift Gate 1: HC2/HC2-BM for TwoWayFixedEffects via full-dummy auto-route#469
igerber merged 8 commits into
mainfrom
twfe-hc2-auto-route

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 19, 2026

Summary

Methodology references (required if estimator / math changes)

  • Method name(s): HC2 (MacKinnon-White 1985), CR2 Bell-McCaffrey (Pustejovsky-Tipton 2018 §3.3; Imbens-Kolesar 2016)
  • Paper / source link(s): sandwich::vcovHC(type="HC2") and clubSandwich::vcovCR(type="CR2") + coef_test()$df_Satt (R parity targets; singleton-cluster trick = one-way HC2-BM)
  • Any intentional deviations from the source (and why): None — empirical parity to R at atol=1e-10 verified on the new twfe_two_period scenario in benchmarks/data/clubsandwich_cr2_golden.json. The full-dummy build is the FWL-correct algebra (within-transform preserves coefficients but NOT the hat matrix, so HC2 leverage on the demeaned design would be wrong).

Validation

  • Tests added/updated:
    • tests/test_estimators_vcov_type.py::TestFitBehavior — 10 behavioral tests (rejection flip → behavioral; refactor regression vs DiD(fixed_effects=[unit, time]) at atol=1e-12; auto-cluster default coverage on hc2_bm with distinguishability check vs one-way HC2-BM at >1% gap; explicit hc2 + analytical no-auto-cluster; hc2 + wild_bootstrap auto-cluster preserved; hc2 / hc2_bm + replicate rejection; always-treated unit finite ATT; coefficients-vs-vcov alignment invariant)
    • tests/test_methodology_twfe.py::TestTWFEHC2RParity — 3 R-parity tests at atol=1e-10 (HC2 SE vs sandwich::vcovHC; one-way BM DOF vs singleton-cluster CR2; CR2-BM clustered-at-unit DOF vs vcovCR(cluster=unit))
    • benchmarks/R/generate_clubsandwich_golden.R — new twfe_two_period scenario (8 units × 4 periods, binary post indicator); JSON regenerated with meta.source = "clubSandwich"
  • Backtest / simulation / notebook evidence: N/A (small-sample SE refinement, not a new estimator)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

Replaces the unconditional NotImplementedError at twfe.py for
`vcov_type in {"hc2","hc2_bm"}` with an inline full-dummy branch.
TWFE has no absorb=/fixed_effects= parameter to swap (unit + time FE
are baked into the estimator's identity), so the auto-route trick
used for DiD-absorb / MPD-absorb doesn't apply directly. Instead,
`TwoWayFixedEffects.fit()` bypasses the within-transform on
hc2/hc2_bm and stacks [intercept, treated×post, covariates,
factor(unit), factor(time)] explicitly so the leverage correction
and BM DOF compute on the full FE projection (FWL preserves
coefficients and residuals but NOT the hat matrix).

**Auto-cluster default:** preserved on hc2_bm (routes to CR2-BM at
unit) and on hc2 + wild_bootstrap; dropped on explicit hc2 +
analytical to match the one-way contract (the linalg validator
rejects hc2 + cluster_ids).

**Surface change disclosure** (matches DiD-absorb / MPD-absorb):
under vcov_type in {"hc2","hc2_bm"}, result.coefficients,
result.vcov, result.residuals, result.fitted_values, and
result.r_squared reflect the full-dummy fit. FE-dummy entries are
included alongside the "ATT" key (len(coefficients) ==
vcov.shape[0] invariant upheld). result.att, its SE, and analytical
inference are unchanged (FWL-equivalent).

**Rejected combos:** vcov_type in {"hc2","hc2_bm"} + replicate-
weight survey designs raises NotImplementedError because the
replicate path re-demeans per replicate, which doesn't compose with
the full-dummy build. Survey variance precedence: any resolved
SurveyDesign drives variance via TSL/replicate (matches existing
DiD/MPD contract), not the analytical small-sample sandwich.

Verified at atol=1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and
`lm() + clubSandwich::vcovCR(cluster=seq_len(n), type="CR2") +
coef_test()$df_Satt` on a new `twfe_two_period` scenario in
benchmarks/data/clubsandwich_cr2_golden.json. New tests:
- TestFitBehavior (10 behavioral tests including refactor regression
  vs DiD(fixed_effects=[unit, time]) at atol=1e-12, auto-cluster
  distinguishability check vs one-way HC2-BM at 1% gap, replicate-
  weight rejection, coefficients-vs-vcov alignment invariant)
- TestTWFEHC2RParity (3 R-parity tests at atol=1e-10)

Lifts Gate 1 of the six HC2/HC2-BM NotImplementedError gates — the
last absorbed-FE gate. Remaining gates: weighted one-way HC2-BM,
weighted CR2-BM (both blocked on the clubSandwich WLS algebra
derivation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The HC2/HC2-BM methodology change in TwoWayFixedEffects is consistent with the Methodology Registry and the intended full-projection HC2/CR2 logic.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • Severity: P2. Impact: the new hc2 / hc2_bm path always builds dense unit and time dummy matrices and then dense-stacks them into X. On realistically large TWFE panels, that can turn the old explicit rejection into OOMs or severe latency. Concrete fix: add a size guard or warning before dummy expansion, or route this path through a sparse/shared full-dummy helper instead of unconditional dense get_dummies + np.column_stack. diff_diff/twfe.py:L279-L300

Maintainability

  • Severity: P3. Impact: TWFE now carries a second bespoke full-dummy builder that overlaps with the existing DifferenceInDifferences(fixed_effects=...) implementation, which increases drift risk around FE naming, survey behavior, and result-surface conventions. Concrete fix: extract a shared helper or delegate the TWFE hc2 / hc2_bm route to the existing fixed-effects DiD path. diff_diff/twfe.py:L279-L315 diff_diff/estimators.py:L467-L485

Tech Debt

  • Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs, but that limitation is explicitly documented and tracked, so it is not a blocker under the project’s deferred-work policy. Concrete fix: none required for approval; implement the tracked follow-up when replicate support is needed. diff_diff/twfe.py:L221-L236 docs/methodology/REGISTRY.md:L2562-L2562 TODO.md:L103-L103

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the new tests cover ATT/SE parity, CR2 auto-clustering, replicate rejection, and coefficients-vs-vcov alignment, but they do not pin the newly documented full-surface behavior for result.residuals, result.fitted_values, result.r_squared, or the non-replicate survey-weighted hc2 / hc2_bm path. A regression there could ship while the current parity tests still pass. Concrete fix: add one regression comparing those fields to DifferenceInDifferences(fixed_effects=["unit", time]) on hc2 / hc2_bm, plus one survey-weighted non-replicate TWFE test on the new path. tests/test_estimators_vcov_type.py:L657-L935 tests/test_methodology_twfe.py:L1230-L1350

Addresses CI Codex review findings on PR #469:

P2 (Performance): the HC2/HC2-BM full-dummy build can OOM on large
TWFE panels (n × (n_units + n_times) float64 entries). Add a memory-
size warning at >50M entries (~400 MB) suggesting hc1 (within-
transform) for large panels.

P3 (Docs/Tests): new tests pinned ATT/SE/DOF but not the documented
full-surface change (residuals/fitted_values/r_squared reflect the
full-dummy fit). Add `test_twfe_hc2_full_surface_matches_did_fixed_effects`
parametrized over hc2/hc2_bm, asserting bit-equality against
DifferenceInDifferences(fixed_effects=[unit, time]) at atol=1e-12 on
all three fields.

P3 (Maintainability): TWFE's inline full-dummy builder duplicates
DiD's fixed_effects= dummy-construction logic. Substantive refactor —
better as a follow-up than inline in this PR. Added a TODO row.

P3 (Tech debt — replicate weights): already tracked, no action needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b0ccc33c77cbf04637ecc5abe5ca34af19445140


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs. This is a documented fast-fail rather than silent wrong inference, and it is tracked under the project’s deferred-work policy. Concrete fix: none required for approval; follow the tracked per-replicate full-dummy refit work in diff_diff/twfe.py:L221-L236, docs/methodology/REGISTRY.md:L2562-L2562, and TODO.md:L103-L103.

Security

  • No findings.

Documentation/Tests

CI review flagged the missing end-to-end test for the new non-replicate
survey-weighted hc2/hc2_bm TWFE path. A regression there could ship
while existing tests still pass.

New parametrized test compares TWFE(vcov_type=vcov, cluster=...) with
SurveyDesign(weights=...) against DifferenceInDifferences(vcov_type=vcov,
cluster=..., fixed_effects=['unit', 'time']) with the same design. Both
paths feed the survey-resolved full-dummy X to LinearRegression's
compute_survey_vcov (TSL), so ATT and SE match bit-equally at atol=1e-12.

Test documents the explicit-cluster requirement on hc2_bm: TWFE's
implicit auto-cluster + survey path intentionally drops PSU injection
(per the survey-design scope rule in _resolve_effective_cluster);
the explicit cluster='unit' form is what aligns with DiD's clustered-
survey behavior and is the documented user-facing way to invoke
clustered survey-aware HC2-BM on TWFE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0a2706655c849b9296325ee621d7e94d9bb42f63


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 issues found in the new TWFE hc2 / hc2_bm route.
  • Methodology is aligned with the absorbed-FE registry note: the PR moves HC2 / CR2 leverage computation onto an explicit full-dummy FE design instead of the within-transformed design, which is the correct fix for FWL hat-matrix non-equivalence. See diff_diff/twfe.py:L147-L155, diff_diff/twfe.py:L279-L325, docs/methodology/REGISTRY.md:L2560-L2564.
  • The prior non-blocking survey-path gap is only partially closed: the new weights-only survey regression is valuable, but I still do not see a direct SurveyDesign(weights=..., strata=..., psu=...) parity test on the new full-dummy hc2 / hc2_bm branch. See tests/test_estimators_vcov_type.py:L968-L1017.
  • The replicate-weight rejection and duplicated full-dummy builder are now explicitly tracked in TODO.md, so they are informational rather than blockers. See TODO.md:L103-L104.
  • Static review only: I could not run the tests in this environment because numpy, pandas, and pytest are unavailable.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: TWFE still carries a bespoke full-dummy builder instead of reusing the existing DifferenceInDifferences(fixed_effects=...) construction, so future FE naming / survey-handling / results-surface drift remains possible. Concrete fix: none required for approval; this is already tracked in TODO.md:L104-L104. Relevant code: diff_diff/twfe.py:L279-L335, diff_diff/estimators.py:L477-L485.

Tech Debt

  • Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still fast-fails on replicate-weight survey designs. This is an explicit guard, not silent wrong inference. Concrete fix: none required for approval; follow the tracked per-replicate full-dummy refit work in TODO.md:L103-L103 and diff_diff/twfe.py:L221-L237.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: [Previously noted, partially addressed] the new weights-only survey parity test closes the most obvious gap, but I still do not see a direct SurveyDesign(weights=..., strata=..., psu=...) TWFE-vs-DiD full-dummy regression for hc2 / hc2_bm. Also, the new test block itself documents a survey exception to the broad “hc2_bm auto-cluster is preserved” wording: with survey_design= and no explicit cluster=, TWFE intentionally keeps the established implicit-PSU path, so users need explicit cluster=unit or survey_design.psu to get unit-level PSU injection. Concrete fix: add one explicit strata+psu parity test, and add one sentence to the TWFE docstring / registry note clarifying that the auto-cluster note applies to the non-survey analytical path. See tests/test_estimators_vcov_type.py:L968-L1017, tests/test_survey.py:L3237-L3286, diff_diff/twfe.py:L374-L393, docs/methodology/REGISTRY.md:L2562-L2562.

… for survey path

CI review polish:

- Add `test_twfe_hc2_with_survey_strata_psu_matches_did_fixed_effects`
  (parametrized over hc2/hc2_bm) — extends the weights-only survey
  regression with `SurveyDesign(weights="w", strata="stratum",
  psu="psu")`. Both TWFE and DiD(fixed_effects=[unit, time]) paths
  feed the survey-resolved full-dummy X to LinearRegression's TSL
  with stratified-design adjustments, so ATT/SE match bit-equally
  at atol=1e-12.

- Clarify the "auto-cluster preserved on hc2_bm" wording in
  TWFE docstring and REGISTRY: the auto-cluster default applies to
  the non-survey analytical path. Under `survey_design=` with no
  explicit `cluster=`, TWFE keeps the documented implicit-PSU path
  (auto-cluster NOT injected into survey PSU); users wanting
  unit-level PSU under a survey design must pass explicit
  `cluster="unit"` or set `survey_design.psu`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 382824162ef52d45a83481d1c9b302579646b9cc


Overall Assessment

✅ Looks good

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: the TWFE HC2/HC2-BM path still duplicates the full-dummy FE construction already present in DifferenceInDifferences(fixed_effects=...), so future drift in FE naming, survey handling, or result-surface behavior remains possible. Concrete fix: follow the tracked refactor in TODO.md:L104-L104, pulling the shared dummy-construction logic out of diff_diff/twfe.py:L285-L341 and diff_diff/estimators.py:L477-L485. Tracked in TODO.md, so informational only.

Tech Debt

  • Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs; this is an explicit fast-fail, not silent wrong inference. Concrete fix: implement per-replicate full-dummy refits as tracked in diff_diff/twfe.py:L227-L243 and TODO.md:L103-L103. Tracked in TODO.md, so informational only.

Security

  • No findings.

Documentation/Tests

CI review (R4) flagged that the new top entry says TWFE HC2/HC2-BM is
supported, but two older bullets in the same [Unreleased] block (DiD-
absorb and MPD-absorb entries) still claim "TwoWayFixedEffects
rejection remains as a follow-up" / "remain as follow-ups". Those
sentences were factually correct when those PRs landed but become
contradictory once Gate 1 ships in the same release.

Replace both with cross-references to the Gate 1 top entry, noting that
TWFE uses a separate full-dummy branch (no fixed_effects= equivalent
inside TWFE) rather than the absorb→fixed_effects parameter swap used
by DiD/MPD.

Per feedback_changelog_accuracy_fixes.md: scanned all CHANGELOG bullets
for similar stale claims; the remaining matches are unrelated entries
(Spillover, BaconDecomposition, Conley waves) that don't reference TWFE
rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: baceb22bda068f738f540f24fd6b98d7a12af7ce


Overall Assessment

⚠️ Needs changes

Executive Summary

  • Re-review focus: the prior documentation-only concern looks addressed. CHANGELOG.md, REGISTRY.md, and TODO.md now consistently describe TWFE HC2/HC2-BM support and the remaining tracked limitations.
  • Methods affected are HC2 and HC2 Bell-McCaffrey / CR2 Bell-McCaffrey. The new analytical TWFE full-dummy route is aligned with the Methodology Registry and the new R-parity harness.
  • Severity P1 [Newly identified]: the new TwoWayFixedEffects(vcov_type="hc2", inference="wild_bootstrap") path is not rank-deficiency-safe. A dropped nuisance column on the full-dummy design can poison the bootstrap sample construction and crash the fit.
  • The replicate-weight rejection and duplicated full-dummy builder are now explicitly tracked in TODO.md, so they are informational only.
  • Static review only: I could not run the test suite here because this environment is missing pandas/pytest.

Methodology

Code Quality

  • Severity: P1 [Newly identified]. Impact: the newly supported hc2/hc2_bm TWFE path uses an explicit full-dummy design (diff_diff/twfe.py L285-L341) and now routes inference="wild_bootstrap" through the shared bootstrap helper (diff_diff/twfe.py L596-L600). That helper re-estimates the restricted model and then builds y_star with raw X @ beta_restricted (diff_diff/utils.py L568-L596). If the full-dummy solve dropped any nuisance column to NaN (for example, a time-invariant covariate collinear with unit FE), beta_restricted contains NaN, y_star becomes NaN, and the bootstrap crashes even though ATT is still identified analytically. The new bootstrap coverage only smoke-tests a full-rank panel (tests/test_estimators_vcov_type.py L828-L846). Concrete fix: make wild_bootstrap_se() reconstruct fitted values from identified columns only, mirroring solve_ols()’s rank-deficient fitted-value logic; while touching that helper, do not coerce invalid bootstrap draws to t*=0.0.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: TWFE’s HC2/HC2-BM inline full-dummy builder still duplicates the DifferenceInDifferences(fixed_effects=...) dummy-construction logic, which leaves drift risk, but this is now explicitly tracked in TODO.md L104 and does not block approval. Concrete fix: follow the tracked shared-helper / delegation refactor.

Tech Debt

  • Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs, but that limitation is explicitly documented in docs/methodology/REGISTRY.md L2562 and tracked in TODO.md L103, so it is informational only. Concrete fix: defer to the tracked per-replicate full-dummy refit follow-up.

Security

  • No findings.

Documentation/Tests

  • No unmitigated findings. The stale CHANGELOG contradiction from the previous review appears resolved in CHANGELOG.md L10-L24. Static review only: I could not execute the new tests in this environment.

Path to Approval

  1. Make wild_bootstrap_se() rank-deficiency-safe for dropped nuisance columns by constructing restricted fitted values from the kept columns only, not raw X @ beta_restricted.
  2. Add a regression test for TwoWayFixedEffects(vcov_type="hc2", inference="wild_bootstrap") on a full-dummy rank-deficient-but-ATT-identified design (for example, a time-invariant covariate collinear with unit FE) and assert the fit still returns finite bootstrap inference.

CI review (R5) identified a P1 bug in wild_bootstrap_se() that was
newly reachable via the TWFE HC2/HC2-BM full-dummy path:

Before this fix, wild_bootstrap_se built each draw's pseudo-outcome
as `y_star = X @ beta_restricted`. When solve_ols dropped a rank-
deficient nuisance column (e.g. a time-invariant covariate collinear
with the unit FE on the full-dummy design), beta_restricted contained
NaN on the dropped slot, and X @ beta_restricted propagated NaN
through every observation. The ATT was analytically identified but
the bootstrap crashed because y_star was all-NaN.

Pre-PR this was unreachable on TWFE (the within-transform absorbed
time-invariant covariates before they entered X), but the new full-
dummy HC2/HC2-BM branch keeps unit/time dummies explicit alongside
covariates, exposing the bug.

Two fixes in wild_bootstrap_se (diff_diff/utils.py):

1. Use solve_ols(return_fitted=True) to get NaN-safe fitted values
   from the kept columns; build y_star = fitted_restricted +
   residuals_restricted * obs_weights instead of X @ beta_restricted.
   fitted_restricted is computed from the kept columns by solve_ols,
   so dropped nuisance NaN doesn't propagate.

2. Replace bootstrap_t_stats[b] = 0.0 fallback for singular draws
   with np.nan + a finite_mask filter at the p-value step. Setting
   t* = 0 biased the p-value downward (|0| < |t_original| counts as
   non-rejection, but those draws are invalid, not non-rejections).
   The same nan-safe filter applies to bootstrap_coefs for the SE
   and percentile CI.

New regression test
`test_twfe_hc2_wild_bootstrap_survives_rank_deficient_full_dummy`
fits TwoWayFixedEffects(vcov_type='hc2', inference='wild_bootstrap',
covariates=['x_invariant']) on a panel where x_invariant is time-
invariant (collinear with unit FE on the full-dummy design); asserts
finite ATT, SE, p-value, and CI. Pre-fix this test crashed with
all-NaN y_star.

No regression in the existing 53 wild_bootstrap tests across
test_wild_bootstrap, test_methodology_did, test_methodology_twfe,
test_conley_vcov, test_estimators_vcov_type, test_business_report,
test_replicate_weight_expansion, test_survey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7f5df65fb038ac807fd7d7ef84e1d34af3b91c71


Overall Assessment

⛔ Blocker

Executive Summary

Methodology

Code Quality

  • Severity P0. Impact: wild_bootstrap_se() now filters invalid draws, but when that leaves n_valid == 0 it returns p_value = 1.0, and when valid_coefs.size == 1 it still returns a percentile CI from that single draw even though se_bootstrap is NaN. _run_wild_bootstrap_inference() maps those fields directly onto estimator results, so the new TWFE hc2 + wild_bootstrap path can emit undefined SE alongside non-NaN p-value/CI/t-stat. That is misleading statistical output and violates the project’s all-or-nothing NaN inference contract. diff_diff/utils.py:L627-L655 diff_diff/estimators.py:L778-L786 diff_diff/twfe.py:L596-L600 Concrete fix: when fewer than two valid coefficient draws survive, return NaN for t_stat, p_value, and both CI endpoints as well; do not special-case p_value=1.0. If you want to preserve the raw original t-stat for diagnostics, keep it off the top-level inference surface.

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: the TWFE HC2/HC2-BM full-dummy builder duplicates the DiD fixed-effects dummy-construction logic, which is a drift risk, but this is already tracked in TODO.md and is non-blocking. TODO.md:L104-L104 diff_diff/twfe.py:L285-L341 diff_diff/estimators.py:L478-L486 Concrete fix: extract a shared full-dummy FE builder or delegate TWFE’s HC2/HC2-BM path to the DiD fixed_effects= branch with TWFE-specific cluster handling.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new tests cover the repaired rank-deficient finite-case bootstrap path, but they do not exercise the n_valid == 0 / valid_coefs.size == 1 branches in wild_bootstrap_se(), which is why the P0 slipped through. tests/test_estimators_vcov_type.py:L830-L879 diff_diff/utils.py:L627-L655 Concrete fix: add direct regression tests for both insufficient-valid-draw branches and assert se, t_stat, p_value, and both CI endpoints are all NaN.

Path to Approval

  1. Make wild_bootstrap_se() return all-NaN inference fields when bootstrap output is not fully identified, at minimum for n_valid == 0 and valid_coefs.size < 2, and ensure the surfaced estimator t_stat does not remain finite on that path.
  2. Add regression tests that force those insufficient-valid-draw branches and assert top-level estimator outputs satisfy the project’s NaN inference contract.

…otstrap

CI review (R6) caught a new P0 in my R5 wild_bootstrap_se() fix:
the degenerate-bootstrap branches violated the all-or-nothing NaN
contract from feedback_bootstrap_nan_on_invalid_contract:

- n_valid == 0 returned p_value = 1.0 with se = NaN (split inference)
- valid_coefs.size == 1 returned a finite percentile CI from a single
  draw alongside se = NaN
- t_stat_original was always finite (analytical), surfacing alongside
  NaN bootstrap se when bootstrap was degenerate

Fix: when n_valid < 2 OR valid_coefs.size < 2, NaN-out the entire
inference quadruple (se, p_value, ci_lower, ci_upper) AND the
surfaced t_stat_original. The analytical t-stat from step 1 is still
computed for diagnostic use inside the helper but not propagated
to the user-facing result on a degenerate bootstrap — this prevents
the estimator wrapper from emitting an analytical t-stat alongside
NaN bootstrap fields, which would mix inference families on the
same coefficient.

New regression tests in tests/test_wild_bootstrap.py::
TestWildBootstrapDegenerateAllNaN:

- test_degenerate_n_valid_zero_returns_all_nan: monkeypatches
  solve_ols so every bootstrap draw has singular vcov; asserts
  ALL five user-surface fields are NaN.

- test_degenerate_single_valid_draw_returns_all_nan: forces exactly
  one valid draw (n_valid == 1); asserts ALL five fields NaN — no
  percentile CI from a single-point sample.

Both branches were previously not exercised by the analytical-design
tests, which is why the R5 fix passed but the R6 reviewer caught the
contract violation via code inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 25c364f77c681db7320b8e652d6dfa3deae2fe7e


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in the changed estimator/inference paths. The prior blocker around degenerate wild-bootstrap inference appears resolved. Highest remaining unmitigated issue is a P2 documentation/test-contract gap.

Executive Summary

  • The previous blocker is fixed: wild_bootstrap_se() now reconstructs restricted fitted values from identified columns and NaN-gates the full inference surface when fewer than two valid draws survive; estimator results propagate those NaNs consistently. diff_diff/utils.py:L573-L676, diff_diff/estimators.py:L764-L786, tests/test_wild_bootstrap.py:L719-L866
  • Affected methods are HC2 and HC2 Bell-McCaffrey / CR2-BM for TwoWayFixedEffects. The new inline full-dummy route matches the Registry’s full-hat-matrix rationale and is backed by new R-parity coverage for HC2 SE plus one-way and clustered BM DOF. docs/methodology/REGISTRY.md:L2559-L2564, diff_diff/twfe.py:L153-L161, diff_diff/twfe.py:L245-L341, benchmarks/R/generate_clubsandwich_golden.R:L235-L284, tests/test_methodology_twfe.py:L1205-L1352
  • I did not find any remaining unmitigated P0/P1 in the changed estimator, variance, clustering, or survey-routing logic.
  • Severity P2: TwoWayFixedEffects.fit() now broadens result.coefficients on the unchanged within-transform HC1/classical/Conley paths too, but the changelog/registry/tests only disclose and pin that surface change under vcov_type in {"hc2","hc2_bm"}. diff_diff/twfe.py:L342-L364, diff_diff/twfe.py:L646-L672, CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2562-L2562, tests/test_estimators_vcov_type.py:L960-L1019
  • The remaining replicate-weight rejection and duplicate full-dummy builder are already tracked in TODO.md, so they are informational only. TODO.md:L103-L104
  • Static review only: I could not run pytest here because pytest is not installed, and this environment has no usable temporary directory.

Methodology

  • No findings. Severity: none. Impact: the HC2 / HC2-BM TWFE change is documented in the Methodology Registry and the implementation follows the cited requirement that leverage / CR2 adjustments be computed on the full FE projection rather than the within-transformed design. Concrete fix: none. docs/methodology/REGISTRY.md:L2559-L2564, diff_diff/twfe.py:L153-L161, diff_diff/twfe.py:L285-L341, benchmarks/R/generate_clubsandwich_golden.R:L235-L284, tests/test_methodology_twfe.py:L1205-L1352

Code Quality

  • No remaining findings. Severity: none. Impact: the prior P0 on partial-NaN bootstrap inference is addressed; undefined bootstrap output now forces se, t_stat, p_value, and both CI endpoints to move together as NaN. Concrete fix: none. diff_diff/utils.py:L573-L676, diff_diff/estimators.py:L764-L786, tests/test_wild_bootstrap.py:L719-L866

Performance

  • No findings. Severity: none. Impact: none identified in the changed code; the new dense TWFE full-dummy route at least warns on large designs instead of silently risking extreme memory use. Concrete fix: none. diff_diff/twfe.py:L296-L315

Maintainability

  • Severity: P3. Impact: the TWFE HC2/HC2-BM full-dummy builder duplicates the fixed-effect dummy construction already present in DifferenceInDifferences.fit(), which creates drift risk on FE naming, survey behavior, and result-surface conventions, but this is already tracked in TODO.md and is non-blocking. Concrete fix: extract a shared FE-dummy builder or delegate TWFE’s HC2/HC2-BM path to the existing DiD full-dummy path. diff_diff/twfe.py:L285-L341, diff_diff/estimators.py:L478-L485, TODO.md:L104-L104

Tech Debt

  • Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs; this limitation is documented in the Registry and tracked in TODO.md, so it is informational rather than blocking. Concrete fix: implement per-replicate full-dummy refits for the TWFE HC2/HC2-BM survey path. diff_diff/twfe.py:L227-L243, docs/methodology/REGISTRY.md:L2562-L2562, TODO.md:L103-L103

Security

  • No findings. Severity: none. Impact: I did not see security-relevant changes or accidental secrets in the changed files. Concrete fix: none.

Documentation/Tests

  • Severity: P2. Impact: TwoWayFixedEffects.fit() now builds result.coefficients from _twfe_var_names on both branches, so HC1/classical/Conley within-transform fits also expose const and covariate entries. The changelog/registry only disclose this result-surface change for the HC2/HC2-BM full-dummy path, and the new regression tests only assert the expanded coefficients/vcov contract for hc2/hc2_bm. That leaves a user-visible API change on unchanged TWFE paths undocumented and unpinned. Concrete fix: either restore the old within-transform contract ({"ATT": att} unless use_full_dummy is active) or document and regression-test the broader all-path TWFE result.coefficients contract. diff_diff/twfe.py:L342-L364, diff_diff/twfe.py:L646-L672, CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2562-L2562, tests/test_estimators_vcov_type.py:L960-L1019

CI review (R7) flagged a P2 scope creep in the R1 coefficients-dict
fix: I built `_twfe_var_names` on BOTH the full-dummy and within-
transform branches, which silently broadened
`result.coefficients` on HC1/classical/Conley paths from
`{"ATT": att}` to `{"const": c, "ATT": att, ...covariates}`. That's
a user-visible API change on unchanged TWFE paths that wasn't
documented in CHANGELOG/REGISTRY or regression-tested.

Per the reviewer's recommendation, restoring the historical
`{"ATT": att}` contract on within-transform paths by setting
`_twfe_var_names = None` on the else branch (the fallback at the
DiDResults construction site handles None via the existing
`{"ATT": float(att)}` literal). Only the HC2/HC2-BM full-dummy path
now broadens the dict — which is what the REGISTRY/CHANGELOG
surface-change disclosure documents, and what the alignment-invariant
test and full-surface regression test pin.

Verified end-to-end: hc1/classical → `{'ATT'}`; hc2/hc2_bm →
`{'const', 'ATT', '_fe_unit_*', '_fe_time_*'}`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a66b9ada5efde2836301123b6985fde6d449e1ce


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain in the changed TWFE HC2/HC2-BM or wild-bootstrap paths.

Executive Summary

  • Affected methods are TwoWayFixedEffects with vcov_type="hc2" and vcov_type="hc2_bm". The new inline full-dummy route matches the Registry’s requirement that HC2 leverage / CR2 Bell-McCaffrey DOF be computed on the full FE projection, not the within-transformed design. docs/methodology/REGISTRY.md:L2559-L2564, diff_diff/twfe.py:L153-L368, benchmarks/R/generate_clubsandwich_golden.R:L235-L284, tests/test_methodology_twfe.py:L1205-L1352
  • I did not find any remaining unmitigated P0/P1 in the changed estimator, variance, clustering, survey, or bootstrap paths.
  • The prior P2 documentation/test-contract gap is resolved: HC1 / classical / Conley now keep the historical {"ATT": ...} coefficients surface, while the expanded coefficients/vcov contract is explicitly scoped to the HC2/HC2-BM full-dummy path. diff_diff/twfe.py:L342-L368, diff_diff/twfe.py:L650-L664, CHANGELOG.md:L11-L11, tests/test_estimators_vcov_type.py:L960-L1019
  • The prior bootstrap blocker is addressed: wild_bootstrap_se() now reconstructs restricted fitted values from kept columns and NaN-gates the full inference surface on degenerate bootstrap output. diff_diff/utils.py:L577-L665, tests/test_estimators_vcov_type.py:L830-L899, tests/test_wild_bootstrap.py:L719-L866
  • Remaining issues are P3 informational only and already tracked in TODO.md: TWFE replicate-weight survey rejection and duplicated FE-dummy construction. TODO.md:L103-L104
  • Static review only: I could not run pytest here because pytest is not installed.

Methodology

  • No findings. Severity: none. Impact: the HC2 / HC2-BM TWFE change is documented in the Methodology Registry, implemented via an explicit full-dummy design, and pinned against R goldens for HC2 SE plus one-way and unit-clustered BM DOF. Concrete fix: none. docs/methodology/REGISTRY.md:L2560-L2564, diff_diff/twfe.py:L285-L341, benchmarks/data/clubsandwich_cr2_golden.json:L85-L106, tests/test_methodology_twfe.py:L1230-L1352

Code Quality

  • No findings. Severity: none. Impact: the changed bootstrap path now avoids NaN propagation from dropped nuisance columns and keeps the full inference tuple aligned under degenerate bootstrap output. Concrete fix: none. diff_diff/utils.py:L577-L665, tests/test_wild_bootstrap.py:L719-L866

Performance

  • No findings. Severity: none. Impact: the new dense full-dummy TWFE path warns before very large allocations instead of silently pushing users into an obvious memory hazard. Concrete fix: none. diff_diff/twfe.py:L296-L315

Maintainability

  • Severity: P3. Impact: the TWFE HC2/HC2-BM full-dummy builder duplicates the fixed-effect dummy construction already present in DifferenceInDifferences(fixed_effects=...), which creates drift risk on FE naming, survey behavior, and result-surface conventions, but this is already tracked in TODO.md and is non-blocking. Concrete fix: extract a shared FE-dummy builder or delegate TWFE’s HC2/HC2-BM path to the DiD fixed-effects branch. diff_diff/twfe.py:L285-L341, diff_diff/estimators.py:L478-L485, TODO.md:L104-L104

Tech Debt

  • Severity: P3. Impact: TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) still rejects replicate-weight survey designs; this limitation is documented in the Registry and tracked in TODO.md, so it is informational rather than blocking. Concrete fix: implement per-replicate full-dummy refits on the TWFE replicate-variance path. diff_diff/twfe.py:L227-L243, docs/methodology/REGISTRY.md:L2562-L2562, TODO.md:L103-L103

Security

  • No findings. Severity: none. Impact: I did not see security-relevant changes or accidental secrets in the changed files. Concrete fix: none.

Documentation/Tests

  • No findings. Severity: none. Impact: the previous doc/test-contract issue is resolved, and the new behavior is now documented in CHANGELOG.md / REGISTRY.md and pinned by behavioral coverage plus R-parity tests. Concrete fix: none. CHANGELOG.md:L11-L11, docs/methodology/REGISTRY.md:L2562-L2562, tests/test_estimators_vcov_type.py:L660-L1035, tests/test_methodology_twfe.py:L1205-L1352

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 19, 2026
@igerber igerber merged commit 4c505d6 into main May 19, 2026
33 of 34 checks passed
@igerber igerber deleted the twfe-hc2-auto-route branch May 19, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant