Skip to content

fix(upsert): early rejection of unsupported join column types#3384

Open
abnobdoss wants to merge 4 commits into
apache:mainfrom
abnobdoss:fix/upsert-pr1-type-safety
Open

fix(upsert): early rejection of unsupported join column types#3384
abnobdoss wants to merge 4 commits into
apache:mainfrom
abnobdoss:fix/upsert-pr1-type-safety

Conversation

@abnobdoss
Copy link
Copy Markdown

Rationale for this change

This is the first in a planned series of PRs to improve the stability and speed of the Table.upsert operation. This PR focuses on improving the correctness foundations by implementing "Fail Fast" validation for join key types.

By rejecting unsupported types upfront, we prevent two major classes of issues:

  1. Silent Data Loss: Floating-point join keys are rejected because PyArrow joins treat -0.0 and 0.0 as distinct values while Iceberg filters treat them as equal, leading to missed updates.
  2. Engine Crashes: Nested types (structs, lists, maps), dictionary-encoded columns, and extension types (e.g., UUID) are rejected early to avoid cryptic C++ crashes in the underlying PyArrow join kernels.

This establishes a safe contract for the subsequent performance-focused PRs (Vectorization and Anti-Join de-duplication).

Are these changes tested?

Yes. I have added a comprehensive suite of parameterized tests in tests/table/test_upsert.py.

  • Validation Matrix: Verified that ValueError or NotImplementedError are correctly raised for Floating Point, Nested, Dictionary, Null, and Extension types.
  • Correctness: Confirmed that standard primitive types (String, Int, Long, Decimal, etc.) continue to function as expected.
  • Schema Authority: Added tests ensuring that validation happens against both the Table Schema (architectural integrity) and the Dataframe Schema (memory format implementation).

Are there any user-facing changes?

Yes. The Table.upsert method now includes strict type validation for the join columns a user provides.

  • Users attempting to upsert on floating-point or nested columns will now receive a descriptive error message explaining the risk and suggesting a cast to Decimal or Integer.
  • This is a protective change that prevents users from accidentally writing corrupt data or encountering low-level engine crashes.

For full disclosure - this PR was developed with the assistance of an AI coding assistant (Antigravity) to help refine the type-safety checks and edge-case validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant