Skip to content

Implement ParquetFormatModel and update write_file to use the format API#3381

Open
nssalian wants to merge 1 commit into
apache:mainfrom
nssalian:file-format-parquet-impl
Open

Implement ParquetFormatModel and update write_file to use the format API#3381
nssalian wants to merge 1 commit into
apache:mainfrom
nssalian:file-format-parquet-impl

Conversation

@nssalian
Copy link
Copy Markdown
Contributor

Continued work on #3100

PR Description

Follow-up to #3119. Implements ParquetFormatWriter and ParquetFormatModel, registers Parquet in the FileFormatFactory, and rewrites write_file to dispatch through the factory using the write.format.default table property. Future formats can be added in a similar way.

Rationale for this change

The write.format.default table property was never read - the write path was hardcoded to Parquet. This PR makes the property functional. Also threads file_format through _to_requested_schema / ArrowProjectionVisitor / _construct_field so field ID metadata keys are correct per format (PARQUET:field_id for Parquet, iceberg.id plus iceberg.required for ORC), preparing the write path for ORC support without changing default behavior.

Are these changes tested?

  • tests/io/test_format_writers.py adds parametrized tests modeled after Java's BaseFormatModelTests covering round-trip, statistics, null handling, context manager caching, close idempotency, close-without-write, and ORC vs Parquet field ID dispatch.
  • tests/io/test_pyarrow.py adds test_write_file_parquet_round_trip and test_write_file_dispatches_on_write_format_default exercising the full write_file path.

Are there any user-facing changes?

No. Default behavior is unchanged. Setting write.format.default to an unregistered format now raises a ValueError.

@nssalian nssalian changed the title Implement ParquetFormatModel and wire write_file to use the format API Implement ParquetFormatModel and update write_file to use the format API May 19, 2026
@nssalian nssalian marked this pull request as ready for review May 19, 2026 03:51
@nssalian
Copy link
Copy Markdown
Contributor Author

@kevinjqliu @Fokko @geruh PTAL when you can

Comment thread pyiceberg/io/pyarrow.py
# For projection visitor, we don't know the file format, so default to Parquet
# This is used for schema conversion during reads, not writes
metadata[PYARROW_PARQUET_FIELD_ID_KEY] = str(field.field_id)
if self._file_format == FileFormat.ORC:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd have a FileFormat API method called add_metadata_for_field (not opinionated on name).

Part of the hope for the FileFormat API was to avoid these kind of switch statements based on the format.



@pytest.fixture
def simple_table() -> pa.Table:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got a few tables in tests/conftest.py. Any reason not to use those?


def test_statistics_record_count(format_model: FileFormatModel, table_schema_simple: Schema, tmp_path: Path) -> None:
"""close() returns DataFileStatistics with correct record count."""
table = pa.table(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why recreate a different table here?

Comment thread pyiceberg/io/pyarrow.py
include_field_ids: bool = False,
projected_missing_fields: dict[int, Any] = EMPTY_DICT,
allow_timestamp_tz_mismatch: bool = False,
file_format: FileFormat = FileFormat.PARQUET,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wild about making PARQUET the default value (I don't think we should have default values...), but that's a light opinion.

Comment thread pyiceberg/io/pyarrow.py
include_field_ids: bool = False,
projected_missing_fields: dict[int, Any] = EMPTY_DICT,
allow_timestamp_tz_mismatch: bool = False,
file_format: FileFormat = FileFormat.PARQUET,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing, not wild about the default value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants