Implement ParquetFormatModel and update write_file to use the format API by nssalian · Pull Request #3381 · apache/iceberg-python

nssalian · 2026-05-19T03:36:06Z

Continued work on #3100

PR Description

Follow-up to #3119. Implements ParquetFormatWriter and ParquetFormatModel, registers Parquet in the FileFormatFactory, and rewrites write_file to dispatch through the factory using the write.format.default table property. Future formats can be added in a similar way.

Rationale for this change

The write.format.default table property was never read - the write path was hardcoded to Parquet. This PR makes the property functional. Also threads file_format through _to_requested_schema / ArrowProjectionVisitor / _construct_field so field ID metadata keys are correct per format (PARQUET:field_id for Parquet, iceberg.id plus iceberg.required for ORC), preparing the write path for ORC support without changing default behavior.

Are these changes tested?

tests/io/test_format_writers.py adds parametrized tests modeled after Java's BaseFormatModelTests covering round-trip, statistics, null handling, context manager caching, close idempotency, close-without-write, and ORC vs Parquet field ID dispatch.
tests/io/test_pyarrow.py adds test_write_file_parquet_round_trip and test_write_file_dispatches_on_write_format_default exercising the full write_file path.

Are there any user-facing changes?

No. Default behavior is unchanged. Setting write.format.default to an unregistered format now raises a ValueError.

nssalian · 2026-05-19T03:56:56Z

@kevinjqliu @Fokko @geruh PTAL when you can

rambleraptor · 2026-05-19T17:19:44Z

-            # For projection visitor, we don't know the file format, so default to Parquet
-            # This is used for schema conversion during reads, not writes
-            metadata[PYARROW_PARQUET_FIELD_ID_KEY] = str(field.field_id)
+            if self._file_format == FileFormat.ORC:


Ideally, we'd have a FileFormat API method called add_metadata_for_field (not opinionated on name).

Part of the hope for the FileFormat API was to avoid these kind of switch statements based on the format.

rambleraptor · 2026-05-19T17:28:02Z

+
+
+@pytest.fixture
+def simple_table() -> pa.Table:


We've got a few tables in tests/conftest.py. Any reason not to use those?

rambleraptor · 2026-05-19T17:28:23Z

+
+def test_statistics_record_count(format_model: FileFormatModel, table_schema_simple: Schema, tmp_path: Path) -> None:
+    """close() returns DataFileStatistics with correct record count."""
+    table = pa.table(


Why recreate a different table here?

rambleraptor · 2026-05-19T17:32:57Z

        include_field_ids: bool = False,
        projected_missing_fields: dict[int, Any] = EMPTY_DICT,
        allow_timestamp_tz_mismatch: bool = False,
+        file_format: FileFormat = FileFormat.PARQUET,


I'm not wild about making PARQUET the default value (I don't think we should have default values...), but that's a light opinion.

rambleraptor · 2026-05-19T17:33:28Z

    include_field_ids: bool = False,
    projected_missing_fields: dict[int, Any] = EMPTY_DICT,
    allow_timestamp_tz_mismatch: bool = False,
+    file_format: FileFormat = FileFormat.PARQUET,


Same thing, not wild about the default value.

Implement ParquetFormatModel and wire write_file to use the format API

e64df3c

nssalian changed the title ~~Implement ParquetFormatModel and wire write_file to use the format API~~ Implement ParquetFormatModel and update write_file to use the format API May 19, 2026

nssalian marked this pull request as ready for review May 19, 2026 03:51

rambleraptor reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ParquetFormatModel and update write_file to use the format API#3381

Implement ParquetFormatModel and update write_file to use the format API#3381
nssalian wants to merge 1 commit into
apache:mainfrom
nssalian:file-format-parquet-impl

nssalian commented May 19, 2026

Uh oh!

nssalian commented May 19, 2026

Uh oh!

rambleraptor May 19, 2026

Uh oh!

rambleraptor May 19, 2026

Uh oh!

rambleraptor May 19, 2026

Uh oh!

rambleraptor May 19, 2026

Uh oh!

rambleraptor May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@pytest.fixture
		def simple_table() -> pa.Table:

Conversation

nssalian commented May 19, 2026

PR Description

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nssalian commented May 19, 2026

Uh oh!

rambleraptor May 19, 2026

Choose a reason for hiding this comment

Uh oh!

rambleraptor May 19, 2026

Choose a reason for hiding this comment

Uh oh!

rambleraptor May 19, 2026

Choose a reason for hiding this comment

Uh oh!

rambleraptor May 19, 2026

Choose a reason for hiding this comment

Uh oh!

rambleraptor May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants