Update schema projection to support `initial-defaults` #1644

Fokko · 2025-02-11T13:28:10Z

Add the projection piece of the initial defaults.

gabeiglio · 2025-02-12T21:27:35Z

Since initial-default projection happens after filtering in _task_to_record_batches Im wondering if this will yield the correct results given a pyarrow_filter for this field.

Fokko · 2025-02-14T19:37:57Z

Since initial-default projection happens after filtering in _task_to_record_batches Im wondering if this will yield the correct results given a pyarrow_filter for this field.

Thanks for pointing this out, and it doesn't handle the filtering correctly. Let me work on a fix. Thanks!

gabeiglio · 2025-02-14T20:22:41Z

No problem! I was trying to get a test case for this by evolving the schema of a table and adding a new field with some initial-default value, but i think that have to wait for V3 table spec

…rt-initial-value

kevinjqliu · 2025-04-19T18:40:47Z

@Fokko could you rebase this when you get a chance

Fokko · 2025-04-23T06:08:33Z

@kevinjqliu Sure, but I think this relies on #1770 to do some proper testing 👍

…rt-initial-value

tests/integration/test_reads.py

…ython into fd-support-initial-value

kevinjqliu

Thanks for adding this feature! The PR generally LGTM. I added a few comments on reading initial-defaults for optional/required fields.

I think it would be great to add tests to cover these scenarios:

Optional field might have initial-default set. If set, use initial-default. If not set, use null
Required field will always have initial-default set.

There are tests covering optional field with initial-default set and required field with initial-defaultset. We just need to addoptional field without initial-default set. And perhaps a test to throw when required field does not have initial-default set`

The spec also mentions V3 data types

All columns of unknown, variant, geometry, and geography types must default to null. Non-null values for initial-default or write-default are invalid.

and nested struct types

When a field that is a struct type is added, its default may only be null or a non-null struct with no field values. Default values for fields must be stored in field metadata.

Should we address them as part of this PR?

kevinjqliu · 2025-06-28T16:08:45Z

tests/integration/test_reads.py

+
+
+@pytest.mark.integration
+# TODO: For Hive we require writing V3


we cant write v3 tables using the hive catalog yet?

No, we block on writing metadata since that's not fully supported yet (mostly row-lineage). For REST, the table metadata is written by the catalog.

tests/io/test_pyarrow.py

kevinjqliu · 2025-06-28T16:30:05Z

pyiceberg/expressions/visitors.py


        if file_column_name is None:
            # In the case of schema evolution, the column might not be present
-            # in the file schema when reading older data
-            if isinstance(predicate, BoundIsNull):


this checks against BoundIsNull, but the same BoundIsNull check will raise ValueError in the new implementation

BoundIsNull is part of the BoundUnaryPredicate:

This generalized pretty well, because for ≤2 tables the default value is always null, which covers the previous behavior. Those are already covered by tests: 5af05bd

kevinjqliu · 2025-06-28T16:31:25Z

pyiceberg/expressions/visitors.py

+                AlwaysTrue()
+                if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field.initial_default))
+                else AlwaysFalse()
+            )

        if isinstance(predicate, BoundUnaryPredicate):


is this line L919 to L926 duplicate?

No, this is the normal branch. The branch above applies when the field is not in the Parquet file.

tests/integration/test_reads.py

kevinjqliu · 2025-06-29T15:49:06Z

pyiceberg/io/pyarrow.py

+            elif field.optional or field.initial_default is not None:
                arrow_type = schema_to_pyarrow(field.field_type, include_field_ids=self._include_field_ids)
-                field_arrays.append(pa.nulls(len(struct_array), type=arrow_type))
+                if field.initial_default is None:
+                    field_arrays.append(pa.nulls(len(struct_array), type=arrow_type))
+                else:
+                    field_arrays.append(pa.repeat(field.initial_default, len(struct_array)))
                fields.append(self._construct_field(field, arrow_type))
            else:
                raise ResolveError(f"Field is required, and could not be found in the file: {field}")


wdyt about restructuring this logic a bit to be more readable?

Optional field might have initial-default set. If set, use initial-default. If not set, use null

When an optional field is added, the defaults may be null and should be explicitly set If initial-default is not set for an optional field, then the default value is null for compatibility with older spec versions.

Required field will always have initial-default set. Should use initial-default if field_array is None

When a required field is added, both defaults must be set to a non-null value

I've added a comment, but different than you suggested.

When an optional field is added, the defaults may be null and should be explicitly set

I don't think this is true, for V2 you can just add an optional field:

Hive:

ALTER TABLE aa_monthly ADD cadence STRING

Here you set the null implicitly.

In V3, with a required field with default value:

ALTER TABLE aa_monthly ADD cadence string NOT NULL DEFAULT 'UNDEFINED'

However, it is also fine to make it optional:

ALTER TABLE aa_monthly ADD cadence string DEFAULT 'UNDEFINED'

But that's a questionable data modelling approach :)

Co-authored-by: Kevin Liu <[email protected]>

Support reading initial-defaults

6f841db

gabeiglio mentioned this pull request Feb 15, 2025

[feature] Add all column projection logic #1636

Open

4 tasks

Fokko added 2 commits March 6, 2025 08:52

Merge branch 'main' of github.com:apache/iceberg-python into fd-suppo…

06334e5

…rt-initial-value

Merge branch 'main' of github.com:apache/iceberg-python into fd-suppo…

e99707d

…rt-initial-value

Fokko mentioned this pull request Apr 22, 2025

Update-schema: Add support for initial-default #1770

Merged

Fokko added 3 commits April 23, 2025 08:09

Merge branch 'main' of github.com:apache/iceberg-python into fd-suppo…

39437f2

…rt-initial-value

Merge branch 'main' of github.com:apache/iceberg-python into fd-suppo…

4860bb8

…rt-initial-value

Add an aditional test

1653c7c

Fokko force-pushed the fd-support-initial-value branch from 532b8b6 to 1653c7c Compare June 24, 2025 13:56

Fokko added this to the PyIceberg 0.10.0 milestone Jun 24, 2025

Fokko changed the title ~~Support reading initial-defaults~~ Update schema projection to support initial-defaults Jun 24, 2025

Fokko commented Jun 24, 2025

View reviewed changes

tests/integration/test_reads.py Show resolved Hide resolved

Include hive tests as well

7cee104

This was referenced Jun 24, 2025

Implement default-value projection #1836

Open

V3 Tracking issue #1818

Open

Fokko added 2 commits June 25, 2025 10:57

Disable Hive for now

80b5e58

Merge branch 'fd-support-initial-value' of github.com:Fokko/iceberg-p…

265cf1f

…ython into fd-support-initial-value

Fokko requested a review from kevinjqliu June 25, 2025 12:56

This was referenced Jun 27, 2025

Detect the case to identify missing column from the file using file's max field id in StrictMetricsEvaluator apache/iceberg#13397

Open

Fix projected fields predicate evaluation #2029

Open

kevinjqliu reviewed Jun 29, 2025

View reviewed changes

Fokko and others added 2 commits June 30, 2025 10:35

Explicit is better than implicit

a7c8370

Co-authored-by: Kevin Liu <[email protected]>

Cleanup

ae34a2d



		@pytest.mark.integration
		# TODO: For Hive we require writing V3

Update schema projection to support initial-defaults #1644

Are you sure you want to change the base?

Update schema projection to support initial-defaults #1644

Uh oh!

Conversation

Fokko commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabeiglio commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Feb 14, 2025

Uh oh!

gabeiglio commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Apr 19, 2025

Uh oh!

Fokko commented Apr 23, 2025

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Update schema projection to support `initial-defaults` #1644

Update schema projection to support `initial-defaults` #1644

Fokko commented Feb 11, 2025 •

edited

Loading

gabeiglio commented Feb 12, 2025 •

edited

Loading

gabeiglio commented Feb 14, 2025 •

edited

Loading