Skip to content

Update schema projection to support initial-defaults #1644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Feb 11, 2025

Add the projection piece of the initial defaults.

Closes #1836

@gabeiglio
Copy link
Contributor

gabeiglio commented Feb 12, 2025

Since initial-default projection happens after filtering in _task_to_record_batches Im wondering if this will yield the correct results given a pyarrow_filter for this field.

@Fokko
Copy link
Contributor Author

Fokko commented Feb 14, 2025

Since initial-default projection happens after filtering in _task_to_record_batches Im wondering if this will yield the correct results given a pyarrow_filter for this field.

Thanks for pointing this out, and it doesn't handle the filtering correctly. Let me work on a fix. Thanks!

@gabeiglio
Copy link
Contributor

gabeiglio commented Feb 14, 2025

No problem! I was trying to get a test case for this by evolving the schema of a table and adding a new field with some initial-default value, but i think that have to wait for V3 table spec

@kevinjqliu
Copy link
Contributor

@Fokko could you rebase this when you get a chance

@Fokko
Copy link
Contributor Author

Fokko commented Apr 23, 2025

@kevinjqliu Sure, but I think this relies on #1770 to do some proper testing 👍

@Fokko Fokko force-pushed the fd-support-initial-value branch from 532b8b6 to 1653c7c Compare June 24, 2025 13:56
@Fokko Fokko added this to the PyIceberg 0.10.0 milestone Jun 24, 2025
@Fokko Fokko changed the title Support reading initial-defaults Update schema projection to support initial-defaults Jun 24, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature! The PR generally LGTM. I added a few comments on reading initial-defaults for optional/required fields.

I think it would be great to add tests to cover these scenarios:

  • Optional field might have initial-default set. If set, use initial-default. If not set, use null
  • Required field will always have initial-default set.

There are tests covering optional field with initial-default set and required field with initial-defaultset. We just need to addoptional field without initial-default set. And perhaps a test to throw when required field does not have initial-default set`

The spec also mentions V3 data types

All columns of unknown, variant, geometry, and geography types must default to null. Non-null values for initial-default or write-default are invalid.

and nested struct types

When a field that is a struct type is added, its default may only be null or a non-null struct with no field values. Default values for fields must be stored in field metadata.

Should we address them as part of this PR?



@pytest.mark.integration
# TODO: For Hive we require writing V3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cant write v3 tables using the hive catalog yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we block on writing metadata since that's not fully supported yet (mostly row-lineage). For REST, the table metadata is written by the catalog.


if file_column_name is None:
# In the case of schema evolution, the column might not be present
# in the file schema when reading older data
if isinstance(predicate, BoundIsNull):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this checks against BoundIsNull, but the same BoundIsNull check will raise ValueError in the new implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoundIsNull is part of the BoundUnaryPredicate:

image

This generalized pretty well, because for ≤2 tables the default value is always null, which covers the previous behavior. Those are already covered by tests: 5af05bd

AlwaysTrue()
if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field.initial_default))
else AlwaysFalse()
)

if isinstance(predicate, BoundUnaryPredicate):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line L919 to L926 duplicate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is the normal branch. The branch above applies when the field is not in the Parquet file.

Comment on lines 1817 to 1825
elif field.optional or field.initial_default is not None:
arrow_type = schema_to_pyarrow(field.field_type, include_field_ids=self._include_field_ids)
field_arrays.append(pa.nulls(len(struct_array), type=arrow_type))
if field.initial_default is None:
field_arrays.append(pa.nulls(len(struct_array), type=arrow_type))
else:
field_arrays.append(pa.repeat(field.initial_default, len(struct_array)))
fields.append(self._construct_field(field, arrow_type))
else:
raise ResolveError(f"Field is required, and could not be found in the file: {field}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about restructuring this logic a bit to be more readable?

Optional field might have initial-default set. If set, use initial-default. If not set, use null

When an optional field is added, the defaults may be null and should be explicitly set
If initial-default is not set for an optional field, then the default value is null for compatibility with older spec versions.

Required field will always have initial-default set. Should use initial-default if field_array is None

When a required field is added, both defaults must be set to a non-null value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment, but different than you suggested.

When an optional field is added, the defaults may be null and should be explicitly set

I don't think this is true, for V2 you can just add an optional field:

Hive:

ALTER TABLE aa_monthly ADD cadence STRING

Here you set the null implicitly.

In V3, with a required field with default value:

ALTER TABLE aa_monthly ADD cadence string NOT NULL DEFAULT 'UNDEFINED'

However, it is also fine to make it optional:

ALTER TABLE aa_monthly ADD cadence string DEFAULT 'UNDEFINED'

But that's a questionable data modelling approach :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement default-value projection
3 participants