-
Notifications
You must be signed in to change notification settings - Fork 318
Update schema projection to support initial-defaults
#1644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Since initial-default projection happens after filtering in _task_to_record_batches Im wondering if this will yield the correct results given a pyarrow_filter for this field. |
Thanks for pointing this out, and it doesn't handle the filtering correctly. Let me work on a fix. Thanks! |
No problem! I was trying to get a test case for this by evolving the schema of a table and adding a new field with some initial-default value, but i think that have to wait for V3 table spec |
@Fokko could you rebase this when you get a chance |
@kevinjqliu Sure, but I think this relies on #1770 to do some proper testing 👍 |
532b8b6
to
1653c7c
Compare
initial-defaults
…ython into fd-support-initial-value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature! The PR generally LGTM. I added a few comments on reading initial-defaults for optional/required fields.
I think it would be great to add tests to cover these scenarios:
- Optional field might have initial-default set. If set, use initial-default. If not set, use null
- Required field will always have initial-default set.
There are tests covering optional field with initial-default set
and required field with initial-defaultset. We just need to add
optional field without initial-default set. And perhaps a test to throw when
required field does not have initial-default set`
The spec also mentions V3 data types
All columns of unknown, variant, geometry, and geography types must default to null. Non-null values for initial-default or write-default are invalid.
and nested struct types
When a field that is a struct type is added, its default may only be null or a non-null struct with no field values. Default values for fields must be stored in field metadata.
Should we address them as part of this PR?
|
||
|
||
@pytest.mark.integration | ||
# TODO: For Hive we require writing V3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we cant write v3 tables using the hive catalog yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we block on writing metadata since that's not fully supported yet (mostly row-lineage). For REST, the table metadata is written by the catalog.
|
||
if file_column_name is None: | ||
# In the case of schema evolution, the column might not be present | ||
# in the file schema when reading older data | ||
if isinstance(predicate, BoundIsNull): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this checks against BoundIsNull
, but the same BoundIsNull
check will raise ValueError in the new implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BoundIsNull
is part of the BoundUnaryPredicate
:

This generalized pretty well, because for ≤2 tables the default value is always null
, which covers the previous behavior. Those are already covered by tests: 5af05bd
AlwaysTrue() | ||
if expression_evaluator(Schema(field), pred, case_sensitive=self.case_sensitive)(Record(field.initial_default)) | ||
else AlwaysFalse() | ||
) | ||
|
||
if isinstance(predicate, BoundUnaryPredicate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this line L919 to L926 duplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is the normal branch. The branch above applies when the field is not in the Parquet file.
pyiceberg/io/pyarrow.py
Outdated
elif field.optional or field.initial_default is not None: | ||
arrow_type = schema_to_pyarrow(field.field_type, include_field_ids=self._include_field_ids) | ||
field_arrays.append(pa.nulls(len(struct_array), type=arrow_type)) | ||
if field.initial_default is None: | ||
field_arrays.append(pa.nulls(len(struct_array), type=arrow_type)) | ||
else: | ||
field_arrays.append(pa.repeat(field.initial_default, len(struct_array))) | ||
fields.append(self._construct_field(field, arrow_type)) | ||
else: | ||
raise ResolveError(f"Field is required, and could not be found in the file: {field}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about restructuring this logic a bit to be more readable?
Optional field might have initial-default set. If set, use initial-default. If not set, use null
When an optional field is added, the defaults may be null and should be explicitly set
If initial-default is not set for an optional field, then the default value is null for compatibility with older spec versions.
Required field will always have initial-default set. Should use initial-default if field_array
is None
When a required field is added, both defaults must be set to a non-null value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment, but different than you suggested.
When an optional field is added, the defaults may be null and should be explicitly set
I don't think this is true, for V2 you can just add an optional field:
Hive:
ALTER TABLE aa_monthly ADD cadence STRING
Here you set the null implicitly.
In V3, with a required field with default value:
ALTER TABLE aa_monthly ADD cadence string NOT NULL DEFAULT 'UNDEFINED'
However, it is also fine to make it optional:
ALTER TABLE aa_monthly ADD cadence string DEFAULT 'UNDEFINED'
But that's a questionable data modelling approach :)
Co-authored-by: Kevin Liu <[email protected]>
Add the projection piece of the initial defaults.
Closes #1836