-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
The upcoming zarr-python PR: zarr-developers/zarr-python#2874 will change how zarr python 3 handles dtypes on zarr stores which will have implications for Xarray. I've run the branch against the Xarray tests in order to see what changes will be necessary here. I've found two categories changes required
1. Encoding Fill Values
Currently xarray
handles the encoding and decoding of fill values via it's own FillValueCoder
xarray/xarray/backends/zarr.py
Line 116 in 7551a7a
class FillValueCoder: |
with the dtypes refactors the new zarr dtypes will provide their own methods to encode and decode fill values. This is a pretty simple update. the reading can be handled with a commit like this: ianhi@70a7e8b
2. Fixed Length Strings and Appending
This other types of tests that fail seem to be related to the validation logic of when appending can happen.
With the changes the test for appending mismatched length strings:
xarray/xarray/tests/test_backends.py
Lines 2882 to 2891 in 7551a7a
def test_append_string_length_mismatch_works(self, dtype) -> None: | |
skip_if_zarr_format_2("This doesn't work with Zarr format 2") | |
# ...but it probably would if we used object dtype | |
ds, ds_to_append = create_append_string_length_mismatch_test_data(dtype) | |
expected = xr.concat([ds, ds_to_append], dim="time") | |
with self.create_zarr_target() as store_target: | |
ds.to_zarr(store_target, mode="w", **self.version_kwargs) | |
ds_to_append.to_zarr(store_target, append_dim="time", **self.version_kwargs) | |
actual = xr.open_dataset(store_target, engine="zarr") | |
xr.testing.assert_identical(expected, actual) |
fails on this validation check:
xarray/xarray/backends/zarr.py
Lines 544 to 570 in 7551a7a
def _validate_datatypes_for_zarr_append(vname, existing_var, new_var): | |
"""If variable exists in the store, confirm dtype of the data to append is compatible with | |
existing dtype. | |
""" | |
if ( | |
np.issubdtype(new_var.dtype, np.number) | |
or np.issubdtype(new_var.dtype, np.datetime64) | |
or np.issubdtype(new_var.dtype, np.bool_) | |
or new_var.dtype == object | |
or (new_var.dtype.kind in ("S", "U") and existing_var.dtype == object) | |
): | |
# We can skip dtype equality checks under two conditions: (1) if the var to append is | |
# new to the dataset, because in this case there is no existing var to compare it to; | |
# or (2) if var to append's dtype is known to be easy-to-append, because in this case | |
# we can be confident appending won't cause problems. Examples of dtypes which are not | |
# easy-to-append include length-specified strings of type `|S*` or `<U*` (where * is a | |
# positive integer character length). For these dtypes, appending dissimilar lengths | |
# can result in truncation of appended data. Therefore, variables which already exist | |
# in the dataset, and with dtypes which are not known to be easy-to-append, necessitate | |
# exact dtype equality, as checked below. | |
pass | |
elif not new_var.dtype == existing_var.dtype: | |
raise ValueError( | |
f"Mismatched dtypes for variable {vname} between Zarr store on disk " | |
f"and dataset to append. Store has dtype {existing_var.dtype} but " | |
f"dataset to append has dtype {new_var.dtype}." | |
) |
ValueError: Mismatched dtypes for variable temperature between Zarr store on disk and dataset to append. Store has dtype |S2 but dataset to append has dtype |S3.
/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:576: ValueError
If I bypass the validation check to allow the code to continue as it would have for prior versions of zarr then the expected dataset no longer matches the actual:
_____________________________________________________________________ TestZarrDirectoryStore.test_append_string_length_mismatch_works[3-S] _____________________________________________________________________
self = <xarray.tests.test_backends.TestZarrDirectoryStore object at 0x111684a10>, dtype = 'S'
@pytest.mark.parametrize("dtype", ["U", "S"])
def test_append_string_length_mismatch_works(self, dtype) -> None:
skip_if_zarr_format_2("This doesn't work with Zarr format 2")
# ...but it probably would if we used object dtype
ds, ds_to_append = create_append_string_length_mismatch_test_data(dtype)
expected = xr.concat([ds, ds_to_append], dim="time")
with self.create_zarr_target() as store_target:
ds.to_zarr(store_target, mode="w", **self.version_kwargs)
ds_to_append.to_zarr(store_target, append_dim="time", **self.version_kwargs)
actual = xr.open_dataset(store_target, engine="zarr")
> xr.testing.assert_identical(expected, actual)
E AssertionError: Left and right Dataset objects are not identical
E Differing data variables:
E L temperature (time) |S3 18B b'aa' b'bb' b'cc' b'aaa' b'bbb' b'ccc'
E R temperature (time) |S2 12B b'aa' b'bb' b'cc' b'aa' b'bb' b'cc'
/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:2897: AssertionError
the comment in the validation function by @dcherian notes that fixed length strings are not easy to append, but seemingly the function used to pass them through as ok. I'm not 100% clear on what the intended behavior is here, any advice appreciated