Zarr v3 Dtype refactor changes needed in Xarray

### What is your issue?

The upcoming zarr-python PR: https://github.com/zarr-developers/zarr-python/pull/2874/ will change how zarr python 3 handles dtypes on zarr stores which will have implications for Xarray. I've run the branch against the Xarray tests in order to see what changes will be necessary here. I've found two categories changes required


## 1. Encoding Fill Values
Currently `xarray` handles the encoding and decoding of fill values via it's own `FillValueCoder`  https://github.com/pydata/xarray/blob/7551a7a4fcad82a27b3701cfa0af85f74404a45c/xarray/backends/zarr.py#L116

with the dtypes refactors the new zarr dtypes will provide their own methods to encode and decode fill values. This is a pretty simple update. the reading can be handled with a commit like this: https://github.com/ianhi/xarray/commit/70a7e8b49f2d68072984fee51527f47b8f5865af


## 2. Fixed Length Strings and Appending

This other types of tests that fail seem to be related to the validation logic of when appending can happen.

With the changes the test for appending mismatched length strings:
https://github.com/pydata/xarray/blob/7551a7a4fcad82a27b3701cfa0af85f74404a45c/xarray/tests/test_backends.py#L2882-L2891


 fails on this validation check:

https://github.com/pydata/xarray/blob/7551a7a4fcad82a27b3701cfa0af85f74404a45c/xarray/backends/zarr.py#L544-L570

```python
ValueError: Mismatched dtypes for variable temperature between Zarr store on disk and dataset to append. Store has dtype |S2 but dataset to append has dtype |S3.

/Users/ian/Documents/dev/xarray/xarray/backends/zarr.py:576: ValueError
```

If I bypass the validation check to allow the code to continue as it would have for prior versions of zarr then the expected dataset no longer matches the actual:


```python
_____________________________________________________________________ TestZarrDirectoryStore.test_append_string_length_mismatch_works[3-S] _____________________________________________________________________

self = <xarray.tests.test_backends.TestZarrDirectoryStore object at 0x111684a10>, dtype = 'S'

    @pytest.mark.parametrize("dtype", ["U", "S"])
    def test_append_string_length_mismatch_works(self, dtype) -> None:
        skip_if_zarr_format_2("This doesn't work with Zarr format 2")
        # ...but it probably would if we used object dtype
        ds, ds_to_append = create_append_string_length_mismatch_test_data(dtype)
        expected = xr.concat([ds, ds_to_append], dim="time")
        with self.create_zarr_target() as store_target:
            ds.to_zarr(store_target, mode="w", **self.version_kwargs)
            ds_to_append.to_zarr(store_target, append_dim="time", **self.version_kwargs)
            actual = xr.open_dataset(store_target, engine="zarr")
>           xr.testing.assert_identical(expected, actual)
E           AssertionError: Left and right Dataset objects are not identical
E           Differing data variables:
E           L   temperature  (time) |S3 18B b'aa' b'bb' b'cc' b'aaa' b'bbb' b'ccc'
E           R   temperature  (time) |S2 12B b'aa' b'bb' b'cc' b'aa' b'bb' b'cc'

/Users/ian/Documents/dev/xarray/xarray/tests/test_backends.py:2897: AssertionError
```

the comment in the validation function by @dcherian notes that fixed length strings are not easy to append, but seemingly the function used to pass them through as ok. I'm not 100% clear on what the intended behavior is here, any advice appreciated

	def _validate_datatypes_for_zarr_append(vname, existing_var, new_var):
	"""If variable exists in the store, confirm dtype of the data to append is compatible with
	existing dtype.
	"""
	if (
	np.issubdtype(new_var.dtype, np.number)
	or np.issubdtype(new_var.dtype, np.datetime64)
	or np.issubdtype(new_var.dtype, np.bool_)
	or new_var.dtype == object
	or (new_var.dtype.kind in ("S", "U") and existing_var.dtype == object)
	):
	# We can skip dtype equality checks under two conditions: (1) if the var to append is
	# new to the dataset, because in this case there is no existing var to compare it to;
	# or (2) if var to append's dtype is known to be easy-to-append, because in this case
	# we can be confident appending won't cause problems. Examples of dtypes which are not
	# easy-to-append include length-specified strings of type `\|S` or `<U` (where * is a
	# positive integer character length). For these dtypes, appending dissimilar lengths
	# can result in truncation of appended data. Therefore, variables which already exist
	# in the dataset, and with dtypes which are not known to be easy-to-append, necessitate
	# exact dtype equality, as checked below.
	pass
	elif not new_var.dtype == existing_var.dtype:
	raise ValueError(
	f"Mismatched dtypes for variable {vname} between Zarr store on disk "
	f"and dataset to append. Store has dtype {existing_var.dtype} but "
	f"dataset to append has dtype {new_var.dtype}."
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Zarr v3 Dtype refactor changes needed in Xarray #10329

What is your issue?

1. Encoding Fill Values

2. Fixed Length Strings and Appending

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def test_append_string_length_mismatch_works(self, dtype) -> None:
	skip_if_zarr_format_2("This doesn't work with Zarr format 2")
	# ...but it probably would if we used object dtype
	ds, ds_to_append = create_append_string_length_mismatch_test_data(dtype)
	expected = xr.concat([ds, ds_to_append], dim="time")
	with self.create_zarr_target() as store_target:
	ds.to_zarr(store_target, mode="w", **self.version_kwargs)
	ds_to_append.to_zarr(store_target, append_dim="time", **self.version_kwargs)
	actual = xr.open_dataset(store_target, engine="zarr")
	xr.testing.assert_identical(expected, actual)

Uh oh!

Zarr v3 Dtype refactor changes needed in Xarray #10329

Description

What is your issue?

1. Encoding Fill Values

2. Fixed Length Strings and Appending

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions