Skip to content

Virtual Dataset Workflow Tracking Issue #197

@mpiannucci

Description

@mpiannucci

In order to create and use virtual datasets with python, users will want to use kerchunk and virtualizarr. These are just starting down the path to zarr 3 and icechunk compatability. This issue will be used to track progress and relevant PRs:

All of this can be installed with pip. However we need to install with three steps for now to avoid version conflicts:

pip install icechunk xarray VirtualiZarr kerchunk

This assumes also having fsspec and s3fs and h5 installed:

pip install fsspec s3fs h5py h5netcdf

With all of this installed, HDF5 virtual datasets currently work like this:

from datetime import datetime, timezone
import icechunk
import xarray as xr
import virtualizarr

url = 's3://met-office-atmospheric-model-data/global-deterministic-10km/20250204T0000Z/20250204T0000Z-PT0000H00M-pressure_at_mean_sea_level.nc'
so = dict(anon=True, default_fill_cache=False, default_cache_type="none")

# create virtualizarr dataset
vds = virtualizarr.open_virtual_dataset(url, reader_options={'storage_options': so}, indexes={})

# create an icechunk repo that can read virtual chunks from eu-west-region with anonymous access
storage = icechunk.local_filesystem_storage("./ukmet")
config = icechunk.RepositoryConfig.default()

config.set_virtual_chunk_container(icechunk.VirtualChunkContainer("s3", "s3://", icechunk.s3_store(region="eu-west-2")))
credentials = icechunk.containers_credentials(s3=icechunk.s3_credentials(anonymous=True))

repo = icechunk.Repository.create(storage, config, credentials)

# create a session, and write to a group inside it using virtualizarr
session = repo.writable_session("main")
vds.virtualize.to_icechunk(session.store, group="msl", last_updated_at=datetime.now(timezone.utc))

# commit to save progress
session.commit("Add msl pressure")

# open it back up
ds = xr.open_zarr(session.store, group="msl", zarr_format=3, consolidated=False, decode_times=False)
ds

# plot!
ds.air_pressure_at_sea_level.plot()

Image

Updated 2/4/2025

Metadata

Metadata

Assignees

No one assigned

    Labels

    virtual references 👻Involves virtual kerchunk/virtualizarr chunk references

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions