Skip to content

Recreate @gajomi's #2070 to keep attrs when calling astype() #4314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 19, 2020

Conversation

dnowacki-usgs
Copy link
Contributor

@dnowacki-usgs dnowacki-usgs commented Aug 5, 2020

This is a do-over of @gajomi's #2070 with minimal additional tests to preserve attrs on Datasets as well as DataArrays.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! One comment and then good-to-go on the code

def test_astype_attrs(self):
mda1 = self.mda.copy()
mda1.attrs["foo"] = "bar"
mda2 = mda1.astype(bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have tests for .astype working correctly? If not, could we add a couple of simple cases (no need to do too much tbc)

@dnowacki-usgs
Copy link
Contributor Author

Thanks @max-sixty. I pushed a second commit, including adding a keep_attrs argument so people can keep the old behavior of losing the attrs. I added a bit more testing on for general astype behavior, checking dtypes are correct and values are similar.

What do you make of the failing tests in test_sparse.py (test_dataarray_method and test_variable_method)? I'm not really familiar enough with the pytest parametrize syntax to figure out what's going on.

@max-sixty
Copy link
Collaborator

I'm not sure re the test at first glance, I don't see how those tests are related, but master seems to pass.
I pushed a merge to master as a check — hope that's OK. Let's see if that sheds any light

@dcherian
Copy link
Contributor

dcherian commented Aug 6, 2020

Looks like COO.astype doesn't accept the casting kwarg: https://sparse.pydata.org/en/stable/generated/sparse.COO.astype.html

@dcherian
Copy link
Contributor

dcherian commented Aug 6, 2020

I've opened an issue at sparse. Let's see what the outcome is.

@dcherian
Copy link
Contributor

dcherian commented Aug 6, 2020

well that was fixed quickly! Now we need to think of backcompat workarounds.

@max-sixty
Copy link
Collaborator

What do you think about reraising the error with a message about upgrading? Is this a narrow enough use case that we can avoid backward-compat fix in xarray?

@dnowacki-usgs
Copy link
Contributor Author

Nice! Thanks a million @dcherian for raising that issue over at sparse.

@max-sixty regarding the minimum version, it seems like sparse has a relatively infrequent release cycle (many months) so it could be a while before we have a minimum version to require. Could we check whether we are dealing with a sparse array and call the numpy method instead for that use case, until sparse supports it?

@max-sixty
Copy link
Collaborator

For sure — good idea to defer to numpy rather than raise — should we do that only on existing versions though? Otherwise we won't roll over to the new version when they do release their next version?
(the version check can be a single line, there are lots of examples in the repo, lmk if that's not clear)

@dcherian
Copy link
Contributor

dcherian commented Aug 6, 2020

I wonder if we should just let it raise (but xfail our test on sparse < 0.10.0). This is an "imperfect NEP-18 implementation" issue and it isn't feasible to catch all such errors.

@dnowacki-usgs
Copy link
Contributor Author

My solution of try/except and popping casting from kwargs felt a bit kludgey, so I'm definitely open to other ideas. Letting it raise and xfailing the test would mean that we can no longer perform astype on sparse arrays until a new sparse version is released in the future. Though I suppose that would depends on the next xarray vs sparse release timeline.

@max-sixty
Copy link
Collaborator

I'm fine with reraising — though I think we should reraise, otherwise isn't it a confusing message?

The try-expect is probably fine given the narrowness (very open to holding too low a standard tho); one step up would be checking it is a spare array.

@dnowacki-usgs
Copy link
Contributor Author

Maybe I am overcomplicating things but I came up with this solution:

Catch the error. If it is the unallowed casting kwarg error, and we are dealing with a sparse array, pop the casting kwarg and issue a RuntimeWarning that we are doing so. Otherwise, raise the original error since it might be something else non-sparse related.

Does that seem reasonable?

@pep8speaks
Copy link

pep8speaks commented Aug 7, 2020

Hello @dnowacki-usgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-19 19:20:14 UTC

@dnowacki-usgs
Copy link
Contributor Author

I think(?) the pep warnings are from the commits @max-sixty pushed several days ago during testing

@keewis
Copy link
Collaborator

keewis commented Aug 11, 2020

I think that's a false positive, our flake8 and black CI don't report that. I guess the bot will remove that FileNotFoundError report once you push a new commit, so it should be safe to ignore it.

@max-sixty
Copy link
Collaborator

Great — I think this looks good — any final thoughts?

@mathause
Copy link
Collaborator

Somehow the contents of #4248 got in here. Can you try to merge master and see if they disappear?

@dnowacki-usgs
Copy link
Contributor Author

Did a fetch and rebase, not sure why commits from other PRs added to master in the meantime show up in this one 🤷

@mathause
Copy link
Collaborator

Weird - it got even worse. I can try tomorrow morning (European time) unless @max-sixty wants to have a go.

@max-sixty
Copy link
Collaborator

The weirdest part is that the local diff is fine; here's the output of git --no-pager diff upstream/master. Is it maybe because one of the merge commits was mine rather than @dnowacki-usgs ?

Sorry if I contributed to this.

One override is to apply this patch on master and force push this branch. I'm happy to do it if @dnowacki-usgs agrees.

diff --git a/doc/whats-new.rst b/doc/whats-new.rst
index 793f72fb..4b7b117b 100644
--- a/doc/whats-new.rst
+++ b/doc/whats-new.rst
@@ -21,7 +21,9 @@ v0.16.1 (unreleased)
 
 Breaking changes
 ~~~~~~~~~~~~~~~~
-
+- :py:meth:`DataArray.astype` and :py:meth:`Dataset.astype` now preserve attributes. Keep the
+  old behavior by passing `keep_attrs=False` (:issue:`2049`, :pull:`4314`).
+  By `Dan Nowacki <https://github.com/dnowacki-usgs>`_ and `Gabriel Joel Mitchell <https://github.com/gajomi>`_.
 
 New Features
 ~~~~~~~~~~~~
diff --git a/xarray/core/common.py b/xarray/core/common.py
index bc5035b6..7b6fc13c 100644
--- a/xarray/core/common.py
+++ b/xarray/core/common.py
@@ -1305,6 +1305,48 @@ class DataWithCoords(SupportsArithmetic, AttrAccessMixin):
             dask="allowed",
         )
 
+    def astype(self, dtype, casting="unsafe", copy=True, keep_attrs=True):
+        """
+        Copy of the xarray object, with data cast to a specified type.
+        Leaves coordinate dtype unchanged.
+
+        Parameters
+        ----------
+        dtype : str or dtype
+            Typecode or data-type to which the array is cast.
+        casting : {'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional
+            Controls what kind of data casting may occur. Defaults to 'unsafe'
+            for backwards compatibility.
+
+            * 'no' means the data types should not be cast at all.
+            * 'equiv' means only byte-order changes are allowed.
+            * 'safe' means only casts which can preserve values are allowed.
+            * 'same_kind' means only safe casts or casts within a kind,
+                like float64 to float32, are allowed.
+            * 'unsafe' means any data conversions may be done.
+        copy : bool, optional
+            By default, astype always returns a newly allocated array. If this
+            is set to False and the `dtype` requirement is satisfied, the input
+            array is returned instead of a copy.
+        keep_attrs : bool, optional
+            By default, astype keeps attributes. Set to False to remove
+            attributes in the returned object.
+
+        See also
+        --------
+        np.ndarray.astype
+        dask.array.Array.astype
+        """
+        from .computation import apply_ufunc
+
+        return apply_ufunc(
+            duck_array_ops.astype,
+            self,
+            kwargs=dict(dtype=dtype, casting=casting, copy=copy),
+            keep_attrs=keep_attrs,
+            dask="allowed",
+        )
+
     def __enter__(self: T) -> T:
         return self
 
diff --git a/xarray/core/duck_array_ops.py b/xarray/core/duck_array_ops.py
index 3d192882..7b28d396 100644
--- a/xarray/core/duck_array_ops.py
+++ b/xarray/core/duck_array_ops.py
@@ -149,6 +149,33 @@ masked_invalid = _dask_or_eager_func(
 )
 
 
+def astype(data, **kwargs):
+    try:
+        return data.astype(**kwargs)
+    except TypeError as e:
+        # FIXME: This should no longer be necessary in future versions of sparse
+        # Current versions of sparse (v0.10.0) don't support the "casting" kwarg
+        # This was fixed by https://github.com/pydata/sparse/pull/392
+        try:
+            import sparse
+        except ImportError:
+            sparse = None
+        if (
+            "got an unexpected keyword argument 'casting'" in repr(e)
+            and sparse is not None
+            and isinstance(data, sparse._coo.core.COO)
+        ):
+            warnings.warn(
+                "The current version of sparse does not support the 'casting' argument. It will be ignored in the call to astype().",
+                RuntimeWarning,
+                stacklevel=4,
+            )
+            kwargs.pop("casting")
+        else:
+            raise e
+        return data.astype(**kwargs)
+
+
 def asarray(data, xp=np):
     return (
         data
diff --git a/xarray/core/ops.py b/xarray/core/ops.py
index 36753179..dad37f3b 100644
--- a/xarray/core/ops.py
+++ b/xarray/core/ops.py
@@ -42,7 +42,7 @@ NUM_BINARY_OPS = [
 NUMPY_SAME_METHODS = ["item", "searchsorted"]
 # methods which don't modify the data shape, so the result should still be
 # wrapped in an Variable/DataArray
-NUMPY_UNARY_METHODS = ["astype", "argsort", "clip", "conj", "conjugate"]
+NUMPY_UNARY_METHODS = ["argsort", "clip", "conj", "conjugate"]
 PANDAS_UNARY_FUNCTIONS = ["isnull", "notnull"]
 # methods which remove an axis
 REDUCE_METHODS = ["all", "any"]
diff --git a/xarray/core/variable.py b/xarray/core/variable.py
index 1f86a403..6d53a4fe 100644
--- a/xarray/core/variable.py
+++ b/xarray/core/variable.py
@@ -360,6 +360,47 @@ class Variable(
             )
         self._data = data
 
+    def astype(self, dtype, casting="unsafe", copy=True, keep_attrs=True):
+        """
+        Copy of the Variable object, with data cast to a specified type.
+
+        Parameters
+        ----------
+        dtype : str or dtype
+             Typecode or data-type to which the array is cast.
+        casting : {'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional
+             Controls what kind of data casting may occur. Defaults to 'unsafe'
+             for backwards compatibility.
+
+             * 'no' means the data types should not be cast at all.
+             * 'equiv' means only byte-order changes are allowed.
+             * 'safe' means only casts which can preserve values are allowed.
+             * 'same_kind' means only safe casts or casts within a kind,
+                 like float64 to float32, are allowed.
+             * 'unsafe' means any data conversions may be done.
+        copy : bool, optional
+             By default, astype always returns a newly allocated array. If this
+             is set to False and the `dtype` requirement is satisfied, the input
+             array is returned instead of a copy.
+        keep_attrs : bool, optional
+            By default, astype keeps attributes. Set to False to remove
+            attributes in the returned object.
+
+        See also
+        --------
+        np.ndarray.astype
+        dask.array.Array.astype
+        """
+        from .computation import apply_ufunc
+
+        return apply_ufunc(
+            duck_array_ops.astype,
+            self,
+            kwargs=dict(dtype=dtype, casting=casting, copy=copy),
+            keep_attrs=keep_attrs,
+            dask="allowed",
+        )
+
     def load(self, **kwargs):
         """Manually trigger loading of this variable's data from disk or a
         remote source into memory and return this variable.
diff --git a/xarray/tests/test_dataarray.py b/xarray/tests/test_dataarray.py
index f10404d7..77969e5c 100644
--- a/xarray/tests/test_dataarray.py
+++ b/xarray/tests/test_dataarray.py
@@ -1874,6 +1874,19 @@ class TestDataArray:
         bar = Variable(["x", "y"], np.zeros((10, 20)))
         assert_equal(self.dv, np.maximum(self.dv, bar))
 
+    def test_astype_attrs(self):
+        for v in [self.va.copy(), self.mda.copy(), self.ds.copy()]:
+            v.attrs["foo"] = "bar"
+            assert list(v.attrs.items()) == list(v.astype(float).attrs.items())
+            assert [] == list(v.astype(float, keep_attrs=False).attrs.items())
+
+    def test_astype_dtype(self):
+        original = DataArray([-1, 1, 2, 3, 1000])
+        converted = original.astype(float)
+        assert_array_equal(original, converted)
+        assert np.issubdtype(original.dtype, np.integer)
+        assert np.issubdtype(converted.dtype, np.floating)
+
     def test_is_null(self):
         x = np.random.RandomState(42).randn(5, 6)
         x[x < 0] = np.nan
diff --git a/xarray/tests/test_dataset.py b/xarray/tests/test_dataset.py
index da7621dc..9a95eb24 100644
--- a/xarray/tests/test_dataset.py
+++ b/xarray/tests/test_dataset.py
@@ -5607,6 +5607,17 @@ class TestDataset:
         np.testing.assert_equal(padded["var1"].isel(dim2=[0, -1]).data, 42)
         np.testing.assert_equal(padded["dim2"][[0, -1]].data, np.nan)
 
+    def test_astype_attrs(self):
+        data = create_test_data(seed=123)
+        data.attrs["foo"] = "bar"
+
+        assert list(data.attrs.items()) == list(data.astype(float).attrs.items())
+        assert list(data.var1.attrs.items()) == list(
+            data.astype(float).var1.attrs.items()
+        )
+        assert [] == list(data.astype(float, keep_attrs=False).attrs.items())
+        assert [] == list(data.astype(float, keep_attrs=False).var1.attrs.items())
+
 
 # Py.test tests
 

@max-sixty
Copy link
Collaborator

There are some duplicated commits — 3x Recreate gaj... — though still weird that GH and git disagree.

I'd suggest the squash and push -f...

*   e23529e0 - (HEAD -> astypekeepattrs, dnowacki-usgs/astypekeepattrs, pr/dnowacki-usgs/4314) Merge branch 'astypekeepattrs' of https://github.com/dnowacki-usgs/xarray into astypekeepattrs (3 hours ago) <Dan Nowacki>
|\
| * ac42be80 - improve error handling and check if sparse array (6 days ago) <Dan Nowacki>
| * e2d38550 - fix dtype test (7 days ago) <Dan Nowacki>
| *   e079cf86 - Merge branch 'astypekeepattrs' of https://github.com/dnowacki-usgs/xarray into astypekeepattrs (7 days ago) <Dan Nowacki>
| |\
| | *   1296f590 - Merge branch 'master' into pr/dnowacki-usgs/4314 (8 days ago) <Maximilian Roos>
| | |\
| | * | 5c101d85 - Accept keep_attrs flag, use apply_ufunc in variable.py, expand tests (8 days ago) <Dan Nowacki>
| | * | d75e1fb9 - Recreate @gajomi's #2070 to keep attrs when calling astype() (8 days ago) <Dan Nowacki>
| * | | eb86c8a6 - add to whats new (7 days ago) <Dan Nowacki>
| * | | db014baf - Ignore casting if we get a TypeError (indicates sparse <= 0.10.0) (7 days ago) <Dan Nowacki>
| * | | 01d1b378 - Accept keep_attrs flag, use apply_ufunc in variable.py, expand tests (7 days ago) <Dan Nowacki>
| * | | 00bfeaa3 - Recreate @gajomi's #2070 to keep attrs when calling astype() (7 days ago) <Dan Nowacki>
* | | | dba93b18 - improve error handling and check if sparse array (3 hours ago) <Dan Nowacki>
* | | | 4856eec5 - fix dtype test (3 hours ago) <Dan Nowacki>
* | | | 983f17d5 - add to whats new (3 hours ago) <Dan Nowacki>
* | | | 9f9b06c6 - Ignore casting if we get a TypeError (indicates sparse <= 0.10.0) (3 hours ago) <Dan Nowacki>
* | | | a277c2a4 - Accept keep_attrs flag, use apply_ufunc in variable.py, expand tests (3 hours ago) <Dan Nowacki>
* | | | 833e444f - Recreate @gajomi's #2070 to keep attrs when calling astype() (3 hours ago) <Dan Nowacki>
* | | | 7ee97989 - Accept keep_attrs flag, use apply_ufunc in variable.py, expand tests (4 hours ago) <Dan Nowacki>
* | | | dd4f5494 - Recreate @gajomi's #2070 to keep attrs when calling astype() (4 hours ago) <Dan Nowacki>
* | | | 7daad4fc - (upstream/master, upstream/HEAD, origin/master, master) Implement interp for interpolating between chunks of data (dask) (#4155) (2 days ago) <Alexandre Poux>
* | | | d514b120 - Add @mathause to current core developers. (#4335) (2 days ago) <Deepak Cherian>
* | | | ceadecd5 - install sphinx-autosummary-accessors from conda-forge (#4332) (3 days ago) <keewis>
* | | | 1791c3b6 - Use sphinx-accessors-autosummary (#4323) (4 days ago) <keewis>
* | | | df7b2eae - ndrolling fixes (#4329) (4 days ago) <Keisuke Fujii>
* | | | f02ca537 - DOC: fix typo argmin -> argmax in DataArray.argmax docstring (#4327) (5 days ago) <Sander>
* | | | 6e9edff5 - pin sphinx to 3.1(#4326) (5 days ago) <keewis>
* | | | 1d3dee08 - nd-rolling (#4219) (6 days ago) <Keisuke Fujii>
|/ / /
* | | e04e21d6 - Implicit dask import 4164 (#4318) (7 days ago) <inakleinbottle>
* | | 7ba19e12 - allow customizing the inline repr of a duck array (#4248) (7 days ago) <keewis>
| |/
|/|
* | 98dc1f4e - silence the known docs CI issues (#4316) (8 days ago) <keewis>
* | ed6fff2c - enh: fixed #4302 (#4315) (8 days ago) <Nick R. Papior>
* | 1101eca6 - Remove all unused and warn-raising methods from AbstractDataStore (#4310) (8 days ago) <alexamici>
|/
* e1dafe67 - (dnowacki-usgs/master) Fix map_blocks example (#4305) (10 days ago) <Tom Augspurger>

@keewis
Copy link
Collaborator

keewis commented Aug 13, 2020

I think that was due to a rebase merge (?). I fixed the commit history using cherry-picking and force-pushed to your branch, but there's a backup of the old branch if something went wrong.

@dnowacki-usgs, if you want to add new commits you'll first have to force-pull (or delete your local branch and check out the remote branch – just make sure that you don't lose any unpushed commits).

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnowacki-usgs This generally looks great. I've added some minor comments below.

@dnowacki-usgs
Copy link
Contributor Author

Thanks for the review @dcherian. I modified the tests as suggested and added some docstring updates. Let me know what you think.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnowacki-usgs . This LGTM.

The backcompat code is a little not-nice but I can't think of a better solution. It would be nice to not break things overnight... so thanks!

@dcherian dcherian requested a review from max-sixty August 15, 2020 16:44
Comment on lines 153 to 155
try:
return data.astype(**kwargs)
except TypeError as e:
Copy link
Collaborator

@keewis keewis Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to do something like

        try:
            import sparse
        except ImportError:
            sparse = None

        if (
            sparse is not None
            and isinstance(data, sparse.C00)
            and LooseVersion(sparse.__version__) < LooseVersion("0.10.0")
            and "casting" in kwargs
        ):
            # warn
            kwargs.pop("casting")
        return data.astype(**kwargs)

?

that way, we wouldn't call data.astype(**kwargs) twice

@max-sixty
Copy link
Collaborator

I just saw I was up for reviewing this — sorry for missing that.

Looks good. @keewis 's suggestion for the back-compat code looks like a nice refinement — it includes a version check which will ease disentangling it. Could we add that and then merge?

@dnowacki-usgs
Copy link
Contributor Author

dnowacki-usgs commented Aug 19, 2020

Updated the try/except following @keewis's lead. Also sparse 0.11.0 has been released so I was able to check that the warning is correctly issued for 0.10.0, then I upgraded, and the warning is no longer issued (and sparse doesn't choke on the casting kwarg).

@keewis
Copy link
Collaborator

keewis commented Aug 19, 2020

could you merge in master? I think that would fix the failing build.

@keewis
Copy link
Collaborator

keewis commented Aug 19, 2020

which command did you use to create the merge? I think that was some sort of rebase merge, which means that all merged commits are included in the commit history, and github doesn't seem to work well with this kind of merges. You also have duplicated commits in the history, which confuses me.

Edit: it seems you somehow merged / rebased the branch onto itself. Just as reference, I usually use

git checkout master
git pull
git checkout <feature-branch>
git merge master
git push

for merging

@dnowacki-usgs
Copy link
Contributor Author

dnowacki-usgs commented Aug 19, 2020

Ugh, sorry about that. I'm following the instructions in contributing to xarray. It's a fetch/rebase. Happy to do it another way if there is a better practice. Thanks.

@keewis
Copy link
Collaborator

keewis commented Aug 19, 2020

ah, now I understand where the duplication is coming from. I'm guessing you did that and then tried to push, but it complained about different commits, so you pulled and then pushed.

I fixed it for you.

I think we should update that guide, rebasing is usually not a good idea because then you'd have to force-push, too (and github doesn't handle force-pushing that well).

@keewis
Copy link
Collaborator

keewis commented Aug 19, 2020

I think it should be

git checkout master
git pull
git checkout <feature-branch>
git merge master
git push

or

git fetch upstream
git merge upstream/master
git push

instead

@dnowacki-usgs
Copy link
Contributor Author

dnowacki-usgs commented Aug 19, 2020

Ah, I see, merge instead of rebase. I'll try that next time.

@keewis
Copy link
Collaborator

keewis commented Aug 19, 2020

I've opened a PR to update the guide, hopefully that helps.

@dcherian
Copy link
Contributor

Thanks @dnowacki-usgs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keeping attributes when using DataArray.astype
6 participants