-
Notifications
You must be signed in to change notification settings - Fork 21
Description
In pyarrow we differentiate between missing (null
) values, which we define with a bitmask, and NaN
float values.
From the dataframe interchange protocol specification we have understood that one can use NaN
to indicate missing values but that does not need to be the case (one can use NaN
as a valid value)
dataframe-api/protocol/dataframe_protocol.py
Lines 195 to 213 in 4f7c1e0
@property | |
def describe_null(self) -> Tuple[int, Any]: | |
""" | |
Return the missing value (or "null") representation the column dtype | |
uses, as a tuple ``(kind, value)``. | |
Kind: | |
- 0 : non-nullable | |
- 1 : NaN/NaT | |
- 2 : sentinel value | |
- 3 : bit mask | |
- 4 : byte mask | |
Value : if kind is "sentinel value", the actual value. If kind is a bit | |
mask or a byte mask, the value (0 or 1) indicating a missing value. None | |
otherwise. | |
""" | |
pass |
There will be disceptancy between pyarrow and pandas, for example, where NaN
will be turned into missing value. But we do not think it would be correct for pyarrow to change the null_count
property as the information about the difference would be lost for the libraries that would benefit from it. Also the bitmask information and the information in the null_count
would need to be made equal.
Is there a way a library could keep the behaviour of not treating NaNs as nulls?
(Connected issue in the arrow repo apache/arrow#34774)