Could NaNs not be counted as null

In pyarrow we differentiate between missing (`null`) values, which we define with a bitmask, and `NaN` float values.

From the dataframe interchange protocol specification we have understood that one _can_ use `NaN` to indicate missing values but that does _not need_ to be the case (one can use `NaN` as a valid value) https://github.com/data-apis/dataframe-api/blob/4f7c1e0c425643d57120c5b73434d992b3e83595/protocol/dataframe_protocol.py#L195-L213

There will be disceptancy between pyarrow and pandas, for example, where `NaN` will be turned into missing value. But we do not think it would be correct for pyarrow to change the `null_count` property as the information about the difference would be lost for the libraries that would benefit from it. Also the bitmask information and the information in the `null_count` would need to be made equal.

Is there a way a library could keep the behaviour of not treating NaNs as nulls?

(Connected issue in the arrow repo https://github.com/apache/arrow/issues/34774#)

	@property
	def describe_null(self) -> Tuple[int, Any]:
	"""
	Return the missing value (or "null") representation the column dtype
	uses, as a tuple ``(kind, value)``.

	Kind:

	- 0 : non-nullable
	- 1 : NaN/NaT
	- 2 : sentinel value
	- 3 : bit mask
	- 4 : byte mask

	Value : if kind is "sentinel value", the actual value. If kind is a bit
	mask or a byte mask, the value (0 or 1) indicating a missing value. None
	otherwise.
	"""
	pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Could NaNs not be counted as null #126

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Could NaNs not be counted as null #126

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions