-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Open
Labels
CategoricalCategorical Data TypeCategorical Data TypeEnhancementNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionStringsString extension data type and string dataString extension data type and string data
Description
In the PR implementing .str/.dt
on Categoricals, #11582.
This is perfectly reasonable. We perform the string op on the uniques. This routine is a boolean result, so we return a boolean result.
In [2]: s = pd.Series(list('aabb')).astype('category')
In [3]: s
Out[3]:
0 a
1 a
2 b
3 b
dtype: category
Categories (2, object): [a, b]
In [4]: s.str.contains("a")
Out[4]:
0 True
1 True
2 False
3 False
dtype: bool
However, I don't recall the rationale for: performing the op on the uniques (as its a categorical), but then returning an object
dtype.
In [5]: s.str.upper()
Out[5]:
0 A
1 A
2 B
3 B
dtype: object
These are by-definition pure transforms, and so a new categorical makes sense. e.g. in this case
In [6]: pd.Series(pd.Categorical.from_codes(s.cat.codes, s.cat.categories.str.upper()))
Out[6]:
0 A
1 A
2 B
3 B
dtype: category
Categories (2, object): [A, B]
This will be way more efficient than actually converting to object.
Metadata
Metadata
Assignees
Labels
CategoricalCategorical Data TypeCategorical Data TypeEnhancementNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionStringsString extension data type and string dataString extension data type and string data