gh-133546: Make `re.Match` a well-rounded `Sequence` type #133549

vberlier · 2025-05-07T02:55:53Z

Issue: Make re.Match a well-rounded Sequence type #133546

📚 Documentation preview 📚: https://cpython-previews--133549.org.readthedocs.build/

JelleZijlstra · 2025-05-07T03:59:33Z

Doc/library/re.rst

@@ -1378,6 +1378,27 @@ when there is no match, you can test whether there was a match with a simple
   if match:
       process(match)

+Match objects are proper :class:`~collections.abc.Sequence` types. You can access


This is not true with this PR, Sequence has a number of other requirements (e.g. an index and count method).

Nice catch, thanks! I added index and count by adapting the implementation of tuple.index and tuple.count. I also updated the unit test to cover all Sequence mixin methods.

Modules/_sre/sre.c

picnixz

Test coverage is sometimes insufficient and sometimes redundant with existing ones.

Doc/library/re.rst

Lib/test/test_re.py

picnixz

Some additional feedback. For the test failures, let's add some assertWarns. I don't think we should make some additional checks in C for the type of the operands (the issue is when you compare the input (which is a bytes) with the matched item (which are strings), hence a BytesWarning). So, instead, let's just expect the warning.

By the way, could you avoid using @picnixz in your commits as I'm getting pinged? TiA.

Lib/test/test_re.py

Modules/_sre/sre.c

vberlier · 2025-05-16T00:19:30Z

Oh, sorry for the ping. Is it okay if I just leave out the cases with bytes as input? I can't seem to trigger the warning locally, even with ./python -Werror -m test.

Doc/library/re.rst

Modules/_sre/sre.c

Lib/test/test_re.py

AA-Turner · 2025-06-02T22:43:36Z

I'm not sure consensus exists for this on either the issue or the linked Discourse thread...

A

vberlier · 2025-06-03T00:50:50Z

It's a bit quiet, but I guess there's not much to talk about. Serhiy Storchaka brought up a valuable point about the fact that back in the day there was some debate over the possible semantics of len() and unpacking. But with the __getitem__ implementation that was introduced 10 years ago, there's actually not much wiggle room for creative interpretations of what it would mean for re.Match to be a proper Sequence type.

This PR is nothing more than painting by numbers. The API was just waiting for someone to spend a couple of hours to fill in the blanks. So it's not particularly exciting, and I doubt the topic will attract a lot of attention on Discourse, but personally I think this is the kind of polish and attention to detail that makes Python feel like home. And it looks like some other people and a core dev are on board with it too. That's enough for me to get started. Plus, working code can help ground discussions.

That said, I'm still new to CPython's contributing process, so let me know if I should've done something differently.

picnixz

After staring at this PR once again, I think we don't need the full-fledged sequence API but we could have some of it:

It's useful to have case support.
It's useful to have deconstruction via _, a, b = match
It may be useful to have len as otherwise we need to extract the groups and do len(m.groups()).
I can't find use cases of .index. While it could give back the index of the group that matches what we pass, it could be more useful if we had the name of the group when possible. IOW, .index is a bit ambiguous as we can access groups by name (if they exist) or by index.

So, maybe for a first iteration, we can drop .index and .count? they don't seem that useful to me. We wouldn't have a real sequence, but at least we would have a more useful interface.

Finally, the docs for .group() (which is meant to be equivalent to __getitem__) say:

If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised

IOW, we don't support negative groups but we support negative access and this is inconsistent with the sequence contract I think.

What I find useful though is the possibility to use match m: and deconstruction.

vberlier · 2025-06-23T01:33:55Z

I agree I don't think that .index() and .count() bring a lot of value on their own. The main motivation is primarily pattern-matching and sequence unpacking.

But in generic contexts, a lot of code out there will assert that the object is a proper Sequence before making use of those features, either through type annotations or dynamically with isinstance. The problem is that this also technically encodes a requirement for mixin methods that are not needed.

I often wonder if Python could have originally introduced collections.abc with an alternative to Sequence that doesn't advertise as many capabilities, basically just Collection + __getitem__. But it's a bit too late now, users rely on Sequence everywhere, and being a good citizen of the data model means abiding by the rules and in some cases filling out pointless methods. In pure Python, the fact that inheriting from Sequence generates the mixin methods automatically tends to hide the fact that most code using collections.abc at API boundaries is prone to overly strict interface requirements. In C the problem becomes much more noticeable because the necessary boilerplate is no longer invisible, you have to implement it manually.

As you're suggesting, the alternative could be to only add len() and Py_TPFLAGS_SEQUENCE, which should be enough for sequence unpacking and pattern-matching, without actually making re.Match a real Sequence type. But we have to consider that from a user's perspective, having to get familiar with a bespoke subset of the Sequence protocol will incur more cognitive overhead than if the contract is fulfilled in its entirety, even though some parts may end up being less applicable. Here providing full Sequence compatibility is not actually much work so I think it's worth it.

Note that for .index() since it's there for compatibility for users supplying a re.Match object to a piece of code that accepts a Sequence, we shouldn't diverge from the expected semantics.

Regarding __getitem__ and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option. .group() needs to keep raising the IndexError otherwise existing code may break. But __getitem__ needs to handle negative indices to fulfill the requirements for Sequence. I'll update the docs to explain the inconsistency.

picnixz · 2025-06-23T07:44:09Z

Regarding getitem and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option. .group() needs to keep raising the IndexError otherwise existing code may break. But getitem needs to handle negative indices to fulfill the requirements for Sequence. I'll update the docs to explain the inconsistency.

In this case, I think it's better not change it and abandon the proposal. Isn't it possible to just implement the destructuring & match support without the sequence? I still don't think it's useful to have more methods than necessary.

Note that for .index() since it's there for compatibility for users supplying a re.Match object to a piece of code that accepts a Sequence, we shouldn't diverge from the expected semantics.

The question we should ask ourselves is rather: why would they use match as a regular sequence? it's not really a sequence IMO. the current expected semantics are "not a sequence" or am I wrong?

I would hold this until some other core dev chimes in to agree with me that we shouldn't add useless methods, reject what I said, or propose something else. I don't know if it's possible to only support destructing and/or matches.

vberlier · 2025-06-25T18:22:30Z

Isn't it possible to just implement the destructuring & match support without the sequence?

Yes, adding len() and Py_TPFLAGS_SEQUENCE should be enough to support sequence unpacking and pattern-matching, without actually making re.Match a real Sequence type.

why would they use match as a regular sequence? it's not really a sequence IMO. the current expected semantics are "not a sequence" or am I wrong?

It's been teetering on the edge of being a sequence for a decade with the introduction of the __getitem__ implementation. And we both agree that making it behave even more like a sequence by adding unpacking and pattern-matching would be beneficial. The rest falls out of Python's natural inclination towards duck-typing: the more re.Match will look like a sequence, the more people will want to use it like a sequence. So the most straightforward way to legitimize this improvement is to finally make the API coherent with the standard Python data model and implement all the remaining bits of the Sequence protocol.

When it comes to mixin methods, they're a pain point that's recurrent throughout all Python code. Even objects like sys.version_info are forced to have dubious-looking .index() and .count() methods, because people use named tuples as Sequence all the time in spite of Sequence making overly generous interface promises.

And even then, while I don't see .index() and .count() being used often on re.Match objects directly, they still make perfect sense. The intrinsic piece of data held within re.Match is a non-empty tuple[str | None, ...] listing the substrings extracted when matching the pattern (it just happens to not actually be stored as a regular tuple internally to make the allocation of the substrings lazy).

So at its core it's a perfectly legitimate Sequence type. Granted, the historical re.Match API doesn't do a great job at making this transparent. But as long as there's a way to fix it with backward-compatibility in mind, I think we should strive to close the gap with other languages, for example like in JavaScript where a match object is a fully-fledged builtin Array:

> let m = "abc".match(/(a)(b)(c)/)
> console.log(m instanceof Array)
true
> m.concat(['hello'])
[ 'abc', 'a', 'b', 'c', 'hello' ]

Regarding __getitem__ and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option

In this case, I think it's better not change it and abandon the proposal

This is a trade-off, we add new features and tackle some long overdue API polish, but this comes at the cost of a slight inconsistency between __getitem__ and the old .group() accessor. Here it's going to come down to personal preference and individual sensibilities. If the inconsistency is a deal-breaker for the majority, we can still do as you suggested and only make the necessary changes to support unpacking and pattern-matching. This wouldn't require updating __getitem__ to conform to the Sequence protocol and therefore wouldn't introduce an inconsistency with .group(). But some people may point out the missed opportunity for cleaning things up.

vberlier · 2025-06-25T18:44:34Z

Actually, m[-1] now returning the last matched substring instead of raising IndexError is technically not backward-compatible. And this change would be required for full Sequence compatibility. So maybe the re.Match API is beyond saving already. We'd need to survey the ecosystem. If there are existing usages that depend on __getitem__ raising IndexError for negative values, we might have to fall back to ad-hoc support for unpacking and pattern-matching regardless.

AA-Turner · 2025-06-25T21:41:10Z

all observable behaviors of your system / will be depended on by somebody

I think with the current opposition already noted and the compatability break here, better to close this PR and consider a new approach of just adding pattern matching support. Before you spend time working on an implementation, though, please check in the issue or on Discourse that there is support for this idea.

A

bedevere-app bot mentioned this pull request May 7, 2025

Make re.Match a well-rounded Sequence type #133546

Open

bedevere-app bot added the awaiting review label May 7, 2025

Make re.Match a well-rounded Sequence type

74480a7

vberlier force-pushed the gh-133546 branch from ac95ba2 to 74480a7 Compare May 7, 2025 03:02

JelleZijlstra reviewed May 7, 2025

View reviewed changes

vberlier force-pushed the gh-133546 branch from 01288ce to 497c42a Compare May 7, 2025 05:23

Implement missing index and count methods

a3de846

vberlier force-pushed the gh-133546 branch from 497c42a to a3de846 Compare May 8, 2025 11:08

ZeroIntensity reviewed May 11, 2025

View reviewed changes

Modules/_sre/sre.c Outdated Show resolved Hide resolved

Modules/_sre/sre.c Show resolved Hide resolved

Modules/_sre/sre.c Outdated Show resolved Hide resolved

Modules/_sre/sre.c Show resolved Hide resolved

vberlier added 6 commits May 11, 2025 20:15

Don't check for PyErr_Occurred

603b1d1

Add missing braces

70b73e4

Fix link to ValueError

f51ef45

Rewrite match_subscript to support negative indexing and slicing

f218828

Update ACKS

5272141

Add news entry

d0aa6fa

picnixz reviewed May 15, 2025

View reviewed changes

Lib/test/test_re.py Outdated Show resolved Hide resolved

Modules/_sre/sre.c Outdated Show resolved Hide resolved

Modules/_sre/sre.c Outdated Show resolved Hide resolved

Address feedback from @picnixz

5f67be0

vberlier force-pushed the gh-133546 branch from ccdbbbb to 5f67be0 Compare May 15, 2025 23:57

vberlier added 3 commits May 16, 2025 02:21

Move slicing test cases into its own test function

17feaa6

Fix error checking

fe709f8

Use else if

4095b52

vberlier requested a review from picnixz May 20, 2025 17:29

picnixz reviewed Jun 2, 2025

View reviewed changes

Address next round of comments

51d918d

vberlier requested a review from picnixz June 11, 2025 12:39

picnixz reviewed Jun 22, 2025

View reviewed changes

AA-Turner closed this Jun 25, 2025

Uh oh!

gh-133546: Make re.Match a well-rounded Sequence type #133549

gh-133546: Make re.Match a well-rounded Sequence type #133549

Uh oh!

Conversation

vberlier commented May 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JelleZijlstra May 7, 2025

Choose a reason for hiding this comment

Uh oh!

vberlier May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vberlier commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AA-Turner commented Jun 2, 2025

Uh oh!

vberlier commented Jun 3, 2025

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

vberlier commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz commented Jun 23, 2025

Uh oh!

vberlier commented Jun 25, 2025

Uh oh!

vberlier commented Jun 25, 2025

Uh oh!

AA-Turner commented Jun 25, 2025

Uh oh!

Uh oh!

gh-133546: Make `re.Match` a well-rounded `Sequence` type #133549

gh-133546: Make `re.Match` a well-rounded `Sequence` type #133549

vberlier commented May 7, 2025 •

edited by github-actions bot

Loading

vberlier commented Jun 23, 2025 •

edited

Loading