Skip to content

gh-133546: Make re.Match a well-rounded Sequence type #133549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

vberlier
Copy link
Contributor

@vberlier vberlier commented May 7, 2025

@@ -1378,6 +1378,27 @@ when there is no match, you can test whether there was a match with a simple
if match:
process(match)

Match objects are proper :class:`~collections.abc.Sequence` types. You can access
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true with this PR, Sequence has a number of other requirements (e.g. an index and count method).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks! I added index and count by adapting the implementation of tuple.index and tuple.count. I also updated the unit test to cover all Sequence mixin methods.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage is sometimes insufficient and sometimes redundant with existing ones.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional feedback. For the test failures, let's add some assertWarns. I don't think we should make some additional checks in C for the type of the operands (the issue is when you compare the input (which is a bytes) with the matched item (which are strings), hence a BytesWarning). So, instead, let's just expect the warning.

By the way, could you avoid using @picnixz in your commits as I'm getting pinged? TiA.

@vberlier
Copy link
Contributor Author

Oh, sorry for the ping. Is it okay if I just leave out the cases with bytes as input? I can't seem to trigger the warning locally, even with ./python -Werror -m test.

@vberlier vberlier requested a review from picnixz May 20, 2025 17:29
@AA-Turner
Copy link
Member

I'm not sure consensus exists for this on either the issue or the linked Discourse thread...

A

@vberlier
Copy link
Contributor Author

vberlier commented Jun 3, 2025

It's a bit quiet, but I guess there's not much to talk about. Serhiy Storchaka brought up a valuable point about the fact that back in the day there was some debate over the possible semantics of len() and unpacking. But with the __getitem__ implementation that was introduced 10 years ago, there's actually not much wiggle room for creative interpretations of what it would mean for re.Match to be a proper Sequence type.

This PR is nothing more than painting by numbers. The API was just waiting for someone to spend a couple of hours to fill in the blanks. So it's not particularly exciting, and I doubt the topic will attract a lot of attention on Discourse, but personally I think this is the kind of polish and attention to detail that makes Python feel like home. And it looks like some other people and a core dev are on board with it too. That's enough for me to get started. Plus, working code can help ground discussions.

That said, I'm still new to CPython's contributing process, so let me know if I should've done something differently.

@vberlier vberlier requested a review from picnixz June 11, 2025 12:39
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After staring at this PR once again, I think we don't need the full-fledged sequence API but we could have some of it:

  • It's useful to have case support.
  • It's useful to have deconstruction via _, a, b = match
  • It may be useful to have len as otherwise we need to extract the groups and do len(m.groups()).
  • I can't find use cases of .index. While it could give back the index of the group that matches what we pass, it could be more useful if we had the name of the group when possible. IOW, .index is a bit ambiguous as we can access groups by name (if they exist) or by index.

So, maybe for a first iteration, we can drop .index and .count? they don't seem that useful to me. We wouldn't have a real sequence, but at least we would have a more useful interface.

Finally, the docs for .group() (which is meant to be equivalent to __getitem__) say:

If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised

IOW, we don't support negative groups but we support negative access and this is inconsistent with the sequence contract I think.


What I find useful though is the possibility to use match m: and deconstruction.

@vberlier
Copy link
Contributor Author

vberlier commented Jun 23, 2025

I agree I don't think that .index() and .count() bring a lot of value on their own. The main motivation is primarily pattern-matching and sequence unpacking.

But in generic contexts, a lot of code out there will assert that the object is a proper Sequence before making use of those features, either through type annotations or dynamically with isinstance. The problem is that this also technically encodes a requirement for mixin methods that are not needed.

I often wonder if Python could have originally introduced collections.abc with an alternative to Sequence that doesn't advertise as many capabilities, basically just Collection + __getitem__. But it's a bit too late now, users rely on Sequence everywhere, and being a good citizen of the data model means abiding by the rules and in some cases filling out pointless methods. In pure Python, the fact that inheriting from Sequence generates the mixin methods automatically tends to hide the fact that most code using collections.abc at API boundaries is prone to overly strict interface requirements. In C the problem becomes much more noticeable because the necessary boilerplate is no longer invisible, you have to implement it manually.

As you're suggesting, the alternative could be to only add len() and Py_TPFLAGS_SEQUENCE, which should be enough for sequence unpacking and pattern-matching, without actually making re.Match a real Sequence type. But we have to consider that from a user's perspective, having to get familiar with a bespoke subset of the Sequence protocol will incur more cognitive overhead than if the contract is fulfilled in its entirety, even though some parts may end up being less applicable. Here providing full Sequence compatibility is not actually much work so I think it's worth it.

Note that for .index() since it's there for compatibility for users supplying a re.Match object to a piece of code that accepts a Sequence, we shouldn't diverge from the expected semantics.

Regarding __getitem__ and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option. .group() needs to keep raising the IndexError otherwise existing code may break. But __getitem__ needs to handle negative indices to fulfill the requirements for Sequence. I'll update the docs to explain the inconsistency.

@picnixz
Copy link
Member

picnixz commented Jun 23, 2025

Regarding getitem and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option. .group() needs to keep raising the IndexError otherwise existing code may break. But getitem needs to handle negative indices to fulfill the requirements for Sequence. I'll update the docs to explain the inconsistency.

In this case, I think it's better not change it and abandon the proposal. Isn't it possible to just implement the destructuring & match support without the sequence? I still don't think it's useful to have more methods than necessary.

Note that for .index() since it's there for compatibility for users supplying a re.Match object to a piece of code that accepts a Sequence, we shouldn't diverge from the expected semantics.

The question we should ask ourselves is rather: why would they use match as a regular sequence? it's not really a sequence IMO. the current expected semantics are "not a sequence" or am I wrong?

I would hold this until some other core dev chimes in to agree with me that we shouldn't add useless methods, reject what I said, or propose something else. I don't know if it's possible to only support destructing and/or matches.

@vberlier
Copy link
Contributor Author

Isn't it possible to just implement the destructuring & match support without the sequence?

Yes, adding len() and Py_TPFLAGS_SEQUENCE should be enough to support sequence unpacking and pattern-matching, without actually making re.Match a real Sequence type.

why would they use match as a regular sequence? it's not really a sequence IMO. the current expected semantics are "not a sequence" or am I wrong?

It's been teetering on the edge of being a sequence for a decade with the introduction of the __getitem__ implementation. And we both agree that making it behave even more like a sequence by adding unpacking and pattern-matching would be beneficial. The rest falls out of Python's natural inclination towards duck-typing: the more re.Match will look like a sequence, the more people will want to use it like a sequence. So the most straightforward way to legitimize this improvement is to finally make the API coherent with the standard Python data model and implement all the remaining bits of the Sequence protocol.

When it comes to mixin methods, they're a pain point that's recurrent throughout all Python code. Even objects like sys.version_info are forced to have dubious-looking .index() and .count() methods, because people use named tuples as Sequence all the time in spite of Sequence making overly generous interface promises.

And even then, while I don't see .index() and .count() being used often on re.Match objects directly, they still make perfect sense. The intrinsic piece of data held within re.Match is a non-empty tuple[str | None, ...] listing the substrings extracted when matching the pattern (it just happens to not actually be stored as a regular tuple internally to make the allocation of the substrings lazy).

So at its core it's a perfectly legitimate Sequence type. Granted, the historical re.Match API doesn't do a great job at making this transparent. But as long as there's a way to fix it with backward-compatibility in mind, I think we should strive to close the gap with other languages, for example like in JavaScript where a match object is a fully-fledged builtin Array:

> let m = "abc".match(/(a)(b)(c)/)
> console.log(m instanceof Array)
true
> m.concat(['hello'])
[ 'abc', 'a', 'b', 'c', 'hello' ]

Regarding __getitem__ and .group() having different behavior for negative indices, it's a bit unfortunate but we don't really have any other option

In this case, I think it's better not change it and abandon the proposal

This is a trade-off, we add new features and tackle some long overdue API polish, but this comes at the cost of a slight inconsistency between __getitem__ and the old .group() accessor. Here it's going to come down to personal preference and individual sensibilities. If the inconsistency is a deal-breaker for the majority, we can still do as you suggested and only make the necessary changes to support unpacking and pattern-matching. This wouldn't require updating __getitem__ to conform to the Sequence protocol and therefore wouldn't introduce an inconsistency with .group(). But some people may point out the missed opportunity for cleaning things up.

@vberlier
Copy link
Contributor Author

Actually, m[-1] now returning the last matched substring instead of raising IndexError is technically not backward-compatible. And this change would be required for full Sequence compatibility. So maybe the re.Match API is beyond saving already. We'd need to survey the ecosystem. If there are existing usages that depend on __getitem__ raising IndexError for negative values, we might have to fall back to ad-hoc support for unpacking and pattern-matching regardless.

@AA-Turner
Copy link
Member

all observable behaviors of your system / will be depended on by somebody

I think with the current opposition already noted and the compatability break here, better to close this PR and consider a new approach of just adding pattern matching support. Before you spend time working on an implementation, though, please check in the issue or on Discourse that there is support for this idea.

A

@AA-Turner AA-Turner closed this Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants