Skip to content

Commit 3557b05

Browse files
bpo-31690: Allow the inline flags "a", "L", and "u" to be used as group flags for RE. (#3885)
1 parent fdd9b21 commit 3557b05

File tree

11 files changed

+300
-140
lines changed

11 files changed

+300
-140
lines changed

Doc/library/re.rst

Lines changed: 30 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -245,16 +245,32 @@ The special characters are:
245245
*cannot* be retrieved after performing a match or referenced later in the
246246
pattern.
247247

248-
``(?imsx-imsx:...)``
249-
(Zero or more letters from the set ``'i'``, ``'m'``, ``'s'``, ``'x'``,
250-
optionally followed by ``'-'`` followed by one or more letters from the
251-
same set.) The letters set or removes the corresponding flags:
252-
:const:`re.I` (ignore case), :const:`re.M` (multi-line), :const:`re.S`
253-
(dot matches all), and :const:`re.X` (verbose), for the part of the
254-
expression. (The flags are described in :ref:`contents-of-module-re`.)
248+
``(?aiLmsux-imsx:...)``
249+
(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
250+
``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
251+
one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
252+
The letters set or remove the corresponding flags:
253+
:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
254+
:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
255+
:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
256+
and :const:`re.X` (verbose), for the part of the expression.
257+
(The flags are described in :ref:`contents-of-module-re`.)
258+
259+
The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
260+
as inline flags, so they can't be combined or follow ``'-'``. Instead,
261+
when one of them appears in an inline group, it overrides the matching mode
262+
in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
263+
ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
264+
(default). In byte pattern ``(?L:...)`` switches to locale depending
265+
matching, and ``(?a:...)`` switches to ASCII-only matching (default).
266+
This override is only in effect for the narrow inline group, and the
267+
original matching mode is restored outside of the group.
255268

256269
.. versionadded:: 3.6
257270

271+
.. versionchanged:: 3.7
272+
The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
273+
258274
``(?P<name>...)``
259275
Similar to regular parentheses, but the substring matched by the group is
260276
accessible via the symbolic group name *name*. Group names must be valid
@@ -384,29 +400,23 @@ character ``'$'``.
384400
Matches any Unicode decimal digit (that is, any character in
385401
Unicode character category [Nd]). This includes ``[0-9]``, and
386402
also many other digit characters. If the :const:`ASCII` flag is
387-
used only ``[0-9]`` is matched (but the flag affects the entire
388-
regular expression, so in such cases using an explicit ``[0-9]``
389-
may be a better choice).
403+
used only ``[0-9]`` is matched.
390404

391405
For 8-bit (bytes) patterns:
392406
Matches any decimal digit; this is equivalent to ``[0-9]``.
393407

394408
``\D``
395409
Matches any character which is not a decimal digit. This is
396410
the opposite of ``\d``. If the :const:`ASCII` flag is used this
397-
becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
398-
regular expression, so in such cases using an explicit ``[^0-9]`` may
399-
be a better choice).
411+
becomes the equivalent of ``[^0-9]``.
400412

401413
``\s``
402414
For Unicode (str) patterns:
403415
Matches Unicode whitespace characters (which includes
404416
``[ \t\n\r\f\v]``, and also many other characters, for example the
405417
non-breaking spaces mandated by typography rules in many
406418
languages). If the :const:`ASCII` flag is used, only
407-
``[ \t\n\r\f\v]`` is matched (but the flag affects the entire
408-
regular expression, so in such cases using an explicit
409-
``[ \t\n\r\f\v]`` may be a better choice).
419+
``[ \t\n\r\f\v]`` is matched.
410420

411421
For 8-bit (bytes) patterns:
412422
Matches characters considered whitespace in the ASCII character set;
@@ -415,18 +425,14 @@ character ``'$'``.
415425
``\S``
416426
Matches any character which is not a whitespace character. This is
417427
the opposite of ``\s``. If the :const:`ASCII` flag is used this
418-
becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
419-
regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
420-
be a better choice).
428+
becomes the equivalent of ``[^ \t\n\r\f\v]``.
421429

422430
``\w``
423431
For Unicode (str) patterns:
424432
Matches Unicode word characters; this includes most characters
425433
that can be part of a word in any language, as well as numbers and
426434
the underscore. If the :const:`ASCII` flag is used, only
427-
``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
428-
regular expression, so in such cases using an explicit
429-
``[a-zA-Z0-9_]`` may be a better choice).
435+
``[a-zA-Z0-9_]`` is matched.
430436

431437
For 8-bit (bytes) patterns:
432438
Matches characters considered alphanumeric in the ASCII character set;
@@ -437,9 +443,7 @@ character ``'$'``.
437443
``\W``
438444
Matches any character which is not a word character. This is
439445
the opposite of ``\w``. If the :const:`ASCII` flag is used this
440-
becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
441-
entire regular expression, so in such cases using an explicit
442-
``[^a-zA-Z0-9_]`` may be a better choice). If the :const:`LOCALE` flag is
446+
becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
443447
used, matches characters considered alphanumeric in the current locale
444448
and the underscore.
445449

@@ -563,9 +567,7 @@ form.
563567
letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
564568
'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
565569
If the :const:`ASCII` flag is used, only letters 'a' to 'z'
566-
and 'A' to 'Z' are matched (but the flag affects the entire regular
567-
expression, so in such cases using an explicit ``(?-i:[a-zA-Z])`` may be
568-
a better choice).
570+
and 'A' to 'Z' are matched.
569571

570572
.. data:: L
571573
LOCALE

Doc/whatsnew/3.7.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,13 @@ pdb
296296
argument. If given, this is printed to the console just before debugging
297297
begins.
298298

299+
re
300+
--
301+
302+
The flags :const:`re.ASCII`, :const:`re.LOCALE` and :const:`re.UNICODE`
303+
can be set within the scope of a group.
304+
(Contributed by Serhiy Storchaka in :issue:`31690`.)
305+
299306
string
300307
------
301308

Lib/sre_compile.py

Lines changed: 38 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,12 @@
6262
_ignorecase_fixes = {i: tuple(j for j in t if i != j)
6363
for t in _equivalences for i in t}
6464

65+
def _combine_flags(flags, add_flags, del_flags,
66+
TYPE_FLAGS=sre_parse.TYPE_FLAGS):
67+
if add_flags & TYPE_FLAGS:
68+
flags &= ~TYPE_FLAGS
69+
return (flags | add_flags) & ~del_flags
70+
6571
def _compile(code, pattern, flags):
6672
# internal: compile a (sub)pattern
6773
emit = code.append
@@ -87,15 +93,21 @@ def _compile(code, pattern, flags):
8793
emit(op)
8894
emit(av)
8995
elif flags & SRE_FLAG_LOCALE:
90-
emit(OP_LOC_IGNORE[op])
96+
emit(OP_LOCALE_IGNORE[op])
9197
emit(av)
9298
elif not iscased(av):
9399
emit(op)
94100
emit(av)
95101
else:
96102
lo = tolower(av)
97-
if fixes and lo in fixes:
98-
emit(IN_IGNORE)
103+
if not fixes: # ascii
104+
emit(OP_IGNORE[op])
105+
emit(lo)
106+
elif lo not in fixes:
107+
emit(OP_UNICODE_IGNORE[op])
108+
emit(lo)
109+
else:
110+
emit(IN_UNI_IGNORE)
99111
skip = _len(code); emit(0)
100112
if op is NOT_LITERAL:
101113
emit(NEGATE)
@@ -104,17 +116,16 @@ def _compile(code, pattern, flags):
104116
emit(k)
105117
emit(FAILURE)
106118
code[skip] = _len(code) - skip
107-
else:
108-
emit(OP_IGNORE[op])
109-
emit(lo)
110119
elif op is IN:
111120
charset, hascased = _optimize_charset(av, iscased, tolower, fixes)
112121
if flags & SRE_FLAG_IGNORECASE and flags & SRE_FLAG_LOCALE:
113122
emit(IN_LOC_IGNORE)
114-
elif hascased:
123+
elif not hascased:
124+
emit(IN)
125+
elif not fixes: # ascii
115126
emit(IN_IGNORE)
116127
else:
117-
emit(IN)
128+
emit(IN_UNI_IGNORE)
118129
skip = _len(code); emit(0)
119130
_compile_charset(charset, flags, code)
120131
code[skip] = _len(code) - skip
@@ -153,8 +164,8 @@ def _compile(code, pattern, flags):
153164
if group:
154165
emit(MARK)
155166
emit((group-1)*2)
156-
# _compile_info(code, p, (flags | add_flags) & ~del_flags)
157-
_compile(code, p, (flags | add_flags) & ~del_flags)
167+
# _compile_info(code, p, _combine_flags(flags, add_flags, del_flags))
168+
_compile(code, p, _combine_flags(flags, add_flags, del_flags))
158169
if group:
159170
emit(MARK)
160171
emit((group-1)*2+1)
@@ -210,10 +221,14 @@ def _compile(code, pattern, flags):
210221
av = CH_UNICODE[av]
211222
emit(av)
212223
elif op is GROUPREF:
213-
if flags & SRE_FLAG_IGNORECASE:
214-
emit(OP_IGNORE[op])
215-
else:
224+
if not flags & SRE_FLAG_IGNORECASE:
216225
emit(op)
226+
elif flags & SRE_FLAG_LOCALE:
227+
emit(GROUPREF_LOC_IGNORE)
228+
elif not fixes: # ascii
229+
emit(GROUPREF_IGNORE)
230+
else:
231+
emit(GROUPREF_UNI_IGNORE)
217232
emit(av-1)
218233
elif op is GROUPREF_EXISTS:
219234
emit(op)
@@ -240,7 +255,7 @@ def _compile_charset(charset, flags, code):
240255
pass
241256
elif op is LITERAL:
242257
emit(av)
243-
elif op is RANGE or op is RANGE_IGNORE:
258+
elif op is RANGE or op is RANGE_UNI_IGNORE:
244259
emit(av[0])
245260
emit(av[1])
246261
elif op is CHARSET:
@@ -309,9 +324,9 @@ def _optimize_charset(charset, iscased=None, fixup=None, fixes=None):
309324
hascased = True
310325
# There are only two ranges of cased non-BMP characters:
311326
# 10400-1044F (Deseret) and 118A0-118DF (Warang Citi),
312-
# and for both ranges RANGE_IGNORE works.
327+
# and for both ranges RANGE_UNI_IGNORE works.
313328
if op is RANGE:
314-
op = RANGE_IGNORE
329+
op = RANGE_UNI_IGNORE
315330
tail.append((op, av))
316331
break
317332

@@ -456,7 +471,7 @@ def _get_literal_prefix(pattern, flags):
456471
prefixappend(av)
457472
elif op is SUBPATTERN:
458473
group, add_flags, del_flags, p = av
459-
flags1 = (flags | add_flags) & ~del_flags
474+
flags1 = _combine_flags(flags, add_flags, del_flags)
460475
if flags1 & SRE_FLAG_IGNORECASE and flags1 & SRE_FLAG_LOCALE:
461476
break
462477
prefix1, prefix_skip1, got_all = _get_literal_prefix(p, flags1)
@@ -482,7 +497,7 @@ def _get_charset_prefix(pattern, flags):
482497
if op is not SUBPATTERN:
483498
break
484499
group, add_flags, del_flags, pattern = av
485-
flags = (flags | add_flags) & ~del_flags
500+
flags = _combine_flags(flags, add_flags, del_flags)
486501
if flags & SRE_FLAG_IGNORECASE and flags & SRE_FLAG_LOCALE:
487502
return None
488503

@@ -631,6 +646,7 @@ def print_2(*args):
631646
print_(op)
632647
elif op in (LITERAL, NOT_LITERAL,
633648
LITERAL_IGNORE, NOT_LITERAL_IGNORE,
649+
LITERAL_UNI_IGNORE, NOT_LITERAL_UNI_IGNORE,
634650
LITERAL_LOC_IGNORE, NOT_LITERAL_LOC_IGNORE):
635651
arg = code[i]
636652
i += 1
@@ -647,12 +663,12 @@ def print_2(*args):
647663
arg = str(CHCODES[arg])
648664
assert arg[:9] == 'CATEGORY_'
649665
print_(op, arg[9:])
650-
elif op in (IN, IN_IGNORE, IN_LOC_IGNORE):
666+
elif op in (IN, IN_IGNORE, IN_UNI_IGNORE, IN_LOC_IGNORE):
651667
skip = code[i]
652668
print_(op, skip, to=i+skip)
653669
dis_(i+1, i+skip)
654670
i += skip
655-
elif op in (RANGE, RANGE_IGNORE):
671+
elif op in (RANGE, RANGE_UNI_IGNORE):
656672
lo, hi = code[i: i+2]
657673
i += 2
658674
print_(op, '%#02x %#02x (%r-%r)' % (lo, hi, chr(lo), chr(hi)))
@@ -671,7 +687,8 @@ def print_2(*args):
671687
print_2(_hex_code(code[i: i + 256//_CODEBITS]))
672688
i += 256//_CODEBITS
673689
level -= 1
674-
elif op in (MARK, GROUPREF, GROUPREF_IGNORE):
690+
elif op in (MARK, GROUPREF, GROUPREF_IGNORE, GROUPREF_UNI_IGNORE,
691+
GROUPREF_LOC_IGNORE):
675692
arg = code[i]
676693
i += 1
677694
print_(op, arg)

Lib/sre_constants.py

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
# update when constants are added or removed
1515

16-
MAGIC = 20170530
16+
MAGIC = 20171005
1717

1818
from _sre import MAXREPEAT, MAXGROUPS
1919

@@ -84,25 +84,37 @@ def _makecodes(names):
8484
CALL
8585
CATEGORY
8686
CHARSET BIGCHARSET
87-
GROUPREF GROUPREF_EXISTS GROUPREF_IGNORE
88-
IN IN_IGNORE
87+
GROUPREF GROUPREF_EXISTS
88+
IN
8989
INFO
9090
JUMP
91-
LITERAL LITERAL_IGNORE
91+
LITERAL
9292
MARK
9393
MAX_UNTIL
9494
MIN_UNTIL
95-
NOT_LITERAL NOT_LITERAL_IGNORE
95+
NOT_LITERAL
9696
NEGATE
9797
RANGE
9898
REPEAT
9999
REPEAT_ONE
100100
SUBPATTERN
101101
MIN_REPEAT_ONE
102-
RANGE_IGNORE
102+
103+
GROUPREF_IGNORE
104+
IN_IGNORE
105+
LITERAL_IGNORE
106+
NOT_LITERAL_IGNORE
107+
108+
GROUPREF_LOC_IGNORE
109+
IN_LOC_IGNORE
103110
LITERAL_LOC_IGNORE
104111
NOT_LITERAL_LOC_IGNORE
105-
IN_LOC_IGNORE
112+
113+
GROUPREF_UNI_IGNORE
114+
IN_UNI_IGNORE
115+
LITERAL_UNI_IGNORE
116+
NOT_LITERAL_UNI_IGNORE
117+
RANGE_UNI_IGNORE
106118
107119
MIN_REPEAT MAX_REPEAT
108120
""")
@@ -113,7 +125,9 @@ def _makecodes(names):
113125
AT_BEGINNING AT_BEGINNING_LINE AT_BEGINNING_STRING
114126
AT_BOUNDARY AT_NON_BOUNDARY
115127
AT_END AT_END_LINE AT_END_STRING
128+
116129
AT_LOC_BOUNDARY AT_LOC_NON_BOUNDARY
130+
117131
AT_UNI_BOUNDARY AT_UNI_NON_BOUNDARY
118132
""")
119133

@@ -123,7 +137,9 @@ def _makecodes(names):
123137
CATEGORY_SPACE CATEGORY_NOT_SPACE
124138
CATEGORY_WORD CATEGORY_NOT_WORD
125139
CATEGORY_LINEBREAK CATEGORY_NOT_LINEBREAK
140+
126141
CATEGORY_LOC_WORD CATEGORY_LOC_NOT_WORD
142+
127143
CATEGORY_UNI_DIGIT CATEGORY_UNI_NOT_DIGIT
128144
CATEGORY_UNI_SPACE CATEGORY_UNI_NOT_SPACE
129145
CATEGORY_UNI_WORD CATEGORY_UNI_NOT_WORD
@@ -133,18 +149,20 @@ def _makecodes(names):
133149

134150
# replacement operations for "ignore case" mode
135151
OP_IGNORE = {
136-
GROUPREF: GROUPREF_IGNORE,
137-
IN: IN_IGNORE,
138152
LITERAL: LITERAL_IGNORE,
139153
NOT_LITERAL: NOT_LITERAL_IGNORE,
140-
RANGE: RANGE_IGNORE,
141154
}
142155

143-
OP_LOC_IGNORE = {
156+
OP_LOCALE_IGNORE = {
144157
LITERAL: LITERAL_LOC_IGNORE,
145158
NOT_LITERAL: NOT_LITERAL_LOC_IGNORE,
146159
}
147160

161+
OP_UNICODE_IGNORE = {
162+
LITERAL: LITERAL_UNI_IGNORE,
163+
NOT_LITERAL: NOT_LITERAL_UNI_IGNORE,
164+
}
165+
148166
AT_MULTILINE = {
149167
AT_BEGINNING: AT_BEGINNING_LINE,
150168
AT_END: AT_END_LINE

0 commit comments

Comments
 (0)