Skip to content

Commit 0e57a63

Browse files
committed
automata: make PikeVM and backtracker work without capture states
Previously, construction of these engines checked to make sure the NFA given had some capture states in it. If the NFA didn't, construction failed with an error. To support the case where the NFA has no capture states at all (to avoid gratuitous memory allocation), we remove this restriction and tweak the engine implementations to stop assuming that the NFA has capture states. This turned out to not be too hard, as we only assumed as much in a few places. The main reason why this restriction existed in the first place was semantics. Namely, it's important that the PikeVM remain infallible. But what happens when you ask for match offsets in a search with an NFA that has no capture states? The PikeVM just doesn't support that. Previously it would panic (and thus the reason construction would fail). But now instead it will just report "no match." It's a little hokey, but we justify it to ourselves because "simplicity" and "avoids footguns" are non-goals of this crate.
1 parent c4bf634 commit 0e57a63

File tree

5 files changed

+72
-61
lines changed

5 files changed

+72
-61
lines changed

regex-automata/src/meta/regex.rs

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2635,8 +2635,10 @@ impl Config {
26352635
/// states. For example, using `WhichCaptures::Implicit` will behave as if
26362636
/// all explicit capturing groups in the pattern were non-capturing.
26372637
///
2638-
/// Setting this to `WhichCaptures::None` may result in an error when
2639-
/// building a meta regex.
2638+
/// Setting this to `WhichCaptures::None` is usually not the right thing to
2639+
/// do. When no capture states are compiled, some regex engines (such as
2640+
/// the `PikeVM`) won't be able to report match offsets. This will manifest
2641+
/// as no match being found.
26402642
///
26412643
/// # Example
26422644
///

regex-automata/src/nfa/thompson/backtrack.rs

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -300,15 +300,6 @@ impl Builder {
300300
&self,
301301
nfa: NFA,
302302
) -> Result<BoundedBacktracker, BuildError> {
303-
// If the NFA has no captures, then the backtracker doesn't work since
304-
// it relies on them in order to report match locations. However, in
305-
// the special case of an NFA with no patterns, it is allowed, since
306-
// no matches can ever be produced. And importantly, an NFA with no
307-
// patterns has no capturing groups anyway, so this is necessary to
308-
// permit the backtracker to work with regexes with zero patterns.
309-
if !nfa.has_capture() && nfa.pattern_len() > 0 {
310-
return Err(BuildError::missing_captures());
311-
}
312303
nfa.look_set_any().available().map_err(BuildError::word)?;
313304
Ok(BoundedBacktracker { config: self.config.clone(), nfa })
314305
}
@@ -954,8 +945,14 @@ impl BoundedBacktracker {
954945
None => return Ok(None),
955946
Some(pid) => pid,
956947
};
957-
let start = slots[0].unwrap().get();
958-
let end = slots[1].unwrap().get();
948+
let start = match slots[0] {
949+
None => return Ok(None),
950+
Some(s) => s.get(),
951+
};
952+
let end = match slots[1] {
953+
None => return Ok(None),
954+
Some(s) => s.get(),
955+
};
959956
return Ok(Some(Match::new(pid, Span { start, end })));
960957
}
961958
let ginfo = self.get_nfa().group_info();
@@ -965,8 +962,14 @@ impl BoundedBacktracker {
965962
None => return Ok(None),
966963
Some(pid) => pid,
967964
};
968-
let start = slots[pid.as_usize() * 2].unwrap().get();
969-
let end = slots[pid.as_usize() * 2 + 1].unwrap().get();
965+
let start = match slots[pid.as_usize() * 2] {
966+
None => return Ok(None),
967+
Some(s) => s.get(),
968+
};
969+
let end = match slots[pid.as_usize() * 2 + 1] {
970+
None => return Ok(None),
971+
Some(s) => s.get(),
972+
};
970973
Ok(Some(Match::new(pid, Span { start, end })))
971974
}
972975

regex-automata/src/nfa/thompson/compiler.rs

Lines changed: 36 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -316,8 +316,8 @@ impl Config {
316316
/// # Example
317317
///
318318
/// This example demonstrates that some regex engines, like the Pike VM,
319-
/// require capturing groups to be present in the NFA. Building a Pike VM
320-
/// with an NFA without capturing groups will result in an error.
319+
/// require capturing states to be present in the NFA to report match
320+
/// offsets.
321321
///
322322
/// (Note that since this method is deprecated, the example below uses
323323
/// [`Config::which_captures`] to disable capture states.)
@@ -329,10 +329,13 @@ impl Config {
329329
/// WhichCaptures,
330330
/// };
331331
///
332-
/// let nfa = NFA::compiler()
333-
/// .configure(NFA::config().which_captures(WhichCaptures::None))
332+
/// let re = PikeVM::builder()
333+
/// .thompson(NFA::config().which_captures(WhichCaptures::None))
334334
/// .build(r"[a-z]+")?;
335-
/// assert!(PikeVM::new_from_nfa(nfa).is_err());
335+
/// let mut cache = re.create_cache();
336+
///
337+
/// assert!(re.is_match(&mut cache, "abc"));
338+
/// assert_eq!(None, re.find(&mut cache, "abc"));
336339
///
337340
/// # Ok::<(), Box<dyn std::error::Error>>(())
338341
/// ```
@@ -363,8 +366,8 @@ impl Config {
363366
/// # Example
364367
///
365368
/// This example demonstrates that some regex engines, like the Pike VM,
366-
/// require capturing groups to be present in the NFA. Building a Pike VM
367-
/// with an NFA without capturing groups will result in an error.
369+
/// require capturing states to be present in the NFA to report match
370+
/// offsets.
368371
///
369372
/// ```
370373
/// use regex_automata::nfa::thompson::{
@@ -373,10 +376,33 @@ impl Config {
373376
/// WhichCaptures,
374377
/// };
375378
///
376-
/// let nfa = NFA::compiler()
377-
/// .configure(NFA::config().which_captures(WhichCaptures::None))
379+
/// let re = PikeVM::builder()
380+
/// .thompson(NFA::config().which_captures(WhichCaptures::None))
381+
/// .build(r"[a-z]+")?;
382+
/// let mut cache = re.create_cache();
383+
///
384+
/// assert!(re.is_match(&mut cache, "abc"));
385+
/// assert_eq!(None, re.find(&mut cache, "abc"));
386+
///
387+
/// # Ok::<(), Box<dyn std::error::Error>>(())
388+
/// ```
389+
///
390+
/// The same applies to the bounded backtracker:
391+
///
392+
/// ```
393+
/// use regex_automata::nfa::thompson::{
394+
/// backtrack::BoundedBacktracker,
395+
/// NFA,
396+
/// WhichCaptures,
397+
/// };
398+
///
399+
/// let re = BoundedBacktracker::builder()
400+
/// .thompson(NFA::config().which_captures(WhichCaptures::None))
378401
/// .build(r"[a-z]+")?;
379-
/// assert!(PikeVM::new_from_nfa(nfa).is_err());
402+
/// let mut cache = re.create_cache();
403+
///
404+
/// assert!(re.try_is_match(&mut cache, "abc")?);
405+
/// assert_eq!(None, re.try_find(&mut cache, "abc")?);
380406
///
381407
/// # Ok::<(), Box<dyn std::error::Error>>(())
382408
/// ```

regex-automata/src/nfa/thompson/error.rs

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,6 @@ enum BuildErrorKind {
6868
/// The invalid index that was given.
6969
index: u32,
7070
},
71-
/// An error that occurs when one tries to build an NFA simulation (such as
72-
/// the PikeVM) without any capturing groups.
73-
MissingCaptures,
7471
/// An error that occurs when one tries to build a reverse NFA with
7572
/// captures enabled. Currently, this isn't supported, but we probably
7673
/// should support it at some point.
@@ -126,10 +123,6 @@ impl BuildError {
126123
BuildError { kind: BuildErrorKind::InvalidCaptureIndex { index } }
127124
}
128125

129-
pub(crate) fn missing_captures() -> BuildError {
130-
BuildError { kind: BuildErrorKind::MissingCaptures }
131-
}
132-
133126
#[cfg(feature = "syntax")]
134127
pub(crate) fn unsupported_captures() -> BuildError {
135128
BuildError { kind: BuildErrorKind::UnsupportedCaptures }
@@ -181,11 +174,6 @@ impl core::fmt::Display for BuildError {
181174
"capture group index {} is invalid (too big or discontinuous)",
182175
index,
183176
),
184-
BuildErrorKind::MissingCaptures => write!(
185-
f,
186-
"operation requires the NFA to have capturing groups, \
187-
but the NFA given contains none",
188-
),
189177
#[cfg(feature = "syntax")]
190178
BuildErrorKind::UnsupportedCaptures => write!(
191179
f,

regex-automata/src/nfa/thompson/pikevm.rs

Lines changed: 16 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -275,15 +275,6 @@ impl Builder {
275275
/// construction of the NFA itself will of course be ignored, since the NFA
276276
/// given here is already built.
277277
pub fn build_from_nfa(&self, nfa: NFA) -> Result<PikeVM, BuildError> {
278-
// If the NFA has no captures, then the PikeVM doesn't work since it
279-
// relies on them in order to report match locations. However, in
280-
// the special case of an NFA with no patterns, it is allowed, since
281-
// no matches can ever be produced. And importantly, an NFA with no
282-
// patterns has no capturing groups anyway, so this is necessary to
283-
// permit the PikeVM to work with regexes with zero patterns.
284-
if !nfa.has_capture() && nfa.pattern_len() > 0 {
285-
return Err(BuildError::missing_captures());
286-
}
287278
nfa.look_set_any().available().map_err(BuildError::word)?;
288279
Ok(PikeVM { config: self.config.clone(), nfa })
289280
}
@@ -828,16 +819,16 @@ impl PikeVM {
828819
if self.get_nfa().pattern_len() == 1 {
829820
let mut slots = [None, None];
830821
let pid = self.search_slots(cache, &input, &mut slots)?;
831-
let start = slots[0].unwrap().get();
832-
let end = slots[1].unwrap().get();
822+
let start = slots[0]?.get();
823+
let end = slots[1]?.get();
833824
return Some(Match::new(pid, Span { start, end }));
834825
}
835826
let ginfo = self.get_nfa().group_info();
836827
let slots_len = ginfo.implicit_slot_len();
837828
let mut slots = vec![None; slots_len];
838829
let pid = self.search_slots(cache, &input, &mut slots)?;
839-
let start = slots[pid.as_usize() * 2].unwrap().get();
840-
let end = slots[pid.as_usize() * 2 + 1].unwrap().get();
830+
let start = slots[pid.as_usize() * 2]?.get();
831+
let end = slots[pid.as_usize() * 2 + 1]?.get();
841832
Some(Match::new(pid, Span { start, end }))
842833
}
843834

@@ -1123,15 +1114,15 @@ impl PikeVM {
11231114
if self.get_nfa().pattern_len() == 1 {
11241115
let mut enough = [None, None];
11251116
let got = self.search_slots_imp(cache, input, &mut enough);
1126-
// This is OK because we know `enough_slots` is strictly bigger
1127-
// than `slots`, otherwise this special case isn't reached.
1117+
// This is OK because we know `enough` is strictly bigger than
1118+
// `slots`, otherwise this special case isn't reached.
11281119
slots.copy_from_slice(&enough[..slots.len()]);
11291120
return got;
11301121
}
11311122
let mut enough = vec![None; min];
11321123
let got = self.search_slots_imp(cache, input, &mut enough);
1133-
// This is OK because we know `enough_slots` is strictly bigger than
1134-
// `slots`, otherwise this special case isn't reached.
1124+
// This is OK because we know `enough` is strictly bigger than `slots`,
1125+
// otherwise this special case isn't reached.
11351126
slots.copy_from_slice(&enough[..slots.len()]);
11361127
got
11371128
}
@@ -2108,15 +2099,16 @@ impl SlotTable {
21082099
// if a 'Captures' has fewer slots, e.g., none at all or only slots
21092100
// for tracking the overall match instead of all slots for every
21102101
// group.
2111-
self.slots_for_captures = nfa.group_info().slot_len();
2102+
self.slots_for_captures = core::cmp::max(
2103+
self.slots_per_state,
2104+
nfa.pattern_len().checked_mul(2).unwrap(),
2105+
);
21122106
let len = nfa
21132107
.states()
21142108
.len()
2115-
// We add 1 so that our last row is always empty. We use it as
2116-
// "scratch" space for computing the epsilon closure off of the
2117-
// starting state.
2118-
.checked_add(1)
2119-
.and_then(|x| x.checked_mul(self.slots_per_state))
2109+
.checked_mul(self.slots_per_state)
2110+
// Add space to account for scratch space used during a search.
2111+
.and_then(|x| x.checked_add(self.slots_for_captures))
21202112
// It seems like this could actually panic on legitimate inputs on
21212113
// 32-bit targets, and very likely to panic on 16-bit. Should we
21222114
// somehow convert this to an error? What about something similar
@@ -2170,7 +2162,7 @@ impl SlotTable {
21702162
/// compute an epsilon closure outside of the user supplied regex, and thus
21712163
/// never want it to have any capturing slots set.
21722164
fn all_absent(&mut self) -> &mut [Option<NonMaxUsize>] {
2173-
let i = self.table.len() - self.slots_per_state;
2165+
let i = self.table.len() - self.slots_for_captures;
21742166
&mut self.table[i..i + self.slots_for_captures]
21752167
}
21762168
}

0 commit comments

Comments
 (0)