Skip to content

std::ascii reform #19350

Closed
Closed
@SimonSapin

Description

@SimonSapin

Following up on #19194 and discussion with @aturon, I took a look at how things in the std::ascii module are used in the Rust repository and in Servo.

The std::ascii::Ascii type is a newtype of u8 that enforces (unless unsafe code is used) that the value is in the ASCII range, similar to char with u32 and the range of Unicode scalar values. [Ascii] is naturally a string of bytes entirely in the ASCII range.

Using the type system like this to enforce data invariants is interesting, but in practice [Ascii] is not that useful. Data (such as from the network) is rarely guaranteed to be ASCII only nor is it desirable to remove or replace non-ASCII bytes, even if ASCII-range-only operations are used. (E.g. “ASCII case-insensitivity” is common in HTML and CSS.)

Every single use of the Ascii type that I’ve found was only to use the to_lowercase or to_uppercase method, then immediately convert back to u8 or char.

Therefore, I suggest:

  • Moving the Ascii type as well as the AsciiCast, OwnedAsciiCast, AsciiStr, and IntoBytes traits into a new ascii Cargo package on crates.io
  • Marking them as deprecated in std::ascii, and removing them at some point before 1.0
  • Reworking the rest of the module to provide the functionality on u8, char, [u8] and str. Specifically:
    • Keep the AsciiExt and OwnedAsciiExt traits. (Maybe rename them?)
    • Implement AsciiExt on char and u8 (in addition to the existing impls for str and [u8])
    • Add is_ascii() -> bool. Maybe on AsciiExt? It’s mostly used on u8 and char, but it also makes sense on str and [u8].
    • Maybe is_ascii_lowercase, is_ascii_uppercase, is_ascii_alphabetic, or is_ascii_alphanumeric could be useful, but I’d be fine with dropping them and reconsider if someone asks for them. The same result can be achieved with .is_ascii() && and the corresponding UnicodeChar method, which in most cases has an ASCII fast path.
    • I don’t think the remaining Ascii methods are valuable.
      • is_digit and is_hex are identical to Char::is_digit(10) and Char::is_digit(16).
      • is_blank, is_control, is_graph, is_print, and is_punctuation are never used.

How does this sound? I can help with the implementation work. Should this go through the RFC process?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions