Skip to content

Counting bytes with high bit set optimizes badly for x86_64 #72355

@hsivonen

Description

@hsivonen

I tried this code:

pub fn count_non_ascii(buffer: &[u8]) -> u64 {
    let mut count = 0;
    for &b in buffer {
        if b >= 0x80 {
            count += 1;
        }
    }
    count
}

Godbolt link

I expected to see this happen: I expected the compiler to autovectorize along the lines of

pub fn count_non_ascii_sse2(buffer: &[u8]) -> u64 {
    let mut count = 0;
    let (prefix, simd, suffix) = unsafe { buffer.align_to::<core::arch::x86_64::__m128i>() };
    for &b in prefix {
        if b >= 0x80 {
            count += 1;
        }
    }
    for &s in simd {
        count += unsafe {core::arch::x86_64::_mm_movemask_epi8(s)}.count_ones() as u64;
    }
    for &b in suffix {
        if b >= 0x80 {
            count += 1;
        }
    }
    count
}

Instead, this happened: It is autovectorized to something more complex and considerably slower than the manual vectorization given above. (The above manual vectorization becomes even faster when compiled with a target_cpu that supports the POPCNT instruction.)

Meta

rustc --version --verbose:

rustc 1.45.0-nightly (a74d1862d 2020-05-14)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-bugCategory: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.O-x86_64Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64)T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions