-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add new lint: Mixed locale ident #7376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
a30a38a
Implement mixed locale ident lint
popzxc 2c4968d
Add test for mixed locale ident lint
popzxc 2ffbc1d
Rename mixed-locale-ident to mixed-locale-idents
popzxc 46b16c9
Fix failing tests
popzxc 2b9fa0a
'cargo dev' chore
popzxc 213c948
Add ticks around identifier name in lint message
popzxc ee81a3d
Improve wording
popzxc b474a08
Add german example to UI test
popzxc 0b2bd3e
Update clippy_lints/Cargo.toml
popzxc c9a42b3
Update implementation to check for identifier parts only
popzxc 9bed7d5
Add test cases for the new lint behavior
popzxc 7a86665
Add more test cases
popzxc c426816
Update stderr file
popzxc d2449e5
Add complex example
popzxc 6984e1c
Fix clippy issues
popzxc 61d4792
Add a few more tests
popzxc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,231 @@ | ||
use clippy_utils::diagnostics::span_lint; | ||
use rustc_data_structures::fx::FxHashMap; | ||
use rustc_lint::{LateContext, LateLintPass}; | ||
use rustc_session::{declare_tool_lint, impl_lint_pass}; | ||
use rustc_span::{Span, Symbol}; | ||
use unicode_script::{Script, UnicodeScript}; | ||
use unicode_security::is_potential_mixed_script_confusable_char; | ||
|
||
declare_clippy_lint! { | ||
/// **What it does:** Checks for usage of mixed locales in a single identifier part | ||
/// with respect to the used case. | ||
/// | ||
/// **Why is this bad?** While it's OK for identifier to have different parts in different | ||
/// locales, it may be confusing if multiple locales are used in a single word. | ||
/// | ||
/// **Known problems:** None. | ||
/// | ||
/// **Example:** | ||
/// | ||
/// The following code compiles without any warnings: | ||
/// | ||
/// ```rust | ||
/// // In the `Block` identifier part, it's not a common latin `o` used, but rather Russian `о`. | ||
/// // It's hard to see visually and the `mixed_script_confusables` warning is not spawned | ||
/// // because of the `Блок` identifier part. | ||
/// struct BlоckБлок; | ||
/// | ||
/// fn main() { | ||
/// let _block = BlоckБлок; | ||
/// } | ||
/// ``` | ||
/// | ||
/// The same example, but with `Block` (english `o`) used instead of `Blоck` (russian `о`). | ||
/// It will not compile. | ||
/// | ||
/// ```compile_fail | ||
/// struct BlоckБлок; | ||
/// | ||
/// fn main() { | ||
/// let _block = BlockБлок; | ||
/// } | ||
/// | ||
/// // Compile output: | ||
/// // | ||
/// // error[E0425]: cannot find value `BlockБлок` in this scope | ||
/// // --> src/main.rs:4:18 | ||
/// // | | ||
/// // 1 | struct BlоckБлок; | ||
/// // | ----------------- similarly named unit struct `BlоckБлок` defined here | ||
/// // ... | ||
/// // 4 | let _block = BlockБлок; | ||
/// // | ^^^^^^^^^ help: a unit struct with a similar name exists: `BlоckБлок` | ||
/// ``` | ||
pub MIXED_LOCALE_IDENTS, | ||
style, | ||
"multiple locales used in a single identifier part" | ||
} | ||
|
||
#[derive(Clone, Debug)] | ||
pub struct MixedLocaleIdents; | ||
|
||
impl_lint_pass!(MixedLocaleIdents => [MIXED_LOCALE_IDENTS]); | ||
|
||
/// Cases that normally can be met in the Rust identifiers. | ||
#[derive(Debug)] | ||
enum Case { | ||
/// E.g. `SomeStruct`, delimiter is uppercase letter. | ||
Camel, | ||
/// E.g. `some_var` or `SOME_VAR, delimiter is `_`. | ||
Snake, | ||
/// E.g. `WHY_AmIWritingThisWay`, delimiters are both `_` and uppercase letter. | ||
Mixed, | ||
} | ||
|
||
fn detect_case(ident: &str) -> Case { | ||
let mut has_underscore = false; | ||
let mut has_upper = false; | ||
let mut has_lower = false; | ||
for c in ident.chars() { | ||
has_underscore |= c == '_'; | ||
has_upper |= c.is_uppercase(); | ||
has_lower |= c.is_lowercase(); | ||
} | ||
|
||
if has_upper && has_lower && has_underscore { | ||
Case::Mixed | ||
} else if !has_upper || !has_lower { | ||
// Because of the statement above we are sure | ||
// that we can't have mixed case already. | ||
// If we have only one case, it's certainly snake case (either lowercase or uppercase). | ||
Case::Snake | ||
} else { | ||
// The only option left. | ||
Case::Camel | ||
} | ||
} | ||
|
||
/// Flag for having confusables in the identifier. | ||
#[derive(Debug, Clone, Copy)] | ||
enum ConfusablesState { | ||
/// Identifier part currently only has confusables. | ||
OnlyConfusables, | ||
/// Identifier part has not only confusables. | ||
NotConfusable, | ||
} | ||
|
||
impl ConfusablesState { | ||
fn from_char(c: char) -> Self { | ||
if Self::is_confusable(c) { | ||
Self::OnlyConfusables | ||
} else { | ||
Self::NotConfusable | ||
} | ||
} | ||
|
||
fn update(&mut self, c: char) { | ||
let is_confusable = Self::is_confusable(c); | ||
*self = match (*self, is_confusable) { | ||
(Self::OnlyConfusables, true) => Self::OnlyConfusables, | ||
_ => Self::NotConfusable, | ||
}; | ||
} | ||
|
||
fn is_confusable(c: char) -> bool { | ||
match c.script() { | ||
Script::Common | Script::Unknown | Script::Latin => { | ||
// `Common` and `Unkown` are not categories we're interested in. | ||
// `Latin` is the primary locale, we don't consider it to be confusable. | ||
false | ||
}, | ||
_ => { | ||
// Otherwise, actually check the symbol. | ||
is_potential_mixed_script_confusable_char(c) | ||
}, | ||
} | ||
} | ||
} | ||
|
||
fn warning_required(locales: &FxHashMap<Script, ConfusablesState>) -> bool { | ||
// If there is a single locale, we don't need a warning (e.g. `око` is a valid Russian | ||
// word that consists of the consufables only). | ||
if locales.len() <= 1 { | ||
return false; | ||
} | ||
|
||
// Otherwise, check whether at least one of locales consists of confusables only. | ||
locales | ||
.iter() | ||
.any(|(_loc, state)| matches!(state, ConfusablesState::OnlyConfusables)) | ||
} | ||
|
||
impl<'tcx> LateLintPass<'tcx> for MixedLocaleIdents { | ||
fn check_name(&mut self, cx: &LateContext<'tcx>, span: Span, ident: Symbol) { | ||
let ident_name = ident.to_string(); | ||
|
||
// First pass just to check whether identifier is fully ASCII, | ||
// without any expensive actions. | ||
// Most of identifiers are *expected* to be ASCII to it's better | ||
// to return early if possible. | ||
if ident_name.is_ascii() { | ||
return; | ||
} | ||
|
||
// Iterate through identifier with respect to its case | ||
// and checks whether it contains mixed locales in a single identifier. | ||
let ident_case = detect_case(&ident_name); | ||
// Positions of the identifier part being checked. | ||
// Used in report to render only relevant part. | ||
let (mut ident_part_start, mut ident_part_end): (usize, usize) = (0, 0); | ||
// List of locales used in the *identifier part* | ||
let mut used_locales: FxHashMap<Script, ConfusablesState> = FxHashMap::default(); | ||
for (id, symbol) in ident_name.chars().enumerate() { | ||
// Check whether we've reached the next identifier part. | ||
let is_next_identifier = match ident_case { | ||
Case::Camel => symbol.is_uppercase(), | ||
Case::Snake => symbol == '_', | ||
Case::Mixed => symbol.is_uppercase() || symbol == '_', | ||
}; | ||
|
||
// If the new identifier started, we should not include its part | ||
// into the report. | ||
if !is_next_identifier { | ||
ident_part_end = id; | ||
} | ||
|
||
if is_next_identifier { | ||
if warning_required(&used_locales) { | ||
// We've found the part that has multiple locales, | ||
// no further analysis is required. | ||
break; | ||
} | ||
// Clear everything and start checking the next identifier. | ||
ident_part_start = id; | ||
used_locales.clear(); | ||
} | ||
|
||
let script = symbol.script(); | ||
if script != Script::Common && script != Script::Unknown { | ||
used_locales | ||
.entry(script) | ||
.and_modify(|e| e.update(symbol)) | ||
.or_insert_with(|| ConfusablesState::from_char(symbol)); | ||
} | ||
} | ||
|
||
if warning_required(&used_locales) { | ||
let ident_len = ident_part_end - ident_part_start + 1; | ||
let ident_part: String = ident_name.chars().skip(ident_part_start).take(ident_len).collect(); | ||
let locales: Vec<&'static str> = used_locales | ||
.iter() | ||
.filter_map(|(loc, state)| match state { | ||
ConfusablesState::OnlyConfusables => Some(loc.full_name()), | ||
ConfusablesState::NotConfusable => None, | ||
}) | ||
.collect(); | ||
|
||
assert!( | ||
!locales.is_empty(), | ||
"Warning is required; having no locales to report is a bug" | ||
); | ||
|
||
let message = format!( | ||
"identifier part `{}` contains confusables-only symbols from the following locales: {}", | ||
ident_part.trim_start_matches('_'), | ||
locales.join(", "), | ||
); | ||
|
||
span_lint(cx, MIXED_LOCALE_IDENTS, span, &message); | ||
} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
#![warn(clippy::mixed_locale_idents)] | ||
#![allow(dead_code, non_camel_case_types, confusable_idents, non_upper_case_globals)] | ||
|
||
mod should_spawn_warnings { | ||
// In the examples, cyrillic `о` is used in `Blоck`. | ||
|
||
pub struct Blоck; | ||
pub const BLОCK: u8 = 42; | ||
pub fn blоck() { | ||
let blоck = 42u8; | ||
} | ||
|
||
// Identifiers that consist of multiple parts. | ||
pub struct SomeBlоckIdentifier; | ||
pub const SOME_BLОCK_IDENT: u8 = 42; | ||
pub fn some_blоck_fn() { | ||
let some_blоck_var = 42u8; | ||
} | ||
pub struct Some_BlоckIdent; // Mixed case | ||
|
||
// Identifiers that have multiple matches. | ||
// Only the first match should be reported. | ||
pub struct BlоckClоck; | ||
pub const BLОCK_CLОCK: u8 = 42; | ||
pub fn blоck_clоck() { | ||
let blоck_clоck = 42u8; | ||
} | ||
|
||
// Identifiers that have both latin & non-latin word, and | ||
// mixed case word. | ||
pub struct SomeБлокBlоck; | ||
|
||
// Identifier that has 3 locales, one of which is not confusable, and one is. | ||
// It must not complain about Chinese (as it's not confusable), but report | ||
// Cyrillic instead. | ||
pub struct Blоck看; | ||
} | ||
|
||
mod should_not_spawn_warnings { | ||
// In all the examples, `блок` is fully cyrillic. | ||
|
||
pub struct TryБлок; | ||
pub const TRY_БЛОК: u8 = 42; | ||
pub fn try_блок() { | ||
let try_блок_var = 42u8; | ||
} | ||
// Mixed case | ||
pub struct Some_БлокIdent; | ||
|
||
// Using non-confusables in ident together with confusables. | ||
fn fnъуъ() { | ||
let try看 = 42u8; | ||
} | ||
|
||
// Using only lating confusables (`o` is latin). | ||
fn ooo_блок() {} | ||
|
||
// One-word non-latin identifiers that contain non-confusables. | ||
fn блок() {} | ||
} | ||
|
||
// Checks to see that some edge cases do not cause panics. | ||
mod render_tests { | ||
// One-letter Latin identifier. Should not trigger the warning. | ||
struct O; | ||
|
||
// One-letter Cyrillic identifier. | ||
struct О; | ||
|
||
// Multiple underscores (`O` is cyrillic; `o` is latin). | ||
const __ZZZ___О__: u8 = 42; | ||
|
||
const __ZZZ___Оo__: u8 = 42; | ||
} | ||
|
||
fn main() { | ||
// Additional examples that should *not* spawn the warning. | ||
|
||
// Should not spawn the lint, as it represents valid sequence in a single locale. | ||
let nutzer_zähler = "user counter"; | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so a problem with this is that not all writing systems have uppercase. More thought needs to be put into how that will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But perhaps using only mixed script confusables without underscores in a character is enough for us for now. Idk.