Skip to content

Change significance to be determined by IQR fencing #996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 26 additions & 17 deletions docs/comparison-analysis.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Comparison Analysis

The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B).
The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B).

This analysis can be done by hand, by using the [comparison page](https://perf.rust-lang.org/compare.html) and entering the two artifacts of interest in the form at the top.

Expand All @@ -18,11 +18,11 @@ At the core of comparison analysis are the collection of test results for the tw
100 * ((statisticForArtifactB - statisticForArtifactA) / statisticForArtifactA)
```

## High-level analysis description
## High-level analysis description

Analysis of the changes is performed in order to determine whether artifact B represents a performance change over artifact A. At a high level the analysis performed takes the following form:

How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?
How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?

* If there are improvements and regressions with magnitude of medium or above then the comparison is mixed.
* If there are only either improvements or regressions then the comparison is labeled with that kind.
Expand All @@ -37,29 +37,38 @@ Whether we actually _report_ an analysis or not depends on the context and how _

### What makes a test result significant?

A test result is significant if the relative change percentage meets some threshold. What the threshold is depends of whether the test case is "dodgy" or not (see below for an examination of "dodginess"). For dodgy test cases, the threshold is set at 1%. For non-dodgy test cases, the threshold is set to 0.1%.

### What makes a test case "dodgy"?

A test case is "dodgy" if it shows signs of either being noisy or highly variable.

To determine noise and high variability, the previous 100 test results for the test case in question are examined by calculating relative delta changes between adjacent test results. This is done with the following formula (where `testResult1` is the test result immediately proceeding `testResult2`):
A test result is significant if the relative change percentage is considered an outlier against historical data. Determining whether a value is an outlier is done through interquartile range "fencing" (i.e., whether a value exceeds a threshold equal to the third quartile plus 1.5 times the interquartile range):

```
testResult2 - testResult1 / testResult1
interquartile_range = Q3 - Q1
result > Q3 + (interquartile_range * 1.5)
```

Any relative delta change that is above a threshold (currently 0.1) is considered "significant" for the purposes of dodginess detection.
(Assuming the data is ordered, Q3 is the median of the upper half of the data while Q1 is the median of the lower half.)

A highly variable test case is one where a certain percentage (currently 5%) of relative delta changes are significant. The logic being that test cases should only display significant relative delta changes a small percentage of the time.
We ignore the lower fence, because result data is bounded by 0.

A noisy test case is one where of all the non-significant relative delta changes, the average delta change is still above some threshold (0.001). The logic being that non-significant changes should, on average, being very close to 0. If they are not close to zero, then they are noisy.
This upper fence is often called the "significance threshold".

### How is confidence in whether a test analysis is "relevant" determined?

The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude (how large a change is regardless of the direction of the change).
The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude.

#### Magnitude

Magnitude is a combination of two factors:
* how large a change is regardless of the direction of the change
* how much that change went over the significance threshold

If a large change only happens to go over the significance threshold by a small factor, then the over magnitude of the change is considered small.

#### Confidence algorithm

The actual algorithm for determining confidence may change, but in general the following rules apply:
* Definitely relevant: any number of very large changes, a small amount of large and/or medium changes, or a large amount of small changes.
* Probably relevant: any number of large changes, more than 1 medium change, or smaller but still substantial amount of small changes.
* Definitely relevant: any number of very large or large changes, a small amount of medium changes, or a large amount of small or very small changes.
* Probably relevant: any number of very large or large changes, any medium change, or smaller but still substantial amount of small or very small changes.
* Maybe relevant: if it doesn't fit into the above two categories, it ends in this category.

### "Dodgy" Test Cases

"Dodgy" test cases are test cases that tend to produce unreliable results (i.e., noise). A test case is considered "dodgy" if its significance threshold is sufficiently far enough away from 0.
4 changes: 4 additions & 0 deletions site/src/api.rs
Original file line number Diff line number Diff line change
Expand Up @@ -147,10 +147,12 @@ pub mod comparison {
use std::collections::HashMap;

#[derive(Debug, PartialEq, Clone, Serialize, Deserialize)]
#[allow(non_snake_case)]
pub struct Request {
pub start: Bound,
pub end: Bound,
pub stat: String,
pub calcNewSig: Option<bool>,
}

#[derive(Debug, Clone, Serialize)]
Expand Down Expand Up @@ -358,9 +360,11 @@ pub mod triage {
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
#[allow(non_snake_case)]
pub struct Request {
pub start: Bound,
pub end: Option<Bound>,
pub calcNewSig: Option<bool>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
Expand Down
164 changes: 125 additions & 39 deletions site/src/comparison.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ pub async fn handle_triage(
"instructions:u".to_owned(),
ctxt,
&master_commits,
body.calcNewSig.unwrap_or(false),
)
.await?
{
Expand Down Expand Up @@ -89,10 +90,16 @@ pub async fn handle_compare(
) -> Result<api::comparison::Response, BoxedError> {
let master_commits = collector::master_commits().await?;
let end = body.end;
let comparison =
compare_given_commits(body.start, end.clone(), body.stat, ctxt, &master_commits)
.await?
.ok_or_else(|| format!("could not find end commit for bound {:?}", end))?;
let comparison = compare_given_commits(
body.start,
end.clone(),
body.stat,
ctxt,
&master_commits,
body.calcNewSig.unwrap_or(false),
)
.await?
.ok_or_else(|| format!("could not find end commit for bound {:?}", end))?;

let conn = ctxt.conn().await;
let prev = comparison.prev(&master_commits);
Expand Down Expand Up @@ -338,7 +345,7 @@ pub async fn compare(
ctxt: &SiteCtxt,
) -> Result<Option<Comparison>, BoxedError> {
let master_commits = collector::master_commits().await?;
compare_given_commits(start, end, stat, ctxt, &master_commits).await
compare_given_commits(start, end, stat, ctxt, &master_commits, false).await
}

/// Compare two bounds on a given stat
Expand All @@ -348,6 +355,7 @@ async fn compare_given_commits(
stat: String,
ctxt: &SiteCtxt,
master_commits: &[collector::MasterCommit],
calc_new_sig: bool,
) -> Result<Option<Comparison>, BoxedError> {
let a = ctxt
.artifact_id_for_bound(start.clone(), true)
Expand Down Expand Up @@ -387,6 +395,7 @@ async fn compare_given_commits(
.as_ref()
.and_then(|v| v.data.get(&test_case).cloned()),
results: (a, b),
calc_new_sig,
})
})
.collect();
Expand Down Expand Up @@ -635,28 +644,53 @@ impl BenchmarkVariance {
self.data.push(value);
}

fn mean(&self) -> f64 {
self.data.iter().sum::<f64>() / self.data.len() as f64
/// The percent change of the deltas sorted from smallest delta to largest
fn percent_changes(&self) -> Vec<f64> {
let mut deltas = self
.deltas()
.zip(self.data.iter())
.map(|(d, &r)| d / r)
.collect::<Vec<_>>();
deltas.sort_by(|d1, d2| d1.partial_cmp(d2).unwrap_or(std::cmp::Ordering::Equal));
deltas
}

fn upper_fence(&self) -> f64 {
let pcs = self.percent_changes();
fn median(data: &[f64]) -> f64 {
if data.len() % 2 == 0 {
(data[(data.len() - 1) / 2] + data[data.len() / 2]) / 2.0
} else {
data[data.len() / 2]
}
}

let len = pcs.len();
let (h1_end, h2_begin) = if len % 2 == 0 {
(len / 2 - 2, len / 2 + 1)
} else {
(len / 2 - 1, len / 2 + 1)
};
// significance is determined by the upper
// interquartile range fence
let q1 = median(&pcs[..=h1_end]);
let q3 = median(&pcs[h2_begin..]);
let iqr = q3 - q1;
q3 + (iqr * 1.5)
}

fn calculate_description(&mut self) {
self.description = BenchmarkVarianceDescription::Normal;

let results_mean = self.mean();
let mut deltas = self
.data
.windows(2)
.map(|window| (window[0] - window[1]).abs())
.collect::<Vec<_>>();
deltas.sort_by(|d1, d2| d1.partial_cmp(d2).unwrap_or(std::cmp::Ordering::Equal));
let non_significant = deltas
let percent_changes = self.percent_changes();
let non_significant = percent_changes
.iter()
.zip(self.data.iter())
.take_while(|(&d, &r)| d / r < Self::SIGNFICANT_DELTA_THRESHOLD)
.take_while(|&&c| c < Self::SIGNFICANT_DELTA_THRESHOLD)
.collect::<Vec<_>>();

let percent_significant_changes =
((deltas.len() - non_significant.len()) as f64 / deltas.len() as f64) * 100.0;
let percent_significant_changes = ((percent_changes.len() - non_significant.len()) as f64
/ percent_changes.len() as f64)
* 100.0;
debug!(
"Percent significant changes: {:.1}%",
percent_significant_changes
Expand All @@ -668,20 +702,30 @@ impl BenchmarkVariance {
}

let delta_mean =
non_significant.iter().map(|(&d, _)| d).sum::<f64>() / (non_significant.len() as f64);
let ratio_change = delta_mean / results_mean;
debug!("Ratio change: {:.3}", ratio_change);
if ratio_change > Self::NOISE_THRESHOLD {
non_significant.iter().cloned().sum::<f64>() / (non_significant.len() as f64);
debug!("Ratio change: {:.4}", delta_mean);
if delta_mean > Self::NOISE_THRESHOLD {
self.description = BenchmarkVarianceDescription::Noisy;
}
}

// Absolute deltas between adjacent results
fn deltas(&self) -> impl Iterator<Item = f64> + '_ {
self.data
.windows(2)
.map(|window| (window[0] - window[1]).abs())
}

/// Whether we can trust this benchmark or not
fn is_dodgy(&self) -> bool {
matches!(
self.description,
BenchmarkVarianceDescription::Noisy | BenchmarkVarianceDescription::HighlyVariable
)
fn is_dodgy(&self, calc_new_sig: bool) -> bool {
if !calc_new_sig {
matches!(
self.description,
BenchmarkVarianceDescription::Noisy | BenchmarkVarianceDescription::HighlyVariable
)
} else {
self.upper_fence() > 0.002
}
}
}

Expand Down Expand Up @@ -736,11 +780,12 @@ pub struct TestResultComparison {
scenario: Scenario,
variance: Option<BenchmarkVariance>,
results: (f64, f64),
calc_new_sig: bool,
}

impl TestResultComparison {
/// The amount of relative change considered significant when
/// the test case is not dodgy
/// we cannot determine from historical data
const SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD: f64 = 0.002;

/// The amount of relative change considered significant when
Expand All @@ -761,33 +806,74 @@ impl TestResultComparison {
}

fn signifcance_threshold(&self) -> f64 {
if self.is_dodgy() {
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY
if !self.calc_new_sig {
if self.is_dodgy() {
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD_DODGY
} else {
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD
}
} else {
Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD
self.variance
.as_ref()
.map(|s| s.upper_fence())
.unwrap_or(Self::SIGNIFICANT_RELATIVE_CHANGE_THRESHOLD)
}
}

fn magnitude(&self) -> Magnitude {
let mag = self.relative_change().abs();
let change = self.relative_change().abs();
let threshold = self.signifcance_threshold();
if mag < threshold * 1.5 {
let over_threshold = if change < threshold * 1.5 {
Magnitude::VerySmall
} else if mag < threshold * 3.0 {
} else if change < threshold * 3.0 {
Magnitude::Small
} else if mag < threshold * 10.0 {
} else if change < threshold * 10.0 {
Magnitude::Medium
} else if mag < threshold * 25.0 {
} else if change < threshold * 25.0 {
Magnitude::Large
} else {
Magnitude::VeryLarge
};
if !self.calc_new_sig {
return over_threshold;
}
let change_magnitude = if change < 0.002 {
Magnitude::VerySmall
} else if change < 0.01 {
Magnitude::Small
} else if change < 0.02 {
Magnitude::Medium
} else if change < 0.05 {
Magnitude::Large
} else {
Magnitude::VeryLarge
};
fn as_u8(m: Magnitude) -> u8 {
match m {
Magnitude::VerySmall => 1,
Magnitude::Small => 2,
Magnitude::Medium => 3,
Magnitude::Large => 4,
Magnitude::VeryLarge => 5,
}
}
fn from_u8(m: u8) -> Magnitude {
match m {
1 => Magnitude::VerySmall,
2 => Magnitude::Small,
3 => Magnitude::Medium,
4 => Magnitude::Large,
_ => Magnitude::VeryLarge,
}
}

from_u8((as_u8(over_threshold) + as_u8(change_magnitude)) / 2)
}

fn is_dodgy(&self) -> bool {
self.variance
.as_ref()
.map(|v| v.is_dodgy())
.map(|v| v.is_dodgy(self.calc_new_sig))
.unwrap_or(false)
}

Expand Down Expand Up @@ -867,7 +953,7 @@ impl std::fmt::Display for Direction {
}

/// The relative size of a performance change
#[derive(Debug, PartialOrd, PartialEq, Ord, Eq)]
#[derive(Clone, Copy, Debug, PartialOrd, PartialEq, Ord, Eq)]
pub enum Magnitude {
VerySmall,
Small,
Expand Down
1 change: 1 addition & 0 deletions site/static/compare.html
Original file line number Diff line number Diff line change
Expand Up @@ -766,6 +766,7 @@ <h2>Comparing <span id="stat-header">{{stat}}</span> between <span id="before">{
end: "",
stat: "instructions:u",
}, state);
values["calcNewSig"] = values.calcNewSig === 'true';
makeRequest("/get", values).then(function (data) {
app.data = data;
});
Expand Down