-
Notifications
You must be signed in to change notification settings - Fork 624
[Backend Tester] Add backend test suite skeleton #11960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11960
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 2eb2407 with merge base fee2bd9 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
What will happen for the following two scenarios?
|
In both cases, it should be fine. These tests focus on the operators end to end essentially, in the same way they appear in the model. If they are decomposed internally, it will still validate the correctness. If they're not partitioned, it's also fine, as the tests will pass. |
How about the case that they should partition and delegated, but it was missing some cases, hence not partition and fall back to cpu? Also, how do we measure the closeness between the baseline and the backend op? |
I might not be fully understanding the question, but this should also be fine. The purpose of these tests are twofold: (1) does the partitioner/lowering error out instead of not partitioning things it can't handle, and (2) if something is partitioned, are the outputs reasonably close to the eager value? On numerical accuracy, I've set very loose tolerances, as my intent is mainly to catch things that are blatantly incorrect, rather than get into specific numerical deviation. It looks at the full graph output, so if something is decomposed and/or partially lowered, it will consider the net output regardless of how it's handled internally. |
I think the following will help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many comments, I will try to keep them structured.
- I would like to better understand and document what gets in/out of this suite but I like this idea.
- I hope we can figure out the reporting for this.
- Another consideration is versioning, i.e. XNNPACK is complient to v1.2 vs. coreml to v1.1 etc. We can tie to to ET versions.
- FC/BC is also a pain with this. So we have to think this through a bit more.
- Lastly, what's the status of the suite itself, is it ready or WIP, what is the update cadence etc.
@@ -0,0 +1,15 @@ | |||
# Operator Compliance Test Suite | |||
|
|||
This directory contains operator tests that all backends are expected to pass. While not every backend will implement every operator or permutation, the expectation is that backend partitioners will only partition nodes that the backend can support. The partitioner should never error out due to not supporting an input node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say why? I.e. what makes these tests a must pass. I know it but document it.
I imagine we will build up a "qualifying criteria" for adding new things or removing things from this compliance suite. And what does it mean for a backend to be compliant, when a user reads it. Also indicate this is a purely functional and no bar on performance. Also we need to distinguish between compliant and partitioned vs. not-partitioned, when we present this, but that's for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-written this section to word it a bit more generally while we align on the exact expectations for backends.
|
||
ALL_TEST_FLOWS = [] | ||
|
||
if is_backend_enabled("xnnpack"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverse of this would be something like is_backend_ready_for_complience_testing()
and someone has to whitelist their backend for this method to return no, and defaults to yes. And ofc you can override for testing etc.
torch.float64, | ||
] | ||
|
||
FLOAT_DTYPES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing bloat16?
I guess no complex types or fp8 cases yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ET seems to have a bit of difficulty (outside of the backends) with bf16, so I excluded it for now. I can add a note. For example, I can't lower a graph adding 2 bf16 tensors with no delegation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the issue with lowering add(bf16, bf16)? Portable op support or export or et issue?
class Add(OperatorTest): | ||
@dtype_test | ||
def test_add_dtype(self, dtype, tester_factory: Callable) -> None: | ||
self._test_op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering instead of calling this method here, why don't we just get an instance of "TestInputs" class which gives you logical_name, module definition, test inputs, etc. This is to disconnect the inputs from consuming those. This will allow us to repurpose these if we want to do different things with them. What do you think?
The whole idea is backends can't choose the test module, the suite will. Backend can use the ET APIs, if they partition it - it is expected to work correctly. If they don't partition is that is also OK. |
The idea is that, if I have an op A, and the backend partitioner makes a mistake and doesn't partition, and fall back to portable, the test will still pass, but it won't detect the mistakes. And also, say the backend supports an aten op (not portable), how do we include them here. My goal is to make the test suit more generic. |
a931676
to
4408e9a
Compare
4408e9a
to
d5041c9
Compare
8cbc094
to
15c73c6
Compare
15c73c6
to
096347f
Compare
096347f
to
164ec42
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739. |
164ec42
to
ea773b1
Compare
Because it is not a mistake, from ET point of view this is still an OK outcome, we could generate a valid PTE. From backend point of view, there should be a report which says this backend didn't partition op A with such and such args. You or someone looking at the report from a backend partition behavior point of view should still be able to "find" it. Making a test fail requires you to know what is expected to work and that I feel is backend's job. I.e. to test a given backend partition for a "this is expected to work" is a job for individual backend's tests. I hope this helps.
Gregory had a simple FACTO configuration for Linear op. In that case I would imagine if one doesn't partition we just stop the test then. I am not in favor of listing every single "do-no-decompose" op in FACTO config just a common ones. And the burden for testing all of them falls on the backend tests.
I think I understand. But I don't want this to be seen as a substitute for individual backend tests. If you say, let's add enough configurability so we can repurpose this framework to test my backend. I think that can happen, organically like XNNPACK Tester but I don't want that to be the goal for this effort esp not for GA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left some comments.
I am glad we droped "compliance" word from this. :)
|
||
# Generate test cases for each backend flow. | ||
def _create_tests(cls): | ||
for key in dir(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert cls type?
ea773b1
to
0cea25b
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739. |
0cea25b
to
ced51a3
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739. |
1d09ca2
to
1adb38f
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739. |
1adb38f
to
8eab1d9
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739. |
8eab1d9
to
3e9930e
Compare
3e9930e
to
2eb2407
Compare
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739. |
### Summary Add the initial skeleton of reporting code for the backend tester. This PR is primarily focused on putting the hooks and runner structure in place. Follow-up work will expand the scope of collection and reporting outputs. This PR adds the following: - CLI runner for the test suite. - Basic test result breakdown by success / fail and cause (failing in lowering vs output mismatch, for example). - Refactoring of test suite logic to clean things up. Next steps: - Aggregate results by flow (backend). - Add additional CLI flags to allow filtering backends and dtypes. - Land more of the operator test suite. - Wire up flows for quantized operators. Note that this PR is stacked on (and thus includes) #11960. I accidently broke my ghstack, so I'm converting this to a normal PR. Sample output (XNNPACK): ``` Test Session Summary: 84 Passed / 95 11 Failed / 95 0 Skipped / 95 [Success] 66 Delegated 18 Undelegated [Failure] 4 Lowering Fail 0 PTE Load Fail 0 PTE Run Fail 6 Output Mismatch Fail ``` Reproduce with `ET_TEST_ENABLED_BACKENDS=xnnpack python -m executorch.backends.test.suite.runner.executorch.backends.test.suite`. I've temporarily commented out non-f32 dtypes to work around some crashes in XNNPACK, which are non-recoverable from Python.
### Summary Add the initial skeleton of reporting code for the backend tester. This PR is primarily focused on putting the hooks and runner structure in place. Follow-up work will expand the scope of collection and reporting outputs. This PR adds the following: - CLI runner for the test suite. - Basic test result breakdown by success / fail and cause (failing in lowering vs output mismatch, for example). - Refactoring of test suite logic to clean things up. Next steps: - Aggregate results by flow (backend). - Add additional CLI flags to allow filtering backends and dtypes. - Land more of the operator test suite. - Wire up flows for quantized operators. Note that this PR is stacked on (and thus includes) #11960. I accidently broke my ghstack, so I'm converting this to a normal PR. Sample output (XNNPACK): ``` Test Session Summary: 84 Passed / 95 11 Failed / 95 0 Skipped / 95 [Success] 66 Delegated 18 Undelegated [Failure] 4 Lowering Fail 0 PTE Load Fail 0 PTE Run Fail 6 Output Mismatch Fail ``` Reproduce with `ET_TEST_ENABLED_BACKENDS=xnnpack python -m executorch.backends.test.suite.runner.executorch.backends.test.suite`. I've temporarily commented out non-f32 dtypes to work around some crashes in XNNPACK, which are non-recoverable from Python.
This PR lays the foundation for the new backend test suite. It includes a skeleton of the runner (iterated on in the stack) and some basic tests for some elementwise operators.