Skip to content

[Backend Tester] Add backend test suite skeleton #11960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 14, 2025

Conversation

GregoryComer
Copy link
Member

@GregoryComer GregoryComer commented Jun 25, 2025

This PR lays the foundation for the new backend test suite. It includes a skeleton of the runner (iterated on in the stack) and some basic tests for some elementwise operators.

@GregoryComer GregoryComer requested a review from cccclai as a code owner June 25, 2025 06:36
@GregoryComer
Copy link
Member Author

GregoryComer commented Jun 25, 2025

Copy link

pytorch-bot bot commented Jun 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11960

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 2eb2407 with merge base fee2bd9 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2025
@GregoryComer GregoryComer changed the title Add compliance suite skeleton and operator tests [[Backend Tester] Add compliance suite skeleton and operator tests Jun 25, 2025
GregoryComer added a commit that referenced this pull request Jun 25, 2025
ghstack-source-id: a21d704
ghstack-comment-id: 3003538522
Pull-Request: #11960
@GregoryComer GregoryComer changed the title [[Backend Tester] Add compliance suite skeleton and operator tests [Backend Tester] Add compliance suite skeleton and operator tests Jun 25, 2025
@cccclai
Copy link
Contributor

cccclai commented Jun 25, 2025

What will happen for the following two scenarios?

  • the backend doesn't support this particular operator, but support the decomposed version instead?
  • the backend supports a particular aten op, that will be decomposed later?

@GregoryComer
Copy link
Member Author

What will happen for the following two scenarios?

  • the backend doesn't support this particular operator, but support the decomposed version instead?
  • the backend supports a particular aten op, that will be decomposed later?

In both cases, it should be fine. These tests focus on the operators end to end essentially, in the same way they appear in the model. If they are decomposed internally, it will still validate the correctness. If they're not partitioned, it's also fine, as the tests will pass.

@cccclai
Copy link
Contributor

cccclai commented Jun 25, 2025

In

What will happen for the following two scenarios?

  • the backend doesn't support this particular operator, but support the decomposed version instead?
  • the backend supports a particular aten op, that will be decomposed later?

In both cases, it should be fine. These tests focus on the operators end to end essentially, in the same way they appear in the model. If they are decomposed internally, it will still validate the correctness. If they're not partitioned, it's also fine, as the tests will pass.

How about the case that they should partition and delegated, but it was missing some cases, hence not partition and fall back to cpu? Also, how do we measure the closeness between the baseline and the backend op?

@GregoryComer GregoryComer requested a review from digantdesai June 25, 2025 20:42
@GregoryComer
Copy link
Member Author

In

What will happen for the following two scenarios?

  • the backend doesn't support this particular operator, but support the decomposed version instead?
  • the backend supports a particular aten op, that will be decomposed later?

In both cases, it should be fine. These tests focus on the operators end to end essentially, in the same way they appear in the model. If they are decomposed internally, it will still validate the correctness. If they're not partitioned, it's also fine, as the tests will pass.

How about the case that they should partition and delegated, but it was missing some cases, hence not partition and fall back to cpu? Also, how do we measure the closeness between the baseline and the backend op?

I might not be fully understanding the question, but this should also be fine. The purpose of these tests are twofold: (1) does the partitioner/lowering error out instead of not partitioning things it can't handle, and (2) if something is partitioned, are the outputs reasonably close to the eager value?

On numerical accuracy, I've set very loose tolerances, as my intent is mainly to catch things that are blatantly incorrect, rather than get into specific numerical deviation. It looks at the full graph output, so if something is decomposed and/or partially lowered, it will consider the net output regardless of how it's handled internally.

@cccclai
Copy link
Contributor

cccclai commented Jun 26, 2025

I think the following will help

  1. make it configurable, meaning that backend can note what op is supported and what op is not, and being able to inject their own ops
  2. check the graph (or the node) after partition, and make sure the op is lowered. This will help us knowing if the backend is doing the right thing.

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many comments, I will try to keep them structured.

  • I would like to better understand and document what gets in/out of this suite but I like this idea.
  • I hope we can figure out the reporting for this.
  • Another consideration is versioning, i.e. XNNPACK is complient to v1.2 vs. coreml to v1.1 etc. We can tie to to ET versions.
  • FC/BC is also a pain with this. So we have to think this through a bit more.
  • Lastly, what's the status of the suite itself, is it ready or WIP, what is the update cadence etc.

@@ -0,0 +1,15 @@
# Operator Compliance Test Suite

This directory contains operator tests that all backends are expected to pass. While not every backend will implement every operator or permutation, the expectation is that backend partitioners will only partition nodes that the backend can support. The partitioner should never error out due to not supporting an input node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say why? I.e. what makes these tests a must pass. I know it but document it.

I imagine we will build up a "qualifying criteria" for adding new things or removing things from this compliance suite. And what does it mean for a backend to be compliant, when a user reads it. Also indicate this is a purely functional and no bar on performance. Also we need to distinguish between compliant and partitioned vs. not-partitioned, when we present this, but that's for later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-written this section to word it a bit more generally while we align on the exact expectations for backends.


ALL_TEST_FLOWS = []

if is_backend_enabled("xnnpack"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverse of this would be something like is_backend_ready_for_complience_testing() and someone has to whitelist their backend for this method to return no, and defaults to yes. And ofc you can override for testing etc.

torch.float64,
]

FLOAT_DTYPES = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing bloat16?

I guess no complex types or fp8 cases yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ET seems to have a bit of difficulty (outside of the backends) with bf16, so I excluded it for now. I can add a note. For example, I can't lower a graph adding 2 bf16 tensors with no delegation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue with lowering add(bf16, bf16)? Portable op support or export or et issue?

class Add(OperatorTest):
@dtype_test
def test_add_dtype(self, dtype, tester_factory: Callable) -> None:
self._test_op(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering instead of calling this method here, why don't we just get an instance of "TestInputs" class which gives you logical_name, module definition, test inputs, etc. This is to disconnect the inputs from consuming those. This will allow us to repurpose these if we want to do different things with them. What do you think?

@digantdesai
Copy link
Contributor

I think the following will help

  1. make it configurable, meaning that backend can note what op is supported and what op is not, and being able to inject their own ops

The whole idea is backends can't choose the test module, the suite will. Backend can use the ET APIs, if they partition it - it is expected to work correctly. If they don't partition is that is also OK.

@cccclai
Copy link
Contributor

cccclai commented Jul 2, 2025

I think the following will help

  1. make it configurable, meaning that backend can note what op is supported and what op is not, and being able to inject their own ops

The whole idea is backends can't choose the test module, the suite will. Backend can use the ET APIs, if they partition it - it is expected to work correctly. If they don't partition is that is also OK.

The idea is that, if I have an op A, and the backend partitioner makes a mistake and doesn't partition, and fall back to portable, the test will still pass, but it won't detect the mistakes.

And also, say the backend supports an aten op (not portable), how do we include them here.

My goal is to make the test suit more generic.

@GregoryComer GregoryComer force-pushed the gh/GregoryComer/69/head branch from a931676 to 4408e9a Compare July 8, 2025 17:23
@GregoryComer GregoryComer force-pushed the gh/GregoryComer/69/head branch from 4408e9a to d5041c9 Compare July 8, 2025 17:42
Base automatically changed from gh/GregoryComer/69/head to main July 8, 2025 18:01
@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch 2 times, most recently from 8cbc094 to 15c73c6 Compare July 8, 2025 21:38
@GregoryComer GregoryComer changed the title [Backend Tester] Add compliance suite skeleton and operator tests [Backend Tester] Add backend test suite skeleton Jul 8, 2025
@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from 15c73c6 to 096347f Compare July 8, 2025 22:27
@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from 096347f to 164ec42 Compare July 8, 2025 22:30
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739.

@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from 164ec42 to ea773b1 Compare July 8, 2025 23:06
@GregoryComer GregoryComer added the release notes: none Do not include this in the release notes label Jul 8, 2025
@digantdesai
Copy link
Contributor

I think the following will help

  1. make it configurable, meaning that backend can note what op is supported and what op is not, and being able to inject their own ops

The whole idea is backends can't choose the test module, the suite will. Backend can use the ET APIs, if they partition it - it is expected to work correctly. If they don't partition is that is also OK.

The idea is that, if I have an op A, and the backend partitioner makes a mistake and doesn't partition, and fall back to portable, the test will still pass, but it won't detect the mistakes.

And also, say the backend supports an aten op (not portable), how do we include them here.

My goal is to make the test suit more generic.

it won't detect the mistakes.

Because it is not a mistake, from ET point of view this is still an OK outcome, we could generate a valid PTE. From backend point of view, there should be a report which says this backend didn't partition op A with such and such args. You or someone looking at the report from a backend partition behavior point of view should still be able to "find" it. Making a test fail requires you to know what is expected to work and that I feel is backend's job. I.e. to test a given backend partition for a "this is expected to work" is a job for individual backend's tests. I hope this helps.

And also, say the backend supports an aten op (not portable), how do we include them here.

Gregory had a simple FACTO configuration for Linear op. In that case I would imagine if one doesn't partition we just stop the test then. I am not in favor of listing every single "do-no-decompose" op in FACTO config just a common ones. And the burden for testing all of them falls on the backend tests.

My goal is to make the test suit more generic.

I think I understand. But I don't want this to be seen as a substitute for individual backend tests. If you say, let's add enough configurability so we can repurpose this framework to test my backend. I think that can happen, organically like XNNPACK Tester but I don't want that to be the goal for this effort esp not for GA.

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left some comments.

I am glad we droped "compliance" word from this. :)


# Generate test cases for each backend flow.
def _create_tests(cls):
for key in dir(cls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert cls type?

@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from ea773b1 to 0cea25b Compare July 10, 2025 21:12
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739.

@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from 0cea25b to ced51a3 Compare July 10, 2025 23:53
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739.

@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch 2 times, most recently from 1d09ca2 to 1adb38f Compare July 11, 2025 00:28
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739.

@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from 1adb38f to 8eab1d9 Compare July 11, 2025 02:00
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739.

@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from 8eab1d9 to 3e9930e Compare July 13, 2025 02:45
@GregoryComer GregoryComer force-pushed the gh/GregoryComer/70/head branch from 3e9930e to 2eb2407 Compare July 13, 2025 05:48
@facebook-github-bot
Copy link
Contributor

@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D77967739.

@GregoryComer GregoryComer merged commit effbb0e into main Jul 14, 2025
197 of 200 checks passed
@GregoryComer GregoryComer deleted the gh/GregoryComer/70/head branch July 14, 2025 18:23
GregoryComer added a commit that referenced this pull request Jul 15, 2025
### Summary
Add the initial skeleton of reporting code for the backend tester. This
PR is primarily focused on putting the hooks and runner structure in
place. Follow-up work will expand the scope of collection and reporting
outputs.

This PR adds the following:
- CLI runner for the test suite.
- Basic test result breakdown by success / fail and cause (failing in
lowering vs output mismatch, for example).
- Refactoring of test suite logic to clean things up.

Next steps:
- Aggregate results by flow (backend).
- Add additional CLI flags to allow filtering backends and dtypes.
- Land more of the operator test suite.
- Wire up flows for quantized operators.

Note that this PR is stacked on (and thus includes)
#11960. I accidently broke my
ghstack, so I'm converting this to a normal PR.

Sample output (XNNPACK):
```
Test Session Summary:

   84 Passed / 95
   11 Failed / 95
    0 Skipped / 95

[Success]
   66 Delegated
   18 Undelegated

[Failure]
    4 Lowering Fail
    0 PTE Load Fail
    0 PTE Run Fail
    6 Output Mismatch Fail
```
Reproduce with `ET_TEST_ENABLED_BACKENDS=xnnpack python -m
executorch.backends.test.suite.runner.executorch.backends.test.suite`.
I've temporarily commented out non-f32 dtypes to work around some
crashes in XNNPACK, which are non-recoverable from Python.
lucylq pushed a commit that referenced this pull request Jul 17, 2025
### Summary
Add the initial skeleton of reporting code for the backend tester. This
PR is primarily focused on putting the hooks and runner structure in
place. Follow-up work will expand the scope of collection and reporting
outputs.

This PR adds the following:
- CLI runner for the test suite.
- Basic test result breakdown by success / fail and cause (failing in
lowering vs output mismatch, for example).
- Refactoring of test suite logic to clean things up.

Next steps:
- Aggregate results by flow (backend).
- Add additional CLI flags to allow filtering backends and dtypes.
- Land more of the operator test suite.
- Wire up flows for quantized operators.

Note that this PR is stacked on (and thus includes)
#11960. I accidently broke my
ghstack, so I'm converting this to a normal PR.

Sample output (XNNPACK):
```
Test Session Summary:

   84 Passed / 95
   11 Failed / 95
    0 Skipped / 95

[Success]
   66 Delegated
   18 Undelegated

[Failure]
    4 Lowering Fail
    0 PTE Load Fail
    0 PTE Run Fail
    6 Output Mismatch Fail
```
Reproduce with `ET_TEST_ENABLED_BACKENDS=xnnpack python -m
executorch.backends.test.suite.runner.executorch.backends.test.suite`.
I've temporarily commented out non-f32 dtypes to work around some
crashes in XNNPACK, which are non-recoverable from Python.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants