Skip to content

[aie][aie2p] Global Isel : trunc vector when size(src)/size(dst) = 4 #496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: aie-public
Choose a base branch
from

Conversation

newling
Copy link
Contributor

@newling newling commented Jun 20, 2025

Add instruction selection tablegen pattern for

 %1 = trunc <32 x i32> %0 to <32 x i8>

It is based on existing patterns for

 %1 = trunc <32 x i32> %0 to <32 x i16>

and

%1 = trunc <32 x i16> %0 to <32 x i8>

The final asm with this new lowering for trunc <32 x i32> %0 to <32 x i8> is

     vlda     x0, [p0, #0];          vldb     x1, [p0, #64]
     [...]
     ret     lr;             vshuffle        x2, x0, x1, r0
     mova    r1, #0;         vshuffle        x0, x0, x1, r0  //  Delay Slot 5
     vshuffle        x0, x0, x2, r1                  //  Delay Slot 4
     [...]

Questions for reviewers:

  • I couldn't find a way in tablegen to do this with just 2 shuffles, and I'm not sure if it's possible with tablegen. The first 2 vshuffles compute the same thing (question: is CSE after instruction selection possible?). Maybe if I wrote the pattern in C++ I could do it with 2 shuffles and avoid the 'rematerialization', but will the final code be more efficient?

  • I see in the tests before this PR, tests of accregbank and vecregbank. Do I need to have the same in my tests?

  • Do I need end-to-end tests? Full llc run, some kind of numerical test?

@newling newling changed the title [aie][aie2p] Global Isel : trunc where dst is 1/4 bitsize of src [aie][aie2p] Global Isel : trunc vector when size(src)/size(dst) = 4 Jun 20, 2025
Copy link
Collaborator

@konstantinschwarz konstantinschwarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks fine to me.

Ideally we would be able to do this "legalization" in the Legalizer, then we don't have to include this complexity into every AIE instruction selector.
However multiple G_TRUNC steps are combined in the artifact combiner during legalization, creating an infinite loop in the legalizer :/

@niwinanto
Copy link
Collaborator

The approach looks fine to me.

Ideally we would be able to do this "legalization" in the Legalizer, then we don't have to include this complexity into every AIE instruction selector. However multiple G_TRUNC steps are combined in the artifact combiner during legalization, creating an infinite loop in the legalizer :/

I agree with @konstantinschwarz Moving this complex ISel pattern to legalizer makes much more sense. However, did you try vshuffle with mode 36, this might help you to trunc vi32 to vi8 directly.


const auto SrcElmBits = SrcTy.getElementType().getSizeInBits();
if (SrcElmBits != 64 && SrcElmBits != 32 && SrcElmBits != 16)
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps assert, or add the assumptions on DstElemBits as a comment. I assume we know it is <= SrcElemBits because we have a G_TRUNC and all truncations are legal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this logic with additional comments and asserts

@newling
Copy link
Contributor Author

newling commented Jun 24, 2025

My initial approach was via legalizing (i32->i8) --> (i32->i16->i8) but as predicted @konstantinschwarz this does get undone by a combiner. Maybe it's possible to adjust the list of combines used here but that might prevent a good optimization.

@niwinanto your mode 36 suggestion looks very promising, I'm investigating now!

@newling
Copy link
Contributor Author

newling commented Jun 24, 2025

@niwinanto the problem I am facing using shuffle's mode 36 truncation is that it can do v16i32 -> v16i8 but the result there (v16i8) is 128-bits and I can't find an equivalent of EXTRACT_SUBREG [...] sub_256_lo for 128-bits. Is that because the underlying HW registers are 256-bits?

@niwinanto
Copy link
Collaborator

@niwinanto the problem I am facing using shuffle's mode 36 truncation is that it can do v16i32 -> v16i8 but the result there (v16i8) is 128-bits and I can't find an equivalent of EXTRACT_SUBREG [...] sub_256_lo for 128-bits. Is that because the underlying HW registers are 256-bits?

@newling Yes, you are right. Basic HW register for vector registers are 256-bit wide, so we cannot extract a subregister smaller than than 256-bit. However, we have 128-bit q registers and there is a vector move operation which moves from 256 bit register into 128 bit register with truncation of upper 128 bits. Please take a look at VMOV_alu_mv_mv_w_to_q, this instruction can help here.

@niwinanto
Copy link
Collaborator

@newling change looks good to me. Just one feed back to include more tests in the legalizer. Also, would be nice if you can rephrase the commit message and squash the clang format commit to respective commit before the merge.

@newling
Copy link
Contributor Author

newling commented Jun 26, 2025

Also, would be nice if you can rephrase the commit message and squash the clang format commit to respective commit before the merge.

Will do. I'm still unsure what the recommended workflow is here, I've only worked on projects where all commits get squashed into a single commit when the PR is merged into main, as outlined here/ I guess in this project you prefer large 'feature' PRs where each commit is a single step towards the feature. As opposed to a sequence of small PRs? Anyway, I'm happy to adapt whatever you all use as a workflow :)

@newling newling force-pushed the trunc_v32i32_v32i8 branch 2 times, most recently from 3676e33 to 9324566 Compare June 26, 2025 17:06
niwinanto
niwinanto previously approved these changes Jun 26, 2025
Copy link
Collaborator

@niwinanto niwinanto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback. LGTM!

@newling newling force-pushed the trunc_v32i32_v32i8 branch 2 times, most recently from b5a23e9 to 14f4523 Compare June 27, 2025 01:18
@andcarminati
Copy link
Collaborator

Also, would be nice if you can rephrase the commit message and squash the clang format commit to respective commit before the merge.

Will do. I'm still unsure what the recommended workflow is here, I've only worked on projects where all commits get squashed into a single commit when the PR is merged into main, as outlined here/ I guess in this project you prefer large 'feature' PRs where each commit is a single step towards the feature. As opposed to a sequence of small PRs? Anyway, I'm happy to adapt whatever you all use as a workflow :)

I think this PR is in a nice shape now, only a small suggestion: append [AIE2p] in front of each commit message.

@newling newling force-pushed the trunc_v32i32_v32i8 branch 2 times, most recently from 92354f8 to a815f2b Compare June 27, 2025 18:09
@newling
Copy link
Contributor Author

newling commented Jul 7, 2025

Ping

Copy link
Collaborator

@konstantinschwarz konstantinschwarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It would be nice to add LLVM IR -> assembly tests for all types that we expect to work, similar to https://github.com/Xilinx/llvm-aie/blob/aie-public/llvm/test/CodeGen/AIE/extractelement.ll

@newling
Copy link
Contributor Author

newling commented Jul 14, 2025

TODO(newling) don't land before I've verified numerical correctness of this transform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants