Skip to content

Sycl graph poc #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed

Sycl graph poc #2

wants to merge 20 commits into from

Conversation

reble
Copy link
Owner

@reble reble commented Aug 11, 2022

No description provided.

Developing doc in separate branch
@reble reble temporarily deployed to aws October 1, 2022 19:17 Inactive
@@ -1488,6 +1489,8 @@ __SYCL_EXPORT pi_result piSamplerRelease(pi_sampler sampler);
//
// Queue Commands
//
__SYCL_EXPORT pi_result piKernelLaunch(pi_queue queue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea for an alternative approach here that doesn't involve adding a new PI entry-point, we use piQueueFlush instead. As it already exists in PI with the semantics of starting execution of work lazily scheduled, assuming it works the same as clFlush. I'd then expect a flush to happen on event wait as well as queue wait.

piQueueFlush is also more generic than just for kernel execution commands. So if in the future we wanted more than kernel launch commands to be lazily executed, we wouldn't need to add another entry-point. e.g. could lazily enqueue piEnqueueMemBufferCopy commands and flush them. Rather than having to add a piMemBufferCopy entry-point to match the piEnqueueMemBufferCopy like we've done here with piEnqueueKernelLaunch.

@reble
Copy link
Owner Author

reble commented Oct 28, 2022

Closing that PR, rework in #5

@reble reble closed this Oct 28, 2022
reble pushed a commit that referenced this pull request May 15, 2023
…callback

The `TypeSystemMap::m_mutex` guards against concurrent modifications
of members of `TypeSystemMap`. In particular, `m_map`.

`TypeSystemMap::ForEach` iterates through the entire `m_map` calling
a user-specified callback for each entry. This is all done while
`m_mutex` is locked. However, there's nothing that guarantees that
the callback itself won't call back into `TypeSystemMap` APIs on the
same thread. This lead to double-locking `m_mutex`, which is undefined
behaviour. We've seen this cause a deadlock in the swift plugin with
following backtrace:

```

int main() {
    std::unique_ptr<int> up = std::make_unique<int>(5);

    volatile int val = *up;
    return val;
}

clang++ -std=c++2a -g -O1 main.cpp

./bin/lldb -o “br se -p return” -o run -o “v *up” -o “expr *up” -b
```

```
frame #4: std::lock_guard<std::mutex>::lock_guard
frame #5: lldb_private::TypeSystemMap::GetTypeSystemForLanguage <<<< Lock #2
frame #6: lldb_private::TypeSystemMap::GetTypeSystemForLanguage
frame #7: lldb_private::Target::GetScratchTypeSystemForLanguage
...
frame #26: lldb_private::SwiftASTContext::LoadLibraryUsingPaths
frame #27: lldb_private::SwiftASTContext::LoadModule
frame #30: swift::ModuleDecl::collectLinkLibraries
frame #31: lldb_private::SwiftASTContext::LoadModule
frame #34: lldb_private::SwiftASTContext::GetCompileUnitImportsImpl
frame #35: lldb_private::SwiftASTContext::PerformCompileUnitImports
frame #36: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetSwiftASTContext
frame #37: lldb_private::TypeSystemSwiftTypeRefForExpressions::GetPersistentExpressionState
frame #38: lldb_private::Target::GetPersistentSymbol
frame #41: lldb_private::TypeSystemMap::ForEach                 <<<< Lock #1
frame #42: lldb_private::Target::GetPersistentSymbol
frame #43: lldb_private::IRExecutionUnit::FindInUserDefinedSymbols
frame #44: lldb_private::IRExecutionUnit::FindSymbol
frame #45: lldb_private::IRExecutionUnit::MemoryManager::GetSymbolAddressAndPresence
frame #46: lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #47: non-virtual thunk to lldb_private::IRExecutionUnit::MemoryManager::findSymbol
frame #48: llvm::LinkingSymbolResolver::findSymbol
frame #49: llvm::LegacyJITSymbolResolver::lookup
frame #50: llvm::RuntimeDyldImpl::resolveExternalSymbols
frame #51: llvm::RuntimeDyldImpl::resolveRelocations
frame #52: llvm::MCJIT::finalizeLoadedModules
frame #53: llvm::MCJIT::finalizeObject
frame #54: lldb_private::IRExecutionUnit::ReportAllocations
frame #55: lldb_private::IRExecutionUnit::GetRunnableInfo
frame #56: lldb_private::ClangExpressionParser::PrepareForExecution
frame #57: lldb_private::ClangUserExpression::TryParse
frame #58: lldb_private::ClangUserExpression::Parse
```

Our solution is to simply iterate over a local copy of `m_map`.

**Testing**

* Confirmed on manual reproducer (would reproduce 100% of the time
  before the patch)

Differential Revision: https://reviews.llvm.org/D149949
Bensuo pushed a commit that referenced this pull request May 29, 2023
…est unittest

Need to finalize the DIBuilder to avoid leak sanitizer errors
like this:

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x55c99ea1761d in operator new(unsigned long)
    #1 0x55c9a518ae49 in operator new
    #2 0x55c9a518ae49 in llvm::MDTuple::getImpl(...)
    #3 0x55c9a4f1b1ec in getTemporary
    #4 0x55c9a4f1b1ec in llvm::DIBuilder::createFunction(...)
reble pushed a commit that referenced this pull request Jun 9, 2023
The motivation for this change is a workload generated by the XLA compiler
targeting nvidia GPUs.

This kernel has a few hundred i8 loads and stores.  Merging is critical for
performance.

The current LSV doesn't merge these well because it only considers instructions
within a block of 64 loads+stores.  This limit is necessary to contain the
O(n^2) behavior of the pass.  I'm hesitant to increase the limit, because this
pass is already one of the slowest parts of compiling an XLA program.

So we rewrite basically the whole thing to use a new algorithm.  Before, we
compared every load/store to every other to see if they're consecutive.  The
insight (from tra@) is that this is redundant.  If we know the offset from PtrA
to PtrB, then we don't need to compare PtrC to both of them in order to tell
whether C may be adjacent to A or B.

So that's what we do.  When scanning a basic block, we maintain a list of
chains, where we know the offset from every element in the chain to the first
element in the chain.  Each instruction gets compared only to the leaders of
all the chains.

In the worst case, this is still O(n^2), because all chains might be of length
1.  To prevent compile time blowup, we only consider the 64 most recently used
chains.  Thus we do no more comparisons than before, but we have the potential
to make much longer chains.

This rewrite affects many tests.  The changes to tests fall into two
categories.

1. The old code had what appears to be a bug when deciding whether a misaligned
   vectorized load is fast.  Suppose TTI reports that load <i32 x 4> align 4
   has relative speed 1, and suppose that load i32 align 4 has relative speed
   32.

   The intent of the code seems to be that we prefer the scalar load, because
   it's faster.  But the old code would choose the vectorized load.
   accessIsMisaligned would set RelativeSpeed to 0 for the scalar load (and not
   even call into TTI to get the relative speed), because the scalar load is
   aligned.

   After this patch, we will prefer the scalar load if it's faster.

2. This patch changes the logic for how we vectorize.  Usually this results in
   vectorizing more.

Explanation of changes to tests:

 - AMDGPU/adjust-alloca-alignment.ll: #1
 - AMDGPU/flat_atomic.ll: #2, we vectorize more.
 - AMDGPU/int_sideeffect.ll: #2, there are two possible locations for the call to @foo, and the pass is brittle to this.  Before, we'd vectorize in case 1 and not case 2.  Now we vectorize in case 2 and not case 1.  So we just move the call.
 - AMDGPU/adjust-alloca-alignment.ll: #2, we vectorize more
 - AMDGPU/insertion-point.ll: #2 we vectorize more
 - AMDGPU/merge-stores-private.ll: #1 (undoes changes from git rev 86f9117, which appear to have hit the bug from #1)
 - AMDGPU/multiple_tails.ll: #1
 - AMDGPU/vect-ptr-ptr-size-mismatch.ll: Fix alignment (I think related to #1 above).
 - AMDGPU CodeGen: I have difficulty commenting on these changes, but many of them look like #2, we vectorize more.
 - NVPTX/4x2xhalf.ll: Fix alignment (I think related to #1 above).
 - NVPTX/vectorize_i8.ll: We don't generate <3 x i8> vectors on NVPTX because they're not legal (and eventually get split)
 - X86/correct-order.ll: #2, we vectorize more, probably because of changes to the chain-splitting logic.
 - X86/subchain-interleaved.ll: #2, we vectorize more
 - X86/vector-scalar.ll: #2, we can now vectorize scalar float + <1 x float>
 - X86/vectorize-i8-nested-add-inseltpoison.ll: Deleted the nuw test because it was nonsensical.  It was doing `add nuw %v0, -1`, but this is equivalent to `add nuw %v0, 0xffff'ffff`, which is equivalent to asserting that %v0 == 0.
 - X86/vectorize-i8-nested-add.ll: Same as nested-add-inseltpoison.ll

Differential Revision: https://reviews.llvm.org/D149893
reble pushed a commit that referenced this pull request May 17, 2024
…ication as used during partial ordering (#91534)

We do not deduce template arguments from the exception specification
when determining the primary template of a function template
specialization or when taking the address of a function template.
Therefore, this patch changes `isAtLeastAsSpecializedAs` such that we do
not mark template parameters in the exception specification as 'used'
during partial ordering (per [temp.deduct.partial]
p12) to prevent the following from being ambiguous:

```
template<typename T, typename U>
void f(U) noexcept(noexcept(T())); // #1

template<typename T>
void f(T*) noexcept; // #2

template<>
void f<int>(int*) noexcept; // currently ambiguous, selects #2 with this patch applied 
```

Although there is no corresponding wording in the standard (see core issue filed here
cplusplus/CWG#537), this seems
to be the intended behavior given the definition of _deduction
substitution loci_ in [temp.deduct.general] p7 (and EDG does the same thing).
EwanC pushed a commit that referenced this pull request May 30, 2024
…erSize (#67657)"

This reverts commit f0b3654.

This commit triggers UB by reading an uninitialized variable.

`UP.PartialThreshold` is used uninitialized in `getUnrollingPreferences()` when
it is called from `LoopVectorizationPlanner::executePlan()`. In this case the
`UP` variable is created on the stack and its fields are not initialized.

```
==8802==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x557c0b081b99 in llvm::BasicTTIImplBase<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/CodeGen/BasicTTIImpl.h
    #1 0x557c0b07a40c in llvm::TargetTransformInfo::Model<llvm::X86TTIImpl>::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) llvm-project/llvm/include/llvm/Analysis/TargetTransformInfo.h:2277:17
    #2 0x557c0f5d69ee in llvm::TargetTransformInfo::getUnrollingPreferences(llvm::Loop*, llvm::ScalarEvolution&, llvm::TargetTransformInfo::UnrollingPreferences&, llvm::OptimizationRemarkEmitter*) const llvm-project/llvm/lib/Analysis/TargetTransformInfo.cpp:387:19
    #3 0x557c0e6b96a0 in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7624:7
    #4 0x557c0e6e4b63 in llvm::LoopVectorizePass::processLoop(llvm::Loop*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10253:13
    #5 0x557c0e6f2429 in llvm::LoopVectorizePass::runImpl(llvm::Function&, llvm::ScalarEvolution&, llvm::LoopInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::BlockFrequencyInfo*, llvm::TargetLibraryInfo*, llvm::DemandedBits&, llvm::AssumptionCache&, llvm::LoopAccessInfoManager&, llvm::OptimizationRemarkEmitter&, llvm::ProfileSummaryInfo*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10344:30
    #6 0x557c0e6f2f97 in llvm::LoopVectorizePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:10383:9

[...]

  Uninitialized value was created by an allocation of 'UP' in the stack frame
    #0 0x557c0e6b961e in llvm::LoopVectorizationPlanner::executePlan(llvm::ElementCount, unsigned int, llvm::VPlan&, llvm::InnerLoopVectorizer&, llvm::DominatorTree*, bool, llvm::DenseMap<llvm::SCEV const*, llvm::Value*, llvm::DenseMapInfo<llvm::SCEV const*, void>, llvm::detail::DenseMapPair<llvm::SCEV const*, llvm::Value*>> const*) llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7623:3
```
EwanC pushed a commit that referenced this pull request May 30, 2024
…0820)

This solves some ambuguity introduced in P0522 regarding how
template template parameters are partially ordered, and should reduce
the negative impact of enabling `-frelaxed-template-template-args`
by default.

When performing template argument deduction, a template template
parameter
containing no packs should be more specialized than one that does.

Given the following example:
```C++
template<class T2> struct A;
template<template<class ...T3s> class TT1, class T4> struct A<TT1<T4>>; // #1
template<template<class    T5 > class TT2, class T6> struct A<TT2<T6>>; // #2

template<class T1> struct B;
template struct A<B<char>>;
```

Prior to P0522, candidate `#2` would be more specialized.
After P0522, neither is more specialized, so this becomes ambiguous.
With this change, `#2` becomes more specialized again,
maintaining compatibility with pre-P0522 implementations.

The problem is that in P0522, candidates are at least as specialized
when matching packs to fixed-size lists both ways, whereas before,
a fixed-size list is more specialized.

This patch keeps the original behavior when checking template arguments
outside deduction, but restores this aspect of pre-P0522 matching
during deduction.

---

Since this changes provisional implementation of CWG2398 which has
not been released yet, and already contains a changelog entry,
we don't provide a changelog entry here.
EwanC pushed a commit that referenced this pull request May 30, 2024
'reduction' has a few restrictions over normal 'var-list' clauses:

1- On parallel, a num_gangs can only have 1 argument when combined with
reduction. These two aren't able to be combined on any other of the
compute constructs however.

2- The vars all must be 'numerical data types' types of some sort, or a
'composite of numerical data types'. A list of types is given in the
standard as a minimum, so we choose 'isScalar', which covers all of
these types and keeps types that are actually numeric. Other compilers
don't seem to implement the 'composite of numerical data types', though
we do.

3- Because of the above restrictions, member-of-composite is not
allowed, so any access via a memberexpr is disallowed. Array-element and
sub-arrays (aka array sections) are both permitted, so long as they meet
the requirements of #2.

This patch implements all of these for compute constructs.
EwanC pushed a commit that referenced this pull request May 30, 2024
… (#92855)

This solves some ambuguity introduced in P0522 regarding how template
template parameters are partially ordered, and should reduce the
negative impact of enabling `-frelaxed-template-template-args` by
default.

When performing template argument deduction, we extend the provisional
wording introduced in llvm/llvm-project#89807 so
it also covers deduction of class templates.

Given the following example:
```C++
template <class T1, class T2 = float> struct A;
template <class T3> struct B;

template <template <class T4> class TT1, class T5> struct B<TT1<T5>>;   // #1
template <class T6, class T7>                      struct B<A<T6, T7>>; // #2

template struct B<A<int>>;
```
Prior to P0522, `#2` was picked. Afterwards, this became ambiguous. This
patch restores the pre-P0522 behavior, `#2` is picked again.

This has the beneficial side effect of making the following code valid:
```C++
template<class T, class U> struct A {};
A<int, float> v;
template<template<class> class TT> void f(TT<int>);

// OK: TT picks 'float' as the default argument for the second parameter.
void g() { f(v); }
```

---

Since this changes provisional implementation of CWG2398 which has not
been released yet, and already contains a changelog entry, we don't
provide a changelog entry here.
kbenzie pushed a commit that referenced this pull request Jul 8, 2024
….5)

The Python interpreter in Xcode cannot be copied because of a relative
RPATH. Our workaround would just use that Python interpreter directly
when it detects this. For the reasons explained in my previous commit,
that doesn't work in a virtual environment. Address this case by
creating a symlink to the "real" interpreter in the virtual environment.
kbenzie pushed a commit that referenced this pull request Jul 8, 2024
…on (#94752)

Fixes #62925.

The following code:
```cpp
#include <map>

int main() {
   std::map m1 = {std::pair{"foo", 2}, {"bar", 3}}; // guide #2
   std::map m2(m1.begin(), m1.end()); // guide #1
}
```
Is rejected by clang, but accepted by both gcc and msvc:
https://godbolt.org/z/6v4fvabb5 .

So basically CTAD with copy-list-initialization is rejected.

Note that this exact code is also used in a cppreference article:
https://en.cppreference.com/w/cpp/container/map/deduction_guides

I checked the C++11 and C++20 standard drafts to see whether suppressing
user conversion is the correct thing to do for user conversions. Based
on the standard I don't think that it is correct.

```
13.3.1.4 Copy-initialization of class by user-defined conversion [over.match.copy]
Under the conditions specified in 8.5, as part of a copy-initialization of an object of class type, a user-defined
conversion can be invoked to convert an initializer expression to the type of the object being initialized.
Overload resolution is used to select the user-defined conversion to be invoked
```
So we could use user defined conversions according to the standard.

```
If a narrowing conversion is required to initialize any of the elements, the
program is ill-formed.
```
We should not do narrowing.

```
In copy-list-initialization, if an explicit constructor is chosen, the initialization is ill-formed.
```
We should not use explicit constructors.
Bensuo pushed a commit that referenced this pull request Aug 13, 2024
This patch adds a frame recognizer for Clang's
`__builtin_verbose_trap`, which behaves like a
`__builtin_trap`, but emits a failure-reason string into debug-info in
order for debuggers to display
it to a user.

The frame recognizer triggers when we encounter
a frame with a function name that begins with
`__clang_trap_msg`, which is the magic prefix
Clang emits into debug-info for verbose traps.
Once such frame is encountered we display the
frame function name as the `Stop Reason` and display that frame to the
user.

Example output:
```
(lldb) run
warning: a.out was compiled with optimization - stepping may behave oddly; variables may not be available.
Process 35942 launched: 'a.out' (arm64)
Process 35942 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = Misc.: Function is not implemented
    frame #1: 0x0000000100003fa4 a.out`main [inlined] Dummy::func(this=<unavailable>) at verbose_trap.cpp:3:5 [opt]
   1    struct Dummy {
   2      void func() {
-> 3        __builtin_verbose_trap("Misc.", "Function is not implemented");
   4      }
   5    };
   6
   7    int main() {
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = Misc.: Function is not implemented
    frame #0: 0x0000000100003fa4 a.out`main [inlined] __clang_trap_msg$Misc.$Function is not implemented$ at verbose_trap.cpp:0 [opt]
  * frame #1: 0x0000000100003fa4 a.out`main [inlined] Dummy::func(this=<unavailable>) at verbose_trap.cpp:3:5 [opt]
    frame #2: 0x0000000100003fa4 a.out`main at verbose_trap.cpp:8:13 [opt]
    frame #3: 0x0000000189d518b4 dyld`start + 1988
```
Bensuo pushed a commit that referenced this pull request Aug 13, 2024
It is named var instead of number now, update the regex.

After
llvm/llvm-project@9ad72df
From:

```
  %0 = load i13, ptr addrspace(4) %a.addr.ascast, align 2, !tbaa !7
  %call = call spir_func signext i5 @_Z22__spirv_FixedSqrtINTELILi13ELi5EEDBT0__DBT__biiii(i13 signext %0, i1 zeroext false, i32 2, i32 2, i32 0, i32 0) #2
```

to:

```
  %0 = load i16, ptr addrspace(4) %a.addr.ascast, align 2, !tbaa !7
  %loadedv = trunc i16 %0 to i13
  %call = call spir_func signext i5 @_Z22__spirv_FixedSqrtINTELILi13ELi5EEDBT0__DBT__biiii(i13 signext %loadedv, i1 zeroext false, i32 2, i32 2, i32 0, i32 0) #2

```
EwanC pushed a commit that referenced this pull request Sep 2, 2024
…linux (#99613)

Examples of the output:

ARM:
```
# ./a.out 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==122==ERROR: AddressSanitizer: SEGV on unknown address 0x0000007a (pc 0x76e13ac0 bp 0x7eb7fd00 sp 0x7eb7fcc8 T0)
==122==The signal is caused by a READ memory access.
==122==Hint: address points to the zero page.
    #0 0x76e13ac0  (/lib/libc.so.6+0x7cac0)
    #1 0x76dce680 in gsignal (/lib/libc.so.6+0x37680)
    #2 0x005c2250  (/root/a.out+0x145250)
    #3 0x76db982c  (/lib/libc.so.6+0x2282c)
    #4 0x76db9918 in __libc_start_main (/lib/libc.so.6+0x22918)

==122==Register values:
 r0 = 0x00000000   r1 = 0x0000007a   r2 = 0x0000000b   r3 = 0x76d95020  
 r4 = 0x0000007a   r5 = 0x00000001   r6 = 0x005dcc5c   r7 = 0x0000010c  
 r8 = 0x0000000b   r9 = 0x76f9ece0  r10 = 0x00000000  r11 = 0x7eb7fd00  
r12 = 0x76dce670   sp = 0x7eb7fcc8   lr = 0x76e13ab4   pc = 0x76e13ac0  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/libc.so.6+0x7cac0) 
==122==ABORTING
```

AArch64:
```
# ./a.out 
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==99==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000063 (pc 0x007fbbbc5860 bp 0x007fcfdcb700 sp 0x007fcfdcb700 T99)
==99==The signal is caused by a UNKNOWN memory access.
==99==Hint: address points to the zero page.
    #0 0x007fbbbc5860  (/lib64/libc.so.6+0x82860)
    #1 0x007fbbb81578  (/lib64/libc.so.6+0x3e578)
    #2 0x00556051152c  (/root/a.out+0x3152c)
    #3 0x007fbbb6e268  (/lib64/libc.so.6+0x2b268)
    #4 0x007fbbb6e344  (/lib64/libc.so.6+0x2b344)
    #5 0x0055604e45ec  (/root/a.out+0x45ec)

==99==Register values:
 x0 = 0x0000000000000000   x1 = 0x0000000000000063   x2 = 0x000000000000000b   x3 = 0x0000007fbbb41440  
 x4 = 0x0000007fbbb41580   x5 = 0x3669288942d44cce   x6 = 0x0000000000000000   x7 = 0x00000055605110b0  
 x8 = 0x0000000000000083   x9 = 0x0000000000000000  x10 = 0x0000000000000000  x11 = 0x0000000000000000  
x12 = 0x0000007fbbdb3360  x13 = 0x0000000000010000  x14 = 0x0000000000000039  x15 = 0x00000000004113a0  
x16 = 0x0000007fbbb81560  x17 = 0x0000005560540138  x18 = 0x000000006474e552  x19 = 0x0000000000000063  
x20 = 0x0000000000000001  x21 = 0x000000000000000b  x22 = 0x0000005560511510  x23 = 0x0000007fcfdcb918  
x24 = 0x0000007fbbdb1b50  x25 = 0x0000000000000000  x26 = 0x0000007fbbdb2000  x27 = 0x000000556053f858  
x28 = 0x0000000000000000   fp = 0x0000007fcfdcb700   lr = 0x0000007fbbbc584c   sp = 0x0000007fcfdcb700  
UndefinedBehaviorSanitizer can not provide additional info.
SUMMARY: UndefinedBehaviorSanitizer: SEGV (/lib64/libc.so.6+0x82860) 
==99==ABORTING
```
EwanC pushed a commit that referenced this pull request Sep 17, 2024
```
  UBSan-Standalone-sparc :: TestCases/Misc/Linux/diag-stacktrace.cpp
```
`FAIL`s on 32 and 64-bit Linux/sparc64 (and on Solaris/sparcv9, too: the
test isn't Linux-specific at all). With
`UBSAN_OPTIONS=fast_unwind_on_fatal=1`, the stack trace shows a
duplicate innermost frame:
```
compiler-rt/test/ubsan/TestCases/Misc/Linux/diag-stacktrace.cpp:14:31: runtime error: execution reached the end of a value-returning function without returning a value
    #0 0x7003a708 in f() compiler-rt/test/ubsan/TestCases/Misc/Linux/diag-stacktrace.cpp:14:35
    #1 0x7003a708 in f() compiler-rt/test/ubsan/TestCases/Misc/Linux/diag-stacktrace.cpp:14:35
    #2 0x7003a714 in g() compiler-rt/test/ubsan/TestCases/Misc/Linux/diag-stacktrace.cpp:17:38
```
which isn't seen with `fast_unwind_on_fatal=0`.

This turns out to be another fallout from fixing
`__builtin_return_address`/`__builtin_extract_return_addr` on SPARC. In
`sanitizer_stacktrace_sparc.cpp` (`BufferedStackTrace::UnwindFast`) the
`pc` arg is the return address, while `pc1` from the stack frame
(`fr_savpc`) is the address of the `call` insn, leading to a double
entry for the innermost frame in `trace_buffer[]`.

This patch fixes this by moving the adjustment before all uses.

Tested on `sparc64-unknown-linux-gnu` and `sparcv9-sun-solaris2.11`
(with the `ubsan/TestCases/Misc/Linux` tests enabled).
aarongreig pushed a commit that referenced this pull request Sep 24, 2024
Currently, process of replacing bitwise operations consisting of
`LSR`/`LSL` with `And` is performed by `DAGCombiner`.

However, in certain cases, the `AND` generated by this process
can be removed.

Consider following case:
```
        lsr x8, x8, #56
        and x8, x8, #0xfc
        ldr w0, [x2, x8]
        ret
```

In this case, we can remove the `AND` by changing the target of `LDR`
to `[X2, X8, LSL #2]` and right-shifting amount change to 56 to 58.

after changed:
```
        lsr x8, x8, #58
        ldr w0, [x2, x8, lsl #2]
        ret
```

This patch checks to see if the `SHIFTING` + `AND` operation on load
target can be optimized and optimizes it if it can.
aarongreig pushed a commit that referenced this pull request Sep 24, 2024
`JITDylibSearchOrderResolver` local variable can be destroyed before
completion of all callbacks. Capture it together with `Deps` in
`OnEmitted` callback.

Original error:

```
==2035==ERROR: AddressSanitizer: stack-use-after-return on address 0x7bebfa155b70 at pc 0x7ff2a9a88b4a bp 0x7bec08d51980 sp 0x7bec08d51978
READ of size 8 at 0x7bebfa155b70 thread T87 (tf_xla-cpu-llvm)
    #0 0x7ff2a9a88b49 in operator() llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:58
    #1 0x7ff2a9a88b49 in __invoke<(lambda at llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:9) &, const llvm::DenseMap<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib *, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > &> libcxx/include/__type_traits/invoke.h:149:25
    #2 0x7ff2a9a88b49 in __call<(lambda at llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:9) &, const llvm::DenseMap<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib *, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > &> libcxx/include/__type_traits/invoke.h:224:5
    #3 0x7ff2a9a88b49 in operator() libcxx/include/__functional/function.h:210:12
    #4 0x7ff2a9a88b49 in void std::__u::__function::__policy_invoker<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr,
```
aarongreig pushed a commit that referenced this pull request Sep 24, 2024
Static destructor can race with calls to notify and trigger tsan
warning.

```
WARNING: ThreadSanitizer: data race (pid=5787)
  Write of size 1 at 0x55bec9df8de8 by thread T23:
    #0 pthread_mutex_destroy [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1344](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=1344&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b12affb) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #1 __libcpp_recursive_mutex_destroy [third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h:91](third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h?l=91&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d4e9) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #2 std::__tsan::recursive_mutex::~recursive_mutex() [third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp:52](third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp?l=52&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d4e9)
    #3 ~SmartMutex [third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h:28](third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h?l=28&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaedfe) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #4 (anonymous namespace)::PerfJITEventListener::~PerfJITEventListener() [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:65](third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=65&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaedfe)
    #5 cxa_at_exit_callback_installed_at(void*) [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:437](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=437&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b172cb9) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #6 llvm::JITEventListener::createPerfJITEventListener() [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:496](third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=496&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcad8f5) (BuildId: ff25ace8b17d9863348bb1759c47246c)
```
```
Previous atomic read of size 1 at 0x55bec9df8de8 by thread T192 (mutexes: write M0, write M1):
    #0 pthread_mutex_unlock [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1387](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=1387&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b12b6bb) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #1 __libcpp_recursive_mutex_unlock [third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h:87](third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h?l=87&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d589) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #2 std::__tsan::recursive_mutex::unlock() [third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp:64](third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp?l=64&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d589)
    #3 unlock [third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h:47](third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h?l=47&cl=669089572):16 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #4 ~lock_guard [third_party/crosstool/v18/stable/src/libcxx/include/__mutex/lock_guard.h:39](third_party/crosstool/v18/stable/src/libcxx/include/__mutex/lock_guard.h?l=39&cl=669089572):101 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968)
    #5 (anonymous namespace)::PerfJITEventListener::notifyObjectLoaded(unsigned long, llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo const&) [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:290](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=290&cl=669089572):1 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968)
    #6 llvm::orc::RTDyldObjectLinkingLayer::onObjEmit(llvm::orc::MaterializationResponsibility&, llvm::object::OwningBinary<llvm::object::ObjectFile>, std::__tsan::unique_ptr<llvm::RuntimeDyld::MemoryManager, std::__tsan::default_delete<llvm::RuntimeDyld::MemoryManager>>, std::__tsan::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo, std::__tsan::default_delete<llvm::RuntimeDyld::LoadedObjectInfo>>, std::__tsan::unique_ptr<llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>>>, std::__tsan::default_delete<llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>>>>>, llvm::Error) [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:386](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp?l=386&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bc404a8) (BuildId: ff25ace8b17d9863348bb1759c47246c)
```
reble pushed a commit that referenced this pull request Oct 17, 2024
When SPARC Asan testing is enabled by PR #107405, many Linux/sparc64
tests just hang like
```
#0  0xf7ae8e90 in syscall () from /usr/lib32/libc.so.6
#1  0x701065e8 in __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) ()
    at compiler-rt/lib/sanitizer_common/sanitizer_linux.cpp:766
#2  0x70107c90 in Wait ()
    at compiler-rt/lib/sanitizer_common/sanitizer_mutex.cpp:35
#3  0x700f7cac in Lock ()
    at compiler-rt/lib/asan/../sanitizer_common/sanitizer_mutex.h:196
#4  Lock ()
    at compiler-rt/lib/asan/../sanitizer_common/sanitizer_thread_registry.h:98
#5  LockThreads ()
    at compiler-rt/lib/asan/asan_thread.cpp:489
#6  0x700e9c8c in __asan::BeforeFork() ()
    at compiler-rt/lib/asan/asan_posix.cpp:157
#7  0xf7ac83f4 in ?? () from /usr/lib32/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
```
It turns out that this happens in tests using `internal_fork` (e.g.
invoking `llvm-symbolizer`): unlike most other Linux targets, which use
`clone`, Linux/sparc64 has to use `__fork` instead. While `clone`
doesn't trigger `pthread_atfork` handlers, `__fork` obviously does,
causing the hang.

To avoid this, this patch disables `InstallAtForkHandler` and lets the
ASan tests run to completion.

Tested on `sparc64-unknown-linux-gnu`.
reble pushed a commit that referenced this pull request Oct 17, 2024
…ap (#108825)

This attempts to improve user-experience when LLDB stops on a
verbose_trap. Currently if a `__builtin_verbose_trap` triggers, we
display the first frame above the call to the verbose_trap. So in the
newly added test case, we would've previously stopped here:
```
(lldb) run
Process 28095 launched: '/Users/michaelbuch/a.out' (arm64)
Process 28095 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = Bounds error: out-of-bounds access
    frame #1: 0x0000000100003f5c a.out`std::__1::vector<int>::operator[](this=0x000000016fdfebef size=0, (null)=10) at verbose_trap.cpp:6:9
   3    template <typename T>
   4    struct vector {
   5        void operator[](unsigned) {
-> 6            __builtin_verbose_trap("Bounds error", "out-of-bounds access");
   7        }
   8    };
```

After this patch, we would stop in the first non-`std` frame:
```
(lldb) run
Process 27843 launched: '/Users/michaelbuch/a.out' (arm64)
Process 27843 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = Bounds error: out-of-bounds access
    frame #2: 0x0000000100003f44 a.out`g() at verbose_trap.cpp:14:5
   11  
   12   void g() {
   13       std::vector<int> v;
-> 14       v[10];
   15   }
   16  
```

rdar://134490328
EwanC pushed a commit that referenced this pull request Dec 2, 2024
This reverts commit a89e016.

This is being reverted because it broke the test:

Unwind/trap_frame_sym_ctx.test

/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake/llvm-project/lldb/test/Shell/Unwind/trap_frame_sym_ctx.test:21:10: error: CHECK: expected string not found in input
 CHECK: frame #2: {{.*}}`main
Bensuo pushed a commit that referenced this pull request Dec 3, 2024
…ates explicitly specialized for an implicitly instantiated class template specialization (#113464)

Consider the following:
```
template<typename T>
struct A {
  template<typename U>
  struct B {
    static constexpr int x = 0; // #1
  };

  template<typename U>
  struct B<U*> {
    static constexpr int x = 1; // #2
  };
};

template<>
template<typename U>
struct A<long>::B {
  static constexpr int x = 2; // #3
};

static_assert(A<short>::B<int>::y == 0); // uses #1
static_assert(A<short>::B<int*>::y == 1); // uses #2

static_assert(A<long>::B<int>::y == 2); // uses #3
static_assert(A<long>::B<int*>::y == 2); // uses #3
```

According to [temp.spec.partial.member] p2:
> If the primary member template is explicitly specialized for a given
(implicit) specialization of the enclosing class template, the partial
specializations of the member template are ignored for this
specialization of the enclosing class template.
If a partial specialization of the member template is explicitly
specialized for a given (implicit) specialization of the enclosing class
template, the primary member template and its other partial
specializations are still considered for this specialization of the
enclosing class template.

The example above fails to compile because we currently don't implement
[temp.spec.partial.member] p2. This patch implements the wording, fixing #51051.
kbenzie pushed a commit that referenced this pull request Jan 17, 2025
…6386)

The overloads for single_task and parallel_for in the
sycl_ext_oneapi_kernel_properties extension are being deprecated as
mentioned in intel#14785. So I'm rewriting
tests containg such overloads so that they can still run after the
deprecation.

---------

Signed-off-by: Hu, Peisen <[email protected]>
EwanC pushed a commit that referenced this pull request Feb 12, 2025
…" (#123877)

Reverts llvm/llvm-project#122811 due to buildbot breakage e.g.,
https://lab.llvm.org/buildbot/#/builders/52/builds/5421/steps/11/logs/stdio

ASan output from local re-run:
```
==2780289==ERROR: AddressSanitizer: use-after-poison on address 0x7e0b87e28d28 at pc 0x55a979a99e7e bp 0x7ffe4b18f0b0 sp 0x7ffe4b18f0a8
READ of size 1 at 0x7e0b87e28d28 thread T0
    #0 0x55a979a99e7d in getStorageClass /usr/local/google/home/thurston/buildbot_repro/llvm-project/llvm/include/llvm/Object/COFF.h:344
    #1 0x55a979a99e7d in isSectionDefinition /usr/local/google/home/thurston/buildbot_repro/llvm-project/llvm/include/llvm/Object/COFF.h:429:9
    #2 0x55a979a99e7d in getSymbols /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/COFF/LLDMapFile.cpp:54:42
    #3 0x55a979a99e7d in lld::coff::writeLLDMapFile(lld::coff::COFFLinkerContext const&) /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/COFF/LLDMapFile.cpp:103:40
    #4 0x55a979a16879 in (anonymous namespace)::Writer::run() /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/COFF/Writer.cpp:810:3
    #5 0x55a979a00aac in lld::coff::writeResult(lld::coff::COFFLinkerContext&) /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/COFF/Writer.cpp:354:15
    #6 0x55a97985f7ed in lld::coff::LinkerDriver::linkerMain(llvm::ArrayRef<char const*>) /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/COFF/Driver.cpp:2826:3
    #7 0x55a97984cdd3 in lld::coff::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, bool, bool) /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/COFF/Driver.cpp:97:15
    #8 0x55a9797f9793 in lld::unsafeLldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>, bool) /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/Common/DriverDispatcher.cpp:163:12
    #9 0x55a9797fa3b6 in operator() /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/Common/DriverDispatcher.cpp:188:15
    #10 0x55a9797fa3b6 in void llvm::function_ref<void ()>::callback_fn<lld::lldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>)::$_0>(long) /usr/local/google/home/thurston/buildbot_repro/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:46:12
    #11 0x55a97966cb93 in operator() /usr/local/google/home/thurston/buildbot_repro/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:69:12
    #12 0x55a97966cb93 in llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) /usr/local/google/home/thurston/buildbot_repro/llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:426:3
    #13 0x55a9797f9dc3 in lld::lldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>) /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/Common/DriverDispatcher.cpp:187:14
    #14 0x55a979627512 in lld_main(int, char**, llvm::ToolContext const&) /usr/local/google/home/thurston/buildbot_repro/llvm-project/lld/tools/lld/lld.cpp:103:14
    #15 0x55a979628731 in main /usr/local/google/home/thurston/buildbot_repro/llvm_build_asan/tools/lld/tools/lld/lld-driver.cpp:17:10
    #16 0x7ffb8b202c89 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #17 0x7ffb8b202d44 in __libc_start_main csu/../csu/libc-start.c:360:3
    #18 0x55a97953ef60 in _start (/usr/local/google/home/thurston/buildbot_repro/llvm_build_asan/bin/lld+0x8fd1f60)
```
Bensuo pushed a commit that referenced this pull request Apr 10, 2025
No codegen regression on either target. The two builtin_ffs implied on
nvptx CSE away.

```
define internal i64 @__gpu_read_first_lane_u64(i64 noundef %__lane_mask, i64 noundef %__x) #2 {
entry:
  %shr = lshr i64 %__x, 32
  %conv = trunc nuw i64 %shr to i32
  %conv1 = trunc i64 %__x to i32
  %conv2 = trunc i64 %__lane_mask to i32
  %0 = tail call range(i32 0, 33) i32 @llvm.cttz.i32(i32 %conv2, i1 true)
  %iszero = icmp eq i32 %conv2, 0
  %sub = select i1 %iszero, i32 -1, i32 %0
  %1 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv2, i32 %conv, i32 %sub, i32 31)
  %conv4 = sext i32 %1 to i64
  %shl = shl nsw i64 %conv4, 32
  %2 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv2, i32 %conv1, i32 %sub, i32 31)
  %conv7 = zext i32 %2 to i64
  %or = or disjoint i64 %shl, %conv7
  ret i64 %or
}
; becomes

define internal i64 @__gpu_competing_read_first_lane_u64(i64 noundef %__lane_mask, i64 noundef %__x) #2 {
entry:
  %shr = lshr i64 %__x, 32
  %conv = trunc nuw i64 %shr to i32
  %conv1 = trunc i64 %__x to i32
  %conv.i = trunc i64 %__lane_mask to i32
  %0 = tail call range(i32 0, 33) i32 @llvm.cttz.i32(i32 %conv.i, i1 true)
  %iszero = icmp eq i32 %conv.i, 0
  %sub.i = select i1 %iszero, i32 -1, i32 %0
  %1 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv.i, i32 %conv, i32 %sub.i, i32 31)
  %conv4 = zext i32 %1 to i64
  %shl = shl nuw i64 %conv4, 32
  %2 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 %conv.i, i32 %conv1, i32 %sub.i, i32 31)
  %conv7 = zext i32 %2 to i64
  %or = or disjoint i64 %shl, %conv7
  ret i64 %or
}
```

The sext vs zext difference is vaguely interesting but since the bits
are immediately discarded in either case it make no odds. The amdgcn one
doesn't need CSE, the readfirstlane function is a single call to an
intrinsic.

Drive by fix to __gpu_match_all_u32, it was calling first_lane_u64 and
could use first_lane_u32 instead. Added the missing call to gpuintrin.c
test case and a stray missing static as well.
reble pushed a commit that referenced this pull request May 1, 2025
…130)

This should fix failures caused by
llvm/llvm-project#133967
Attn: @sarnex
Thanks

Signed-off-by: Arvind Sudarsanam <[email protected]>
reble pushed a commit that referenced this pull request May 1, 2025
…d A520 (#132246)

Inefficient SVE codegen occurs on at least two in-order cores,
those being Cortex-A510 and Cortex-A520. For example a simple vector
add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence
of instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, #8
         ldp q1, q2, [x10, #-16]
         add x10, x10, #32
         add x11, x11, #32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, #32
```

Which is more efficient.
reble pushed a commit that referenced this pull request May 1, 2025
… A510/A520 (#134606)

Recommit. This work was done by #132246 but failed buildbots due to the
test introduced needing updates

Inefficient SVE codegen occurs on at least two in-order cores, those
being Cortex-A510 and Cortex-A520. For example a simple vector add

```
void foo(float a, float b, float dst, unsigned n) {
    for (unsigned i = 0; i < n; ++i)
        dst[i] = a[i] + b[i];
}
```

Vectorizes the inner loop into the following interleaved sequence of
instructions.

```
        add     x12, x1, x10
        ld1b    { z0.b }, p0/z, [x1, x10]
        add     x13, x2, x10
        ld1b    { z1.b }, p0/z, [x2, x10]
        ldr     z2, [x12, #1, mul vl]
        ldr     z3, [x13, #1, mul vl]
        dech    x11
        add     x12, x0, x10
        fadd    z0.s, z1.s, z0.s
        fadd    z1.s, z3.s, z2.s
        st1b    { z0.b }, p0, [x0, x10]
        addvl   x10, x10, #2
        str     z1, [x12, #1, mul vl]
```

By adjusting the target features to prefer fixed over scalable if the
cost is equal we get the following vectorized loop.

```
         ldp q0, q3, [x11, #-16]
         subs    x13, x13, #8
         ldp q1, q2, [x10, #-16]
         add x10, x10, #32
         add x11, x11, #32
         fadd    v0.4s, v1.4s, v0.4s
         fadd    v1.4s, v2.4s, v3.4s
         stp q0, q1, [x12, #-16]
         add x12, x12, #32
```

Which is more efficient.
reble pushed a commit that referenced this pull request May 19, 2025
…e (#138091)

Check this error for more context
(https://github.com/compiler-research/CppInterOp/actions/runs/14749797085/job/41407625681?pr=491#step:10:531)

This fails with 
```
* thread #1, name = 'CppInterOpTests', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x55500356d6d3)
  * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99
    frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830
    frame #2: 0x00007fffee20917a libclangCppInterOp.so.21.0gitstd::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 58
    frame #3: 0x00007fffee224796 libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 838
    frame #4: 0x00007fffee22494d libclangCppInterOp.so.21.0gitclang::CompilerInstance::~CompilerInstance() + 13
    frame #5: 0x00007fffed95ec62 libclangCppInterOp.so.21.0gitclang::IncrementalCUDADeviceParser::~IncrementalCUDADeviceParser() + 98
    frame #6: 0x00007fffed9551b6 libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 102
    frame #7: 0x00007fffed95598d libclangCppInterOp.so.21.0gitclang::Interpreter::~Interpreter() + 13
    frame #8: 0x00007fffed9181e7 libclangCppInterOp.so.21.0gitcompat::createClangInterpreter(std::vector<char const*, std::allocator<char const*>>&) + 2919
```

Problem : 

1) The destructor currently handles no clearance for the DeviceParser
and the DeviceAct. We currently only have this

https://github.com/llvm/llvm-project/blob/976493822443c52a71ed3c67aaca9a555b20c55d/clang/lib/Interpreter/Interpreter.cpp#L416-L419

2) The ownership for DeviceCI currently is present in
IncrementalCudaDeviceParser. But this should be similar to how the
combination for hostCI, hostAction and hostParser are managed by the
Interpreter. As on master the DeviceAct and DeviceParser are managed by
the Interpreter but not DeviceCI. This is problematic because :
IncrementalParser holds a Sema& which points into the DeviceCI. On
master, DeviceCI is destroyed before the base class ~IncrementalParser()
runs, causing Parser::reset() to access a dangling Sema (and as Sema
holds a reference to Preprocessor which owns PragmaNamespace) we see
this
```
  * frame #0: 0x00007fffee41cfe3 libclangCppInterOp.so.21.0gitclang::PragmaNamespace::~PragmaNamespace() + 99
    frame #1: 0x00007fffee435666 libclangCppInterOp.so.21.0gitclang::Preprocessor::~Preprocessor() + 3830
    
```
EwanC pushed a commit that referenced this pull request Jun 4, 2025
Fix for:
`Assertion failed: (false && "Architecture or OS not supported"),
function CreateRegisterContextForFrame, file
/usr/src/contrib/llvm-project/lldb/source/Plugins/Process/elf-core/ThreadElfCore.cpp,
line 182.
PLEASE submit a bug report to https://bugs.freebsd.org/submit/ and
include the crash backtrace.
#0 0x000000080cd857c8 llvm::sys::PrintStackTrace(llvm::raw_ostream&,
int)
/usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:13
#1 0x000000080cd85ed4
/usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:797:3
#2 0x000000080cd82ae8 llvm::sys::RunSignalHandlers()
/usr/src/contrib/llvm-project/llvm/lib/Support/Signals.cpp:104:5
#3 0x000000080cd861f0 SignalHandler
/usr/src/contrib/llvm-project/llvm/lib/Support/Unix/Signals.inc:403:3 #4
0x000000080f159644 handle_signal
/usr/src/lib/libthr/thread/thr_sig.c:298:3
`
EwanC pushed a commit that referenced this pull request Jun 4, 2025
The mcmodel=tiny memory model is only valid on ARM targets. While trying
this on X86 compiler throws an internal error along with stack dump.
#125641
This patch resolves the issue.
Reduced test case:
```
#include <stdio.h>
int main( void )
{
printf( "Hello, World!\n" ); 
return 0; 
}
```
```
0.	Program arguments: /opt/compiler-explorer/clang-trunk/bin/clang++ -gdwarf-4 -g -o /app/output.s -fno-verbose-asm -S --gcc-toolchain=/opt/compiler-explorer/gcc-snapshot -fcolor-diagnostics -fno-crash-diagnostics -mcmodel=tiny <source>
1.	<eof> parser at end of file
 #0 0x0000000003b10218 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b10218)
 #1 0x0000000003b0e35c llvm::sys::CleanupOnSignal(unsigned long) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b0e35c)
 #2 0x0000000003a5dbc3 llvm::CrashRecoveryContext::HandleExit(int) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dbc3)
 #3 0x0000000003b05cfe llvm::sys::Process::Exit(int, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3b05cfe)
 #4 0x0000000000d4e3eb LLVMErrorHandler(void*, char const*, bool) cc1_main.cpp:0:0
 #5 0x0000000003a67c93 llvm::report_fatal_error(llvm::Twine const&, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67c93)
 #6 0x0000000003a67df8 (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a67df8)
 #7 0x0000000002549148 llvm::X86TargetMachine::X86TargetMachine(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x2549148)
 #8 0x00000000025491fc llvm::RegisterTargetMachine<llvm::X86TargetMachine>::Allocator(llvm::Target const&, llvm::Triple const&, llvm::StringRef, llvm::StringRef, llvm::TargetOptions const&, std::optional<llvm::Reloc::Model>, std::optional<llvm::CodeModel::Model>, llvm::CodeGenOptLevel, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x25491fc)
 #9 0x0000000003db74cc clang::emitBackendOutput(clang::CompilerInstance&, clang::CodeGenOptions&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3db74cc)
#10 0x0000000004460d95 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4460d95)
#11 0x00000000060005ec clang::ParseAST(clang::Sema&, bool, bool) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x60005ec)
#12 0x00000000044614b5 clang::CodeGenAction::ExecuteAction() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44614b5)
#13 0x0000000004737121 clang::FrontendAction::Execute() (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4737121)
#14 0x00000000046b777b clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x46b777b)
#15 0x00000000048229e3 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x48229e3)
#16 0x0000000000d50621 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd50621)
#17 0x0000000000d48e2d ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0
#18 0x00000000044acc99 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) Job.cpp:0:0
#19 0x0000000003a5dac3 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x3a5dac3)
#20 0x00000000044aceb9 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (.part.0) Job.cpp:0:0
#21 0x00000000044710dd clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x44710dd)
#22 0x0000000004472071 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/opt/compiler-explorer/clang-trunk/bin/clang+++0x4472071)
#23 0x000000000447c3fc clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0x447c3fc)
#24 0x0000000000d4d2b1 clang_main(int, char**, llvm::ToolContext const&) (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd4d2b1)
#25 0x0000000000c12464 main (/opt/compiler-explorer/clang-trunk/bin/clang+++0xc12464)
#26 0x00007ae43b029d90 (/lib/x86_64-linux-gnu/libc.so.6+0x29d90)
#27 0x00007ae43b029e40 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e40)
#28 0x0000000000d488c5 _start (/opt/compiler-explorer/clang-trunk/bin/clang+++0xd488c5)
```

---------

Co-authored-by: Shashwathi N <[email protected]>
EwanC pushed a commit that referenced this pull request Jun 4, 2025
… `getForwardSlice` matchers (#115670)

Improve mlir-query tool by implementing `getBackwardSlice` and
`getForwardSlice` matchers. As an addition `SetQuery` also needed to be
added to enable custom configuration for each query. e.g: `inclusive`,
`omitUsesFromAbove`, `omitBlockArguments`.

Note: backwardSlice and forwardSlice algoritms are the same as the ones
in `mlir/lib/Analysis/SliceAnalysis.cpp`
Example of current matcher. The query was made to the file:
`mlir/test/mlir-query/complex-test.mlir`

```mlir
./mlir-query /home/dbudii/personal/llvm-project/mlir/test/mlir-query/complex-test.mlir -c "match getDefinitions(hasOpName(\"arith.add
f\"),2)"

Match #1:

/home/dbudii/personal/llvm-project/mlir/test/mlir-query/complex-test.mlir:5:8:
  %0 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%arg0 : tensor<5x5xf32>) outs(%arg1 : tensor<5x5xf32>) {
       ^
/home/dbudii/personal/llvm-project/mlir/test/mlir-query/complex-test.mlir:7:10: note: "root" binds here
    %2 = arith.addf %in, %in : f32
         ^
Match #2:

/home/dbudii/personal/llvm-project/mlir/test/mlir-query/complex-test.mlir:10:16:
  %collapsed = tensor.collapse_shape %0 [[0, 1]] : tensor<5x5xf32> into tensor<25xf32>
               ^
/home/dbudii/personal/llvm-project/mlir/test/mlir-query/complex-test.mlir:13:11:
    %c2 = arith.constant 2 : index
          ^
/home/dbudii/personal/llvm-project/mlir/test/mlir-query/complex-test.mlir:14:18:
    %extracted = tensor.extract %collapsed[%c2] : tensor<25xf32>
                 ^
/home/dbudii/personal/llvm-project/mlir/test/mlir-query/complex-test.mlir:15:10: note: "root" binds here
    %2 = arith.addf %extracted, %extracted : f32
         ^
2 matches.
```
reble pushed a commit that referenced this pull request Jun 23, 2025
Fixes #123300

What is seen 
```
clang-repl> int x = 42;
clang-repl> auto capture = [&]() { return x * 2; };
In file included from <<< inputs >>>:1:
input_line_4:1:17: error: non-local lambda expression cannot have a capture-default
    1 | auto capture = [&]() { return x * 2; };
      |                 ^
zsh: segmentation fault  clang-repl --Xcc="-v"

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000107b4f8b8 libclang-cpp.19.1.dylib`clang::IncrementalParser::CleanUpPTU(clang::PartialTranslationUnit&) + 988
    frame #1: 0x0000000107b4f1b4 libclang-cpp.19.1.dylib`clang::IncrementalParser::ParseOrWrapTopLevelDecl() + 416
    frame #2: 0x0000000107b4fb94 libclang-cpp.19.1.dylib`clang::IncrementalParser::Parse(llvm::StringRef) + 612
    frame #3: 0x0000000107b52fec libclang-cpp.19.1.dylib`clang::Interpreter::ParseAndExecute(llvm::StringRef, clang::Value*) + 180
    frame #4: 0x0000000100003498 clang-repl`main + 3560
    frame #5: 0x000000018d39a0e0 dyld`start + 2360
```

Though the error is justified, we shouldn't be interested in exiting
through a segfault in such cases.

The issue is that empty named decls weren't being taken care of
resulting into this assert


https://github.com/llvm/llvm-project/blob/c1a229252617ed58f943bf3f4698bd8204ee0f04/clang/include/clang/AST/DeclarationName.h#L503

Can also be seen when the example is attempted through xeus-cpp-lite.


![image](https://github.com/user-attachments/assets/9b0e6ead-138e-4b06-9ad9-fcb9f8d5bf6e)
Bensuo pushed a commit that referenced this pull request Jun 26, 2025
With non -O0, the call stack is not preserved, like malloc_shared will
be inlined, the call stack would be like
```
 #0 in int* sycl::_V1::malloc_host<int>(unsigned long, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /tmp/syclws/include/sycl/usm.hpp:215:27
 #1 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x757867a2a1c9)
 #2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x757867a2a28a)
```
instead of
```
 #0 in int* sycl::_V1::malloc_host<int>(unsigned long, sycl::_V1::context const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /tmp/syclws/include/sycl/usm.hpp:215:27
 #1 in int* sycl::_V1::malloc_host<int>(unsigned long, sycl::_V1::queue const&, sycl::_V1::property_list const&, sycl::_V1::detail::code_location const&) /tmp/syclws/include/sycl/usm.hpp:223:10
 #2 in main /tmp/syclws/llvm/sycl/test-e2e/MemorySanitizer/track-origins/check_host_usm_initialized_on_host.cpp:15:17
 #3 in ?? (/lib/x86_64-linux-gnu/libc.so.6+0x7a67f842a1c9)
 #4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x7a67f842a28a)
```

Also, add env to every %{run} directive to make sure they are not
affected by system env.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants