Skip to content

TestGlobalModuleCache.py test is flaky on 32 bit Arm Linux #76057

Open
@DavidSpickett

Description

@DavidSpickett

This test was added by #74894 and was flaky on Arm and AArch64, though I have only since seen it fail on Arm. I assume it's just a lot more common there.

The failure looks like:

intern-state     ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
intern-state     Plan stack initial state:
  thread #1: tid = 0x1213e8:
    Active plan stack:
      Element 0: Base thread plan.
      Element 1: Single stepping past breakpoint site 11 at 0xf7fc0c14

python3.10       Discarding thread plans for thread (tid = 0x1213e8, force 1)
intern-state     Plan Step over breakpoint trap should stop: 0.
intern-state     Completed step over breakpoint plan.
intern-state     Plan Step over breakpoint trap auto-continue: true.
intern-state     ^^^^^^^^ Thread::ShouldStop plan stack before PopPlan ^^^^^^^^
intern-state       thread #1: tid = 0x1213e8:
    Active plan stack:
      Element 0: Base thread plan.
    Discarded plan stack:
      Element 0: Single stepping past breakpoint site 11 at 0xf7fc0c14

python3.10: /home/david.spickett/llvm-project/lldb/source/Target/ThreadPlanStack.cpp:151: lldb::ThreadPlanSP lldb_private::ThreadPlanStack::PopPlan(): Assertion `m_plans.size() > 1 && "Can't pop the base thread plan"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
#0 0xf174a0c0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) Signals.cpp:0:0
#1 0xf1747b24 llvm::sys::RunSignalHandlers() Signals.cpp:0:0
#2 0xf174a968 SignalHandler(int) Signals.cpp:0:0
#3 0xf774d6e0 __default_sa_restorer ./signal/../sysdeps/unix/sysv/linux/arm/sigrestorer.S:67:0
#4 0xf773db06 ./csu/../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47:0
#5 0xf777d2ca __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
#6 0xf774c840 gsignal ./signal/../sysdeps/posix/raise.c:27:6

Note that this build included the logging I added in d14d521, after narrowing down the crash to one particular call to PopPlan.

Initially I could not reproduce it but have now been able to do so, though it takes a long time to fail.

The basic problem is that somehow Thread::ShouldStop asks for the current plan when the stack consists of the base plan and the single step plan. Before it can decide that the step has finished and should be popped, a call is made in the test to destroy the debugger.

This follows this call chain:

  • Debugger::Destroy
  • Debugger::Clear
  • Process::Finalize
  • Process::DestroyImpl
  • ThreadList::DiscardThreadPlans
  • Thread::DiscardThreadPlans

Which is why we see Discarding thread plans for thread (tid = 0x1213e8, force 1) in the log output.

Discarding the thread plans leaves only the base plan on the stack (the stack always has the base plan no matter what). So when thread::ShouldStop decides to pop the single step plan, it's no longer on the stack. It's a time of read/time of use issue, except ShouldStop wasn't written with it in mind that the plan stack would change during it at all.

This means that PopPlan tries to pop the base plan, which asserts to tell us we can't do that.

intern-state     ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
intern-state     Plan stack initial state:
  thread #1: tid = 0x1213e8:
    Active plan stack:
      Element 0: Base thread plan.
      Element 1: Single stepping past breakpoint site 11 at 0xf7fc0c14

Thread::ShouldStop sees that the current plan is the single step plan.

python3.10       Discarding thread plans for thread (tid = 0x1213e8, force 1)

Destroying the debugger discards the single step plan.

intern-state     Completed step over breakpoint plan.

Thread::ShouldStop works out that the single step is finished and so the plan should be discarded.

intern-state     ^^^^^^^^ Thread::ShouldStop plan stack before PopPlan ^^^^^^^^
intern-state       thread #1: tid = 0x1213e8:
    Active plan stack:
      Element 0: Base thread plan.
    Discarded plan stack:
      Element 0: Single stepping past breakpoint site 11 at 0xf7fc0c14

Whoops, it already was and we certainly can't pop the base plan!

The overall issue seems to be one of destruction order. Or at least, something isn't telling the threads to stop before we start destroying the process. The threads I think are destroyed later in Process::Finalize, and I think there's a potential bug there too.

  m_thread_plans.Clear();
  m_thread_list_real.Destroy();
  m_thread_list.Destroy();
  m_extended_thread_list.Destroy();

The thread plans are cleared before the threads that would be looking at them are destroyed. That's not the cause of this particular assert, but it's suspicious to me at least.

The underlying problem is probably whatever is letting thread::ShouldStop run, even though we're in the process of destroying the Process that contains them.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions