Description
This test was added by #74894 and was flaky on Arm and AArch64, though I have only since seen it fail on Arm. I assume it's just a lot more common there.
The failure looks like:
intern-state ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
intern-state Plan stack initial state:
thread #1: tid = 0x1213e8:
Active plan stack:
Element 0: Base thread plan.
Element 1: Single stepping past breakpoint site 11 at 0xf7fc0c14
python3.10 Discarding thread plans for thread (tid = 0x1213e8, force 1)
intern-state Plan Step over breakpoint trap should stop: 0.
intern-state Completed step over breakpoint plan.
intern-state Plan Step over breakpoint trap auto-continue: true.
intern-state ^^^^^^^^ Thread::ShouldStop plan stack before PopPlan ^^^^^^^^
intern-state thread #1: tid = 0x1213e8:
Active plan stack:
Element 0: Base thread plan.
Discarded plan stack:
Element 0: Single stepping past breakpoint site 11 at 0xf7fc0c14
python3.10: /home/david.spickett/llvm-project/lldb/source/Target/ThreadPlanStack.cpp:151: lldb::ThreadPlanSP lldb_private::ThreadPlanStack::PopPlan(): Assertion `m_plans.size() > 1 && "Can't pop the base thread plan"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
#0 0xf174a0c0 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) Signals.cpp:0:0
#1 0xf1747b24 llvm::sys::RunSignalHandlers() Signals.cpp:0:0
#2 0xf174a968 SignalHandler(int) Signals.cpp:0:0
#3 0xf774d6e0 __default_sa_restorer ./signal/../sysdeps/unix/sysv/linux/arm/sigrestorer.S:67:0
#4 0xf773db06 ./csu/../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47:0
#5 0xf777d2ca __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
#6 0xf774c840 gsignal ./signal/../sysdeps/posix/raise.c:27:6
Note that this build included the logging I added in d14d521, after narrowing down the crash to one particular call to PopPlan
.
Initially I could not reproduce it but have now been able to do so, though it takes a long time to fail.
The basic problem is that somehow Thread::ShouldStop
asks for the current plan when the stack consists of the base plan and the single step plan. Before it can decide that the step has finished and should be popped, a call is made in the test to destroy the debugger.
This follows this call chain:
Debugger::Destroy
Debugger::Clear
Process::Finalize
Process::DestroyImpl
ThreadList::DiscardThreadPlans
Thread::DiscardThreadPlans
Which is why we see Discarding thread plans for thread (tid = 0x1213e8, force 1)
in the log output.
Discarding the thread plans leaves only the base plan on the stack (the stack always has the base plan no matter what). So when thread::ShouldStop
decides to pop the single step plan, it's no longer on the stack. It's a time of read/time of use issue, except ShouldStop
wasn't written with it in mind that the plan stack would change during it at all.
This means that PopPlan
tries to pop the base plan, which asserts to tell us we can't do that.
intern-state ^^^^^^^^ Thread::ShouldStop Begin ^^^^^^^^
intern-state Plan stack initial state:
thread #1: tid = 0x1213e8:
Active plan stack:
Element 0: Base thread plan.
Element 1: Single stepping past breakpoint site 11 at 0xf7fc0c14
Thread::ShouldStop sees that the current plan is the single step plan.
python3.10 Discarding thread plans for thread (tid = 0x1213e8, force 1)
Destroying the debugger discards the single step plan.
intern-state Completed step over breakpoint plan.
Thread::ShouldStop
works out that the single step is finished and so the plan should be discarded.
intern-state ^^^^^^^^ Thread::ShouldStop plan stack before PopPlan ^^^^^^^^
intern-state thread #1: tid = 0x1213e8:
Active plan stack:
Element 0: Base thread plan.
Discarded plan stack:
Element 0: Single stepping past breakpoint site 11 at 0xf7fc0c14
Whoops, it already was and we certainly can't pop the base plan!
The overall issue seems to be one of destruction order. Or at least, something isn't telling the threads to stop before we start destroying the process. The threads I think are destroyed later in Process::Finalize
, and I think there's a potential bug there too.
m_thread_plans.Clear();
m_thread_list_real.Destroy();
m_thread_list.Destroy();
m_extended_thread_list.Destroy();
The thread plans are cleared before the threads that would be looking at them are destroyed. That's not the cause of this particular assert, but it's suspicious to me at least.
The underlying problem is probably whatever is letting thread::ShouldStop
run, even though we're in the process of destroying the Process
that contains them.