Skip to content

Crash on greenlet.switch #261

@avtikhon

Description

@avtikhon

Crash on greenlet.switch:
https://gitlab.com/tarantool/tarantool/-/jobs/932191839#L4908
https://gitlab.com/tarantool/tarantool/-/jobs/922096265#L4649
https://gitlab.com/tarantool/tarantool/-/jobs/923636066#L4187

[073] vinyl/errinj_ddl.test.lua                                       
[073] [Instance "vinyl"] Tarantool server failed to start
[073] 
[073] Last 15 lines of Tarantool Log file [Instance "vinyl"][/builds/nZUxDh2c/0/tarantool/tarantool/test/var/073_vinyl/vinyl.log]:
[073] tcp_server: remove dead UNIX socket: /builds/nZUxDh2c/0/tarantool/tarantool/test/var/073_vinyl/vinyl.control
[073] started
[073] 2020-12-18 23:06:29.624 [22666] main/103/vinyl C> Tarantool 2.7.0-126-g720772b98
[073] 2020-12-18 23:06:29.624 [22666] main/103/vinyl C> log level 5
[073] 2020-12-18 23:06:29.626 [22666] main/103/vinyl I> mapping 536870912 bytes for memtx tuple arena...
[073] 2020-12-18 23:06:29.626 [22666] main/103/vinyl I> mapping 536870912 bytes for vinyl tuple arena...
[073] 2020-12-18 23:06:29.709 [22666] main/103/vinyl I> instance uuid 526cc434-698b-45cf-9193-44dd6aee8372
[073] 2020-12-18 23:06:29.710 [22666] main/103/vinyl I> instance vclock {1: 21001}
[073] 2020-12-18 23:06:29.710 [22666] iproto/101/main I> binary: bound to unix/:(socket)
[073] 2020-12-18 23:06:29.717 [22666] coio vy_log.c:2202 E> failed to process vylog record: delete_slice{slice_id=341, }
[073] 2020-12-18 23:06:29.717 [22666] main/103/vinyl vy_log.c:2068 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 341 deleted but not registered
[073] 2020-12-18 23:06:29.717 [22666] main/103/vinyl vy_log.c:2420 E> failed to load `/builds/nZUxDh2c/0/tarantool/tarantool/test/var/073_vinyl/vinyl/00000000000000020789.vylog'
[073] 2020-12-18 23:06:29.718 [22666] main/103/vinyl vy_log.c:2068 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 341 deleted but not registered
[073] 2020-12-18 23:06:29.718 [22666] main/103/vinyl F> can't initialize storage: Invalid VYLOG file: Slice 341 deleted but not registered
[073] 2020-12-18 23:06:29.718 [22666] main/103/vinyl F> can't initialize storage: Invalid VYLOG file: Slice 341 deleted but not registered
[061] sql/savepoints.test.lua                         vinyl           [ pass ]
[073] 
[073] Worker "073_vinyl" received the following error; stopping...
[073] Traceback (most recent call last):
[073]   File "/builds/nZUxDh2c/0/tarantool/tarantool/test-run/lib/worker.py", line 316, in run_task
[073]     task, self.server, self.inspector)
[073]   File "/builds/nZUxDh2c/0/tarantool/tarantool/test-run/lib/test_suite.py", line 261, in run_test
[073]     short_status, result_checksum = test.run(server)
[073]   File "/builds/nZUxDh2c/0/tarantool/tarantool/test-run/lib/test.py", line 182, in run
[073]     self.execute(server)
[073]   File "/builds/nZUxDh2c/0/tarantool/tarantool/test-run/lib/tarantool_server.py", line 391, in execute
[073]     ts.stop_nondefault(signal=signal.SIGKILL)
[073]   File "/builds/nZUxDh2c/0/tarantool/tarantool/test-run/lib/preprocessor.py", line 444, in stop_nondefault
[073]     v.stop(silent=True, signal=signal)
[073]   File "/builds/nZUxDh2c/0/tarantool/tarantool/test-run/lib/tarantool_server.py", line 1059, in stop
[073]     save_join(self.crash_detector)
[073]   File "/builds/nZUxDh2c/0/tarantool/tarantool/test-run/lib/tarantool_server.py", line 52, in save_join
[073]     green_obj.get(timeout=timeout)
[073]   File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 463, in get
[073]     result = self.parent.switch()
[073]   File "/usr/lib/python2.7/dist-packages/gevent/hub.py", line 609, in switch
[073]     return greenlet.switch(self)
[073] TarantoolStartError
[073] Exception: 

Investigating the issue once got the following:

#0  0x55b5cadb2d38 in crash_collect+c6
#1  0x55b5cadb4d2f in crash_signal_cb+55
#2  0x7fafa26603c0 in funlockfile+60
#3  0x55b5cafc76c5 in histogram_lookup_bucket+1c
#4  0x55b5cafc7855 in histogram_discard+27
#5  0x55b5cad18392 in vy_lsm_unacct_range+34
#6  0x55b5cad194d0 in vy_lsm_split_range+4b7
#7  0x55b5cad49bd0 in vy_task_compaction_new+a9
[001] vinyl/gh.test.lua                                               
[001] 
[001] [Instance "vinyl" killed by signal: 6 (SIGABRT)]
#8  0x55b5cad4a65f in vy_scheduler_peek_compaction+9a
#9  0x55b5cad4a702 in vy_schedule+56
#10 0x55b5cad4ab99 in vy_scheduler_f+164
#11 0x55b5cabaa837 in _ZL16fiber_cxx_invokePFiP13__va_list_tagES0_+22
#12 0x55b5cadbccb7 in fiber_loop+88
#13 0x55b5cb0ac1f3 in coro_init+50

As it seen on 08/06/2021 traceback changed:
also it happened on Tarantool default server restart which was temporary added to the test:
test changed with:

diff --git a/test/vinyl/errinj_stat.test.lua b/test/vinyl/errinj_stat.test.lua
index 0d1e4af8e..d9ea347b2 100644
--- a/test/vinyl/errinj_stat.test.lua
+++ b/test/vinyl/errinj_stat.test.lua
@@ -1,5 +1,7 @@
 test_run = require('test_run').new()
 
+test_run:cmd('restart server default')
+
 -- Since we store LSNs in data files, the data size may differ
 -- from run to run. Deploy a new server to make sure it will be
 -- the same so that we can check it.

test result log output:

test_run = require('test_run').new()
---
...
test_run:cmd('restart server default')

It means that to reproduce the issue, need to run 2 tests on single worker. The first one need to be vinyl/gh.test.lua and the second test needs to try to restart default tarantool server on the very start, it will cause the issue, check the reproducer:

cat >a.rep <<EOF
- [vinyl/gh.test.lua, null]
- [vinyl/restart_server.test.lua, null]
EOF

cat >vinyl/restart_server.test.lua <<EOF
test_run = require('test_run').new()
test_run:cmd('restart server default')
EOF

./test-run.py vinyl/restart_server.test.lua --update-result

c=0 ; while ./test-run.py --reproduce a.rep --force ; do c=$(($c+1)) ; echo "======================================= $c" | tee c.count ; done ; echo "======================================= $c" | tee c.count

Found that the following block of code avoids of the current crash:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 5aca3cc..3891b34 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -1082,8 +1082,8 @@ class TarantoolServer(Server):
 
             timer = Timer(timeout, kill)
             timer.start()
-            if self.crash_detector is not None:
-                save_join(self.crash_detector)
+            #if self.crash_detector is not None:
+            #    save_join(self.crash_detector)
             self.wait_stop()
             timer.cancel()
 

It means that we can't use crash detector in such default tarantool server restart due to it catches crash on server stop after the previous test:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 5aca3cc..fc45dd1 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -675,7 +675,7 @@ class TarantoolServer(Server):
         self.crash_enabled = False
 
         # set in from a test let test-run ignore server's crashes
-        self.crash_expected = False
+        self.crash_expected = True
 
         # filled in {Test,FuncTest,LuaTest,PythonTest}.execute()
         # or passed through execfile() for PythonTest

Also found that the additional cause of the issue that main worker log file var/log/008_vinyl.log was some how rewritten ! and also it happened even after the fail !!! (check the comments below):

cat var/log/008_vinyl.log 
[2021-06-08 13:20:49.002871] 
[2021-06-08 13:20:49.002871] * [QA Notice]
[2021-06-08 13:20:49.002871] *
[2021-06-08 13:20:49.002871] * The server 'proxy' does not stop during 5 seconds after the 9 (SIGKILL) signal.
[2021-06-08 13:20:49.002871] * Info: process 22375 [unknown; unknown]
[2021-06-08 13:20:49.002871] * Sending SIGKILL...
[2021-06-08 13:20:49.002871] *

while log file /home/ubuntu/tarantool/test/var/008_vinyl/vinyl.log inside worker workspace was written 30 secs before !:

Starting instance vinyl...
Run console at unix/:/home/ubuntu/tarantool/test/var/008_vinyl/vinyl.control
started
mkdir /homlib/tarantool_server.pye/ubuntu/tarantool/test/var/008_vinyl/vinyl
2021-06-08 13:20:20.906 [20512] main/103/vinyl C> Tarantool 2.9.0-57-gea0b126ff

Issue output:

[008] vinyl/errinj_stat.test.lua                                      Failed vinyl
[008] Last 15 lines of Tarantool Log file [Instance "vinyl"][/home/ubuntu/tarantool/test/var/008_vinyl/vinyl.log]:
[008] 2021-06-08 13:20:43.903 [22398] main/103/vinyl C> Tarantool 2.9.0-57-gea0b126ff
[008] 2021-06-08 13:20:43.904 [22398] main/103/vinyl C> log level 5
[008] 2021-06-08 13:20:43.904 [22398] main/103/vinyl I> wal/engine cleanup is paused
[008] 2021-06-08 13:20:43.904 [22398] main/103/vinyl I> mapping 536870912 bytes for memtx tuple arena...
[008] 2021-06-08 13:20:43.904 [22398] main/103/vinyl I> Actual slab_alloc_factor calculated on the basis of desired slab_alloc_factor = 1.044274
[008] 2021-06-08 13:20:43.904 [22398] main/103/vinyl I> mapping 536870912 bytes for vinyl tuple arena...
[008] 2021-06-08 13:20:43.954 [22398] main/103/vinyl I> instance uuid 1b85ab4d-e9a1-477e-8b65-fccf41f1022f
[008] 2021-06-08 13:20:43.954 [22398] main/103/vinyl I> instance vclock {1: 23605}
[008] 2021-06-08 13:20:43.954 [22398] main/103/vinyl I> : bound to unix/:(socket)
[008] 2021-06-08 13:20:43.958 [22398] coio vy_log.c:2276 E> failed to process vylog record: delete_slice{slice_id=474, }
[008] 2021-06-08 13:20:43.958 [22398] main/103/vinyl vy_log.c:2141 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 474 deleted but not registered
[008] 2021-06-08 13:20:43.958 [22398] main/103/vinyl vy_log.c:2495 E> failed to load `/home/ubuntu/tarantool/test/var/008_vinyl/vinyl/00000000000000023562.vylog'
[008] 2021-06-08 13:20:43.958 [22398] main/103/vinyl vy_log.c:2141 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 474 deleted but not registered
[008] 2021-06-08 13:20:43.958 [22398] main/103/vinyl F> can't initialize storage: Invalid VYLOG file: Slice 474 deleted but not registered
[008] 2021-06-08 13:20:43.958 [22398] main/103/vinyl F> can't initialize storage: Invalid VYLOG file: Slice 474 deleted but not registered
[008] 
[008] Worker "008_vinyl" received the following error; stopping...
[008] Traceback (most recent call last):
[008]   File "/home/ubuntu/tarantool/test-run/lib/worker.py", line 320, in run_task
[008]     short_status, result_checksum, duration = self.suite.run_test(
[008]   File "/home/ubuntu/tarantool/test-run/lib/test_suite.py", line 274, in run_test
[008]     short_status, result_checksum = test.run(server)
[008]   File "/home/ubuntu/tarantool/test-run/lib/test.py", line 190, in run
[008]     self.execute(server)
[008]   File "/home/ubuntu/tarantool/test-run/lib/tarantool_server.py", line 399, in execute
[008]     ts.stop_nondefault(signal=signal.SIGKILL)
[008]   File "/home/ubuntu/tarantool/test-run/lib/preprocessor.py", line 444, in stop_nondefault
[008]     v.stop(silent=True, signal=signal)
[008]   File "/home/ubuntu/tarantool/test-run/lib/tarantool_server.py", line 1086, in stop
[008]     save_join(self.crash_detector)
[008]   File "/home/ubuntu/tarantool/test-run/lib/tarantool_server.py", line 56, in save_join
[008]     green_obj.get(timeout=timeout)
[008]   File "src/gevent/greenlet.py", line 824, in gevent._gevent_cgreenlet.Greenlet.get
[008]   File "src/gevent/greenlet.py", line 812, in gevent._gevent_cgreenlet.Greenlet.get
[008]   File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[008]   File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[008]   File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
[008]   File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
[008]   File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
[008]   File "/home/ubuntu/.local/lib/python3.8/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
[008]     return handle(*args_tuple)
[008]   File "/home/ubuntu/tarantool/test-run/lib/inspector.py", line 103, in handle
[008]     result = self.parser.parse_preprocessor(line)
[008]   File "/home/ubuntu/tarantool/test-run/lib/preprocessor.py", line 123, in parse_preprocessor
[008]     return self.server(stype, sname, options)
[008]   File "/home/ubuntu/tarantool/test-run/lib/preprocessor.py", line 350, in server
[008]     return getattr(self, attr)(ctype, sname, opts)
[008]   File "/home/ubuntu/tarantool/test-run/lib/preprocessor.py", line 341, in server_restart
[008]     self.server_start(ctype, sname, opts)
[008]   File "/home/ubuntu/tarantool/test-run/lib/preprocessor.py", line 209, in server_start
[008]     self.servers[sname].start(
[008]   File "/home/ubuntu/tarantool/test-run/lib/tarantool_server.py", line 909, in start
[008]     self.wait_until_started(wait_load, deadline)
[008]   File "/home/ubuntu/tarantool/test-run/lib/tarantool_server.py", line 1142, in wait_until_started
[008]     self.wait_load(deadline)
[008]   File "/home/ubuntu/tarantool/test-run/lib/tarantool_server.py", line 1126, in wait_load
[008]     if not self.logfile_pos.seek_wait(msg, p, self.name, deadline):
[008]   File "/home/ubuntu/tarantool/test-run/lib/tarantool_server.py", line 483, in seek_wait
[008]     raise TarantoolStartError(name)
[008] lib.tarantool_server.TarantoolStartError: Failed vinyl
[008] 
[008] Exception: Failed vinyl

As found above [QA Notice] message was the only message in the log file of the worker and I've tried to temporary block it to check if it could help to avoid of crash in test-run:

diff --git a/lib/colorer.py b/lib/colorer.py
index 48ab560..06000db 100644
--- a/lib/colorer.py
+++ b/lib/colorer.py
@@ -43,7 +43,7 @@ def qa_notice(*args, **kwargs):
     data = '\n* [QA Notice]\n*\n{}*\n'.format(data)
 
     # Write out.
-    color_stdout(data, **kwargs)
+    #color_stdout(data, **kwargs)
 
 
 class CSchema(object):

the same issue happens now w/o crash:

[007] vinyl/upsert.test.lua                                           Failed vinyl
[007] Last 15 lines of Tarantool Log file [Instance "vinyl"][/home/ubuntu/tarantool/test/var/007_vinyl/vinyl.log]:
[007] 2021-06-08 14:15:40.269 [39454] main/103/vinyl C> Tarantool 2.9.0-57-gea0b126ff
[007] 2021-06-08 14:15:40.269 [39454] main/103/vinyl C> log level 5
[007] 2021-06-08 14:15:40.269 [39454] main/103/vinyl I> wal/engine cleanup is paused
[007] 2021-06-08 14:15:40.269 [39454] main/103/vinyl I> mapping 536870912 bytes for memtx tuple arena...
[007] 2021-06-08 14:15:40.269 [39454] main/103/vinyl I> Actual slab_alloc_factor calculated on the basis of desired slab_alloc_factor = 1.044274
[007] 2021-06-08 14:15:40.270 [39454] main/103/vinyl I> mapping 536870912 bytes for vinyl tuple arena...
[007] 2021-06-08 14:15:40.297 [39454] main/103/vinyl I> instance uuid 09b69427-ae86-4a2f-a3be-95bc67f00dae
[007] 2021-06-08 14:15:40.297 [39454] main/103/vinyl I> instance vclock {1: 20503}
[007] 2021-06-08 14:15:40.297 [39454] main/103/vinyl I> : bound to unix/:(socket)
[007] 2021-06-08 14:15:40.300 [39454] coio vy_log.c:2276 E> failed to process vylog record: delete_slice{slice_id=115, }
[007] 2021-06-08 14:15:40.300 [39454] main/103/vinyl vy_log.c:2141 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 115 deleted but not registered
[007] 2021-06-08 14:15:40.300 [39454] main/103/vinyl vy_log.c:2495 E> failed to load `/home/ubuntu/tarantool/test/var/007_vinyl/vinyl/00000000000000020419.vylog'
[007] 2021-06-08 14:15:40.300 [39454] main/103/vinyl vy_log.c:2141 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 115 deleted but not registered
[007] 2021-06-08 14:15:40.300 [39454] main/103/vinyl F> can't initialize storage: Invalid VYLOG file: Slice 115 deleted but not registered
[007] 2021-06-08 14:15:40.300 [39454] main/103/vinyl F> can't initialize storage: Invalid VYLOG file: Slice 115 deleted but not registered
[007] 
[007] [Instance "vinyl"] Failed to start tarantool instance "vinyl"
[007] [ fail ]
[007] Worker "007_vinyl" got failed test; restarted the server
[007] vinyl/dump_stress.test.lua                                      [ disabled ]

Also found that this error happens only in the tests run after vinyl/gh.test.lua on restart server default command, when some inconsistent status of the Tarantool worker run server stopping and the error happens. So gh.test.lua test should be checked for the root cause of the issue. Also found the other issues after gh.test.lua test tarantool/tarantool-qa#126

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions