Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error stack can be broken after several minutes under heavy workload [CORE4553] #4871

Closed
firebird-automations opened this issue Sep 14, 2014 · 6 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @pavel-zotov

Consider the following DDL:

recreate table ttt(id bigint primary key, x int); commit;
insert into ttt select i, 1 from (select row_number()over() i from rdb$types rows 100) order by rand();
commit;
create descending index ttt_id_desc on ttt(id);
commit;

set term ^;

create or alter procedure p_15(a_rnd double precision) returns (dbkey dm_dbkey) as
declare c cursor for (select rdb$db_key dbkey from ttt where id >= :a_rnd order by id rows 1 FOR UPDATE WITH LOCK);
begin
execute statement ('select rdb$db_key dbkey from ttt where id >= :r order by id rows 1 FOR UPDATE WITH LOCK')
( r := a_rnd)
into dbkey;
suspend;
end
^
create or alter procedure p_14(a_rnd double precision) returns (dbkey dm_dbkey) as
begin
execute statement ('select p.dbkey from p_15( :r ) p')
( r := a_rnd)
into dbkey;
suspend;
end
^

--- . . . so on downto p_00 . . .

create or alter procedure p_00(a_rnd double precision) returns (dbkey dm_dbkey) as
begin
execute statement ('select p.dbkey from p_01( :r ) p')
( r := a_rnd)
into dbkey; suspend;
end
^
set term ;^
commit;

(full text see in attach, file broken_stack_ddl.sql ).

So, we can call any of SPs from: { p_00, p_01, ... p_14 } - and they all ultimately will call p_15 (most deepest) which attempts to lock first row with ID >= :some_selected_id in table TTT.

There is huge .sql which is run by ~50...70 attaches and does the follow:

set transaction no wait no auto undo;
set term ^;
execute block returns(dts char(12), id bigint, elapsed_ms int) as
declare id_min double precision;
declare id_max double precision;
declare id_rnd double precision;
declare dbkey char(8) character set octets;
declare t0 timestamp;
begin
select min(id)-0.5 from ttt into id_min;
select max(id)+0.5 from ttt into id_max;
id_rnd = id_min+rand()*(id_max-id_min);
t0='now';
select dbkey from p_03( :id_rnd ) into dbkey; -- #⁠#⁠#⁠ NB #⁠#⁠#⁠

update ttt t set t.x = rand()*1000 -- (select count(*) from rdb$types,rdb$types)
where t.rdb$db_key = :dbkey
returning id into id;

dts=substring(cast(cast('now' as timestamp) as varchar(25)) from 12 for 12);
elapsed_ms = datediff(millisecond from t0 to cast('now' as timestamp));
suspend;
end
^
set term ;^
commit;

-- and this EB repeats many times (full test see in attach, file broken_stack_run.sql ).

We can change the name of called SP from p_01 upto p_14.
But error can be reproduced only if 'p_00' . . . 'p_03' is specified in EB (at least in my env.).

Batch file to launch isqls:

setlocal enabledelayedexpansion enableextensions
set winq=%1
for /l %%i in (1, 1, %winq%) do (
set str=isql 192.168.0.220/3333:oltp30 -n -i broken_stack_run.sql 1^>nul 2^>nul
start /min cmd /c !str!
)

If we run this batch like this: broken_stack_run.bat 70 - it will launch 70 isql windows.

Trace config:

enabled = true
log_errors = true
time_threshold = 0
print_plan = true
explain_plan = true
print_perf = true
max_sql_length = 16384
max_log_size = 9999999999

After run trace and batch: broken_stack_run.bat 70 - we need to wait a few minutes.

Then we can periodically check log of trace like this (file `get-broken-lines.sh` in attach) :

grep " : " $1 | grep -v "deadlock\|concurrent\|At" | head -20

(where $1 - the name of trace log).

The following 'broken' messages will be displayed after some minutes:

335544842 : 1
335544842 : 7
335544842 : 5
335544842 : 7
335544842 : 5
335544842 : 5
335544842 : t procedure 'P_09' line: 4, col: 21
335544842 : procedure 'P_04' line: 4, col: 11
335544842 : procedure 'P_04' line: 4, col: 11
335544842 : procedure 'P_05' line: 4, col: 13
335544842 : procedure 'P_11' line: 4, col: 25
335544842 : procedure 'P_11' line: 4, col: 25
335544842 : procedure 'P_11' line: 4, col: 25
335544842 : 5
335544842 : 5
335544842 : procedure 'P_03' line: 4, col: 9
335544842 : t procedure 'P_08' line: 4, col: 19
335544842 : t procedure 'P_08' line: 4, col: 19
335544842 : t procedure 'P_05' line: 4, col: 13
335544842 : t procedure 'P_08' line: 4, col: 19

The full test case please see in attach.
Correction in `broken_stack_run.bat` is required for host name + port + database alias.

====== Test Details ======

Reproduced only under high concurrent load (OLTP-EMUL).
Letters to / from Alex: 15-oct-2014, subj: "better fix for damaged strings in error messages".
Issue in trunk/ChangeLog:
2014-10-13 09:30 alexpeshkoff
M src/yvalve/MasterImplementation.cpp
Fixed search for free circullar buffer when using LWP number as thread id

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

Sorry, couldn`t attach .zip

Please download it from here: https://yadi.sk/d/P1UVFnIdbTyMa

(link is permanent)

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

reporter: Pavel Zotov [ tabloid ] => Dmitry Yemanov [ dimitr ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

reporter: Dmitry Yemanov [ dimitr ] => Pavel Zotov [ tabloid ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0 Beta 1 [ 10332 ]

assignee: Alexander Peshkov [ alexpeshkoff ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: Cannot be tested

Test Details: Reproduced only under high concurrent load (OLTP-EMUL).
Letters to / from Alex: 15-oct-2014, subj: "better fix for damaged strings in error messages".
Issue in trunk/ChangeLog:
2014-10-13 09:30 alexpeshkoff
M src/yvalve/MasterImplementation.cpp
Fixed search for free circullar buffer when using LWP number as thread id

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants