Moving database to 'full shutdown' state can corrupt index (at least) on long-key text field. Shutdown process can hang w/o returning control to outer environment. [CORE4914] #5206

firebird-automations · 2015-08-24T20:44:03Z

Is related to CORE4904

Attachments:
fb3ss_result.zip
dbshut-fb30ss-15attaches-missing-some-errors-in-firebird-log.zip

I've encountered with lot of errors which can be produced when database is moved to shutdown state during intensive DML workload.
It seems that almost all of these errors are caused by presence of long-key indexed field, especially when this field is filled with the same text value (i.e. 100% duplicates). Errors can happen even on 5-7 attachments when test running on machine with slow CPU and/or IO subsystem. All tests was run on Windows.

To reproduce:
1) Download following two files using your svn client:
1.1) http://svn.code.sf.net/p/firebird/code/qa/fbt-repository/trunk/tests/functional/tabloid/batches/dbshut.bat.txt
1.2) http://svn.code.sf.net/p/firebird/code/qa/fbt-repository/trunk/tests/functional/tabloid/batches/dbshut.conf

2) Rename 'dbshut.bat.tx' to 'dbshut.bat' (i.e. remove last part of its 'long' extension).
3) Open 'dbshut.conf' and correct settings related to your FB instance(s) (hereafter: 'Cs' = Classic; 'sC' = SuperClassic; 'sS' = SuperServer):

fb25Cshome=<path with FB binaries of 2.5 Classic>
fb25Csport =<port which is listening by FB 2.5 Classic instance>

fb25sChome=<path with FB binaries of 2.5 SuperClassic>
fb25sCport=<port which is listening by FB 2.5 SuperClassic instance>

fb25sShome=...
fb25sSport=...

fb30Cshome=...
fb30Csport=...

fb30sChome=...
fb30sCport=...

fb30sShome=...
fb30sSport=...

4) Download MTEE utility (Commandline Standard Stream Splitter) from http://www.commandline.co.uk/mtee/

Syntax for usage: dbshut.bat <FB> <NN> 2>&1 | mtee /t <log>

where:
<FB> is case-insensitive abbrev. of architecture ( SS | SC |CS) concatenated with FB version ( 25 | 30 );
<NN> is number of opening ISQL sessions;
<log> is name of log file that should store console output.

Usage samples:

1) test FB 2.5 Classic with opening 20 ISQL sessions:

dbshut.bat cs25 20 2>&1 | mtee /t dbshut_cs25_att20.log

2) test FB 3.0 SuperClassic with opening 35 ISQL sessions:

dbshut.bat ss30 35 2>&1 | mtee /t dbshut_sc30_att35.log

3) test FB 3.0 SuperServer with opening 40 ISQL sessions:

dbshut.bat ss30 40 2>&1 | mtee /t dbshut_ss30_att40.log

This test does following:

1) reads config 'dbshut.conf' and determines:
1.1) required table DDL (parameter 'loading_mode'; can be: tiny | small | middle | strong | heavy | ultra; value of this parameter defines ratio of index-key to page_size);
1.2) database attributes: page_size and FW;
1.3) how long should all ISQL sessions do their DML work (inserts into table with indexed field) - see parameters dml_work_time_*;
2) created temp database with test table;
3) prepares config for trace and launches 'FBSVCMGR action_trace_start ...';
4) starts in separate windows required number of ISQL sessions (their number is taken from command-line argument %2 to this batch);
5) forces each ISQL session periodically check until all other ISQLs become ready to perform DML. All subsequent DML will be start only when ALL required attachments are established - this process not instant and can take several seconds. When all ISQLs are ready to perform DML, this action (INSERT INTO ...) will be auto-started by them and will be done during dml_work_time_* milliseconds.
6) After DML was done during 'dml_work_time_*', batch initiates SHUTDOWN process (FBSVCMGR ... prp_shutdown_mode prp_sm_full prp_force_shutdown 0)
7) After control from FBSVCMGR which makes shutdown will return batch will stop TRACE session which was launched on step "3)" and wait several seconds until IO subsystem will finish flushing trace log on disk.
8) After that batch will parse trace log and try to find there 1st occurence of word "shutdown". This step was added at the beginning of this test implementation: its purpose was to check that no actions can be done by common ISQL attachments after database will be moved to 'shutdown' state. Currently this step can be considered as excessive.
9) Time difference between moments of start and finish shutdown process is evaluated and logged. Also, timestamps for each ISQL session when it finishes with DML due to db-shutdown are logged.
10) Move database back in ONLINE state and run validation (FBSVCMGR action_validate dbname ...)

After this batch will finish, one may see its log created by MTEE (e.g. "dbshut_ss30_att10.log" for command like: dbshut.bat ss30 40 2>&1 | mtee /t dbshut_ss30_att40.log ).

On WI-V3.0.0.31981 (07-aug-2015) following messages appeared:

Page 0 wrong type (expected 7 encountered 1)
page NNN is of wrong type (expected 7, found 32)
(or: "(expected 7, found -67)", "(expected 7, found -68)", "(expected 7, found 45)")
internal Firebird consistency check (error during savepoint backout (290), file: exe.cpp line: 4097)
internal Firebird consistency check (index inconsistent (204), file: btr.cpp line: 4987)
Index N has orphan child page at page MMMM in table FIX (129)
Index N is corrupt on page MMMM level 2 at offset KKKK. File: ..\..\..\src\jrd\validation.cpp, line: 2136
Index N has inconsistent left sibling pointer, page MMMM level 2 in table FIX (129)
Index N misses node on page MMMM level 2 in table FIX (129)

On WI-V3.0.0.32008 (23-aug-2015) following messages appeares:

Error: Page NNN wrong type (expected data encountered unknown (210))
Error: Data page NNN {sequence MMM} is confused
Warning: Pointer page NNN {sequence 0} bits {0x01 full} are not consistent with data page MMM {sequence 111} state {0x00 }
Warning: Index N misses node on page NNN level 2 at offset MMMM
Error: Index N is corrupt on page MMMM level 2 at offset KKKK. File: ..\..\..\src\jrd\validation.cpp, line: 2055
Warning: Index N misses node on page MMMM level 2 at offset KKKK
Error: Page NNN wrong type (expected index B-tree encountered unknown (203))
Warning: Index N has inconsistent left sibling pointer, page MMMM level 2 at offset KKKK
Error: Index N is corrupt on page MMMM level 2 at offset KKKK. File: ..\..\..\src\jrd\validation.cpp, line: 2258
Error: Page N wrong type (expected data encountered purposely undefined)
Error: Index N is corrupt on page MMMM level 1 at offset KKKK. File: ..\..\..\src\jrd\validation.cpp, line: 2258
Error: Page N wrong type (expected data encountered purposely undefined)
Warning: Pointer page N {sequence 0} bits {0x00 } are not consistent with data page M {sequence 1183} state {0x10 empty}

Also, there were strange hangs of FBSVCMGR where batch could not receive control from "prp_shutdown_mode prp_sm_full prp_force_shutdown 0" command. In such cases FB service most likely should be restarted because Ctrl-Break does not work at all.

PS. As far as I can see after dozens of this test run, most affect on database shutdown result have following values:
1) length of indexed field;
2) durability of DML.

PPS. Errors can occur not only on poor hardware. I have several results obtained by Simonov Denis which he got on power server also, but with increased number of attachments (200...300).

Commits: dd0a2dd FirebirdSQL/fbt-repository@3f1880c FirebirdSQL/fbt-repository@459fdfc

The text was updated successfully, but these errors were encountered:

firebird-automations · 2015-08-26T09:47:03Z

Commented by: @sim1984

I conducted a series of tests. Sometimes FB falls before the start of the main execute block, sometimes after. The minimum number of connections, which managed to reproduce the problem 17. But it is a floating number, sometimes the test could successfully withstood 25 connections. 50 connections error occurs already stable.

Firebird V3.0.0.32008 x64

firebird-automations · 2015-08-26T09:47:03Z

Modified by: @sim1984

Attachment: fb3ss_result.zip [ 12801 ]

firebird-automations · 2015-08-27T07:54:15Z

Commented by: @pavel-zotov

One more issue: it was encountered that some message(s) about database corruption can appear only in ISQL log and absent in firebird.log. Currently I can see this one:

Statement failed, SQLSTATE = XX001
Error during savepoint backout - transaction invalidated
-database file appears corrupt (C:\FBTESTING\DBSHUT\DBSHUT_TEST_FDB.TMP)
-wrong page type
-page 207 is of wrong type (expected index B-tree, found unknown (52))

This error does appear in 10 logs of 15 total, also it is in trace.log; but I can`t see this in firebird.log.
Got this on FB 3.0 SuperServer with 15 ISQL sessions, FW = OFF, page=8192, indexed_key_length = 1024.

Command to run batch: dbshut.bat ss30 15 2>&1 | mtee /t dbshut_ss30_att15.log
See attached file "dbshut-fb30ss-15attaches-missing-some-errors-in-firebird-log.zip"

firebird-automations · 2015-08-27T07:54:26Z

Modified by: @pavel-zotov

Attachment: dbshut-fb30ss-15attaches-missing-some-errors-in-firebird-log.zip [ 12802 ]

firebird-automations · 2015-09-02T17:18:31Z

Modified by: @hvlad

assignee: Vlad Khorsun [ hvlad ]

firebird-automations · 2015-09-03T20:53:45Z

Modified by: @hvlad

Link: This issue is related to CORE4904 [ CORE4904 ]

firebird-automations · 2015-09-03T20:53:45Z

Commented by: @hvlad

This ticket also related to the wrong handling of very long index keys (as CORE4904).
This time incomplete logic is when index bucket about to split - it is often overfilled and corrupts database.
Investigation is in progress.

firebird-automations · 2015-09-29T13:05:37Z

Commented by: @hvlad

Patch is committed (into trunk). Please verify.

firebird-automations · 2015-09-30T10:19:42Z

Commented by: @sim1984

I launched Pavel's test on 400 connections in architecture SuperServer.
At this time there were no errors.
Thank You.

firebird-automations · 2015-10-02T12:53:02Z

Commented by: @sim1984

In configurations strong, heavy no errors were found. The ultra configuration error occurred

10: 26: 20.344 Statement failed, SQLSTATE = 54000
10: 26: 20.344 Implementation limit exceeded
10: 26: 20.344 -Maximum index level reached

but validation database was successful. The number of connections at the time of shutdown were only 66 out of 400. It looks like there was a mistake when the depth of the index reached a value of 17. If I understand this is a normal situation.

All tests run on Firebird 3.0 with 400 connections.

:)

firebird-automations · 2015-10-03T14:41:48Z

Commented by: @pavel-zotov

The problem with SHUTDOWN (as it is signed in subj of this ticket) still DOES exist.

I run test with following config params:

page_size=8192
fw_on=1
chk_start_delay=1000
loading_mode=ultra
dml_work_time_ultra=30000
flush_log_delay=5000
limit_dml_rows=2000000

Number of ISQL sessions: 35. Checked FB: WI-V3.0.0.32060, arch = SS.

After all ISQL was loaded and had performed DML work, command to shutdown database was issued:

13:59:19.954 13:59:19.95 - start shutdown, point just before run it:
13:59:19.954 run_cmd="C:\1INSTALL\firebird\fb30sS\fbsvcmgr localhost/3333:service_mgr user SYSDBA password masterkey action_properties dbname C:\FBTESTING\dbshut\dbshut_test_fdb.tmp prp_shutdown_mode prp_sm_full prp_force_shutdown 0"
13:59:19.954
13:59:19.954 C:\FBTESTING\dbshut>cmd /c "C:\1INSTALL\firebird\fb30sS\fbsvcmgr localhost/3333:service_mgr user SYSDBA password masterkey action_properties dbname C:\FBTESTING\dbshut\dbshut_test_fdb.tmp prp_shutdown_mode prp_sm_full prp_force_shutdown 0"

This command did not finish: firebird.exe process did not load CPU but fbsvcmgr hanged infinite time. Connections to this database were enable, but I needed to kill FB service and fbsvcmgr - no reaction on Ctrl-C or Ctrl-Break.

Before this kill, I created full dums of:
1) firebird.exe
2) fbsvcmgr which did "prp_shutdown_mode prp_sm_full prp_force_shutdown 0"
3) fbsvcmgr which did "action_trace_start"
4) one again dump for firebird.exe

All these dumps are here:

https://yadi.sk/d/gl1URFdojVAbW

My environment:
Windows XP, ram = 1Gb, cpu=2.4Ghz.

firebird-automations · 2015-10-03T14:50:29Z

Commented by: @hvlad

Denis, big thanks for testing !

Pavel,
you (as usual) mixed different issues on this ticket.

Let this ticket be about index corruptions. Hung of client utilities is another issue (for example CORE4938) and should be discussed and fixed separately

firebird-automations · 2015-10-03T15:31:45Z

Commented by: @pavel-zotov

Well, for *very* small workload level (only 5-6 attachments) I can see final result of this test when config loading_mode=ultra.
It''s OK (load time was 30 seconds):

===
Comparison of files dbshut.fb-log-before-test.log и DBSHUT.FB-LOG-AFTER-TEST.LOG
*** dbshut.fb-log-before-test.log
* DBSHUT.FB-LOG-AFTER-TEST.LOG
1357:
1358: BALAHA Sat Oct 03 15:27:20 2015
1359: Database: C:\FBTESTING\DBSHUT\DBSHUT_TEST_FDB.TMP
1360: Validation started
1361:
1362:
1363: BALAHA Sat Oct 03 15:27:20 2015
1364: Database: C:\FBTESTING\DBSHUT\DBSHUT_TEST_FDB.TMP
1365: Validation finished: 0 errors, 0 warnings, 0 fixed
1366:
***

Unfortunately, I haven't powerful machine with Windows installed, so we have to rely on Denis' report. Or may be someone else who will be able to verify on his Windows host.

firebird-automations · 2015-10-03T15:34:21Z

Commented by: @pavel-zotov

> (for example CORE4938)

This ticket is related to FB embedded.
Should I create new ticket for describing almost(?) the same troubles in SS/SC/CS which were encountered during test from *this* ticket ?

firebird-automations · 2015-10-03T16:10:02Z

Commented by: @hvlad

The latest dump you provided is related to the CORE4952

firebird-automations · 2015-11-09T23:49:36Z

Modified by: @hvlad

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0 RC 1 [ 10584 ]

firebird-automations closed this as completed Nov 9, 2015

firebird-automations added fix-version: 3.0 RC1 resolution: fixed priority: major component: engine component: svcmgr type: bug labels Apr 25, 2021

firebird-automations assigned hvlad Apr 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving database to 'full shutdown' state can corrupt index (at least) on long-key text field. Shutdown process can hang w/o returning control to outer environment. [CORE4914] #5206

Moving database to 'full shutdown' state can corrupt index (at least) on long-key text field. Shutdown process can hang w/o returning control to outer environment. [CORE4914] #5206

firebird-automations commented Aug 24, 2015

firebird-automations commented Aug 26, 2015

firebird-automations commented Aug 26, 2015

firebird-automations commented Aug 27, 2015

firebird-automations commented Aug 27, 2015

firebird-automations commented Sep 2, 2015

firebird-automations commented Sep 3, 2015

firebird-automations commented Sep 3, 2015

firebird-automations commented Sep 29, 2015

firebird-automations commented Sep 30, 2015

firebird-automations commented Oct 2, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Nov 9, 2015

Moving database to 'full shutdown' state can corrupt index (at least) on long-key text field. Shutdown process can hang w/o returning control to outer environment. [CORE4914] #5206

Moving database to 'full shutdown' state can corrupt index (at least) on long-key text field. Shutdown process can hang w/o returning control to outer environment. [CORE4914] #5206

Comments

firebird-automations commented Aug 24, 2015

2) Rename 'dbshut.bat.tx' to 'dbshut.bat' (i.e. remove last part of its 'long' extension). 3) Open 'dbshut.conf' and correct settings related to your FB instance(s) (hereafter: 'Cs' = Classic; 'sC' = SuperClassic; 'sS' = SuperServer):

fb30sShome=... fb30sSport=...

This test does following:

On WI-V3.0.0.31981 (07-aug-2015) following messages appeared:

On WI-V3.0.0.32008 (23-aug-2015) following messages appeares:

firebird-automations commented Aug 26, 2015

firebird-automations commented Aug 26, 2015

firebird-automations commented Aug 27, 2015

One more issue: it was encountered that some message(s) about database corruption can appear only in ISQL log and absent in firebird.log. Currently I can see this one:

Statement failed, SQLSTATE = XX001 Error during savepoint backout - transaction invalidated -database file appears corrupt (C:\FBTESTING\DBSHUT\DBSHUT_TEST_FDB.TMP) -wrong page type -page 207 is of wrong type (expected index B-tree, found unknown (52))

firebird-automations commented Aug 27, 2015

firebird-automations commented Sep 2, 2015

firebird-automations commented Sep 3, 2015

firebird-automations commented Sep 3, 2015

firebird-automations commented Sep 29, 2015

firebird-automations commented Sep 30, 2015

firebird-automations commented Oct 2, 2015

firebird-automations commented Oct 3, 2015

After all ISQL was loaded and had performed DML work, command to shutdown database was issued:

firebird-automations commented Oct 3, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Oct 3, 2015

firebird-automations commented Nov 9, 2015

2) Rename 'dbshut.bat.tx' to 'dbshut.bat' (i.e. remove last part of its 'long' extension).
3) Open 'dbshut.conf' and correct settings related to your FB instance(s) (hereafter: 'Cs' = Classic; 'sC' = SuperClassic; 'sS' = SuperServer):

fb30sShome=...
fb30sSport=...

Statement failed, SQLSTATE = XX001
Error during savepoint backout - transaction invalidated
-database file appears corrupt (C:\FBTESTING\DBSHUT\DBSHUT_TEST_FDB.TMP)
-wrong page type
-page 207 is of wrong type (expected index B-tree, found unknown (52))