Issue Details (XML | Word | Printable)

Key: CORE-3927
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Viktor Belzetskiy
Votes: 0
Watchers: 9
Operations

If you were logged in you would be able to see more operations.
Firebird Core

Restore's hung for big database.

Created: 19/Sep/12 09:48 AM   Updated: 20/Sep/12 04:59 PM
Component/s: GBAK
Affects Version/s: 2.5.0
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments: 1. File db_create.7z (26 kB)
2. File db_hang.7z (353 kB)

Environment: Firebird CS 2.5.1 x86, Firebird SS 2.5.2.26538 x64, Windows2008 x86, Windows 2008R2 x64

Planning Status: Unspecified


 Description  « Hide
The DB volume is about 2 bln records, 1800 tables with 5400 indexes.

Restore's hung is happening while "gbak.exe -r -user SYSDBA -pass masterkey
-p..." executing if the key (-o "restore one table at a time") is not used.

We made a number of tries and found out that restore's hung is not depends
on:
- Firefird's architecture
- Database page size
- Windows architecture (x64, x86)
- Protocol (INET, XNET, service_mgr)

For same bug's playback you should act according to following sequence:
1. Decompressing db_hang.7z.
2. Execute script from file db_script.sql (execute time* ~16h, final
database size ~600 Gb, 7575522329 records)
3. Run Backup database "gbak -b..." (execute time* 6h, size fbk ~400 Gb)
4. Run Restore "gbak -r..." without key -o "restore one table at a time".

In our case the hung took place on 330 Gb file for "TEMP_TBL_1087"-table
(gbak: 176110000 records restored), after about 3 hours* of Restore's
executing.
* The executing time is specified for my hardware

 All   Comments   Work Log   Change History   Version Control   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sean Leyne added a comment - 19/Sep/12 05:00 PM
Given the size of the source database, I doubt that the restore is hung, 3 hours is not a lot of time for a restore.

I suspect that the restore is struggling with reactivating the indexes.

Try the restore with the "-v" (verbose) switch, it should provide more insight into where the restore is "hanging".

Viktor Belzetskiy added a comment - 20/Sep/12 09:54 AM - edited
>I suspect that the restore is struggling with reactivating the indexes.
There is no activation indexes

>Try the restore with the "-v" (verbose) switch, it should provide more insight into where the restore is "hanging"
A lot of tests with different parameters (keys) and on different hardware, using different Windows and Firebird versions was performed. Restore process didn't hang ONLY with -o "restore one table at time" option. "Hang" mean 100% of CPU Core with no disk activity for a long time (24 hours). See PE_info.png.
Test example will show this bug.

Sean Leyne added a comment - 20/Sep/12 02:20 PM
Could you extract the db metadata and post it, I am running v2.1 and would like to test. Am unable to open the attached database.

Sean Leyne added a comment - 20/Sep/12 02:22 PM
The "-v" option would show where the restore was failing (ie. "activating and building index XXXX").

The PE_info.png actually shows that there is some disk IO taking place, which is suggestive of the sort stage required for rebuilding indexes on table data.

Viktor Belzetskiy added a comment - 20/Sep/12 03:04 PM - edited
>The "-v" option would show where the restore was failing (ie. "activating and building index XXXX").
I used this option ...
C:\Firebird\bin\gbak.exe -r -se localhost:service_mgr -user SYSDBA -pass masterkey -p 16384 -v -y F:\test\db_hang.log E:\test\db_hang.fbk F:\test\db_hang.fdb

Last lines from db_hang.log
===========
gbak: 176090000 records restored
gbak: 176100000 records restored
gbak: 176110000 records restored
gbak:
===========

>Could you extract the db metadata and post it, I am running v2.1 and would like to test. Am unable to open the attached database.
The script is for 2.5 only because Autonomous Transactions are used to speed up data insertion. If using FB2.5 is not appropriate for you, you may modify the procedure TEMP_GENERATE_ALL_DATA to make it work in one transaction as FB2.1 usually does, but it can significally slow down the insertion process. The script used for database creation will be attached to this ticket.

>The PE_info.png actually shows that there is some disk IO taking place, which is suggestive of the sort stage required for rebuilding indexes on table data.
those blue peaks from the PNG image mean that disk IO = 16 bytes (yes, bytes !!) per seconds :)))

Vlad Khorsun added a comment - 20/Sep/12 03:32 PM
There is a guess that internal savepoint number (signed 32 bit int) is overflowed and something weird happens in DFW structures.
It is not easy to confirm it by myself due to test nature...

Sean Leyne added a comment - 20/Sep/12 03:48 PM
Vlad, I didn't think that gbak used savepoints.

Vlad Khorsun added a comment - 20/Sep/12 04:04 PM
Sean,
engine does.
Every single insert (as any single statement) is wrapped (automatically) by savepoint.

Viktor kindly provided me with a few memory dumps taken from hung process, so guess above is a bit educated...

Sean Leyne added a comment - 20/Sep/12 04:21 PM - edited
Well given that the database has over 7billion rows (well beyond the capacity/size of a signed 32bit int) it seems that the size savepoint number is the problem. (Vlad it would have been nice to mention that you had looked at the memory dumps, which led to your guess/supposition ;-])

As such, I suggest that the case subject be changed to read "Unable to restore database with more than 2^31-1 rows"

Viktor Belzetskiy added a comment - 20/Sep/12 04:59 PM
Sean, the subject you've suggested is not exact for this bug,
at least because the database with such a big number of rows can be restored without any problems if there is no indexes in it.
Looks like the reason is the superposition of few factors: number of records, number of tables and number of indexes (even deactivated).