Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore's hung for big database. [CORE3927] #4262

Open
firebird-automations opened this issue Sep 19, 2012 · 13 comments
Open

Restore's hung for big database. [CORE3927] #4262

firebird-automations opened this issue Sep 19, 2012 · 13 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Viktor Belzetskiy (viktor_bs)

Attachments:
db_hang.7z
db_create.7z

The DB volume is about 2 bln records, 1800 tables with 5400 indexes.

Restore's hung is happening while "gbak.exe -r -user SYSDBA -pass masterkey
-p..." executing if the key (-o "restore one table at a time") is not used.

We made a number of tries and found out that restore's hung is not depends
on:
- Firefird's architecture
- Database page size
- Windows architecture (x64, x86)
- Protocol (INET, XNET, service_mgr)

For same bug's playback you should act according to following sequence:
1. Decompressing db_hang.7z.
2. Execute script from file db_script.sql (execute time* ~16h, final
database size ~600 Gb, 7575522329 records)
3. Run Backup database "gbak -b..." (execute time* 6h, size fbk ~400 Gb)
4. Run Restore "gbak -r..." without key -o "restore one table at a time".

In our case the hung took place on 330 Gb file for "TEMP_TBL_1087"-table
(gbak: 176110000 records restored), after about 3 hours* of Restore's
executing.
* The executing time is specified for my hardware

@firebird-automations
Copy link
Collaborator Author

Modified by: Viktor Belzetskiy (viktor_bs)

Attachment: db_hang.7z [ 12211 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: Viktor Belzetskiy (viktor_bs)

description: The DB volume is about 2 bln records, 1800 tables with 5400 indexes.

Restore's hung is happening while "gbak.exe -r -user SYSDBA -pass masterkey
-p..." executing if the key (-o "restore one table at a time") is not used.

We made a number of tries and found out that restore's hung is not depends
on:
- Firefird's architecture
- Database page size
- Windows architecture (x64, x86)
- Protocol (INET, XNET, service_mgr)

For same bug's playback you should act according to following sequence:
1. Decompressing db_hang.7z.
2. Execute script from file db_script.sql (execute time* ~16h, final
database size ~600Г, 7575522329 records)
3. Run Backup database "gbak -b..." (execute time* 6h, size fbk ~400Г)
4. Run Restore "gbak -r..." without key -o "restore one table at a time".

In our case the hung took place on 330 Gb file for "TEMP_TBL_1087"-table
(gbak: 176110000 records restored), after about 3 hours* of Restore's
executing.
* The executing time is specified for my hardware

=>

The DB volume is about 2 bln records, 1800 tables with 5400 indexes.

Restore's hung is happening while "gbak.exe -r -user SYSDBA -pass masterkey
-p..." executing if the key (-o "restore one table at a time") is not used.

We made a number of tries and found out that restore's hung is not depends
on:
- Firefird's architecture
- Database page size
- Windows architecture (x64, x86)
- Protocol (INET, XNET, service_mgr)

For same bug's playback you should act according to following sequence:
1. Decompressing db_hang.7z.
2. Execute script from file db_script.sql (execute time* ~16h, final
database size ~600 Gb, 7575522329 records)
3. Run Backup database "gbak -b..." (execute time* 6h, size fbk ~400 Gb)
4. Run Restore "gbak -r..." without key -o "restore one table at a time".

In our case the hung took place on 330 Gb file for "TEMP_TBL_1087"-table
(gbak: 176110000 records restored), after about 3 hours* of Restore's
executing.
* The executing time is specified for my hardware

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Given the size of the source database, I doubt that the restore is hung, 3 hours is not a lot of time for a restore.

I suspect that the restore is struggling with reactivating the indexes.

Try the restore with the "-v" (verbose) switch, it should provide more insight into where the restore is "hanging".

@firebird-automations
Copy link
Collaborator Author

Commented by: Viktor Belzetskiy (viktor_bs)

>I suspect that the restore is struggling with reactivating the indexes.
There is no activation indexes

>Try the restore with the "-v" (verbose) switch, it should provide more insight into where the restore is "hanging"
A lot of tests with different parameters (keys) and on different hardware, using different Windows and Firebird versions was performed. Restore process didn't hang ONLY with -o "restore one table at time" option. "Hang" mean 100% of CPU Core with no disk activity for a long time (24 hours). See PE_info.png.
Test example will show this bug.

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Could you extract the db metadata and post it, I am running v2.1 and would like to test. Am unable to open the attached database.

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

The "-v" option would show where the restore was failing (ie. "activating and building index XXXX").

The PE_info.png actually shows that there is some disk IO taking place, which is suggestive of the sort stage required for rebuilding indexes on table data.

@firebird-automations
Copy link
Collaborator Author

Commented by: Viktor Belzetskiy (viktor_bs)

>The "-v" option would show where the restore was failing (ie. "activating and building index XXXX").
I used this option ...
C:\Firebird\bin\gbak.exe -r -se localhost:service_mgr -user SYSDBA -pass masterkey -p 16384 -v -y F:\test\db_hang.log E:\test\db_hang.fbk F:\test\db_hang.fdb

Last lines from db_hang.log

gbak: 176090000 records restored
gbak: 176100000 records restored
gbak: 176110000 records restored
gbak:

>Could you extract the db metadata and post it, I am running v2.1 and would like to test. Am unable to open the attached database.
The script is for 2.5 only because Autonomous Transactions are used to speed up data insertion. If using FB2.5 is not appropriate for you, you may modify the procedure TEMP_GENERATE_ALL_DATA to make it work in one transaction as FB2.1 usually does, but it can significally slow down the insertion process. The script used for database creation will be attached to this ticket.

>The PE_info.png actually shows that there is some disk IO taking place, which is suggestive of the sort stage required for rebuilding indexes on table data.
those blue peaks from the PNG image mean that disk IO = 16 bytes (yes, bytes !!) per seconds :)))

@firebird-automations
Copy link
Collaborator Author

Modified by: Viktor Belzetskiy (viktor_bs)

Attachment: db_create.7z [ 12220 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

There is a guess that internal savepoint number (signed 32 bit int) is overflowed and something weird happens in DFW structures.
It is not easy to confirm it by myself due to test nature...

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Vlad, I didn't think that gbak used savepoints.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Sean,
engine does.
Every single insert (as any single statement) is wrapped (automatically) by savepoint.

Viktor kindly provided me with a few memory dumps taken from hung process, so guess above is a bit educated...

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Well given that the database has over 7billion rows (well beyond the capacity/size of a signed 32bit int) it seems that the size savepoint number is the problem. (Vlad it would have been nice to mention that you had looked at the memory dumps, which led to your guess/supposition ;-])

As such, I suggest that the case subject be changed to read "Unable to restore database with more than 2^31-1 rows"

@firebird-automations
Copy link
Collaborator Author

Commented by: Viktor Belzetskiy (viktor_bs)

Sean, the subject you've suggested is not exact for this bug,
at least because the database with such a big number of rows can be restored without any problems if there is no indexes in it.
Looks like the reason is the superposition of few factors: number of records, number of tables and number of indexes (even deactivated).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant