Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index garbage collection on varchar column causes server to hang [CORE5419] #5692

Closed
firebird-automations opened this issue Dec 13, 2016 · 29 comments

Comments

@firebird-automations
Copy link
Collaborator

firebird-automations commented Dec 13, 2016

Submitted by: Thomas Kragh (realic)

Attachments:
c5419-trace-and-batch-output.7z

Steps to reproduce:
1) Create a table(TEST) with a single varchar(36) column named ID (not null).
2) Load the table with data 4000000 rows using this: (select lower( uuid_to_char(gen_uuid())) from rdb$database);

3) Create a primary key on the ID column.
4) Delete all rows in "TEST" with delete from test and commit
5) from another connection issue "select count(*) from test"

This will cause the database server to become unstable and hang for minutes at a time(1-5 minutes), where no new connections is made and no other queries get processed. Other databases on the server is affected in the same way as well!
To observe this behavior, i wrote a small program that would select current_timestamp from the server(select current_timestamp from mon$database) and print how long time it took to the console. Then sleep for 500ms and then select the time time again. This loop ran during step 4 and 5.

If step 5 is omitted and backup/sweep is executed the same problem occurred when garbage collection starts on the TEST table.

Servermode is super classic with pagebuffers set to 256 and tested with 1024 without any luck.

On the mailing list Ann Harrison gave this explanation:
https://groups.yahoo.com/neo/groups/firebird-support/conversations/topics/129936 (archive) (the first 15 messages is about virtual server might be the cause - skip those)
"Interesting. GUIDs produce really fat indexes. Firebird uses prefix compression and GUIDs vary in the first bytes so they don't compress well. Keys generated sequentially will be on different index leaf pages. Since records are deleted and garbage collected in the order they were stored, index garbage collection could easily touch a different page for each key. With only 256 buffers, index garbage collection is running pages through the cache wildly - with luck they're cached by the file system."

The solution to the problem that made me file this bug was to use sequential generated GUID, however this is not always possible, in the company I work at, we use varchar index for social security number, login tokens, oauth access/refresh tokens, guids from third party applications and so on.

Best regards
Thomas Kragh

Commits: 2d26794

====== Test Details ======

Can't reproduce, wait for suggestions how to do that - see post 16/Feb/17 05:27 AM

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Added Component: core to this issue

@firebird-automations
Copy link
Collaborator Author

Modified by: Thomas Kragh (realic)

Component: Engine [ 10000 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

Out of curiosity, why do you prefer to store (and index) GUIDs as VARCHAR(36) rather than CHAR(16) OCTETS? IMHO, this way you just make the problem even worse.

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

The main reason is readability, GUIDSs are always formattet in this way: "f5da8ac7-60b1-4f38-a631-11a34d2f7c8c" in all of our applications and websites, and it is easy to copy this value if you need to query a table.

To my understanding storing a GUID in CHAR(16) would result in the same problem? perhaps i would have noticed it later as the number of rows that were to be deleted increased.
Question: Is there a way to create sequential generated GUIDs in Firebird?

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

CHAR(16) would also suffer from this issue, but to a smaller extent. Intentionally slowing things down to please readability does not seem a correct choice, IMHO. Functions UUID_TO_CHAR and CHAR_TO_UUID exist exactly to provide readability while still storing efficiently.

Anyway, hanging for minutes without allowing concurrent activity is not expected and should be investigated.

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Deleted 4 000 000 records + cooperative GC in SC + tiny pages cache...
I agree that hanging connection for a minutes is not good but looks like under such circumstances unavoidable - it's that 'select ...' that does actually all job related with records deletion.

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Hey Alexander
In your comment it sounds like there are circumstances/settings where this is possible?
I tried setting pagebuffers to 4096 however this had no effect.

I think the fact that this action affects all other databases and attachments on the same server makes this a very critical bug.

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Thomas,

I am not sure this is a bug but more a by-product of the nature of the operation you are trying to perform.

Could you please post the execution stats for the DELETE and SELECT COUNT(*) statements, I am interested in the "writes" values.

Also, please advise of the disk subsystem/performance (Crystal Disk Mark is a good benchmark) which you are running against.

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

First of all I wanted to point to the fact that such delay is not something outstanding... But I'm almost sure that setting pages cache to very big value (up to a half of DB size) will increase performance in this particular case. Unfortunately this is not good in many other aspects but as an experiment seems to be interesting.

Next question is - how does GC time depend upon number of deleted records? If dependency is linear or almost linear dependency also not awful problem. The fact that this affects all other databases and attachments on the same server almost for sure explained by HDD load.

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Hey Sean
Sorry for the long wait. I have run the scenario again and the stats can be found here:

Delete statistics
https://dl.dropboxusercontent.com/u/937321/fb_delete.PNG

Count statistics
https://dl.dropboxusercontent.com/u/937321/fb_count.PNG

Crystal disk mark statistics:
https://dl.dropboxusercontent.com/u/937321/disk_mark.PNG

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Thomas,

please, specify exact OS and kernel\glibc version

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

90% of my testing have been on windows(8.1), the two screenshots is made using my laptop(8.1), I have been able to reproduce the problem on windows server 2008r2, and CentOs - regarding CentOs, i will ask the system administrator for specific version.

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

We use CentOs 7 with kernel 3.10.0-327.36.1 and Glib-2.17-106/Glib2-2.42.2-5

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Hey Alexander
Here is the time taken for delete/count of 5/20 million rows where page-buffers is set to 4000:

Scenario 1 - 5.000.000 rows
delete: 10s 0ms
count(*): 6m 54s 15ms

Scenario 1 - 20.000.000 rows
delete: 50s 515ms
count(*): 42m 9s 672ms

The deletion of the 20 million rows seams to take about 6 times longer, even though the number of rows is only increased 4 times.

If I increase page-buffers to 400.000 and run the scenario 1 with 5 million rows it looks like this:
delete: 12s 390ms
count(*): 1m 57s 703ms

Regarding the effect this has on other attachments on the server, I tried running the disk benchmark while "pinging" the server using the program I mentioned in my original message(select current_timestamp from mon$database). The benchmark did not cause the server to become unresponsive.

From my point of view this sound more like a locking issue or starvation - however i have no experience with the Firebird codebase.

Just to be thorough I tried switching to classic and run scenario 1, this did not change anything.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Thomas,

i can prepare a Windows build with patch, would you like to test it ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Vlad
Yes of course, that would be really nice : )

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Hey Vlad
Thanks for the build, it works perfect, no server unresponsiveness whatsoever :)

What was the problem/solution? And is there an ETA for 2.5.7?

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Thomas,

thanks for the info.

Some details about the issue could be found there: https://sourceforge.net/p/firebird/mailman/message/35597059/
Patch is currently committed into separate branch to let other developers review and\or test it: https://github.com/FirebirdSQL/firebird/tree/core-5419

@firebird-automations
Copy link
Collaborator Author

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Hey Dmitry
Unfortunately I do not have a Linux server for testing at my disposal. :(

@firebird-automations
Copy link
Collaborator Author

Commented by: Thomas Kragh (realic)

Hey Vlad
In your link to the mailing list you write the following "- for Classic and SuperClassic problem is confirmed of course, operations with another DB's are not affected".
However in my test all databases on the same server was affected by the GC.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Thomas,

i just tried your test against release build of Firebrid 2.5.6 CS and SC - no luck, other databases are not affected

@firebird-automations
Copy link
Collaborator Author

Modified by: @hvlad

assignee: Vlad Khorsun [ hvlad ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @hvlad

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 2.5.7 [ 10770 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

I can NOT reproduec this ticket issues.

Quote of start message:

Steps to reproduce:
1\) Create a table\(TEST\) with a single varchar\(36\) column named ID \(not null\)\.
2\) Load the table with data 4000000 rows using this: \(select lower\( uuid\_to\_char\(gen\_uuid\(\)\)\) from rdb$database\);

3\) Create a primary key on the ID column\.
4\) Delete all rows in "TEST" with delete from test and commit
5\) from another connection issue "select count\(\*\) from test" 

===

Prepare DB:
-----------------
-- create new DB, leave default page_size and FW
set bail on;
recreate table test(id int, s varchar(36));
commit;
set stat on;
set term ^;
execute block as
declare n int = 4000000;
begin
while (n>0) do insert into test(s) values( uuid_to_char(gen_uuid()) ) returning :n-1 into n;
end^
set term ;^
set stat off;
commit;

set stat on;
create index test\_s on test\(s\);
set stat off;
commit;
show version;
quit;

Prepare aux. batch ('c5419-run-check.bat'):
-------------------------
@echo off
@Rem (update path to ISQL and name of test DB alias):
echo %date% %time%
echo set list on; select current_timestamp as dts_get_inside_isql from rdb$database; ^
set count on; ^
select current_timestamp as dts_finish, a.*, s.* ^
from mon$attachments a join mon$statements s using(mon$attachment_id) ^
where mon$attachment_id!=current_connection; ^
quit; ^
| C:\FB\25Cs\bin\isql /:e25 -n
@Rem C:\FB\old.256\bin\isql /:e25 -n

NOTE: this batch show time of OS before launch ISQL ("echo %date% %time%") and forces ISQL itself also to show time ("dts_get_inside_isql"), so we can estimate how long connection was established.

Prepare trace config and run trace:
-----------------------------------------------
<database (%[\\/](%).fdb)|(%)>
enabled true
log_errors true
time_threshold 0
log_connections true
log_transactions true
log_statement_prepare true
log_statement_start true
log_statement_finish true
print_perf true
max_sql_length 4096
max_log_size 999
</database>

Session #⁠1
----------------
Launch ISQL, do there:

set count on;
delete from test;
Records affected: 4000000
commit;
\-\-\-\-\- do NOT quit from ISQL, stay in its prompt \-\-\-\-

Session #⁠2
----------------
Launch ISQL, do there:
set stat on;
select count(*) from test;
-- this leads FB do "huge work";
-- now we can check whether one more attachment can be establish and do some work in it

WAIT for 5-10 seconds. Then launch another window and:

Session #⁠3
----------------
Run twice batch c5419-run-check.bat: first after 5...10 seconds, then after 1...2 minute since session #⁠2 started its 'SELECT COUNT(*) FROM TEST'.

=========================================================

Result of running batch c5419-run-check.bat:

1) for WI-V2.5.6.27020 (official 2.5.6 release, 01-JUL-2016):

1.1) run-1:
~~~~~~~~

C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat
16.02.2017 7:33:06.40
Database: /:e25
SQL>
DTS_GET_INSIDE_ISQL 2017-02-16 07:33:06.6290

DTS_FINISH 2017-02-16 07:33:06.6320
MON$ATTACHMENT_ID 11
....
MON$STATE 1
MON$TIMESTAMP 2017-02-16 07:32:56.1330
MON$SQL_TEXT 0:4
select count(*) from test
MON$STAT_ID 8

1.2) run-2:
~~~~~~~~

C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat
16.02.2017 7:35:43.41
Database: /:e25
SQL>
DTS_GET_INSIDE_ISQL 2017-02-16 07:35:43.7200

DTS_FINISH 2017-02-16 07:35:43.7240
MON$ATTACHMENT_ID 11
....
MON$SQL_TEXT 0:4
select count(*) from test
MON$STAT_ID 8

So, connect was established instantly. Query to MON$ tables was slow, but is also _was_ finished. FB was avaliable.

2) for WI-V2.5.7.27033 (29-dec-2016):

2.1) run-1:
~~~~~~~~

C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat
16.02.2017 7:48:23.05
Database: /:e25
SQL>
DTS_GET_INSIDE_ISQL 2017-02-16 07:48:28.7000

DTS_FINISH 2017-02-16 07:48:28.7050
MON$ATTACHMENT_ID 6
....
MON$STATE 1
MON$TIMESTAMP 2017-02-16 07:47:58.0220
MON$SQL_TEXT 0:4
select count(*) from test
MON$STAT_ID 8

2.2) run-2:
~~~~~~~~

C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat
16.02.2017 7:50:55.57
Database: /:e25
SQL>
DTS_GET_INSIDE_ISQL 2017-02-16 07:50:55.8360

DTS_FINISH 2017-02-16 07:50:55.8400
...
MON$STATE 1
MON$TIMESTAMP 2017-02-16 07:47:58.0220
MON$SQL_TEXT 0:4
select count(*) from test
MON$STAT_ID 8

Result is the same as for 2.5.6 Release.
Trace log, batch etc see in attached file "c5419-trace-and-batch-output.7z "

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

Attachment: c5419-trace-and-batch-output.7z [ 13068 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

PS.

Windows 8.1 Pro 64-bit OS, x64-based processor

C:\>ver
Microsoft Windows [Version 6.3.9600]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: No test => Deferred

Test Details: Can't reproduce, wait for suggestions how to do that - see post 16/Feb/17 05:27 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants