New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index garbage collection on varchar column causes server to hang [CORE5419] #5692
Comments
Commented by: Thomas Kragh (realic) Added Component: core to this issue |
Modified by: Thomas Kragh (realic)Component: Engine [ 10000 ] |
Commented by: @dyemanov Out of curiosity, why do you prefer to store (and index) GUIDs as VARCHAR(36) rather than CHAR(16) OCTETS? IMHO, this way you just make the problem even worse. |
Commented by: Thomas Kragh (realic) The main reason is readability, GUIDSs are always formattet in this way: "f5da8ac7-60b1-4f38-a631-11a34d2f7c8c" in all of our applications and websites, and it is easy to copy this value if you need to query a table. To my understanding storing a GUID in CHAR(16) would result in the same problem? perhaps i would have noticed it later as the number of rows that were to be deleted increased. |
Commented by: @dyemanov CHAR(16) would also suffer from this issue, but to a smaller extent. Intentionally slowing things down to please readability does not seem a correct choice, IMHO. Functions UUID_TO_CHAR and CHAR_TO_UUID exist exactly to provide readability while still storing efficiently. Anyway, hanging for minutes without allowing concurrent activity is not expected and should be investigated. |
Commented by: @AlexPeshkoff Deleted 4 000 000 records + cooperative GC in SC + tiny pages cache... |
Commented by: Thomas Kragh (realic) Hey Alexander I think the fact that this action affects all other databases and attachments on the same server makes this a very critical bug. |
Commented by: Sean Leyne (seanleyne) Thomas, I am not sure this is a bug but more a by-product of the nature of the operation you are trying to perform. Could you please post the execution stats for the DELETE and SELECT COUNT(*) statements, I am interested in the "writes" values. Also, please advise of the disk subsystem/performance (Crystal Disk Mark is a good benchmark) which you are running against. |
Commented by: @AlexPeshkoff First of all I wanted to point to the fact that such delay is not something outstanding... But I'm almost sure that setting pages cache to very big value (up to a half of DB size) will increase performance in this particular case. Unfortunately this is not good in many other aspects but as an experiment seems to be interesting. Next question is - how does GC time depend upon number of deleted records? If dependency is linear or almost linear dependency also not awful problem. The fact that this affects all other databases and attachments on the same server almost for sure explained by HDD load. |
Commented by: Thomas Kragh (realic) Hey Sean Delete statistics Count statistics Crystal disk mark statistics: |
Commented by: @hvlad Thomas, please, specify exact OS and kernel\glibc version |
Commented by: Thomas Kragh (realic) 90% of my testing have been on windows(8.1), the two screenshots is made using my laptop(8.1), I have been able to reproduce the problem on windows server 2008r2, and CentOs - regarding CentOs, i will ask the system administrator for specific version. |
Commented by: Thomas Kragh (realic) We use CentOs 7 with kernel 3.10.0-327.36.1 and Glib-2.17-106/Glib2-2.42.2-5 |
Commented by: Thomas Kragh (realic) Hey Alexander Scenario 1 - 5.000.000 rows Scenario 1 - 20.000.000 rows The deletion of the 20 million rows seams to take about 6 times longer, even though the number of rows is only increased 4 times. If I increase page-buffers to 400.000 and run the scenario 1 with 5 million rows it looks like this: Regarding the effect this has on other attachments on the server, I tried running the disk benchmark while "pinging" the server using the program I mentioned in my original message(select current_timestamp from mon$database). The benchmark did not cause the server to become unresponsive. From my point of view this sound more like a locking issue or starvation - however i have no experience with the Firebird codebase. Just to be thorough I tried switching to classic and run scenario 1, this did not change anything. |
Commented by: @hvlad Thomas, i can prepare a Windows build with patch, would you like to test it ? |
Commented by: Thomas Kragh (realic) Vlad |
Commented by: Thomas Kragh (realic) Hey Vlad What was the problem/solution? And is there an ETA for 2.5.7? |
Commented by: @hvlad Thomas, thanks for the info. Some details about the issue could be found there: https://sourceforge.net/p/firebird/mailman/message/35597059/ |
Commented by: @dyemanov If possible, please test also the Linux build: http://web.firebirdsql.org/downloads/rabbits/dimitr/FirebirdCS-2.5.7.27038-0.amd64.tar.gz |
Commented by: Thomas Kragh (realic) Hey Dmitry |
Commented by: Thomas Kragh (realic) Hey Vlad |
Commented by: @hvlad Thomas, i just tried your test against release build of Firebrid 2.5.6 CS and SC - no luck, other databases are not affected |
Modified by: @hvladassignee: Vlad Khorsun [ hvlad ] |
Modified by: @hvladstatus: Open [ 1 ] => Resolved [ 5 ] resolution: Fixed [ 1 ] Fix Version: 2.5.7 [ 10770 ] |
Commented by: @pavel-zotov I can NOT reproduec this ticket issues. Quote of start message:
=== Prepare DB:
Prepare aux. batch ('c5419-run-check.bat'): NOTE: this batch show time of OS before launch ISQL ("echo %date% %time%") and forces ISQL itself also to show time ("dts_get_inside_isql"), so we can estimate how long connection was established. Prepare trace config and run trace: Session #1
Session #2 WAIT for 5-10 seconds. Then launch another window and: Session #3 ========================================================= Result of running batch c5419-run-check.bat: 1) for WI-V2.5.6.27020 (official 2.5.6 release, 01-JUL-2016): 1.1) run-1: C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat DTS_FINISH 2017-02-16 07:33:06.6320 1.2) run-2: C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat DTS_FINISH 2017-02-16 07:35:43.7240 So, connect was established instantly. Query to MON$ tables was slow, but is also _was_ finished. FB was avaliable. 2) for WI-V2.5.7.27033 (29-dec-2016): 2.1) run-1: C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat DTS_FINISH 2017-02-16 07:48:28.7050 2.2) run-2: C:\FBTESTING\qa\fbt-repo\tmp>c5419-run-check.bat DTS_FINISH 2017-02-16 07:50:55.8400 Result is the same as for 2.5.6 Release. |
Modified by: @pavel-zotovAttachment: c5419-trace-and-batch-output.7z [ 13068 ] |
Commented by: @pavel-zotov PS. Windows 8.1 Pro 64-bit OS, x64-based processor C:\>ver |
Modified by: @pavel-zotovstatus: Resolved [ 5 ] => Resolved [ 5 ] QA Status: No test => Deferred Test Details: Can't reproduce, wait for suggestions how to do that - see post 16/Feb/17 05:27 AM |
Submitted by: Thomas Kragh (realic)
Attachments:
c5419-trace-and-batch-output.7z
Steps to reproduce:
1) Create a table(TEST) with a single varchar(36) column named ID (not null).
2) Load the table with data 4000000 rows using this: (select lower( uuid_to_char(gen_uuid())) from rdb$database);
3) Create a primary key on the ID column.
4) Delete all rows in "TEST" with delete from test and commit
5) from another connection issue "select count(*) from test"
This will cause the database server to become unstable and hang for minutes at a time(1-5 minutes), where no new connections is made and no other queries get processed. Other databases on the server is affected in the same way as well!
To observe this behavior, i wrote a small program that would select current_timestamp from the server(select current_timestamp from mon$database) and print how long time it took to the console. Then sleep for 500ms and then select the time time again. This loop ran during step 4 and 5.
If step 5 is omitted and backup/sweep is executed the same problem occurred when garbage collection starts on the TEST table.
Servermode is super classic with pagebuffers set to 256 and tested with 1024 without any luck.
On the mailing list Ann Harrison gave this explanation:
https://groups.yahoo.com/neo/groups/firebird-support/conversations/topics/129936 (archive) (the first 15 messages is about virtual server might be the cause - skip those)
"Interesting. GUIDs produce really fat indexes. Firebird uses prefix compression and GUIDs vary in the first bytes so they don't compress well. Keys generated sequentially will be on different index leaf pages. Since records are deleted and garbage collected in the order they were stored, index garbage collection could easily touch a different page for each key. With only 256 buffers, index garbage collection is running pages through the cache wildly - with luck they're cached by the file system."
The solution to the problem that made me file this bug was to use sequential generated GUID, however this is not always possible, in the company I work at, we use varchar index for social security number, login tokens, oauth access/refresh tokens, guids from third party applications and so on.
Best regards
Thomas Kragh
Commits: 2d26794
====== Test Details ======
Can't reproduce, wait for suggestions how to do that - see post 16/Feb/17 05:27 AM
The text was updated successfully, but these errors were encountered: