New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using COLLATE UNICODE_CI_AI in WHERE clause (not indexed) is extremely slow [CORE4125] #4453
Comments
Commented by: Sean Leyne (seanleyne) Since the case is about "slower" performance, some mention of the performance of non-collation based query vs. collation query would be helpful. |
Commented by: Toni Martir (toni_martir) I'm sorry this is not a issue about performance for non-collation based or collation based, the performance with collation or non-collation is near the same. As I included in the original information COLLATE UNICODE_CI is as fast as not using COLLATE. This is an issue about UNICODE_CI_AI only (or maybe relative to any _AI collation), that is about Accent Insensitive collation. |
Commented by: Toni Martir (toni_martir) I debugged the firebird server, and found the bottleneck. I know nothing about unicode implementation or icu library, but it seems the way Firebird is calling the library is slowing things a lot. The bottleneck is in unicode_util.cpp near line 1253 (2.5 branch): It seems that utransOpen is somewhat slow, so to solve the problem this "handle" to transliterator should be created only once in some way. Adding this lines to create the transliterator only once speed up things: and remove the line closing the transliterator icu->utransClose(trans) and the handle should be closed on cleanup. After this fix, the UNICODE_CI and UNICODE_CI_AI have near the same performance (a query that longs for 10 seconds now takes only 300 milliseconds) |
Modified by: @asfernandesassignee: Adriano dos Santos Fernandes [ asfernandes ] |
Modified by: @asfernandesstatus: Open [ 1 ] => Resolved [ 5 ] resolution: Fixed [ 1 ] Fix Version: 3.0 Alpha 1 [ 10331 ] Fix Version: 2.5.3 [ 10461 ] |
Commented by: @asfernandes Thanks Toni for debug it. A transliteration could not be shared between threads, so I used to clone it, which seems to not degrade the performance. |
Commented by: Toni Martir (toni_martir) Thanks for the fast fix. Thanks again. |
Commented by: @asfernandes This is not so simple. It would require a "TLS cleanup" routine too, to not leak memory of destroyed threads. We currently does not have this in place in the code, so the clone approach seems doable and faster enough. |
Commented by: Toni Martir (toni_martir) I'm sorry, I compiled and debugged with the updated unicode_util.cpp and tested the sample I provide, and now UNICODE_CI_AI does not work correctly, it returns all the rows, so it's now broken. Thanks |
Commented by: @asfernandes The problem happens with the embedded ICU 3.0, but not with newer versions. I'm investigating why. |
Commented by: @asfernandes I committed a different fix. Please test it. |
Commented by: Toni Martir (toni_martir) Now works ok and witthout any performance penalty. |
Modified by: @pavel-zotovstatus: Resolved [ 5 ] => Resolved [ 5 ] QA Status: Done successfully Test Details: Checked on 2.5.1: ratio = ~105, on 2.5.2: ~132, since 2.5.3: ~1. |
Modified by: @pcisarstatus: Resolved [ 5 ] => Closed [ 6 ] |
Submitted by: Toni Martir (toni_martir)
Votes: 1
Executing a select with UNICODE_CI_AI collation is very slow (4000 rows table ):
SELECT UTF8CL FROM TEST_UTF8 WHERE UTF8CL COLLATE UNICODE_CI_AI LIKE '%8%'
- Using LIKE or CONTAINING have the same effect.
Sorting is fast: SELECT UTF8CL FROM TEST_UTF8 ORDER BY UTF8CL COLLATE UNICODE_CI_AI (30 miliseconds)
Using UNICODE_CI is also fast SELECT UTF8CL FROM TEST_UTF8 WHERE UTF8CL COLLATE UNICODE_CI LIKE '%8%'
So accent insensitive Unicode comparisons are extermely slow. Defining the column as UNICODE_CI_AI has the same problem (but there is no need to specify COLLATE in WHERE clause).
I don't know if it's related, but creating foreign keys with domains defined as UNICODE_CI_AI fails (incompatible field types).
I provide script to easily reproduce the bug (create empty database first)
CREATE TABLE TEST_UTF8(UTF8CL VARCHAR(20) CHARACTER SET UTF8 COLLATE UNICODE);
CREATE PROCEDURE FILLTEST_UTF8
AS
DECLARE VARIABLE COUNTER INTEGER;
BEGIN
COUNTER=4000;
WHILE (COUNTER>0) DO
BEGIN
INSERT INTO TEST_UTF8 (UTF8CL) VALUES (CAST(:COUNTER AS VARCHAR(10)));
COUNTER=COUNTER-1;
END
END
EXECUTE PROCEDURE FILLTEST_UTF8;
COMMIT;
SELECT UTF8CL FROM TEST_UTF8
WHERE UTF8CL COLLATE UNICODE_CI_AI CONTAINING '8'
Commits: 6dedc68 7cb608d 6514d6a FirebirdSQL/fbt-repository@5264556 FirebirdSQL/fbt-repository@a240fa2 FirebirdSQL/fbt-repository@96bc825
====== Test Details ======
Checked on 2.5.1: ratio = ~105, on 2.5.2: ~132, since 2.5.3: ~1.
The text was updated successfully, but these errors were encountered: