Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using COLLATE UNICODE_CI_AI in WHERE clause (not indexed) is extremely slow [CORE4125] #4453

Closed
firebird-automations opened this issue Jun 19, 2013 · 14 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Toni Martir (toni_martir)

Votes: 1

Executing a select with UNICODE_CI_AI collation is very slow (4000 rows table ):
SELECT UTF8CL FROM TEST_UTF8 WHERE UTF8CL COLLATE UNICODE_CI_AI LIKE '%8%'
- Using LIKE or CONTAINING have the same effect.

Sorting is fast: SELECT UTF8CL FROM TEST_UTF8 ORDER BY UTF8CL COLLATE UNICODE_CI_AI (30 miliseconds)
Using UNICODE_CI is also fast SELECT UTF8CL FROM TEST_UTF8 WHERE UTF8CL COLLATE UNICODE_CI LIKE '%8%'

So accent insensitive Unicode comparisons are extermely slow. Defining the column as UNICODE_CI_AI has the same problem (but there is no need to specify COLLATE in WHERE clause).

I don't know if it's related, but creating foreign keys with domains defined as UNICODE_CI_AI fails (incompatible field types).

I provide script to easily reproduce the bug (create empty database first)

CREATE TABLE TEST_UTF8(UTF8CL VARCHAR(20) CHARACTER SET UTF8 COLLATE UNICODE);

CREATE PROCEDURE FILLTEST_UTF8
AS
DECLARE VARIABLE COUNTER INTEGER;
BEGIN
COUNTER=4000;
WHILE (COUNTER>0) DO
BEGIN
INSERT INTO TEST_UTF8 (UTF8CL) VALUES (CAST(:COUNTER AS VARCHAR(10)));
COUNTER=COUNTER-1;
END
END

EXECUTE PROCEDURE FILLTEST_UTF8;

COMMIT;

SELECT UTF8CL FROM TEST_UTF8
WHERE UTF8CL COLLATE UNICODE_CI_AI CONTAINING '8'

Commits: 6dedc68 7cb608d 6514d6a FirebirdSQL/fbt-repository@5264556 FirebirdSQL/fbt-repository@a240fa2 FirebirdSQL/fbt-repository@96bc825

====== Test Details ======

Checked on 2.5.1: ratio = ~105, on 2.5.2: ~132, since 2.5.3: ~1.

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Since the case is about "slower" performance, some mention of the performance of non-collation based query vs. collation query would be helpful.

@firebird-automations
Copy link
Collaborator Author

Commented by: Toni Martir (toni_martir)

I'm sorry this is not a issue about performance for non-collation based or collation based, the performance with collation or non-collation is near the same. As I included in the original information COLLATE UNICODE_CI is as fast as not using COLLATE. This is an issue about UNICODE_CI_AI only (or maybe relative to any _AI collation), that is about Accent Insensitive collation.

@firebird-automations
Copy link
Collaborator Author

Commented by: Toni Martir (toni_martir)

I debugged the firebird server, and found the bottleneck. I know nothing about unicode implementation or icu library, but it seems the way Firebird is calling the library is slowing things a lot.
I have found a workaround but because my knowledgment is very limited someone else should validate (or fix the bug in other way).

The bottleneck is in unicode_util.cpp near line 1253 (2.5 branch):
UTransliterator *trans = icu->utransOpen("Any-Upper; NFD; [:Nonspacing Mark:] Remove; NFC",
UTRANS_FORWARD, NULL, 0, NULL, &errorCode);

It seems that utransOpen is somewhat slow, so to solve the problem this "handle" to transliterator should be created only once in some way.

Adding this lines to create the transliterator only once speed up things:
// Add variable to store the transliterator
UTransliterator* globaltrans = NULL;
.....
if (globaltrans == NULL)
globaltrans = icu->utransOpen("Any-Upper; NFD; [:Nonspacing Mark:] Remove; NFC",
UTRANS_FORWARD, NULL, 0, NULL, &errorCode);
UTransliterator *trans = globaltrans;

and remove the line closing the transliterator icu->utransClose(trans) and the handle should be closed on cleanup.

After this fix, the UNICODE_CI and UNICODE_CI_AI have near the same performance (a query that longs for 10 seconds now takes only 300 milliseconds)

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

assignee: Adriano dos Santos Fernandes [ asfernandes ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0 Alpha 1 [ 10331 ]

Fix Version: 2.5.3 [ 10461 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

Thanks Toni for debug it.

A transliteration could not be shared between threads, so I used to clone it, which seems to not degrade the performance.

@firebird-automations
Copy link
Collaborator Author

Commented by: Toni Martir (toni_martir)

Thanks for the fast fix.
Just to comment that In C#⁠ exists the [ThreadStatic] attribute so one variable per thread is initialized, I can remember something in C like "static _thread", but don't know if it's ok for all C compilers. Anyway I usually try to avoid this variables when multitheading by using an object instance for each thread to store thread object references, I don't know how this "problems" are solved in Firebird code.

Thanks again.

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

This is not so simple. It would require a "TLS cleanup" routine too, to not leak memory of destroyed threads.

We currently does not have this in place in the code, so the clone approach seems doable and faster enough.

@firebird-automations
Copy link
Collaborator Author

Commented by: Toni Martir (toni_martir)

I'm sorry, I compiled and debugged with the updated unicode_util.cpp and tested the sample I provide, and now UNICODE_CI_AI does not work correctly, it returns all the rows, so it's now broken.
With the changes I proposed works ok (but it's not the good fix because it's not thread safe).
I also measured the penalty for using clone, and it's about 50% time for comparisons than using UNICODE_CI, using Open only once has no penalty, just the same time UNICODE_CI and UNICODE_CI_AI.
Note: A 50% may be acceptable (if it works).

Thanks

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

The problem happens with the embedded ICU 3.0, but not with newer versions. I'm investigating why.

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

I committed a different fix. Please test it.

@firebird-automations
Copy link
Collaborator Author

Commented by: Toni Martir (toni_martir)

Now works ok and witthout any performance penalty.
Thanks

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: Done successfully

Test Details: Checked on 2.5.1: ratio = ~105, on 2.5.2: ~132, since 2.5.3: ~1.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants