New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SIMILAR TO performance [CORE3919] #4254
Comments
Modified by: @asfernandesstatus: Open [ 1 ] => Resolved [ 5 ] resolution: Fixed [ 1 ] Fix Version: 3.0 Alpha 1 [ 10331 ] assignee: Adriano dos Santos Fernandes [ asfernandes ] |
Commented by: @pavel-zotov Adriano, can you backport this solution to 2.5.x branch ? |
Commented by: @asfernandes I would not backport it. Changes to SIMILAR TO are very risk, because the code is complex and there are very few tests. |
Modified by: @pcisarstatus: Resolved [ 5 ] => Closed [ 6 ] |
Commented by: @pavel-zotov > With test from CORE3858, time decreases from 165 seconds to 101 seconds. Adriano, can you please explain how can this small text be parsed so long with using similar to ? SQL> select iif('sdf--asdf_sd.dsfsfh-.sdjskdjfh-_.' SIMILAR TO '(([[:ALNUM:]\_\-])+.?)+([[:ALNUM:]\_\-])+' escape '\', 1, 0) from rdb$database; Select Expression
============ Current memory = 917736 TIA. |
Commented by: @asfernandes Search for regular expression catastrophic backtracking. |
Commented by: @pavel-zotov Adriano, why SIMILAR TO can't stop its work immediatelly when it found 1st matching ? (now I mean only simplest case of using!) For instance: SQL> recreate table tb(b blob); commit;
===================== SQL> select 1000000/(251*251) from rdb$database;
===================== -- now insert single record with blob of length ~1'000'000: CHAR_LENGTH
SQL> commit;
===================== Current memory = 2945864
===================== Current memory = 3994776 PS-1. It seems to me that elapsed time for SIMILAR TO has non-linear relation with the length of input: .......................... length=500000: .......................... length=750000: .......................... length=1000000: .......................... length=1250000: PS-2. |
Modified by: @pavel-zotovAttachment: poor_similar_to_performance_when_scanning_blobs.fbk.7z [ 13047 ] |
Commented by: @pavel-zotov I've found one more sample. set stat on; select count(*) as like_count, sum(char_length(b)) as like_sum_len select count(*) as similar_to_count, sum(char_length(b)) as similar_to_sum_len Output on my machine: LIKE_COUNT 86 Current memory = 37747632 .... SIMILAR_TO_COUNT 86 Current memory = 37852976 So, blobs with total length ~2.3 Mb are processed using 'similar to' about 40 seconds. |
Modified by: @pavel-zotovstatus: Closed [ 6 ] => Closed [ 6 ] QA Status: No test => Done successfully |
Modified by: @pavel-zotovstatus: Closed [ 6 ] => Closed [ 6 ] Test Details: Done test that was described in message of 10/Dec/16 08:13 PM (see in this ticket). |
Submitted by: @asfernandes
Attachments:
poor_similar_to_performance_when_scanning_blobs.fbk.7z
There is two tickets (CORE3773 and CORE3858) about SIMILAR TO performance.
Both are related to the algorithm (NFA-based, without heuristics) used, which may cause exponential backtracking for special patterns and/or long texts.
The improvement performance of this ticket is about re-implementation of the current algorithm.
With test from CORE3858, time decreases from 165 seconds to 101 seconds.
Times in seconds for test: echo "set stats on; select iif('aaaaaaaaaaabaaaaaabaaaaaaaaaab' SIMILAR TO '(a+b?)+a+' escape '\', 1, 0) from rdb\$database;" | isql -q t.fdb
(iterative version is the one used, and recursive version is deactivated due to stack overflow problems)
Before:
gcc iterative: 11,83
gcc recursive: 4,74
clang iterative: 9,73
clang recursive: 3,45
After:
gcc iterative: 5,54
gcc recursive: 3,97
clang iterative: 3,85
clang recursive: 3,69
Commits: 6820b50
====== Test Details ======
Done test that was described in message of 10/Dec/16 08:13 PM (see in this ticket).
The text was updated successfully, but these errors were encountered: