Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SIMILAR TO performance [CORE3919] #4254

Closed
firebird-automations opened this issue Sep 9, 2012 · 12 comments
Closed

Improve SIMILAR TO performance [CORE3919] #4254

firebird-automations opened this issue Sep 9, 2012 · 12 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @asfernandes

Attachments:
poor_similar_to_performance_when_scanning_blobs.fbk.7z

There is two tickets (CORE3773 and CORE3858) about SIMILAR TO performance.

Both are related to the algorithm (NFA-based, without heuristics) used, which may cause exponential backtracking for special patterns and/or long texts.

The improvement performance of this ticket is about re-implementation of the current algorithm.

With test from CORE3858, time decreases from 165 seconds to 101 seconds.

Times in seconds for test: echo "set stats on; select iif('aaaaaaaaaaabaaaaaabaaaaaaaaaab' SIMILAR TO '(a+b?)+a+' escape '\', 1, 0) from rdb\$database;" | isql -q t.fdb

(iterative version is the one used, and recursive version is deactivated due to stack overflow problems)

Before:
gcc iterative: 11,83
gcc recursive: 4,74
clang iterative: 9,73
clang recursive: 3,45

After:
gcc iterative: 5,54
gcc recursive: 3,97
clang iterative: 3,85
clang recursive: 3,69

Commits: 6820b50

====== Test Details ======

Done test that was described in message of 10/Dec/16 08:13 PM (see in this ticket).

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0 Alpha 1 [ 10331 ]

assignee: Adriano dos Santos Fernandes [ asfernandes ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

Adriano,

can you backport this solution to 2.5.x branch ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

I would not backport it. Changes to SIMILAR TO are very risk, because the code is complex and there are very few tests.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

> With test from CORE3858, time decreases from 165 seconds to 101 seconds.

Adriano,

can you please explain how can this small text be parsed so long with using similar to ?
I've tested again this sample on LI-T3.0.0.30556 and have no clue how it can be done in ~ 2 minutes.

SQL> select iif('sdf--asdf_sd.dsfsfh-.sdjskdjfh-_.' SIMILAR TO '(([[:ALNUM:]\_\-])+.?)+([[:ALNUM:]\_\-])+' escape '\', 1, 0) from rdb$database;

Select Expression
-> Table "RDB$DATABASE" Full Scan

    CASE

============
0

Current memory = 917736
Delta memory = -390408
Max memory = 1605960
Elapsed time= 117.89 sec
Buffers = 2048
Reads = 0
Writes 2
Fetches = 47
SQL>

TIA.

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

Search for regular expression catastrophic backtracking.

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

Adriano,

why SIMILAR TO can't stop its work immediatelly when it found 1st matching ? (now I mean only simplest case of using!)

For instance:

SQL> recreate table tb(b blob); commit;
-- simple calculations to estimate arguments for INSERT and LIST():
SQL> select count(*) from rdb$types;

            COUNT

=====================
251

SQL> select 1000000/(251*251) from rdb$database;

           DIVIDE

=====================
15

-- now insert single record with blob of length ~1'000'000:
SQL> insert into tb select list(rpad('',15,'a'),'#⁠') from (select 1 x from rdb$types,rdb$types rows (251*251));
SQL> select char_length(b) from tb;

CHAR_LENGTH

 1008015

SQL> commit;
SQL> set stat on;
SQL> select count(*) from tb where b like '%aaa#⁠aaa%';

            COUNT

=====================
1

Current memory = 2945864
Delta memory = 21616
Max memory = 3008088
Elapsed time= 0.05 sec -- <<<<<<<<<<<<<<<<<<<<<<< YES, operator "LIKE" stops immediatelly
Buffers = 256
Reads = 2
Writes 0
Fetches = 18
SQL> select count(*) from tb where b similar to '%aaa#⁠aaa%';

            COUNT

=====================
1

Current memory = 3994776
Delta memory = 1048912
Max memory = 5008264
Elapsed time= 113.75 sec -- <<<<<<<<<<<<<<<<<<<<<<< ???
Buffers = 256
Reads = 311
Writes 0
Fetches = 319

PS-1. It seems to me that elapsed time for SIMILAR TO has non-linear relation with the length of input:
length=250000:
Current memory = 3545360
Delta memory = 268536
Max memory = 3801936
Elapsed time= 6.39 sec
Buffers = 256
Reads = 103
Writes 0
Fetches = 141

..........................

length=500000:
Current memory = 3842032
Delta memory = 545648
Max memory = 4351016
Elapsed time= 25.97 sec
Buffers = 256
Reads = 0
Writes 0
Fetches = 195

..........................

length=750000:
Current memory = 4366320
Delta memory = 1055272
Max memory = 5127560
Elapsed time= 58.36 sec
Buffers = 256
Reads = 248
Writes 0
Fetches = 257

..........................

length=1000000:
Current memory = 4366320
Delta memory = 1055272
Max memory = 5379808
Elapsed time= 104.11 sec
Buffers = 256
Reads = 312
Writes 0
Fetches = 319

..........................

length=1250000:
Current memory = 5427368
Delta memory = 2118512
Max memory = 6693112
Elapsed time= 170.47 sec
Buffers = 256
Reads = 374
Writes 0
Fetches = 381

PS-2.
SQL> show version;
ISQL Version: WI-V2.5.3.26661 Firebird 2.5
Server version:
Firebird/x86/Windows NT (access method), version "WI-T3.0.0.30603 Firebird 3.0 Alpha 1"
Firebird/x86/Windows NT (remote server), version "WI-T3.0.0.30603 Firebird 3.0 Alpha 1/tcp (CSMIRROR)/P12"
Firebird/x86/Windows NT (remote interface), version "WI-V2.5.3.26661 Firebird 2.5/tcp (csprog)/P12"
on disk structure version 12.0

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

QA Status: No test

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

Attachment: poor_similar_to_performance_when_scanning_blobs.fbk.7z [ 13047 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

I've found one more sample.
Unpack and restore DB from attached file ( "poor_similar_to_performance_when_scanning_blobs.fbk.7z" ).
The run following script:

set stat on;
set list on;
set echo on;

select count(*) as like_count, sum(char_length(b)) as like_sum_len
from test t
where
t.b like '%a%' or
t.b like '%b%' or
t.b like '%c%' or
t.b like '%d%' or
t.b like '%e%' or
t.b like '%f%' or
t.b like '%g%' or
t.b like '%h%' or
t.b like '%i%' or
t.b like '%j%' or
t.b like '%k%' or
t.b like '%l%' or
t.b like '%m%' or
t.b like '%n%' or
t.b like '%o%' or
t.b like '%p%' or
t.b like '%q%' or
t.b like '%r%' or
t.b like '%s%' or
t.b like '%t%' or
t.b like '%u%' or
t.b like '%v%' or
t.b like '%w%' or
t.b like '%x%' or
t.b like '%y%' or
t.b like '%z%'
;

select count(*) as similar_to_count, sum(char_length(b)) as similar_to_sum_len
from test t
where t.b similar to '%[a-z]%'
;

Output on my machine:

LIKE_COUNT 86
LIKE_SUM_LEN 2306757

Current memory = 37747632
Delta memory = 391952
Max memory = 37793576
Elapsed time= 0.036 sec
Buffers = 2048
Reads = 80
Writes = 0
Fetches = 1671

....

SIMILAR_TO_COUNT 86
SIMILAR_TO_SUM_LEN 2306757

Current memory = 37852976
Delta memory = 105344
Max memory = 38084408
Elapsed time= 41.475 sec
Buffers = 2048
Reads = 91
Writes = 0
Fetches = 640

So, blobs with total length ~2.3 Mb are processed using 'similar to' about 40 seconds.
How can it be ?

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

QA Status: No test => Done successfully

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

Test Details: Done test that was described in message of 10/Dec/16 08:13 PM (see in this ticket).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants