Improve SIMILAR TO performance [CORE3919] #4254

firebird-automations · 2012-09-09T04:26:29Z

Attachments:
poor_similar_to_performance_when_scanning_blobs.fbk.7z

There is two tickets (CORE3773 and CORE3858) about SIMILAR TO performance.

Both are related to the algorithm (NFA-based, without heuristics) used, which may cause exponential backtracking for special patterns and/or long texts.

The improvement performance of this ticket is about re-implementation of the current algorithm.

With test from CORE3858, time decreases from 165 seconds to 101 seconds.

Times in seconds for test: echo "set stats on; select iif('aaaaaaaaaaabaaaaaabaaaaaaaaaab' SIMILAR TO '(a+b?)+a+' escape '\', 1, 0) from rdb\$database;" | isql -q t.fdb

(iterative version is the one used, and recursive version is deactivated due to stack overflow problems)

Before:
gcc iterative: 11,83
gcc recursive: 4,74
clang iterative: 9,73
clang recursive: 3,45

After:
gcc iterative: 5,54
gcc recursive: 3,97
clang iterative: 3,85
clang recursive: 3,69

Commits: 6820b50

====== Test Details ======

Done test that was described in message of 10/Dec/16 08:13 PM (see in this ticket).

The text was updated successfully, but these errors were encountered:

firebird-automations · 2012-09-09T04:48:12Z

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0 Alpha 1 [ 10331 ]

assignee: Adriano dos Santos Fernandes [ asfernandes ]

firebird-automations · 2012-09-09T13:27:11Z

Commented by: @pavel-zotov

Adriano,

can you backport this solution to 2.5.x branch ?

firebird-automations · 2012-09-14T18:23:18Z

Commented by: @asfernandes

I would not backport it. Changes to SIMILAR TO are very risk, because the code is complex and there are very few tests.

firebird-automations · 2013-04-23T17:01:59Z

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

firebird-automations · 2013-07-26T22:49:00Z

Commented by: @pavel-zotov

> With test from CORE3858, time decreases from 165 seconds to 101 seconds.

Adriano,

can you please explain how can this small text be parsed so long with using similar to ?
I've tested again this sample on LI-T3.0.0.30556 and have no clue how it can be done in ~ 2 minutes.

SQL> select iif('sdf--asdf_sd.dsfsfh-.sdjskdjfh-_.' SIMILAR TO '(([[:ALNUM:]\_\-])+.?)+([[:ALNUM:]\_\-])+' escape '\', 1, 0) from rdb$database;

Select Expression
-> Table "RDB$DATABASE" Full Scan

    CASE

============
0

Current memory = 917736
Delta memory = -390408
Max memory = 1605960
Elapsed time= 117.89 sec
Buffers = 2048
Reads = 0
Writes 2
Fetches = 47
SQL>

TIA.

firebird-automations · 2013-07-26T22:56:38Z

Commented by: @asfernandes

Search for regular expression catastrophic backtracking.

firebird-automations · 2013-09-07T11:35:56Z

Commented by: @pavel-zotov

Adriano,

why SIMILAR TO can't stop its work immediatelly when it found 1st matching ? (now I mean only simplest case of using!)

For instance:

SQL> recreate table tb(b blob); commit;
-- simple calculations to estimate arguments for INSERT and LIST():
SQL> select count(*) from rdb$types;

            COUNT

=====================
251

SQL> select 1000000/(251*251) from rdb$database;

           DIVIDE

=====================
15

-- now insert single record with blob of length ~1'000'000:
SQL> insert into tb select list(rpad('',15,'a'),'#⁠') from (select 1 x from rdb$types,rdb$types rows (251*251));
SQL> select char_length(b) from tb;

CHAR_LENGTH

SQL> commit;
SQL> set stat on;
SQL> select count(*) from tb where b like '%aaa#⁠aaa%';

            COUNT

=====================
1

Current memory = 2945864
Delta memory = 21616
Max memory = 3008088
Elapsed time= 0.05 sec -- <<<<<<<<<<<<<<<<<<<<<<< YES, operator "LIKE" stops immediatelly
Buffers = 256
Reads = 2
Writes 0
Fetches = 18
SQL> select count(*) from tb where b similar to '%aaa#⁠aaa%';

            COUNT

=====================
1

Current memory = 3994776
Delta memory = 1048912
Max memory = 5008264
Elapsed time= 113.75 sec -- <<<<<<<<<<<<<<<<<<<<<<< ???
Buffers = 256
Reads = 311
Writes 0
Fetches = 319

PS-1. It seems to me that elapsed time for SIMILAR TO has non-linear relation with the length of input:
length=250000:
Current memory = 3545360
Delta memory = 268536
Max memory = 3801936
Elapsed time= 6.39 sec
Buffers = 256
Reads = 103
Writes 0
Fetches = 141

..........................

length=500000:
Current memory = 3842032
Delta memory = 545648
Max memory = 4351016
Elapsed time= 25.97 sec
Buffers = 256
Reads = 0
Writes 0
Fetches = 195

..........................

length=750000:
Current memory = 4366320
Delta memory = 1055272
Max memory = 5127560
Elapsed time= 58.36 sec
Buffers = 256
Reads = 248
Writes 0
Fetches = 257

..........................

length=1000000:
Current memory = 4366320
Delta memory = 1055272
Max memory = 5379808
Elapsed time= 104.11 sec
Buffers = 256
Reads = 312
Writes 0
Fetches = 319

..........................

length=1250000:
Current memory = 5427368
Delta memory = 2118512
Max memory = 6693112
Elapsed time= 170.47 sec
Buffers = 256
Reads = 374
Writes 0
Fetches = 381

PS-2.
SQL> show version;
ISQL Version: WI-V2.5.3.26661 Firebird 2.5
Server version:
Firebird/x86/Windows NT (access method), version "WI-T3.0.0.30603 Firebird 3.0 Alpha 1"
Firebird/x86/Windows NT (remote server), version "WI-T3.0.0.30603 Firebird 3.0 Alpha 1/tcp (CSMIRROR)/P12"
Firebird/x86/Windows NT (remote interface), version "WI-V2.5.3.26661 Firebird 2.5/tcp (csprog)/P12"
on disk structure version 12.0

firebird-automations · 2016-01-18T17:57:04Z

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

QA Status: No test

firebird-automations · 2016-12-10T22:11:04Z

Modified by: @pavel-zotov

Attachment: poor_similar_to_performance_when_scanning_blobs.fbk.7z [ 13047 ]

firebird-automations · 2016-12-10T22:13:33Z

Commented by: @pavel-zotov

I've found one more sample.
Unpack and restore DB from attached file ( "poor_similar_to_performance_when_scanning_blobs.fbk.7z" ).
The run following script:

set stat on;
set list on;
set echo on;

select count(*) as like_count, sum(char_length(b)) as like_sum_len
from test t
where
t.b like '%a%' or
t.b like '%b%' or
t.b like '%c%' or
t.b like '%d%' or
t.b like '%e%' or
t.b like '%f%' or
t.b like '%g%' or
t.b like '%h%' or
t.b like '%i%' or
t.b like '%j%' or
t.b like '%k%' or
t.b like '%l%' or
t.b like '%m%' or
t.b like '%n%' or
t.b like '%o%' or
t.b like '%p%' or
t.b like '%q%' or
t.b like '%r%' or
t.b like '%s%' or
t.b like '%t%' or
t.b like '%u%' or
t.b like '%v%' or
t.b like '%w%' or
t.b like '%x%' or
t.b like '%y%' or
t.b like '%z%'
;

select count(*) as similar_to_count, sum(char_length(b)) as similar_to_sum_len
from test t
where t.b similar to '%[a-z]%'
;

Output on my machine:

LIKE_COUNT 86
LIKE_SUM_LEN 2306757

Current memory = 37747632
Delta memory = 391952
Max memory = 37793576
Elapsed time= 0.036 sec
Buffers = 2048
Reads = 80
Writes = 0
Fetches = 1671

....

SIMILAR_TO_COUNT 86
SIMILAR_TO_SUM_LEN 2306757

Current memory = 37852976
Delta memory = 105344
Max memory = 38084408
Elapsed time= 41.475 sec
Buffers = 2048
Reads = 91
Writes = 0
Fetches = 640

So, blobs with total length ~2.3 Mb are processed using 'similar to' about 40 seconds.
How can it be ?

firebird-automations · 2019-09-08T11:48:00Z

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

QA Status: No test => Done successfully

firebird-automations · 2019-09-08T11:48:45Z

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

Test Details: Done test that was described in message of 10/Dec/16 08:13 PM (see in this ticket).

firebird-automations closed this as completed Sep 9, 2012

firebird-automations added affect-version: 3.0 Initial fix-version: 3.0 Alpha 1 resolution: fixed priority: major type: improvement qa: done successfully labels Apr 25, 2021

firebird-automations assigned asfernandes Apr 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SIMILAR TO performance [CORE3919] #4254

Improve SIMILAR TO performance [CORE3919] #4254

firebird-automations commented Sep 9, 2012

firebird-automations commented Sep 9, 2012

firebird-automations commented Sep 9, 2012

firebird-automations commented Sep 14, 2012

firebird-automations commented Apr 23, 2013

firebird-automations commented Jul 26, 2013

firebird-automations commented Jul 26, 2013

firebird-automations commented Sep 7, 2013

firebird-automations commented Jan 18, 2016

firebird-automations commented Dec 10, 2016

firebird-automations commented Dec 10, 2016

firebird-automations commented Sep 8, 2019

firebird-automations commented Sep 8, 2019

Improve SIMILAR TO performance [CORE3919] #4254

Improve SIMILAR TO performance [CORE3919] #4254

Comments

firebird-automations commented Sep 9, 2012

firebird-automations commented Sep 9, 2012

firebird-automations commented Sep 9, 2012

firebird-automations commented Sep 14, 2012

firebird-automations commented Apr 23, 2013

firebird-automations commented Jul 26, 2013

firebird-automations commented Jul 26, 2013

firebird-automations commented Sep 7, 2013

CHAR_LENGTH

firebird-automations commented Jan 18, 2016

firebird-automations commented Dec 10, 2016

firebird-automations commented Dec 10, 2016

firebird-automations commented Sep 8, 2019

firebird-automations commented Sep 8, 2019