Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIKE operator: make its performance independent from the string length when pattern is immediatelly matched [CORE4306] #4629

Open
firebird-automations opened this issue Dec 28, 2013 · 5 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @pavel-zotov

It was found that performance of LIKE operator becomes very poor in case when string have big length - e.g. 32K - though the character that matches to pattern is just at the beginning of this string.

TEST:

set stat on;
set term ^;
execute block returns(cnt int) as
declare n int = 10000000;
declare s varchar(32760);
begin
cnt=0;
s=rpad('q',32760,' '); -- string has huge length but BEGINS with character `q`
while (n>0) do begin
cnt = cnt + iif( s like '%q%', 1, 0); -- such pattern should be found immediatelly
n=n-1;
end
suspend;
end^
set term ;^
set stat off;

output:

Current memory = 2450596208
Delta memory = 390320
Max memory = 2450628912
Elapsed time= 34.18 sec -- <<<<<<<<<<<<<<<<<<<<< :(((
Cpu = 0.00 sec
Buffers = 524288
Reads = 16
Writes = 0
Fetches = 37
SQL> set term ;^
SQL> set stat off;

Compare with:

set stat on;
set term ^;
execute block returns(cnt int) as
declare n int = 10000000;
declare s varchar(50);
begin
cnt=0;
s=rpad('q',50,' '); -- same string, but of small length
while (n>0) do begin
cnt = cnt + iif( s like '%q%', 1, 0);
n=n-1;
end
suspend;
end^
set term ;^
set stat off;

output:

Current memory = 2450530424
Delta memory = -65784
Max memory = 2450628912
Elapsed time= 6.68 sec -- <<<<<<<<<<<<<<<<< much better
Cpu = 0.00 sec
Buffers = 524288
Reads = 0
Writes = 0
Fetches = 2

PS. compare with Java, (only FYI):

import static java.lang.System.*;
import java.util.regex.*;
class MatchesCount {
public static void main(String[] args) {
int pad = args.length > 0 ? Integer.parseInt(args[0]) : 32760;
String s = String.format("%1$-" + pad + "s", "q");
<//out.println>("s=>"+s+"<");
Pattern p = Pattern.compile(".*?q.*?");
Matcher m;
int cnt=0;
long t0 = currentTimeMillis();
for(int i=0; i < 10000000; i++) {
m=p.matcher(s);
cnt += m.find() ? 1 : 0;
}
out.println("s.length()="+s.length()+", cnt="+cnt+", done at "+( currentTimeMillis() - t0 )+" ms");
}
}

====
D:\JAVA>java MatchesCount 10
s.length()=10, cnt=10000000, done at 5562 ms

D:\JAVA>java MatchesCount
s.length()=32760, cnt=10000000, done at 5860 ms

D:\JAVA>java MatchesCount 256000
s.length()=256000, cnt=10000000, done at 5594 ms

D:\JAVA>java MatchesCount 512000
s.length()=512000, cnt=10000000, done at 5735 ms

As we can see, Java has regexp engine that is NO sensitive to the length of source string (in case when pattern is founded immediatelly).

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

The issue is not with the LIKE matching algorithm per se (which does stop as soon as match is found), but with the surrounding code. In particular, we have unnecessary alloc / memcpy / free chain per every LIKE comparison which cost obviously depends on the string length. I suppose that if the given collation does not support a canonical transformation, then CanonicalConverter::ctor should act as no-op, leaving the original string as is. This optimization makes the performance being the same regardless of the string length.

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

summary: LIKE operator: make its performance insensitive to the length of string when pattern is immediatelly founded => LIKE operator: make its performance independent from the string length when pattern is immediatelly matched

Version: 3.0 Alpha 1 [ 10331 ] =>

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

assignee: Adriano dos Santos Fernandes [ asfernandes ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

> This optimization makes the performance being the same regardless of the string length.

Is it possible to apply such optimization also to:
1) operator CONTAINING
2) SIMILAR TO
- ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

Cannot say about SIMILAR TO, but I doubt this is possible for CONTAINING, as its arguments are implicitly uppercased thus requiring a string copy operation anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants