More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] #6769

firebird-automations · 2021-04-16T13:57:31Z

Submitted by: @hvlad

The case below shows bad performance of SUBSTRING for UTF8 comparing with (legacy) UNICODE_FSS

a) UNICODE_FSS

execute block
as
declare str1 varchar(8000) character set unicode_fss;
declare str2 varchar(10) character set unicode_fss;
declare n int = 100000;
begin
str1 = LPAD('abcd', 8000, '--');
while (n > 0) do
begin
str2 = SUBSTRING(str1 from 1 FOR 10);
n = n - 1;
end
end

Execute time = 62ms

b) UTF8

execute block
as
declare str1 varchar(8000) character set utf8;
declare str2 varchar(10) character set utf8;
declare n int = 100000;
begin
str1 = LPAD('abcd', 8000, '--');
while (n > 0) do
begin
str2 = SUBSTRING(str1 from 1 FOR 10);
n = n - 1;
end
end

Execute time = 983ms

The case is simplified and based on end-user report. In user case the same query on the system tables run much longer with FB4 than with FB3
(test database was restored from the same backup). Origin of the problem is that FB4 uses UTF8 for metadata while FB3 uses UNICODE_FSS.

The SUBSTRING implementation for UNICODE_FSS (internal_fss_substring()) is straigthforward and logical - it skips POSITION characters
from the start of the source string first and then copy LENGTH chars into dest string.

The UTF8 implementation (MultiByteCharSet::substring()) convert whole source string into UTF16 and only then get substring of UTF16 string.
This is simple but very inefficient especially for a long strings and small POSITION values.

Commits: 9c566c0 3334128 f1fe0ee ac2532f b5407c3

The text was updated successfully, but these errors were encountered:

firebird-automations · 2021-04-16T14:14:40Z

Modified by: @asfernandes

assignee: Adriano dos Santos Fernandes [ asfernandes ]

firebird-automations · 2021-04-16T17:25:12Z

Modified by: @asfernandes

summary: Implementation of SUBSTRING for multi-byte character sets is inefficient => Implementation of SUBSTRING for UTF8 character set is inefficient

firebird-automations · 2021-04-16T17:38:40Z

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 4.0.0 [ 10931 ]

Fix Version: 3.0.8 [ 10960 ]

firebird-automations · 2021-04-17T10:42:09Z

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: No test => Done successfully

firebird-automations · 2021-04-17T10:42:16Z

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Closed [ 6 ]

firebird-automations closed this as completed Apr 17, 2021

firebird-automations added affect-version: 4.0 RC 1 affect-version: 3.0.7 fix-version: 4.0.0 fix-version: 3.0.8 resolution: fixed priority: major component: engine type: improvement qa: done successfully labels Apr 25, 2021

firebird-automations assigned asfernandes Apr 25, 2021

dyemanov changed the title ~~Implementation of SUBSTRING for UTF8 character set is inefficient [CORE6542]~~ More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] #6769

More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] #6769

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 17, 2021

firebird-automations commented Apr 17, 2021

More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] #6769

More efficient implementation of SUBSTRING for UTF8 character set [CORE6542] #6769

Comments

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 16, 2021

firebird-automations commented Apr 17, 2021

firebird-automations commented Apr 17, 2021