Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with SIMILAR TO and UTF8 on 2.5 Beta 2 (and 1) [CORE2721] #3117

Closed
firebird-automations opened this issue Oct 31, 2009 · 6 comments
Closed

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @asfernandes

Reported in fb-devel:

I've encountered a problem using SIMILAR TO syntax on UTF-8 fields. I
searched through the tracker but couldn't find this issue.

What seems to be happening is that when using SIMILAR TO on UTF-8 fields
doesn't get the right pattern length.
I've made a testcase with a bit of data in it that you can get from:
http://www.fantasycomic.com/t/testsimilarto.zip (both 2.5beta2 database
as gbak backup are in there).

In that testcase you'll find a table created with:

CREATE TABLE TEST (
ANSIFIELD Varchar(100),
UTF8FIELD Varchar(100) CHARACTER SET UTF8,
UNICODECIFIELD Varchar(100) CHARACTER SET UTF8 COLLATE UNICODE_CI
);

This table was then filled with a number of records, with the same value
going into each three fields.

The default character set for the database is ISO8859_1. I have done all
my tests using flamerobin 0.8.6.1652 Unicode, connection charset was
ISO8859_1.

Now on that table if you do

SELECT * FROM Test WHERE Utf8Field SIMILAR TO 'DELL %' ;

You'll get 16 results. Amongst those results are incorrect matches like
for example 'DE SINGEL'. All 16 matches start with 'DE'

Change the query to:
SELECT * FROM Test WHERE Utf8Field SIMILAR TO 'DE %' ;

and you'll get 39 results, all starting with "D" but some incorrect
matches like 'DHR. F. BERENDE'.

Now, these same queries on AnsieField (which has the same data but
different character set) will give 3 and 6 results respectively and all
correct.
If you change the query to use LIKE instead of SIMILAR TO on Utf8Field
then you get the correct results.

To me this looks like some of the substring matching in the regexp
engine is using the wrong character set to calculate string sizes, so
instead of 'DELL %' it sees 'DE%' (because the 5 character 'DELL '
literal string is seen as a 2-char one).

I've tested this issue in Firebird 2.5 Beta 1 and Beta 2 running as
superserver on windows vista, 32bit. If you wish I can run the same
tests on xp32 bit and xp 64 bit.

Commits: ba9231d c89d55b

====== Test Details ======

See also test for CORE2006 and list of other related tickets there.

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 2.5 RC2 [ 10372 ]

assignee: Adriano dos Santos Fernandes [ asfernandes ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

Fix Version: 3.0 Alpha 1 [ 10331 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

Fix Version: 2.5 RC1 [ 10362 ]

Fix Version: 2.5 RC2 [ 10372 ] =>

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

QA Status: Done successfully

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

Test Details: See also test for CORE2006 and list of other related tickets there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment