Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharp-S character treated incorrectly in UNICODE_CI_AI collation [CORE4136] #4463

Closed
firebird-automations opened this issue Jul 1, 2013 · 12 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Stefan Heymann (stefanheymann)

Is related to QA529

Votes: 1

The UNICODE_CI_AI collation treats the Sharp-s character (U+00DF) incorrectly.
This character (used in German language text) is special in that there is only a lower-case form, no upper-case (having derived from a ligature between a long and a round lowercase "s". Forget about U+1E9E, which is an abstract invention by the Unicode consortium that has no practical use in German language).

To reproduce the bug, try this on a UTF8 database:

select
case when 'Übergeek' collate unicode_ci_ai like 'ÜB%' collate unicode_ci_ai
then '=' else '<>' end as test_1,
case when 'Übergeek' collate unicode_ci_ai like 'üb%' collate unicode_ci_ai
then '=' else '<>' end as test_2,
case when 'Fußball' collate unicode_ci_ai like 'fu%' collate unicode_ci_ai
then '=' else '<>' end as test_3,
case when 'Fußball' collate unicode_ci_ai like 'fuß%' collate unicode_ci_ai
then '=' else '<>' end as test_4,
case when upper ('Fußball') like upper ('fuß%')
then '=' else '<>' end as test_5
from rdb$database

TEST_4 will show a mismatch where it should show a match.

Commits: 0e7302f fb41d66 FirebirdSQL/fbt-repository@9b97921 FirebirdSQL/fbt-repository@41851cd

====== Test Details ======

Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.
See also sample in CORE857 ( 19/Apr/15 08:56 AM )

@firebird-automations
Copy link
Collaborator Author

Commented by: Stefan Heymann (stefanheymann)

I should add that UNICODE_CI (wihout AI) works correctly on this character:

select
case when 'Übergeek' collate unicode_ci like 'ÜB%' collate unicode_ci_ai
then '=' else '<>' end as test_1,
case when 'Übergeek' collate unicode_ci like 'üb%' collate unicode_ci_ai
then '=' else '<>' end as test_2,
case when 'Fußball' collate unicode_ci like 'fu%' collate unicode_ci_ai
then '=' else '<>' end as test_3,
case when 'Fußball' collate unicode_ci like 'fuß%' collate unicode_ci_ai
then '=' else '<>' end as test_4,
case when upper ('Fußball') like upper ('fuß%')
then '=' else '<>' end as test_5
from rdb$database

will show a match for TEST_4, which is correct.

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

assignee: Adriano dos Santos Fernandes [ asfernandes ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 2.5.3 [ 10461 ]

Fix Version: 3.0 Alpha 2 [ 10560 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Stefan Heymann (stefanheymann)

It's working now (tested 2013-07-08 with Firebird 2.5.3.26671). Thanks!

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

Fix Version: 3.0 Alpha 1 [ 10331 ]

Fix Version: 3.0 Alpha 2 [ 10560 ] =>

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

Link: This issue is related to QA529 [ QA529 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

Can anyone who knows German language explain following results:

set names utf8;
create database 'localhost/3333:C:\MIX\firebird\QA\fbt-repo\tmp\c4136_a.fdb' default character set utf8;
commit;

recreate table test(text varchar(10) collate unicode_ci_ai, patt varchar(10) collate unicode_ci_ai);
commit;
show table test;

insert into test values('ß','s');
insert into test values('ß','ss');
insert into test values('ss','ß');
commit;

set list on;
select
text
,patt
,text = patt as "equal result"
,text is NOT distinct from patt as "is NOT_distinct result"
,text starting with patt as "starting_with result"
,text like patt as "like result"
,text containing patt as "containing result"
,text similar to patt as "similar_to result"
from test
;

Output:

TEXT ß
PATT s
equal result <false>
is NOT_distinct result <false>
starting_with result <false> ------------ WHY ?
like result <false>
containing result <false> ------------- WHY ?
similar_to result <false>

TEXT ß
PATT ss
equal result <true>
is NOT_distinct result <true>
starting_with result <false> -------- WHY ??? (this and subsequent three)
like result <false>
containing result <false>
similar_to result <false>

TEXT ss
PATT ß
equal result <true>
is NOT_distinct result <true>
starting_with result <false> -------- WHY ??? (this and subsequent three)
like result <false>
containing result <false>
similar_to result <false>

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: Done successfully

Test Details: Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.

@firebird-automations
Copy link
Collaborator Author

Commented by: Stefan Heymann (stefanheymann)

Historically, the sharp-s character (ß) is a typographical ligature of a "long" and a "round" lowercase s. Today it is a complete character of its own, so "ß" and "ss" are not the same (you can, however, treat them as the same in sorting). The same is true for "ß" and "s". There is no such thing as an uppercase sharp-S, so the Unicode code point U+1E9E LATIN CAPITAL LETTER SHARP S is completely pointless.
In uppercase writing, a sharp-s is replaced by "SS" or (to avoid amiguities) by "SZ" (so Masse gets MASSE, Maße gets MASSE or MASZE, Fuß gets FUSS).
A "ß" is not starting with "s" or "ss" and it doesn't contain them.

@firebird-automations
Copy link
Collaborator Author

Commented by: @pcisar

Test created.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

Test Details: Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.

=>

Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.
See also sample in CORE857 ( 19/Apr/15 08:56 AM )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment