Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accent-insensitive comparison of cyrillic 'И' (U+0418) and 'Й' (U+0419) should be the same as 'Е' (U+0415) and 'Ё' (U+0401) [CORE4803] #5101

Open
firebird-automations opened this issue May 19, 2015 · 6 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @pavel-zotov

SQL> select _utf8 'и' collate unicode_ci_ai = _utf8 'й' collate unicode_ci_ai as "equal?" from rdb$database
CON> union all
CON> select _utf8 'е' collate unicode_ci_ai = _utf8 'ё' collate unicode_ci_ai from rdb$database;

equal?

<false>
<true>

SQL> select _utf8 'и' collate unicode_ci_ai = _utf8 'Й' collate unicode_ci_ai as "equal?" from rdb$database
CON> union all
CON> select _utf8 'Е' collate unicode_ci_ai = _utf8 'ё' collate unicode_ci_ai as "equal?" from rdb$database
CON> ;

equal?

<false>
<true>

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

Why? Sources?

For example, Chrome Ctrl+F (text search) searchs in case-/accent-insensitive way.

And it is against your ticket. U+0418 is considered different than U+0419.

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

http://en.wikipedia.org/wiki/Breve

Letter 'Й' is common cyrillic letter 'И' with diacritical mark over it (semi-cirle) which name is 'breve'.

If accent-insensitive collation is that one ignores ACCENTS than why it should behave different for this letter ?

Also: sound when these letters are pronounced ( 'И' (U+0418) and 'Й' (U+0419)) is differ -- but only when 'Й' at **last** position of the word. There is no much problem when this letter occurs in the middle of word is this is COMMON noun.

Problem can be when doing 'sound-based search' of words that containing 'И' or 'Й' in the middle and are NOT from set common nouns, i.e. name of person, city, river etc.

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Adriano,

"...U+0418 is considered different than U+0419."

To an outsider, if 'Е' (U+0415) and 'Ё' (U+0401) are the same it would seem reasonable that they could be the same. Why do you believe that they are different.

What I find interesting/odd is that 'Е' and 'ё' are the same, they are different cases.

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

> To an outsider, if 'Е' (U+0415) and 'Ё' (U+0401) are the same it would seem reasonable that they could be the same. Why do you believe that they are different.

Currently I speak about ANOTHER pair of letters: 'И' (U+0418) and 'Й' (U+0419) -- they should be compared just like 'Е' (U+0415) and 'Ё' (U+0401) when applying Accent Insensitive collation. IMHO.

> What I find interesting/odd is that 'Е' and 'ё' are the same, they are different cases.

This is because I've applied CASE_insensitive collation also, not only AI.

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

These are the letter names:

cyrillic capital letter short i (U+0419)
cyrillic small letter short i (U+0439)

cyrillic capital letter i (U+0418)
cyrillic small letter i (U+0438)

cyrillic capital letter io (U+0401)
cyrillic capital letter ie (U+0415)

So, for me, even the 401/415 is not that simple. It's not just an accent.

Since Chrome does as Firebird, I believe this should be reported to ICU tracker instead of Firebird.

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

IMHO, our behavior is correct. 'Е' (U+0415) and 'Ё' (U+0401) can really be considered equal if compared without accents, recently it became quite common in Russia to write 'Е' instead of 'Ё' in PC-typed texts - formally wrong but absolutely understandable. But 'И' (U+0418) and 'Й' (U+0419) are really different, the former is clear vowel but the latter is more a consonant than a vowel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant