Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed string when trying to cast text containing national (cyryllic) characters to UTF8 if field was created in charset = NONE [CORE3830] #4172

Closed
firebird-automations opened this issue Apr 24, 2012 · 10 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @pavel-zotov

C:\MIX\firebird\misc>isql
Use CONNECT or CREATE DATABASE to specify a database
SQL> create database 'tcs2.fdb'; commit;
SQL> show database;
Database: tcs2.fdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 196
Sweep interval = 20000
Forced Writes are ON
Transaction - oldest = 3
Transaction - oldest active = 4
Transaction - oldest snapshot = 4
Transaction - Next = 5
ODS = 11.2
Default Character set: NONE
SQL> select a.mon$character_set_id cset_id,c.rdb$character_set_name cset_name
CON> from mon$attachments a
CON> left join rdb$character_sets c on a.mon$character_set_id=c.rdb$character_set_id
CON> where a.mon$attachment_id=current_connection;

CSET_ID CSET_NAME
======= ===============================================================================
0 NONE

SQL> recreate table ttt_none(msg varchar(60)); commit;
SQL> insert into ttt_none values('some message'); commit;
SQL> select cast(msg as varchar(60) character set utf8) msg from ttt_none;

MSG

===============================================================================
some message

SQL> insert into ttt_none values('некоторое сообщение'); commit; -- some text in russian here
SQL> select cast(msg as varchar(60) character set utf8) msg from ttt_none;

MSG

===============================================================================
some message

Statement failed, SQLSTATE = 22000
Malformed string
SQL>

But:
-----
SQL> select cast(msg as varchar(60) character set win1251) msg from ttt_none;

MSG

some message
некоторое сообщение -- is extracted OK and seems readable.

------------
PS.

SQL> show version;
ISQL Version: WI-V2.5.1.26351 Firebird 2.5
Server version:
Firebird/x86/Windows NT (access method), version "WI-V2.5.1.26351 Firebird 2.5"
Firebird/x86/Windows NT (remote server), version "WI-V2.5.1.26351 Firebird 2.5/XNet (CSPROG)/P12"
Firebird/x86/Windows NT (remote interface), version "WI-V2.5.1.26351 Firebird 2.5/XNet (CSPROG)/P12"
on disk structure version 11.2

PPS.
The RDB$EXCEPTION table contains field RDB$MESSAGE that is created in charset = NONE.
Is it correct ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

You want to insert invalid data and Firebird told it to you. What's wrong?

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

First question appeared due to my misunderstanding.
Why FB *can* return data that was stored in charset NONE if we cast them to win1251 and can *NOT* do it when we cast to UTF8 ?

And what about RDB$EXCEPTION.RDB$MESSAGE ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

The only valid point here is whether RDB$MESSAGE should be Unicode in ODS12 and if not, why. The rest of the ticket is useless.

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

And the RDB$MESSAGE was discussed already, but no agreement so far AFAIR. Not a matter for the tracker.

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Won't Fix [ 2 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @mrotteveel

Firebird can convert the NONE to WIN1251 because the text in you query was most likely a WIN1251 text and was inserted as is. When you then try to convert NONE to UTF8 it fails because there are invalid byte combinations for UTF8 encoding (ie first byte for a character is > 127).

@firebird-automations
Copy link
Collaborator Author

Commented by: Paul Gardner (cantak)

Why? The É character is 144 or C9. If I have CHARACTER SET NONE and I cast as UTF8 it throws an error. Why can't this be converted to UTF8 as C3 89? How can someone perform a cast on data?

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

It *can* be converted if the source value is defined as e.g. WIN1250 (or whatever charset defining É as 0xC9). But it *cannot* be converted if the source value is defined as NONE, because the engine has no clue what real (OS level, application level, etc) charset was used when populating that value. It's simply impossible to convert from "unknown" to anything else. The bytes are treated as UTF8 and validation fails because the input is not a well-formed UTF8 string.

@firebird-automations
Copy link
Collaborator Author

Commented by: Paul Gardner (cantak)

I understand about the NONE character set. However I tried an ASCII field and it also fails to convert to UTF8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant