Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_UTF8 prefix for string values does not work with WIN1252 charset [DNET933] #856

Closed
firebird-automations opened this issue Mar 13, 2020 · 6 comments

Comments

@firebird-automations
Copy link

Submitted by: Daniel Richter (drichter)

For string values in queries, the generated SQL results in inline values with the prefix "_UTF8". E.g,
dataContext.Articles.Where(a => a.Name=="Mülltüte")
generates SQL such as
SELECT * FROM "ARTICLE" WHERE "NAME"=_UTF8'Mülltüte'

That works as long as the connection charset is set to "UTF8". In case the connection string defines a charset such as "WIN1252" the execution throws an error "Dynamic SQL Error, SQL error code = -104, Malformed string" (when e.g. the value contains an umlaut).
After removing the "_UTF8" prefix everything works fine (both WIN1252 and UTF8 charset):
SELECT * FROM "ARTICLE" WHERE "NAME"='Mülltüte'

Because of the following code in SqlGenerator.cs, Visit(DbConstantExpression)
// constant is always considered Unicode
isUnicode = true;
The "_UTF8" prefix will always be generated - no matter which charset was used to connect.

In my opinion it is not possible to use string values in queries with charsets other than UTF8. The "_UTF8" prefix should only be added in case the database connection uses UTF8, too.

Otherwise: Is it safe to connect via UTF8 to a database with WIN1252 as charset for columns or are there any side effects?

@firebird-automations
Copy link
Author

Commented by: @cincuranet

That's weird, the `Mülltüte` does not contain any character that can't be represented in WIN1252, AFAIK. I can't immediately understand why Firebird does not handle it properly.

> Otherwise: Is it safe to connect via UTF8 to a database with WIN1252 as charset for columns or are there any side effects?

Yes, it's safe. Obviously you shouldn't pass into database values that can't be represented in WIN1252.

@firebird-automations
Copy link
Author

Modified by: @cincuranet

status: Open [ 1 ] => In Progress [ 3 ]

@firebird-automations
Copy link
Author

Commented by: Daniel Richter (drichter)

Works:
UTF8 charset - "NAME"=_UTF8'Test'
WIN1252 charset - "NAME"=_UTF8'Test'
UTF8 charset - "NAME"=_UTF8'Mülltüte'
WIN1252 charset - "NAME"='Mülltüte'

Does not work:
WIN1252 charset - "NAME"=_UTF8'Mülltüte'

Is it possible to omit the _UTF8-prefix in cases where the connection string charset is not UTF8?

@firebird-automations
Copy link
Author

Commented by: @cincuranet

> Is it possible to omit the _UTF8-prefix in cases where the connection string charset is not UTF8?

Thinking about it more, I think the problem is because the command text is sent as WIN1252 (the connection charset) and then Firebird tries to consider the piece as UTF8 and the bytes do not align. But I don't think removing the prefix is a good idea. I'm thinking in a way of: column charset X, connection charset Y, C#⁠ obviously Unicode. And then the X and Y codepages might provide different characters.

IMO the correct solution would be to use (and maybe force) usage of UTF8. At least for EF Core that's what I'm now really thinking to do.

Or maybe sending it as HEX literal. Which would solve it, but also the query would not be nice.

@firebird-automations
Copy link
Author

Commented by: Daniel Richter (drichter)

> Or maybe sending it as HEX literal. Which would solve it, but also the query would not be nice.
Can confirm. That works. (SqlGenerator.cs/FormatString)
#⁠if false
result.Append("'");
result.Append(value.Replace("'", "''"));
#⁠else
result.Append(" x'");
result.Append(BitConverter.ToString(Encoding.UTF8.GetBytes(value)).Replace("-",""));
#⁠endif
Does not generate
_UTF8'Mülltüte'
but
_UTF8 x'4DC3BC6C6C74C3BC7465'

> Thinking about it more, I think the problem is because the command text is sent as WIN1252 (the connection charset) and then Firebird tries to consider the piece as UTF8 and the bytes do not align. But I don't think removing the prefix is a good idea. I'm thinking in a way of: column charset X, connection charset Y, C#⁠ obviously Unicode. And then the X and Y codepages might provide different characters.

Hm, from a database clients view:
I decided to use WIN1252 as connection charset, so its my responsibility to only send valid WIN1252 data. I do not want any middleware to interpret my data as UTF8 (thats what the always generated _UTF8 prefix does) - because I didn't told sol! If there is any data that does not match the charset, I will get an error ("malformed string" maybe) - and that is completely my fault, because I said I want to use WIN1252. (Reminder: Umlauts and WIN1252 do work fine - without the _UTF8-prefix).
If I would decide to connect to a WIN1252 database with UTF8 and send valid UTF8 but non-WIN1252 data, the would be an error nevertheless ("arithmetic exception, numeric overflow, or string truncation" maybe).
So from this point of view it is completely unneccessary to interpret some non-UTF8 data as UTF8. (And maybe in case of a UTF8 connection charset the prefix is unnecessary too because the data already is UTF8). Using UTF8 data in non-UTF8 environments would break anyway; interpreting non-UTF8 data as UTF8 in non-UTF8 environments causes errors that are avoidable (from my point of view, I don't know the internals).

> IMO the correct solution would be to use (and maybe force) usage of UTF8. At least for EF Core that's what I'm now really thinking to do.

Than please throw an exception in case UTF8 is _not_ used.

@cincuranet
Copy link
Member

EF6 provider here is in maintenance/legacy mode here (see https://groups.google.com/g/firebird-net-provider/c/_0Hwoh5XG_I). If somebody feels like tackling this with good PR, it will be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants