Issue Details (XML | Word | Printable)

Key: DNET-933
Type: Bug Bug
Status: In Progress In Progress
Priority: Major Major
Assignee: Jiri Cincura
Reporter: Daniel Richter
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
.NET Data provider

_UTF8 prefix for string values does not work with WIN1252 charset

Created: 13/Mar/20 05:39 PM   Updated: 16/Mar/20 03:34 PM
Component/s: Entity Framework
Affects Version/s: 7.5.0.0
Fix Version/s: None


 Description  « Hide
For string values in queries, the generated SQL results in inline values with the prefix "_UTF8". E.g,
   dataContext.Articles.Where(a => a.Name=="Mülltüte")
generates SQL such as
   SELECT * FROM "ARTICLE" WHERE "NAME"=_UTF8'Mülltüte'
   
That works as long as the connection charset is set to "UTF8". In case the connection string defines a charset such as "WIN1252" the execution throws an error "Dynamic SQL Error, SQL error code = -104, Malformed string" (when e.g. the value contains an umlaut).
After removing the "_UTF8" prefix everything works fine (both WIN1252 and UTF8 charset):
   SELECT * FROM "ARTICLE" WHERE "NAME"='Mülltüte'

Because of the following code in SqlGenerator.cs, Visit(DbConstantExpression)
// constant is always considered Unicode
isUnicode = true;
The "_UTF8" prefix will always be generated - no matter which charset was used to connect.

In my opinion it is not possible to use string values in queries with charsets other than UTF8. The "_UTF8" prefix should only be added in case the database connection uses UTF8, too.

Otherwise: Is it safe to connect via UTF8 to a database with WIN1252 as charset for columns or are there any side effects?

 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Jiri Cincura added a comment - 16/Mar/20 11:40 AM
That's weird, the `Mülltüte` does not contain any character that can't be represented in WIN1252, AFAIK. I can't immediately understand why Firebird does not handle it properly.

> Otherwise: Is it safe to connect via UTF8 to a database with WIN1252 as charset for columns or are there any side effects?

Yes, it's safe. Obviously you shouldn't pass into database values that can't be represented in WIN1252.

Daniel Richter added a comment - 16/Mar/20 12:17 PM
Works:
UTF8 charset - "NAME"=_UTF8'Test'
WIN1252 charset - "NAME"=_UTF8'Test'
UTF8 charset - "NAME"=_UTF8'Mülltüte'
WIN1252 charset - "NAME"='Mülltüte'

Does not work:
WIN1252 charset - "NAME"=_UTF8'Mülltüte'

Is it possible to omit the _UTF8-prefix in cases where the connection string charset is not UTF8?

Jiri Cincura added a comment - 16/Mar/20 01:29 PM
> Is it possible to omit the _UTF8-prefix in cases where the connection string charset is not UTF8?

Thinking about it more, I think the problem is because the command text is sent as WIN1252 (the connection charset) and then Firebird tries to consider the piece as UTF8 and the bytes do not align. But I don't think removing the prefix is a good idea. I'm thinking in a way of: column charset X, connection charset Y, C# obviously Unicode. And then the X and Y codepages might provide different characters.

IMO the correct solution would be to use (and maybe force) usage of UTF8. At least for EF Core that's what I'm now really thinking to do.

Or maybe sending it as HEX literal. Which would solve it, but also the query would not be nice.

Daniel Richter added a comment - 16/Mar/20 03:34 PM
> Or maybe sending it as HEX literal. Which would solve it, but also the query would not be nice.
Can confirm. That works. (SqlGenerator.cs/FormatString)
  #if false
      result.Append("'");
      result.Append(value.Replace("'", "''"));
   #else
      result.Append(" x'");
      result.Append(BitConverter.ToString(Encoding.UTF8.GetBytes(value)).Replace("-",""));
   #endif
Does not generate
   _UTF8'Mülltüte'
but
   _UTF8 x'4DC3BC6C6C74C3BC7465'


> Thinking about it more, I think the problem is because the command text is sent as WIN1252 (the connection charset) and then Firebird tries to consider the piece as UTF8 and the bytes do not align. But I don't think removing the prefix is a good idea. I'm thinking in a way of: column charset X, connection charset Y, C# obviously Unicode. And then the X and Y codepages might provide different characters.

Hm, from a database clients view:
I decided to use WIN1252 as connection charset, so its my responsibility to only send valid WIN1252 data. I do not want any middleware to interpret my data as UTF8 (thats what the always generated _UTF8 prefix does) - because I didn't told sol! If there is any data that does not match the charset, I will get an error ("malformed string" maybe) - and that is completely my fault, because I said I want to use WIN1252. (Reminder: Umlauts and WIN1252 do work fine - without the _UTF8-prefix).
If I would decide to connect to a WIN1252 database with UTF8 and send valid UTF8 but non-WIN1252 data, the would be an error nevertheless ("arithmetic exception, numeric overflow, or string truncation" maybe).
So from this point of view it is completely unneccessary to interpret some non-UTF8 data as UTF8. (And maybe in case of a UTF8 connection charset the prefix is unnecessary too because the data already is UTF8). Using UTF8 data in non-UTF8 environments would break anyway; interpreting non-UTF8 data as UTF8 in non-UTF8 environments causes errors that are avoidable (from my point of view, I don't know the internals).


> IMO the correct solution would be to use (and maybe force) usage of UTF8. At least for EF Core that's what I'm now really thinking to do.

Than please throw an exception in case UTF8 is _not_ used.