Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outer join tables, on text columns with different collates (perhaps charsets too), may propagate wrong LIKE predicate [CORE6094] #6343

Open
firebird-automations opened this issue Jul 3, 2019 · 5 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Arioch (arioch)

Initial report: https://stackoverflow.com/questions/56858452/why-did-my-where-clause-affect-my-left-join
Initial discussion: https://www.sql.ru/forum/1314381-a/fb3-join-vs-like-vs-case-sensitivity

The example is based upon mismatched collates but probably implementation behavior is the same with regard to charsets too.

DDL:

~~~~~~~~~~~~~~~~~
create table t_CI ( -- case-INSENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE_CI,
payload integer
);

create table t_CS ( -- case-SENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE,
payload integer
);

insert into t_CS values ( 'ABCD', -100 );

insert into t_CI values ( 'ABCD', +22 );
~~~~~~~~~~~~~~

We now have two tables with ID fields having different collates. However the ID value is the same within both tables and is within both collates.

When joining those two tables - the results should be either data from both tables or no data at all.

Situation, when upon join by ID data is fetched from only one of the two tables but not from another would be abnormal !!!

Abnormal query returning only half of the data, most probably due to propagation of `LIKE` predicate with wrong collation:
~~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%bc%'
;

/*
Plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (RDB$PRIMARY3))

Adapted plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (INTEG_21))
*/
~~~~~~~~~~~

For comparison:

Correct queries returning data from BOTH tables:
~~~~~~~~
select * from T_CS, T_CI where T_CS.id = T_CI.id
;

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%BC%'
;

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%BC%'
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode_ci like '%bc%'
;
~~~~~~~~~~~~~~~

Correct queries returning no data at all:
~~~~~~~
select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%bc%'
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like '%bc%'
;
~~~~~~~

One more query, somewhat similar, different in engaging master-table index thus probably preventing LIKE-propagation
~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id starting with 'abc' ;
;

/*
Plan:
PLAN JOIN (T_CI INDEX (RDB$PRIMARY2), T_CS INDEX (RDB$PRIMARY3))
*/
~~~~~~~~~~~~~~~~~~~~~

@firebird-automations
Copy link
Collaborator Author

Modified by: Arioch (arioch)

description: Initial report: https://stackoverflow.com/questions/56858452/why-did-my-where-clause-affect-my-left-join
Initial discussion: https://www.sql.ru/forum/1314381-a/fb3-join-vs-like-vs-case-sensitivity

The example is based upon mismatched collates but probably implementation behavior is the same with regard to charsets too.

DML:

~~~~~~~~~~~~~~~~~
create table t_CI ( -- case-INSENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE_CI,
payload integer
);

create table t_CS ( -- case-SENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE,
payload integer
);

insert into t_CS values ( 'ABCD', -100 );

insert into t_CI values ( 'ABCD', +22 );
~~~~~~~~~~~~~~

We now have two tables with ID fields having different collates. However the ID value is the same within both tables and is within both collates.

When joining those two tables - the results should be either data from both tables or no data at all.

Situation, when upon join by ID data is fetched from only one of the two tables but not from another would be abnormal !!!

Abnormal query returning only half of the data, most probably due to propagation of `LIKE` predicate with wrong collation:
~~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%bc%'
PLAN JOIN (T_CI NATURAL, T_CS INDEX (RDB$PRIMARY3))

Adapted plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (INTEG_21))
~~~~~~~~~~~

For comparison:

Correct queries returning data from BOTH tables:
~~~~~~~~
select * from T_CS, T_CI where T_CS.id = T_CI.id

select * from T_CS
left join T_CI on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%BC%'

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%BC%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode_ci like '%bc%'
~~~~~~~~~~~~~~~

Correct queries returning no data at all:
~~~~~~~
select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%bc%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like '%bc%'
~~~~~~~

One more query, somewhat similar, different is engaging master-table index thus probably preventing LIKE-propagation
~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id starting with 'abc'

-- OK: data ~ data

Plan:
PLAN JOIN (T_CI INDEX (RDB$PRIMARY2), T_CS INDEX (RDB$PRIMARY3))
~~~~~~~~~~~~~~~~~~~~~

=>

Initial report: https://stackoverflow.com/questions/56858452/why-did-my-where-clause-affect-my-left-join
Initial discussion: https://www.sql.ru/forum/1314381-a/fb3-join-vs-like-vs-case-sensitivity

The example is based upon mismatched collates but probably implementation behavior is the same with regard to charsets too.

DВL:

~~~~~~~~~~~~~~~~~
create table t_CI ( -- case-INSENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE_CI,
payload integer
);

create table t_CS ( -- case-SENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE,
payload integer
);

insert into t_CS values ( 'ABCD', -100 );

insert into t_CI values ( 'ABCD', +22 );
~~~~~~~~~~~~~~

We now have two tables with ID fields having different collates. However the ID value is the same within both tables and is within both collates.

When joining those two tables - the results should be either data from both tables or no data at all.

Situation, when upon join by ID data is fetched from only one of the two tables but not from another would be abnormal !!!

Abnormal query returning only half of the data, most probably due to propagation of `LIKE` predicate with wrong collation:
~~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%bc%'
PLAN JOIN (T_CI NATURAL, T_CS INDEX (RDB$PRIMARY3))

Adapted plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (INTEG_21))
~~~~~~~~~~~

For comparison:

Correct queries returning data from BOTH tables:
~~~~~~~~
select * from T_CS, T_CI where T_CS.id = T_CI.id

select * from T_CS
left join T_CI on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%BC%'

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%BC%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode_ci like '%bc%'
~~~~~~~~~~~~~~~

Correct queries returning no data at all:
~~~~~~~
select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%bc%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like '%bc%'
~~~~~~~

One more query, somewhat similar, different is engaging master-table index thus probably preventing LIKE-propagation
~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id starting with 'abc'

-- OK: data ~ data

Plan:
PLAN JOIN (T_CI INDEX (RDB$PRIMARY2), T_CS INDEX (RDB$PRIMARY3))
~~~~~~~~~~~~~~~~~~~~~

@firebird-automations
Copy link
Collaborator Author

Modified by: Arioch (arioch)

description: Initial report: https://stackoverflow.com/questions/56858452/why-did-my-where-clause-affect-my-left-join
Initial discussion: https://www.sql.ru/forum/1314381-a/fb3-join-vs-like-vs-case-sensitivity

The example is based upon mismatched collates but probably implementation behavior is the same with regard to charsets too.

DВL:

~~~~~~~~~~~~~~~~~
create table t_CI ( -- case-INSENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE_CI,
payload integer
);

create table t_CS ( -- case-SENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE,
payload integer
);

insert into t_CS values ( 'ABCD', -100 );

insert into t_CI values ( 'ABCD', +22 );
~~~~~~~~~~~~~~

We now have two tables with ID fields having different collates. However the ID value is the same within both tables and is within both collates.

When joining those two tables - the results should be either data from both tables or no data at all.

Situation, when upon join by ID data is fetched from only one of the two tables but not from another would be abnormal !!!

Abnormal query returning only half of the data, most probably due to propagation of `LIKE` predicate with wrong collation:
~~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%bc%'
PLAN JOIN (T_CI NATURAL, T_CS INDEX (RDB$PRIMARY3))

Adapted plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (INTEG_21))
~~~~~~~~~~~

For comparison:

Correct queries returning data from BOTH tables:
~~~~~~~~
select * from T_CS, T_CI where T_CS.id = T_CI.id

select * from T_CS
left join T_CI on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%BC%'

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%BC%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode_ci like '%bc%'
~~~~~~~~~~~~~~~

Correct queries returning no data at all:
~~~~~~~
select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%bc%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like '%bc%'
~~~~~~~

One more query, somewhat similar, different is engaging master-table index thus probably preventing LIKE-propagation
~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id starting with 'abc'

-- OK: data ~ data

Plan:
PLAN JOIN (T_CI INDEX (RDB$PRIMARY2), T_CS INDEX (RDB$PRIMARY3))
~~~~~~~~~~~~~~~~~~~~~

=>

Initial report: https://stackoverflow.com/questions/56858452/why-did-my-where-clause-affect-my-left-join
Initial discussion: https://www.sql.ru/forum/1314381-a/fb3-join-vs-like-vs-case-sensitivity

The example is based upon mismatched collates but probably implementation behavior is the same with regard to charsets too.

DDL:

~~~~~~~~~~~~~~~~~
create table t_CI ( -- case-INSENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE_CI,
payload integer
);

create table t_CS ( -- case-SENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE,
payload integer
);

insert into t_CS values ( 'ABCD', -100 );

insert into t_CI values ( 'ABCD', +22 );
~~~~~~~~~~~~~~

We now have two tables with ID fields having different collates. However the ID value is the same within both tables and is within both collates.

When joining those two tables - the results should be either data from both tables or no data at all.

Situation, when upon join by ID data is fetched from only one of the two tables but not from another would be abnormal !!!

Abnormal query returning only half of the data, most probably due to propagation of `LIKE` predicate with wrong collation:
~~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%bc%'
PLAN JOIN (T_CI NATURAL, T_CS INDEX (RDB$PRIMARY3))

Adapted plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (INTEG_21))
~~~~~~~~~~~

For comparison:

Correct queries returning data from BOTH tables:
~~~~~~~~
select * from T_CS, T_CI where T_CS.id = T_CI.id

select * from T_CS
left join T_CI on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%BC%'

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%BC%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode_ci like '%bc%'
~~~~~~~~~~~~~~~

Correct queries returning no data at all:
~~~~~~~
select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%bc%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like '%bc%'
~~~~~~~

One more query, somewhat similar, different is engaging master-table index thus probably preventing LIKE-propagation
~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id starting with 'abc'

-- OK: data ~ data

Plan:
PLAN JOIN (T_CI INDEX (RDB$PRIMARY2), T_CS INDEX (RDB$PRIMARY3))
~~~~~~~~~~~~~~~~~~~~~

@firebird-automations
Copy link
Collaborator Author

Modified by: Arioch (arioch)

description: Initial report: https://stackoverflow.com/questions/56858452/why-did-my-where-clause-affect-my-left-join
Initial discussion: https://www.sql.ru/forum/1314381-a/fb3-join-vs-like-vs-case-sensitivity

The example is based upon mismatched collates but probably implementation behavior is the same with regard to charsets too.

DDL:

~~~~~~~~~~~~~~~~~
create table t_CI ( -- case-INSENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE_CI,
payload integer
);

create table t_CS ( -- case-SENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE,
payload integer
);

insert into t_CS values ( 'ABCD', -100 );

insert into t_CI values ( 'ABCD', +22 );
~~~~~~~~~~~~~~

We now have two tables with ID fields having different collates. However the ID value is the same within both tables and is within both collates.

When joining those two tables - the results should be either data from both tables or no data at all.

Situation, when upon join by ID data is fetched from only one of the two tables but not from another would be abnormal !!!

Abnormal query returning only half of the data, most probably due to propagation of `LIKE` predicate with wrong collation:
~~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%bc%'
PLAN JOIN (T_CI NATURAL, T_CS INDEX (RDB$PRIMARY3))

Adapted plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (INTEG_21))
~~~~~~~~~~~

For comparison:

Correct queries returning data from BOTH tables:
~~~~~~~~
select * from T_CS, T_CI where T_CS.id = T_CI.id

select * from T_CS
left join T_CI on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%BC%'

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%BC%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode_ci like '%bc%'
~~~~~~~~~~~~~~~

Correct queries returning no data at all:
~~~~~~~
select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%bc%'

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like '%bc%'
~~~~~~~

One more query, somewhat similar, different is engaging master-table index thus probably preventing LIKE-propagation
~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id starting with 'abc'

-- OK: data ~ data

Plan:
PLAN JOIN (T_CI INDEX (RDB$PRIMARY2), T_CS INDEX (RDB$PRIMARY3))
~~~~~~~~~~~~~~~~~~~~~

=>

Initial report: https://stackoverflow.com/questions/56858452/why-did-my-where-clause-affect-my-left-join
Initial discussion: https://www.sql.ru/forum/1314381-a/fb3-join-vs-like-vs-case-sensitivity

The example is based upon mismatched collates but probably implementation behavior is the same with regard to charsets too.

DDL:

~~~~~~~~~~~~~~~~~
create table t_CI ( -- case-INSENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE_CI,
payload integer
);

create table t_CS ( -- case-SENSITIVE
id VarChar(10) character set UTF8 primary key collate UNICODE,
payload integer
);

insert into t_CS values ( 'ABCD', -100 );

insert into t_CI values ( 'ABCD', +22 );
~~~~~~~~~~~~~~

We now have two tables with ID fields having different collates. However the ID value is the same within both tables and is within both collates.

When joining those two tables - the results should be either data from both tables or no data at all.

Situation, when upon join by ID data is fetched from only one of the two tables but not from another would be abnormal !!!

Abnormal query returning only half of the data, most probably due to propagation of `LIKE` predicate with wrong collation:
~~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%bc%'
;

/*
Plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (RDB$PRIMARY3))

Adapted plan:
PLAN JOIN (T_CI NATURAL, T_CS INDEX (INTEG_21))
*/
~~~~~~~~~~~

For comparison:

Correct queries returning data from BOTH tables:
~~~~~~~~
select * from T_CS, T_CI where T_CS.id = T_CI.id
;

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id like '%BC%'
;

select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%BC%'
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode_ci like '%bc%'
;
~~~~~~~~~~~~~~~

Correct queries returning no data at all:
~~~~~~~
select * from T_CS
left join T_CI on T_CS.id = T_CI.id
where T_CS.id like '%bc%'
;

select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like '%bc%'
;
~~~~~~~

One more query, somewhat similar, different in engaging master-table index thus probably preventing LIKE-propagation
~~~~~~~~~~
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id starting with 'abc' ;
;

/*
Plan:
PLAN JOIN (T_CI INDEX (RDB$PRIMARY2), T_CS INDEX (RDB$PRIMARY3))
*/
~~~~~~~~~~~~~~~~~~~~~

@firebird-automations
Copy link
Collaborator Author

Commented by: Arioch (arioch)

One more jinxed example.

---------------
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id collate unicode like _UTF8 '%bc%' --collate unicode
---------------

Like we saw above, "nailing down" the case-sensitive collation makes the query correctly return zero rows.
Now we move the collation enforcing to another side of the LIKE

---------------------
select * from T_CI
left join T_CS on T_CS.id = T_CI.id
where T_CI.id /* collate unicode */ like _UTF8 '%bc%' collate unicode
--------------------

Suddenly the query outputs the data row, from both tables.

I can not explain this sudden deviation from the test results above, at all.

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

assignee: Dmitry Yemanov [ dimitr ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants