Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISTINCT and UNION no longer work for BLOB fields [CORE1345] #1764

Open
firebird-automations opened this issue Jul 5, 2007 · 16 comments
Open

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Rene Pijnacker (rpijnacker)

Votes: 1

I tried to run our application through BDE using firebird 2.1 beta
on opening of some screen I get error message No support for blob sorting.
This works on all other versions of firebird

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

This is an intentional change. Blobs cannot be sorted by, including the GROUP BY and DISTINCT operations. Prior versions allowed that but produced unexpected results as in fact BLOB IDs were sorted.

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Won't Fix [ 2 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Bruno Depero (bdepero)

I tried to restore few databases from 2.0.2 to 2.1 but gbak finished before completing due to blob field that with 2.1.x cannot be sorted.
When 2.1.x will be released I wish to migrate from 2.0.x to 2.1.
7 databases, very large ddl: there is a fast way to find what items do a sort on a blob field (view, stored, triggers, ...) ?

Thank you in advance !
Bruno

@firebird-automations
Copy link
Collaborator Author

Commented by: @samofatov

We have encountered the issue with with pretty much all commercial Firebird applications that we tried to port to 2.1, and the needed fixes for applications are sometimes non-trivial or even impossible, and very difficult to explain to the customer. The issue had to be ranked as critical and the release blocker for us internally. CORE1250 is another report on the same issue.

The access frameworks routinely generate queries like "select distinct * from <sometable>" and

"select ...many fields... from <sometable>
union
select ...many fields... from <sometable>"

The previous engine used to treat two non-null BLOBs as DISTINCT if they came from different tables or the different fields of the same table, which was correct behavior (meaning it was consistent and acceptable).

Changing DISTINCT and UNION behavior with BLOBs breaks a big percentage of applications for no logical reason.

We may forbid using BLOBs in ORDER BY clause, because it produces incorrect results, but breaking DISTINCT and UNION cannot be justified.

@firebird-automations
Copy link
Collaborator Author

Modified by: @samofatov

status: Closed [ 6 ] => Reopened [ 4 ]

resolution: Won't Fix [ 2 ] =>

@firebird-automations
Copy link
Collaborator Author

Modified by: @samofatov

priority: Minor [ 4 ] => Major [ 3 ]

summary: No support for BLOB sorting => DISTINCT and UNION no longer work for BLOB fields

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

create table blb1 (id int, col blob);
create table blb2 (id int, col blob);
commit;

insert into blb1 values (1, 'asd');
insert into blb1 values (1, 'asd');
insert into blb2 values (1, 'asd');
commit;

select distinct * from blb1;
-- 2 rows returned

select * from blb1
union
select * from blb2
-- 3 rows returned

Do you call it a correct result?

@firebird-automations
Copy link
Collaborator Author

Commented by: @samofatov

And the result is:

>isql -i a_test.sql

      ID               COL

============ =================

       1              80:0
       1              80:2
       1              81:0

This is not too bad result, IMO. Much better than returning a error and breaking existing programs.

Not many programs rely on ORDER BY for blob fields, but using DISTINCT and UNION with blobs is the standard practice, you like it or not.

Consider this example:

create database 'aa.fdb' user 'sysdba' password 'masterkey';

create table products (
id integer not null primary key,
dsc blob
);

create table orders (
id integer not null primary key,
product_id integer references products,
qty integer
);

insert into products values(1, 'Red Database');
insert into orders values(1, 1, 1);

select distinct http://products.id, products.dsc
from products, orders
where http://products.id = orders.product_id;

How can you justify to the user that this query which has produced expected and consistent results before has to be fixed?

Old style and analytical queries very commonly join a whole bunch of tables (leveling many-many relationships), filter it in a WHERE clause and than use DISTINCT to figure out the rows that they needed. The change breaks this perfectly legal approach.

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

As we discussed in fb-devel, others DBMS doesn't allow this query either.

But, we can make it work based on primary key/unique, i.e., if any primary key/unique of the same table of a blob is present (as in your example), this record will be always distinct.

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Nickolay, your sample with:
>isql -i a_test.sql
shows that we have badly designed client tool (our isql), not that such queries are correct.

I agree with Adriano's suggestion - let's use PK (or any other unique constraint if present), this will let most of old queries run. But reanimating old _bugs_ (like one mentioned in your sample) for backward-compatibility reasons is not a good way to go.

@firebird-automations
Copy link
Collaborator Author

Commented by: @samofatov

Guys, are we making the best engine in the world or a patchy mess?

1. Absence/presence/active status of constraints or indexes should not change SELECT statement validity or results (remember DDL dependencies and other stuff?)

2. Compatibility with existing installed base is hugely important, and if we are going to inflict migration costs, that has to be justified with something other than calling old predictable and consistent behavior "bugs". Either change the functionality properly or don't touch it. Breaking applications for the sake of breakage is not the way forward.

3. If we want to attack the issue seriously so it compares BLOB values in generic scenario correctly we have to implement SORT callbacks or something similar to compare BLOB values and a little bunch of optimizations to avoid performance regressions with old queries.

4. Regarding the other servers not supporting the feature, please come on, don't be silly:

mysql> create table a (b longtext);
Query OK, 0 rows affected (0.30 sec)

mysql> insert into a values('test');
Query OK, 1 row affected (0.44 sec)

mysql> insert into a values('test');
Query OK, 1 row affected (0.44 sec)

mysql> select distinct * from a;
+------+
| b |
+------+
| test |
+------+
1 row in set (0.20 sec)

========

postgres=#⁠ create table aa (b text);
CREATE TABLE
postgres=#⁠ insert into aa values('test');
INSERT 0 1
postgres=#⁠ insert into aa values('test');
INSERT 0 1
postgres=#⁠ select distinct * from aa;
b
------
test
(1 row)

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

If we can provide postgres-like behavior, I see no problems. But certainly this will affect performance - at least by reading of blob pages from disk. BTW, presence of active unique constraint my become a serious optimization here.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

Workflow: jira [ 12510 ] => Firebird [ 14867 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Philip Williams (unordained)

Blobs aren't always stored on separate pages, if room exists to fit them on the main record page, right? I, for one, often use blobs when I'm not sure a large varchar will handle outlier cases, despite knowing that 99% of the time, the value will be very small and even a small varchar would be sufficient. Memo fields, for example, need to be large for rare cases, but most of the time users won't take the time to even write complete sentences, let alone 4kb of text. So there *may* be a performance cost in looking at the blob value, but you shouldn't assume it'll be catastrophic.

And for those few users who do want to run a distinct on blobs because they're trying to eliminate duplicate photos in their collections (!), I expect they know who they are and will be okay with exchanging speed for features.

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

If I were you i will implement it as:
1. Compare all normal fields
2. If result is still true compare blobs.

This will in most of cases do not tought blob data at all even if no unique constraint is specified in the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant