Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"SuperIndex" for string serching [CORE4687] #4995

Open
firebird-automations opened this issue Feb 10, 2015 · 9 comments
Open

"SuperIndex" for string serching [CORE4687] #4995

firebird-automations opened this issue Feb 10, 2015 · 9 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @livius2

Votes: 1

This is big task but can optimize search operation on big tables and answer quality will be 100% of searched records for query "like '%something%'"

Context:

Last year some customer with very big Firebird database near 1 bilion of records in biggest table ask for optimising search on field VarChar(127)

They used full text search engine and it work ok as fulltext but, as you know, how it work then answer quality is not 100%.

Benefit was that it answer in few seconds instead of many seconds if we do search like UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%' full table scan

FullText indexing words and you can ask like "starting with" but not "containing" It find record with text like "case studies" if you search for "studi" but not find if you search for "ies" because in this case there was only one field and short VarChar(127),

Then i decided to rethink the problem and I got idea to create a series of expression indexes like:

CREATE ASCENDING INDEX IXAE_SMS__BODY1 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1));
CREATE ASCENDING INDEX IXAE_SMS__BODY2 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2));
CREATE ASCENDING INDEX IXAE_SMS__BODY3 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3));
CREATE ASCENDING INDEX IXAE_SMS__BODY4 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4));
CREATE ASCENDING INDEX IXAE_SMS__BODY5 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5));
... more

and when we search then we do

SELECT
...
from
...
WHERE
(
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5) LIKE 'SOMETHING%' OR
...
)

This query ran in few milliseconds with 100% of records :)

Of course it is storage expensive and any change to table is also expensive because of many indexes. But benefit is much more bigger for this customer and he was very surprised when he saw answer in few ms - and got 100% quality answer.

Will be good to see build in Firebird creation of something what I call "SuperIndex"

This can be created poorly internally with many virtual indexes like above - and visible outside as single index but mayby someone have concept how to store index info optimal without redundant information (i suppose that this can be patented then :)

and then we can:

CREATE SuperIndex IXAS_SMS__BODY ON SMS(BODY);

and then we can do simple search:

SELECT
...
from
...
WHERE
SMS_BODY LIKE '%SOMETHING%'

and then engine do for us work like above

I do not know whether sufficient human resources are in FB team and of course if the same concept will be of interest

@firebird-automations
Copy link
Collaborator Author

Commented by: @sim1984

It's too wasteful. Such an index would be very slow INSERT / UPDATE / DELETE. This wish can be reduced to one of two points:
1. Only index scan
2. Full-text search

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

>>>It's too wasteful. Such an index would be very slow INSERT / UPDATE / DELETE.

yes i know i write about this "change to table is also expensive because of many indexes" but as any index people should know when use it and when not in provided case it bring new life to query on this big database

>>This wish can be reduced to one of two points:
>>1. Only index scan

but how if not in this direction like above? If you have concept i'am very interested :)

>> 2. Full-text search
This was used at start and answer quality is was not 100%

@firebird-automations
Copy link
Collaborator Author

Commented by: @romansimakov

I see the following:
1) The problem exists,
2) One possible solution. Certainly if start to think we may find out more efficient solution.

Does anybody know how such problems solved in other DBMS? Oracle?

@firebird-automations
Copy link
Collaborator Author

Commented by: Omacht András (aomacht)

Not a DBMS, but a working solution from Apache.
http://lucene.apache.org/core/

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Edited "Description" for brevity and readability.

@firebird-automations
Copy link
Collaborator Author

Modified by: Sean Leyne (seanleyne)

description: This is big task but can optimize search operation on big tables and
answer quality will be 100% of searched records for query "like '%something%'"

case studies:

last year some customer with very big Firebird database near 1 bilion of records in biggest table ask for optimising search on field VarChar(127)
They used full text search engine and it work ok as fulltext but as you know how it work then answer quality is not 100%.
Benefit was that it answer in few seconds instead of many seconds if we do search like
UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%'
full table scan

FullText indexing words and you can ask like "starting with" but not "containing"
It find record with text like "case studies" if you search for "studi"
but not find if you search for "ies"

because in this case there was only one field and short VarChar(127)
then i decided to rethink all and i got concept

create expression indexes like

CREATE ASCENDING INDEX IXAE_SMS__BODY1 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1));
CREATE ASCENDING INDEX IXAE_SMS__BODY2 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2));
CREATE ASCENDING INDEX IXAE_SMS__BODY3 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3));
CREATE ASCENDING INDEX IXAE_SMS__BODY4 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4));
CREATE ASCENDING INDEX IXAE_SMS__BODY5 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5));
CREATE ASCENDING INDEX IXAE_SMS__BODY6 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 6));
CREATE ASCENDING INDEX IXAE_SMS__BODY7 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 7));
CREATE ASCENDING INDEX IXAE_SMS__BODY8 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 8));
CREATE ASCENDING INDEX IXAE_SMS__BODY9 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 9));
... more

and when we search then we do

SELECT
...
from
...
WHERE
(
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 6) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 7) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 8) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 9) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 10) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 11) LIKE 'SOMETHING%' OR
..
)

and this query answer in few milliseconds with 100% of records :)
Of course it is storage expensive and any change to table is also expensive because of many indexes
But benefit is much more bigger for this customer and he was very surprised when he saw answer in few ms - and got 100% quality answer.

Will be good to see build in Firebird creation of something what i call SuperIndex
This can be created poorly internally with many virtual indexes like above - and visible outside as single index
but mayby someone have concept how to store index info optimal without redundant information (i suppose that this can be patented then :)

and then we can
CREATE SuperIndex IXAS_SMS__BODY ON SMS(BODY);

and then we can do simple search:

SELECT
...
from
...
WHERE
SMS_BODY LIKE '%SOMETHING%'

and then engine do for us work like above

I do not know whether sufficient human resources are in FB team
and of course if the same concept will be of interest

=>

This is big task but can optimize search operation on big tables and answer quality will be 100% of searched records for query "like '%something%'"

Context:

Last year some customer with very big Firebird database near 1 bilion of records in biggest table ask for optimising search on field VarChar(127)

They used full text search engine and it work ok as fulltext but, as you know, how it work then answer quality is not 100%.

Benefit was that it answer in few seconds instead of many seconds if we do search like UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%' full table scan

FullText indexing words and you can ask like "starting with" but not "containing" It find record with text like "case studies" if you search for "studi" but not find if you search for "ies" because in this case there was only one field and short VarChar(127),

Then i decided to rethink the problem and I got idea to create a series of expression indexes like:

CREATE ASCENDING INDEX IXAE_SMS__BODY1 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1));
CREATE ASCENDING INDEX IXAE_SMS__BODY2 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2));
CREATE ASCENDING INDEX IXAE_SMS__BODY3 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3));
CREATE ASCENDING INDEX IXAE_SMS__BODY4 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4));
CREATE ASCENDING INDEX IXAE_SMS__BODY5 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5));
... more

and when we search then we do

SELECT
...
from
...
WHERE
(
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4) LIKE 'SOMETHING%' OR
SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5) LIKE 'SOMETHING%' OR
...
)

This query ran in few milliseconds with 100% of records :)

Of course it is storage expensive and any change to table is also expensive because of many indexes. But benefit is much more bigger for this customer and he was very surprised when he saw answer in few ms - and got 100% quality answer.

Will be good to see build in Firebird creation of something what I call "SuperIndex"

This can be created poorly internally with many virtual indexes like above - and visible outside as single index but mayby someone have concept how to store index info optimal without redundant information (i suppose that this can be patented then :)

and then we can:

CREATE SuperIndex IXAS_SMS__BODY ON SMS(BODY);

and then we can do simple search:

SELECT
...
from
...
WHERE
SMS_BODY LIKE '%SOMETHING%'

and then engine do for us work like above

I do not know whether sufficient human resources are in FB team and of course if the same concept will be of interest

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Karol,

It seems that you are describing a "word" index, where each word would be indexed and the LIKE and CONTAINING operator would use the word index. Am I correct?

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

not exactly
e.g. record with word FIREBIRD in record with text

"the best FIREBIRD"

with only word index it can find record in case like

"FIR%
"FIRE%"
....

but not find any record if you looking for

"%BIRD"
"%BIRD%"

It should find also ending text in word also inner text in word

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Karol,

The issue of finding inner and ending text is the result of the current "index prefix compression" logic used by the engine.

This results in indexed values not stored as full values/strings so that the index values cannot be "walked"/scanned to determine which values meet the criteria.

The index storage structure is a by-product of a long ago decision which focused on minimizing storage footprint at the "cost" of limited search functionality. Arguably with current abundance of storage space, it might be appropriate to add option to CREATE INDEX syntax to disable the compression and enable index use for all CONTAINING, LIKE and ENDING operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant