Issue Details (XML | Word | Printable)

Key: CORE-4687
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Karol Bieniaszewski
Votes: 1
Watchers: 7
Operations

If you were logged in you would be able to see more operations.
Firebird Core

"SuperIndex" for string serching

Created: 10/Feb/15 10:23 AM   Updated: 28/Feb/15 06:17 PM
Component/s: Engine
Affects Version/s: None
Fix Version/s: None


 Description  « Hide
This is big task but can optimize search operation on big tables and answer quality will be 100% of searched records for query "like '%something%'"

Context:

  Last year some customer with very big Firebird database near 1 bilion of records in biggest table ask for optimising search on field VarChar(127)

  They used full text search engine and it work ok as fulltext but, as you know, how it work then answer quality is not 100%.

Benefit was that it answer in few seconds instead of many seconds if we do search like UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%' full table scan

FullText indexing words and you can ask like "starting with" but not "containing" It find record with text like "case studies" if you search for "studi" but not find if you search for "ies" because in this case there was only one field and short VarChar(127),

Then i decided to rethink the problem and I got idea to create a series of expression indexes like:

  CREATE ASCENDING INDEX IXAE_SMS__BODY1 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1));
  CREATE ASCENDING INDEX IXAE_SMS__BODY2 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2));
  CREATE ASCENDING INDEX IXAE_SMS__BODY3 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3));
  CREATE ASCENDING INDEX IXAE_SMS__BODY4 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4));
  CREATE ASCENDING INDEX IXAE_SMS__BODY5 ON SMS COMPUTED BY(SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5));
  ... more

and when we search then we do

  SELECT
    ...
  from
    ...
  WHERE
   (
     SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 1) LIKE 'SOMETHING%' OR
     SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 2) LIKE 'SOMETHING%' OR
     SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 3) LIKE 'SOMETHING%' OR
     SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 4) LIKE 'SOMETHING%' OR
     SUBSTRING(UPPER(SMS_BODY COLLATE PXW_PLK) FROM 5) LIKE 'SOMETHING%' OR
     ...
   )
 
This query ran in few milliseconds with 100% of records :)

Of course it is storage expensive and any change to table is also expensive because of many indexes. But benefit is much more bigger for this customer and he was very surprised when he saw answer in few ms - and got 100% quality answer.

Will be good to see build in Firebird creation of something what I call "SuperIndex"

This can be created poorly internally with many virtual indexes like above - and visible outside as single index but mayby someone have concept how to store index info optimal without redundant information (i suppose that this can be patented then :)

and then we can:

  CREATE SuperIndex IXAS_SMS__BODY ON SMS(BODY);

and then we can do simple search:

  SELECT
     ...
  from
    ...
  WHERE
    SMS_BODY LIKE '%SOMETHING%'

and then engine do for us work like above

I do not know whether sufficient human resources are in FB team and of course if the same concept will be of interest

 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Simonov Denis added a comment - 10/Feb/15 10:50 AM
It's too wasteful. Such an index would be very slow INSERT / UPDATE / DELETE. This wish can be reduced to one of two points:
1. Only index scan
2. Full-text search

Karol Bieniaszewski added a comment - 10/Feb/15 11:13 AM - edited
>>>It's too wasteful. Such an index would be very slow INSERT / UPDATE / DELETE.

yes i know i write about this "change to table is also expensive because of many indexes" but as any index people should know when use it and when not in provided case it bring new life to query on this big database

>>This wish can be reduced to one of two points:
>>1. Only index scan

but how if not in this direction like above? If you have concept i'am very interested :)

>> 2. Full-text search
This was used at start and answer quality is was not 100%

Roman Simakov added a comment - 10/Feb/15 12:58 PM
I see the following:
1) The problem exists,
2) One possible solution. Certainly if start to think we may find out more efficient solution.

Does anybody know how such problems solved in other DBMS? Oracle?

Omacht András added a comment - 10/Feb/15 01:53 PM
Not a DBMS, but a working solution from Apache.
http://lucene.apache.org/core/

Sean Leyne added a comment - 10/Feb/15 05:02 PM
Edited "Description" for brevity and readability.

Sean Leyne added a comment - 10/Feb/15 05:04 PM - edited
Karol,

It seems that you are describing a "word" index, where each word would be indexed and the LIKE and CONTAINING operator would use the word index. Am I correct?

Karol Bieniaszewski added a comment - 10/Feb/15 05:37 PM
not exactly
e.g. record with word FIREBIRD in record with text

"the best FIREBIRD"

with only word index it can find record in case like

"FIR%
"FIRE%"
....

but not find any record if you looking for

"%BIRD"
"%BIRD%"

It should find also ending text in word also inner text in word

Sean Leyne added a comment - 28/Feb/15 06:17 PM
Karol,

The issue of finding inner and ending text is the result of the current "index prefix compression" logic used by the engine.

This results in indexed values not stored as full values/strings so that the index values cannot be "walked"/scanned to determine which values meet the criteria.

The index storage structure is a by-product of a long ago decision which focused on minimizing storage footprint at the "cost" of limited search functionality. Arguably with current abundance of storage space, it might be appropriate to add option to CREATE INDEX syntax to disable the compression and enable index use for all CONTAINING, LIKE and ENDING operations.