Issue Details (XML | Word | Printable)

Key: CORE-2006
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Adriano dos Santos Fernandes
Reporter: Dany Marmur
Votes: 2
Watchers: 6

If you were logged in you would be able to see more operations.
Firebird Core

SUBSTRING with regular expression (SIMILAR TO) capability

Created: 21/Jul/08 02:03 AM   Updated: 28/Sep/15 01:12 PM
Component/s: Engine
Affects Version/s: 2.5 Beta 1
Fix Version/s: 3.0 Alpha 1

Issue Links:

QA Status: Done successfully
Test Details:
Some expressions from test relates to: CORE-2389, CORE-2756, CORE-2780, CORE-3523, CORE-3754, CORE-769
See also: CORE-4740

Sub-Tasks  All   Open   

 Description  « Hide
The build-in function SUBSTRING should be extended to handle matching patterns.
The current implementation of the SIMILAR TO predicate only allows for selection and validation based on pattern matching (regular expressions).
String manipulation based on pattern matching is impossible without extending the SUBSTRING function.


SUBSTRING ( <value> FROM <similar pattern> FOR <escape character>)

the function should return NULL for no match or the matching substring for a match.

Further control is achieved with an additional character combination specified in <similar pattern> to bracket the portion of the match that should be returned.

 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Nickolay Samofatov added a comment - 21/Jul/08 02:34 PM
BroadView uses the following UDFs for RegExp functionality, and they are proven very convenient.

-- Takes String, Pattern, Options

-- Takes String, Pattern, Match_Number, Occurence, Options
    ENTRY_POINT 'F_RegExp_Substr' MODULE_NAME 'aaUDFLib';

These UDFs mimic Oracle's REGEXP_LIKE, REGEXP_SUBSTR and REGEXP_REPLACE functions.
If there is significant interest we can probably ask Sean nicely, and he may authorize the Open Source release of (Free Pascal) code for these functions, but they are just simple wrappers around regexp library.

In general it is probably better to use somebody else' syntax rather than invent new one.
BTW, SQL 2008 defines SUBSTRING_REGEX function.

Dany Marmur added a comment - 21/Jul/08 06:33 PM
I was told that SUBSTRING( <value> FROM <similar pattern> FOR <escape character>) is "SQL standard functionality".

When considering UDFs - CSTRING is limited in size and with multi character sets it can be very limiting. LIKE and CONTAINING can even be used with BLOBS. This is one reason why build-in functions are preferrable. UDF needs to be passed the "locale" to. The built-in functions and predicates should handle multi-byte character sets properly from the start. A wrapper for regexp library could behave differently on different platforms. UDFs must be considered from an cross-platform perspective too. In PostgreSQL the built-in function behaves as SIMILAR TO while the wrapper for regexp library functions behaves differently (this is documented but it can be considered un-optimal from a standard compliance pow).

Any other SQL standard like SUBSTRING_REGEX is of course fine too.

Adriano dos Santos Fernandes added a comment - 21/Feb/10 01:48 AM
Implemented SUBSTRING(<value> SIMILAR <pattern> ESCAPE <escape>) as per the SQL standard.

Vlad Khorsun added a comment - 26/Sep/10 09:16 PM
Unfortunately, current implementation doesn't conforms to SQL standard.

=============== SQL standard ===============

6.29 <string value function>

General Rules

6) If <regular expression substring function> is specified, then:

    a) Let C be the result of the first <character value expression>, let R be the result of the second
        <character value expression>, and let E be the result of the <escape character>.

    b) If one or more of C, R, or E is the null value, then the result of the <regular expression substring
        function> is the null value.

    c) If the length in characters of E is not equal to 1 (one), then an exception condition is raised: data
        exception - invalid escape character.

    d) If R does not contain exactly two occurrences of the two-character sequence consisting of E,
        each immediately followed by <double quote>, then an exception condition is raised: data
        exception - invalid use of escape character.

    e) Let R1, R2, and R3 be the substrings of R, such that
            'R' = 'R1' || 'E' || '"' || 'R2' || 'E' || '"' || 'R3'
        is True.

    f) If any one of R1, R2, or R3 is not a zero-length string and does not have the format of a <regular
        expression>, then an exception condition is raised: data exception - invalid regular expression.

    g) If the predicate
        'C' SIMILAR TO 'R1' || 'R2' || 'R3' ESCAPE 'E'
        is not True, then the result of the <regular expression substring function> is the null value.

    h) Otherwise, the result S of the <regular expression substring function> is computed as follows:
        i) Let S1 be the shortest initial substring of C such that there is a substring S23 of C such that
            the value of the following <search condition> is True:
                'C' = 'S1' || 'S23' AND
                'S1' SIMILAR TO 'R1' ESCAPE 'E' AND
                'S23' SIMILAR TO '(R2R3)' ESCAPE 'E'

        ii) Let S3 be the shortest final substring of S23 such that there is a substring S2 of S23 such
            that the value of the following <search condition> is True:
                'S23' = 'S2' || 'S3' AND
                'S2' SIMILAR TO 'R2' ESCAPE 'E' AND
                'S3' SIMILAR TO 'R3' ESCAPE 'E'

        iii) The result of the <regular expression substring function> is S2.

=============== SQL standard ===============

SELECT SUBSTRING('asds 12.34 asd' SIMILAR '% \"[\+\-]?[0-9]*([0-9].|.[0-9])?[0-9]*\" %' ESCAPE '\')

returns '.34'. It is abviously not correct as in this case we have

R1 = '% '
R2 = '[\+\-]?[0-9]*([0-9].|.[0-9])?[0-9]*'
R3 = ' %'

E = '\'


S1 = 'asds 12.'
S2 = '.34'
S3 = ' asd'


S1 || S2 || S3 similar to R1 || R2 || R3 escape E == true
S1 similar to R1 escape E == false !

Correct answer should be

S1 = 'asds '
S2 = '12.34'
S3 = ' asd'

I think this ticket should be reopened (no sence to create new ticket with bug description as current implementation was never released).

Pavel Zotov added a comment - 13/Apr/15 11:52 PM
Release notes specify ability to apply 'NOT' in this statement:

=== page 72 ===
A substring search can now use a regular expression.
Search Pattern
SUBSTRING(<string> [NOT] SIMILAR TO <pattern> ESCAPE <char>)


SQL> select substring('abcqwe' similar '%\"q\"%' escape '\') from rdb$database;


SQL> select substring('abcqwe'
CON> NOT --------------------------------------------------------- <<< CAN be here ? <<<<
CON> similar '%\"q\"%' escape '\') from rdb$database;


Statement failed, SQLSTATE = 42000
Dynamic SQL Error
-SQL error code = -104
-Token unknown - line 1, column 39

So, how to apply 'NOT' in this case of SUBSTRING function ?

Adriano dos Santos Fernandes added a comment - 14/Apr/15 12:40 AM
Pavel, release notes is wrong. That's not present in README.substring_similar.txt;