Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUBSTRING with regular expression (SIMILAR TO) capability [CORE2006] #2443

Closed
firebird-automations opened this issue Jul 21, 2008 · 13 comments
Closed

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Dany Marmur (dany)

Jira_subtask_outward CORE3147
Is duplicated by CORE3584
Is related to QA632

Votes: 2

The build-in function SUBSTRING should be extended to handle matching patterns.
The current implementation of the SIMILAR TO predicate only allows for selection and validation based on pattern matching (regular expressions).
String manipulation based on pattern matching is impossible without extending the SUBSTRING function.

writing

SUBSTRING ( <value> FROM <similar pattern> FOR <escape character>)

the function should return NULL for no match or the matching substring for a match.

Further control is achieved with an additional character combination specified in <similar pattern> to bracket the portion of the match that should be returned.

Commits: 5aac863 4f9b916 FirebirdSQL/fbt-repository@57ef2ba

====== Test Details ======

Some expressions from test relates to: CORE2389, CORE2756, CORE2780, CORE3523, CORE3754, CORE769
See also: CORE4740

@firebird-automations
Copy link
Collaborator Author

Commented by: @samofatov

BroadView uses the following UDFs for RegExp functionality, and they are proven very convenient.

-- Takes String, Pattern, Options
DECLARE EXTERNAL FUNCTION F_RegExp_Like
CSTRING(2048),
CSTRING(254),
CSTRING(10)
RETURNS INTEGER BY VALUE
ENTRY_POINT 'F_RegExp_Like' MODULE_NAME 'aaUDFLib';

-- Takes String, Pattern, Match_Number, Occurence, Options
DECLARE EXTERNAL FUNCTION F_RegExp_Substr
CSTRING(2048),
CSTRING(254),
INTEGER,
INTEGER,
CSTRING(10)
RETURNS CSTRING(2048)
ENTRY_POINT 'F_RegExp_Substr' MODULE_NAME 'aaUDFLib';

These UDFs mimic Oracle's REGEXP_LIKE, REGEXP_SUBSTR and REGEXP_REPLACE functions.
If there is significant interest we can probably ask Sean nicely, and he may authorize the Open Source release of (Free Pascal) code for these functions, but they are just simple wrappers around regexp library.

In general it is probably better to use somebody else' syntax rather than invent new one.
BTW, SQL 2008 defines SUBSTRING_REGEX function.

@firebird-automations
Copy link
Collaborator Author

Commented by: Dany Marmur (dany)

I was told that SUBSTRING( <value> FROM <similar pattern> FOR <escape character>) is "SQL standard functionality".

When considering UDFs - CSTRING is limited in size and with multi character sets it can be very limiting. LIKE and CONTAINING can even be used with BLOBS. This is one reason why build-in functions are preferrable. UDF needs to be passed the "locale" to. The built-in functions and predicates should handle multi-byte character sets properly from the start. A wrapper for regexp library could behave differently on different platforms. UDFs must be considered from an cross-platform perspective too. In PostgreSQL the built-in function behaves as SIMILAR TO while the wrapper for regexp library functions behaves differently (this is documented but it can be considered un-optimal from a standard compliance pow).

Any other SQL standard like SUBSTRING_REGEX is of course fine too.

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

assignee: Adriano dos Santos Fernandes [ asfernandes ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

Implemented SUBSTRING(<value> SIMILAR <pattern> ESCAPE <escape>) as per the SQL standard.

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0 Alpha 1 [ 10331 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Unfortunately, current implementation doesn't conforms to SQL standard.

=============== SQL standard ===============

6.29 <string value function>

General Rules

6) If <regular expression substring function> is specified, then:

a\) Let C be the result of the first <character value expression\>, let R be the result of the second 
    <character value expression\>, and let E be the result of the <escape character\>\.

b\) If one or more of C, R, or E is the null value, then the result of the <regular expression substring 
    function\> is the null value\.

c\) If the length in characters of E is not equal to 1 \(one\), then an exception condition is raised: data
    exception \- invalid escape character\.

d\) If R does not contain exactly two occurrences of the two\-character sequence consisting of E, 
    each immediately followed by <double quote\>, then an exception condition is raised: data 
    exception \- invalid use of escape character\.

e\) Let R1, R2, and R3 be the substrings of R, such that
        'R' = 'R1' \|\| 'E' \|\| '"' \|\| 'R2' \|\| 'E' \|\| '"' \|\| 'R3'
    is True\.

f\) If any one of R1, R2, or R3 is not a zero\-length string and does not have the format of a <regular
    expression\>, then an exception condition is raised: data exception \- invalid regular expression\.

g\) If the predicate
    'C' SIMILAR TO 'R1' \|\| 'R2' \|\| 'R3' ESCAPE 'E'
    is not True, then the result of the <regular expression substring function\> is the null value\.

h\) Otherwise, the result S of the <regular expression substring function\> is computed as follows:
    i\) Let S1 be the shortest initial substring of C such that there is a substring S23 of C such that 
        the value of the following <search condition\> is True:
            'C' = 'S1' \|\| 'S23' AND
            'S1' SIMILAR TO 'R1' ESCAPE 'E' AND
            'S23' SIMILAR TO '\(R2R3\)' ESCAPE 'E'

    ii\) Let S3 be the shortest final substring of S23 such that there is a substring S2 of S23 such 
        that the value of the following <search condition\> is True:
            'S23' = 'S2' \|\| 'S3' AND
            'S2' SIMILAR TO 'R2' ESCAPE 'E' AND
            'S3' SIMILAR TO 'R3' ESCAPE 'E'

    iii\) The result of the <regular expression substring function\> is S2\.

=============== SQL standard ===============

SELECT SUBSTRING('asds 12.34 asd' SIMILAR '% \"[\+\-]?[0-9]*([0-9].|.[0-9])?[0-9]*\" %' ESCAPE '\')
FROM RDB$DATABASE

returns '.34'. It is abviously not correct as in this case we have

R1 = '% '
R2 = '[\+\-]?[0-9]*([0-9].|.[0-9])?[0-9]*'
R3 = ' %'

E = '\'

and

S1 = 'asds 12.'
S2 = '.34'
S3 = ' asd'

and

S1 || S2 || S3 similar to R1 || R2 || R3 escape E == true
S1 similar to R1 escape E == false !

Correct answer should be

S1 = 'asds '
S2 = '12.34'
S3 = ' asd'

I think this ticket should be reopened (no sence to create new ticket with bug description as current implementation was never released).

@firebird-automations
Copy link
Collaborator Author

Modified by: @asfernandes

Link: This issue is duplicated by CORE3584 [ CORE3584 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

Release notes specify ability to apply 'NOT' in this statement:

=== page 72 ===
A substring search can now use a regular expression.
Search Pattern
SUBSTRING(<string> [NOT] SIMILAR TO <pattern> ESCAPE <char>)

But:

SQL> select substring('abcqwe' similar '%\"q\"%' escape '\') from rdb$database;

SUBSTRING

q

SQL> select substring('abcqwe'
CON> NOT --------------------------------------------------------- <<< CAN be here ? <<<<
CON> similar '%\"q\"%' escape '\') from rdb$database;

Output:

Statement failed, SQLSTATE = 42000
Dynamic SQL Error
-SQL error code = -104
-Token unknown - line 1, column 39
-'%\"q\"%'

So, how to apply 'NOT' in this case of SUBSTRING function ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @asfernandes

Pavel, release notes is wrong. That's not present in README.substring_similar.txt;

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: Done successfully

Test Details: Some expressions from test relates to: CORE2389, CORE2756, CORE2780, CORE3523, CORE3754, CORE769

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

Test Details: Some expressions from test relates to: CORE2389, CORE2756, CORE2780, CORE3523, CORE3754, CORE769 => Some expressions from test relates to: CORE2389, CORE2756, CORE2780, CORE3523, CORE3754, CORE769
See also: core-4740

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

Test Details: Some expressions from test relates to: CORE2389, CORE2756, CORE2780, CORE3523, CORE3754, CORE769
See also: core-4740

=>

Some expressions from test relates to: CORE2389, CORE2756, CORE2780, CORE3523, CORE3754, CORE769
See also: CORE4740

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

Link: This issue is related to QA632 [ QA632 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants