Issue Details (XML | Word | Printable)

Key: CORE-4436
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Adriano dos Santos Fernandes
Reporter: Joaquim Pais
Votes: 3
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Firebird Core

Support for different hash algorithms in HASH system function

Created: 26/May/14 10:51 AM   Updated: 07/Aug/17 04:26 PM
Component/s: Engine
Affects Version/s: 2.1.5 Update 1
Fix Version/s: 4.0 Alpha 1

Environment: Windows Xp, 7, 2008 Server

QA Status: Done with caveats
Test Details:
Test verifies only:
1) ability to use syntax: hash(<string> using <algo>)
2) non-equality of hash results for sha1, sha256 and sha512 using _TRIVIAL_ sample from ticket.


 Description  « Hide

Hi,

Different string's give the same hash result ist's a bug?

Witch max length to parameter hash function?

Example

select hash('4-20100433-01775-LOTES')
from rdb$database
union
select hash('1-20100433-01765-LOTES')
from rdb$database


 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Dimitry Sibiryakov added a comment - 26/May/14 10:56 AM
Any hash function has collisions. It is unavoidable when dimension of values is shrinked.

Paul Reeves added a comment - 26/May/14 11:08 AM
Interesting. This seems to produce the correct result:

select 1, hash('1-20100433-01765-LOTES')
from rdb$database
union
select 2, hash('2-20100433-01765-LOTES')
from rdb$database
union
select 3, hash('3-20100433-01765-LOTES')
from rdb$database
union
select 4, hash('4-20100433-01765-LOTES')
from rdb$database
union
select 5, hash('5-20100433-01775-LOTES')
from rdb$database;


but this doesn't:

select 1,hash( '1-20100433-01765-LOTES')
from rdb$database
union
select 4, hash('4-20100433-01775-LOTES')
from rdb$database;

Omacht András added a comment - 26/May/14 11:20 AM - edited
Your selects are different:
First:
...
select 4, hash('4-20100433-01765-LOTES')
from rdb$database
...

vs. secound:
...
select 4, hash('4-20100433-01775-LOTES')
from rdb$database;


01765 vs 01775

If you try with 01775 in the first select the result will be the same.

(It is possible to two different string have the same hash code. See: http://en.wikipedia.org/wiki/Hash_function)

Paul Reeves added a comment - 26/May/14 11:34 AM - edited
Ah, yes. In fact this does demonstrate it correctly:

select 1, hash('1-20100433-01765-LOTES')
from rdb$database
union
select 2, hash('2-20100433-01765-LOTES')
from rdb$database
union
select 3, hash('3-20100433-01765-LOTES')
from rdb$database
union
select 4, hash('4-20100433-01775-LOTES')
from rdb$database
union
select 5, hash('5-20100433-01765-LOTES')
from rdb$database;


    CONSTANT
============ =====================
           1 533117221586865043
           2 533117221318429587
           3 533117222123735955
           4 533117221586865043
           5 533117221872077715




Paul Reeves added a comment - 26/May/14 12:31 PM
The documentation for hash() just says "Returns a hash value for the input string. This function fully supports text BLOBs of any length and character set."

I suppose one resolution of this bug report could be to amend the documentation to indicate that collisions are possible and must be tested for. No assumption can be made about the uniqueness of a string based on its hash value. It is then left to the user to compare the original strings when two identical hash values are returned.

Perhaps the real question is "Does that make this implementation of hash() useless?".

Adriano dos Santos Fernandes added a comment - 26/May/14 03:08 PM
FB's HASH is a function (and algorithm) ported from Yaffil. The algorithm is not a good hash function, but collisions are possible in any algorithm.

This function should be extended to support standard and well known algorithms (MD5, SHA-*) via a second parameter.

Sean Leyne added a comment - 11/Jun/14 02:41 PM
@Dimitry Sibiryakov,

While collisions have been found in MD5 and SHA-1 hash functions, making them unsuitable for the current purposes.

SHA-1 ad SHA-2 hash functions only have a **theoretical** chance of hash collision.

Further, SHA-3 hash functions don't even have a theoretical hash collision. (http://en.wikipedia.org/wiki/Cryptographic_hash_function)

So, there is a modern hash algorithm that would address this issue.

Dmitry Yemanov added a comment - 11/Jun/14 03:57 PM
Sean, every hash function (SHA-3 included) has collisions, this is by design. It's impossible to map longer string into a shorter one without collisions.

Evandro Amparo added a comment - 19/May/17 02:13 PM - edited
I deduced Firebird hash algorithm while playing with some examples:

- Reverse the string
- Consider the first position p = 1
- For each char, shift it's ASCII code to the left by 4 * (p - 1) bits and add up with the previous
- The resulting sum is the hash

That's why longer strings have longer hashes, and that's why it's so weak!

It would be a great if other algorithms were added.

Pavel Zotov added a comment - 29/Jul/17 07:17 AM - edited
After reading https://stackoverflow.com/questions/3475648/sha1-collision-demo-example I decided to make trivial test:
===
set list on;
set blob all;

recreate table test(s1 blob, s2 blob);
commit;
--insert into test(s) values('4-20100433-01775-LOTES');
--insert into test(s) values('1-20100433-01765-LOTES');

-- https://stackoverflow.com/questions/3475648/sha1-collision-demo-example
insert into test(s1, s2) values('20177527e04e05d5e7b448c1ab2b872f86831d0b', '20177527e04e05d5e7b448c1ab2b872f86831d0b');
commit;

select
    s1,s2
    ,hash(s1 using sha1) s1_hash_using_sha1
    ,hash(s2 using sha1) s2_hash_using_sha1
    ,iif( hash(s1 using sha1) = hash(s2 using sha1), '## ACHTUNG ##', 'Passed, all fine.') as collision_sha1_check
from test;

select
    s1,s2
    ,hash(s1 using sha256) s1_hash_using_sha256
    ,hash(s2 using sha256) s2_hash_using_sha256
    ,iif( hash(s1 using sha256) = hash(s2 using sha256), '## ACHTUNG ##', 'Passed, all fine.') as collision_sha256_check
from test;

select
    s1,s2
    ,hash(s1 using sha512) s1_hash_using_sha512
    ,hash(s2 using sha512) s2_hash_using_sha512
    ,iif( hash(s1 using sha512) = hash(s2 using sha512), '## ACHTUNG ##', 'Passed, all fine.') as collision_sha512_check
from test;

quit;
===

Its output on build 713 is:
===

S1 9b:0
20177527e04e05d5e7b448c1ab2b872f86831d0b
S2 9b:1
20177527e04e05d5e7b448c1ab2b872f86831d0b
S1_HASH_USING_SHA1 8C4E783C453F8059ADCF8B52C4E57FAE93F75861
S2_HASH_USING_SHA1 8C4E783C453F8059ADCF8B52C4E57FAE93F75861
COLLISION_SHA1_CHECK ## ACHTUNG ##



S1 9b:0
20177527e04e05d5e7b448c1ab2b872f86831d0b
S2 9b:1
20177527e04e05d5e7b448c1ab2b872f86831d0b
S1_HASH_USING_SHA256 2C15B17DF7EA260B5654E85A2BFF7CF3AF5E40C0D9FC1FD758EFBA61BB88CC53
S2_HASH_USING_SHA256 2C15B17DF7EA260B5654E85A2BFF7CF3AF5E40C0D9FC1FD758EFBA61BB88CC53
COLLISION_SHA256_CHECK ## ACHTUNG ##



S1 9b:0
20177527e04e05d5e7b448c1ab2b872f86831d0b
S2 9b:1
20177527e04e05d5e7b448c1ab2b872f86831d0b
S1_HASH_USING_SHA512 C0CB5278996574D9CCBF952AF67B5DFF38CDAC394AE713928A759A884222985DDBC92C15A74278236EB9599E8603EDFC9A7613AD475085D5ADF93174DC63536A
S2_HASH_USING_SHA512 C0CB5278996574D9CCBF952AF67B5DFF38CDAC394AE713928A759A884222985DDBC92C15A74278236EB9599E8603EDFC9A7613AD475085D5ADF93174DC63536A
COLLISION_SHA512_CHECK ## ACHTUNG ##
===

Is it expected that current implementation of SHA256 and SHA512 in FB 4.0.0 can fail as SHA1 ?



Adriano dos Santos Fernandes added a comment - 29/Jul/17 01:22 PM - edited
> Is it expected that current implementation of SHA256 and SHA512 in FB 4.0.0 can fail as SHA1 ?

Can fail as what? What do you mean?

Pavel Zotov added a comment - 29/Jul/17 02:18 PM
I mean that two different strings with not big length give the same hash on SHA-256 and even in SHA-512.
It this the feature of SHA algorithm per se or this is because of concrete implementation in FB ?

Adriano dos Santos Fernandes added a comment - 07/Aug/17 04:26 PM
A SHA-256 / SHA-512 will return the same hash for the same string in every implementation. If not, the code is bugged.

It's out of our scope to judge quality of these algorithms and minimal length for clashes.