Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new agg. function for statistic purpoces: MEDIAN() & MEDIAN()OVER() [CORE5584] #5851

Open
firebird-automations opened this issue Jul 17, 2017 · 3 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @pavel-zotov

Attachments:
find-median-usin-quickselect_-_blum-floyd-pratt-rivest-tarjan.py.7z

Votes: 3

Please consider to implement aggregate function MEDIAN() which can be very useful when we want to evaluate 'most typical' value from set which does contain several values with big deviation: { 12, 13, 11, 14, 11, 11, 11, 12, 9999, 1, 12,13, 13 } ==> median = 12 (compare with average = 779.46).

Explanation: https://en.wikipedia.org/wiki/Median
Online eval.: http://www.alcula.com/calculators/statistics/median/

PS. Oracle has this function since 10g.

@firebird-automations
Copy link
Collaborator Author

Commented by: Timothy Weber (teejaydub)

Apart from a convenience, I wonder if this could be a substantial speedup - currently I SELECT COUNT (*), then figure out whether I need one or two rows and from where, then SELECT AVG(x) FROM (SELECT x ORDER BY x ROWS a TO b). Any speed improvement for that would be welcome.

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

I doubt that finding median can be efficiently done using current SQL / PSQL features in FB.
Effective methods known today solve this task using intensive permutations of array elements.
This obviously can not be implmented with normal performance if we will operate with *tables* (cost of disk i/o will kill all such attempts).

There is 'quickselect' method (by Tony Hoare) for selecting Kth largest element and it can be adapted for searching median.
It provides performanse O(n) but it does this only in *average*. If pivot element (that we select for partition) will be largest on each step - we will face O(n^2).
Because of this, one may to consider more effective method developed by Blum-Floyd-Pratt-Rivest-Tarjan
( http://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf )

Example of its implementation can be found here:
https://rcoh.me/posts/linear-time-median-finding/

See in attachment small Python script from this article (if it will occasinally become "404" in the future).
Usage:
python find-median-usin-quickselect_-_blum-floyd-pratt-rivest-tarjan.py

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

Attachment: find-median-usin-quickselect_-_blum-floyd-pratt-rivest-tarjan.py.7z [ 13552 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant