New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow at backup time possibility to order data in tables [CORE3668] #4018
Comments
Commented by: @asfernandes It doesn't matter that data is ordered, as table changes and will not be then, so engine must read the same set of pages anyway. |
Commented by: @livius2 No, you are wrong - maybe by my poor English e.g. "client 1" have 1000 invoices we do now: SELECT * FROM INVOICES WHERE CLIENT='client 1' before reorder server must read 1000 pages server and hard drive must read only 20 pages not 1000 |
Commented by: @asfernandes > No, you are wrong No. You're saying to backup data in order, and I'm saying the order will not be preserved after any insert/update, so engine must not limit visited pages. |
Commented by: @livius2 I do not understand you.. after backup process you have pages where data for client 1 will by before pages with data of client 2... before reorder data in pages you have of course if you insert new data in this new restored database then "fragmentation" will be in this new pages and will wait for time when another backup restore process take action |
Commented by: @livius2 one sample show more then bilion words ;-) look at this real sample Create database with 16 k page size ################################################### SET TERM ^ ; SET TERM ^ ; CREATE INDEX IXA_T_AFTER__CLIENT_ID ON T_AFTER (CLIENT_ID); ################################################################## now fill tables execute procedure FILL_T_BEFORE; execute procedure FILL_T_AFTER; ################################################################## ################################################################## SELECT COUNT(*) FROM T_BEFORE disconnect after conect with flamerobin FBServer use 17 608 K in windows proces manager select COUNT(*) FROM T_BEFORE TB WHERE TB.CLIENT_ID=1 Done. FBServer memory 82236 ############################################## disconnect Done. FBServer use 18 424 K as you can see FBServer in ordered data read onle few pages and use only 18 424 K |
Commented by: @livius2 look also at reads stats before 4032 reads |
Commented by: Sean Leyne (seanleyne) Karol, you are describing an feature which is similar to a "cluster" index. Where the data in a table is stored such that entries which share a common criteria are stored on data pages with other entries with the same criteria. Your request has only a marginal benefit for some specific use cases; largely ones where the table data is historical in nature with only occasional updates to the existing rows, and/or where the data is accessed/filtered using a limited criteria where the index would help limit the number of data pages which would be read. For (almost) all other use cases, there is no benefit to storage order. When you look at the execution time of the 2 queries, the difference is only 0.078s! Finally, you have not considered the cost of SORT which would be added to the backup time. The SORT time cost can be estimated as the difference between the following statements |
Commented by: @dyemanov I'd say this smells more like a table partitioning. |
Commented by: Sean Leyne (seanleyne) Dmitry, It seems that both "Table Partitioning" and "Cluster Index" could apply/describe Karol's limited issue. If the case were taken simply as written, then a "cluster index" would seem to be meet the strict requirements. My reading on "Table Partioning" found that it (in DB2, MSSQL and Oracle) is included as part of a much wider scope of functionality including tablespace/database partition management, with many more implementation complexities. |
Commented by: @livius2 yes you have right this is something like clustered index i suppose that simpler to implement is set this on backup time |
Commented by: Sean Leyne (seanleyne) In thinking about this, there is a relatively simple approach to implement the request as written. Allow for the an Index with a special name ("Backup$Order", "GBAK$Order"...) to be defined which GBAK would look for when creating backups. If the index is present (even if the index is not active), GBAK would add the appropriate ORDER BY to the backup SELECT statement. It that way the data would be backed-up in the defined order, so it would be restored in that order and thus provide "data clustering". This "index name" approach would allow for the ordering preference to be incorporated into the database metadata and not rely on command line/runtime switches. |
Commented by: @asfernandes > Allow for the an Index with a special name ("Backup$Order", "GBAK$Order"...) This is *hack*, and a very ugly one. |
Commented by: @dyemanov I'd rather spend time on a proper solution rather than on an easy one. GBAK is not a tool to customize the data layout or optimize the I/O performance. Let's leave it for its intended purpose. |
Commented by: Sean Leyne (seanleyne) Yes, my proposal is a hack, I don't think *that* ugly. But it can be implemented for all FB versions with less than *1* days of effort. It also can be thought of as a bridge until a more complete solution is available. And, once that solution is available, developers can drop the indexes with no significant effort and/or harm. Adriano, it is easy to complain about the ideas of others. How about a counter proposal? |
Commented by: @livius2 This is not ugly - this is nice hack - nothing to change in command line and params of gbak i never need to drop it :) - because really i "rename" my already existing index to this name and i think that this feature should also exists when clustered index and table partitioning will be coded. |
Commented by: Yurij (yurij) Sean Leyne >Your request has only a marginal benefit for some specific use cases; largely ones where the table data is historical in nature with only occasional updates to the existing rows, and/or where >the data is accessed/filtered using a limited criteria where the index would help limit the number of data pages which would be read. Not so marginal and "largely ones where the table data is historical in nature with only occasional updates to the existing rows" means "most of financial/accounting/data-logging applications". I had at least 20x difference in query execution times and page reads in queries after reordering from natural input order (0.6-1 records per data datasource per page) to "40 records in average per datasource per page". Yes, cluster index or table partitioning may be solution for this case. |
Commented by: @asfernandes > But it can be implemented for all FB versions with less than *1* days of effort. This was Interbase development philosophy. The 1 day hack SS SMP implementation is taken now 10+ years to be corrected. So as the DSQL architecture taken. |
Commented by: @asfernandes I doubt this feature has much relation with partitioning. Partitioning is going to only divide the problem in some buckets. |
Commented by: @livius2 >>This was Interbase development philosophy. The 1 day hack SS SMP implementation is taken now 10+ years to be corrected. but is this situation here? of course - will be better to have real fix like clustered index and partitioning this feature can we get as Sean Leyne say in one day - and work will not go to the hell after real feature - i think. |
Commented by: John Franck (bozzy) Hi. I was just searching for the exact same feature and found someone else already requested it. I'd like to give my opinion. I'm in a situation similar to Karol's one. There are times when I have to deal with a set (sometimes quite big sets) of records from these tables, so I have to SELECT them using a partial index. I've noticed, as Karol did, that the engine makes a lot of reads, apparently much more than what it should do, or better, much more that what I expected it to do. It also takes quite long to fetch all the records, but there's not much CPU usage, instead there's a lot of DISK activity. On these tables I do batch inserts and batch deletes, so some amount of "fragmentation" occurs, I think both at filesystem level and at DB level, but FB engine reuses space freed by deleted records when new inserts occur, so at logical (index-wise) level the table results both fragmented and not-ordered. Letting out filesystem fragmentation now, the pure DB fragmentation problem can be solved with a backup-restore. After that, I notice some speed up, but the number of reads remains high, and this is due to logical (index-wise) disorder. What Karol suggests can IMHO effectively reduce this kind of "fragmentation" (or disorganization) of tables at DB-file level, resulting in actual DB efficiency optimization (less reads, less memory usage, more speed). My suggestion is, in order to avoid hacks, let GBAK use the PRIMARY KEY, if present. This feature should be activated by a switch during backup time, and the time cost during backp is something up to the user, I mean, if you want your tables ordered, you know your backup will likely last longer. |
Commented by: @dyemanov Using an index during backup may lead to the lost data (if the index is corrupted). This is unacceptable. I'm wondering why you guys speak about backup/restore (which has a totally different purpose) rather than asking for something like REORDER statement? Also, it could be achieved with a combination of CREATE TEMPORARY TABLE AS SELECT followed by TRUNCATE and then by INSERT. |
Commented by: @livius2 >>Using an index during backup may lead to the lost data (if the index is corrupted). This is unacceptable. becouse of that better will be restore process but as far i know first table data are restored and then indexes are created *************************************** >>I'm wondering why you guys speak about backup/restore (which has a totally different purpose) rather than asking for something like REORDER statement? but i know that this is bigger development time cost.. |
Commented by: @livius2 >>Dmitry Yemanov this only can work for simple table without relations :/ Can this feature (hack or real fix like partitioning/cluster index) be included in 3.0 road-map or it is too late? |
Commented by: @dyemanov We don't include hacks, sorry. And clustered indices are not in the short term goals either. |
Submitted by: @livius2
Votes: 2
Hi,
is possible to see feature to order at backup or restore process
(i suppose at backup this will be simpler to implement but i do not know internal FB work)
ordering data in pages for table in some way
example invoices
in normal work you store new invoices for many clients
and after 4 years you have e.g. 10 000 000 invoices
data stored in table look like table INVOICES
database page 1
client 1 ....
client 2 ..
client 3 ..
..
database page 2
client 4
client 5
...
database page X
client 1 ..
client 3 ..
client 2 ..
...
and now we do select invoices for "client 1"
what happend?
server locate data for one client from many pages - and disc usage is very very big
i think that all records are on different pages for e.g. "client 1" an hard drive must move head as many times as is count of invoices for client
will be good to see something like this (or other solution - the best will be ordering data in pages for running database)
when we can post some script code at backup time
like gbak 127.0.0.1:d:\baza\bazaX.fdb -order_table "INVOICES" -order_by "CLIENT_ID ASC"
after that data in table INVOICES will be look like
database page 1
client 1 ....
client 1 ..
client 1 ..
..
database page 2
client 1
client 1
client 1
...
database page X
client 2 ..
client 2 ..
client 2 ..
...
//this is correlated but not the same to my CORE2708
The text was updated successfully, but these errors were encountered: