Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to query only those records INSERTed or UPDATEd since last time/sequence [CORE6199] #6444

Open
firebird-automations opened this issue Nov 29, 2019 · 19 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: PizzaProgram Ltd. (szakilaci)

A typical request is to pull only THOSE records from Firebird servers, WHICH CHANGED since last query.
With current engine model this is a very complex thing to do.
See: https://stackoverflow.com/questions/59071948/how-to-read-all-last-changed-records-from-firebird-db

It would make programmers life much easier if there would be an "auto-sequencer field", like: RDB$LAST_MODIFIED attached to every table by default.
That would be auto-updated on every insert and update by a global system generator AFTER_COMMIT. (NOT during transaction, like triggers do!)

SELECT o.* FROM orders o WHERE o.RDB$LAST_MODIFIED > 5489852;

Or could be the new replication system's values used for this?

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

You can use for this (limited use) RDB$RECORD_VERSION (i write all from memory than some bugs can exists).
But you must first remember OAT (oldest active transaction)

consider this example.
1. remember OAT as e.g. interestingOAT
2. do your select "SELECT o.* FROM orders o" - cache result or whatever
3. some time elapsed e.g. 5 minutes
4. remember OAT as e.g. currentOAT
5. do your select "SELECT o.* FROM orders o WHERE o.RDB$RECORD_VERSION>:interestingOAT"
6. store current into interesting
interestingOAT:= currentOAT
7. repeat from point 3.

Warning.
1. remember that on point 5 you got not only delta but you also retrive some records already retrived in point 2 call it as "ExtendedDelta".
2. If OAT stuck for longer time you will got same ExtendedDelta every time in point 6.
3. Also you must know that with this approach do not get deleted records.
4. Such query do not use index if your where clause only will contain "o.RDB$RECORD_VERSION>:interestingOAT" without any other values

To fix point 3, engine should support two new features.
1. ability to lock transaction as still interesting for engine.
E.g. user can call "i am still interesting with record versions from transaction id XXX". And then firebird should not remove record versioins from this transaction id and newer.
2. SELECT * FROM TABLE_XXX INDLUDE DELETED

Look at new transaction CN in Firebird 4 it can bring some modifications to above scenario.

you can look also in the firebird-devel archive about some discussion about eg. 24.07.2018 16:05.

@firebird-automations
Copy link
Collaborator Author

Commented by: Ann Harrison (awharrison)

"It would make programmers life much easier if there would be an "auto-sequencer field", like: RDB$LAST_MODIFIED attached to every table by default.
That would be auto-updated on every insert and update by a global system generator AFTER_COMMIT. (NOT during transaction, like triggers do!)"

Interesting thought. I've got a some questions.

Is LAST_MODIFIED table specific or database wide?

If the "RDB$LAST_MODIFIED" column is updated after commit, will Firebird need to revisit and change every record that the transaction altered after the commit? Would you expect that to cause a performance degradation? Why do you want the change made after the commit (when it's expensive) rather at the time the record is stored or modified?

If the after commit action fails for some reason, what is the state of the record? Committed but without that field? Would it show up if you searched based on an earlier value of LAST_MODIFIED... Probably not since that column has to be indexes...

How does your connection get the right starting value for LAST_MODIFIED? It can't be passed to the transaction, since it's gone. Does it somehow become connection state? Suppose you want to use it after a reconnect?

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

LAST_MODIFIED as timestamp is not good at all this must be some counter.
Look about discussion taken on devel "24.07.2018 16:05" and modified exaple using CN with cooperation with RDB$RECORD_VERSION istead of current one OAT with RDB$RECORD_VERSION.

>>"rather at the time the record is stored or modified?"

I can answer this. Consider situation:

1. transaction 1 start
2. transaction 1 update some records
3. before transaction 2 start you store current time as "interestingTime"
4. transaction 2 start an got you retrive SELECT o.* FROM orders o
5. transaction 2 commit.
6. transaction 1 commit.
7. transaction 3 start and you make SELECT o.* FROM orders o WHERE o.RDB$LAST_MODIFIED >interestingTime
you miss at point 7 record changes commited by transaction1 which was Active before your transactions 2 and 3 started but committed later.

@firebird-automations
Copy link
Collaborator Author

Commented by: PizzaProgram Ltd. (szakilaci)

@ann: Thanks for interesting about this... :)
Answers to your questions:

0.) RDB$LAST_MODIFIED is maybe not the best naming.
- RDB$LAST_SEQUENCE or
- RDB$GLOBAL_MODIFICATION_INC ...
- (Maybe RDB$RECORD_VERSION already fulfills that purpose?)
It would be an INT64, NOT NULL value.

1.) Yes, LAST_MODIFIED is table related. It would be added automatically. (Maybe only on user request globally on the whole DB)
ALTER DATABASE [GLOBAL_SEQUENCING] ...

2.) The generator for it can be (should be?) global.

3.) >> will Firebird need to revisit and change every record that the transaction altered after the commit?
YES. That's why it would be great to turn this feature on only if needed. (For replication or for "last_changed queries.)

4.) >> Would you expect that to cause a performance degradation?
YES I would, but compared to the
- ease of programmers job
- the reduce of client processing
- the tiny amount of data transfer, etc.
it is a very small price.
Pulling only 0-1-2 lines of a select instead of 100 each time on a 2G/3G or DSL internet / or VPN makes HUGE difference.

5.) >> Why do you want the change made after the commit (when it's expensive) rather at the time the record is stored or modified?
- Because we tried that. Our application became totally unusable! Had to revert all changes made during the last 2 month :-(
- The problem with that is explained in the stackoverflow link above.
(Timestamp gaps occur between 2 downloads! Those records will never being updated !)

6.1.) >> If the after commit action fails for some reason, what is the state of the record?
That's why it should be implemented into the engine, so that never happens.
Basically if would happen DURING=ENDING commit, not really "after". So it could revoke the commit, if something would go wrong. (eg. Disk write error.)
The only thing the engine should ensure to keep the sequence "straight". So a commit before an other commit should have always smaller generated values. For that probably a multi-thread protection needs to be added to the global-sequence-generator, not allowing multiple commits "switch execution order".

Now, that I think of this >> it gets clear to me: having 1-1 generator per table would increase performance, instead of waiting for the global generator, if lots of records needs to be updated in multiple tables.

6.2)
>> at Insert : LAST_MODIFIED gets always filled with a value. (Otherwise insert would fail)
>> at update: if fails >> previous value stays.

7.) >> How does your connection get the right starting value for LAST_MODIFIED? It can't be passed to the transaction, since it's gone. Does it somehow become connection state? Suppose you want to use it after a reconnect?

NO, nothing like that. The application programmer would need to keep track of that, (search for biggest value)
or query the current generator value.
#⁠#⁠ First query : SELECT o.* FROM orders o WHERE o.closed=0; .... would give automatically this built-in field too:

ID fk_customer_id closed date RDB$LAST_MODIFIED

11 34 0 12/03 5489851
12 23 0 12/04 5489852

#⁠#⁠ Next time query:
SELECT o.* FROM orders o WHERE o.RDB$LAST_MODIFIED > 5489852;
ID fk_customer_id closed date RDB$LAST_MODIFIED

11 34 1 12/03 5489853 <<< (the order got closed, so it changed, that's why the LAST_MODIFIED increased)
13 963 0 12/04 5489854

@firebird-automations
Copy link
Collaborator Author

Commented by: PizzaProgram Ltd. (szakilaci)

@Karol:
"WHERE o.RDB$RECORD_VERSION>:interestingOAT"
I've just looked int this new feature I didn't know about. Too bad it's only available since FB3.0, not 2.5.
https://www.ibphoenix.com/resources/documents/contributed/doc_393

But it seems very similar to that what I need. :)
Will examine and compare it further.

@firebird-automations
Copy link
Collaborator Author

Modified by: @mrotteveel

Version: 4.0 Beta 1 [ 10750 ] =>

Version: 4.0 Beta 2 [ 10888 ] =>

@firebird-automations
Copy link
Collaborator Author

Commented by: Ann Harrison (awharrison)

At the risk of sounding a bit judgemental, the cost of updating every record inserted or updated is enormous - unsupportable. The problem is not just doubling the number of writes, but of finding space for a new, longer version of the record and probably needing to move the old version to a new page 15 or 20 percent of the time, causing extra page writes.

If the goal is replication - fine, then that's something that should be done at the system level, not at the application level. Even if your application never deletes records, others do, so maintaining some history of deleted records is essential for consistency.

If there is some other goal - e.g. reading everything new into memory and working with it there, well, that needs to be explained more clearly to me. Surely there's some old data that's interesting. Can you explain use cases other than replication?

I don't see why a timestamp is worse than a generated value. Essentially, a timestamp is a value generated by the clock rather than a generator. Yes, you might get duplicates, but I think duplicates don't matter for this use case.

Doing this in a Read Committed transaction invites another kind of inconsistency. Consider this case.

Transaction 1 starts
Transaction 2 starts
Transaction 1 reads records A,B,C,D,
Transaction 2 updates records A,B,C,D,E,F,G
Transaction 2 commits
Transaction 1 reads records E,F,G
Transaction 1 commits

Transaction 1 has read the most recent versions of E,F,G, but not the most
recent versions of A,B,C,D. Should the changes made by Transaction 2
be considered read or unread by Transaction 1's successor?

@firebird-automations
Copy link
Collaborator Author

Commented by: PizzaProgram Ltd. (szakilaci)

Dear Ann,

>> "Transaction 1 has read the most recent versions of E,F,G, but not the most recent versions of A,B,C,D"
EXACTLY this is the problem can NOT be solved with simple triggers by us, programmers.
It has to be supported by the engine itself.
______________________________________________
>> " Can you explain use cases other than replication? "
YES, of course! We are developing POS systems for Pizzerias since 17 years. 1 restaurant is usually using 2-5 PC + 2-3 Tablet + 1-2 remote (boss) Laptops.
The problem is: that ALL PCs and Tablets need immediate updates of each orders+details+sub-details+customers data
+ sometimes the prices\unites\foods\exchange rates etc. changes after a looooong (12 hours) transaction.

- So typically there are 1.000.000 records of the last 10 years on a simple 100$ PC as "server".
- From those there are usually ACTIVE records of:
+ orders = 300
+ detail records = 3000
+ sub records = 200
+ customer data = 300
+ 5000 ware details (foods, units, groups, stock, rats, etc. = 30 other tables)
- ALL users must see ALL changes immediately (<1 sec) REALTIME!
- and sometimes the restaurant owner is changing things from his laptop a whole day long, commiting only at night. (Editing new prices or daily menu.)
- (And of course we would need replication too, if the "main-PC" dies suddenly.)

Now imagine how stressed that "main-PC" is, if it needs to download EVERY data again and again to every other device every second (or even faster ... whenever an event arrives?
With this technique it would be possible to download ONLY the last 1-2 records, but re-download ALL changed records "back in time" if the boss is commiting new prices.
Also there are : pre-orders + compicated material-management + realtime incoming phone-number display + webserver component +++ ALL these functions working the same time in the background reading and writing and LOCKING ! the database all the time.

Summary == There is a huge difference between "office environment" and "POS environment".
A program needs to work very fast \\ very safely \\ on cheap hardware \\ with many records \ with multiple stupid users doing all kinds of errors all the time in a overheated, stressful kitchens.

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

Ann for support this feature updating all records is not nescessary and should be avoided.

>> Doing this in a Read Committed transaction invites another kind of inconsistency. Consider this case.

No, as you read records looking at OAT first. If transaction is not commited then you got all from time of last commited transactions. No inconsistency at all. Only "Extended delta" every time.
But to have not blocked by OAT feature, better is using new feature from FB4 - CN - joined with RDB$RECORD_VERSION.

PS> if team think about support this feature, then you must consider that Interbase have now patent granted for "change views".
I do not know if this involve some problems or not, but must be taken into account.

@firebird-automations
Copy link
Collaborator Author

Commented by: PizzaProgram Ltd. (szakilaci)

Yes, it seems EVERY other famous RDB systems has this feature under different names.

- Interbase's model is "subscription based". Which is a great idea, making the life of users extremely easy.
https://delphiaball.co.uk/2015/02/06/interbase-change-views-part-1/

- My solution is a bit of a "do it yourself" way, since the users have to keep track of the "LAST_CHANGED" sequence themselves, and do not support deleted records out of the box. But it's not that hard to remember one INT64 variable per table and select from that last number.

Currently I'm trying to make this work in FB2.5 by:
- creating a "PAST_TIMES" table, that would be updated at ON TRANSACTION COMMIT trigger:
TableNumer TriggerStart AfterCommitTime AfterSequence

23 12:03:25.215 12:11:13.546 4654456
12 12:02:59.236 NULL NULL << this line will be updated after commit

using SELECT WITH LOCK to update each record WHERE AfterSequence = NULL;
That's how I would be informed about, if there is new data has been just saved into a table "in the past".

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

In Firebird4 it is quite simply possible (without tracking deleted records). You are not blocked here by OAT as CN always progress :-)

you can query Firebird using

SELECT o.* FROM orders o WHERE RDB$GET_TRANSACTION_CN(o.RDB$RECORD_VERSION) > your_last_queried_CN;

still two problems remaining
1. Query for deleted records - impossible now.
2. Above query do not use index - whole table scan.

to support point 1. we must have something like this
BLOCK_RECORD_VERSIONS(CN_NUMBER) - after that engine threat this record versions (from specified CN number + all newer) as untouchable by garbage collector.
And we should have a way to query deleted records from this blocked record versions

To support point 2?

@firebird-automations
Copy link
Collaborator Author

Commented by: @sim1984

Why do this with SQL queries? Why not use the replication system implemented in 4.0? If you need more fine-tuning, then in Beta 2 they promised to make a CDC API to get the change log.
The problem with your approach is that even if you get changes in one table, they may not take into account related changes in other tables. What is the point of obtaining partially inconsistent information?
In addition, I am categorically against introducing into the engine functions that drastically slow it down and block garbage collection.

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

>> Why not use the replication system implemented in 4.0?

Isn't it page oriented not record oriented?

>> The problem with your approach is that even if you get changes in one table, they may not take into account related changes in other tables. What is the point of obtaining partially inconsistent information?

Can you explain it more? You can do same for other tables nothing prevent you to do this.
Also you can do simple join and add or in the where clause.

>> In addition, I am categorically against introducing into the engine functions that drastically slow it down and block garbage collection.
this is not permanent blocking it exists until you increase value by next calling BLOCK_RECORD_VERSION - of course it can be used wrongly - but now it is also possible you can start snapshot transaction and live it active for months.

@firebird-automations
Copy link
Collaborator Author

Commented by: @sim1984

>> Isn't it page oriented not record oriented?

You are wrong. Firebird 4.0 implements just logical replication. There is no page-level physical replication. There is an additional nbackup functionality that allows you to restore increments directly to the database.
The latter tool can be used to implement the similarity of a physical standby, but this is not physical replication, as it is done in a DBMS with a transaction log.

>> Can you explain it more? You can do same for other tables nothing prevent you to do this.
>> Also you can do simple join and add or in the where clause.

Querying older versions of a record cannot be done quickly even for a single table, not to mention joining multiple tables. This is not index access.

>> this is not permanent blocking it exists until you increase value by next calling BLOCK_RECORD_VERSION - of course it can be used wrongly - but now it is also possible you can start snapshot transaction and live it active for months.

A new snapshot will only hold the versions it needs, not the entire version chain.

@firebird-automations
Copy link
Collaborator Author

Commented by: PizzaProgram Ltd. (szakilaci)

I was able to solve it with an AFTER COMMIT trigger under FB 2.5 too:
See: . https://stackoverflow.com/a/59160005/2110862

1.) One way is to record ALL changes of ALL records of ALL table. (Every insert, update or delete)

2.) But I've chosen a less disk-space consuming method:

- recording ONLY those changes where [CURRENT_TIMESTAMP = beginning of the transaction]
happened more than 60 seconds before commiting that transaction.

- pulling data from tables always by:
'where LAST_MODIFIED >'#⁠39 + FormatDateTime(my_last_pull_time - 60*onesecond) +#⁠39;
(Of course that "60 sec" limit can be changed to less on frequently changing tables individually.)

- or even better, if the time of the "where-part" is determined by a stored procedure, that will:
search for "past_time" occurences of the logging table and "go back more".

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

Just curious, why do you use method which can miss some data changes?
Especially this one "60 seconds before commiting that transaction".

@firebird-automations
Copy link
Collaborator Author

Commented by: PizzaProgram Ltd. (szakilaci)

Conclusion:
- It is possible to "do it yourself",

but it would be better, if
- There would be a "ready to use" solution to support Two-Tier Architectures better.

(Started to learn about Interbase + later Firebird 20 years ago. Still, it took me more than a week to find a solution + write those 70 triggers + test them. I can not imagine how a newbee could do that...)

@firebird-automations
Copy link
Collaborator Author

Commented by: PizzaProgram Ltd. (szakilaci)

Dear Karol,

I'm NOT missing any data. That's the point of the whole topic here!

Obviously you have misunderstood something.
Please look at the code at https://stackoverflow.com/a/59160005/2110862

- This method is logging those transaction started and commited longer than 60 sec.
- By default every select query is pulling data: now-60 sec .
(That means it will download data changed during the last 1 minute again and again ...)
EXCEPT:
- if there is a LOG entry that notify us there was a special case, where we have to extend that 60 sec, and re-read the last ... 231 sec. again.

@firebird-automations
Copy link
Collaborator Author

Commented by: @livius2

>> Obviously you have misunderstood something

possibly. But do you have some additional fields inside tables? How do you retrive only changed records based on this stackoverflow solution?
As it only:
1. store last (not first but last ;-) ) modification time of table and store it as TRG_START
2. store commit time as SYNC_TIME

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant