Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SuperServer could hung when changing physical backup state under high load [CORE5613] #5879

Closed
firebird-automations opened this issue Sep 16, 2017 · 4 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @hvlad

The issue was detected when testing nbackup during TPCC run with 64 concurrent connections.
Engine could hung immediately after begin\end backup, i.e. after physical state change.
Few threads waits infinitely in RWLock::beginRead() for BackupManager::localStateLock.
Wait can't succeed as there is no owner of localStateLock.
Also, lock value is -1 which should never happens.
All other threads waits for bdb latches already acquired by threads above.

The problem happens because of race condition:

- backup thread acquires localStateLock in Write mode (see BackupManager::StateWriteGuard) and set TDBB_backup_write_locked flag (see BackupManager::lockStateWrite),
then it marks header page and set BDB_nbak_state_lock flag on its BufferDesc
note, this mark does not acquire localStateLock in Read mode because of BDB_nbak_state_lock (see CCH\set_diff_page() and BackupManager::lockStateRead)
then backup thread release header page (it does not release localStateLock)

- another thread commits and flush dirty pages, it writes dirty header page and release localStateLock (see CCH\clear_dirty_flag_and_nbak_state)
as BufferDesc have BDB_nbak_state_lock flag set and tdbb is not marked with TDBB_backup_write_locked flag

- backup thread release localStateLock in Write mode (see ~StateWriteGuard)

I.e. we have excess RWLock::endRead call which broke lock state and leads to the hangup.

To make problem happens there should be very short transactions to fit (from start to finish) into small time window
between release of header page and localStateLock by backup thread.

Commits: a60b19f bb5a3b0

@firebird-automations
Copy link
Collaborator Author

Modified by: @hvlad

assignee: Vlad Khorsun [ hvlad ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @hvlad

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 3.0.3 [ 10810 ]

Fix Version: 4.0 Beta 1 [ 10750 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: No test => Cannot be tested

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Closed [ 6 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment