Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock with events [CORE5757] #6020

Closed
firebird-automations opened this issue Feb 22, 2018 · 22 comments
Closed

deadlock with events [CORE5757] #6020

firebird-automations opened this issue Feb 22, 2018 · 22 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Hamish Moffatt (hmoffatt)

Jira_subtask_outward CORE5772

Attachments:
gdb.txt
event_loop.py
PORT_connecting.patch
event_loop.py
after-patch.txt
after-patch2.txt

My Firebird server deadlocks often. I am using 2.5.8 on Linux in a mix of superserver, superclassic, 32-bit and 64-bit. All are affected.

When this happens I cannot make any new connections or run any queries on existing connections.

This looks just like CORE4680 which was meant to be fixed in 2.5.5.

I created a Python program which connects to the server, registers an event listener then disconnects. It runs 5 threads at once. The server deadlocked after about 300 connects (64 connections on each thread). When I killed the Python program the server resumes. The test database is any empty database.

I have attached a back trace from the server while it's in this state.

Commits: e4f8a9f 4023436 1888eab

====== Test Details ======

Stored as usual Python script, for usage only in separate POSIX environment.
Must NOT be launched together with other tests from fbt-repo!

See: fbt-repo/files/core_5757.py.txt

@firebird-automations
Copy link
Collaborator Author

Modified by: Hamish Moffatt (hmoffatt)

Version: 2.5.8 [ 10809 ]

Attachment: gdb.txt [ 13212 ]

Attachment: event_loop.py [ 13213 ]

description: Approximately every month i see a "firebird superserver" that stalled.
I must restart the server with the init script , every connect to the server is impossible.
Firebird offers the following message in the log file:

Koenig_DB_server (Server) Tue Oct 21 10:47:50 2014
Shutting down the server with 26 active connection(s) to 1 database(s), 0 active service(s)

Koenig_DB_server (Server) Thu Nov 13 07:50:54 2014
Shutting down the server with 16 active connection(s) to 1 database(s), 0 active service(s)

Koenig_DB_server (Server) Thu Jan 15 09:42:31 2015
Shutting down the server with 30 active connection(s) to 1 database(s), 0 active service(s)

Koenig_DB_server (Server) Fri Jan 30 10:24:50 2015
Shutting down the server with 27 active connection(s) to 1 database(s), 0 active service(s)

Since it was years long so goes, I have written a little program, with which I was able to replicate the problem.

1 .A connection is established to the database server
2. 9 events registered.
3. Waited a short time ( millisecond )
4. Unregister the events.
5. Connection closed.
6. Start all over again.

If it is possible to kill the process that caused the deadlock, the server continues to run normal.
If no events were registered it does not happen.

I was able to reproduce the problem under "super server firebird 2.5 / 3beta2 linux / windows".

Here's a video that shows that problem.
https://datiscum.com/FB_Test_x264.mp4

The program can be downloaded here.
https://datiscum.com/FirebirdTest.7z

The downloads are available only for a few days.

I hope this was helpful and the problem can be eliminated.

Regards,
Sascha Michel

=>

My Firebird server deadlocks often. I am using 2.5.8 on Linux in a mix of superserver, superclassic, 32-bit and 64-bit. All are affected.

When this happens I cannot make any new connections or run any queries on existing connections.

This looks just like CORE4680 which was meant to be fixed in 2.5.5.

I created a Python program which connects to the server, registers an event listener then disconnects. It runs 5 threads at once. The server deadlocked after about 300 connects (64 connections on each thread). When I killed the Python program the server resumes. The test database is any empty database.

I have attached a back trace from the server while it's in this state.

environment: Linux / Windows => Linux

@firebird-automations
Copy link
Collaborator Author

Commented by: Hamish Moffatt (hmoffatt)

Two observations:

1. I can't reproduce this on a Windows super server.

2. On linux, I sometimes get this error:

DatabaseError: ('Error while waiting for events:\n- SQLCODE: 0\n- unknown ISC error 0', 0, 0)

and the server is logging

quokka Thu Feb 22 14:53:15 2018
INET/inet_error: invalid socket in packet_receive errno = 22

quokka Thu Feb 22 14:53:15 2018
INET/inet_error: read errno = 104

@firebird-automations
Copy link
Collaborator Author

Modified by: @AlexPeshkoff

assignee: Alexander Peshkov [ alexpeshkoff ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

I've got:
#⁠ ./event_loop.py
File "./event_loop.py", line 14
print "Thread %d connection %d" % (self._number, i)
^
SyntaxError: invalid syntax

Take into an account that I'm not familiar with python. FYI:

python
Python 3.4.5 (default, Jan 18 2018, 13:54:16)
[GCC 6.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

If needed I can quickly switch to 2.7 or 3.6

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Hamish, can you try with this patch?

@firebird-automations
Copy link
Collaborator Author

Modified by: @AlexPeshkoff

Attachment: PORT_connecting.patch [ 13214 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Hamish Moffatt (hmoffatt)

Here is updated version that works in Python 3, sorry about that.

@firebird-automations
Copy link
Collaborator Author

Modified by: Hamish Moffatt (hmoffatt)

Attachment: event_loop.py [ 13215 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: Hamish Moffatt (hmoffatt)

Attachment: after-patch.txt [ 13216 ]

Attachment: after-patch2.txt [ 13217 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Hamish Moffatt (hmoffatt)

Thanks Alexander. It seems better with the patch, once it ran to 1000 connects with 5 threads but then I tried more threads and it failed again and I have even seen it fail after just 30 connects each on 5 threads again. I attached two new back traces.

I tested superclassic on 64-bit.

@firebird-automations
Copy link
Collaborator Author

Commented by: Hamish Moffatt (hmoffatt)

I've been able to run the original test (5 threads) to completion (1000 connects each) several times. This is a huge improvement.

However if I bump it to 20 threads it still fails after a while.

I am not seeing any
INET/inet_error: invalid socket in packet_receive errno = 22

errors in the log any more (no errors at all actually).

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Reproduced failure with 30 threads. On both 2.5 & 3.0. Looks like not known earlier issue.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Open [ 1 ] => Open [ 1 ]

QA Status: Cannot be tested =>

Test Details: Decided skip implementation after letter from hvlad, 28.12.2017 12:33. =>

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Open [ 1 ] => Open [ 1 ]

QA Status: No test

@firebird-automations
Copy link
Collaborator Author

Modified by: @AlexPeshkoff

Version: 3.0.3 [ 10810 ]

Version: 4.0 Alpha 1 [ 10731 ]

Version: 3.0.2 [ 10785 ]

Version: 2.5.7 [ 10770 ]

Version: 3.0.1 [ 10730 ]

Version: 2.5.6 [ 10721 ]

Version: 3.0.0 [ 10740 ]

Version: 4.0 Initial [ 10621 ]

Fix Version: 2.5.5 [ 10670 ] =>

@firebird-automations
Copy link
Collaborator Author

Modified by: @AlexPeshkoff

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 4.0 Beta 1 [ 10750 ]

Fix Version: 3.0.4 [ 10863 ]

Fix Version: 2.5.9 [ 10862 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

When used with release (not debug) build and glibc 2.25 test from this ticket hangs when working with shared memory (except 2.5 SS). But this appears to be not related with events. Moreover, with older glibc (for example 2.11.2) there is no hang.

@firebird-automations
Copy link
Collaborator Author

Commented by: Hamish Moffatt (hmoffatt)

I built the latest R2_5 from git and can't reproduce the failure now. I have deployed to my servers. Thanks.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: No test => Deferred

Test Details: sent letter to dimitr & alex, 09.03.18 12:32. Waiting for reply.

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

Client (python) still hangs when running script on build 4.0.0.920.
No messages in the firebird.log but execution of script is just paused (infinitely).
I've sent full memory dump of Python process to Alex, letter 11-mar-18 20:08.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: Deferred => Done with caveats

Test Details: sent letter to dimitr & alex, 09.03.18 12:32. Waiting for reply. => Stored as usual Python script, for usage only in separate POSIX environment.
Must NOT be launched together with other tests from fbt-repo!

See: fbt-repo/files/core_5757.py.txt

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Closed [ 6 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment