Issue Details (XML | Word | Printable)

Key: CORE-5757
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Alexander Peshkov
Reporter: Hamish Moffatt
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Firebird Core

deadlock with events

Created: 22/Feb/18 03:18 AM   Updated: 25/Mar/18 07:40 AM
Component/s: Engine
Affects Version/s: 4.0 Initial, 3.0.0, 2.5.6, 3.0.1, 2.5.7, 3.0.2, 4.0 Alpha 1, 2.5.8, 3.0.3
Fix Version/s: 3.0.4, 4.0 Beta 1, 2.5.9

File Attachments: 1. Text File after-patch.txt (7 kB)
2. Text File after-patch2.txt (21 kB)
3. File event_loop.py (0.8 kB)
4. File event_loop.py (0.5 kB)
5. Text File gdb.txt (5 kB)
6. Text File PORT_connecting.patch (8 kB)

Environment: Linux

QA Status: Done with caveats
Test Details:
Stored as usual Python script, for usage only in separate POSIX environment.
Must NOT be launched together with other tests from fbt-repo!

See: fbt-repo/files/core_5757.py.txt


Sub-Tasks  All   Open   

 Description  « Hide
My Firebird server deadlocks often. I am using 2.5.8 on Linux in a mix of superserver, superclassic, 32-bit and 64-bit. All are affected.

When this happens I cannot make any new connections or run any queries on existing connections.

This looks just like CORE-4680 which was meant to be fixed in 2.5.5.

I created a Python program which connects to the server, registers an event listener then disconnects. It runs 5 threads at once. The server deadlocked after about 300 connects (64 connections on each thread). When I killed the Python program the server resumes. The test database is any empty database.

I have attached a back trace from the server while it's in this state.

 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hamish Moffatt made changes - 22/Feb/18 03:37 AM
Field Original Value New Value
Affects Version/s 2.5.8 [ 10809 ]
Environment Linux / Windows Linux
Description Approximately every month i see a "firebird superserver" that stalled.
I must restart the server with the init script , every connect to the server is impossible.
Firebird offers the following message in the log file:

Koenig_DB_server (Server) Tue Oct 21 10:47:50 2014
        Shutting down the server with 26 active connection(s) to 1 database(s), 0 active service(s)

Koenig_DB_server (Server) Thu Nov 13 07:50:54 2014
        Shutting down the server with 16 active connection(s) to 1 database(s), 0 active service(s)

Koenig_DB_server (Server) Thu Jan 15 09:42:31 2015
        Shutting down the server with 30 active connection(s) to 1 database(s), 0 active service(s)

Koenig_DB_server (Server) Fri Jan 30 10:24:50 2015
        Shutting down the server with 27 active connection(s) to 1 database(s), 0 active service(s)

Since it was years long so goes, I have written a little program, with which I was able to replicate the problem.

1 .A connection is established to the database server
2. 9 events registered.
3. Waited a short time ( millisecond )
4. Unregister the events.
5. Connection closed.
6. Start all over again.

If it is possible to kill the process that caused the deadlock, the server continues to run normal.
If no events were registered it does not happen.

I was able to reproduce the problem under "super server firebird 2.5 / 3beta2 linux / windows".


Here's a video that shows that problem.
https://datiscum.com/FB_Test_x264.mp4

The program can be downloaded here.
https://datiscum.com/FirebirdTest.7z

The downloads are available only for a few days.

I hope this was helpful and the problem can be eliminated.

Regards,
  Sascha Michel
My Firebird server deadlocks often. I am using 2.5.8 on Linux in a mix of superserver, superclassic, 32-bit and 64-bit. All are affected.

When this happens I cannot make any new connections or run any queries on existing connections.

This looks just like CORE-4680 which was meant to be fixed in 2.5.5.

I created a Python program which connects to the server, registers an event listener then disconnects. It runs 5 threads at once. The server deadlocked after about 300 connects (64 connections on each thread). When I killed the Python program the server resumes. The test database is any empty database.

I have attached a back trace from the server while it's in this state.
Attachment gdb.txt [ 13212 ]
Attachment event_loop.py [ 13213 ]
Hamish Moffatt added a comment - 22/Feb/18 04:00 AM - edited
Two observations:

1. I can't reproduce this on a Windows super server.

2. On linux, I sometimes get this error:

DatabaseError: ('Error while waiting for events:\n- SQLCODE: 0\n- unknown ISC error 0', 0, 0)

and the server is logging

quokka Thu Feb 22 14:53:15 2018
INET/inet_error: invalid socket in packet_receive errno = 22


quokka Thu Feb 22 14:53:15 2018
INET/inet_error: read errno = 104


Alexander Peshkov made changes - 22/Feb/18 12:42 PM
Assignee Alexander Peshkov [ alexpeshkoff ]
Alexander Peshkov added a comment - 22/Feb/18 12:43 PM - edited
I've got:
# ./event_loop.py
  File "./event_loop.py", line 14
    print "Thread %d connection %d" % (self._number, i)
                                  ^
SyntaxError: invalid syntax

Take into an account that I'm not familiar with python. FYI:

python
Python 3.4.5 (default, Jan 18 2018, 13:54:16)
[GCC 6.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

If needed I can quickly switch to 2.7 or 3.6

Alexander Peshkov added a comment - 22/Feb/18 03:57 PM
Hamish, can you try with this patch?

Alexander Peshkov made changes - 22/Feb/18 03:57 PM
Attachment PORT_connecting.patch [ 13214 ]
Hamish Moffatt added a comment - 22/Feb/18 09:59 PM
Here is updated version that works in Python 3, sorry about that.

Hamish Moffatt made changes - 22/Feb/18 09:59 PM
Attachment event_loop.py [ 13215 ]
Hamish Moffatt made changes - 22/Feb/18 10:38 PM
Attachment after-patch.txt [ 13216 ]
Attachment after-patch2.txt [ 13217 ]
Hamish Moffatt added a comment - 22/Feb/18 10:39 PM
Thanks Alexander. It seems better with the patch, once it ran to 1000 connects with 5 threads but then I tried more threads and it failed again and I have even seen it fail after just 30 connects each on 5 threads again. I attached two new back traces.

I tested superclassic on 64-bit.

Hamish Moffatt added a comment - 22/Feb/18 11:42 PM - edited
I've been able to run the original test (5 threads) to completion (1000 connects each) several times. This is a huge improvement.

However if I bump it to 20 threads it still fails after a while.

I am not seeing any
         INET/inet_error: invalid socket in packet_receive errno = 22

errors in the log any more (no errors at all actually).

Alexander Peshkov added a comment - 24/Feb/18 10:12 AM
Reproduced failure with 30 threads. On both 2.5 & 3.0. Looks like not known earlier issue.

Pavel Zotov made changes - 25/Feb/18 07:45 AM
Status Open [ 1 ] Open [ 1 ]
Test Details Decided skip implementation after letter from hvlad, 28.12.2017 12:33.
QA Status Cannot be tested
Pavel Zotov made changes - 25/Feb/18 07:45 AM
Status Open [ 1 ] Open [ 1 ]
QA Status No test
Alexander Peshkov made changes - 25/Feb/18 05:13 PM
Affects Version/s 3.0.3 [ 10810 ]
Affects Version/s 4.0 Alpha 1 [ 10731 ]
Affects Version/s 3.0.2 [ 10785 ]
Affects Version/s 2.5.7 [ 10770 ]
Affects Version/s 3.0.1 [ 10730 ]
Affects Version/s 2.5.6 [ 10721 ]
Affects Version/s 3.0.0 [ 10740 ]
Affects Version/s 4.0 Initial [ 10621 ]
Fix Version/s 2.5.5 [ 10670 ]
Alexander Peshkov made changes - 25/Feb/18 05:14 PM
Status Open [ 1 ] Resolved [ 5 ]
Fix Version/s 4.0 Beta 1 [ 10750 ]
Fix Version/s 3.0.4 [ 10863 ]
Fix Version/s 2.5.9 [ 10862 ]
Resolution Fixed [ 1 ]
Alexander Peshkov added a comment - 25/Feb/18 05:18 PM
When used with release (not debug) build and glibc 2.25 test from this ticket hangs when working with shared memory (except 2.5 SS). But this appears to be not related with events. Moreover, with older glibc (for example 2.11.2) there is no hang.

Hamish Moffatt added a comment - 27/Feb/18 10:20 PM
I built the latest R2_5 from git and can't reproduce the failure now. I have deployed to my servers. Thanks.

Pavel Zotov made changes - 09/Mar/18 09:25 AM
Status Resolved [ 5 ] Resolved [ 5 ]
Test Details sent letter to dimitr & alex, 09.03.18 12:32. Waiting for reply.
QA Status No test Deferred
Pavel Zotov added a comment - 12/Mar/18 12:20 PM
Client (python) still hangs when running script on build 4.0.0.920.
No messages in the firebird.log but execution of script is just paused (infinitely).
I've sent full memory dump of Python process to Alex, letter 11-mar-18 20:08.

Pavel Zotov made changes - 25/Mar/18 07:17 AM
Status Resolved [ 5 ] Resolved [ 5 ]
Test Details sent letter to dimitr & alex, 09.03.18 12:32. Waiting for reply. Stored as usual Python script, for usage only in separate POSIX environment.
Must NOT be launched together with other tests from fbt-repo!

See: fbt-repo/files/core_5757.py.txt

QA Status Deferred Done with caveats
Pavel Zotov made changes - 25/Mar/18 07:40 AM
Status Resolved [ 5 ] Closed [ 6 ]