Issue Details (XML | Word | Printable)

Key: CORE-5757
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Alexander Peshkov
Reporter: Hamish Moffatt
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Firebird Core

deadlock with events

Created: 22/Feb/18 03:18 AM   Updated: 25/Mar/18 07:40 AM
Component/s: Engine
Affects Version/s: 4.0 Initial, 3.0.0, 2.5.6, 3.0.1, 2.5.7, 3.0.2, 4.0 Alpha 1, 2.5.8, 3.0.3
Fix Version/s: 3.0.4, 4.0 Beta 1, 2.5.9

File Attachments: 1. Text File after-patch.txt (7 kB)
2. Text File after-patch2.txt (21 kB)
3. File event_loop.py (0.8 kB)
4. File event_loop.py (0.5 kB)
5. Text File gdb.txt (5 kB)
6. Text File PORT_connecting.patch (8 kB)

Environment: Linux

QA Status: Done with caveats
Test Details:
Stored as usual Python script, for usage only in separate POSIX environment.
Must NOT be launched together with other tests from fbt-repo!

See: fbt-repo/files/core_5757.py.txt


Sub-Tasks  All   Open   

 Description  « Hide
My Firebird server deadlocks often. I am using 2.5.8 on Linux in a mix of superserver, superclassic, 32-bit and 64-bit. All are affected.

When this happens I cannot make any new connections or run any queries on existing connections.

This looks just like CORE-4680 which was meant to be fixed in 2.5.5.

I created a Python program which connects to the server, registers an event listener then disconnects. It runs 5 threads at once. The server deadlocked after about 300 connects (64 connections on each thread). When I killed the Python program the server resumes. The test database is any empty database.

I have attached a back trace from the server while it's in this state.

 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Hamish Moffatt added a comment - 22/Feb/18 04:00 AM - edited
Two observations:

1. I can't reproduce this on a Windows super server.

2. On linux, I sometimes get this error:

DatabaseError: ('Error while waiting for events:\n- SQLCODE: 0\n- unknown ISC error 0', 0, 0)

and the server is logging

quokka Thu Feb 22 14:53:15 2018
INET/inet_error: invalid socket in packet_receive errno = 22


quokka Thu Feb 22 14:53:15 2018
INET/inet_error: read errno = 104


Alexander Peshkov added a comment - 22/Feb/18 12:43 PM - edited
I've got:
# ./event_loop.py
  File "./event_loop.py", line 14
    print "Thread %d connection %d" % (self._number, i)
                                  ^
SyntaxError: invalid syntax

Take into an account that I'm not familiar with python. FYI:

python
Python 3.4.5 (default, Jan 18 2018, 13:54:16)
[GCC 6.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

If needed I can quickly switch to 2.7 or 3.6

Alexander Peshkov added a comment - 22/Feb/18 03:57 PM
Hamish, can you try with this patch?

Hamish Moffatt added a comment - 22/Feb/18 09:59 PM
Here is updated version that works in Python 3, sorry about that.

Hamish Moffatt added a comment - 22/Feb/18 10:39 PM
Thanks Alexander. It seems better with the patch, once it ran to 1000 connects with 5 threads but then I tried more threads and it failed again and I have even seen it fail after just 30 connects each on 5 threads again. I attached two new back traces.

I tested superclassic on 64-bit.

Hamish Moffatt added a comment - 22/Feb/18 11:42 PM - edited
I've been able to run the original test (5 threads) to completion (1000 connects each) several times. This is a huge improvement.

However if I bump it to 20 threads it still fails after a while.

I am not seeing any
         INET/inet_error: invalid socket in packet_receive errno = 22

errors in the log any more (no errors at all actually).

Alexander Peshkov added a comment - 24/Feb/18 10:12 AM
Reproduced failure with 30 threads. On both 2.5 & 3.0. Looks like not known earlier issue.

Alexander Peshkov added a comment - 25/Feb/18 05:18 PM
When used with release (not debug) build and glibc 2.25 test from this ticket hangs when working with shared memory (except 2.5 SS). But this appears to be not related with events. Moreover, with older glibc (for example 2.11.2) there is no hang.

Hamish Moffatt added a comment - 27/Feb/18 10:20 PM
I built the latest R2_5 from git and can't reproduce the failure now. I have deployed to my servers. Thanks.

Pavel Zotov added a comment - 12/Mar/18 12:20 PM
Client (python) still hangs when running script on build 4.0.0.920.
No messages in the firebird.log but execution of script is just paused (infinitely).
I've sent full memory dump of Python process to Alex, letter 11-mar-18 20:08.