Issue Details (XML | Word | Printable)

Key: CORE-1807
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Alexander Peshkov
Reporter: Smarts Broadcast Systems
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Firebird Core

fbserver assigned to non-canonical port after abnormal termination

Created: 27/Mar/08 08:06 PM   Updated: 18/Nov/08 12:28 PM
Component/s: Engine
Affects Version/s: 2.0.3
Fix Version/s: 2.5 Alpha 1, 2.1.1, 2.0.5

Time Tracking:
Not Specified

Environment:
The host3 database server box is running Firebird SS version 2.0.3.12981
Red Hat version 9
Linux kernel 2.4.27-5
Intel(R) Celeron(TM) CPU 1200MHz
256576 KB RAM
522104 KB Swap

Planning Status: Unspecified


 Description  « Hide
A synopsis of this matter can be found here:

http://tech.groups.yahoo.com/group/firebird-support/message/93230

host3 is our Firebird database server. It had been running without fault since 7 MAR 2008. Typically it runs for far longer periods than that but a recent change of a UPS accounts for the relatively short uptime.

A total of 25 clients, 22 from other Linux machines and 3 from Windows XP were connected to the main channel (port 3050). Of these connections, 4 were also using the event channel. At 10:33 AM this date, one of the Windows XP clients (which did have an event alert channel open) exited uncleanly (connection reset by peer) as ``firebird.log'' shows:

host3.xxxx.com (Server) Thu Mar 27 10:33:45 2008
        INET/inet_error: read errno = 104

host3.xxxx.com (Client) Thu Mar 27 10:33:46 2008
        /opt/firebird/bin/fbguard: bin/fbserver terminated abnormally (-1)

host3.xxxx.com (Client) Thu Mar 27 10:33:46 2008
        /opt/firebird/bin/fbguard: guardian starting bin/fbserver

The guardian detected the fault and started the ``fbserver'' process. However, it started listening on port 58798 (suspiciously close to ports where event channels typically are found) as shown by netstat:

tcp 0 0 0.0.0.0:58798 0.0.0.0:* LISTEN 16941/fbserver

It was confirmed (through ``isql'') that normal Firebird (main channel) data connections could be made through THIS port. meanwhile, the standard port 3050 *still* was listening but all attempts to connect via that port hung until TCP timed out about 5-10 minutes later. Here's the ``wedged'' instance on the standard port of 3050:

tcp 0 0 0.0.0.0:3050 0.0.0.0:* LISTEN 1812/fbserver

The measures taken to resolve this were as follows (and at no point was the Firebird sever box rebooted):

1). Cleanly exit the server listening on 58798 port with ``kill 16941''.

2). The wedged listener on port 3050 could not be cleanly terminated with a standard ``kill'' so ``kill -9 1812'' was the only alternative.

3). Then ``service firebird start'' was issued. It was confirmed with netstat that the main channel was listening on the standard 3050 port. Subsequent connections were successful, both on the main channel and the event alert channel.

4). Attempts to replicate the problem by forcing unclean termination (connection reset by peer) of WinXP applications did yield the expected 104 errno diagnostics. But none of these caused the abnormal termination of ``fbserver''.

If other information is needed, I can provide it. I chronicled all state data to help resolve this.

 All   Comments   Work Log   Change History   Version Control   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Alexander Peshkov added a comment - 28/Mar/08 05:00 AM
If the bug is not reproducible, I have to say that the only chance to know what _exactly_ happened is gone - it was getting core dump of both instances and guardian before killing them.

I see 3 problems in this bug.

First, how could guardian detect death of fbserver (and even know it's exut status!) when process continues to work. Code, waiting for a child to terminate in guardian, is trivial. I suppose this is sooner problem of a kernel, not FB.

Next, I confirm that when second instance of firebird server is started (with busy primary port), it starts to listen at random port instead of giving up. IMO should be fixed.

And last - why did it die at all. Unfortunately, there may be many reasons, and we can only guess know. Without core dump it's impossible to say something useful.

Alexander Peshkov added a comment - 03/Apr/08 11:58 AM
After a few attempts to bind socket to gds_db port, error from bind() was ignored, and the following listen() binded it to random port.