You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
host3 is our Firebird database server. It had been running without fault since 7 MAR 2008. Typically it runs for far longer periods than that but a recent change of a UPS accounts for the relatively short uptime.
A total of 25 clients, 22 from other Linux machines and 3 from Windows XP were connected to the main channel (port 3050). Of these connections, 4 were also using the event channel. At 10:33 AM this date, one of the Windows XP clients (which did have an event alert channel open) exited uncleanly (connection reset by peer) as ``firebird.log'' shows:
The guardian detected the fault and started the ``fbserver'' process. However, it started listening on port 58798 (suspiciously close to ports where event channels typically are found) as shown by netstat:
It was confirmed (through ``isql'') that normal Firebird (main channel) data connections could be made through THIS port. meanwhile, the standard port 3050 *still* was listening but all attempts to connect via that port hung until TCP timed out about 5-10 minutes later. Here's the ``wedged'' instance on the standard port of 3050:
The measures taken to resolve this were as follows (and at no point was the Firebird sever box rebooted):
1). Cleanly exit the server listening on 58798 port with ``kill 16941''.
2). The wedged listener on port 3050 could not be cleanly terminated with a standard ``kill'' so ``kill -9 1812'' was the only alternative.
3). Then ``service firebird start'' was issued. It was confirmed with netstat that the main channel was listening on the standard 3050 port. Subsequent connections were successful, both on the main channel and the event alert channel.
4). Attempts to replicate the problem by forcing unclean termination (connection reset by peer) of WinXP applications did yield the expected 104 errno diagnostics. But none of these caused the abnormal termination of ``fbserver''.
If other information is needed, I can provide it. I chronicled all state data to help resolve this.
If the bug is not reproducible, I have to say that the only chance to know what _exactly_ happened is gone - it was getting core dump of both instances and guardian before killing them.
I see 3 problems in this bug.
First, how could guardian detect death of fbserver (and even know it's exut status!) when process continues to work. Code, waiting for a child to terminate in guardian, is trivial. I suppose this is sooner problem of a kernel, not FB.
Next, I confirm that when second instance of firebird server is started (with busy primary port), it starts to listen at random port instead of giving up. IMO should be fixed.
And last - why did it die at all. Unfortunately, there may be many reasons, and we can only guess know. Without core dump it's impossible to say something useful.
Submitted by: Smarts Broadcast Systems (smartsbroadcast)
A synopsis of this matter can be found here:
http://tech.groups.yahoo.com/group/firebird-support/message/93230 (archive)
host3 is our Firebird database server. It had been running without fault since 7 MAR 2008. Typically it runs for far longer periods than that but a recent change of a UPS accounts for the relatively short uptime.
A total of 25 clients, 22 from other Linux machines and 3 from Windows XP were connected to the main channel (port 3050). Of these connections, 4 were also using the event channel. At 10:33 AM this date, one of the Windows XP clients (which did have an event alert channel open) exited uncleanly (connection reset by peer) as ``firebird.log'' shows:
host3.xxxx.com (Server) Thu Mar 27 10:33:45 2008
INET/inet_error: read errno = 104
host3.xxxx.com (Client) Thu Mar 27 10:33:46 2008
/opt/firebird/bin/fbguard: bin/fbserver terminated abnormally (-1)
host3.xxxx.com (Client) Thu Mar 27 10:33:46 2008
/opt/firebird/bin/fbguard: guardian starting bin/fbserver
The guardian detected the fault and started the ``fbserver'' process. However, it started listening on port 58798 (suspiciously close to ports where event channels typically are found) as shown by netstat:
tcp 0 0 0.0.0.0:58798 0.0.0.0:* LISTEN 16941/fbserver
It was confirmed (through ``isql'') that normal Firebird (main channel) data connections could be made through THIS port. meanwhile, the standard port 3050 *still* was listening but all attempts to connect via that port hung until TCP timed out about 5-10 minutes later. Here's the ``wedged'' instance on the standard port of 3050:
tcp 0 0 0.0.0.0:3050 0.0.0.0:* LISTEN 1812/fbserver
The measures taken to resolve this were as follows (and at no point was the Firebird sever box rebooted):
1). Cleanly exit the server listening on 58798 port with ``kill 16941''.
2). The wedged listener on port 3050 could not be cleanly terminated with a standard ``kill'' so ``kill -9 1812'' was the only alternative.
3). Then ``service firebird start'' was issued. It was confirmed with netstat that the main channel was listening on the standard 3050 port. Subsequent connections were successful, both on the main channel and the event alert channel.
4). Attempts to replicate the problem by forcing unclean termination (connection reset by peer) of WinXP applications did yield the expected 104 errno diagnostics. But none of these caused the abnormal termination of ``fbserver''.
If other information is needed, I can provide it. I chronicled all state data to help resolve this.
Commits: ecff2eb 9a3adc9 6699ab0
The text was updated successfully, but these errors were encountered: