Issue Details (XML | Word | Printable)

Key: CORE-3387
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Dmitry Yemanov
Reporter: vander clock stephane
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Firebird Core

Client library could hang infinitely waiting for a reply packet on a forcibly disconnected server socket

Created: 15/Mar/11 04:47 PM   Updated: 23/Apr/13 11:46 AM
Component/s: Engine
Affects Version/s: 2.5.0
Fix Version/s: 2.5.1, 3.0 Alpha 1

Time Tracking:
Not Specified

Environment: windows 2008 R2 64 bit, FBserver version 26230
Issue Links:
Relate
 

Planning Status: Unspecified


 Description  « Hide
hello,

i have several client that are running for some days now waiting that the server answer to them.
so at the end they look like "frozen"!

in the server side, in the $MON table i can see that the client are well attached from the 3/11/2011 (so 5 days) but the server don't answer them ! client are waiting for 5 days now one answer from the server that never arrive ...

the server version is 26230. the client use the original 2.5 client DLL

i still not kill the client (nor the server too) so i can still do some stuff on it if you need (like full dump?)

thank by advance


 All   Comments   Work Log   Change History   Version Control   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Dmitry Yemanov added a comment - 15/Mar/11 04:59 PM
This is the second report of such a problem already. It seems we have a regression.

Dmitry Yemanov added a comment - 15/Mar/11 05:15 PM
What v2.5.1 build have you used priorly, that didn't have this issue?

vander clock stephane added a comment - 15/Mar/11 05:54 PM
Hard to say, in the 2.5 original i also have this kind of issue (but much much more often, but not sure it's the same). I try too little time the beta release between to 2.5 original and the 2.5 version 2630 to say with don't have this issue.
what i can say is that this issue in the 2630 don't appear so often as in the 2.5 original (but on the other way the FB 2.5 original crash close to every days and the 26230 never crash so hard to know)

i don't know, if you want i can try to attach a debuger on the fb_inet_server.exe to see what's happen ?

vander clock stephane added a comment - 15/Mar/11 10:31 PM
I forget, just say me if i can kill the "frozen" clients or if you need them to do some tests ...

vander clock stephane added a comment - 23/Mar/11 07:58 AM
the same error again ....

today the behavior is little different

one client was running several day with 2 connections on the FB server
1 waiting some EVENT;
1 doing some queries;

yesterday the client become frozen. when i look in the $mon table in the server side i see only the EVENT connection.
so it's mean that the client was disconnected (but why? as the event is still connected it's not look like a network problem)
but the client not detect it and continue to wait undefinitivelly the answer of the server

this bug become very important because it's block all the time our application :(
do you think i can take a newer version than the FB 26230 ?

vander clock stephane added a comment - 24/Mar/11 11:39 AM
today a little different,

i have on the server

Thu Mar 24 11:07:35 2011
INET/inet_error: read errno = 10054

and on the client

Unable to complete network request to host "xxx".
Error reading data from the connection.
Unsuccessful execution caused by a system error that precludes successful execution of subsequent statements
Error Code: 401

BUT i have several other application connecting to the FB server on the same client machine that are still working correctly !

this error (10054 = A connection was forcibly closed by a peer.) look like morea bug than a network problem

vander clock stephane added a comment - 06/Apr/11 09:52 AM
What the status of this bug ? it's appear in average one time a week but it's very strong because when it's happen it's "froze" the client Application :(

more info (probably it's repeat but can not be bad in case i forget something) :

We have one client application that run at full time in background.

This client open 2 connections on the database, one for query and one for select/update

Every second this client application do a query on the database (select + insert/update). so the connection is never "idle" more than 1 second. but after days of work (1, 2, 7, etc..) then the connection freeze waiting the answer of the server that never arrive.

on the server side (when it's freeze), i can only see (in the $MON table) the connection of the EVENT, but not the second connection that do Select/update.
The $Mon table say that the connection of the event it's open from the 3/29/2011

in the log of the client application i see that the second connection is frozen from the 04-04-11 14:43:19
in the Server side, in the firebird.log i can see this :

FBSERVER Mon Apr 04 14:43:48 2011
INET/inet_error: read errno = 10054

any idea ? did need to update the current version of the FBServer and FB client DLL ? actually client = legacy 2.5 and server = FB 26230

Dmitry Yemanov added a comment - 06/Apr/11 09:59 AM
Different solutions are being tested on the customer site who has it more or less reproducible. I will report here as soon as we have something concrete found.

vander clock stephane added a comment - 06/Apr/11 10:08 AM
thanks dmitry, as soon you have something, you can come also to me to test it ...

Vlad Khorsun added a comment - 15/Apr/11 12:03 PM
The fix is committed (into v2.5 source tree), please try the next snapshot build

vander clock stephane added a comment - 15/Apr/11 05:22 PM
thanks Vlad !

but in the meantime Dmitry send me 3 differents versions to test. I actually test them now, and the 1rt version still have the bug, i m actually testing the second... did i meed to stop my test and use instead the next snapshot build ?

stephane

Dmitry Yemanov added a comment - 15/Apr/11 05:31 PM
The issue should be resolved in the snapshot build, so it's worth trying it now. If it still locks up, you can return to my builds :-)

vander clock stephane added a comment - 25/Apr/11 06:00 PM
bad new, unfortunatly it's seam to not be resolved, i just have now a client that is waiting for already 2 hours the answer of the server and by this way enter in a "frozen" state :( on the server side i can see one EVENT connection with the client, but no any other connection (i look in the $Mon table).

on the server i just have something like this at the time the client start to wait the server answer :
Mon Apr 25 17:04:25 2011
INET/inet_error: read errno = 10054

FB version 26263 ... any idea ?

vander clock stephane added a comment - 26/Apr/11 07:41 AM
i confirm, now 24 hours passed the the client is still frozen :(

Dmitry Yemanov added a comment - 26/Apr/11 10:44 AM
What do you mean by the "EVENT connection"? Do you set up a separate connection to wait for events and this connection is not used for any client queries?

vander clock stephane added a comment - 26/Apr/11 11:01 AM
at the begining, when the client is launch 2 connections are open to the server :
1 Event connection to listen to any database server event (mostly not used, not too much event are raised) ;
1 Query connection to select, update, etc... ;

here we see that the query connection is lost (not present anymore in the $Mon table) BUT the Event connection is still show in the $mon table and still aleave ! this mostly show that the probleme look like not a "hardware network" probleme.

it's not really hard to reproduce this bug (on our database), it's appear after 1/2 days of running

if you want i can make a dump of the client to know what he is waiting for ?

Dmitry Yemanov added a comment - 26/Apr/11 01:31 PM
I'm not sure the dump would tell us something useful, but if it's possible for you then please proceed and send it to me by email (or make available for download and send me the link).
BTW, have you looked what netstat reports on both endpoints of the connection? Are the network connections alive or not?

vander clock stephane added a comment - 26/Apr/11 06:34 PM - edited
I make a little mistake, when i try to launch the debuger to make the dump i close the client by mistake :(
doesn't matter, i run it now again and wait for the next mistake and i will see the netstat at this time

vander clock stephane added a comment - 08/May/11 11:35 AM
Hello everyone,

Now i have the confirmation that this bug is still present :(
FB version 26263

One of my client application if waiting the serverr answer for now 2 days. this time i setup more log in my client application and i can say that

* The client open 2 connections, one for event and one for queries
* the client work now for several days (between 1 and 7) without any problem
* and The client Start a new transaction (OK)
* The client call just after DSQLExecuteImmediate <= here the problem is, server never answer, client wait for unlimited time

the exact SQL Query was
[05-07-11 18:47:42][ThreadId: 2136] Insert Into transac_comment (id_transac, Comment_Type, Language, Comment) Values ('VBOQ-T1885',3,'FRA','Ce bel appartement de 2 chambres se trouve sur la première Étage d''un immeuble bien entretenu avec un total de 3 unités d''habitation et la maison idéale pour une seule nouvelle ou une belle paire de tous les âges. <br /><br /> L''appartement ensoleillé est réparti sur 65 mètres carrés et fonctionne grâce à la nouvelle de nombreux ...')

NB: this SQL is valid, just to show to you in case

on the server side i can see in the $Mon table the Event connection, but not the Queries connection


doing netstat on the client computer show this :

Active Connections

  Proto Local Address Foreign Address State PID
 [WinRETranslator.exe]
  TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
 [WinRETranslator.exe]
  TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
 [WinRETranslator.exe]
  TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132
 [WinRETranslator.exe]
  TCP 94.210.76.118:50791 94.210.76.120:3050 ESTABLISHED 3588

NB: i remove here the connection not related to WinRETranslator.exe (pur client application) in this report


doint netstat on the server , i can found only thes row connected to our client application:

Active Connections

  Proto Local Address Foreign Address State
 TCP 94.210.76.120:49261 94.210.76.118:49166 ESTABLISHED
 TCP 94.210.76.120:3050 94.210.76.118:49167 ESTABLISHED
 TCP 94.210.76.120:3050 94.210.76.118:50791 ESTABLISHED

so i can not found on the server the line for
 
 TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132


thanks for you help !

vander clock stephane added a comment - 08/May/11 11:52 AM
sorry i make a litte error in pasting the netstat result


doing netstat on the client computer show this :

Active Connections

  Proto Local Address Foreign Address State PID
  TCP 94.210.76.118:49164 94.210.76.120:3050 ESTABLISHED 2132
  TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
  TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
  TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132

NB: i remove here the connection not related to WinRETranslator.exe (pur client application) in this report


doint netstat on the server , i can found these row connected to our client application:

Active Connections

  Proto Local Address Foreign Address State
  TCP 94.210.76.120:3050 94.210.76.118:49164 ESTABLISHED
  TCP 94.210.76.120:49261 94.210.76.118:49166 ESTABLISHED
  TCP 94.210.76.120:3050 94.210.76.118:49167 ESTABLISHED


so i can not found on the server the line for
  
 TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132


stephane

Vlad Khorsun added a comment - 09/May/11 01:22 PM
Stephane, please explain :
- you said "The client open 2 connections, one for event and one for queries". It means client should have 3 sockets opened : 2 with remote port 3050 for database connection and one with different port - for events processing
- but you show 4 sokets opened by client :

  TCP 94.210.76.118:49164 94.210.76.120:3050 ESTABLISHED 2132
  TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
  TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
  TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132

and 3 of them is for regular database connection.

vander clock stephane added a comment - 09/May/11 02:24 PM - edited
hmm, i look the code and it's possible that more than 2 connections are open for the queries. in fact another connection can be open for "log" that simply store in a table all the queries executed by the first connection.

vander clock stephane added a comment - 11/May/11 09:03 AM
ok, 5 days passed and the client is still waiting the server answer :(

net stat still show this in the client :

  Proto Local Address Foreign Address State PID
  TCP 94.210.76.118:49164 94.210.76.120:3050 ESTABLISHED 2132
  TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
  TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
  TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132

and in the server still the same

 Proto Local Address Foreign Address State
  TCP 94.210.76.120:3050 94.210.76.118:49164 ESTABLISHED
  TCP 94.210.76.120:49261 94.210.76.118:49166 ESTABLISHED
  TCP 94.210.76.120:3050 94.210.76.118:49167 ESTABLISHED

so it's really strange that the client show
TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
as established but on the server side their is nothing about it :(

i was waiting few days hopping a time out or something like this will close the connection but no :(

Dmitry Yemanov added a comment - 11/May/11 10:12 AM
what about the memory dump for the client app? Also, does number of fb_inet_server processes correspond to select count(*) from mon$attachments?

vander clock stephane added a comment - 11/May/11 10:34 AM
dammed ... i just close today the client app before reading your message :(
but i thing i have another client application that is in the same situation, i will try to see if i can do a memory dump of it

> Also, does number of fb_inet_server processes correspond to select count(*) from mon$attachments?

i m in Firebird super classic, how to see that ?

Dmitry Yemanov added a comment - 11/May/11 10:40 AM
Ah, sorry, I thought you were using Classic.

vander clock stephane added a comment - 11/May/11 01:14 PM
ok, i have another client application that is frozen for 8 days too in the same situation.
so i just create a dump, and put it in the same server i send you the last time in private message. if you don't find it, i can send the login password again to you..

thanks !

vander clock stephane added a comment - 14/May/11 07:23 PM
Actually i m testing the new version of FBClient, and it's look fine for now (but need one or two week to know for sure)

but i just want to say also that 4 days ago, i was force to hardly kill some clients (because of server overload).
i just today see that their connection is still present in the $MON table but the client is dead for several days now !
Some of this connection have even an active status...

in the $MON table i can see 221 connections, and doing NetStat i can see around 286 established connections

So just to say that the trouble seam to not be only in the fbclient.dll side but also in the server side ...

vander clock stephane added a comment - 29/May/11 09:43 PM
The new version of FBclient that dmitry send me look like it's correct the bug !
from the time i install them (more than 2 weeks ago now) i don't have any more any frozen client !

congratulation dmitry and many thanks !

i thing we can close this bug as resolved now !
also when the modification will be commited in the daily snapshot ?

thanks again
stéphane

Dmitry Yemanov added a comment - 30/May/11 09:34 AM
Thanks for the good news. The fix will be committed today and thus will be available in the tomorrow's snapshot build.

Dmitry Yemanov added a comment - 30/May/11 09:51 AM
While fixing CORE-1763, it was decided to not set the SO_KEEPALIVE flag for the client socket. But the lack of keep-alive packets is now proved causing problems in this ticket, supposedly due to some networking (or firewall related) troubles in the production environment. The solution is to turn keep-alive packets on.

vander clock stephane added a comment - 01/Jun/11 06:18 AM
one thing, is the Gbak use the Fbclient.dll or it's own connection protocol ? because i also have one backup that is frozen for several days ... but i guess that gbak not use fbclient.dll but that the bug in gbak is simillar to the bug in the fbclient.dll ?

Dmitry Yemanov added a comment - 01/Jun/11 09:10 AM
GBAK is a regular client application, i.e. it uses fbclient.dll. Perhaps this is a different issue.

vander clock stephane added a comment - 05/Jun/11 05:43 PM
sorry for late answer, was in travel..
hmm, that not sound good because it's pretty like the same issue :(

on the client where gtat run net stat show :
   TCP 94.210.76.118:53154 94.210.76.120:3050 ESTABLISHED [gbak.exe]

on the server nada :(
i can not find any connection on the port 53154 :(
in the $Mon table also, nothing about the gbak

the gbak is frozen from the 24 may ...
when it froze it's was exactly at this state :

gbak:25900000 records written
gbak:25920000 records written

i just make a dump of the gbak and put it in the ftp in case you need...

thanks !

vander clock stephane added a comment - 06/Jun/11 07:32 AM
another remark:

on the firebird server i can see 4 EVENT connections alive that was open by client that are gone serveral week ago !
i see theses connections in the $Mon table, i don't know how to recognise them in the net stat table

These connections was not open with client that have the new FBClient.dll (don't know if matter, because here it's on the server side)

the server version is still the 26230

vander clock stephane added a comment - 25/Aug/11 07:46 PM
hello,

I have now often the GBAK that freeze (one time a week). but except the GBAK i have much much (much) lower connection frozen bug than before !!
do you thing that using the fb release canditate 1 can resolve this issue ? i still use the the server version is 2.5.1.26263

gbak is still connected now (from several days) if you want i can also do a memory dump (or something else?)

thanks !

Dmitry Yemanov added a comment - 26/Aug/11 04:24 PM
It's worth trying v2.5.1 RC1 regardless. A memory dump (ideally, for both frozen GBAK and the server it's supposedly connected to) is appreciated, as well as the network stats (also, for both server and client) for the moment.

vander clock stephane added a comment - 28/Aug/11 08:36 PM - edited
ok, in more details :

on the Gbak computer the netstat show :

  TCP 93.122.12.118:52939 93.122.12.120:3050 ESTABLISHED
  [GBAK.EXE]

on the firebird server computer ... nothing ! the gbak client is not in the $MON table nor in the netstat !

so it's simply look like that GBAK don't detect the deconnection from the server and stay connected forever :(

i put the copy of the Gbak Dump file n the ftp server with i send you the credential by email the last time ...

thanks !