Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client library could hang infinitely waiting for a reply packet on a forcibly disconnected server socket [CORE3387] #3753

Closed
firebird-automations opened this issue Mar 15, 2011 · 46 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: vander clock stephane (arkadia)

Is related to CORE1763

hello,

i have several client that are running for some days now waiting that the server answer to them.
so at the end they look like "frozen"!

in the server side, in the $MON table i can see that the client are well attached from the 3/11/2011 (so 5 days) but the server don't answer them ! client are waiting for 5 days now one answer from the server that never arrive ...

the server version is 26230. the client use the original 2.5 client DLL

i still not kill the client (nor the server too) so i can still do some stuff on it if you need (like full dump?)

thank by advance

Commits: 070718b e04bf21 9087961

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

This is the second report of such a problem already. It seems we have a regression.

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

What v2.5.1 build have you used priorly, that didn't have this issue?

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

Hard to say, in the 2.5 original i also have this kind of issue (but much much more often, but not sure it's the same). I try too little time the beta release between to 2.5 original and the 2.5 version 2630 to say with don't have this issue.
what i can say is that this issue in the 2630 don't appear so often as in the 2.5 original (but on the other way the FB 2.5 original crash close to every days and the 26230 never crash so hard to know)

i don't know, if you want i can try to attach a debuger on the fb_inet_server.exe to see what's happen ?

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

I forget, just say me if i can kill the "frozen" clients or if you need them to do some tests ...

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

the same error again ....

today the behavior is little different

one client was running several day with 2 connections on the FB server
1 waiting some EVENT;
1 doing some queries;

yesterday the client become frozen. when i look in the $mon table in the server side i see only the EVENT connection.
so it's mean that the client was disconnected (but why? as the event is still connected it's not look like a network problem)
but the client not detect it and continue to wait undefinitivelly the answer of the server

this bug become very important because it's block all the time our application :(
do you think i can take a newer version than the FB 26230 ?

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

today a little different,

i have on the server

Thu Mar 24 11:07:35 2011
INET/inet_error: read errno = 10054

and on the client

Unable to complete network request to host "xxx".
Error reading data from the connection.
Unsuccessful execution caused by a system error that precludes successful execution of subsequent statements
Error Code: 401

BUT i have several other application connecting to the FB server on the same client machine that are still working correctly !

this error (10054 = A connection was forcibly closed by a peer.) look like morea bug than a network problem

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

What the status of this bug ? it's appear in average one time a week but it's very strong because when it's happen it's "froze" the client Application :(

more info (probably it's repeat but can not be bad in case i forget something) :

We have one client application that run at full time in background.

This client open 2 connections on the database, one for query and one for select/update

Every second this client application do a query on the database (select + insert/update). so the connection is never "idle" more than 1 second. but after days of work (1, 2, 7, etc..) then the connection freeze waiting the answer of the server that never arrive.

on the server side (when it's freeze), i can only see (in the $MON table) the connection of the EVENT, but not the second connection that do Select/update.
The $Mon table say that the connection of the event it's open from the 3/29/2011

in the log of the client application i see that the second connection is frozen from the 04-04-11 14:43:19
in the Server side, in the firebird.log i can see this :

FBSERVER Mon Apr 04 14:43:48 2011
INET/inet_error: read errno = 10054

any idea ? did need to update the current version of the FBServer and FB client DLL ? actually client = legacy 2.5 and server = FB 26230

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

Different solutions are being tested on the customer site who has it more or less reproducible. I will report here as soon as we have something concrete found.

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

thanks dmitry, as soon you have something, you can come also to me to test it ...

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

The fix is committed (into v2.5 source tree), please try the next snapshot build

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

thanks Vlad !

but in the meantime Dmitry send me 3 differents versions to test. I actually test them now, and the 1rt version still have the bug, i m actually testing the second... did i meed to stop my test and use instead the next snapshot build ?

stephane

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

The issue should be resolved in the snapshot build, so it's worth trying it now. If it still locks up, you can return to my builds :-)

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

bad new, unfortunatly it's seam to not be resolved, i just have now a client that is waiting for already 2 hours the answer of the server and by this way enter in a "frozen" state :( on the server side i can see one EVENT connection with the client, but no any other connection (i look in the $Mon table).

on the server i just have something like this at the time the client start to wait the server answer :
Mon Apr 25 17:04:25 2011
INET/inet_error: read errno = 10054

FB version 26263 ... any idea ?

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

i confirm, now 24 hours passed the the client is still frozen :(

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

What do you mean by the "EVENT connection"? Do you set up a separate connection to wait for events and this connection is not used for any client queries?

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

at the begining, when the client is launch 2 connections are open to the server :
1 Event connection to listen to any database server event (mostly not used, not too much event are raised) ;
1 Query connection to select, update, etc... ;

here we see that the query connection is lost (not present anymore in the $Mon table) BUT the Event connection is still show in the $mon table and still aleave ! this mostly show that the probleme look like not a "hardware network" probleme.

it's not really hard to reproduce this bug (on our database), it's appear after 1/2 days of running

if you want i can make a dump of the client to know what he is waiting for ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

I'm not sure the dump would tell us something useful, but if it's possible for you then please proceed and send it to me by email (or make available for download and send me the link).
BTW, have you looked what netstat reports on both endpoints of the connection? Are the network connections alive or not?

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

I make a little mistake, when i try to launch the debuger to make the dump i close the client by mistake :(
doesn't matter, i run it now again and wait for the next mistake and i will see the netstat at this time

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

Hello everyone,

Now i have the confirmation that this bug is still present :(
FB version 26263

One of my client application if waiting the serverr answer for now 2 days. this time i setup more log in my client application and i can say that

* The client open 2 connections, one for event and one for queries
* the client work now for several days (between 1 and 7) without any problem
* and The client Start a new transaction (OK)
* The client call just after DSQLExecuteImmediate <= here the problem is, server never answer, client wait for unlimited time

the exact SQL Query was
[05-07-11 18:47:42][ThreadId: 2136] Insert Into transac_comment (id_transac, Comment_Type, Language, Comment) Values ('VBOQ-T1885',3,'FRA','Ce bel appartement de 2 chambres se trouve sur la première Étage d''un immeuble bien entretenu avec un total de 3 unités d''habitation et la maison idéale pour une seule nouvelle ou une belle paire de tous les âges. <br /><br /> L''appartement ensoleillé est réparti sur 65 mètres carrés et fonctionne grâce à la nouvelle de nombreux ...')

NB: this SQL is valid, just to show to you in case

on the server side i can see in the $Mon table the Event connection, but not the Queries connection

doing netstat on the client computer show this :

Active Connections

Proto Local Address Foreign Address State PID
[WinRETranslator.exe]
TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
[WinRETranslator.exe]
TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
[WinRETranslator.exe]
TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132
[WinRETranslator.exe]
TCP 94.210.76.118:50791 94.210.76.120:3050 ESTABLISHED 3588

NB: i remove here the connection not related to WinRETranslator.exe (pur client application) in this report

doint netstat on the server , i can found only thes row connected to our client application:

Active Connections

Proto Local Address Foreign Address State
TCP 94.210.76.120:49261 94.210.76.118:49166 ESTABLISHED
TCP 94.210.76.120:3050 94.210.76.118:49167 ESTABLISHED
TCP 94.210.76.120:3050 94.210.76.118:50791 ESTABLISHED

so i can not found on the server the line for

TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132

thanks for you help !

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

sorry i make a litte error in pasting the netstat result

doing netstat on the client computer show this :

Active Connections

Proto Local Address Foreign Address State PID
TCP 94.210.76.118:49164 94.210.76.120:3050 ESTABLISHED 2132
TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132

NB: i remove here the connection not related to WinRETranslator.exe (pur client application) in this report

doint netstat on the server , i can found these row connected to our client application:

Active Connections

Proto Local Address Foreign Address State
TCP 94.210.76.120:3050 94.210.76.118:49164 ESTABLISHED
TCP 94.210.76.120:49261 94.210.76.118:49166 ESTABLISHED
TCP 94.210.76.120:3050 94.210.76.118:49167 ESTABLISHED

so i can not found on the server the line for

TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132

stephane

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Stephane, please explain :
- you said "The client open 2 connections, one for event and one for queries". It means client should have 3 sockets opened : 2 with remote port 3050 for database connection and one with different port - for events processing
- but you show 4 sokets opened by client :

TCP 94.210.76.118:49164 94.210.76.120:3050 ESTABLISHED 2132
TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132

and 3 of them is for regular database connection.

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

hmm, i look the code and it's possible that more than 2 connections are open for the queries. in fact another connection can be open for "log" that simply store in a table all the queries executed by the first connection.

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

ok, 5 days passed and the client is still waiting the server answer :(

net stat still show this in the client :

Proto Local Address Foreign Address State PID
TCP 94.210.76.118:49164 94.210.76.120:3050 ESTABLISHED 2132
TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
TCP 94.210.76.118:49166 94.210.76.120:49261 ESTABLISHED 2132
TCP 94.210.76.118:49167 94.210.76.120:3050 ESTABLISHED 2132

and in the server still the same

Proto Local Address Foreign Address State
TCP 94.210.76.120:3050 94.210.76.118:49164 ESTABLISHED
TCP 94.210.76.120:49261 94.210.76.118:49166 ESTABLISHED
TCP 94.210.76.120:3050 94.210.76.118:49167 ESTABLISHED

so it's really strange that the client show
TCP 94.210.76.118:49165 94.210.76.120:3050 ESTABLISHED 2132
as established but on the server side their is nothing about it :(

i was waiting few days hopping a time out or something like this will close the connection but no :(

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

what about the memory dump for the client app? Also, does number of fb_inet_server processes correspond to select count(*) from mon$attachments?

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

dammed ... i just close today the client app before reading your message :(
but i thing i have another client application that is in the same situation, i will try to see if i can do a memory dump of it

> Also, does number of fb_inet_server processes correspond to select count(*) from mon$attachments?

i m in Firebird super classic, how to see that ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

Ah, sorry, I thought you were using Classic.

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

ok, i have another client application that is frozen for 8 days too in the same situation.
so i just create a dump, and put it in the same server i send you the last time in private message. if you don't find it, i can send the login password again to you..

thanks !

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

Actually i m testing the new version of FBClient, and it's look fine for now (but need one or two week to know for sure)

but i just want to say also that 4 days ago, i was force to hardly kill some clients (because of server overload).
i just today see that their connection is still present in the $MON table but the client is dead for several days now !
Some of this connection have even an active status...

in the $MON table i can see 221 connections, and doing NetStat i can see around 286 established connections

So just to say that the trouble seam to not be only in the fbclient.dll side but also in the server side ...

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

The new version of FBclient that dmitry send me look like it's correct the bug !
from the time i install them (more than 2 weeks ago now) i don't have any more any frozen client !

congratulation dmitry and many thanks !

i thing we can close this bug as resolved now !
also when the modification will be commited in the daily snapshot ?

thanks again
stéphane

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

Thanks for the good news. The fix will be committed today and thus will be available in the tomorrow's snapshot build.

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

assignee: Dmitry Yemanov [ dimitr ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

summary: Server and client are connected, but server not anwser to client and client is waiting indefinitively server answer ! => Client library could hang infinitely waiting for a reply packet on a forcibly disconnected server socket

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

Link: This issue is related to CORE1763 [ CORE1763 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

While fixing CORE1763, it was decided to not set the SO_KEEPALIVE flag for the client socket. But the lack of keep-alive packets is now proved causing problems in this ticket, supposedly due to some networking (or firewall related) troubles in the production environment. The solution is to turn keep-alive packets on.

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

Version: 2.5.0 [ 10221 ]

Version: 2.5.1 [ 10333 ] =>

@firebird-automations
Copy link
Collaborator Author

Modified by: @dyemanov

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 2.5.1 [ 10333 ]

Fix Version: 3.0 Alpha 1 [ 10331 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

one thing, is the Gbak use the Fbclient.dll or it's own connection protocol ? because i also have one backup that is frozen for several days ... but i guess that gbak not use fbclient.dll but that the bug in gbak is simillar to the bug in the fbclient.dll ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

GBAK is a regular client application, i.e. it uses fbclient.dll. Perhaps this is a different issue.

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

sorry for late answer, was in travel..
hmm, that not sound good because it's pretty like the same issue :(

on the client where gtat run net stat show :
TCP 94.210.76.118:53154 94.210.76.120:3050 ESTABLISHED [gbak.exe]

on the server nada :(
i can not find any connection on the port 53154 :(
in the $Mon table also, nothing about the gbak

the gbak is frozen from the 24 may ...
when it froze it's was exactly at this state :

gbak:25900000 records written
gbak:25920000 records written

i just make a dump of the gbak and put it in the ftp in case you need...

thanks !

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

another remark:

on the firebird server i can see 4 EVENT connections alive that was open by client that are gone serveral week ago !
i see theses connections in the $Mon table, i don't know how to recognise them in the net stat table

These connections was not open with client that have the new FBClient.dll (don't know if matter, because here it's on the server side)

the server version is still the 26230

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

hello,

I have now often the GBAK that freeze (one time a week). but except the GBAK i have much much (much) lower connection frozen bug than before !!
do you thing that using the fb release canditate 1 can resolve this issue ? i still use the the server version is 2.5.1.26263

gbak is still connected now (from several days) if you want i can also do a memory dump (or something else?)

thanks !

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

It's worth trying v2.5.1 RC1 regardless. A memory dump (ideally, for both frozen GBAK and the server it's supposedly connected to) is appreciated, as well as the network stats (also, for both server and client) for the moment.

@firebird-automations
Copy link
Collaborator Author

Commented by: vander clock stephane (arkadia)

ok, in more details :

on the Gbak computer the netstat show :

TCP 93.122.12.118:52939 93.122.12.120:3050 ESTABLISHED
[GBAK.EXE]

on the firebird server computer ... nothing ! the gbak client is not in the $MON table nor in the netstat !

so it's simply look like that GBAK don't detect the deconnection from the server and stay connected forever :(

i put the copy of the Gbak Dump file n the ftp server with i send you the credential by email the last time ...

thanks !

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

QA Status: No test

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

QA Status: No test => Cannot be tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants