Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firebird Server hangs [CORE3603] #3957

Closed
firebird-automations opened this issue Sep 20, 2011 · 31 comments
Closed

Firebird Server hangs [CORE3603] #3957

firebird-automations opened this issue Sep 20, 2011 · 31 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Christian Masberg (cubism)

Attachments:
drwtsn32.log
fb_inet_server.exe.mdmp
fb_inet_server.exe.hdmp
drwtsn32_20111024.log
fb_inet_server.exe 20111024.hdmp

Votes: 1

Hi there!
We are using Firebird server software for 5 years and have been experienced just a few problems. Since appr. 3 years the server hung from time to time (once every half year). It was something that was ok and bearable.
In the recent half year the deadlocks of the whole db got more frequent. The last few weeks the server hung once a week making it a real problem.
The majority of the incidents happened during a backup process handled by the FIBS Service (Firebird/Interbase Backup Scheduler) which uses the gbak.exe for remote and scheduled backups, but there were also deadlocks during normal business and while executing rather huge queries.
In these cases neither a db shutdown and a following bring online nor a restart of firebird service is solving the problem. Only a restart of the physical server brings the db back to normal life.
I have now read a lot of issues and did some web research and I want to setup Dr. Watson on our server to create a hopefully siginificant dump file after the next crash. As a following step I would then update the server to 2.1.4, but beforehand I would like to invest some time to maybe discover the cause of the problem.
Concerning the setup I have some questions.
After some searching I found the corresponding debug version to our server which is 2.1.1 build 17910. What are the differences of the debug version from the standard one? Is it safe to implement it in productive or is it just for testing environments? Is it slowing down the perfomrance in any kind? If we update to version 2.1.4 is it advisable to install the debug versio by default for the case that the problem is not resolved? Are there any further tips or hints for seting up the debug environment?

I would be very grateful if any of you could spare some experiences in this issue! Thank you very much in advance.
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: @mrotteveel

This sounds more like a support question than a bug report. Support questions should be directed to the firebird-support mailinglist.

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi Mark,
thank you for your comment. Basically I do agree with you.
I want to attach the dump file to this issue later on when the next deadlock happens. And I just wanted to clearify some open questions in this ticket as well.
But the basic intend is to get info about the deadlocks, which Inhope to get from the dump file.

Kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

I set up the debug version on the server. Am waiting for the next crash to happen and will post the crash dump file then.

Kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Finally the server crashed again. I attached the Dr. Watson Logfile and the windows mdmp file. Hope someone can give me a clue!!

@firebird-automations
Copy link
Collaborator Author

Modified by: Christian Masberg (cubism)

Attachment: drwtsn32.log [ 12022 ]

Attachment: fb_inet_server.exe.mdmp [ 12023 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Arioch (arioch)

Classic server should have some separate LOCKS manager, not belonging to certain SQL worker per-connection proceses.
If LOCK Manager gets overflooded somehow, that maybe may cause the "Only a restart of the physical server" behaviour.

Maybe you can dig more about how lock manager is implemented and how to diagnose/reset its state.
Is this the only FB database on the server ? since resetting Lock Manager would probably need to stop all the databases on the server beforehand.

@firebird-automations
Copy link
Collaborator Author

Commented by: Arioch (arioch)

maybe some points in comments at Issue CORE3473 would apply here

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

hopefully full Dump File.

@firebird-automations
Copy link
Collaborator Author

Modified by: Christian Masberg (cubism)

Attachment: fb_inet_server.exe.hdmp [ 12030 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi Dimitry,
thank you for your response! Yes, there are some similarities between this issue and #⁠3473. Coincidentally the minute you were writing a new deadlock occurred and we were in the process of restarting the DB server. If I had read it beforehand I would have made a printout of lock manager.:-(
There are a lot of processes in task manager, but only some are crucial for the evaluation, the rest are just unsuccessful tries of users to connect to the db. This can be seen because the process creation time lies after the time the freeze happened (timestamp of the db) and the size being all the same (appr. 4MB).
Generally from the time the db freezes all user processes are stuck and it is not possible to make new connections to the db. Furthermore administrative tools (e.g. IBExpert) are not able to connect to the database. A shutdown and bring online does not help to solve the problem. So far we are making a restart of the whole server. A shutdown of all host processes and finally the parent process might do the trick as well, I don't know.
So far in the case of a deadlock I kill the parent process (firebird default instance) using Dr. Watson and after a restart of the physical server, windows presents systems dump files which are already attached to this issue relating to the first deadlock and I will attach the new ones related to the deadlock today. I hope that these are the full dumps (*.hdmp).
Sysinternals Process Explorer may be able to export more info, I will give it a try.
How relevant may the lock manager be in this respect? I found the option to print the content of the lock manager during a freeze with " fb_lock_print -a". What do you think?
In the respective issue I found the hint to set the "LockMemSize" in firebird.conf to 100MB. So far as I understand, this will only help regarding problems when using superserver. We are using Classic Server and the file size should be dynamic in this case and we did not experience neither any unusual memory nor CPU usage.
Is there generally a way of monitoring the lock manager activity?
I also looked into the firebird server log and found the following regarding the time when the freeze happened was 10:34:

CITRIX-SERVER Mon Oct 24 10:39:44 2011
INET/inet_error: read errno = 10054

CITRIX-SERVER Mon Oct 24 10:48:10 2011
XNET error: Server initialization failed

CITRIX-SERVER Mon Oct 24 10:48:10 2011
Database:

CITRIX-SERVER Mon Oct 24 10:49:00 2011
INET/inet_error: bind errno = 10048

CITRIX-SERVER Mon Oct 24 10:49:00 2011
Database:
Unable to complete network request to host "citrix-server".
Error while listening for an incoming connection.
Only one usage of each socket address (protocol/network address/port) is normally permitted.

CITRIX-SERVER Mon Oct 24 10:58:37 2011
XNET error: Server initialization failed

CITRIX-SERVER Mon Oct 24 10:58:37 2011
Database:

CITRIX-SERVER Mon Oct 24 10:59:27 2011
INET/inet_error: bind errno = 10048

CITRIX-SERVER Mon Oct 24 10:59:27 2011
Database:
Unable to complete network request to host "citrix-server".
Error while listening for an incoming connection.
Only one usage of each socket address (protocol/network address/port) is normally permitted.

I'd be very happy if anyone could look at the problem. Is there any way that I could debug the process myself?
Thanks in advance and kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Dump Files of today. Full crash dump by operating system and Dr. Watson Logfile.

@firebird-automations
Copy link
Collaborator Author

Modified by: Christian Masberg (cubism)

Attachment: drwtsn32_20111024.log [ 12031 ]

Attachment: fb_inet_server.exe 20111024.hdmp [ 12032 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Since the last post there have been some deadlocks.
Today another one occured.
This time we created full dump files of all connected and relevant (startup time prior to time of deadlock) processes using the Process Hacker (former MS Process Explorer). These files take 2GB in size. I could upload them to a ftp location. Please advice.

We urgently need help in debuging this error! If nobody can take the time to do this him/herselves we need advice on how to setup a debugging environment.

Furthermore if someone has some further advice or point out to more relevant data, please do so.

Thanks in advance and kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Please, describe what files (dumps, something else ?) do you have, its amount, sizes and how do you produced it.

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi!
Thank you very much for your answer.
We have 22 dump files of all processes that were connected to the server the time the deadlock happend. They were created using a software called "Process Hacker" which was formerly known as the Sysinternals Process Explorer. They have the file ending *.dmp and are hopefully full crash dumps. All togehter they make up 1,87 GB. The debug version of the firebird server is installed.

What else do you need or might be helpful. It seems that deadlocks become more frequent so that one is occurring every two weeks sometimes every week. Please advice so that we may act differently the next time a crash happens..
Thanks in advance!!
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Put dupms on the ftp, compress each file separately.
Personally, i have a little free time right now and can't promise fast answer or solution, sorry. I'll look at it when time permits.
If this is really urgent - you can try commercial support...

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi Vlad,
Thank you for the advice. I will upload the files asap. Can you tell me the details of the FTP?

I thought about asking for commercial support myself, too.
Can you recommend someone? Maybe someone here in germany?
Kind tegards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: @pmakowski

you have a list here : http://www.firebirdsql.org/en/support/
companies listed here support Firebird development as sponsors of Firebird Foundation.

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi there,
I have not commented on this issue for quite some time, because we have reworked our client software and were looking for issues that might be in relation to this problem. This led to a rebiuld of transaction and connection handling. At the same time we updated the firebird software to ver. 2.1.4.

Unfortunately our efforts were not rewarded, because today the firebird server software hung another time.

As usual the server software and all client processes came to a sudden standstill and new processes couldn't connect to the database.
Using Classic Server we have an initial size of appr. 45 MB for each client that successfully connects to the DB (Page size: 4096, 10.000 pages per 4K).
In Process Hacker you can trace the process of a user connecting to the db. When it establishes the connection first has an initial size of appr. 2,06 MB. It grows to appr. 2,5MB in the next step after a short time, until it is fully established with appr. 45 MB. We have appr. 25 clients connecting to it, each starting with 45 MB in cache after doing some work in the client the cache can expands to usually not more than 80-90 MB.
The processes with unsuccessful connections keep their initial size of appr. 2,06 MB.

We now created a second db on the same physical server.
During the deadlock both databases were not accessible through the clients and database maintenance software (IBExpert). Not until the client processes were terminated in process hacker the normal work could be continued.
We did not make connection tries each time we terminated a client process, we will do this next time to maybe expose the faulted client process.
Does the fact that both databases on the same physical machine were both affected by this deadlock point out to a special part in the process, maybe the lock manager? Are there separate lock managers each handling the transactions for one db?
We are thinking of implementing a SQL Monitor software which acts like a kind of proxy and receives and redirects network traffic to the server and writes transactions info to a separate database. Do you think that this a reasonable approach?
I would very much appreciate an answer. Thanks in advance and kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Christian,

Have you looked at the Sweep Interval setting of the database?

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi Sean,
thanks for the answer. We set the sweep intervall to '0' and are doing manual sweeps. At the end of the day all clients disconect from the db and close all transactions.

A DB statistic at the end of the day looks like this:

Database header page information:
Flags 0
Checksum 12345
Generation 3532843
Page size 4096
ODS version 11.1
Oldest transaction 26296
Oldest active 3106886
Oldest snapshot 3106886
Next transaction 3106887
Bumped transaction 1
Sequence number 0
Next attachment ID 425949
Implementation ID 16
Shadow count 0
Page buffers 10000
Next header page 0
Database dialect 3
Creation date Sep 12, 2011 18:34:17
Attributes force write

Variable header data: 
    Sweep interval:         0 

The oldest transaction shows that we have not made a sweep for some http://time.Is this a relevant factor, or are just OAT and OST of any relevance?

Kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Christian,

1 - If you are running Classic server you should significantly reduce the size of the page cache/buffers. Depending on the number of simultaneous connections I would recommend a valud no larger than 500. Classic server, unlike SuperServer or SuperClassic, performance drops as the size of cache increases due to the time required to synchronize the cache across all engine instances.

2 - I believe that the database has a lot of old record versions which can be contributing to the "deadlocks" you are seeing. A database sweep is recommended, it does not appear that the database has had a sweep in some time (since Sept 13-14, based on the age of the database and the avg number of transactions per day).

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi Sean,
thank you very much for your very helpful comments.

At he moment we have the following db settings:

Page Size: 4096
Buffers:
Pages: 10000
KB: 4000

Due to your recommendations I would suggest the following settings:

Page Size: 4096 (seems to be sufficient, we don't have a table where the number of records exceeds the number of pages)
Pages: 500
KB: 4000

At the same time a sweep will be executed daily at night.
I would like to know your feedback on this.
What do you say?

Kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Christian,

Your cache size and nightly sweep are reasonable.

I would increase the page size to 8KB/8192, that is the typical disk block size used by many filing systems.

By matching the page size to a full multiple (ie. 1x) of the disk block size, you get the most "bang" for each disk IO. It could also improve index performance, if you have very deep indexes (you need to run gstat and look at the "depth" values, value > 3 is not good).

One thing, what is the "KB" you referring to?

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi Sean,
forget about the KB reference. It was just an extra info of the ib admin tool that I use. It is the number of pages times page size, whisch is the amount of cache per user in RAM (using Classic Server).

Thanks for the page size hint. 16KB îs not recommendable?

Kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

I have found that 8KB is a "sweet spot" for active databases (our databases range up to 50GB).

16KB page would require 2 disk block IO, which due to disk fragmentation could be at different disk locations, making the IO much more expensive.

@firebird-automations
Copy link
Collaborator Author

Commented by: Paul Read (nsolve)

We have a very similar situation occurring - where all clients freeze and no new connections can be made until FB server is restarted.
Did the tweaking of the DB configuration resolve the issue?

@firebird-automations
Copy link
Collaborator Author

Commented by: Christian Masberg (cubism)

Hi Paul,
after 2,5 years without any freezes it's a clear Yes. After the introduction of the nightly sweep interval and the tweaking of the page size and initial page number, the problems disappeared.

Kind regards
Christian

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Based on the comments posted, it seems that the issue was related to server configuration, not a specific issue which requires engine developer action. As such, this case is closed.

@firebird-automations
Copy link
Collaborator Author

Modified by: Sean Leyne (seanleyne)

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Cannot Reproduce [ 5 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant