New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fbserver terminated abnormally when thread start failed [CORE2306] #2730
Comments
Modified by: @AlexPeshkoffassignee: Alexander Peshkov [ alexpeshkoff ] |
Commented by: @AlexPeshkoff First of all read this. When you have core file, contact me here. BTW, if you'll be able to provide me remote access to your server (after getting core file), this can seriously help solving your problem. |
Commented by: Andreas Kallenbach (andreas_kc) Thanks, I have set BugcheckAbort=1 and will follow up once I have a core dump. |
Commented by: Andreas Kallenbach (andreas_kc) Ok, I do have a core.22213 file that is 3GB in size. A little large to attach here, how else can I send this to you? |
Commented by: Andreas Kallenbach (andreas_kc) I have downloaded the debug package FirebirdSS-debuginfo-2.1.1.17910-0.nptl.i686.tar.gz and extracted them into the appropriate directory. I assume this is the expected output? [root@localhost tmp]# gdb /opt/firebird/bin/.debug/fbserver.debug core.22213 warning: core file may not match specified executable file. warning: Can't read pathname for load map: Input/output error. done. |
Commented by: Andreas Kallenbach (andreas_kc) Had a second crash of the day with the following output... (same as before) Program terminated with signal 11, Segmentation fault. |
Commented by: Andreas Kallenbach (andreas_kc) I don't know if anything can be told by file dates on the http://libpthread.so.. but it is as follows.... [root@localhost lib]# ls -l libpthread* |
Commented by: @AlexPeshkoff Andreas, this is almost that output which is needed. Please run in gdb the following command - What about the size of core file. I'm pretty sure it can be compressed using bzip2 at least 10 times. Certainly, if you'll be able to put it afterwards to any available place (http, ftp, scp, etc.), it will be ideal. Though quite possible that the backtrace (bt) of stacks can already help. Sooner of all we have some OOM problem. PS. Version of libpthread is OK. |
Commented by: Andreas Kallenbach (andreas_kc) [root@localhost tmp]# gdb /opt/firebird/bin/.debug/fbserver.debug core.22213 warning: core file may not match specified executable file. warning: Can't read pathname for load map: Input/output error. done. |
Commented by: Andreas Kallenbach (andreas_kc) There were 266 threads in the core listing and I only enumerated the first and last 5 plus some in between. The core dump can be downloaded at http://www.infinityins.com/core.22213.tar.bz2 it was compressed using.... [root@localhost tmp]# tar -cjf core.22213.tar.bz2 core.22213 [root@localhost tmp]# ls -l core* |
Modified by: Sean Leyne (seanleyne)Component: Engine [ 10000 ] |
Commented by: Andreas Kallenbach (andreas_kc) Maybe I am grasping at straws here... I'm not versed in Linux programming or its tools, but in the first thread I find this curious... #4 0x082f9f5a in ~system_call_failed (this=0x9799194c) libstdc++.so.5 is a dependency that I have always had to install on many of the Fedora/CentOS/Redhat Servers we use before installing firebird*.rpm. I resolve the dependency by the following.... [root@localhost Desktop]# yum install libstdc++.so.5 Dependencies Resolved =============================================================================
|
Commented by: @AlexPeshkoff Unfortunately I can't work with your core file - a lot of library version mismatches make it unusable. Therefore I'll have to ask you to provide some answers. I see 2 main problems with provided backtraces. One (which directly caused AV) is an error in gds__thread_start. It will be interesting to try the following in gdb: But sooner of all failure happened due to too many threads started. May be you should fix limits on your box (don't ask how, this is distro-dependent thing). What is also strange is that a lot of threads wait for SS_MUTEX lock in detach database code - and this can be a reason for too many threads. Do you have any problems when detaching from server? Can you attach to the server without problems (that mutex blocks attaches too)? It will be very interesting to know, how many of 260 threads are waiting in this place (like thread 100): |
Commented by: Andreas Kallenbach (andreas_kc) (gdb) t 1 We have not had an issue with attaching/detaching from the server. This particular machine has been in use for almost 2 years and until this started has run very well. It was using 2.0.1 for most of that time until it was upgraded to the current version last month. It was upgraded after these issues started happening. I wanted to see if it would improve with a newer version. I am looking to install this on a newer machine this weekend. Now, what may be interesting is that before the server hits this error, crashes, and restarts.... users report they can't connect to the database, the cpu shoots up to 100% utilization, stays at 100% for 5 minutes, and then crashes. Near the end of this, attempts to connect to the database start responding with various errors, "Can't find system environment variables, path issues, etc". Other than the times when the database crashes, connections to the database are pretty much instant. When I have a little time later today.... I will do some more thread enumerating... my human "for" loop skills are not as efficient as the machine. |
Commented by: Andreas Kallenbach (andreas_kc) I did switch our database to a similar spec'd machine and Operating System to check for hardware issues, but we are still receiving troublesome restarts on the database under load. Now, I have noticed that the firebird.log contains a number of these errors right before it restarts... INET/inet_error: read errno = 104 For instance this Monday, we had this error repeat 292 times localhost.localdomain (Server) Mon Mar 16 18:37:31 2009 All occurred at 18:37:31 and was promptly followed with... localhost.localdomain (Client) Mon Mar 16 18:37:31 2009 |
Commented by: @AlexPeshkoff Andreas, I'm sure this is not HW issue. Final error in pthread_mutex_lock() is definitely "too many threads per process" (or may be per user, does not really matter). But the actual reason of your problems is not this, something wrong happens earlier. Looking at your stacks (for all working threads that were shown) I see they all perform simultaneous disconnect. I do not think that all of your clients have simultaneously exited client software:-) I.e. for some reason all network connections of particular process were dropped. Except it I do not see any problems with fbserver at athis moment. Certainly, exception handling in working thread startup will be fixed. May be you should try with CS instead of SS? |
Commented by: @AlexPeshkoff Changed name after final reasons' analysis |
Modified by: @AlexPeshkoffsummary: fbguard/fbserver terminated abnormally kernel segfault => fbserver terminated abnormally when thread start failed |
Modified by: @AlexPeshkoff |
Commented by: @AlexPeshkoff All said about 2415 is true here. fbserver fails due to gcc bug, in next point release it will be fixed. |
Commented by: @AlexPeshkoff Segfault can be avoided with older versions if you build them using fresh (> 3.2.X) gcc. Additionally fixed client hang in case when fbserver can't start even first worker thread. |
Modified by: @AlexPeshkoffstatus: Open [ 1 ] => Resolved [ 5 ] resolution: Fixed [ 1 ] Fix Version: 2.5 RC1 [ 10300 ] Fix Version: 2.1.3 [ 10302 ] Fix Version: 2.0.6 [ 10303 ] |
Modified by: @pcisarstatus: Resolved [ 5 ] => Closed [ 6 ] |
Modified by: @pavel-zotovQA Status: No test |
Modified by: @pavel-zotovstatus: Closed [ 6 ] => Closed [ 6 ] QA Status: No test => Cannot be tested |
Submitted by: Andreas Kallenbach (andreas_kc)
Is related to CORE2415
Relate to CORE2702
For the last three weeks our database server has been resetting around 2-3 times a week. I have been doing backup/restores on the weekends to keep potential database corruptions down, but we end up with almost 30 hours of weekly downtime to backup/restore a 45GB database.
Today, we had a server crash and messages contained the following line...
Feb 4 10:08:23 localhost kernel: fbserver[15932]: segfault at 00000058 eip 006c6370 esp bfdc5c44 error 4
firebird.log contained the following line...
localhost.localdomain (Client) Wed Feb 4 10:08:23 2009
/opt/firebird/bin/fbguard: /opt/firebird/bin/fbserver terminated abnormally (-1)
What can I do to track this issue down and make it stop?
Commits: eaa740f 7a5070b 9b0950d
The text was updated successfully, but these errors were encountered: