New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLCODE: -902 - internal Firebird consistency check (can't continue after bugcheck) [CORE6378] #6617
Comments
Commented by: @dyemanov "can't continue after bugcheck" is a consequence, there should be another bugcheck (with proper description) before this one in the log. |
Commented by: Russell Stuart (rstuart) Thanks for the fast reply. I didn't know it created logs. Yes, you are right, there are errors before it:
That is puzzling. I presume "File too large" is emitted in response to EFBIG. The server mode is "Classic"; the engine is started by xinetd. Xinetd's ulimit for file size is unlimited, and as I type this every running instance of the engine also a file size ulimit of unlimited. The file system is ext4, and the file size is always under 8GB. Not that it's relevant, but the file system is at 29% of capacity, and after that error the database file operations.gdb continues to grow happily. Is there some way the ulimit could be set? |
Commented by: @aafemt http://ibphoenix.com/resources/documents/search/doc_36 It is better to discuss problem in firebird support mail list before posting a ticket in the tracker. |
Commented by: Russell Stuart (rstuart) OK, I'll try that. Thanks for your help. |
Commented by: @dyemanov This looks like a memory corruption followed by unexpected (and possibly random) runtime error. I'd check the memory chips first. Then it could be worth building Firebird with Valgrind support and check for memory access errors. |
Commented by: Russell Stuart (rstuart) Wow! Thanks for thinking about it. Regarding memory corruption, its an enterprise server. It has ECC. It has a current uptime of 237 days. Hardware issues like that would strike randomly across the system. That isn't happening. And, it's always the same firebird table. Random ECC errors would strike random tables, but prefer large tables that see plenty of writes. This is a small table that sees a tiny amount of writes compared to the others. I'm going to discard random hardware errors for now. I pulled down the source and looked at firebird's code last night. I see it uses aio. Some words in the aio_write() man page describing when EFBIG happens triggered a thought: if the offset is too large it may generate the same error. So I tried running "dd if=/dev/zero of=x seek=17592186044416 count=1 bs=1" (the significance of the big number is it's 16TB, which is ext4's limit), and indeed dd does return a "File too large" error. So my working theory for now is an invalid file offset is being passed to aio_write(). My current plan is to add a check for that in pwrite(), and force a core dump if it happens, and re-compile. I'm not sure what I will do with the core dump I get it; I'll cross that bridge if/when I get one. I'll probably need your help. PS: xinetd is a replacement for inetd. It's the same thing, but it comes with a few more features in the box. One of those features is limiting the interfaces and IP addresses that can connect to the engine, which is handy. It allows you to store the configuration for each service in a separate file in the /etc/xinet.d directory which is makes automated provisioning of servers easier. I very much doubt it's a problem in general, and in this particular case were using inetd/xinetd that's been running for 237 days guarantees ulimit isn't set it's helps eliminate some causes. |
Commented by: @hvlad Custom implementation of pread\pwrite (which really uses AIO) is wrong ! I think, this bug was not detected and fixed earlier because this code is unused usually - So, I don't think the problem is in AIO, unless you built your Firebird binaries by yourself with You could check result of pwrite() at PIO_write() and force coredump or just log offset value somewhere, |
Commented by: Russell Stuart (rstuart) Oh, OK, I was looking at the wrong code. Thanks for letting me know. I'll add the offset check to PIO_write(). |
Commented by: @hvlad Any news on this ? |
Commented by: Russell Stuart (rstuart) No news. It takes months to happen. I've patched firebird3.0-3.0.5.33100 and we are using it as our live version. This is the patch. Because it takes months it will be a real bugger if I've muffed it: --- a/src/jrd/os/posix/unix.cpp +#define FILE_TOO_BIG_ASSERT @@ -878,6 +880,15 @@ +#ifdef FILE_TOO_BIG_ASSERT
|
Commented by: Russell Stuart (rstuart) TL;DR I've got more failures, but none was the one the code I added was targeting. I got a core file. It was for a SIGSEGV. I have no idea what is going on here, but it doesn't seem related to my problem so I'm going to ignore it. Even better, it did not cause database corruption. (gdb) backtrace I got the error I am chasing this morning, but not core file or email. I haven't figured out why: titan Mon Sep 14 07:00:11 2020 It was not followed by an inconsistency check failure. Looking back it most days, always at the same time of day, and never seems to generate later inconsistency checks. I can't see anything special that happens at 07:00. My best guess is it is caused by the server being run with a file size ulimit (ie, it's not a firebird bug, and not what I'm chasing), but how that could happen given always run the same way from inetd is a bit of a mystery. Can libfbclient.so.2 write large files? Finally this happened on Saturday, which resulted in a corrupted table. Dropping the table and reloading it made the error go away. The previous error was logged 2 days prior. As the database is discarded and rebuilt overnight it effectively was for a different database, so there were no prior errors. titan Sat Sep 12 10:01:25 2020 titan Sat Sep 12 10:01:26 2020 titan Sat Sep 12 10:01:26 2020 titan Sat Sep 12 10:01:31 2020 Looking back, this has happened 10 times in the past 12 months. That exact error line ("cannot find record back version") is logged in every case. It is always the first error logged for the new database. It is a serious error as it table is unusable from then on, but it is not the error my patch is attempting to trap the cause of. Worse, I have no idea how I would go about tracking the cause of it. |
Submitted by: Russell Stuart (rstuart)
Every few months we get this error:
You get it when you do a select on a table; it's almost always the same table (the database has about 200 tables). The problem can be usually worked around by dropping that table, and reloading old data from a backup.
Background information: it's been happening for literally years, probably going back to version 2.5 or possibly even earlier. ServerMode is "Classic".
The access pattern of the database is unusual. It is a replica of a (non-SQL) database. Updates from the replica happen every few minutes. These updates are large, and happen in a single transaction. Those updates are the only way the database is changed, and there is only ever one happening at any one time, so there is only ever one writer. There up to 200 people reading from the database simultaneously while the updates are happening. Overnight the entire database (the .fdb is 7GB .. 8GB) is discarded and rebuilt from scratch. There are several other firebird databases in use simultaneously by the same application that see similar loads, but have a more typical usage pattern. They don't have a problem.
I am a experienced C programmer (and some C++), and am happy to instrument & recompile the server as required.
The database is hosted on a dual Xeon Dell 740, and lives on a mirrored SSD.
The text was updated successfully, but these errors were encountered: