Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLCODE: -902 - internal Firebird consistency check (can't continue after bugcheck) [CORE6378] #6617

Open
firebird-automations opened this issue Aug 4, 2020 · 11 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Russell Stuart (rstuart)

Every few months we get this error:

Error while rolling back transaction:
\- SQLCODE: \-902
\- internal Firebird consistency check \(can't continue after bugcheck\)

You get it when you do a select on a table; it's almost always the same table (the database has about 200 tables). The problem can be usually worked around by dropping that table, and reloading old data from a backup.

Background information: it's been happening for literally years, probably going back to version 2.5 or possibly even earlier. ServerMode is "Classic".

The access pattern of the database is unusual. It is a replica of a (non-SQL) database. Updates from the replica happen every few minutes. These updates are large, and happen in a single transaction. Those updates are the only way the database is changed, and there is only ever one happening at any one time, so there is only ever one writer. There up to 200 people reading from the database simultaneously while the updates are happening. Overnight the entire database (the .fdb is 7GB .. 8GB) is discarded and rebuilt from scratch. There are several other firebird databases in use simultaneously by the same application that see similar loads, but have a more typical usage pattern. They don't have a problem.

I am a experienced C programmer (and some C++), and am happy to instrument & recompile the server as required.

The database is hosted on a dual Xeon Dell 740, and lives on a mirrored SSD.

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

"can't continue after bugcheck" is a consequence, there should be another bugcheck (with proper description) before this one in the log.

@firebird-automations
Copy link
Collaborator Author

Commented by: Russell Stuart (rstuart)

Thanks for the fast reply. I didn't know it created logs. Yes, you are right, there are errors before it:

titan	Tue Aug  4 07:00:04 2020
    I/O error during "write" operation for file "/ssd/operations\.gdb"
    Error while trying to write to file
    File too large

titan	Tue Aug  4 07:00:10 2020
    I/O error during "write" operation for file "/ssd/operations\.gdb"
    Error while trying to write to file
    File too large

titan	Tue Aug  4 08:11:08 2020
Database: /ssd/operations\.gdb
    internal Firebird consistency check \(invalid block type encountered \(147\), file: cch\.cpp line: 1836\)

That is puzzling. I presume "File too large" is emitted in response to EFBIG. The server mode is "Classic"; the engine is started by xinetd. Xinetd's ulimit for file size is unlimited, and as I type this every running instance of the engine also a file size ulimit of unlimited. The file system is ext4, and the file size is always under 8GB. Not that it's relevant, but the file system is at 29% of capacity, and after that error the database file operations.gdb continues to grow happily.

Is there some way the ulimit could be set?

@firebird-automations
Copy link
Collaborator Author

Commented by: @aafemt

http://ibphoenix.com/resources/documents/search/doc_36

It is better to discuss problem in firebird support mail list before posting a ticket in the tracker.
Firebird 3 doesn't use xinetd so something can be wrong with your setup.

@firebird-automations
Copy link
Collaborator Author

Commented by: Russell Stuart (rstuart)

OK, I'll try that. Thanks for your help.

@firebird-automations
Copy link
Collaborator Author

Commented by: @dyemanov

This looks like a memory corruption followed by unexpected (and possibly random) runtime error. I'd check the memory chips first. Then it could be worth building Firebird with Valgrind support and check for memory access errors.

@firebird-automations
Copy link
Collaborator Author

Commented by: Russell Stuart (rstuart)

Wow! Thanks for thinking about it.

Regarding memory corruption, its an enterprise server. It has ECC. It has a current uptime of 237 days. Hardware issues like that would strike randomly across the system. That isn't happening. And, it's always the same firebird table. Random ECC errors would strike random tables, but prefer large tables that see plenty of writes. This is a small table that sees a tiny amount of writes compared to the others. I'm going to discard random hardware errors for now.

I pulled down the source and looked at firebird's code last night. I see it uses aio. Some words in the aio_write() man page describing when EFBIG happens triggered a thought: if the offset is too large it may generate the same error. So I tried running "dd if=/dev/zero of=x seek=17592186044416 count=1 bs=1" (the significance of the big number is it's 16TB, which is ext4's limit), and indeed dd does return a "File too large" error.

So my working theory for now is an invalid file offset is being passed to aio_write(). My current plan is to add a check for that in pwrite(), and force a core dump if it happens, and re-compile. I'm not sure what I will do with the core dump I get it; I'll cross that bridge if/when I get one. I'll probably need your help.

PS: xinetd is a replacement for inetd. It's the same thing, but it comes with a few more features in the box. One of those features is limiting the interfaces and IP addresses that can connect to the engine, which is handy. It allows you to store the configuration for each service in a separate file in the /etc/xinet.d directory which is makes automated provisioning of servers easier. I very much doubt it's a problem in general, and in this particular case were using inetd/xinetd that's been running for 237 days guarantees ulimit isn't set it's helps eliminate some causes.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Custom implementation of pread\pwrite (which really uses AIO) is wrong !
Last parameter "offset" should be of type off_t, not SLONG !

I think, this bug was not detected and fixed earlier because this code is unused usually -
it is conditionally compiled if build host have no pread\pwrite.

So, I don't think the problem is in AIO, unless you built your Firebird binaries by yourself with
HAVE_PREAD and HAVE_PWRITE undefined by some reason.

You could check result of pwrite() at PIO_write() and force coredump or just log offset value somewhere,
say in firebird.log (use gds__log() do to it) in case of EFBIG.

@firebird-automations
Copy link
Collaborator Author

Commented by: Russell Stuart (rstuart)

Oh, OK, I was looking at the wrong code. Thanks for letting me know.

I'll add the offset check to PIO_write().

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Any news on this ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Russell Stuart (rstuart)

No news. It takes months to happen. I've patched firebird3.0-3.0.5.33100 and we are using it as our live version. This is the patch. Because it takes months it will be a real bugger if I've muffed it:

--- a/src/jrd/os/posix/unix.cpp
+++ b/src/jrd/os/posix/unix.cpp
@@ -77,6 +77,8 @@
#⁠include "../common/classes/init.h"
#⁠include "../common/os/os_utils.h"

+#⁠define FILE_TOO_BIG_ASSERT
+
using namespace Jrd;
using namespace Firebird;

@@ -878,6 +880,15 @@
return 0;
}

+#⁠ifdef FILE_TOO_BIG_ASSERT
+ if (lseek_offset > 1ULL * 1024 * 1024 * 1024 * 1024) {
+ unix_error("lseek-offset", file, isc_io_32bit_exceeded_err, status_vector);
+ system("echo 'firebird lseek-offset failed.' | mail -s 'firebird abort()' mailto:rstuart@lubemobile.com.au");
+ abort();
+ }
+#⁠endif
+
+
*offset = lseek_offset;

return file;

@firebird-automations
Copy link
Collaborator Author

Commented by: Russell Stuart (rstuart)

TL;DR I've got more failures, but none was the one the code I added was targeting.

I got a core file. It was for a SIGSEGV. I have no idea what is going on here, but it doesn't seem related to my problem so I'm going to ignore it. Even better, it did not cause database corruption.

(gdb) backtrace
#⁠0 0x0000000000000000 in ?? ()
#⁠1 0x0000560111db36e8 in Firebird::RefPtr<rem_port>::~RefPtr (this=0x7ffc138ada90,
__in_chrg=<optimized out>) at ./src/include/../common/classes/RefCounted.h:122
#⁠2 main (argc=<optimized out>, argv=<optimized out>)
at ./src/remote/server/os/posix/inet_server.cpp:170

I got the error I am chasing this morning, but not core file or email. I haven't figured out why:

titan Mon Sep 14 07:00:11 2020
I/O error during "write" operation for file "/lube/ssd/lubenet/sqlstuff/operations.gdb"
Error while trying to write to file
File too large

It was not followed by an inconsistency check failure. Looking back it most days, always at the same time of day, and never seems to generate later inconsistency checks. I can't see anything special that happens at 07:00. My best guess is it is caused by the server being run with a file size ulimit (ie, it's not a firebird bug, and not what I'm chasing), but how that could happen given always run the same way from inetd is a bit of a mystery. Can libfbclient.so.2 write large files?

Finally this happened on Saturday, which resulted in a corrupted table. Dropping the table and reloading it made the error go away. The previous error was logged 2 days prior. As the database is discarded and rebuilt overnight it effectively was for a different database, so there were no prior errors.

titan Sat Sep 12 10:01:25 2020
Database: /lube/ssd/lubenet/sqlstuff/operations.gdb
internal Firebird consistency check (cannot find record back version (291), file: dpm.cpp line: 1376)

titan Sat Sep 12 10:01:26 2020
I/O error during "read" operation for file "/lube/ssd/lubenet/sqlstuff/operations.gdb"
Error while trying to read from file
Resource temporarily unavailable

titan Sat Sep 12 10:01:26 2020
Database: /lube/ssd/lubenet/sqlstuff/operations.gdb
I/O error during "read" operation for file "/lube/ssd/lubenet/sqlstuff/operations.gdb"
Error while trying to read from file
Resource temporarily unavailable

titan Sat Sep 12 10:01:31 2020
Database: /lube/ssd/lubenet/./sqlstuff/operations.gdb
internal Firebird consistency check (cannot find record back version (291), file: vio.cpp line: 4705)

Looking back, this has happened 10 times in the past 12 months. That exact error line ("cannot find record back version") is logged in every case. It is always the first error logged for the new database. It is a serious error as it table is unusable from then on, but it is not the error my patch is attempting to trap the cause of. Worse, I have no idea how I would go about tracking the cause of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant