Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firebird stops responding, required restart to get going again. Related to process shutdown code [CORE3053] #3433

Open
firebird-automations opened this issue Jun 17, 2010 · 25 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Neil Pickles (npickles)

Attachments:
leedscrashdump.7z
benton crash logs.7z
leeds drwtsn32.log

Firebird just stops responding and requires a restart to get it going again. Initially thought to relate to CORE2900 but subsequently told it is a seperate issue.

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

The attached crashdump & Dr Watson log files illustrate the issue.

@firebird-automations
Copy link
Collaborator Author

Modified by: Neil Pickles (npickles)

Attachment: leedscrashdump.7z [ 11650 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @hvlad

assignee: Vlad Khorsun [ hvlad ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Neil, can you try latest snapshot of 2.1 branch?

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

It seems that this was really fixed by Alex, see CORE2865

I also suggest you to try latest snapshot of (not released yet) 2.1.4, as it probably have fix for CORE3050 too

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

Attached are the Dr Watson logs & Firebird logs from another site that has fallen over during the weekend.

This site was using the latest build of FB 2.1.4, v2.1.4.18314 but is still having a problem.

I'll post details of where the crashdump file can be downloaded from once i have it back from site as it'll be around 100 Meg and too large to upload onto the tracker directly.

@firebird-automations
Copy link
Collaborator Author

Modified by: Neil Pickles (npickles)

Attachment: benton crash logs.7z [ 11652 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

Here is another Dr Watson log file (leeds drwtsn32.log) from another site. This time there is nothing in the Firebird log to speak of, there is this log file and a crashdump file. I'll get the crashdump file back from site and post details of where it can be downloaded from as it'll be around 100 Meg zipped up.

Again, this site was using the latest v2.1.4 build I could find last week, 18314.

@firebird-automations
Copy link
Collaborator Author

Modified by: Neil Pickles (npickles)

Attachment: leeds drwtsn32.log [ 11653 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

The two crashdumps for the two sites can be downloaded from http://news.csy.co.uk/leedscrashdump3.7z & http://news.csy.co.uk/bentoncrashdump.7z

Any help would be very much appreciated as to what's going on here.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

About leedscrashdump3.

The call stack is :

> fbserver.exe!looper(Jrd::thread_db * tdbb=0x044ef988, Jrd::jrd_req * request=0x0538f6f8, Jrd::jrd_nod * in_node=0x00d39d34) Line 1863 C++
fbserver.exe!execute_looper(Jrd::thread_db * tdbb=0x00000000, Jrd::jrd_req * request=0x00000000, Jrd::jrd_tra * transaction=0x04f393b8, Jrd::jrd_req::req_s next_state=req_proceed) Line 1461 + 0x1f bytes C++
fbserver.exe!EXE_send(Jrd::thread_db * tdbb=0x00000004, Jrd::jrd_req * request=0x0538f6f8, unsigned short msg=32696, unsigned short length=32, const unsigned char * buffer=0x0534d760) Line 1003 + 0xf bytes C++
fbserver.exe!jrd8_start_and_send(int * user_status=0x044efd8c, Jrd::jrd_req * * req_handle=0x00acab64, Jrd::jrd_tra * * tra_handle=0x04851280, unsigned short msg_type=0, unsigned short msg_length=32, char * msg=0x0534d760, short level=0) Line 3790 + 0x19 bytes C++
fbserver.exe!isc_start_and_send(int * user_status=0x044efd8c, void * * req_handle=0x0534df7c, void * * tra_handle=0x0534df68, unsigned short msg_type=0, unsigned short msg_length=32, const char * msg=0x0534d760, short level=0) Line 4942 + 0x2f bytes C++
fbserver.exe!execute_request(dsql_req * request=0x0534df24, void * * trans_handle=0x044efde4, unsigned short in_blr_length=16, const unsigned char * in_blr=0x00acaac8, unsigned short in_msg_length=26, unsigned char * in_msg=0x00ac8588, unsigned short out_blr_length=0, unsigned char * out_blr=0x00000000, unsigned short out_msg_length=0, unsigned char * out_msg=0x00000000, bool singleton=false) Line 3429 + 0x26 bytes C++
fbserver.exe!GDS_DSQL_EXECUTE_CPP(int * user_status=0x044efd00, void * * trans_handle=0x044efde4, dsql_req * * req_handle=0x00acaa84, unsigned short in_blr_length=16, const unsigned char * in_blr=0x00acaac8, unsigned short in_msg_type=0, unsigned short in_msg_length=26, unsigned char * in_msg=0x00ac8588, unsigned short out_blr_length=0, unsigned char * out_blr=0x00000000, unsigned short out_msg_type=64804, unsigned short out_msg_length=0, unsigned char * out_msg=0x00000000) Line 570 + 0x26 bytes C++
fbserver.exe!dsql8_execute(int * user_status=0x044efd8c, void * * trans_handle=0x044efde4, dsql_req * * req_handle=0x00acaa84, unsigned short in_blr_length=16, const char * in_blr=0x00acaac8, unsigned short in_msg_type=0, unsigned short in_msg_length=26, char * in_msg=0x00ac8588, unsigned short out_blr_length=0, char * out_blr=0x00000000, unsigned short out_msg_type=0, unsigned short out_msg_length=0, char * out_msg=0x00000000) Line 296 + 0x41 bytes C++
fbserver.exe!isc_dsql_execute2_m(int * user_status=0x00000000, void * * tra_handle=0x044efde4, void * * stmt_handle=0x00a42c4c, unsigned short in_blr_length=16, const char * in_blr=0x00acaac8, unsigned short in_msg_type=0, unsigned short in_msg_length=26, char * in_msg=0x00ac8588, unsigned short out_blr_length=0, char * out_blr=0x00000000, unsigned short out_msg_type=0, unsigned short out_msg_length=0, char * out_msg=0x00000000) Line 2531 + 0x36 bytes C++
fbserver.exe!rem_port::execute_statement(P_OP op=op_execute, p_sqldata * sqldata=0x00003bdb, packet * sendL=0x00acaf10) Line 2172 C++
fbserver.exe!process_packet2(rem_port * port=0x01704abc, packet * sendL=0x00acaf10, packet * receive=0x00acb1c4, rem_port * * result=0x044eff44) Line 3622 C++
fbserver.exe!process_packet(rem_port * port=0x01704abc, packet * sendL=0x00acaf10, packet * receive=0x00acb1c4, rem_port * * result=0x044eff44) Line 3372 + 0x22 bytes C++

and AV is at exe.cpp, line 1863 :

static jrd_nod* looper(thread_db* tdbb, jrd_req* request, jrd_nod* in_node)
{
...
#⁠if defined(DEBUG_GDS_ALLOC) && FALSE
int node_type = node->nod_type;
#⁠endif

	switch \(node\-\>nod\_type\) \{						<\-\-\- HERE
	case nod\_asn\_list:
		if \(request\-\>req\_operation == jrd\_req::req\_evaluate\) \{

local variable "node" contains garbage bytes.

tdbb, database, attachment and request - all seems OK and valid.

The SQL text of request is

UPDATE VETRANS SET CREATESYNCID=? WHERE TRANSID=?

So far i have no ideas what happens :(

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

About bentoncrushdump

The call stack is :

> msvcr80.dll!7814537a()
[Frames below may be incorrect and/or missing, no symbols loaded for msvcr80.dll]
fbserver.exe!Jrd::LocksCacheJrd::CachedLock\::get(Jrd::thread_db * tdbb=0x07f9f988, const unsigned char * key=0x07f9b46c) Line 180 C++
fbserver.exe!Jrd::BtrPageGCLock::disablePageGC(Jrd::thread_db * tdbb=0x07f9f988, const Jrd::PageNumber & page={...}) Line 262 + 0xd bytes C++
fbserver.exe!add_node(Jrd::thread_db * tdbb=0x00000000, Jrd::win * window=0x07f9b5f4, Jrd::index_insertion * insertion=0x07f9e73c, Jrd::temporary_key * new_key=0x07f9b690, RecordNumber * new_record_number=0x07f9b610, long * original_page=0x07f9b53c, long * sibling_page=0x07f9b55c) Line 2385 C++
fbserver.exe!add_node(Jrd::thread_db * tdbb=0x04d4317c, Jrd::win * window=0x07f9b5f4, Jrd::index_insertion * insertion=0x07f9e73c, Jrd::temporary_key * new_key=0x07f9b690, RecordNumber * new_record_number=0x07f9b610, long * original_page=0x00000000, long * sibling_page=0x00000000) Line 2393 + 0x35 bytes C++
fbserver.exe!BTR_insert(Jrd::thread_db * tdbb=0x07f9f988, Jrd::win * root_window=0x07f9e724, Jrd::index_insertion * insertion=0x07f9e73c) Line 1031 + 0x29 bytes C++
fbserver.exe!insert_key(Jrd::thread_db * tdbb=0x07f9f988, Jrd::jrd_rel * relation=0x0370e544, Jrd::Record * record=0x04fcd4e4, Jrd::jrd_tra * transaction=0x00000000, Jrd::win * window_ptr=0x00000000, Jrd::index_insertion * insertion=0x07f9e73c, Jrd::jrd_rel * * bad_relation=0x07f9f860, unsigned short * bad_index=0x07f9f86c) Line 1603 + 0x4a bytes C++
fbserver.exe!IDX_store(Jrd::thread_db * tdbb=0x07f9f988, Jrd::record_param * rpb=0x04e81508, Jrd::jrd_tra * transaction=0x078e55b4, Jrd::jrd_rel * * bad_relation=0x07f9f860, unsigned short * bad_index=0x07f9f86c) Line 998 + 0x22 bytes C++

AV is at memmove, which was called from Array::remove(size_t index) :

template <class LockClass>
GlobalRWLock* LocksCache<LockClass>::get(thread_db *tdbb, const UCHAR* key)
{
...
que_inst = que_inst->que_backward;
QUE_DELETE(lock->m_lru);
m_sortedLocks.remove(pos); <--- HERE

			if \(lock\-\>setLockKey\(tdbb, key\)\) 
				break;

It seems that que_inst at line 171is wrong and points to the heal of que (this->m_lru) :

			lock = \(LockClass\*\) \(\(SCHAR\*\) que\_inst \- OFFSET \(LockClass\*, m\_lru\)\);

therefore "lock" is also invalid and its key can't be found at m_sortedLocks. So, "pos" have wrong value and memmove crashed.

Unfortunately due to inlined code and high usage of registers by MSVC optimizer i can't verify this guess even with crush dump with full process memory.

To fix this i offer following patch :

Index: jrd/LocksCache.h

RCS file: /cvsroot/firebird/firebird2/src/jrd/Attic/LocksCache.h,v
retrieving revision 1.1.2.3
diff -u -w -b -r1.1.2.3 LocksCache.h
--- jrd/LocksCache.h 27 Oct 2009 09:16:27 -0000 1.1.2.3
+++ jrd/LocksCache.h 21 Jun 2010 21:59:57 -0000
@@ -174,6 +174,9 @@
fb_assert(found);

			que\_inst = que\_inst\-\>que\_backward;

+ if (que_inst == &m_lru) {
+ que_inst = que_inst->que_backward;
+ }
QUE_DELETE(lock->m_lru);
m_sortedLocks.remove(pos);

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

I have another instance of an access violation occuring within Firebird, the associated crashdump & dr watson log files can be downloaded from http://news.csy.co.uk/christchurch.7z.

Again this is using the same v2.1.4 build as the other sites.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

cristchurch is the same case as above with LocksCache

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Neil, try please

http://www.firebirdsql.org/download/rabbits/hvlad/fbserver-2.1.4.18314-1_Win32.7z

this is fbserver.exe, based on current 2.1.4 branch with patch above.

It should fix issues with LocksCache, i hope

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

I'll try that at the 6 or so sites that are running v2.1.4.

I also have another instance, not sure if it is the same as before, that can be downloaded from http://news.csy.co.uk/brackmills.7z

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

"brackmills" is also same issue with LocksCache. FB 2.1.3 this time

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

I updated the 10 test sites to the patched version of v2.1.4 overnight.

Two have since produced crashdump files again.

These can be downloaded from http://news.csy.co.uk/stirling_v214_patch.7z & http://news.csy.co.uk/brackmills_v214_patch.7z

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

I've since got another 3 crashdumps from the 10 test sites, I'll let you know where they can be downloaded from when I have them back from site. Two are from the same sites as the previous 2 and 1 is from a different site.

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

The three new crashdumps can be downloaded from http://news.csy.co.uk/stirling_v214_patched_2.7z , http://news.csy.co.uk/bedminster_v214_patched.7z & http://news.csy.co.uk/brackmills_v214_patched_2.7z

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

stirling_v214_patch is the same issue with LocksCache. Other dumps will look shortly.

I prepared new build with new patch :

http://www.firebirdsql.org/download/rabbits/hvlad/fbserver-2.1.4.18314-2_Win32.7z

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

I'll get that new version installed and see how I get on with that. I'll try to do it today.

There is another crashdump available for download that is from the first patched version. http://news.csy.co.uk/benton_v214_patched.7z , if you could confirm that they are experiencing the same issue as the others with LocksCache or not.

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

Updated the 10 test sites last night and so far today, it's 3.30pm in the UK now, none have fallen over.

I'll monitor things over the weekend and update you on Monday but it's looking good so far.

Thanks for all your help.

Any idea when v2.1.4 is going to be released, is there much still to be fixed into v2.1.4 that you know about ?

Cheers,

Neil Pickles

@firebird-automations
Copy link
Collaborator Author

Commented by: Neil Pickles (npickles)

I've been monitoring this since last week and it now appears to be sorted out.

Thanks for your prompt help with this, I look forward to the official release of v2.1.4, soon hopefully.

Cheers,

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Neil, thanks for reports and patience.

I'll commit fix for "LocksCache" issue into your CORE3050 as this ticket (CORE3053) points to the another issue, related to the process shutdown and sooner of all fixed by Alex in CORE2865. I'll change description at CORE3050 to better reflex nature of bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants