Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FB2.5.7 and Firebird-2.5.8.27070-0_x64 - periodically frozen under load [CORE5604] #5870

Closed
firebird-automations opened this issue Sep 5, 2017 · 14 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Slavomir Skopalik (skopaliks)

Is replaced by CORE5617

Approximately every one or two days first firebird will consume one CPU, but no SQL command in monitoring tables.
After some time, about 6 hours, firebird stops executing statements and receiving new connection.
Some notes about environment:
* About 50 connections
* About 5 transactions per second
* There are high number of rollbacked transactions (making update and rollback)

From log (firebird was hard killed and start at Tue Sep 05 07:14:23 2017):
CEECZPROP05 Tue Sep 05 07:03:05 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 07:14:24 2017
Sweep is started by SWEEPER
Database "D:\FBDATA\KINGSPAN_CZ.FDB"
OIT 93640260, OAT 93935017, OST 93935017, Next 93935034

CEECZPROP05 Tue Sep 05 07:18:30 2017
Sweep is finished
Database "D:\FBDATA\KINGSPAN_CZ.FDB"
OIT 93935016, OAT 93935359, OST 93935359, Next 93936545

CEECZPROP05 Tue Sep 05 07:30:28 2017
INET/inet_error: read errno = 10054

Process dump file is here: http://www.elektlabs.cz/20170905-FB2.5.8.7z

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

It appears that you problem is automatic database sweep. Try disabling the sweep, by setting "Sweep Interval" to 0/zero.

@firebird-automations
Copy link
Collaborator Author

Commented by: Slavomir Skopalik (skopaliks)

I will try, but I'm worry about performance in long term.

Also I was checked log file and I was found that previous sweep started after another frozen.

CEECZPROP05 Mon Sep 04 02:55:40 2017
Sweep is started by SWEEPER
Database "D:\FBDATA\KINGSPAN_CZ.FDB"
OIT 93549286, OAT 93591888, OST 93591886, Next 93591904

CEECZPROP05 Mon Sep 04 02:56:23 2017
Shutting down the server with 24 active connection(s) to 1 database(s), 0 active service(s)

CEECZPROP05 Mon Sep 04 02:56:23 2017
Error during sweep:
connection shutdown

CEECZPROP05 Mon Sep 04 02:56:34 2017
Timeout expired during remote server shutdown

CEECZPROP05 Mon Sep 04 02:56:34 2017
INET/inet_error: read errno = 10093

CEECZPROP05 Mon Sep 04 02:58:09 2017
Sweep is started by SWEEPER
Database "D:\FBDATA\KINGSPAN_CZ.FDB"
OIT 93549286, OAT 93592057, OST 93592057, Next 93592060

CEECZPROP05 Mon Sep 04 03:02:01 2017
Sweep is finished
Database "D:\FBDATA\KINGSPAN_CZ.FDB"
OIT 93592058, OAT 93592112, OST 93592112, Next 93592571

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

I think you are reading too much from the Logs.

I don't see anything that suggests that the first sweep was "frozen", it was stopped because of a server shutdown. (there wasn't enough time between the start of the sweep and the shutdown to say that the sweep was "frozen")

If you are concerned about the long term performance, create a scheduled job which performs a sweep (using gfix) on a weekly basis. For the most part, if you have active tables, the accumulated "garbage" will be cleaned-up thru normal/day-to-day table/row operations.

@firebird-automations
Copy link
Collaborator Author

Commented by: Slavomir Skopalik (skopaliks)

About log.
There is some funny thing. When we are try to stops, this request failed (from instsvc.exe and also by windows services).
Next by ProcessExplorer we killed process and process automatically restarted.
Then we can stops by instsvc.exe.

More info abut current state:
* We are switched during last frozen to pure classic
* Windows firewall is fully off
* We are using events
Now log contain this:
CEECZPROP05 Tue Sep 05 16:58:23 2017
Sweep is started by SWEEPER
Database "D:\FBDATA\KINGSPAN_CZ.FDB"
OIT 94177230, OAT 94207593, OST 94207592, Next 94207596

CEECZPROP05 Tue Sep 05 17:00:32 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 17:00:32 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 17:00:32 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 17:00:32 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 17:01:00 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 17:01:00 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 17:01:00 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 17:01:00 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 17:02:36 2017
Sweep is finished
Database "D:\FBDATA\KINGSPAN_CZ.FDB"
OIT 94207591, OAT 94207677, OST 94207677, Next 94208845

CEECZPROP05 Tue Sep 05 17:06:04 2017
SERVER/process_packet: broken port, server exiting

CEECZPROP05 Tue Sep 05 17:29:37 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 17:29:37 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 17:44:33 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 17:44:33 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 18:06:42 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 18:06:43 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 18:53:17 2017
SERVER/process_packet: broken port, server exiting

CEECZPROP05 Tue Sep 05 18:59:54 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 18:59:54 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 18:59:54 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 18:59:54 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 19:00:10 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 19:00:10 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 19:12:16 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 19:12:16 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 20:44:11 2017
INET/inet_error: read errno = 10054

CEECZPROP05 Tue Sep 05 20:44:11 2017
Unable to complete network request to host "ceeczprop05".
Error reading data from the connection.

CEECZPROP05 Tue Sep 05 22:47:54 2017
SERVER/process_packet: broken port, server exiting

CEECZPROP05 Tue Sep 05 23:14:50 2017
SERVER/process_packet: broken port, server exiting

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

Sweep is a very intensive process, so not being able to shutdown the engine is entirely expected.

You need to disable auto-sweep as soon as possible.

Most of the errors noted are due to your killing of the Firebird engine/instances/service.

@firebird-automations
Copy link
Collaborator Author

Commented by: Slavomir Skopalik (skopaliks)

We kill FB last time at Tue Sep 05 16:58:23 2017.
Everything after this timestamp is FB self related.
No human action.

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

The issue is a bit complex and seems to be related with events.

Intersting thread is buzy detaching from database and freeing registered events

#⁠ Call Site
00 fb_inet_server!Jrd::EventManager::free_global+0x3c
01 fb_inet_server!Jrd::EventManager::delete_session+0x75
02 fb_inet_server!Jrd::EventManager::deleteSession+0x1e
03 fb_inet_server!release_attachment+0x5f
04 fb_inet_server!purge_attachment+0x3bb
05 fb_inet_server!jrd8_detach_database+0x270
06 fb_inet_server!detach_or_drop_database+0x119
07 fb_inet_server!rem_port::end_database+0x48
08 fb_inet_server!process_packet+0x319
09 fb_inet_server!loopThread+0x19d
0a fb_inet_server!`anonymous namespace'::threadStart+0x4a
0b msvcr80!endthreadex+0x47
0c msvcr80!endthreadex+0x104
0d kernel32!BaseThreadInitThunk+0xd
0e ntdll!RtlUserThreadStart+0x21

Some threads waits for events table mutex.

One thread acquires monitoring lock in EX mode to produce monitoring snapshot.
It ask all connections to release its instances of monitoring lock.
First connection can't release own instances of monitoring lock as it is busy in EventManager::free_global()
thus "monitoring" conneciton waits for "events" connection.

Other connections already released its monitoring lock instances, dumped monitoring data, acquires monitoring
lock in SH mode again and wait for the "monitoring" conneciton.

Unfortunately, process memory dump contains no image of events data (which reside in shared memory)
and i can't find exact reason of freeze. It could be huge number of data to process (could be detected by
file size of events file in shared memory fb_events_XXX) or some kind of corruption of events data.

Two quick questions
a) could you monitor size of fb_events_XXX file (should be at C:\ProgramData\firebird) ?
b) does you changed EventMemSize setting at firebird.conf ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Slavomir Skopalik (skopaliks)

Current situation, pure classic, problem already happens:
fb_event_87955d300000da022a000000 166 MB (174,325,760 bytes)

#⁠ ----------------------------
#⁠
#⁠ Bytes of shared memory allocated for event manager.
#⁠
#⁠ Type: integer
#⁠
#⁠EventMemSize = 65536

Dump file related to this situation (frozen classic) is here:
http://www.elektlabs.cz/Firebird-2.5.8.27070-0_Clasic_x64.7z

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

> fb_event_87955d300000da022a000000 166 MB (174,325,760 bytes)
Looks like you should seriously reconsider how your application works with events...

@firebird-automations
Copy link
Collaborator Author

Commented by: Slavomir Skopalik (skopaliks)

One more info, that can be important:
1. We starts using FB3.0.3 client library because of problem with event registration in FB2.5.7 library see CORE5597
2. When is happen, one of FB instances consume 100% of CPU core

Now, there are 14 connection, 9 can register Events (up to 10 events) and file size is: 171 MB (179,437,568 bytes)

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Could you provide us with reproducible test case ?

@firebird-automations
Copy link
Collaborator Author

Commented by: @pcisar

Badly written application registered interest into too many events in uncontrolled way which exploits problem in engine that doesn't have a limit for event table size. Issue converted to CORE5617.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Open [ 1 ] => Closed [ 6 ]

resolution: Won't Fix [ 2 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

Link: This issue is replaced by CORE5617 [ CORE5617 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant