Issue Details (XML | Word | Printable)

Key: CORE-6296
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Critical Critical
Assignee: Vlad Khorsun
Reporter: Thomas Kragh
Votes: 0
Watchers: 10
Operations

If you were logged in you would be able to see more operations.
Firebird Core

Many connections results in poor performance

Created: 01/May/20 10:06 PM   Updated: 02/Jul/20 03:33 PM
Component/s: Engine
Affects Version/s: 3.0.5
Fix Version/s: 3.0.6, 4.0 RC 1

File Attachments: None
Image Attachments:

1. 2020_05_02---fb_test--1.png
(472 kB)

2. 2020_05_02---fb_test--2.png
(399 kB)

3. prod-db4.PNG
(86 kB)

4. uvd-db2.PNG
(87 kB)
Environment: Windows/Linux
Issue Links:
Relate
 

QA Status: Cannot be tested


 Description  « Hide
------------UPDATE------------
1) The ping statement have been updated to select from RDB$database instead of mon$database.
2) The parameters of the console applications have been ajusted for the new ping statement
3) This bug report stated that Classic architecture was also affected, this looks not to be the case, after updating the ping statement.


Hey
Where I work, we run a fairly large education platform, at peek hours the database servers execute approximately 1200 to 1400 query's/second. Over the last 4-5 weeks I have been working with Alexey from IBSurgeon to resolve a performance issue that we have been having.

The problem occurs when the system is under high load, and over a period of 5-10 seconds the number of attachments rises from eg. 125 to 800 attachments. 800 is the max limit allowed by the connection pools, eg. 4 web servers with a 200 max pool size. Under high load with 125 attachments the web application is spending 13 milliseconds on average waiting for the database to execute query's, when we reach 800 attachments this number i 10-20 times greater.

I believe that the quick rise in attachments can be correlated with garbage collection in .Net, where the application is suspended for eg. 100ms, but the webserver still receives requests, when GC is done, the queued requests are executed in parallel, causing perhaps 100 new attachments to the database. This again causes the database server to respond slower that the average 13ms, when this happens the 3 other webservers will need to create more attachments to the database, causing a chain reaction.
After hitting 800 attachments the average query time is 10-20 times slower, this can last for hours if the web application is not restarted, after witch everything returns to normal.

To recreate what I think is the problem, I have created a sample application(see download link in bottom), that shows that overall query speed slows down proportional to the number of attachments.

Reproduce steps:
1) Ajust the connectionstring found in App.config
2) Start one instance of the console application with the command "FbPerformance.exe --testcase ping --sleep 500", --testcase describes the "mode", and --sleep is the time between ping. Leave the application running. Ping query is a simple "select current_timestamp from rdb$database".
3) Start another instance of the application with the command "FbPerformance.exe --testcase ambient --count 1200" testcase is "ambient" and creates 1200 connections/transactions to the database. No statements will be executed on these connections!
4) While the second application creates attachments please observe the "execute statement" time that the ping application prints. This will go from 2-4ms to 20-40ms.

Step 1-4 is the steps to reprocuce. 5-6 is bonus information

5) Stop the ping application and change --sleep to 20, and observe the CPU usage of the Firebird process, witch i find quite high compared to the small amount of work.
6) Stop the ambient instance from step 3, and observe that start transaction/execute statement in the ping application can take quite some time - I have observed cases of 600-1400ms, after witch executing statement is back to about 2-4ms.

This behavior has been observed on Firebird 3.0.5 running Centos7, in SuperServer mode, where the test application was runnning on Windows Server 2012R2.
And on Firebird 3.0.5 running Windows 10 in SuperClassic and SuperServer mode, where the test application was run on the same machine.

Also note that if I Attache a trace session(FBTraceManager3) it will report that the ping statement takes less that 1ms to execute, even though the ping application reports 40ms.

sample application
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication.zip?st=2020-05-01T21%3A04%3A08Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=SiwOnTZ15E4HQsY3aNtZnkFmTlseiwN8xt5tG%2BmJ3qY%3D

sample application source
https://mublobstorageprod.blob.core.windows.net/mudrift/fbperformance/sampleapplication_src.zip?st=2020-05-01T21%3A04%3A51Z&se=2025-05-02T21%3A04%3A00Z&sp=rl&sv=2018-03-28&sr=b&sig=3qt59y1TcbFkt12%2FUWm1dTPHooo1ELvE77X5tXpHLB0%3D

I hope this is enough information for you guys to track down this issue, and thank for a great database server!
Best regards Thomas

 All   Comments   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Karol Bieniaszewski added a comment - 01/May/20 11:07 PM
>> Ping query is a simple "select current_timestamp from mon$database"

Sorry byt this is not a simple select. It is very costly select which take snapshot of database state.
Simple select will be from RDB$database but not from MON$

Thomas Kragh added a comment - 02/May/20 05:45 AM - edited
Hey Karol
Thank you for the response, the query was chosen so that the program could be run against all databases.
Of coures using RDB$database is better.

Anyway, this can also be recreated whit a primery key query in the employee database, in the .config file you can replace the pingstatement config line with this:
<add key="pingstatement" value="select e.hire_date as &quot;current_timestamp&quot; from employee e where e.emp_no = 20" />

It might be necessary to increase the --count parameter to perhaps 1200 to see big changes using this query. When i ran the test using Classic, SuperClassic and SuperServet, it looks like Classic is less affected when using this new ping query.

Kovalenko Dmitry added a comment - 02/May/20 06:37 AM - edited
I executed your test on my workstation [Win10, FB3.0.6 SS x64, DB on RAM Disk] :)

select current_timestamp from rdb$database

FbPerformance.exe --testcase ping --sleep 2
FbPerformance.exe --testcase ambient --count 1800

Two different providers:
- FirebirdSql.Data.FirebirdClient
- lcpi.data.oledb + LPCI.IBProvider.5 + LCPI OLE DB Services (I little adapted your code)

Not any visible problems. Execution time increased, but not critical.
- FirebirdSql.Data.FirebirdClient - to 9-11 ms
- lcpi.data.oledb - to 6-8ms

See attached screenshot - 2020_05_02---fb_test--1.png.

(About slow time of connection through lcpi.data.oledb. I think, it is result of "right" verification of connection before reusing. Provider make ping of server).

Kovalenko Dmitry added a comment - 02/May/20 07:05 AM
2020_05_02---fb_test--2.png

Work after stoping of "ambient 1800" process.

Time of execute statement: 1-2ms for both providers.

Thomas Kragh added a comment - 02/May/20 11:35 AM
Hey All
So I ran the application against two servers instead of my laptop - see attached pictures

prod-db4 is Firebird 3 Superserver running on Windows Server, here the overall increase is about 120%
and
uvd-db2 is Firebird 3 Superserver running on CentOs, here the overall increase is about 1000%

As updated in the discription, it looks line classic doesn't have the problem as originally described.

Sean Leyne added a comment - 11/Jun/20 02:53 PM
Vlad,

Please describe the nature of the "fix"

Paul Reeves added a comment - 11/Jun/20 03:20 PM
I ran the performance test with the new code and the initial results look good.

Vlad Khorsun added a comment - 11/Jun/20 05:45 PM - edited
It is all about our TCP listener.

The fix uses sorted array of pollfd structures to avoid few places in our code with O(N^2) complexity.
Note, poll() is not used on Windows, so this problem is not present there.

Also when old code searched for the signalled port (after call of select\poll), it started from the head
of the ports list every time, now such search starts from last checked port in list.

--
PS I restored my comment as it is and was much more correct.
Please, avoid such "correcting" in the future.

Sean Leyne added a comment - 11/Jun/20 06:18 PM
The "Environment" refers to Linux *and Windows*, and the details make same references. Is there a problem with Windows or not?

Vlad Khorsun added a comment - 11/Jun/20 07:29 PM
Sean,

the Windows is affected too but in a much less degree.