Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.x SuperServer: poor performance when the same (trivial) query is executed by more than two sessions [CORE5962] #6216

Open
firebird-automations opened this issue Nov 12, 2018 · 6 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: @pavel-zotov

Attachments:
FB_3x-SuperServer-parallel-execution-poor-performance.xlsx
parallel-executing-performance-SS-vs-SC-benchmark.7z

Votes: 1

It was ecountered (suddenly) that performance of any trivial SQL statement will be *very* poor when testing against SuperServer. I have lot of results for fresh 3.0.4, but the same result is for 3.0.1 (sep-2016).
This is reproduced in 100% of runs when statement does something with DB tables (rather than "no-DB" statement that evaluates something, e.g. check pattern matching of some text or calls deterministic PSQL-function in loop, etc).
Table can be either fixed or GTT - it does not matter; but performance on GTT goes down slower. Checked table always contained only 1 record and single field: ID int.

Steps to reproduce.
0) create empty database (see below: "E:\FB30.TMPINSTANCE\examples\empbuild\tmptest.fdb"), prepare FB instance for work as SuperServer;
1) create batch scenario and change its variables to be match to your enviroment ("fbhome", "fbport", "dbname", ISC_USER & ISC_PASSWORD).

@echo off
setlocal enabledelayedexpansion enableextensions
set fbhome=E:\FB30.TMPINSTANCE
set fbhost=localhost
set fbport=3030
set dbname=E:\FB30.TMPINSTANCE\examples\empbuild\tmptest.fdb
set sql=%~dpn0.sql

set ISC_USER=SYSDBA
set ISC_PASSWORD=masterkey

set att_cnt=%1
if .!att_cnt!.==.. (
echo Must specify number of establishing attachments.
exit
)

md %~dp0logs 2>nul

if .1.==.0. (
!fbhome!\gfix.exe -mode read_write !fbhost!/!fbport!:!dbname!
(
echo set bail on;
echo recreate table fix_test(id int^);
echo insert into fix_test(id^) values(1^);
echo commit;
echo recreate global temporary table gtt_test(id int^) on commit delete rows;
echo insert into gtt_test(id^) values(1^);
echo commit;
) > %~dpn0.tmp

\!fbhome\!\\isql\.exe \!fbhost\!/\!fbport\!:\!dbname\! \-i \!%\~dpn0\.tmp\!
if errorlevel 1 \(
    echo Error while preparing test DB\.
    exit
\)
del %\~dpn0\.tmp

\!fbhome\!\\gfix\.exe \-mode read\_only \!fbhost\!/\!fbport\!:\!dbname\!

)

for /l %%i in (1,1,!att_cnt!) do (
set run_isql=!fbhome!\isql.exe !fbhost!/!fbport!:!dbname! -i !sql!
set /a k=1000+%%i
set log=%~dp0logs\tmp.!k:~1,3!.log
echo "!run_isql! 1^>!log! 2>&1"
start /min cmd /c "!run_isql! 1>!log! 2>&1"
)

2) create SQL script in the same folder and with name = name of batch (i.e., "ss_test.bat" --> "ss_test.sql"):

set autoddl off;
commit;
set transaction read only;
set bail on;
set plan on ;
set explain on;
set stat on;
set list on;
set term ^;
execute block returns(dts_before timestamp, dts_after timestamp) as
declare c int = 1373078;
declare i int;
begin
dts_before = 'now';
while (c>0) do
begin
select 1 from rdb$database into i; --------------------------------- [ 1 ]
c = c-1;
end
dts_after= 'now';
suspend;
end
^
set term ;^
quit;

Please look in this SQL: note that statement marked as " [ 1 ] " will be executed in PSQL loop about 1.3E+6 times, but it is absolutely trivial.

3) write somewhere number of physical cores of your CPU:

WMIC.exe /locale:ms_409 cpu get NumberOfCores /format:list > c:\temp\my_cpu_phys_cores.txt

4) launch trace with following config:

database = \(%\[\\\\/\]\(security3\)\.fdb\|\(security\.db\)\)
\{
  enabled = false
\}

database =
{
enabled = true
log_errors = true
log_initfini = false
time_threshold = 0
log_statement_finish = true
print_plan = true
}

5) run batch with specifying as 1st (and single) argument number of ISQL sessions that should be launched in asynchronous mode (i.e. using START command; on POSIX this id sone by specifying "&" after command line).

Please start with ss_test.bat 1, and repeat it 2-3 times -- this will give somewhat like etalone results.

Stop trace and make filter of its data (we are interested only about FETCHES):

findstr /c:"fetch(es)" trace.log | findstr /v "mark(s)" > filtered_trace.log

NOTE! After each start you have to wait until ALL child windows will be closed (and no further activity in trace can be seen).

After run "ss_test.bat 1" - repeat with "ss_test.bat 2" and then with "ss_test.bat 4", ""ss_test.bat 8" and so on (and WAIT until all sessions will complete!)
Also, do restart of trace before each launch and stop it, plus filter it into separate text file (see above).

You will see that:
1) elapsed time grows approx. linearly - EVEN WHEN NUMBER OF ATTACHES LESS THAN PHYSICAL CPU CORES
2) elapsed time of different sessions becomes VALUABLE DIFFER when their number greater than physical CPU cores. Some of attaches (usually 2-3) can finish their job MUCH FASTER that others. Difference can be up to 2...3 times. Workload balance is poor here (unfair).
3) Growth of elapsed time remains linear with increasing number of attachments, but coefficient between Y/X is much greater after exceeding number of physical cores.

If you look again into batch then you can see there: "if .1.==.0. ("
Change it to: .1.==.1. -- and try with other types of DB objects: GTT, selectable procedures, views etc. Of course, after this you have to change SQL script that will handle these (new) objects.

I tried with following:
* "dummy loop" with only "i = i+1", no query to any DB object; result: elapsed time did not increased for 1,2,4... attaches until all physical cores will be involved in work (so this is 100% scalability);
* loop with evaluating: bool_var = <LONG_STRING> LIKE '%qwerty%'; result: good scalability until all physical cores were involved; for 8 and 16 attachments there was valuable difference in elapsed time (unfair balancing of cpu time);
* loop with query to GTT instead of fixed table RDB$DATABASE. Result: POOR, no scalability. Linear growth of elapsed time; and again valuable difference of speed between attachments (unfair balancing). Good news: avg time for GTT is about 3-4 times less than for RDB$DATABASE
* loop with query to FIXED USER-DEFINED table (i.e. not from DB dictionary). Result: POOR, the same as for RDB$DATABASE.

Test was also repeated:
* for different values of DB cache pages;
* for database access = read_only;
* for FW = OFF
* for different transactions isolation level (including snapshot reserving table)

Results were the same.

::::::::::::::::::::::::::::::::::: NOTE :::::::::::::::::::::::::::::::::::::::
NO such troubles in Classic and SuperClassic!
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Please look into attached .xlsx: there are data of measures on Windows host.
Sheets:
* Cs vs sC vs SS -- results for Classic, SuperClassic and SuperServer, for "select 1 from rdb$database", for attaches count: 1,2,4,8 and 16
* benchmark query to FIX table -- benchmark of fixed table vs GTT; also - results for old FB 3.0.1.32610, 27.09.2016
* benchmark query to GTT table -- miscelan results for GTT when we query it directly or via VIEW or via SP or function; results for deterministic PSQL function; results for DIFFERENT GTTs
* benchmark string`LIKE` pattern -- single measurement for loop with <LONG_STRING> LIKE '%qwerty%' statement;
* .bat, .sql -- content of batch and SQL
* sys info -- firebird.conf for 3.0.x that was used; also - miscelan system info about machine hardware

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

summary: 3.x SuperServer: quick performance when the same (trivial) query is executed by more than two sessions => 3.x SuperServer: poor performance when the same (trivial) query is executed by more than two sessions

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

Attachment: FB_3x-SuperServer-parallel-execution-poor-performance.xlsx [ 13309 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

PS.
Also, attempt to improve balancing was done on POSIX by tuning NUMA (Non-Unfirmed Memory Access) with the help of numactl package.

It was machine with 24 cores ( server with 2 CPU sockets x 6 physical cores x 2 HT ) and following was shown there :

#⁠ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 16254 MB
node 0 free: 6388 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 16384 MB
node 1 free: 4344 MB
node distances:
node 0 1
0: 10 21
1: 21 10

I decided to do following:
1) numactl --cpunodebind=0 --membind=0 -- /opt/firebird/bin/firebird -p 3333
2) numactl cpunodebind=0 --localalloc -- /opt/firebird/bin/firebird -p 3333
3) numactl --physcpubind=+0-5 --localalloc -- /opt/firebird/bin/firebird -p 3333

-- with launching 1,2,4 and 8 ISQL sessions each time.

Result: NO improvements was noticed, both times result was almost the same.
This is elapsed time for 1,2,4 and 8 working attachments:

1715 ms, 6865390 fetch(es)

4102 ms, 6865390 fetch(es)
4186 ms, 6865390 fetch(es)

7554 ms, 6865390 fetch(es)
7555 ms, 6865390 fetch(es)
7592 ms, 6865390 fetch(es)
6819 ms, 6865390 fetch(es)

13516 ms, 6865390 fetch(es)
13662 ms, 6865390 fetch(es)
13743 ms, 6865390 fetch(es)
13819 ms, 6865390 fetch(es)
13800 ms, 6865390 fetch(es)
13844 ms, 6865390 fetch(es)
13819 ms, 6865390 fetch(es)
13001 ms, 6865390 fetch(es)

-- i.e. linear growth.

Compare with SuperClassic:

2014 ms, 6865390 fetch(es)

2065 ms, 6865390 fetch(es)
2086 ms, 6865390 fetch(es)

2137 ms, 6865390 fetch(es)
2185 ms, 6865390 fetch(es)
2184 ms, 6865390 fetch(es)
2238 ms, 6865390 fetch(es)

2763 ms, 6865390 fetch(es)
2881 ms, 6865390 fetch(es)
2838 ms, 6865390 fetch(es)
2873 ms, 6865390 fetch(es)
2908 ms, 6865390 fetch(es)
3062 ms, 6865390 fetch(es)
3020 ms, 6865390 fetch(es)
3090 ms, 6865390 fetch(es)

And finally i turned NUMA off. Result was completely poor:

1947 ms, 6865390 fetch(es)

8473 ms, 6865390 fetch(es)
8518 ms, 6865390 fetch(es)

14164 ms, 6865390 fetch(es)
14295 ms, 6865390 fetch(es)
14811 ms, 6865390 fetch(es)
14836 ms, 6865390 fetch(es)

27132 ms, 6865390 fetch(es)
27227 ms, 6865390 fetch(es)
27367 ms, 6865390 fetch(es)
27450 ms, 6865390 fetch(es)
27982 ms, 6865390 fetch(es)
28092 ms, 6865390 fetch(es)
28071 ms, 6865390 fetch(es)
28082 ms, 6865390 fetch(es)

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

Denis Simonov had guess that performance can also depend on number (count) of records that are processed (we discussed this ticket privately).

I decided to check this and made one more case of test (it will be added here as .7z attach).
Table has two integer fields: fix_test_1( id int, f1 int ) ;

Test adds 1'000'000 rows into this table:

insert into fix_test_1(id, f1 )
select row_number()over(), mod( row_number()over(), 100 )
from (select 1 i from rdb$types rows 100) a, (select 1 i from rdb$types rows 100) b, (select 1 i from rdb$types rows 100) c;

After this, descending index is created: create descending index fix_test_1_f1 on fix_test_1( f1 );

Then 4 ISQL sessions are launched at the same time and each of them does following script:

set autoddl off;
commit;
set transaction read only;
set stat on;
set list on;
set term ^;
execute block returns(dts_before timestamp, dts_after timestamp) as
declare c int = 1;
declare i int;
begin
dts_before = 'now';
c = 1 * ( select f1 from fix_test_1 order by f1 desc rows 1 );
while (c > 0) do
begin
select count(*) from fix_test_1 where f1 = :c into i; ---------------------------- [ 1 ]
--select count(*) from fix_test_1 where f1 + 0 = :c into i; ---------------------------- [ 2 ]
c = c-1;
end
dts_after= 'now';
suspend;
end
^
set term ;^
quit;

NOTE: statement [ 1 ] will use index, [ 2 ] will force engine to use natural reads.

Following cache was used for SS and SC:
1) SS: DefaultDbCachePages = 768K
2) SC: DefaultDbCachePages = 1024

Trace was started every time before all 4 ISQL sessions began process .sql.

Result: SuperServer has *excellent* scalability when INDEX is used, but POOR when engine does NATURAL READS.
This is result of filtering trace logs by command "findstr /i /c:"fetch(es)" <trace_log>" (SS = SuperServer; SC = SuperClassic):

E:\FB30.TMPINSTANCE>findstr /i /c:"fetch(es)" idx--SS-2.log
2788 ms, 1973 read(s), 1646548 fetch(es)
2823 ms, 1485 read(s), 1646548 fetch(es)
2823 ms, 1567 read(s), 1646548 fetch(es)
2829 ms, 2191 read(s), 1646548 fetch(es)

E:\FB30.TMPINSTANCE>findstr /i /c:"fetch(es)" idx--SC-2.log
13394 ms, 657036 read(s), 1646548 fetch(es)
13433 ms, 657036 read(s), 1646548 fetch(es)
13439 ms, 657036 read(s), 1646548 fetch(es)
13680 ms, 657036 read(s), 1646548 fetch(es)

E:\FB30.TMPINSTANCE>findstr /i /c:"fetch(es)" nat--SS-2.log
188357 ms, 2835 read(s), 101623207 fetch(es)
189731 ms, 608 read(s), 101623207 fetch(es)
190828 ms, 2278 read(s), 101623207 fetch(es)
191148 ms, 903 read(s), 101623207 fetch(es)

E:\FB30.TMPINSTANCE>findstr /i /c:"fetch(es)" nat--SC-2.log
95732 ms, 656172 read(s), 101623207 fetch(es)
95734 ms, 656172 read(s), 101623207 fetch(es)
96156 ms, 656172 read(s), 101623207 fetch(es)
96734 ms, 656172 read(s), 101623207 fetch(es)

@firebird-automations
Copy link
Collaborator Author

Commented by: @pavel-zotov

parallel-executing-performance-SS-vs-SC-benchmark.7z -- batch and used firebird.conf (SS and SC) when every ISQL session processed 1'000'000 rows, INDEXED vs NATURAL reads.

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

Attachment: parallel-executing-performance-SS-vs-SC-benchmark.7z [ 13312 ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant