Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttling Nbackup on large databases [CORE2316] #2740

Closed
firebird-automations opened this issue Feb 10, 2009 · 26 comments
Closed

Throttling Nbackup on large databases [CORE2316] #2740

firebird-automations opened this issue Feb 10, 2009 · 26 comments

Comments

@firebird-automations
Copy link
Collaborator

Submitted by: Andreas Kallenbach (andreas_kc)

Attachments:
graph_image.php.png
Load Average.php.png
iostat.log
mpstat.log
Top Command during nbackup
Query Execution Comparison
Memory Usage during incremental backup.png
CPU Usage during incremental backup.png
Load during incremental backup.png
iostat2.log
iostat3.log

We have a 45GB database that runs on a fairly well equipped machine and run a full nbackup once a month and an incremental once a day at 3am. The incremental takes around 12 minutes to complete, but during that time our regular use of the database crawls to an almost imperceptible stop.

Can nbackup be throttled to only consume 20%-50% of the system resources? While the backup itself may take longer, the users on the database could still continue to use the database. As it is, anyone using the database during an incremental backup will just take a break for 10 minutes because the backup is so intensive.

Are there other solutions to making nbackup more efficient and allowing other users to continue working on a database this size?

I'm not interested in setting up replication for a variety of reasons, but some sort of log-shipping type method has become increasingly attractive to keep a warm-backup ready in the event that the nasty(tm) occurs. Nbackup seems to be the only close solution but it has the drawback of almost locking the use of a 45GB database file.

Commits: c1593c1 cdd4323 a6e0665

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

What do you mean under "regular use of the database crawls to an almost imperceptible stop" ?
I guess queries works slower because of hard IO pressure from nbackup ? Оr something else ?

If my guess is correct i see two ways to make things better :

a) nbackup reads whole database every time it creates backup. It is possible to introduse some kind of bitmap (or array) of pages which was changes since last backup. It is not easy task taking in account multiply levels of backup and some other internals but it is possible and looks like must have solution.

b) To reduce IO pressure we could limit nbackup IO activity. For example by inserting small delay between reads\writes or by limiting number of IO requests per second issued by nbackup. I not sure if it is OK and how it may impact whole picture, though.

Another ideas ?

@firebird-automations
Copy link
Collaborator Author

Commented by: Sean Leyne (seanleyne)

As Vlad points out, in order to reduce the amount of resources that NBackup would "consume", it would be necessary to add a delay to the IO logic.

In terms of the the best 'measurement' to determine the throttle control; Delay (in between disk IO) is sec/msec or Max IO per second.

I think that the Max IO is the better approach.

P.S.

I find it interesting that you ran into this problem running Linux, we had seen the "consuming all resources" problem in some testing which was done during the development of the NBackup utlitity but it had only occured under Windows (we had thought that it would be a problem which would only affect large Windows databases).

Nikolay had confidently asserted that the problem would not occur under Linux, due to better IO scheduling. It is nice to see that his is human, and not correct all of the time. ;-]

Because we were in development and wanted to see what "real world" experience would bring, we had only noted this issue on the "to be fixed in the future" list. Thanks for opening the case.

@firebird-automations
Copy link
Collaborator Author

Commented by: @romansimakov

Your users should not wait while backup is finished. Users working is stopped only for changing backup state. It takes time for flushing cash. After that your users should use database as before. Now merging pages from delta to main database file takes more time than in new nbackup. There is a patch to improve nbackup but on superserver it has known issues and it don't allow to apply it in firebird. At least the merging process must be faster then now. I can build for you a special build on which you can test nbackup work. Certainly your must be careful in using it ;)

Vlad, about your case b. I'd like to note. We must sleep some time if there is another job. Maybe it's enough to reduce a priority of nbackup.exe?

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

Attached Cacti Server Graph, notice our full nbackup starts at 4am. A full nbackup to local disk will take 18 minutes on a 45GB file. The graph load for the two hours after the nbackup is for the off-site file copy.

@firebird-automations
Copy link
Collaborator Author

Modified by: Andreas Kallenbach (andreas_kc)

Attachment: graph_image.php.png [ 11340 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @hvlad

Roman> Vlad, about your case b. I'd like to note. We must sleep some time if there is another job. Maybe it's enough to reduce a priority of nbackup.exe?

If this issue is really about IO pressure then i doubt lowering CPU priority (which is only priority settable via OS API i know) could help. And this is very easy to check without recompiling nbackup ;)

There is another possible reason for slowdown during nbackup to investigate :
c) reading of such huge file as 45GB may remove from file OS cache pages useful for Firebird (so called "hot" pages).
In this case we could try POSIX's fadvise API to not cache pages read by nbackup.

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

I will pickup more statistics on the server during the nbackup. My recollection is that during the nbackup process, the wa parameter shoots through the roof to between 60-90% of cpu time. Load average also goes over 2.0. I will record the CPU load during the process tomorrow for information that is more reliable than my recollection ;)

Roman, I have tried to encourage my users to continue using the database during the incremental backup, but they are an impatient stubbron lot. For almost the entirety of the nbackup (10-12minutes) the query response is just slow. Queries that normally open in less than a second tend to take 20 seconds or more. While queries do finish they do so at a pace that encourage my users to mumble dirty words when they think of their sysdba during backup time. Thankfully, most will just pull out their smokes and take a break rather than dream out ideas on how to strangle me.

Vlad, in a day or so.. I will post some more information that should highlight what process is clogging the server and if it is cpu/io/etc.

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

Attached Load Average for nbackup process at 4am.

@firebird-automations
Copy link
Collaborator Author

Modified by: Andreas Kallenbach (andreas_kc)

Attachment: Load Average.php.png [ 11341 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

Ok, so the two graphs I attached actually show the full backup nbackup -B 0 that happens now at 4:00am every day. Originally, I did have a full backup and an incremental at 10am, 2pm, 6pm, and 10pm. Then because of complaints, moved to a 2pm/6pm schedule, down to a 2pm schedule, and then just settled on a daily full nbackup at 4am when there is only 3-4 users on. I would like to get back to using incrementals during the day, so they can be copied off to a remote server for a standby server.

Again, I will follow this with additional stats on the nbackup incremental graphs and top statistics when more users stop working for today.

The following scripts are how the nbackup full/incrementals are being initiated...

//FULL NBACKUP Script
filedate=`date +%Y-%m-%d_%H-%M-%S`
mount.cifs <//10.0.x.x/dbbackup> $mountpath -o username=dbbkp,pass=SEKRITPWD
cd /opt/firebird/net_data/temp/
mkdir -p $filedate

echo "Subject: DB NBackup Log" \> $filedate/ib\_email
echo "Backup started at "\`date\`\>\>  $filedate/ib\_email
hostname \>\> $filedate/ib\_email
echo "Starting Full Nbackup" \>\> $filedate/ib\_email
/opt/firebird/bin/nbackup \-U sysdba \-P SEKRITPWD \-B 0 /opt/firebird/net\_data/audit\.gdb $filedate/audit\.nbk \>\> $filedate/ib\_email

echo "NBackup Finished at "\`date\` \>\> $filedate/ib\_email
sendmail <mailto:admin@infinityins.com> < $filedate/ib\_email

mv \-f /opt/firebird/net\_data/temp/\* $mountpath/nbackup/                   /\* Originally, this was set to nbackup directly to CIFS mount\. Changed to this because if the disk mount ran out of diskspace, nbackup would stall and leave the database locked \*/

//Incremental NBACKUP Script
filedate=`date +%Y-%m-%d_%H-%M-%S`
mount.cifs <//10.0.x.x/dbbackup> $mountpath -o username=dbbkp,pass=SEKRITPWD
cd /opt/firebird/net_data/temp/
mkdir -p $filedate

echo "Subject: DB Nbackup Hourly Log" \> $filedate/ib\_email
echo "Backup started at "\`date\` \>\> $filedate/ib\_email
hostname \>\> $filedate/ib\_email
echo "Starting Hourly NBackup" \>\> $filedate/ib\_email
/opt/firebird/bin/nbackup \-U sysdba \-P SEKRITPWD \-B 1 /opt/firebird/net\_data/audit\.gdb $filedate/audit\.nbk \>\> $filedate/ib\_email

echo "Hourly NBackup Finished at " \`date\` \>\> $filedate/ib\_email
sendmail <mailto:admin@infinityins.com> < $filedate/ib\_email
mv \-f /opt/firebird/net\_data/temp/\* $mountpath/nbackup/

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

I started the incremental nbackup -B 1 process and attached iostat/mpstat and top logs

<//iostat.log> and mpstat.log taken during nbackup -B 1
iostat -xtc 5 240 > iostat.log //io stats for a count of 240 at 5 second intervals (20 minutes)
mpstat 5 240 > mpstat.log //cpu, process stats for a count of 240 at 5 second intervals (20 minutes)

The nbackup e-mail with start and end times for the nbackup process...
Backup started at Tue Feb 10 21:13:05 CST 2009
Starting Hourly NBackup
Hourly NBackup Finished at Tue Feb 10 21:19:06 CST 2009

This took around 6 minutes during our non-peak time. During the day under load an incremental backup will take anywhere from 10-15 minutes.

Also, I attached a "top" listing at start, during, and after the nbackup process.

@firebird-automations
Copy link
Collaborator Author

Modified by: Andreas Kallenbach (andreas_kc)

Attachment: iostat.log [ 11342 ]

Attachment: mpstat.log [ 11343 ]

Attachment: Top Command during nbackup [ 11344 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

I have also attached a simple query execution comparison during nbackup load and afterwards.

@firebird-automations
Copy link
Collaborator Author

Modified by: Andreas Kallenbach (andreas_kc)

Attachment: Query Execution Comparison [ 11345 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

I added Cacti Memory/CPU/Load Graphs during the nbackup -B 1 (Incremental) process

It looks like the graph server time is off from the database server time. The process starts right around 20:45 on the graph.

@firebird-automations
Copy link
Collaborator Author

Modified by: Andreas Kallenbach (andreas_kc)

Attachment: Memory Usage during incremental backup.png [ 11346 ]

Attachment: CPU Usage during incremental backup.png [ 11347 ]

Attachment: Load during incremental backup.png [ 11348 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @AlexPeshkoff

assignee: Alexander Peshkov [ alexpeshkoff ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Andreas, what do devices dm-0,1,2 mean here, looks like in that case I'll be able to understand why does sdb and dm-2 loads correlate so fine. Output of 'mount' on your system is also useful.

But actually it does not matter. It's clear that we have typical disk overload case. I will try to reproduce on smaller database (10Gb) and take a look, what can be done to balance IO load.

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Well, I've read that:

Originally, this was set to nbackup directly to CIFS mount. Changed to this because if the disk mount ran out of diskspace, nbackup would stall and leave the database locked.

But it's anyway not good idea to write backup to same disk where database is located - particularly when you have IO performance problems. Can you change it at least for incremental backup? This should provide clear picture of what is happening with big database when nbackup runs.

@firebird-automations
Copy link
Collaborator Author

Commented by: Andreas Kallenbach (andreas_kc)

I have attached two logs backing up to a different disk than the database is located on.

//nbackup setup to backup directly to cifs.mount
iostat -xtc 5 240 > iostat2.log

//nbackup setup to backup to Scsi channel A instead of channel B where database is stored
iostat -xtc 5 240 > iostat3.log

Also, the dm-0, dm-1 are logical volumes created by the lvm manager. One volume on scsi-a contains the OS-Firebird binary. A second volume contains the database on scsi-b.

With the time I had today, I was not able to figure out how to get iostats on the cifs.mount so I ran the nbackup over a separate scsi channel instead which is shown in iostat3.log.

@firebird-automations
Copy link
Collaborator Author

Modified by: Andreas Kallenbach (andreas_kc)

Attachment: iostat2.log [ 11349 ]

Attachment: iostat3.log [ 11350 ]

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

First, I've reproduced it on HEAD (2.5).
And looks like setting O_DIRECT flag when opening database helps.

@firebird-automations
Copy link
Collaborator Author

Commented by: @AlexPeshkoff

Use O_DIRECT for database in nbackup

@firebird-automations
Copy link
Collaborator Author

Modified by: @AlexPeshkoff

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 2.5 Beta 1 [ 10251 ]

Fix Version: 2.1.3 [ 10302 ]

Fix Version: 2.0.6 [ 10303 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-automations
Copy link
Collaborator Author

Modified by: @pavel-zotov

QA Status: No test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment