Issue Details (XML | Word | Printable)

Key: CORE-2316
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Alexander Peshkov
Reporter: Andreas Kallenbach
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Firebird Core

Throttling Nbackup on large databases

Created: 10/Feb/09 03:55 PM   Updated: 08/Nov/09 09:45 PM
Component/s: NBACKUP
Affects Version/s: 2.1.1
Fix Version/s: 2.5 Beta 1, 2.1.3, 2.0.6

Time Tracking:
Not Specified

File Attachments: 1. PNG File CPU Usage during incremental backup.png (32 kB)
2. PNG File graph_image.php.png (39 kB)
3. Text File iostat.log (180 kB)
4. Text File iostat2.log (180 kB)
5. Text File iostat3.log (180 kB)
6. PNG File Load Average.php.png (34 kB)
7. PNG File Load during incremental backup.png (29 kB)
8. PNG File Memory Usage during incremental backup.png (30 kB)
9. Text File mpstat.log (22 kB)
10. Text File Query Execution Comparison (0.9 kB)
11. Text File Top Command during nbackup (9 kB)

Environment:
Fedora release 7 (Moonshine) Redhat Tikanga 5
FirebirdSS-2.1.1.17910-0.nptl.i686.rpm
Dell Poweredge Server 1750, P4-Dual Core-Xeon 2.8GHZ, 3GB Ram, Direct Attached SCSI RAID10 14-Drive 500GB Total

Planning Status: Unspecified


 Description  « Hide
We have a 45GB database that runs on a fairly well equipped machine and run a full nbackup once a month and an incremental once a day at 3am. The incremental takes around 12 minutes to complete, but during that time our regular use of the database crawls to an almost imperceptible stop.

Can nbackup be throttled to only consume 20%-50% of the system resources? While the backup itself may take longer, the users on the database could still continue to use the database. As it is, anyone using the database during an incremental backup will just take a break for 10 minutes because the backup is so intensive.

Are there other solutions to making nbackup more efficient and allowing other users to continue working on a database this size?

I'm not interested in setting up replication for a variety of reasons, but some sort of log-shipping type method has become increasingly attractive to keep a warm-backup ready in the event that the nasty(tm) occurs. Nbackup seems to be the only close solution but it has the drawback of almost locking the use of a 45GB database file.

 All   Comments   Work Log   Change History   Version Control   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Vlad Khorsun added a comment - 10/Feb/09 04:21 PM
What do you mean under "regular use of the database crawls to an almost imperceptible stop" ?
I guess queries works slower because of hard IO pressure from nbackup ? Оr something else ?

If my guess is correct i see two ways to make things better :

a) nbackup reads whole database every time it creates backup. It is possible to introduse some kind of bitmap (or array) of pages which was changes since last backup. It is not easy task taking in account multiply levels of backup and some other internals but it is possible and looks like must have solution.

b) To reduce IO pressure we could limit nbackup IO activity. For example by inserting small delay between reads\writes or by limiting number of IO requests per second issued by nbackup. I not sure if it is OK and how it may impact whole picture, though.

Another ideas ?

Sean Leyne added a comment - 10/Feb/09 05:52 PM
As Vlad points out, in order to reduce the amount of resources that NBackup would "consume", it would be necessary to add a delay to the IO logic.

In terms of the the best 'measurement' to determine the throttle control; Delay (in between disk IO) is sec/msec or Max IO per second.

I think that the Max IO is the better approach.


P.S.

I find it interesting that you ran into this problem running Linux, we had seen the "consuming all resources" problem in some testing which was done during the development of the NBackup utlitity but it had only occured under Windows (we had thought that it would be a problem which would only affect large Windows databases).

Nikolay had confidently asserted that the problem would not occur under Linux, due to better IO scheduling. It is nice to see that his is human, and not correct all of the time. ;-]

Because we were in development and wanted to see what "real world" experience would bring, we had only noted this issue on the "to be fixed in the future" list. Thanks for opening the case.

Roman Simakov added a comment - 10/Feb/09 05:59 PM
Your users should not wait while backup is finished. Users working is stopped only for changing backup state. It takes time for flushing cash. After that your users should use database as before. Now merging pages from delta to main database file takes more time than in new nbackup. There is a patch to improve nbackup but on superserver it has known issues and it don't allow to apply it in firebird. At least the merging process must be faster then now. I can build for you a special build on which you can test nbackup work. Certainly your must be careful in using it ;)

Vlad, about your case b. I'd like to note. We must sleep some time if there is another job. Maybe it's enough to reduce a priority of nbackup.exe?

Andreas Kallenbach added a comment - 10/Feb/09 06:10 PM - edited
Attached Cacti Server Graph, notice our full nbackup starts at 4am. A full nbackup to local disk will take 18 minutes on a 45GB file. The graph load for the two hours after the nbackup is for the off-site file copy.

Vlad Khorsun added a comment - 10/Feb/09 06:14 PM - edited
Roman> Vlad, about your case b. I'd like to note. We must sleep some time if there is another job. Maybe it's enough to reduce a priority of nbackup.exe?

If this issue is really about IO pressure then i doubt lowering CPU priority (which is only priority settable via OS API i know) could help. And this is very easy to check without recompiling nbackup ;)


There is another possible reason for slowdown during nbackup to investigate :
c) reading of such huge file as 45GB may remove from file OS cache pages useful for Firebird (so called "hot" pages).
In this case we could try POSIX's fadvise API to not cache pages read by nbackup.

Andreas Kallenbach added a comment - 10/Feb/09 06:33 PM
I will pickup more statistics on the server during the nbackup. My recollection is that during the nbackup process, the wa parameter shoots through the roof to between 60-90% of cpu time. Load average also goes over 2.0. I will record the CPU load during the process tomorrow for information that is more reliable than my recollection ;)

Roman, I have tried to encourage my users to continue using the database during the incremental backup, but they are an impatient stubbron lot. For almost the entirety of the nbackup (10-12minutes) the query response is just slow. Queries that normally open in less than a second tend to take 20 seconds or more. While queries do finish they do so at a pace that encourage my users to mumble dirty words when they think of their sysdba during backup time. Thankfully, most will just pull out their smokes and take a break rather than dream out ideas on how to strangle me.

Vlad, in a day or so.. I will post some more information that should highlight what process is clogging the server and if it is cpu/io/etc.

Andreas Kallenbach added a comment - 10/Feb/09 07:17 PM - edited
Attached Load Average for nbackup process at 4am.

Andreas Kallenbach added a comment - 10/Feb/09 09:07 PM
Ok, so the two graphs I attached actually show the full backup nbackup -B 0 that happens now at 4:00am every day. Originally, I did have a full backup and an incremental at 10am, 2pm, 6pm, and 10pm. Then because of complaints, moved to a 2pm/6pm schedule, down to a 2pm schedule, and then just settled on a daily full nbackup at 4am when there is only 3-4 users on. I would like to get back to using incrementals during the day, so they can be copied off to a remote server for a standby server.

Again, I will follow this with additional stats on the nbackup incremental graphs and top statistics when more users stop working for today.

The following scripts are how the nbackup full/incrementals are being initiated...

//FULL NBACKUP Script
    filedate=`date +%Y-%m-%d_%H-%M-%S`
    mount.cifs //10.0.x.x/dbbackup $mountpath -o username=dbbkp,pass=SEKRITPWD
    cd /opt/firebird/net_data/temp/
    mkdir -p $filedate

    echo "Subject: DB NBackup Log" > $filedate/ib_email
    echo "Backup started at "`date`>> $filedate/ib_email
    hostname >> $filedate/ib_email
    echo "Starting Full Nbackup" >> $filedate/ib_email
    /opt/firebird/bin/nbackup -U sysdba -P SEKRITPWD -B 0 /opt/firebird/net_data/audit.gdb $filedate/audit.nbk >> $filedate/ib_email

    echo "NBackup Finished at "`date` >> $filedate/ib_email
    sendmail admin@infinityins.com < $filedate/ib_email
    
    mv -f /opt/firebird/net_data/temp/* $mountpath/nbackup/ /* Originally, this was set to nbackup directly to CIFS mount. Changed to this because if the disk mount ran out of diskspace, nbackup would stall and leave the database locked */

//Incremental NBACKUP Script
    filedate=`date +%Y-%m-%d_%H-%M-%S`
    mount.cifs //10.0.x.x/dbbackup $mountpath -o username=dbbkp,pass=SEKRITPWD
    cd /opt/firebird/net_data/temp/
    mkdir -p $filedate

    echo "Subject: DB Nbackup Hourly Log" > $filedate/ib_email
    echo "Backup started at "`date` >> $filedate/ib_email
    hostname >> $filedate/ib_email
    echo "Starting Hourly NBackup" >> $filedate/ib_email
    /opt/firebird/bin/nbackup -U sysdba -P SEKRITPWD -B 1 /opt/firebird/net_data/audit.gdb $filedate/audit.nbk >> $filedate/ib_email

    echo "Hourly NBackup Finished at " `date` >> $filedate/ib_email
    sendmail admin@infinityins.com < $filedate/ib_email
    mv -f /opt/firebird/net_data/temp/* $mountpath/nbackup/

Andreas Kallenbach added a comment - 11/Feb/09 12:46 AM - edited
I started the incremental nbackup -B 1 process and attached iostat/mpstat and top logs

//iostat.log and mpstat.log taken during nbackup -B 1
iostat -xtc 5 240 > iostat.log //io stats for a count of 240 at 5 second intervals (20 minutes)
mpstat 5 240 > mpstat.log //cpu, process stats for a count of 240 at 5 second intervals (20 minutes)

The nbackup e-mail with start and end times for the nbackup process...
Backup started at Tue Feb 10 21:13:05 CST 2009
Starting Hourly NBackup
Hourly NBackup Finished at Tue Feb 10 21:19:06 CST 2009

This took around 6 minutes during our non-peak time. During the day under load an incremental backup will take anywhere from 10-15 minutes.

Also, I attached a "top" listing at start, during, and after the nbackup process.

Andreas Kallenbach added a comment - 11/Feb/09 12:47 AM
I have also attached a simple query execution comparison during nbackup load and afterwards.

Andreas Kallenbach added a comment - 11/Feb/09 12:58 AM
I added Cacti Memory/CPU/Load Graphs during the nbackup -B 1 (Incremental) process

It looks like the graph server time is off from the database server time. The process starts right around 20:45 on the graph.

Alexander Peshkov added a comment - 11/Feb/09 05:01 AM
Andreas, what do devices dm-0,1,2 mean here, looks like in that case I'll be able to understand why does sdb and dm-2 loads correlate so fine. Output of 'mount' on your system is also useful.

But actually it does not matter. It's clear that we have typical disk overload case. I will try to reproduce on smaller database (10Gb) and take a look, what can be done to balance IO load.

Alexander Peshkov added a comment - 11/Feb/09 05:22 AM
Well, I've read that:

Originally, this was set to nbackup directly to CIFS mount. Changed to this because if the disk mount ran out of diskspace, nbackup would stall and leave the database locked.

But it's anyway not good idea to write backup to same disk where database is located - particularly when you have IO performance problems. Can you change it at least for incremental backup? This should provide clear picture of what is happening with big database when nbackup runs.

Andreas Kallenbach added a comment - 11/Feb/09 11:07 PM
I have attached two logs backing up to a different disk than the database is located on.

//nbackup setup to backup directly to cifs.mount
iostat -xtc 5 240 > iostat2.log

//nbackup setup to backup to Scsi channel A instead of channel B where database is stored
iostat -xtc 5 240 > iostat3.log

Also, the dm-0, dm-1 are logical volumes created by the lvm manager. One volume on scsi-a contains the OS-Firebird binary. A second volume contains the database on scsi-b.

With the time I had today, I was not able to figure out how to get iostats on the cifs.mount so I ran the nbackup over a separate scsi channel instead which is shown in iostat3.log.

Alexander Peshkov added a comment - 06/Mar/09 12:36 PM
First, I've reproduced it on HEAD (2.5).
And looks like setting O_DIRECT flag when opening database helps.

Alexander Peshkov added a comment - 06/Mar/09 01:29 PM
Use O_DIRECT for database in nbackup