|
As Vlad points out, in order to reduce the amount of resources that NBackup would "consume", it would be necessary to add a delay to the IO logic.
In terms of the the best 'measurement' to determine the throttle control; Delay (in between disk IO) is sec/msec or Max IO per second. I think that the Max IO is the better approach. P.S. I find it interesting that you ran into this problem running Linux, we had seen the "consuming all resources" problem in some testing which was done during the development of the NBackup utlitity but it had only occured under Windows (we had thought that it would be a problem which would only affect large Windows databases). Nikolay had confidently asserted that the problem would not occur under Linux, due to better IO scheduling. It is nice to see that his is human, and not correct all of the time. ;-] Because we were in development and wanted to see what "real world" experience would bring, we had only noted this issue on the "to be fixed in the future" list. Thanks for opening the case. Your users should not wait while backup is finished. Users working is stopped only for changing backup state. It takes time for flushing cash. After that your users should use database as before. Now merging pages from delta to main database file takes more time than in new nbackup. There is a patch to improve nbackup but on superserver it has known issues and it don't allow to apply it in firebird. At least the merging process must be faster then now. I can build for you a special build on which you can test nbackup work. Certainly your must be careful in using it ;)
Vlad, about your case b. I'd like to note. We must sleep some time if there is another job. Maybe it's enough to reduce a priority of nbackup.exe? Attached Cacti Server Graph, notice our full nbackup starts at 4am. A full nbackup to local disk will take 18 minutes on a 45GB file. The graph load for the two hours after the nbackup is for the off-site file copy.
Roman> Vlad, about your case b. I'd like to note. We must sleep some time if there is another job. Maybe it's enough to reduce a priority of nbackup.exe?
If this issue is really about IO pressure then i doubt lowering CPU priority (which is only priority settable via OS API i know) could help. And this is very easy to check without recompiling nbackup ;) There is another possible reason for slowdown during nbackup to investigate : c) reading of such huge file as 45GB may remove from file OS cache pages useful for Firebird (so called "hot" pages). In this case we could try POSIX's fadvise API to not cache pages read by nbackup. I will pickup more statistics on the server during the nbackup. My recollection is that during the nbackup process, the wa parameter shoots through the roof to between 60-90% of cpu time. Load average also goes over 2.0. I will record the CPU load during the process tomorrow for information that is more reliable than my recollection ;)
Roman, I have tried to encourage my users to continue using the database during the incremental backup, but they are an impatient stubbron lot. For almost the entirety of the nbackup (10-12minutes) the query response is just slow. Queries that normally open in less than a second tend to take 20 seconds or more. While queries do finish they do so at a pace that encourage my users to mumble dirty words when they think of their sysdba during backup time. Thankfully, most will just pull out their smokes and take a break rather than dream out ideas on how to strangle me. Vlad, in a day or so.. I will post some more information that should highlight what process is clogging the server and if it is cpu/io/etc. Attached Load Average for nbackup process at 4am.
Ok, so the two graphs I attached actually show the full backup nbackup -B 0 that happens now at 4:00am every day. Originally, I did have a full backup and an incremental at 10am, 2pm, 6pm, and 10pm. Then because of complaints, moved to a 2pm/6pm schedule, down to a 2pm schedule, and then just settled on a daily full nbackup at 4am when there is only 3-4 users on. I would like to get back to using incrementals during the day, so they can be copied off to a remote server for a standby server.
Again, I will follow this with additional stats on the nbackup incremental graphs and top statistics when more users stop working for today. The following scripts are how the nbackup full/incrementals are being initiated... //FULL NBACKUP Script filedate=`date +%Y-%m-%d_%H-%M-%S` mount.cifs //10.0.x.x/dbbackup $mountpath -o username=dbbkp,pass=SEKRITPWD cd /opt/firebird/net_data/temp/ mkdir -p $filedate echo "Subject: DB NBackup Log" > $filedate/ib_email echo "Backup started at "`date`>> $filedate/ib_email hostname >> $filedate/ib_email echo "Starting Full Nbackup" >> $filedate/ib_email /opt/firebird/bin/nbackup -U sysdba -P SEKRITPWD -B 0 /opt/firebird/net_data/audit.gdb $filedate/audit.nbk >> $filedate/ib_email echo "NBackup Finished at "`date` >> $filedate/ib_email sendmail admin@infinityins.com < $filedate/ib_email mv -f /opt/firebird/net_data/temp/* $mountpath/nbackup/ /* Originally, this was set to nbackup directly to CIFS mount. Changed to this because if the disk mount ran out of diskspace, nbackup would stall and leave the database locked */ //Incremental NBACKUP Script filedate=`date +%Y-%m-%d_%H-%M-%S` mount.cifs //10.0.x.x/dbbackup $mountpath -o username=dbbkp,pass=SEKRITPWD cd /opt/firebird/net_data/temp/ mkdir -p $filedate echo "Subject: DB Nbackup Hourly Log" > $filedate/ib_email echo "Backup started at "`date` >> $filedate/ib_email hostname >> $filedate/ib_email echo "Starting Hourly NBackup" >> $filedate/ib_email /opt/firebird/bin/nbackup -U sysdba -P SEKRITPWD -B 1 /opt/firebird/net_data/audit.gdb $filedate/audit.nbk >> $filedate/ib_email echo "Hourly NBackup Finished at " `date` >> $filedate/ib_email sendmail admin@infinityins.com < $filedate/ib_email mv -f /opt/firebird/net_data/temp/* $mountpath/nbackup/ I started the incremental nbackup -B 1 process and attached iostat/mpstat and top logs
//iostat.log and mpstat.log taken during nbackup -B 1 iostat -xtc 5 240 > iostat.log //io stats for a count of 240 at 5 second intervals (20 minutes) mpstat 5 240 > mpstat.log //cpu, process stats for a count of 240 at 5 second intervals (20 minutes) The nbackup e-mail with start and end times for the nbackup process... Backup started at Tue Feb 10 21:13:05 CST 2009 Starting Hourly NBackup Hourly NBackup Finished at Tue Feb 10 21:19:06 CST 2009 This took around 6 minutes during our non-peak time. During the day under load an incremental backup will take anywhere from 10-15 minutes. Also, I attached a "top" listing at start, during, and after the nbackup process. I have also attached a simple query execution comparison during nbackup load and afterwards.
I added Cacti Memory/CPU/Load Graphs during the nbackup -B 1 (Incremental) process
It looks like the graph server time is off from the database server time. The process starts right around 20:45 on the graph. Andreas, what do devices dm-0,1,2 mean here, looks like in that case I'll be able to understand why does sdb and dm-2 loads correlate so fine. Output of 'mount' on your system is also useful.
But actually it does not matter. It's clear that we have typical disk overload case. I will try to reproduce on smaller database (10Gb) and take a look, what can be done to balance IO load. Well, I've read that:
Originally, this was set to nbackup directly to CIFS mount. Changed to this because if the disk mount ran out of diskspace, nbackup would stall and leave the database locked. But it's anyway not good idea to write backup to same disk where database is located - particularly when you have IO performance problems. Can you change it at least for incremental backup? This should provide clear picture of what is happening with big database when nbackup runs. I have attached two logs backing up to a different disk than the database is located on.
//nbackup setup to backup directly to cifs.mount iostat -xtc 5 240 > iostat2.log //nbackup setup to backup to Scsi channel A instead of channel B where database is stored iostat -xtc 5 240 > iostat3.log Also, the dm-0, dm-1 are logical volumes created by the lvm manager. One volume on scsi-a contains the OS-Firebird binary. A second volume contains the database on scsi-b. With the time I had today, I was not able to figure out how to get iostats on the cifs.mount so I ran the nbackup over a separate scsi channel instead which is shown in iostat3.log. First, I've reproduced it on HEAD (2.5).
And looks like setting O_DIRECT flag when opening database helps. Use O_DIRECT for database in nbackup
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||
I guess queries works slower because of hard IO pressure from nbackup ? Оr something else ?
If my guess is correct i see two ways to make things better :
a) nbackup reads whole database every time it creates backup. It is possible to introduse some kind of bitmap (or array) of pages which was changes since last backup. It is not easy task taking in account multiply levels of backup and some other internals but it is possible and looks like must have solution.
b) To reduce IO pressure we could limit nbackup IO activity. For example by inserting small delay between reads\writes or by limiting number of IO requests per second issued by nbackup. I not sure if it is OK and how it may impact whole picture, though.
Another ideas ?