Backup Management

Backups are the critical service that Acorn provides. Using a combination RSync and proprietary management software, we are quickly and easily able to verify that backups complete and take action when they don't. For details on the implementation, keep reading.

Customer Server Configuration

RSync is installed on all customer servers that require backup. We have created a batch file named backup.bat which initiates the RSync and later uploads the stats.out file to our reporting server. Below is a sample RSync configuration file.

set home=e:\atc
taskkill /f /im ssh.exe
taskkill /f /im rsync.exe
echo y | c:\progra~2\cwRsync\bin\rsync.exe --delete --ignore-errors --chmod=u=rwx -vtrR -e "c:\progra~2\cwRsync\bin\ssh.exe -i e:\atc\id_dsa.txt" --exclude "*.mp3" "/cygdrive/E/shares" "logon@server.location.com:mirror/servername" > c:\stats.out 2>&1
type c:\stats.out | c:\progra~2\cwrsync\bin\ssh.exe -i e:\atc\id_dsa.txt login@server.location.com "cat > /reports/customer/servername.out" > c:\upload.txt 2>&1

The RSync log is dumped to C:\stats.out. Below is a sample stats.out from an Exchange server.

cygwin warning:
  MS-DOS style path detected: c:\progra~1\cwRsync\bin\plink.exe -i d:\atc\atc.ppk
  Preferred POSIX equivalent is: /cygdrive/c/progra~1/cwRsync/bin/plink.exe -i d:/atc/atc.ppk
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
building file list ... done
/cygdrive/D/
/cygdrive/D/atc/exchbak/
/cygdrive/D/atc/exchbak/publicfolders.bkf

sent 23990367 bytes  received 120237 bytes  1176127.02 bytes/sec
total size is 294748160  speedup is 12.22

Once the backup completes, the stats.out file will be uploaded to the backup report server for parsing. This is done via SSH in the last command in the backup.bat file.

The backup.bat file is setup to run as a scheduled task that runs daily after-hours.

Backup Report Server Configuration

The server is a standard LAMP configuration.

Database Configuration

Data is stored in a database named backup_serverlist. The following tables are in backup_serverlist:
+-----------------------------+
| Tables_in_backup_serverlist |
+-----------------------------+
| csrs                        |
| customers                   |
| errors                      |
| exceptions                  |
| failures                    |
| results                     |
| servers                     |
| tsrs                        |
| updates                     |
+-----------------------------+
  					

A full dump of the database schema can be found here.

Parsing the stats.out File

In order to parse the WSUS server, we need to know what each error code stands means. I couldn't find a good centralized repository for the definitions for each of the RSync error codes in plain English, so I made one.

CodeTechnical JargonTranslationResolution
-2Backup did not completeThe backup either 1) is still running, was 2) running and was killed, or 3) the frontend server was not responsive. There was no error code written by rsync, so the backup report parser adds -2 as the error code.If 1, monitor the backup throughout the day. If it finishes with a completion code (0/23/24), you will want to re-run the Upload Stats scheduled task. You will then clear the database and re-run the parsers. If this is a recurring issue, we need to determine if they have too much data. If they have too much constantly changing data, we can look at trimming their PSTs (for Exchange customers) or finding an alternative backup method. If 2, check to make sure that the setting for "Run only when on AC power" is not checked for the Backup to Acorn scheduled task. If 3), you would see this happening to many servers that are on the same custbak. This needs to be addressed by a TSR3.
0SuccessThe backup completed successfully.This is a good thing, so there is nothing to fix.
1Syntax or usage errorWe didn't write the rsync script correctly and need to fix it.See Translation.
2Protocol incompatibilityGarbage data/data from another script is being sent via rsync and is confusing the rsync connection.There most likely is an issue with faulty hardware which is causing corruption in the data. This could be on their LAN or Internet connection.
3Errors selecting input/output files, dirsRysnc couldn't find the files that were specified to backup.Check the RSync script to make sure that the files that it specifies to backup actually exist.
4Requested action not supported: an attempt was made to manipulate 64-bit files on a platform that cannot support them; or an option was specified that is supported by the client and not by the server.Rsync cannot copy 64-bit files on an operating system that doesn't support 64-bit files.We need to evaluate what operating system we are on and what sort of files are on it. There is a fundamental incompatibility between the operating system and the files that reside on it.
5Error starting client-server protocolRsync had trouble initiating the connection to our backup servers. We will need to review the logs to determine the source of failure.See Translation.
6Daemon unable to append to log-fileRsync couldn't update its log file. Check permissions on where RSync is writing its log file.
10Error in socket I/OThe remote backup server is not accepting the Rsync connection.A TSR3 will need to troubleshoot why the backup servers are not allowing incoming connections.
11Error in file I/OThe server had a problem deleting a file. We ignore this and continue onward.See Translation.
12Connection DiedThe connection to the backup server died.The customer's Internet connectivity likely dropped. Restart the Backup to Acorn scheduled task. If it finishes, you will then clear the database and re-run the parsers.
13Errors with program diagnosticsThis error is typically caused by connection interrupts/ dropped idle connections for n reasons (n=ISP problems, firewall, power outage..etc), permission issues, or errors with program diagnostics. More troubleshooting is required.See Translation. We should investigate whether there are problems with the customer's Internet connection as well as make sure that the server has a stable network connection.
14Error in IPC codeThis error is usually thrown when there is an issue making a connection to the Rsync server during the backup.A TSR3 will need to troubleshoot why the backup servers are not allowing incoming connections.
20Received SIGUSR1 or SIGINTThis message may come when one process on the remote backup server side interrupts the other because it is about to die. In this event, one of the custbak or frontend servers is going down. Restart the Backup to Acorn scheduled task. If it finishes, you will then clear the database and re-run the parsers.See Translation.
21Some error returned by waitpid()Rsync failed because a process was unresponsive. See Translation.
22Error allocating core memory buffersThis issue was fixed with newer versions of Rsync, so we shouldn't see it.See Translation.
23Partial transfer due to errorThis code indicates that errors regarding specific files were found in the stats.out file. Some of those errors we don't care about, such as when a customer deletes a file Rsync notes that it's missing. Other errors are bad, but we have implemented an alternative method of backing them up. In this case we add an exception. When the parser finds an error, it cross references it with the exceptions to see if it's an error that's no longer an issue. Lastly, there are actual errors that we care about that are not exceptions that report as a bad backup.In the event of there being a file that is failing, that we care about, we need to setup an alternative method of backing up the file (e.g. NT Backup, SQL dump). Once we have verified that the alternative method is working, add the file as an exception within the backupreport system. If we don't care about the file and it's throwing an error, add it as an exception.
24Partial transfer due to vanished source files'Partial tranfer' usually means that file was changed while rsyncing (so retry should fix it). Vanished files are okay to ignore.This is a good conclusion. There is nothing to fix.
25The --max-delete limit stopped deletionsWe don't set the max delete flag, so there is no reason why we would ever see this error.See Translation.
30Timeout in data send/receiveThe idle timeout that is set in Rsync was reached. We should look at increasing it and troubelshooting if the customer is having any internet connectivity issues.See Translation.
35Timeout waiting for daemon connectionThe remote backup server was unresponsive.A TSR3 will need to troubleshoot why the backup servers are not allowing incoming connections.
127Connection unexpectedly closedThe connection between the customer server and the remote backup server dropped, and thus the backup failed.There most likely is an issue with faulty hardware which is causing corruption in the data. This could be on their LAN or Internet connection.
255Connection unexpectedly closedThe network connection dropped and thus killed Rsync.There most likely is an issue with the customer's Internet dropping. This could be on their LAN or Internet connection. Restart the Backup to Acorn scheduled task. If it finishes, you will then clear the database and re-run the parsers.

Code -2 was made up by me. RSync will never return this code.

There are a few things that we want to look for when parsing the stats.out file.

  • Any file errors
  • The server completion code or lack thereof
A rough overview for the parser is as follows:
  • Verify that the stats.out file exists. If it does not, throw an error and die.
  • Determine the identity of the customer and server based on the path to the file.
  • Check the modification date of the stats.out file. If the date is greater than 24 hours, we have a problem as the stats.out file is not current. Log error code -2.
  • Go through line by line searching for file errors or a finish code. When a file error has been found, check to see if it is an exception (explained later). If it is a finish code, enter the value into the results database.
  • If finished reading the file and no code has been found, check for perfect completions. The last line will contain "total size is ". If not a perfect completion, log error code -2, otherwise log code 0.

Exceptions

Since you will run into files that are constantly in use (e.g. PST, QuickBooks), you will have to setup alternative backup processes in order to capture this data. We have various methods for achieving this depending on the server's operating system. For Windows Server 2003 we will typically use NTBackup. For Windows Server 2008 Hobocopy is our tool of choice. Basically anything that uses Volume Shadow Copy should do the trick. Once we have setup the alternative backup process, the file is added to the exceptions table. This tells the parser that if the file is locked, the error can be ignored because it is being backed up elsewhere.

Reports

There are several reports that run daily. In addition to being accessible from within a web browser, the daily and three day failure reports are sent out as email.

Nightly Backup Report

This report gives us a list of all servers along with their status for backups performed within the last 24 hours.

Historical Report

The historical report allows you to view the backup status for all servers since the beginning of monitoring.

Notes

In addition to what I've described here, there is a management interface for adding servers, customers, and much more.

Conclusion

With the new backup monitoring system there is a new level of transparency that was previously missing. Customer service representatives can see exactly which servers are backing up or having issues, which has helped out tremendously. Additionally we are much better able to proactively manage the backup process. If you are looking at implementing a similar system and have questions, feel free to email me.