Backups are the critical service that Acorn provides. Using a combination RSync and proprietary management software, we are quickly and easily able to verify that backups complete and take action when they don't. For details on the implementation, keep reading.
RSync is installed on all customer servers that require backup. We have created a batch file named backup.bat which initiates the RSync and later uploads the stats.out file to our reporting server. Below is a sample RSync configuration file.
set home=e:\atc taskkill /f /im ssh.exe taskkill /f /im rsync.exe echo y | c:\progra~2\cwRsync\bin\rsync.exe --delete --ignore-errors --chmod=u=rwx -vtrR -e "c:\progra~2\cwRsync\bin\ssh.exe -i e:\atc\id_dsa.txt" --exclude "*.mp3" "/cygdrive/E/shares" "logon@server.location.com:mirror/servername" > c:\stats.out 2>&1 type c:\stats.out | c:\progra~2\cwrsync\bin\ssh.exe -i e:\atc\id_dsa.txt login@server.location.com "cat > /reports/customer/servername.out" > c:\upload.txt 2>&1
The RSync log is dumped to C:\stats.out. Below is a sample stats.out from an Exchange server.
cygwin warning: MS-DOS style path detected: c:\progra~1\cwRsync\bin\plink.exe -i d:\atc\atc.ppk Preferred POSIX equivalent is: /cygdrive/c/progra~1/cwRsync/bin/plink.exe -i d:/atc/atc.ppk CYGWIN environment variable option "nodosfilewarning" turns off this warning. Consult the user's guide for more details about POSIX paths: http://cygwin.com/cygwin-ug-net/using.html#using-pathnames building file list ... done /cygdrive/D/ /cygdrive/D/atc/exchbak/ /cygdrive/D/atc/exchbak/publicfolders.bkf sent 23990367 bytes received 120237 bytes 1176127.02 bytes/sec total size is 294748160 speedup is 12.22
Once the backup completes, the stats.out file will be uploaded to the backup report server for parsing. This is done via SSH in the last command in the backup.bat file.
The backup.bat file is setup to run as a scheduled task that runs daily after-hours.
+-----------------------------+ | Tables_in_backup_serverlist | +-----------------------------+ | csrs | | customers | | errors | | exceptions | | failures | | results | | servers | | tsrs | | updates | +-----------------------------+
A full dump of the database schema can be found here.
In order to parse the WSUS server, we need to know what each error code stands means. I couldn't find a good centralized repository for the definitions for each of the RSync error codes in plain English, so I made one.
Code | Technical Jargon | Translation | Resolution |
---|---|---|---|
-2 | Backup did not complete | The backup either 1) is still running, was 2) running and was killed, or 3) the frontend server was not responsive. There was no error code written by rsync, so the backup report parser adds -2 as the error code. | If 1, monitor the backup throughout the day. If it finishes with a completion code (0/23/24), you will want to re-run the Upload Stats scheduled task. You will then clear the database and re-run the parsers. If this is a recurring issue, we need to determine if they have too much data. If they have too much constantly changing data, we can look at trimming their PSTs (for Exchange customers) or finding an alternative backup method. If 2, check to make sure that the setting for "Run only when on AC power" is not checked for the Backup to Acorn scheduled task. If 3), you would see this happening to many servers that are on the same custbak. This needs to be addressed by a TSR3. |
0 | Success | The backup completed successfully. | This is a good thing, so there is nothing to fix. |
1 | Syntax or usage error | We didn't write the rsync script correctly and need to fix it. | See Translation. |
2 | Protocol incompatibility | Garbage data/data from another script is being sent via rsync and is confusing the rsync connection. | There most likely is an issue with faulty hardware which is causing corruption in the data. This could be on their LAN or Internet connection. |
3 | Errors selecting input/output files, dirs | Rysnc couldn't find the files that were specified to backup. | Check the RSync script to make sure that the files that it specifies to backup actually exist. |
4 | Requested action not supported: an attempt was made to manipulate 64-bit files on a platform that cannot support them; or an option was specified that is supported by the client and not by the server. | Rsync cannot copy 64-bit files on an operating system that doesn't support 64-bit files. | We need to evaluate what operating system we are on and what sort of files are on it. There is a fundamental incompatibility between the operating system and the files that reside on it. |
5 | Error starting client-server protocol | Rsync had trouble initiating the connection to our backup servers. We will need to review the logs to determine the source of failure. | See Translation. |
6 | Daemon unable to append to log-file | Rsync couldn't update its log file. | Check permissions on where RSync is writing its log file. |
10 | Error in socket I/O | The remote backup server is not accepting the Rsync connection. | A TSR3 will need to troubleshoot why the backup servers are not allowing incoming connections. |
11 | Error in file I/O | The server had a problem deleting a file. We ignore this and continue onward. | See Translation. |
12 | Connection Died | The connection to the backup server died. | The customer's Internet connectivity likely dropped. Restart the Backup to Acorn scheduled task. If it finishes, you will then clear the database and re-run the parsers. |
13 | Errors with program diagnostics | This error is typically caused by connection interrupts/ dropped idle connections for n reasons (n=ISP problems, firewall, power outage..etc), permission issues, or errors with program diagnostics. More troubleshooting is required. | See Translation. We should investigate whether there are problems with the customer's Internet connection as well as make sure that the server has a stable network connection. |
14 | Error in IPC code | This error is usually thrown when there is an issue making a connection to the Rsync server during the backup. | A TSR3 will need to troubleshoot why the backup servers are not allowing incoming connections. |
20 | Received SIGUSR1 or SIGINT | This message may come when one process on the remote backup server side interrupts the other because it is about to die. In this event, one of the custbak or frontend servers is going down. Restart the Backup to Acorn scheduled task. If it finishes, you will then clear the database and re-run the parsers. | See Translation. |
21 | Some error returned by waitpid() | Rsync failed because a process was unresponsive. | See Translation. |
22 | Error allocating core memory buffers | This issue was fixed with newer versions of Rsync, so we shouldn't see it. | See Translation. |
23 | Partial transfer due to error | This code indicates that errors regarding specific files were found in the stats.out file. Some of those errors we don't care about, such as when a customer deletes a file Rsync notes that it's missing. Other errors are bad, but we have implemented an alternative method of backing them up. In this case we add an exception. When the parser finds an error, it cross references it with the exceptions to see if it's an error that's no longer an issue. Lastly, there are actual errors that we care about that are not exceptions that report as a bad backup. | In the event of there being a file that is failing, that we care about, we need to setup an alternative method of backing up the file (e.g. NT Backup, SQL dump). Once we have verified that the alternative method is working, add the file as an exception within the backupreport system. If we don't care about the file and it's throwing an error, add it as an exception. |
24 | Partial transfer due to vanished source files | 'Partial tranfer' usually means that file was changed while rsyncing (so retry should fix it). Vanished files are okay to ignore. | This is a good conclusion. There is nothing to fix. |
25 | The --max-delete limit stopped deletions | We don't set the max delete flag, so there is no reason why we would ever see this error. | See Translation. |
30 | Timeout in data send/receive | The idle timeout that is set in Rsync was reached. We should look at increasing it and troubelshooting if the customer is having any internet connectivity issues. | See Translation. |
35 | Timeout waiting for daemon connection | The remote backup server was unresponsive. | A TSR3 will need to troubleshoot why the backup servers are not allowing incoming connections. |
127 | Connection unexpectedly closed | The connection between the customer server and the remote backup server dropped, and thus the backup failed. | There most likely is an issue with faulty hardware which is causing corruption in the data. This could be on their LAN or Internet connection. |
255 | Connection unexpectedly closed | The network connection dropped and thus killed Rsync. | There most likely is an issue with the customer's Internet dropping. This could be on their LAN or Internet connection. Restart the Backup to Acorn scheduled task. If it finishes, you will then clear the database and re-run the parsers. |
Code -2 was made up by me. RSync will never return this code.
There are a few things that we want to look for when parsing the stats.out file.
Since you will run into files that are constantly in use (e.g. PST, QuickBooks), you will have to setup alternative backup processes in order to capture this data. We have various methods for achieving this depending on the server's operating system. For Windows Server 2003 we will typically use NTBackup. For Windows Server 2008 Hobocopy is our tool of choice. Basically anything that uses Volume Shadow Copy should do the trick. Once we have setup the alternative backup process, the file is added to the exceptions table. This tells the parser that if the file is locked, the error can be ignored because it is being backed up elsewhere.
There are several reports that run daily. In addition to being accessible from within a web browser, the daily and three day failure reports are sent out as email.
This report gives us a list of all servers along with their status for backups performed within the last 24 hours.
The historical report allows you to view the backup status for all servers since the beginning of monitoring.
In addition to what I've described here, there is a management interface for adding servers, customers, and much more.
With the new backup monitoring system there is a new level of transparency that was previously missing. Customer service representatives can see exactly which servers are backing up or having issues, which has helped out tremendously. Additionally we are much better able to proactively manage the backup process. If you are looking at implementing a similar system and have questions, feel free to email me.