The problem comes when you need to decide how to do it. Do you use off the shelf backup software? Do you create your own solution?
When I set out to find a solution to our archival problem here I wanted something easy to use, and that could be run from any computer. With those requirements, I was limited to what was built into Mac OS X. I have used TAR and the infamous ZIP formats. These both work well but I wanted abetter solution.
The problem I have with ZIP is that it is an aging format. It is being phased out on the windows side to the more popular RAR and the newer 7zip formats. ZIP also does not have a table of contents. That means that every time you want to see what is in your ZIP file you need to actually run an unzip command at some level to see the listing of the files.
TAR is much better and is widely considered standard in the UNIX and LINUX communities. TAR can be used with multiple compression formats and is the commonly accepted format for most packages on those platforms and has been around since 1979. TAR comes from the tape back up world. In fact, TAR stands for Tape ARchive. Being written for Tape archives, TAR was originally written for sequential I/O devices. This means the that as you are creating the archive, the data is streamed in. This make creating archives fast but also makes retrieving data out slow. The reason is because the TAR format does not keep a table of contents of the files in the archive. If you want to know what is in the archive you must actually runt the TAR command to stream out the list of files. After you do that you can extract the file you need.
Enter XAR. XAR is a relatively new archive format. XAR has been adopted by Apple as of Mac OS 10.5 for all install packages. XAR is different int he fact that it keep a full table of contents of the files in the archive at the very beginning of the file in XML format. this is handing because if you need a fast listing of the files in the archive, no matter how large the archive is, it's a simple matter of running the XAR command to get the table of contents shown to you. Light years faster than ZIP and even faster than TAR.
So how do we use XAR to archive many files? Well that depends on how you want to archive your data. We have our files organized in such a way that there are main folder levels of the names of the categories and then multiple "jobs" in each subdirectory. See below.
MAIN FOLDER---
\------------Client_01
\----------- Job_01
\----------- Job_02
\----------- Job_03
\----------- Etc.......
\------------Client_02
\------------ Job_01
\----------- Job_01
\----------- Job_02
\----------- Job_03
\----------- Etc.......
So lets say I want to only turn the "Job" folders into separate XAR archives. I could CD into each folder and run the XAR command to make an archive and wait for each one, moving to the next folder and then waiting. While this would work fine, it certainly would take some time considering the there are normally around 1000 jobs in each client folder and over 1000 clients. I also want to verify the data and then delete the original folder once its turned into an archive thereby saving over the half the size. Automation is the key here.
A shell script works great for this.
So the steps my script will make is:
1. Copy all the contents from one level deep to a new location using RSYNC. (I will talk about RSYNC in another post)
2. Change any names of folders removing any illegal characters.
3. Create a XAR archive of the folder.
4. Delete the original folder.
5. move to the next folder in the list.
Heres how I did it.
The command is run with: $ ./xarchive.sh [Source] [Destination]
Note that source and destination can be the same if you want to XAR archive and Delete in the same place. Or you can use a new destination to empty the original and archive it to new location.
#!/bin/bash
#find all the folders that have illegl names and change them.
#this command calls a perl script to do the work: fix_names.pl
find "$1" . -depth 1 -type d -print0 | xargs -0 ./fix_names.pl > renamed.text
(cd "$1"
find . -depth 1 -type d -print0 | xargs -0 stat -f "%N") > srcstats.txt
# this whole command prints out the full file path for each directory in the first argument.
# find . = Find everything in current dir.
# -depth 1 = Look level deep
# -type d = find Dir's
# -print0 = Prints the pathname of the found file
# xargs -0 = Changes xargs to expect a NULL char from the print0 command as a seperator. this is NULL char is added to the end of each file name and xargs then parses each file pathname as one item each.
# stat - displays the file infomation from each of xargs items.
# -f "%N" = Displays the name of each file item from xargs.
echo srcstats.txt ..................... DONE!
mkdir $2
sed -e 's:[A-Z]*..\(.*\):rsync -avrhzE --stats --progress "'"$1"'\1/" "'"$2"'\1" ; xar -vcf "'"$2"'\1.xar" "'"$2"'\1" ; rm -R "'"$2"'\1":' srcstats.txt > output.txt
echo output.txt ..................... DONE!
cp output.txt xarfiles.sh
chmod +x xarfiles.sh
echo xarfiles.sh ................... DONE!
echo executing xarfiles.sh . . . . . NOW!
./xarfiles.sh