Bringing your data home
(AKA The rsync fanboy page)
In which the author conveys a few facts and spouts a lot of opinions on
methods for getting your data home from MacCHESS.
Last update 2008-03
Modern equipment at high-powered synchrotrons can generate huge amounts of
data. With 1 second exposure times our Quantum 210 (A1, F2) and Quantum 270 (F1)
detectors can stream 11
frames/minute in binned mode(8MB files). In unbinned mode(32 MB files), a set
of 180 images can take up over 5.7 GB. If you want to save the raw ( *.imx_0
pre-dark current correction) files you can double that number. (I personally
don't recommend this, but some people do it and they have their reasons). Many
of our users end up with 50 - 100 GB of data in a few days. You probably want
to take the images home with you in case there are uncertainties about the
indexing or you want to tweak the data processing. There are a variety of
methods for getting your data home.
Bring your own computer
By bringing your own computer, presumably a laptop, to MacCHESS, you can
exercise a lot of control over how you store your data. Recent laptops come with
bigger and bigger disks, and many have FireWire and USB2.0 connections suitable
for external disks.
- You can set up everything ahead of time and be sure that it works. If you
are using external disks, this means no worries about filesystem compatibility.
- You can configure your network connection to use DHCP for instant access, or
ask MacCHESS for a static address.
- If you opt for a static IP address, we can arrange to export the relevant
data collection disks to your machine via NFS.
- Each MacCHESS beamline has a 10/100/1000 Mbps switch, so if your computer
has a gigabit ethernet interface you can take full advantage of it. CHESS also
has wireless service, but at a lower bandwidth. Ask the CHESS operator for
instructions on wireless use.
- Being inside the CHESS firewall, you should encounter minimal hassles and
maximal flexibility with data transfer protocols.
- You may load your computer with your own assortment of data processing
software, e-mail clients, etc.
See the section below on network data transfer for
suggestions on copying the data to your own computer.
Portable disks
This seems to be the up-and-coming method for data transport. It is
great for you for a number of reasons:
- reasonable cost, and getting more reasonable all the time
- rewritable media
- random access
- convenient hot-swap connections
Its great for us too; PCI cards for USB2.0 and FireWire are a lot cheaper
than tape drives.
MacCHESS can handle a variety of device interfaces and filesystems, especially
since the introduction of Linux boxes at each of our beamlines:
- USB 2.0 and FireWire (IEEE1394 or IEEE1394a) on Linux, Mac and Windows
- FireWire 800 (IEEE1394b) on Mac and Linux
- Hook your devices up to your own laptop and plug it into our ethernet (see
above)
- Filesystems:
My recommendations:
Here's a Unix/Linux TAR command which will copy an entire directory from one
location to another, while preserving file creation times:
tar cBf - olddir | (cd newdir && tar xBf -)
Here's a comparable command using RSYNC which has some notable advantages:
rsync -av olddir newdir
See the network section for more details on the RSYNC
command.
A word of caution: many FireWire and USB2.0 disks are sold as 'external' devices
rather than 'portable', and may not be sufficiently ruggedized for portable use.
Since you're going to use them anyway, you can do a few things to help assure
the reliability of your data:
- Bring more than one disk for redundancy.
- Transport the disks in a padded carrying case.
- Don't delete your data from the data collection disks until you're home and
assured that your disks and data made it safely. This is not a problem for us
since we added the RAID servers at the beamlines.
Network transfer
Some groups have taken to transferring their data home via the Internet. Your
contentment with this will probably depend on the speed of connections between
Cornell and your home lab. CHESS has an internal gigabit ethernet network, and
our connection to the Cornell backbone is also gigabit, but then there's all
those connections between here and your home lab that may vary considerably in
speed. It takes longer than you think, and don't put it off until the last
hour or two.
Cornell is a member of the National LambdaRail
(NLR) consortium, which operates a nationwide high-speed optical network.
Cornell provides NLR connections to several New York and New England
institutions through NorthEast
LambdaRail (NeLR).
CHESS Firewall
CHESS uses a firewall for network security. For connections from CHESS,
any network protocol is OK. Consider requirements on the other end of your
connection, especially if a firewall is in use there as well.
For connection to CHESS, only the ssh protocol is allowed in, and
only from certain addresses. If you need to initiate connections from outside
the CHESS firewall, contact your MacCHESS staff scientist so the appropriate
arrangements can be made.
RSYNC
RSYNC is a Unix/Linux utility for backing up directory systems. It is well
thought out and has a number of advantages over other options. If rsync is
available to you, I recommend using it.
- RSYNC is widely available for Linux, Macos X, Alpha Tru64 and most other
Unix platforms. RSYNC can be built on Windows if you have Cygwin installed;
that topic is beyond the scope of this web page.
- RSYNC can be used for
copying a directory structure on a single host, or between hosts on a network.
- RSYNC first compares files on the source and destination, and only copies
files which are new or updated. This means you don't have to wait until a data
set is completed before starting your backup; when you repeat the command later
you will not waste time overwriting files which have already been transferred.
- RSYNC can use several different network protocols. The default SSH protocol
(since rsync version 2.6.0) makes it excellent for use with firewalls.
Here's an example script with options I would recommend, to be run on the
MacCHESS machine:
#!/bin/csh -f
#
rsync -avz --delete --exclude "*.imx_0" /A1a/jones/xtal1 jones@myhost.mynet.edu:macc/xtal1
Alternatively, you can run RSYNC from your own host. This will be preferable if
your host uses DHCP and therefore does not have a static IP address. If either
the source or target disks are over a network link, RSYNC will prompt for the
remote account password.
- -a use tar archive format, e.g. preserve symbolic links as links
and preserve file creation times
- -v verbose
- -z compress for network transfer, decompress on the other end.
- -e ssh use ssh as the network transfer protocol. (This is
the default since rsync v2.6.0)
- --delete if any files disappear from the source, delete them
from the destination (this may be useful when copying processing directories)
- --exclude "*.imx_0" Do not transfer certain files, in this example
the raw (not corrected for dark current) data frames.
- source and destination can be specified in ssh style
format.
- For more options, read the man page.
The FAT filesystem is limited in several respects: file creation times have low
resolution and it is not possible to set owner and group ID. Different options
on the rsync command can forestall warning messages. This script worked well on
Macintosh and Linux computers with a FAT32 disk:
#!/bin/csh -f
#
rsync -rltv --modify-window=1 --exclude "*.imx_0" /A1a/smith /media/disk1/chess_040812/.
The FAT filesystem has another limitation: apparently it has trouble
differentiating between upper and lower case. (The FAT filesystem dates back to
the days of DOS.) All characters in file names will
be converted to upper case on the FAT filesystem (which will be translated to
lower case when read by Linux). This means that Rsync may not recognize your
files as being the same on both disks. It might then proceed to delete all
affected files on the FAT disk and recopy them. This defeats the benefit of
incremental backup that makes rsync so great.
- If you are experiencing problems of this sort, do not use the --delete
option.
- Do not use file or directory names that are identical except for case, such
as: fred FRED fReD, these will all be mapped to the same file name on the
FAT disk. You will not be happy with the results.
Here's a repeat of a typical rsync command to copy files from one place to
another on the same computer. You could do this if the data disk was mounted via
NFS:
rsync -av olddir newdir
!! CAUTION !! In the rsync command, the source comes first and the destination
last. Do not confuse the two and wipe out your files before they have been
backed up. We cannot recover lost files. Our RAID systems use the XFS file
system.
FTP
FTP is available on most platforms, even Windows.
- FTP is faster than SCP by ~ 2x because SCP encrypts files before transfer.
- FTP commands mput and mget can transfer multiples files in
one command.
- The syntax uses wildcards, e.g.: mput xtal1*.img
- To avoid being asked permission to transfer each file, use the
prompt command to turn off prompting (the command is a toggle)
- Some FTP servers have an idle timeout, and for reasons I have never
comprehended, transferring lots of big files qualifies as idling: try idle
7200 to increase the timeout period to 2 hours. This option is not
available on some FTP servers.
- If you're still having trouble with the timeout, break the files into
smaller groups: mput xtal1_1_0*.img, mput xtal1_1_1*.img, etc
- FTP FROM CHESS is easy, if you have an FTP server at your home lab.
- FTP TO CHESS is not possible due to the CHESS firewall.
- Don't compress your files before transfer, it takes quite a bit of time to
compress multi-gigabyte data sets; more than you'll make up in transfer speeds.
- Don't put all of your image files into one tar file; it will be too huge,
and if the connection gets dropped you'll have to start all over again. Tar
files are OK for processing files, which are mostly smaller.
- Transfer your files to/from the computer where the files reside. e.g. If
your files are on disk /A1a, make your ftp connection to dateline rather
than one of the other computers. If you transfer from another machine which has
/A1a mounted via NFS, there are actually two network transfers taking place,
which can only slow you down.
- Alternate FTP clients, such as GFTP, may have a better user interface than
standard FTP. Availability of clients depends on your platform.
SCP
SCP is a file transfer program which uses the SSH network protocol. It is widely
available on Unix/Linux computers, and ports exist for Windows as well (e.g.
WinSCP). SCP does not have the sophisticated features of RSYNC, so investigate
that option first.
Tape
The section on tape backup has been moved to a separate
page to shorten this one. With the retirement of our Alpha workstations, we
have fewer tape drives plugged in, so if you want to use tape, please make a
request in advance.
Optical
MacCHESS has a quite a few DVD burners on Macintosh and Linux machines. A
single-sided single-layer DVD holds < 5 GB, so I have trouble taking optical
media seriously for synchrotron backup. K3b is a popular DVD-burning front-end
application on the Linux machines. Bring your own media or get directions to the
local shops; our stockroom does not carry them. If you encounter any
difficulties in burning DVDs consult your MacCHESS staff scientist.
Blu-ray has finally won out over HD-DVD as the higher density optical format for
pre-recorded media. Burners are still somewhat expensive and not common for data
backup. If you use Blu-ray and would like to see a drive available at MacCHESS,
let us know and we'll look into it.
Back to Schuller staff
page
MacCHESS home page