Moving Large Amounts of Data
How to transfer large amounts of data via network.
by Harry Mangalam<harry.mangalam@uci.edu>
Taken from UCCSC List on 12/4/2008
http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html
We all need to transfer data, and the amount of that data is increasing as the world gets more digital. If it's not climate model data from the IPCC, it's high energy particle physics data from the LHC (or your MP3 collection).
The usual methods of transferring data (scp, http and ftp utilities (such as curl or wget) work fine when your data is in the MB range, but when you have very large collections of data there are some tricks that are worth mentioning.
1. Compression & Encryption
Whether to compress and/or encrypt your data in transit depends on the cost of doing so. For a modern desktop or laptop computer, the CPU(s) are usually not doing much of anything so the cost incurred in doing the compression/encryption is generally not even noticed. However on an otherwise loaded machine, it can be significant, so it depends on what has to be done at the same time. Compression can reduce the amount of data that needs to be transmitted considerably if the data is of a type that is compressible (text, XML, uncompressed images and music), however progressively such data is already compressed on the disk, and compressing already compressed data yields little improvement. Most compression utilities try to detect already-compressed data and skip it, so there is usually no penalty in requesting compression, but some utilities will not detect it correctly and waste lots of time.
Similarly, there is a computational cost to encrypting and decrypting a text, but less so than with compression. scp uses ssh to do the underlying encryption and it does a very good job, but like the other single-stream utilities like curl and wget, it will only be able to push so much thru a connection.
2. Avoiding data transfer
The most efficient way to transfer data is not to transfer it at all. There are a number of utilities that can be used to assist in NOT transferring data. Some of them are listed below.
2.1. kdirstat
The elegant, open source kdirstat (and it's ports to MacOSX Disk Inventory X and Windows Windirstat) are quick ways to visualize what's taking up space on your disk so you can either exclude the unwanted data that needs to be copied or delete it to make more space. All of these are fully native GUI applications that show disk space utilization by file type and directory structure.

2.2. rsync
rsync, from the fertile mind of Andrew (samba) Tridgell, is a protocol that can recurse thru a directory tree and create rolling checksums for the data it encounters and send only changed data over the network to the remote rsync.
For example, if you had recently added a song to your 120 GB MP3 collection and you wanted to refresh the collection to your backup machine, instead of sending the entire collection over the network, rsync would detect and send only the new songs.
For example, the first time rsync is used to transfer a directory tree, there will be no speedup.
$ rsync -av ~/FF moo:~
building file list ... done
FF/
FF/6vxd7_10_2.pdf
FF/Advanced_Networking_SDSC_Feb_1_minutes_HJM_fw.doc
FF/Amazon Logitech $30 MIR MX Revolution mouse.pdf
FF/Atbatt.com_receipt.gif
FF/BAG_bicycle_advisory_group.letter.doc
FF/BAG_bicycle_advisory_group.letter.odt
...
sent 355001628 bytes received 10070 bytes 11270212.63 bytes/sec
total size is 354923169 speedup is 1.00
but a few minutes later after adding danish_wind_industry.html to the FF directory
$ rsync -av ~/FF moo:~
building file list ... done
FF/
FF/danish_wind_industry.html
sent 63294 bytes received 48 bytes 126684.00 bytes/sec
total size is 354971578 speedup is 5604.05
So the synchronization has a speedup of 5600-fold.
Even more efficiently, if you had a huge database to back up and you had recently modified it so that most of the bits were identical, rsync would send only the blocks that contained the differences.
Here's a modest example using a small binary database file:
$ rsync -av mlocate.db moo:~
building file list ... done
mlocate.db
sent 13580195 bytes received 42 bytes 9053491.33 bytes/sec
total size is 13578416 speedup is 1.00
After the transfer, I update the database and rsync it again:
$ rsync -av mlocate.db moo:~
building file list ... done
mlocate.db
sent 632641 bytes received 22182 bytes 1309646.00 bytes/sec
total size is 13614982 speedup is 20.79
There are many utilities based on rsync that are used to synchronize data on 2 sides of a connection by only transmitting the differences. The backup utility BackupPC is one.
The open source rsync is included by default with almost all Linux distributions as well as Mac OSX. Versions of rsync exist for Windows as well, via Cygwin and DeltaCopy
2.3. Unison
Unison is a slightly different take on transmitting only changes. It uses a bi-directional sync algorithm to unify filesystems across a network. Native versions exist for Windows as well as Linux/Unix and it is usually available from the standard Linux repositories.
From a Ubuntu or Debian machine, to install it would require:
$ sudo apt-get install unison
3. Streamlining Data transfer
3.1. GridFTP
If you and your colleagues have to transfer data in the range of multiple GBs and you have to do it regularly, it's probably worth setting up a GridFTP site. Because it allows multipoint, multi-stream TCP connections, it can transfer data at mulitple GB/s. However, it's beyond the scope of this simple doc to describe its setup and use, so if this sounds useful, bother your local network guru/sysadmin.
3.2. bbftp
bbftp is a modification of the FTP protocol that enables you to open multiple simultaneous TCP streams to transfer data. It therefore allows you to sometimes bypass per-TCP restrictions that result from badly configured intervening machines. Short of access to a GridFTP site, this appears to be the best single-node method for transferring data.
In order to use it, you 'll need a bbftp client and server. Most places that recieve large amounts of data (SDSC, NCAR, other supercomputer centers, teragrid nodes) will already have a bbftp server running, but you can also compile and run the server yourself.
The more usual case is to run only the client. It builds very easily on Linux with just the typical curl/untar, cd, ./configure, make, make install dance:
$ curl http://doc.in2p3.fr/bbftp/dist/bbftp-client-3.2.0.tar.gz |tar -xzvf -
$ cd bbftp-client-3.2.0/bbftpc/
$ ./configure --prefix=/usr/local
$ make -j3
$ sudo make install
Using bbftp is more complicated than the usual ftp client because it has its own syntax:
To send data to a server:
$ bbftp -s -e 'put file.154M /gpfs/mangalam/big.file' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org
Password:
>> COMMAND : put file.154M /gpfs/mangalam/big.file
<< OK
160923648 bytes send in 7.32 secs (2.15e+04 Kbytes/sec or 168 Mbits/s)
the arguments mean:
-s use ssh encryption
-e 'local command'
-E 'remote command' (not used above, but often used to cd on the remote system)
-u 'login'
-p # use # parallel TCP streams
-V be verbose
The data was sent at 21MB/s to SDSC thru 10 parallel TCP streams (but well below the peak bandwidth of about 90MB/s on a Gb network)
To get data from a server:
$ bbftp -s -e 'get /gpfs/mangalam/big.file from.sdsc' -u mangalam -p 10 -V tg-login1.sdsc.teragrid.org
Password:
>> COMMAND : get /gpfs/mangalam/big.file from.sdsc
<< OK
160923648 bytes got in 3.46 secs (4.54e+04 Kbytes/sec or 354 Mbits/s)
I was able to get the data at 45MB/s, about half of the theoretical maximum.
As a comparison, because the remote reciever is running an old (2.4) kernel which does not handle dynamic TCP window scaling, scp is only able to manage 2.2MB/s to this server:
$ scp file.154M mangalam@tg-login1.sdsc.teragrid.org:/gpfs/mangalam/junk
Password:
file.154M 100% 153MB 2.2MB/s 01:10
3.3. netcat
netcat (aka nc) is installed by default on most Linux and MacOSX systems. It provides a way of opening TCP or UDP network connections between nodes, acting as an open pipe thru which you can send any data as fast as the connection will allow, imposing no additional protocol load on the transfer. Because of its widespread availability and it's speed, it can be used to transmit data between 2 points relatively quickly, especially if the data doesn't need to be encrypted or compressed (or if it already is).
However, to use netcat, you have to have login privs on both ends of the connection and you need to explicitly set up a sender that waits for a connection request on a specific port from the receiver. This is less convenient to do than simply initiating an scp or rsync connection from one end, but may be worth the effort if the size of the data transfer is very large. To monitor the transfer, you also have to use something like pv (pipeviewer); netcat itself is quite laconic.
How it works: On the sending end, you need to set up a listening port:
[send_host]: $ pv -pet honkin.big.file | nc -q 1 -l -p 1234 <enter>
This sends the honkin.big.file thru pv -pet which will display progress, ETA, and time taken. The command will hang, listening (-l) for a connection from the other end. The -q 1 option tells the sender to wait 1s after getting the EOF and then quit.
On the receiving end, you connect to the nc listener
[receive_host] $ nc bongo.nac.uci.edu 1234 |pv -b > honkin.big.file <enter>
(note: no -p to indicate port on the receiving side). The -b option to pv shows only bytes received.
Once the receive_host command is inititated, the transfer starts, as can be seen by the pv output on the sending side and the bytecount on the receiving side. When it finishes, both sides terminate the connection 1s after getting the EOF.
This arrangement is slightly arcane, but supports the unix tools philosophy which allows you to chain various small tools together to perform a task. While the above example shows the case for a sinle large file, it can also be modified only slightly to do recursive transfers, using tar, shown here recursively copying the local sge directory to the remote host.
[send_host]: $ tar -czvf - sge | nc -q 1 -l -p 1234
[receive_host] $ nc bongo.nac.uci.edu 1234 |tar -xzvf -
In this case, I've added the verbose flag (-v) to the tar command so using pv is redundant. It also uses tar's built-in compression flag (-c) to compress as it transmits.
You could also bundle the 2 together in a script, using ssh to execute the remote command. etc, etc, etc, etc.
4. Latest version of this Document
The latest version of this document should always be here.
Further information
If there is anywhere else the reader can go for more information on this topic, include some links or pointers here:
http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html
Fast Data Transfer application (mentioned by IET's David Walker)
http://fasterdata.es.net/