Imagine that we have two very large directories (tens or hundreds of GB ) that we want to keep identical in two different systems. For example, we could think of a directory with professional documents that we want to synchronize between the home and work PC or in a directory with family photos that we want to synchronize between your home PC and your mother’s PC.
The idea would be that after modifying the files in one of the PCs, we could copy only the changes to the other PC (which in principle would be little data volume) without having to copy all the files every time (remember that we speak of many GB of data ).
If both systems are connected by network (local, Internet), although there may be other different solutions, my preferred solution would be to use rsync ( Backups with rsync ), available in Linux, Mac OS X and Windows / Cygwin.
We could also choose to store the data in ” The Cloud “, with Dropbox, Google Drive, Skydrive, iCloud, etc. Thus, if we modify some files in a system, the modified files are updated in the cloud and, later, in the other systems. But, let’s say we handle 100GB of data. In addition to needing to have at least such space contracted in our cloud storage provider, it turns out that to initially upload 100GB to the cloud with a 1Mbps Internet connection up we would need several days:
100,000,000,000 bytes / (1,000,000 bit / s / 8 bytes / bit) = 800,000 s = 9.26 days
If we do not have
rsync, or storage in the cloud, and perhaps no network access, there would be those who would copy all the files to an external hard drive (hundreds of GB) or maybe who would keep track of which files have changed to copy only those to A USB stick. And later still have to re-copy the files in the target system …
Well, it turns out to
rdiffdirbe an excellent solution to this problem.
rdiffdirIs written in Python, is part of the application duplicity (to make directory backups), is based on
rdiffand uses the library
rdiffdirwe can easily create a file that contains only the changes with the rsync algorithm and apply those changes to the system that is out of phase.
Let’s use for example the case of the directory with professional documents that we want to synchronize between the home PC and the work PC to understand how we would do it with
We start from a scenario in which the directories are perfectly synchronized in both systems. We just got to work, and the directory we want to sync contains the following:
Work ~ $ find directory / directory/ Directory / subdirectory1 Directory / subdirectory1 / A.txt file Directory / subdirectory2 Directory / subdirectory2 / Btf file Work ~ $ cat directory / subdirectory1 / A.txt file Test A Work ~ $ cat directory / subdirectory2 / Btf file Test B
Before starting to work, we generate a file with the checksums of the blocks of the files and with the information of the directories with
Work ~ $ rdiffdir signature directory signature _ $ (date +% y% m% d) .rdiffdir
And we started to work: edit files, add files and directories, delete files …
Work ~ $ echo "New line" >> directory / subdirectory1 / A.txt file Work ~ $ rm -rf directory / subdirectory2 / Work ~ $ mkdir directory / subdirectory3 Work ~ $ echo "Test C"> directory / subdirectory3 / Ctf file
And at the end of the day, we copy the signature file and generate a file with the changes:
Work ~ $ rdiffdir delta signature_130208.rdiff directory changes _ $ (date +% y% m% d) .rdiffdir
We go home, copy the change file, and apply it on the directory:
Home ~ $ rdiffdir patch directory changes_130208.rdiffdir
And we verify that, indeed, we have the changes of the day:
Home ~ $ find directory directory Directory / subdirectory1 Directory / subdirectory1 / A.txt file Directory / subdirectory3 Directory / subdirectory3 / Ctf file Home ~ $ cat directory / subdirectory1 / A.txt file Test A New line Home ~ $ cat directory / subdirectory3 / file.txt Test C
If we plan to make changes at home, we would have to repeat the process: generate a signature file before starting, and one of changes at the end.
But, see if it will cost more the collar than the dog! What does a signature file contain from a huge directory? What takes to generate it? Well I just tried a directory with 22000 files, 1400 directories and 10GB and took about 6 minutes and takes up about 130MB: A suitable size to be able to carry it on a USB stick.
Duplicity is available on some Linux distributions (Fedora, Ubuntu, Debian). In others, we can compile it. We can also use it under Windows under Cygwin. You have to install the packages beforehand
librsync-develthen, you just have to download the file with the sources, unzip it, enter it from the Cygwin shell and execute:
Python setup.py install