Feb 27, 2008
bsdiff and bspatch are tools for building and applying patches to binary files. When looking for a solution to create a shared pool for squashfs images, I ran across bsdiff. Even if it's not going to work for Jigdo-style live image pools, it's still very interesting.
Upon reading Colin Percival's doctoral thesis I ran into a statement that I thoroughly enjoyed and it was this statement that convinced me to try using bsdiff to solve distribution concerns with tens, if not hundreds, of similar Live-Spins.
"If a mathematician is a machine for turning coffee into theorems, a computer scientist is a machine for converting caffeine into algorithms. As with mathematicians and theorems, the output of these machines may bear little resemblance to that which was originally sought, but I hope the reader will find this particular body of output to be both interesting and useful."- Colin Percival, Doctoral Thesis, http://www.daemonology.net/bsdiff/, 2006.
After doing some brief testing on how to slice up a squashfs image for sharing multiple live images using some base of binary data, I found this is no easy task. The Fedora Project has been planning on using BitTorrent to share custom live images with their Community. As a Fedora Unity member, I've been involved in trying to find [or create] a solution to efficiently share localized spins where the primary difference in the squashfs is just localization data. When gzipping (the compression for squashfs is gzip) even identical zeroed files, the end result is a different file:
[jon@damaestrojr ~]$ dd if=/dev/zero of=test.img bs=1024 count=100; cp test.img test2.img; gzip *.img; diff test.img.gz test2.img.gz 100+0 records in 100+0 records out 102400 bytes (102 kB) copied, 0.00317973 s, 32.2 MB/s Binary files test.img.gz and test2.img.gz differ [jon@damaestrojr ~]$ md5sum *.img.gz ffa2865fe4cce1abdd18ea62af86cd1f test2.img.gz 80a93fda63cb0908817d520db12cbc79 test.img.gz
After testing this very simple example, I knew we were in trouble.
SquashFS - What is it and why do we use it?
" Squashfs is a compressed read-only filesystem for Linux. Squashfs is intended for general read-only filesystem use, for archival use (i.e. in cases where a .tar.gz file may be used), and in constrained block device/memory systems (e.g. embedded systems) where low overhead is needed."- http://squashfs.sourceforge.net/
Most of the Live-Spins the Fedora Project, and most other Live-Spins, will be doing are ~700MB (the size of a CD.) Due to this size constraint, the Live-Spin rootfs is squashed. Live-Spins can be created without a compressed filesystem, but in most cases it is.
What about Jigdo?! Why wont it work?
Jigdo does a lot of things well and is a really neat concept. The only way jigdo [concepts] would be able to help us is if we recreated the squashfs with data that is downloaded from rpm packages and dumped into the squashfs. At this point of complexity, it's almost easier to just rebuild the Live-Spin from a definition (the kickstart) rather then trying to piece it back together. It's understandable that some people don't have the resources or desire to learn and utilize the live toolchain but recreation of [essentially] the same process is a mis-use of volunteer effort and the computational resources needed to achieve this process.
Okay, Okay, Where does bsdiff fit in?
bsdiff [concepts] could be used to isolate binary changes in the squashfs filesystem, for example. This would enable localized versions of Live-Spins to be distributed as a "patch" to the base Live-Spin. These patches would be trivial in size compared to an entirely additional Live-Spin. The one test that was done on a machine with 8 cores and 8GB of RAM, caused the entire system to crash. This is not a great sign, but oh well; it is fun to try things.
Will this ever work?
Maybe. Much more testing and input is needed. Please, ideas are welcome.
-  http://www.daemonology.net/papers/thesis.pdf