May 08, 2008
Similarity-Enhanced Transfer (SET) looks like it could prove very useful for efficiently sharing collections of Live-Spins without having to re-download an entire ISO image for every desired Live-Spin.
What is Similarity-Enhanced Transfer?After a brief skim, it seems that SET is a concept similar to BitTorrent but without arbitrary chunking of data. By using handprinting both similar and exact match chunks can be identified and utilized in the download process. The concept looks very interesting and I'm hoping to set aside some time to work on proof of concept code in the near future. I would also like to extend an invitation to the community to help develop and prove the viability of such a solution for mass-hosting of Live-Spins and Live-Spin collections, such as localized spins based on the same package set. We could easily setup an upstream git repo (likely on fedorahosted) or we could just add a branch to the existing pyJigdo repo and get right to work.
Why should this Concept Even be Considered?
Well, I'll quote the abstract and hope it's enough to encourage reading the entire paper:
"Many contemporary approaches for speeding up large file transfers attempt to download chunks of a data object from multiple sources. Systems such as BitTorrent quickly locate sources that have an exact copy of the desired object, but they are unable to use sources that serve similar but non-identical objects. Other systems automatically exploit cross-file similarity by identifying sources for each chunk of the object. These systems, however, require a number of lookups proportional to the number of chunks in the object and a mapping for each unique chunk in every identical and similar object to its corresponding sources. Thus, the lookups and mappings in such a system can be quite large, limiting its scalability.
This paper presents a hybrid system that provides the best of both approaches, locating identical and similar sources for data objects using a constant number of lookups and inserting a constant number of mappings per object. We first demonstrate through extensive data analysis that similarity does exist among objects of popular file types, and that making use of it can sometimes substantially improve download times. Next, we describe handprinting, a technique that allows clients to locate similar sources using a constant number of lookups and mappings. Finally, we describe the design, implementation and evaluation of Similarity-Enhanced Transfer (SET), a system that uses this technique to download objects. Our experimental evaluation shows that by using sources of similar objects, SET is able to significantly out-perform an equivalently configured BitTorrent."
Himabindu Pucha, David G. Andersen, Michael Kaminsky
Purdue University, Carnegie Mellon University, Intel Research Pittsburgh
Apr 25, 2008
pyJigdo version 0.3.0 has been sent off to the Fedora build system. There has been a lot of work put into this release to make it a stable starting ground for everything we want to achieve with pyJigdo.
No more jigdo-lite...I'm delighted to inform the community that we now have a good alternative to jigdo-lite for downloading the up-coming Fedora [jigdo] release. There has been a good amount of testing that has gone into this release. This amount of testing, coupled with the complete rewrite, has proven to be quite fruitful. Two [important] things that remain to be tested are running this release on F7 and F9; I have done all my testing with F8. This release already has many more features then jigdo-lite and, in most cases, saves time. Some of the more notable new features include the ability to auto-mount an existing ISO image (via fuseiso,) the ability to efficiently search directories for needed files, and the ability to use mirror lists. See pyjigdo --help for all of the currently available features.
Where to download?I've just sent out the builds, so if you don't want to wait for it to hit updates-testing (and then updates) go to the pyjigdo koji page for builds. If you do end up testing it, please mark your comments via Bodhi for F7 and F8. This release should be yum installable soon enough, however.
Where do bugs go?Please file bugs either in the redhat bugzilla, or preferably on the Fedora Hosted pyJigdo trac instance.
What does the future bring?We have a lot planned for pyJigdo, so please keep an eye out for more releases (or send patches for features and fixes.) Also, watch for changes to our roadmap as I am going to try to get everything/anything I plan on doing into trac before I start working on it.
Feb 27, 2008
bsdiff and bspatch are tools for building and applying patches to binary files. When looking for a solution to create a shared pool for squashfs images, I ran across bsdiff. Even if it's not going to work for Jigdo-style live image pools, it's still very interesting.
Upon reading Colin Percival's doctoral thesis I ran into a statement that I thoroughly enjoyed and it was this statement that convinced me to try using bsdiff to solve distribution concerns with tens, if not hundreds, of similar Live-Spins.
"If a mathematician is a machine for turning coffee into theorems, a computer scientist is a machine for converting caffeine into algorithms. As with mathematicians and theorems, the output of these machines may bear little resemblance to that which was originally sought, but I hope the reader will find this particular body of output to be both interesting and useful."- Colin Percival, Doctoral Thesis, http://www.daemonology.net/bsdiff/, 2006.
After doing some brief testing on how to slice up a squashfs image for sharing multiple live images using some base of binary data, I found this is no easy task. The Fedora Project has been planning on using BitTorrent to share custom live images with their Community. As a Fedora Unity member, I've been involved in trying to find [or create] a solution to efficiently share localized spins where the primary difference in the squashfs is just localization data. When gzipping (the compression for squashfs is gzip) even identical zeroed files, the end result is a different file:
[jon@damaestrojr ~]$ dd if=/dev/zero of=test.img bs=1024 count=100; cp test.img test2.img; gzip *.img; diff test.img.gz test2.img.gz 100+0 records in 100+0 records out 102400 bytes (102 kB) copied, 0.00317973 s, 32.2 MB/s Binary files test.img.gz and test2.img.gz differ [jon@damaestrojr ~]$ md5sum *.img.gz ffa2865fe4cce1abdd18ea62af86cd1f test2.img.gz 80a93fda63cb0908817d520db12cbc79 test.img.gz
After testing this very simple example, I knew we were in trouble.
SquashFS - What is it and why do we use it?
" Squashfs is a compressed read-only filesystem for Linux. Squashfs is intended for general read-only filesystem use, for archival use (i.e. in cases where a .tar.gz file may be used), and in constrained block device/memory systems (e.g. embedded systems) where low overhead is needed."- http://squashfs.sourceforge.net/
Most of the Live-Spins the Fedora Project, and most other Live-Spins, will be doing are ~700MB (the size of a CD.) Due to this size constraint, the Live-Spin rootfs is squashed. Live-Spins can be created without a compressed filesystem, but in most cases it is.
What about Jigdo?! Why wont it work?
Jigdo does a lot of things well and is a really neat concept. The only way jigdo [concepts] would be able to help us is if we recreated the squashfs with data that is downloaded from rpm packages and dumped into the squashfs. At this point of complexity, it's almost easier to just rebuild the Live-Spin from a definition (the kickstart) rather then trying to piece it back together. It's understandable that some people don't have the resources or desire to learn and utilize the live toolchain but recreation of [essentially] the same process is a mis-use of volunteer effort and the computational resources needed to achieve this process.
Okay, Okay, Where does bsdiff fit in?
bsdiff [concepts] could be used to isolate binary changes in the squashfs filesystem, for example. This would enable localized versions of Live-Spins to be distributed as a "patch" to the base Live-Spin. These patches would be trivial in size compared to an entirely additional Live-Spin. The one test that was done on a machine with 8 cores and 8GB of RAM, caused the entire system to crash. This is not a great sign, but oh well; it is fun to try things.
Will this ever work?
Maybe. Much more testing and input is needed. Please, ideas are welcome.
-  http://www.daemonology.net/papers/thesis.pdf
Feb 20, 2008
Jigdo has been around for a while and has proven itself useful for Debian. I found most of the resistance to using it in Fedora stemming from the fact the client is not amazing. Enter pyJigdo...
How is this Related to Fedora?
In the Fedora universe, many things have been done to open up the distribution and have made developing Fedora very interesting. One of the new concepts is Re-Spinning the distribution for specific use cases. Many companies don't have the resources to compose their own in-house distribution and share it network wide but they do have a use case that warrants a "corporate standard" desktop that is maintainable by as little as one person. Fedora users are now able to take the published packages and Re-Mix, so to speak, the package universe and create something specific to their use cases. This includes customizing runtime settings, available packages and even making their desktops (even servers) stateless. A single system administrator can easily create their own flavor of Fedora (or even a derivative) with a few simple clicks of a mouse or minor adjustments to kickstart definitions which can then be shared company wide. Not only will this increase the likelihood of more people trying Fedora, it will create a more stable and thus more productive environment. It will also lead to IT staff having more time to focus on business applications of their technology rather then running Spyware, Malware and Anti-Virus software wasting time and computational resources. There is so much more about Re-Spins and Re-Mixes but I have to stop as it is outside of the scope of this particular blog posting.
What is Jigdo?
"Jigsaw Download, or short jigdo, is a tool designed to ease the distribution of very large files over the internet, for example CD or DVD images. Its aim is to make downloading the images as easy for users as a click on a direct download link in a browser, while avoiding all the problems that server administrators have with hosting such large files."
How does this help?When having to download a large file, such as a full ISO image, from a single mirror users can run into slow link speeds, corrupted downloads and wasted time. Even more so, if needing to share a given image or image set to many locations the amount of time to transfer from a single location is greatly increased as more locations are requesting data. Jigdo provides a mechanism to create a "definition" of a given image. This definition can easily be shared and is trivial in size in comparison to a full image. The jigdo definition enables consumers of the image (for lack of a better term) to put back together the image(s) easily and efficiently. Most of the efficiency comes from the ability to use multiple sources to fetch data including local data sources or existing images. In the case of a Re-Spin, a jigdo definition can be used to "patch" a past Spin resulting in a fully updated image. In the case of a Re-Mix, packages that make up the image can be fetched from many sources including an on-site install tree (normally used to do PXE or network based installations) or even a system such as cobbler . Also, all files/data that make up the image are hashed and will eliminate corrupt images which waste time and bandwidth. The inherit format of the definition also provides a healthy layer of transparency as to the contents of the resulting image. In my humble opinion, there is many more benefits but in the interest of being terse I continue.
What about BitTorrent?
There is nothing wrong with bittorrent and it provides many of the same benefits as jigdo does. One of the major complaints about bittorrent is the inability to use it on some network infrastructure. Not only does one need to run a "tracker" to keep tabs on peers but "seeds" have to run BitTorrent software. It's not an everyday thing where mirror administrators (those with serious servers and serious bandwidth) are willing to fire up a torrent client for a customized flavor of a distribution they already mirror or even a full/official release. BitTorrent has many viable use cases and I concede that there are good arguments for using BitTorrent in the use cases outlined above. However, I don't believe it is the best solution.
Where to Next?
As Jigdo is almost 100% client side, we need to make a better client. jigdo-lite (a shell script) has served it's purpose but we need to create a more extensible and maintainable client. As a result of these needs, the pyJigdo code base has been created. We need interested python developers to help with the effort of both creating a fast and efficient implementation of jigdo in python and creating an interface (both CLI and GUI) that enables users to create, host, assemble, verify, [insert your feature here] and inspect jigdo definitions and templates. Development efforts will continue but to succeed we will need developers passionate about what concepts Jigdo presents.
Okay, so How do I Help?
- Join the effort: http://pyjigdo.org
- Read the code: https://fedorahosted.org/pyjigdo
- Test how Jigdo works and give feedback: http://spins.fedoraunity.org
- Read more: http://fedoraproject.org/wiki/Features/JigdoRelease