Hadoop File Crusher

Why do you need it?

It is well documented that Hadoop has a small file problems. The entire filesystem has to fit in Name Node memory meaning many files and files with large names are a problem. Additionally map reduce jobs have a lot of overhead in spawning map tasks to work on small files. Jobs with multiple reducers again produce more then one file. Thus feeding data though an IdentityMapper and IdentityReducer is intensive.

But what can be done?

Enter the hadoop filecrush tool. The hadoop filecrush tool can be used as a map reduce job or stand alone program. The file crush tool navigates entire file tree (or just a single folder) and decides which files are below a threshold and combines those into bigger files.

The file crush tool works with sequence or text files. It can work with any type of sequence files regardless of Key or Value type.


The filecrush tool is Apache v2 licensed and was open open sourced by my employer www.media6degrees.com

V2 Filecrush the current version with all the awesome new features was the work Of David Ha

SVN: http://www.jointhegrid.com/svn/filecrush

Download Snapshot: filecrush-1.0-SNAPSHOT.jar
Download Snapshot: filecrush-2.0-SNAPSHOT.jar