This is fourth article about Jumbo.
From the moment I started working on Jumbo, I knew I wanted two things: I wanted it to be more flexible than plain MapReduce, and I still wanted it to be easy to use.
While I didn't want to be limited to MapReduce, I also didn't just want to allow arbitrary task graphs, because I felt that was too complex for what I wanted to do. By abstracting MapReduce into a linear sequence of "stages," of which you could have an arbitrary number, I feel like I struck a good balance between allowing alternative job structures, while still not getting too complicated.
Add to that things like making channel operations (such as sorting) optional, and suddenly you could use hash-table aggregation, you could do in one job things that in MapReduce would've required several jobs, and by allowing multiple input stages, you could do things that were traditionally hard to emulate in MapReduce, such as joins.
Still, this added flexibility did come with some complexity. In Hadoop, you write a map function, a reduce function, set up a simple configuration specifying what they are, and you're done. With Jumbo, you had to put a little more thought into what kind of structure is right for your job, and creating a JobConfiguration was a bit more complex.
From the earliest versions of Jumbo Jet I wrote, the job configuration was always an XML file, and
it was always a serialization of the JobConfiguration
class. However, the structure of that format changed quite a bit. Originally, you needed to add each
task individually, which meant that if you had an input file with a 1000 blocks, you needed to add
a 1000 TaskConfiguration
elements to the configuration.
Not exactly scalable.
Eventually, I switched this to having a StageConfiguration
instead, and letting the JobServer
figure things out from there. But, you still needed to essentially build the job structure manually.
At first, this still meant manually creating each stage and channel. Later, I added helpers, which
meant that creating WordCount's configuration would look something like this:
var config = new JobConfiguration(new[] { Assembly.GetExecutingAssembly() }); var input = dfsClient.NameServer.GetFileSystemEntryInfo("/input"); var firstStage = config.AddInputStage("WordCount", input, typeof(WordCountTask), typeof(LineRecordReader)); var localAggregation = config.AddPointToPointStage("LocalAggregation", firstStage, typeof(WordCountAggregationTask), ChannelType.Pipeline, null, null); var info = new InputStageInfo(localAggregation) { ChannelType = ChannelType.File, PartitionerType = typeof(HashPartition<Pair<Utf8String, int>>); }; config.AddStage("WordCountAggregation", localAggregation, typeof(WordCountAggregationTask), taskCount, info, "/wcoutput", typeof(TextRecordWriter)); var job = jetClient.JobServer.CreateJob(); jetClient.RunJob(job, config, dfsClient, new[] { Assembly.GetExecutingAssembly().Location });
Maybe it's not terrible, but it's not super friendly (and this was already better than it started
out at). I tried to improve things by creating helpers for specific job types; a job like WordCount
was an AccumulatorJob
; a two-stage job with optional sorting (like MapReduce) was a BasicJob
.
This worked, but kind of left you hanging if you wanted a different job structure.
Some things got easier quickly; instead of manually using the JetClient
class, I introduced the
concept of a job runner, a special class use by JetShell. But defining the structure of the job
stayed like this, unless you could use a helper for a predefined job structure, for a long time.
One of my first ideas for how to make creating jobs easier was to adapt LINQ (which was pretty new at the time). I even thought that something like that could have potential for publication. Unfortunately, Microsoft itself beat me to the punch by publishing a paper on DryadLINQ, so that was no longer an option.
After that, I no longer saw this as something I could publish a paper about, but just for my own gratification I still wanted a better way to create jobs.
I thought about alternative approaches. I could still do LINQ, but it would be complicated, and without a research purpose I didn't want to invest that kind of time. Hadoop had its own methods; while single jobs were easy, complex processing that required multiple jobs or joins was still hard even there, and the best solution at the time was Pig, which used a custom programming language, Pig Latin, for creating jobs. I didn't much like that either, since you'd often have to combine both Pig Latin and Java, and that didn't seem like a good option to me.
No, I wanted something that kept you in C#, and the solution I came up with was the JobBuilder, which we'll discuss next time.
No comments here...
Comments are closed for this post. Sorry.