Jumbo's JobBuilder: part 1

This is fifth article about Jumbo.

From the moment I started working on Jumbo, I knew I wanted two things: I wanted it to be more flexible than plain MapReduce, and I still wanted it to be easy to use.

While I didn't want to be limited to MapReduce, I also didn't just want to allow arbitrary task graphs, because I felt that was too complex for what I wanted to do. By abstracting MapReduce into a linear sequence of "stages," of which you could have an arbitrary number, I feel like I struck a good balance between allowing alternative job structures, while still not getting too complicated.

Add to that things like making channel operations (such as sorting) optional, and suddenly you could use hash-table aggregation, you could do in one job things that in MapReduce would've required several jobs, and by allowing multiple input stages, you could do things that were traditionally hard to emulate in MapReduce, such as joins.

Still, this added flexibility did come with some complexity. In Hadoop, you write a map function, a reduce function, set up a simple configuration specifying what they are, and you're done. With Jumbo, you had to put a little more thought into what kind of structure is right for your job, and creating a JobConfiguration was a bit more complex.

From the earliest versions of Jumbo Jet I wrote, the job configuration was always an XML file, and it was always a serialization of the JobConfiguration class. However, the structure of that format changed quite a bit. Originally, you needed to add each task individually, which meant that if you had an input file with a 1000 blocks, you needed to add a 1000 TaskConfiguration elements to the configuration. Not exactly scalable.

Eventually, I switched this to having a StageConfiguration instead, and letting the JobServer figure things out from there. But, you still needed to essentially build the job structure manually. At first, this still meant manually creating each stage and channel. Later, I added helpers, which meant that creating WordCount's configuration would look something like this:

var config = new JobConfiguration(new[] { Assembly.GetExecutingAssembly() });
var input = dfsClient.NameServer.GetFileSystemEntryInfo("/input");
var firstStage = config.AddInputStage("WordCount", input, typeof(WordCountTask), typeof(LineRecordReader));
var localAggregation = config.AddPointToPointStage("LocalAggregation", firstStage, typeof(WordCountAggregationTask), ChannelType.Pipeline, null, null);
var info = new InputStageInfo(localAggregation)
    ChannelType = ChannelType.File,
    PartitionerType = typeof(HashPartition<Pair<Utf8String, int>>);

config.AddStage("WordCountAggregation", localAggregation, typeof(WordCountAggregationTask), taskCount, info, "/wcoutput", typeof(TextRecordWriter));
var job = jetClient.JobServer.CreateJob();
jetClient.RunJob(job, config, dfsClient, new[] { Assembly.GetExecutingAssembly().Location });

Maybe it's not terrible, but it's not super friendly (and this was already better than it started out at). I tried to improve things by creating helpers for specific job types; a job like WordCount was an AccumulatorJob; a two-stage job with optional sorting (like MapReduce) was a BasicJob. This worked, but kind of left you hanging if you wanted a different job structure.

Some things got easier quickly; instead of manually using the JetClient class, I introduced the concept of a job runner, a special class use by JetShell. But defining the structure of the job stayed like this, unless you could use a helper for a predefined job structure, for a long time.

One of my first ideas for how to make creating jobs easier was to adapt LINQ (which was pretty new at the time). I even thought that something like that could have potential for publication. Unfortunately, Microsoft itself beat me to the punch by publishing a paper on DryadLINQ, so that was no longer an option.

After that, I no longer saw this as something I could publish a paper about, but just for my own gratification I still wanted a better way to create jobs.

I thought about alternative approaches. I could still do LINQ, but it would be complicated, and without a research purpose I didn't want to invest that kind of time. Hadoop had its own methods; while single jobs were easy, complex processing that required multiple jobs or joins was still hard even there, and the best solution at the time was Pig, which used a custom programming language, Pig Latin, for creating jobs. I didn't much like that either, since you'd often have to combine both Pig Latin and Java, and that didn't seem like a good option to me.

No, I wanted something that kept you in C#, and the solution I came up with was the JobBuilder, which we'll discuss next time.

Categories: Software, Programming
Posted on: 2023-01-26 00:13 UTC. Show comments (0)

Ookii.CommandLine 3.0

When I released Ookii.CommandLine for C++, I realized I had quite a backlog of things I wanted to update in Ookii.CommandLine, not just for the C++ version, but for the .Net version as well.

The result is the release of Ookii.CommandLine 3.0 for .Net. This is the biggest release of Ookii.CommandLine yet, with many new features, including support for an additional, more POSIX-like argument syntax, argument validation and dependencies, automatic name transformations, an updated subcommand API, usage help color output, more powerful customization, and more.

Seriously, there's a lot. Even version 2.0, which was a substantial rewrite from the original, wasn't anywhere near as big. Unfortunately, that does mean there's some breaking changes, but I expect most users won't need to make too many changes.

One question you might have is, when is all this new stuff coming to the C++ version? Unfortunately, I don't have a good answer. I'll probably add at least some of the new features to the C++ version, but it probably won't be all at once, and I'm not going to give a timeline either. If there's any feature you want in particular, you should file an issue for it.

Get it on NuGet or GitHub. You can also try it out on .Net Fiddle, or try out subcommands.

Categories: Software, Programming
Posted on: 2022-12-01 22:28 UTC. Show comments (0)

Jumbo and .Net Remoting

This is the fourth article about Jumbo.

One of the things I knew I would need for Jumbo was a way for clients and servers to communicate. Clients would have to communicate with the servers, but servers also had to communicate with each other, mostly through heartbeats.

In some cases, writing a custom TCP server and protocol made sense, like client communication with the data server where large amounts of data had to be processed. But for most things, it was just a matter of invoking some function on the target server, like when creating a file on the NameServer or submitting a job to the JobServer. I figured it would be overkill to create completely custom protocols for this. In fact, being able to easily do this was one of the reasons I picked .Net.

WCF existed at the time, but I wasn't very familiar with it, and I'm not sure if Mono supported it back then. So, I decided to use good, old-fashioned .Net Remoting. All I had to do was make some interfaces, like INameServerClientProtocol, inherit from MarshalByRefObject, set up some minimal configuration, and I was in business. This seemed to work great, both on .Net and Mono.

All was well, until I started scaling up on Linux clusters. Suddenly, I was faced with extremely long delays in remoting calls. This was particularly noticeable with heartbeats, which sometimes ended up taking tens of seconds to complete. Now, I wasn't running thousands of nodes, where scaling issues might be expected; I was running maybe forty at the most. A remoting call that takes a few milliseconds to complete shouldn't suddenly take 10+ seconds in that environment.

It took a lot of digging, but I eventually found the cause: it was a bug in Mono. Mono's remoting TCP server channel implementation used a thread pool to dispatch requests. Now, the .Net BCL has a built-in thread pool class, but remoting didn't use that. It used its own RemotingThreadPool, which was used nowhere else.

This thread pool had a fatal flaw. Once it reached a specific number of active threads, on the next request it would wait 500ms for an existing thread to become available, before creating a new one.

That's already not great. Worse, it did this synchronously on the thread that accepts new connections! Which meant that these delays would stack! If a connection attempt comes in, and one is already waiting, it can't accept this connection until the waiting one is done. And if that one then also hits the limit...

Basically, if you get 10 new connection attempts when the pool is already at the limit, the last one will end up waiting 5 seconds, instead of 500ms. This was the cause of my scalability problems, with a few dozen nodes all trying to send heartbeats to the single NameServer and JobServer.

I reported this bug, but it never got fixed. In fact, this problem is still in the code today, at least as of this writing.

I created my own patched version of Mono, which worked around the issue (I think I just removed the delay). But, needing a custom-built Mono wasn't great for the usability of the project. I eventually ended up writing my own RPC implementation, using Begin/End asynchronous methods, which performed better. Still, I refused to merge this into the main branch (trunk, since this was SVN), still waiting for a Mono fix that would never come.

Eventually, I did switch to using the custom RPC implementation permanently, because if I ever did want to release Jumbo, requiring users to patch and compile Mono wasn't really an option. And, it was probably for the better, since .Net Core no longer has remoting, so I would've needed a different solution anyway when making the port.

And yes, just like .Net Remoting, my custom RPC mechanism depends on BinaryFormatter, because it works much the same. And BinaryFormatter is deprecated and insecure. Since I have no interest in further developing Jumbo, that will remain that way, so do consider that if you run Jumbo on any of your machines.

Categories: Software, Programming
Posted on: 2022-11-24 01:11 UTC. Show comments (0)

Ookii.CommandLine for C++

Today I'm releasing a library that I've had for a long time, but never released: Ookii.CommandLine is now available for C++. It offers the same functionality as its .Net counterpart, but with an API that's suitable for C++.

I've had this for a while, actually, but it was interwoven with another project that contains a lot of pieces that aren't really meant for release. I recently updated it to match Ookii.CommandLine 2.4 functionality, and figured, why not go the extra mile and move it into its own repository.

So I did, and that repository is now available on GitHub. The library is header only, so it can be easily included in any project, as long as you use a compiler that supports C++20.

Categories: Software, Programming
Posted on: 2022-10-23 23:33 UTC. Show comments (0)

A tale of two Jumbos

This is the second article about Jumbo.

As I said previously, Jumbo originally ran on Mono and the .Net Framework. Part of my hesitation in releasing this project was because of how complicated it could be to get it running on that environment.

When .Net Core started getting more mature with the release of .Net Core 3, I was interested in learning more about it. While I don't use .Net in my day-to-day anymore, I still like it, and .Net Core's promise of cross-platform .Net would've saved me so much hassle back in the day. So, I was curious to see whether I could get Jumbo running on it, just for fun.

Porting was, fortunately, not that difficult. To make things easier on myself, I started in a clean workspace, and copied over Jumbo's components one at a time, created SDK-style projects for them, and fixing build errors until they compiled. Some changes were necessary, but it wasn't too bad.

Changes made for .Net Core

One early difference I ran into was how sockets worked. On Linux, if you listen on an IPv6 address, you automatically also listen on the corresponding IPv4 address. If you subsequently also tried to bind to the IPv4 address, it would fail. Jumbo handled this by only listening on IPv6 on Linux, but on both IPv6 and IPv4 on Windows. This worked great for Mono.

But, I guess .Net Core handled this differently, since it no longer listens on IPv4 automatically on Linux, so I had to change how Jumbo behaved. Fortunately, that only meant changing the default for the associated setting in Jumbo's configuration files.

One of the biggest issues I ran into was the lack of the AssemblyBuilder.Save method. This method, needed to save dynamically generated assemblies to disk (which Jumbo's JobBuilder depends on), doesn't exist in .Net Core (and still doesn't in .Net 6). Fortunately, I was able to find a NuGet package that provided the same functionality, Lokad.ILPack, which saved the day. Unfortunately, it had a bug that prevented it from working for my scenario, but I was able to find and contribute a fix.

Another minor issue was the lack of AppDomains. Jumbo's TaskServer normally runs tasks in an external process, but it could run tasks in an AppDomain if requested, which was used to ease debugging and testing. Since AppDomains don't exist anymore, I had to cut this feature. But, since it isn't used when running Jumbo normally, it wasn't a big loss. The biggest difference is that it does make running the unit tests a bit more fragile, since they now depend on the external TaskHost process.

Then, I discovered that .Net Core can't serialize delegates, but I was able to work around that too. With that, I finally got all the tests working. Not only that; I could run them on Linux now, which was never possible before (old Jumbo's tests depended on NUnit and only ran on Windows).

Another interesting part was the CRC library. Jumbo's DFS calculates checksums a lot, so original Jumbo used a native library, for a faster implementation than the managed version. That library was built as part of Jumbo, either with Visual C++ (on Windows), or g++ (on Linux). Integrating that native build into the .Net Core build process turned out to be a hassle.

I did some speed tests to find out if this would still matter, and tried to optimize the managed fallback algorithm a bit more. And then, I found a CRC32 library for .Net Core that was even faster than my original native implementation, so that solved that problem.

Oh, and Ookii.CommandLine 2.3's inclusion of a .Net Standard 2.0 binary? That's because I needed it for this Jumbo port! Otherwise, I'm not sure when I would've gotten around to that.

Finally, I just had to remove some of the Mono-specific parts from the code (there are probably still some leftovers), and it was good to go. Except...

DfsWeb and JetWeb

The least straight-forward part of the port was, without a doubt, the two administration websites, DfsWeb and JetWeb. These were written using ASP.NET Web Forms (with code-behind files and everything), which just wasn't supported in .Net Core. Not wanting to give up now, I persevered, and rewrote both projects to use ASP.NET Core, replacing the Web Forms with Razor pages. Fortunately, neither site was particularly big, so there were only a few pages to convert.

This also had a big advantage: to use the websites, previously you needed IIS Express (on Windows) or Mono XSP. Now, they could just run using Kestrel, the web server included with ASP.NET Core.


Jumbo was meant to run on a Linux cluster, and used a bunch of Bash scripts for deployment and running. These scripts used password-less SSH to start everything on all the nodes in the cluster, and were tailored only for the Linux environment. Running it on Windows, only used for testing purposes, was a mostly manual affair.

When I originally started planning to release Jumbo, back in 2013, I realized that maybe some people would want to be able to try out Jumbo, if only in a one-node configuration, on Windows, without having to manually launch multiple executables. I wrote some PowerShell scripts to facilitate that, but of course, those were for Windows only, so now there were two sets of scripts for basically the same thing.

With the port to .Net Core, there was something else interesting I could use: PowerShell Core! Now, I could adapt the PowerShell scripts to work on both Windows and Linux, and get rid of the old Bash scripts, rather than having to maintain two separate, functionally identical sets of scripts. PowerShell Core even supported remote sessions over SSH now, so could be used to launch Jumbo on multiple nodes in a Linux cluster.

Come to think of it, there's really nothing to stop you from running a hybrid cluster with both Windows and Linux nodes. I've never tried doing that. It could be interesting; it would probably be somewhat difficult to get the configuration right, since e.g. the scripts assume Jumbo is in the same location on every node, and the Windows paths need a drive letter. But it's probably possible to hack something together.

Anyway, that's the story of how Jumbo was brought into the modern age, and is no longer a pain to run on either Windows or Linux. I eventually brought it forward to .Net 6 as well (which was very straight-forward at that point), which is what's available now.

Check out both Jumbo's .Net Core port and the original .Net/Mono version on GitHub, and try to spot the differences!

Categories: Software, Programming
Posted on: 2022-10-17 22:54 UTC. Show comments (0)

Latest posts




RSS Subscribe