This is the third article about Jumbo.
One of the things I knew I would need for Jumbo was a way for clients and servers to communicate. Clients would have to communicate with the servers, but servers also had to communicate with each other, mostly through heartbeats.
In some cases, writing a custom TCP server and protocol made sense, like client communication with the data server where large amounts of data had to be processed. But for most things, it was just a matter of invoking some function on the target server, like when creating a file on the NameServer or submitting a job to the JobServer. I figured it would be overkill to create completely custom protocols for this. In fact, being able to easily do this was one of the reasons I picked .Net.
WCF existed at the time, but I wasn't very familiar with it, and I'm not sure if Mono supported it
back then. So, I decided to use good, old-fashioned .Net Remoting. All I had to do was make some interfaces,
like INameServerClientProtocol
,
inherit from MarshalByRefObject
, set up some minimal configuration, and I was in business. This
seemed to work great, both on .Net and Mono.
All was well, until I started scaling up on Linux clusters. Suddenly, I was faced with extremely long delays in remoting calls. This was particularly noticeable with heartbeats, which sometimes ended up taking tens of seconds to complete. Now, I wasn't running thousands of nodes, where scaling issues might be expected; I was running maybe forty at the most. A remoting call that takes a few milliseconds to complete shouldn't suddenly take 10+ seconds in that environment.
It took a lot of digging, but I eventually found the cause: it was a bug in Mono. Mono's remoting TCP server
channel implementation used a thread pool to dispatch requests. Now, the .Net BCL has a built-in thread pool class,
but remoting didn't use that. It used its own RemotingThreadPool
, which was used nowhere else.
This thread pool had a fatal flaw. Once it reached a specific number of active threads, on the next request it would wait 500ms for an existing thread to become available, before creating a new one.
That's already not great. Worse, it did this synchronously on the thread that accepts new connections! Which meant that these delays would stack! If a connection attempt comes in, and one is already waiting, it can't accept this connection until the waiting one is done. And if that one then also hits the limit...
Basically, if you get 10 new connection attempts when the pool is already at the limit, the last one will end up waiting 5 seconds, instead of 500ms. This was the cause of my scalability problems, with a few dozen nodes all trying to send heartbeats to the single NameServer and JobServer.
I reported this bug, but it never got fixed. In fact, this problem is still in the code today, at least as of this writing.
I created my own patched version of Mono, which worked around the issue (I think I just removed the delay). But, needing a custom-built Mono wasn't great for the usability of the project. I eventually ended up writing my own RPC implementation, using Begin/End asynchronous methods, which performed better. Still, I refused to merge this into the main branch (trunk, since this was SVN), still waiting for a Mono fix that would never come.
Eventually, I did switch to using the custom RPC implementation permanently, because if I ever did want to release Jumbo, requiring users to patch and compile Mono wasn't really an option. And, it was probably for the better, since .Net Core no longer has remoting, so I would've needed a different solution anyway when making the port.
And yes, just like .Net Remoting, my custom RPC mechanism depends on BinaryFormatter, because it works much the same. And BinaryFormatter is deprecated and insecure. Since I have no interest in further developing Jumbo, that will remain that way, so do consider that if you run Jumbo on any of your machines.
Today I'm releasing a library that I've had for a long time, but never released: Ookii.CommandLine is now available for C++. It offers the same functionality as its .Net counterpart, but with an API that's suitable for C++.
I've had this for a while, actually, but it was interwoven with another project that contains a lot of pieces that aren't really meant for release. I recently updated it to match Ookii.CommandLine 2.4 functionality, and figured, why not go the extra mile and move it into its own repository.
So I did, and that repository is now available on GitHub. The library is header only, so it can be easily included in any project, as long as you use a compiler that supports C++20.
This is the second article about Jumbo.
As I said previously, Jumbo originally ran on Mono and the .Net Framework. Part of my hesitation in releasing this project was because of how complicated it could be to get it running on that environment.
When .Net Core started getting more mature with the release of .Net Core 3, I was interested in learning more about it. While I don't use .Net in my day-to-day anymore, I still like it, and .Net Core's promise of cross-platform .Net would've saved me so much hassle back in the day. So, I was curious to see whether I could get Jumbo running on it, just for fun.
Porting was, fortunately, not that difficult. To make things easier on myself, I started in a clean workspace, and copied over Jumbo's components one at a time, created SDK-style projects for them, and fixing build errors until they compiled. Some changes were necessary, but it wasn't too bad.
One early difference I ran into was how sockets worked. On Linux, if you listen on an IPv6 address, you automatically also listen on the corresponding IPv4 address. If you subsequently also tried to bind to the IPv4 address, it would fail. Jumbo handled this by only listening on IPv6 on Linux, but on both IPv6 and IPv4 on Windows. This worked great for Mono.
But, I guess .Net Core handled this differently, since it no longer listens on IPv4 automatically on Linux, so I had to change how Jumbo behaved. Fortunately, that only meant changing the default for the associated setting in Jumbo's configuration files.
One of the biggest issues I ran into was the lack of the AssemblyBuilder.Save
method. This method,
needed to save dynamically generated assemblies to disk (which Jumbo's JobBuilder
depends on),
doesn't exist in .Net Core (and still doesn't in .Net 6). Fortunately, I was able to find a NuGet
package that provided the same functionality, Lokad.ILPack,
which saved the day. Unfortunately, it had a bug
that prevented it from working for my scenario, but I was able to find and contribute a fix.
Another minor issue was the lack of AppDomains. Jumbo's TaskServer normally runs tasks in an external process, but it could run tasks in an AppDomain if requested, which was used to ease debugging and testing. Since AppDomains don't exist anymore, I had to cut this feature. But, since it isn't used when running Jumbo normally, it wasn't a big loss. The biggest difference is that it does make running the unit tests a bit more fragile, since they now depend on the external TaskHost process.
Then, I discovered that .Net Core can't serialize delegates, but I was able to work around that too. With that, I finally got all the tests working. Not only that; I could run them on Linux now, which was never possible before (old Jumbo's tests depended on NUnit and only ran on Windows).
Another interesting part was the CRC library. Jumbo's DFS calculates checksums a lot, so original Jumbo used a native library, for a faster implementation than the managed version. That library was built as part of Jumbo, either with Visual C++ (on Windows), or g++ (on Linux). Integrating that native build into the .Net Core build process turned out to be a hassle.
I did some speed tests to find out if this would still matter, and tried to optimize the managed fallback algorithm a bit more. And then, I found a CRC32 library for .Net Core that was even faster than my original native implementation, so that solved that problem.
Oh, and Ookii.CommandLine 2.3's inclusion of a .Net Standard 2.0 binary? That's because I needed it for this Jumbo port! Otherwise, I'm not sure when I would've gotten around to that.
Finally, I just had to remove some of the Mono-specific parts from the code (there are probably still some leftovers), and it was good to go. Except...
The least straight-forward part of the port was, without a doubt, the two administration websites, DfsWeb and JetWeb. These were written using ASP.NET Web Forms (with code-behind files and everything), which just wasn't supported in .Net Core. Not wanting to give up now, I persevered, and rewrote both projects to use ASP.NET Core, replacing the Web Forms with Razor pages. Fortunately, neither site was particularly big, so there were only a few pages to convert.
This also had a big advantage: to use the websites, previously you needed IIS Express (on Windows) or Mono XSP. Now, they could just run using Kestrel, the web server included with ASP.NET Core.
Jumbo was meant to run on a Linux cluster, and used a bunch of Bash scripts for deployment and running. These scripts used password-less SSH to start everything on all the nodes in the cluster, and were tailored only for the Linux environment. Running it on Windows, only used for testing purposes, was a mostly manual affair.
When I originally started planning to release Jumbo, back in 2013, I realized that maybe some people would want to be able to try out Jumbo, if only in a one-node configuration, on Windows, without having to manually launch multiple executables. I wrote some PowerShell scripts to facilitate that, but of course, those were for Windows only, so now there were two sets of scripts for basically the same thing.
With the port to .Net Core, there was something else interesting I could use: PowerShell Core! Now, I could adapt the PowerShell scripts to work on both Windows and Linux, and get rid of the old Bash scripts, rather than having to maintain two separate, functionally identical sets of scripts. PowerShell Core even supported remote sessions over SSH now, so could be used to launch Jumbo on multiple nodes in a Linux cluster.
Come to think of it, there's really nothing to stop you from running a hybrid cluster with both Windows and Linux nodes. I've never tried doing that. It could be interesting; it would probably be somewhat difficult to get the configuration right, since e.g. the scripts assume Jumbo is in the same location on every node, and the Windows paths need a drive letter. But it's probably possible to hack something together.
Anyway, that's the story of how Jumbo was brought into the modern age, and is no longer a pain to run on either Windows or Linux. I eventually brought it forward to .Net 6 as well (which was very straight-forward at that point), which is what's available now.
Check out both Jumbo's .Net Core port and the original .Net/Mono version on GitHub, and try to spot the differences!
This is the first in a series of articles about Jumbo.
The recently released Jumbo comes in two versions: a modern version, and the original. That original version was written in C#, primarily targeting Mono, supporting the official Microsoft .Net Framework mainly for development and testing.
Because it was dependent on rather old versions of Mono (I never tested it on newer versions), and support for running on Windows was limited, it became harder to run it over time. Which is partially why I eventually decided to port it to .Net Core.
But, why is Jumbo written in .Net anyway? Hadoop is written in Java, so if I wanted to learn more about Hadoop, wouldn't it make sense for me to use Java too? Perhaps, but I wasn't super familiar with Java and its tooling, and I also felt it would steer me too much to be exactly like Hadoop, which I didn't want either.
So, I felt I had basically two choices given my skills at the time: C++ or .Net, both of which I was very proficient in.
C++ would've given me performance. But, it also would've increased the amount of work. I knew there were a bunch of things I would need to solve that aren't straight-forward with C++: reading configuration, RPC across a network, dynamically loading code to run jobs, text processing, robust networking, and more. All of those would require me to either find libraries, or roll my own. Possible, but time consuming.
Also keep in mind that this was 2008. C++11 was still called C++0x and not finished yet, with basically no compiler support. A lot of the niceties of modern C++ weren't there yet. And the ecosystem for finding, building, and using libraries in C++ wasn't exactly friendly either, especially if you wanted to work cross-platform.
And I did want to be cross-platform: I had to run this thing on university-owned clusters, which ran Linux, but I was doing my development on a Windows machine. And this was long before WSL made such a thing easy to do; my best options at the time were VMs or Cygwin. The idea of using Visual Studio for Linux development would've sounded ludicrous at the time, and VSCode wasn't even a flicker in anyone's imagination yet.
.Net would be slower, but, I reasoned, probably not any worse than Java. And running it on Linux would be possible with Mono, I believed. It would give me most of what I needed for free (like remoting and reflection), and I was kind of interested to see how some of .Net's features over Java (such as real generics) would alter the design.
My next choice would've been to use Rust, but unfortunately that hadn't been invented yet.
Yeah, I guess part of the point of this post is that this whole endeavour would've been much easier with modern tools. WSL, .Net Core, VSCode, maybe even Rust... it would've been interesting, to be sure. I also used Subversion for source control; git was around, but not so prevalent yet. Only much later did I move the repository over to git.
Anyway, given my choices at the time, I picked .Net. I think it was a good choice, as it allowed me to develop something working reasonably quickly. It also brought its challenges, in particular when dealing with Mono.
At first, I just started in Visual Studio and tried to get some basic DFS parts running before worrying too much about Linux. When I did try to run in on Linux with Mono, it worked surprisingly well.
Building it on Mono was another matter, however. I'm not 100% sure, but I seem to recall that originally, the Mono C# compiler was lacking some features that prevented it. Even after it became possible, I had to keep a separate build system; the Visual Studio (MSBuild) project files were used on Windows, and for Mono I settled on NAnt. I had to keep the two in sync manually. Before that, I would just build on Windows, and run those binaries on Linux.
Mono also limited me in what features I could use. .Net Framework 4.0 was in beta at the time, but I couldn't adopt any fancy new features (like LINQ) until those were supported on Mono, which took a while. Until then, I was basically stuck with mostly .Net Framework 2.0, as most of the 3.x features weren't super relevant for what I was doing.
One of the biggest hurdles with Mono turned out to be Garbage Collection, as Mono's GC at the time was much less advanced than the .Net Framework (and presumably Java's) one. Mono's GC was non-generational, and used a stop-the-world approach (all threads are paused during collection). It did eventually get a better GC, called sgen, but it wasn't stable in time for me to benefit. Meanwhile, the old GC was often taking more than 10% of task execution times, which necessitated the development of record reuse.
Another big challenge was some problems with .Net Remoting, which I'll cover separately in a future post.
Still, Mono was what made this whole thing possible. Without it, I wouldn't have been able to use .Net at all, and any other choice for me at the time would've made development more complicated, slower, and probably less fun. Big thanks to Miguel de Icaza and others who contributed to Mono.
Today, I'm releasing something that I've wanted to release for a very long time. It's a project that I worked on during my Ph.D., and while I don't think it'll be terribly useful to anyone, a lot of work went into it that I want to preserve, even if just for myself.
That project is Jumbo, and it's now availabe on GitHub in two flavors: Jumbo for .Net 6+, and the original for .Net Framework and Mono. If you want to play around with it or learn more about it, you probably want the former.
Jumbo is an experimental large-scale distributed data processing system, inspired by MapReduce and in particular Hadoop 1.0. Jumbo was created as a way for me to learn about these systems, and should be treated as such. It's not production quality code, and you probably shouldn't entrust important data to it.
Basically, back when I was getting started with my Ph.D. in 2008, I found myself staring at the code of Hadoop (which wasn't even at version 1.0 yet at the time), and finding I wasn't really getting a good feel of how the whole thing fit together, and what really goes into designing a system like that.
So, some people at my lab suggested I should try building something for myself, which I did. I built, from the ground up, a distributed file system and data processing system, which is Jumbo. It was heavily inspired by Hadoop, and definitely borrows from its design (although no actual code was borrowed). In some aspects, I deviate from Hadoop quite a lot (especially since Jumbo isn't constrained to only using MapReduce).
Building Jumbo taught me a lot: about software design, about distributed processing, about decisions that affect scalability, and more. It's my hope that maybe, someone else interested in these topics might want to look at it and find what I did interesting. If nothing else, I just want to preserve this massive project that I did (still the biggest project I've done where I'm the sole contributor), and have its history available.
I did end up using Jumbo for some research efforts, which you can read about in a few papers as well as my dissertation under the University section of my site.
Jumbo is also the origin of one of my most widely used libraries, Ookii.CommandLine, so it's significant in that respect as well.
Like I said, I've wanted to release Jumbo for a long time. If you look through the original project's commit history you can see a bunch of work done in early 2013 (as I was nearing the end of my Ph.D.) like cleaning stuff up and adding documentation, but I never quite reached a level where I was comfortable doing so. The project, which primarily targeted Mono to run on Linux, wasn't that easy to set up and run.
In 2019, I ported the project to .Net Core, just to see if I could. That version was easier to play around with, and I wanted to release it then too, but I never quite got around to finishing it, until now.
So now, you can look at Jumbo and play around with it on .Net 6+, thanks to this new version. I've also expanded the documentation significantly, so it should be easy to get started and to learn more about how it works. The original Jumbo project for Mono and .Net Framework is only provided to preserve the original history of the project (the new repository only contains the history of the port). You probably shouldn't try and run it (though I obviously can't stop you).
If you want to comment on Jumbo or ask any questions, please use the discussions page on GitHub.