Jumbo and .Net Remoting

This is the third article about Jumbo.

One of the things I knew I would need for Jumbo was a way for clients and servers to communicate. Clients would have to communicate with the servers, but servers also had to communicate with each other, mostly through heartbeats.

In some cases, writing a custom TCP server and protocol made sense, like client communication with the data server where large amounts of data had to be processed. But for most things, it was just a matter of invoking some function on the target server, like when creating a file on the NameServer or submitting a job to the JobServer. I figured it would be overkill to create completely custom protocols for this. In fact, being able to easily do this was one of the reasons I picked .Net.

WCF existed at the time, but I wasn't very familiar with it, and I'm not sure if Mono supported it back then. So, I decided to use good, old-fashioned .Net Remoting. All I had to do was make some interfaces, like INameServerClientProtocol, inherit from MarshalByRefObject, set up some minimal configuration, and I was in business. This seemed to work great, both on .Net and Mono.

All was well, until I started scaling up on Linux clusters. Suddenly, I was faced with extremely long delays in remoting calls. This was particularly noticeable with heartbeats, which sometimes ended up taking tens of seconds to complete. Now, I wasn't running thousands of nodes, where scaling issues might be expected; I was running maybe forty at the most. A remoting call that takes a few milliseconds to complete shouldn't suddenly take 10+ seconds in that environment.

It took a lot of digging, but I eventually found the cause: it was a bug in Mono. Mono's remoting TCP server channel implementation used a thread pool to dispatch requests. Now, the .Net BCL has a built-in thread pool class, but remoting didn't use that. It used its own RemotingThreadPool, which was used nowhere else.

This thread pool had a fatal flaw. Once it reached a specific number of active threads, on the next request it would wait 500ms for an existing thread to become available, before creating a new one.

That's already not great. Worse, it did this synchronously on the thread that accepts new connections! Which meant that these delays would stack! If a connection attempt comes in, and one is already waiting, it can't accept this connection until the waiting one is done. And if that one then also hits the limit...

Basically, if you get 10 new connection attempts when the pool is already at the limit, the last one will end up waiting 5 seconds, instead of 500ms. This was the cause of my scalability problems, with a few dozen nodes all trying to send heartbeats to the single NameServer and JobServer.

I reported this bug, but it never got fixed. In fact, this problem is still in the code today, at least as of this writing.

I created my own patched version of Mono, which worked around the issue (I think I just removed the delay). But, needing a custom-built Mono wasn't great for the usability of the project. I eventually ended up writing my own RPC implementation, using Begin/End asynchronous methods, which performed better. Still, I refused to merge this into the main branch (trunk, since this was SVN), still waiting for a Mono fix that would never come.

Eventually, I did switch to using the custom RPC implementation permanently, because if I ever did want to release Jumbo, requiring users to patch and compile Mono wasn't really an option. And, it was probably for the better, since .Net Core no longer has remoting, so I would've needed a different solution anyway when making the port.

And yes, just like .Net Remoting, my custom RPC mechanism depends on BinaryFormatter, because it works much the same. And BinaryFormatter is deprecated and insecure. Since I have no interest in further developing Jumbo, that will remain that way, so do consider that if you run Jumbo on any of your machines.

Categories: Software, Programming
Posted on: 2022-11-24 01:11 UTC.


No comments here...

Add comment

Comments are closed for this post. Sorry.

Latest posts




RSS Subscribe