Wednesday, September 14, 2016

How Open Source Devolves

You know what I'm talking about. Why are you forced to use the build flag "utf8strings" to generate correct python thrift code? Why is the default behavior of MySQL to truncate data (among a million other things)? Why, over time, do so many projects/libraries/services become obtuse and require a wealth of knowledge to successfully use correctly? Why do so many open-source things come with absolute bonkers default behavior?

Let me show you an example.

https://issues.apache.org/jira/browse/THRIFT-395

Let's systematically break this down. The behavior of the Thrift compiler at the time was completely unaware of unicode strings. It was essentially broken, especially when talking to other thrift code. Thrift contains two string-like types: string and binary. Binary is for raw bytes, while string is for utf8 -encoded strings. Python at the time wasn't correctly encoding unicode strings as utf8, so it was broken. Essentially every other thrift target language was doing the right thing.

Now if you notice in that thread, a tortured programmer soul was disturbed by this change, because it would break his existing code. This argument is the cancer of the open-source world: if the world is broken, it must remain broken because fixing it will break my thing.

But this isn't true. This person's code would only be broken if 1) this code change landed, 2) they upgraded their thrift libraries to the new version containing the change and 3) refused to go through their broken codebase and change "string" to "binary". This person is willing to upgrade versions of thrift, but unwilling to run sed. Maybe they're on a mac, and BSD sed can be tricky? I don't know. But this person could also just NOT upgrade thrift, and everything they've written will continue to work. Or they could both upgrade thrift and use some sed.

Yet, because of this one person, the ENTIRE world gets to add "utf8strings" to their python thrift builds.

Look, this is like if Ford made a truck and accidentally forgot one of the wheels. Then one person figures out how to load the truck bed so that it drives (albeit shakily) on 3 wheels. Then Ford issues a recall and this person protests, so they cancel THE ENTIRE RECALL and EVERY truck continues to be shipped with 3 wheels. The 4th wheel is included in the truck bed when you drive it off the lot, in case you want a truck with 4 wheels instead of 3.

And if you go looking, you will find exactly this, over and over and over. This is literally how open source development works. You can't fix the world, you have to keep it broken.

This is how open source sucks.

Don't even get me started on committee governance models. Let's go ahead and dilute any individual expertise on the committee by giving everyone an equal vote.

Friday, July 22, 2016

Python, the web, and snake oil - part 3

Here we are, over a year since I ranted about the goofiness of (most of) the python web ecosystem and later put together some coherent thoughts. So how did it go? Where have I ended up since? In a word...

Twisted

It's old; it has funny-looking style; it's not the new, cool whizzbang fresh off of that tech-news-source-that-shall-not-be-named.

But it is absolutely fantastic and you should use it.

There are few software projects in the world that will, given some time, practically bring you to tears of joy. The API is divine. It's been running in production environments for over 15 years. You can imagine the rock-solid stability of a library that began development before the current generation of python programmers learned how to use a toilet. Oh, and that's why it "looks funny"; twisted style was very carefully designed to be consistent and informative, before the python world even proposed pep8. Think about that, Twisted predates pep8.

Every single long-running Python application at Oscar speaks to the world using Twisted. This has expanded beyond just web applications to services. Over the past year, Twisted has become the substrate for anything written in Python.

Using Twisted with Blocking Code

While unsettling to some diehard Twisted users, we tend to hide the fact that our infrastructure is running with Twisted by extensive use of deferToThread. Twisted's wsgi container already does this, and I do so in our RPC infrastructure as well. This is totally ok, and still provides some benefits of an async networking stack while providing compatibility with more general, blocking code.

Since we perform all IO via Twisted, and defer to a threadpool to do work, we immediately gain the ability to concurrently hold thousands of mostly idle connections. This allows connections (and their associated handshakes, e.g. SASL) to remain open beyond a single request/response. The benefit of reusing an authenticated TCP channel is significant. Some refer to this kind of architecture as "half-sync", where IO is done asynchronously and work is done synchronously in a thread pool. In addition, many workloads may currently be better suited to threading (contrary to popular belief, most RDBMS access is CPU bound, not IO bound).

Growing with Twisted

As time goes on, we have found ourselves relying more and more on the Twisted stack. LoopingCall has started to spread through the codebase (even around mostly blocking code as mentioned above). On one occasion to debug a particularly nasty bug, I simply added a "manhole"--the ability to ssh into a running process and drop into a REPL. Usage of Twisted endpoints allows a service to be brought up listening in a variety of ways simply by configuration (from on a port to a unix domain socket to an inherited file descriptor, TLS or plain, etc).

With services, we have written our own protocol and transport stack for Thrift, which provides us with the same half-sync characteristics as our web containers.

At the same time, we utilized Twisted in a fully asynchronous manner where we can. Twisted itself provides the building blocks to talk to just about anything on the internet, and third party projects built on Twisted provide the rest. For example, the treq project is a Twisted-compatible port of the popular requests package.

Interpreter Environment

As mentioned previously (in parts 1 and 2), I was searching for a sane interpreter environment where development and production would be as close as possible. Every application and service is built into a pex using pants and is simply started with command-line/environment/configuration flags (using our published oscar.flag package). The process is the same in both development and production, and our python applications are just that - python applications. Since then we've had absolutely no errors due to difference in interpreter environment. This shouldn't be something to write home about, but in the current state of Python web deployment, it unfortunately is. Twisted is fully available as a set of python modules, and it will offer no surprises in your interpreter environment.