Today’s interview is with Lobsters user Sean T. Allen. You can follow him on Twitter at twitter.com/SeanTAllen.
What do you do for work? How long have you been at it?
I’m currently working as Principal Architect at TheLadders. I lead the platform team, which is focused on moving us from being a job search to a mobile career network. Our belief is that we should be able to use machine learning and matching to get better results to job seekers than can be done using traditional job board search mechanisms. As part of that, we’ve spent the last year slowly building out our infrastructure, allowing us to do more sophisticated matching. We’re getting read to launch a new infrastructure soon that should allow us far more algorithmic flexibility and result over the course of the year in a vastly improved experience for our customers as their matches become more and more targeted.
What software are you most often using?
TheLadders historically was primarily a Java shop, but for the work we do on the platform team, more functional languages such as Scala and Clojure suit us better. Our one real constraint at this point is the JVM; we looked quite seriously at introducing Erlang into our stack. We learned a lot from the experience and could see that we would gain a great deal, however, the operational costs [of Erlang] simply outweighed those gains. We have extensive experience with the JVM and a large amount of our monitoring and other tooling is locked into the JVM. We rely extensively on a number of 3rd party libraries including Hystrix and metrics that would be hard to give up with a massive gain that we just wouldn’t have right now with Erlang.
On a personal level, I contributed to Redline Smalltalk - a Smalltalk implementation that runs on the JVM - for a while. And I’m currently rediscovering my love for Dylan via Open Dylan and also investigating Idris. I’ve come to believe over my time programming that you should never create invalid objects, and that you need to model state transitions explicitly. Dependent typing looks like a really interesting tool for doing that. I’ve had little time to play with either right now though as I’m a co-author of a forthcoming book from Manning, called Storm Applied. Pretty much all my free/project time is spent writing the book.
I spend a large amount of time using Storm at work, we leverage it along with RabbitMQ heavily so I have a couple nice large clusters to play around with for both work and book ideas. It’s been an interesting cross pollination: projects at work have led to ideas for the book and things we had to learn about for the book have in turn been leveraged for projects at TheLadders.
What is your work/computing environment like?
Here are a few shots of my work environment. We try to keep things light and have fun, they probably say all that needs to be said on this one:
And a couple “work” videos that I thoroughly enjoy:
I love working at TheLadders. It’s a great group of people with tons of engineering/product/ux chops. If anyone ever wants to talk about moving to NYC to work with us, they should drop me a line. We’re always on the lookout for great people.
What’s an interesting project you’ve been working on recently?
Pretty much all my time right now goes towards the book or work. I was asked to do this interview about 3 weeks ago and finally got past a deadline for the book and was able to spend a little time on this. At work, the most interesting project I’ve been working on lately is our new matching infrastructure that will be detailed in an upcoming blog post (probably May or June) on our engineering blog. The tools (Riak, Storm, RabbitMQ, Scala) have been a joy to work with but what has really made it fun is how hard it is to not end up in an inconsistent state when you are trying to build a fast, highly available, event driven architecture. Its been a serious brainfuck trying to keep on top of all the ways that we could screw up and end up in an inconsistent state that when presented to a customer, would make them think we have no idea what we are doing. I’ve gravitated to distributed systems problems over the last few years precisely because they are hard and I love the challenge.
What is something new you’ve used or integrated into your work that has made a positive impact?
Storm and RabbitMQ have been a god send to us. There’s no way we would have been able to do so many of the things that we’ve managed to accomplish as a team over the last couple years without being able to leverage those 2 core pieces of tech. We’ve bent Storm in a variety of odd directions because in addition to the Twitter-type workloads that people usually think about with it, it can be a great platform for “little data” projects that require fault tolerance. We have a number of Storm topologies that could easily be a single application running on a machine somewhere, but if that application crashed, we’d need something to alert, restart it etc. As we are already using Storm, we now put a lot of those jobs on our Storm cluster and if a worker goes down, Storm moves the work. Combine that with persistent queues in RabbitMQ and our failure scenarios have gotten much easier. The combo of the two has made a number of large new projects easy to design with good failure characteristics as well as saved us a ton of time by improving the failure characteristics of those smaller jobs. There is some wonkiness in Storm, but you take that failure handling, combine with guaranteed message processing and the ability to easily scale out jobs from small prototypes to large production systems easily and you have a tool that I think a lot more people should be examining to see how they could leverage it.
What’s the most difficult bug you ever tracked down?
Last fall, for a few days, we were getting results out of our new Elastic Search jobs index (we’re switching from Solr to ElasticSearch) that people characterized as “not quite right”. Nothing was screamingly wrong but things just didn’t feel right when looking at the results. Eventually, we found some results that just made no sense at all. That lead us down a path to where we discovered that we had misconfigured tokenization in ElasticSearch. Without thinking, we had copied the tokenization string from Solr to ElasticSearch not accounting for a switch from XML to JSON. The tokenization characters contains “&” in the ElasticSearch config JSON, so we were tokenizing on “&”, “a”, “m”, “p”, and “;” which remarkably, returns good results for the vast majority of cases and then just completely awful ones for a few. We ended up finding the problem while staring at some results that made no sense. I made the flippant and slightly trollish comment “the only way these would make any sesne is if we are tokenizing on the letter p”. What do you know, turns out we were. That took about 5 weeks before we found a case of “not quite right” that couldn’t be explained by other means and finally exposed the “bug”.
Working on distributed systems, it’s quite easy to get in over your head very quickly - you can operate on data twice, create inconsistent data, or other bad things. What advice do you have for people getting started with distributed systems? What tips have helped you build more reliable, fault tolerant systems?
I think the 2nd part of that is the most important part. Most programmers I know write their code and their tests for happy path scenario. Things will fail. Things will blow up. You need to think through how they will fail. And when you fix something, you need to think of the ramifications. If you fix an upstream bottleneck, is your downstream system going to hold up against the additional load? Does your retry logic that works in normal scenarios have degenerative cases were you can bring down your entire system?
Read. A lot. Here are a few good things to start with:
And get out there and make mistakes. We all do. Its going to happen. Hopefully it won’t be catastrophic. And while you are reading and making mistakes, I would strongly suggest learning Erlang. Even if you never do any serious projects with Erlang, there is a ton you can learn about building reliable software from it.
One last bit of advice, start following the members of the Basho team list on Twitter, as well as the people they are talking to.
This is the second time in as many weeks I hear about someone struggling with ElasticSearch heisenbugs. Makes one cling to clustered Solr for its predictability.
Excellent interview, btw. And as a newbie lobster, I enjoy this richer experience, both in content and presentation, compared to the /other/ news site.
That wasn’t an ES bug. It was entirely us not paying attention to what we were doing when configured tokenization.
FWIW, we’re working on a “Search QA” feature at Elasticsearch. It’s a long way from being done, but on the roadmap for sure.
The “these results don’t look quite right” problem happens a lot, and I’d say 99% of companies don’t have a systematic way to tune/test their queries. Most people just fiddle until they feel happy with some arbitrary query, without fully understanding the impact on the entire system.
Basically, you provide training data in the form of <input query, [matching docs]> tuples, and then one or more of Elasticsearch queries that are evaluated against your training data.
Elasticsearch will then tell you which query gives you the best results (using standard metrics like recall, precision, MAP, ROC, etc). It’s basically just simple, applied ML, but in the context of result relevance. We’re hoping that it will help companies tune their search results more systematically, and find problems (like the tokenization problem Sean had) faster, since the QA will surface problems immediately.
Now that is a project I can’t wait see come to fruition.