[Update 14/08/2014: If you're wondering what ever became of Tyk the ESB, it eventually got rebuilt in Golang as an API Gateway and released into the wilds yesterday.]

This is a reflection on our experiences with trying to build a micro-service architecture for Loadzen – our load testing platform, we started on this path about a year ago when there was even less documentation about how to go about things than there is today. Please note this post isn’t a how-to, or even assumes that we got anything right, it’s simply a report on our experiences and what we learnt.

When we decided to re-develop the Loadzen service, we switched modes, we decided to move towards a SOA-based design, influenced heavily by Erlang’s actor model and the emerging popularity of microservcies to build scalability and specificity into the heart of the webapp.

The problem with this approach was that the term “microservices” is quite poorly defined, and a service-orientated architecture is very much an enterprise-level integration model, and not necessarily an application design model.

So how did we start? Well, we became very well versed with how RabbitMQ worked – we evaluated a few technologies, but our existing experience with RMQ was good, we couldn’t fault it. For those that want to get more involved with the ins and outs of message brokers and AMQP, I can recommend RabbitMQ In Action as a great way to get acquainted with the more detailed elements of how to manage nodes.

An odd beginning

Well before we started building the new Loadzen we were keen on building an Enterprise Integration App, something we could use ourselves to string together our own services (and at some point maybe move into the IoT). This project was pretty ambitious, and was called project Tyk.

ESB Inception

So we went off and built a pluggable integration platform top-to-toe that was completely dynamic, used connectors and components and could be configured by JSON to easily set up protocol translators, bots, messaging patterns (fan-out, round-robin, broadcast) and custom integration components to make it easier to integrate multiple platforms into a single Enterprise Service Bus.

What we basically did was take the book Enterprise Integration Patterns and work through each use case to see if we could deploy it with Tyk, it was very ambitious and was almost finished.

Except that Loadzen was looking extremely dated. And when we had started building Tyk, instead of starting with a straight-to-market prototype approach, we had put in place a bunch of unit tests, integration tests and rigour to make sure it was a solid framework.

Which made us feel a little bad about Loadzen – I mean, like the forgotten big brother – the forgotten, revenue generating, big brother. It was the typical shiny toys over tried-and-tested veteran dilemma

So we stopped. We downed tools and evaluated where we were, and our sanity.

It was good – we were in a great place. We had just built ourselves a micro-service framework without even knowing it.

You see, Tyk itself was highly modular, it was engineered to be built with itself, and the core components of Tyk basically did a few things well:

  • Handled integration, connection, heartbeat and pass-through setup for integration with RMQ
  • Gave us a repeatable, component-based service model
  • Was designed to always work as a ‘shell’ around core components, ensuring that what we wrote was highly testable

So we thought – the best way to really test this thing is to build something with it, and Loadzen was right there.

Our approach

So how did we approach micro-services? The new Loadzen would split out from being a monolithic, django-based application into a modular, specialised service-based architecture.

1. The message is key!

The message was key – every service would essentially react to a payload and produce some kind of output message, this message needed to be validated before it entered the system, and share some basic structural traits

The message format should be independent: We wanted to be able to switch between encoding formats, so the encoding element was left to messaging architecture so we could work with pure-python objects at the higher levels, but the messaging between service would transparently serialise and de-serialise data (we opted for json with the possibility of switching to msgpack).

Finally, as large chunks of the system interacted with the core elements of the message, using a document-model where we could embed key content, validate it and re-use it across services was also important. Validation acrtoss the system was handled using JSONSchema

2. Detail your service map – complexity hides in every cranny

What components did we think were core to the new solution? This was a ground-up rewrite so having a service map was really important.

We knew there would be quirks and dependencies we couldn’t foresee, so we tried to map out the data flow for each component as mush as possible before hand so we had a clear idea of what we should build.

Service Mapping

This took the form of some lovely hand-drawn flows sketched out in a notebook, and then were eventually transcribed into multiple trello boards so we could easily manage the development process.

The key thing was to make sure we had a good overall view of how the services spoke to one another – what did this service need as input, and what did this component output? Where did it go? Which services cared about it’s output?

This can get complicated very quickly – although having a high-level view of your service map means when you concentrate on how the service itself performs, you don’t need to worry about interaction effects, which made development easier as no one had to keep the whole model in their mind.

3. Be single-minded

It was important to ensure that we didn’t add non-domain things to our services, sometimes it felt easier to hook some functionality into one component over another – but if it affected the status of another service, surely it didn’t belong there?

This is where the power of message broker really comes in handy. For example, Loadzen needs to know when a test stage has completed in order to run the next stage, it also needs to know when those results are available for processing and temporary storage. This splits into two service domains: The Test Manager and the Results Processor, these services both require the same information to do two completely different tasks.

It would be easiest to get the results processor to tell the test manager that a stage is done, but that really isn’t it’s job – the test manager should handle this independently.

So to get the test manager to listen for when a stage of a test is done, we just hook one of it’s inputs into the same exchange that receives the test results from our load generators – now whenever a result set comes in, our test manager is notified and can take appropriate action – since it’s resulting action will depend on what kind of test is being run.

It’s a small point, but can be easily missed – now we can use mock output from the load generator to test how the various test managers work with different types / scenarios of data. All wrapped up in a single domain.

4. Develop defensively

Because a broker-based system can by it’s very nature be asynchronous, with certain components doing things in different orders, it’s important do be defensive in how you develop your interactions – ensure you are reacting to the right data, do not make assumptions about input (not just validation).

Some messages will appear out of order – we found ourselves making heavy use of hash tables to keep track of internal state and message order (if it was needed)

5. Sate is a necessary evil

We found that developing the services did mean keeping some kind of internal state in place so we could react to cumulative events if necessary. All of our micro-services are inherently stateless, which meant that so long as the services ran, everything would be fine. However, if one service restarted, you might need the cumulative data you had accrued – so some form of caching was necessary.

We use a very heavy-handed caching approach (everything is cached), this was a decision made to ensure we could offer continuous deployment without service interruption. But it did take some of the ‘lightweight’ nature out of our microservices.

And what did we learn?

The key takeaways from the experience were overall positive – writing Loadzen as a microservice architecture means a few things:

  • We can develop different parts of the system and update them without downtime or massive regressions
  • Testability has gone up.
  • We are much more flexible when launching features (for example, we moved our websocket & email notification infrastructure to dedicated event service, this used to sit within the webapp because we thought it ‘belonged there’)
  • Concentrating on bugs within a service is much easier as there is less code to review

However – there are some key downsides that are worth mentioning before anyone decides to jump head-first into microservcies, and I’ll try to cover them here.

1. Complexity exists between systems

If you have independent actors that react to messages at different rates and are in some way co-dependent then their interactions are harder to test. Most functional tests will focus on the single service in question, and simulating interactions is difficult and can create complexities that need to be eradicated if you want to retain any sanity.

2. Integration is always great in theory

Again, if components are co-dependent (e.g. they both act on a similar part of a message block), then when the component that generates the message changes, there is an inter-service regression. these can be tough to track and painful to fix. We use JSONSchema to validate our messages and advocate defensive development to ensure that you are covering error states that could interrupt services.

3. Because your comms are now over the wire, expect things to slow down

Unless your broker lives on the same machine as the rest of your applications, you will have latency issues as messages travel between machines. That means that if you are looking to deal with load, make sure that all of you components can be parallelised, and that the business-service ends (e.g. the web site, the core REST API) are performant.

Hopefully, because you are using micro-services this should be easy – but remember how sate can be a real nightmare (if one service node is not aware of a previous state, it’s response will be inaccurate causing duplication of events).

We cache everything in Redis, and so when we paralellise a service, the round-robin nature of how queues work in RMQ mean that we can run multiple service nodes and not worry about locks, while ensuring service continuity as they essentially operate as a single brain.

Conclusion

That was a really long post, congratulations for getting to the bottom!

The basic takeaway for us was that the micro-service approach was right for the application we were building, and definitely helped us improve our code and our product.

However, with the increase in moving parts within the system, locking down expected behaviour is difficult and requires some forethought and that state is always an issue and needs to be solved up front rather than later.

Good luck developing your micro-services – please remember this post is just a reflection of what we did while developing Loadzen with a microservice architecture and that as stated elsewhere, there isn’t a de-facto right or wrong approach (yet).