Infrastructure development of a product often goes undiscussed – the thing just works, or sometimes it just works faster. But there’s plenty of important work going on behind closed doors. It’s not sexy but it is important; the long-term stability of a company’s product will depend on it.
Formisimo Senior PHP engineer Doug Read and his team have done an outstanding job of turning their Minimal Viable Infrastructure in to something faster and stronger (back in early 2013 Formisimo was running just one web server – a year later this had risen to over a hundred servers!) As the business grew it was important to grow its infrastructure too; it needed to be both robust and infinitely scalable. In this article Doug shares the method of how he ensured Formisimo’s infrastructure was best-placed for long-term growth.
Formisimo is a powerful online-form analytics tool, providing advanced data and insight to help increase your bottom line. This article was first published on Formisimo’s blog.
In its minimum-viable-product phase Formisimo was one web-server. This web server handled everything except the database – it was the don. It was a monolith.
There is nothing wrong with a monolith. It’s a great way to get stuff done quickly and in one place; however, for many reasons it’s not suited to applications that grow above a certain size (size of features and/or size of load). So as Formisimo approached leaving the MVP stage I thought about the architecture of the application and what shape it needed to be to allow Formisimo to be robust and infinitely scalable. The architecture I planned was the complete opposite of a monolith – a service orientated architecture. There are many branches to this planned architecture but it wasn’t long before a strain beared down upon the monolith helping us decide which bit to do first.
This project ended up being called Project XFactor – I think it was because we used the word refactor a lot. It focussed on the tracking side of Formisimo; receiving and storing all the tracking data we record on behalf of our clients.
XFactor’s goal was not only to improve the architecture and future scalability but to also bring immediate performance improvements. Clients had expressed concerns that the response times in their customer’s browsers was on the slow side, and we were aware of some short outages – we needed it faster and sturdier!
XFactor had 3 main tactics
- Decouple the need to write to the database before we respond 200 OK to tracking HTTP requests. Accessing the busy database added too many milliseconds to the response
- Extract the tracking elements of the monolith into a separate web service. Code this web service so that it can be load balanced and scaled out horizontally – this allows scaling specifically to the needs of the tracking load and also gives separation from the errors that may occur in the rest of the monolith (separation of concerns).
None of these ideas are crazy or weird – plenty of people gone before us have done just this – they are proven strategies – and they’ve proved to work for us.
Decoupling the database
For the database decoupling I wanted to use a message or job queue. I have used these before and know them to be a great way of achieving this decoupling. As we’re currently hosted on AWS, SQS was a very quick and easy option and we went for this. We’re not staying on SQS long however – update on this coming soon!
Refactoring the tracking code
The tracking service was coded to be as dumb as possible – it does very little thinking as it has one job it does as quickly as it can. It accepts whatever gets thrown at it, puts it in the queue and says thank you very much. I like to picture it as someone saying “thank you” a millisecond before you finish your sentence. This led to amazing speed increases in our client’s customer’s browsers.
We coded the tracking service to work in any number of parallel running servers to allow us to scale outwards. With this in place we had the opportunity to determine an optimum balance between the number of servers and the power of those servers. At the time of writing there are between three and seven medium EC2 instances behind an AWS load balancer.
Service Orientated Architecture
Traffic from the cloud hits a load balancer, then a Tracking server, then a Queue, then finally a Queue-Consumer which crunches the data. You may be able to guess that this is a diagram I drew by hand; ooohhhhhhh!
Of course if we have a service populating a queue we need a service consuming the other side of it – our Qconsumers. The tracking servers have the tiniest workload per request; in comparison the qconsumers have at least ten times more work to do per request. They have to parse and understand every bit of the tracking data to write it to the database ready for data crunching. We have between 50 and 100 servers each running 3 parallel processes which poll the queue.
The queue-decoupling gives a buffer between the realtime demand on incoming tracking data and the processing and insertion into the database – like suspension in a car. But we try to achieve what I call (does anyone else?) “almost realtime” data insertion. This essentially means that although we have the buffer in place, which allows the queue to grow to any size as tracking data spikes, our aim is that a request gets processed as soon as it lands in the queue.
Unfortunately this doesn’t directly benefit our clients yet as there is still some data crunching that has to happen before the data is readily available — this can take up to 24 hours. In the future we intend to pass the almost realtime benefits through to the reports the clients see.
Choosing a CDN
The CDN was another easy option as AWS CloudFront was right there. We created a new web service just for static content and have placed CloudFront in front of this and as you’d expect achieved even more speed into our client’s customer’s browsers.
That sums up XFactor but for a couple of technology changes; moving from PHP to Ruby. PHP is a fine language for a web service, but when your code isn’t 100% sat behind a web server you’re much better off picking a more general purpose language…. and no, I don’t mean Rails. And also moving from Apache to Nginx – simply as Nginx has better performance.
With this article I’ve aimed for brevity, but if you have any questions or want more details on our technical implementations then please feel free to email me or leave a comment; I love talking about this stuff!