To the Cloud! Loupe Service is Moving to Azure
Historically we’ve hosted all of our sites and applications on our own hardware in data centers where we rent space. We started with just 4U of space in Baltimore and expanded from there as we grew. Every year when we’d buy new hardware (to handle growth or to replace hardware that was aging out) we’d do the legwork to decide should we continue as we were or make the jump to the Cloud, or something in between.
Two years ago we took on a particularly large customer - one that wanted to start with 500GB of log data per day and be able to grow up to several terabytes (which is where they are now). We knew this wasn’t going to be affordable to host on our own iron because we’d have to dramatically over-purchase capacity in case they rapidly expanded, not to mention the bandwidth into our data center. For this customer, Azure appeared a great fit - we could expand as needed and shrink back if the client discontinued the service without worrying about having piles of unused hardware and take advantage of Azure not charging for inbound bandwidth.
We’ve operated this customer in Azure ever since and learned a lot about the pluses and minuses of cloud hosting. On the plus side, our costs are more directly coupled to our revenue and we can grow quickly to meet demand. We can afford system recovery strategies (like replicating to multiple geodiverse data centers) we couldn’t before. We could offload more of our IT administration as well to keep our staff focused on our products.
The cloud hasn’t been without its problems. First, we’ve yet to see consistent VM performance that is as good as our hardware. Perhaps it’s that we just bought nicer host machines or were more conservative in how they were configured but for each configuration we observed they were about half as capable as the matching config on our hardware. Consistency of performance also hasn’t been as good as we believe it should be. Second, Virtual Machines aren’t a perfect fit with Azure’s system management strategy - they will get restarted periodically with little or no warning and be completely unavailable for around 30 minutes. Even with clustering this is disconcerting when you start seeing machines turn off at mid-day. It’s particularly problematic with SQL Server since we opted not to incur the thousands of dollars of additional costs and operational complexity it takes to set up SQL Server Always On.
Planning for the Future
Our mixed bag experience led us to cool our jets a bit on charging headlong into the cloud. We probably would have continued this way for another year or two if it wasn’t for the significant, broad growth of Loupe Service in 2015. As the year went on we saw that we couldn’t just stay where we were: We needed to either purchase new networking infrastructure and notable new hardware or shift to the cloud to meet our growth.
We did a financial analysis of three approaches, looking just at the first year costs:
- Purchase new hardware to meet the projected demand for 2016 as well as additional network capacity and rack space from our data center provider
- Transition all services to Azure / Office 365 in 2016
- Transition all services to AWS / Office 365 in 2016
The three options weren’t necessarily like for like: With Azure and AWS we got some redundancy options we wouldn’t have in our own data center and with Azure we spec’d out SQL Azure instead of a SQL Server VM. When comparing costs I like to use three year total costs - that tends to weed out the difference between big cost up front options (purchase hardware) and services. We assumed we’d continue our roughly 30% year on year growth and we wouldn’t keep any hardware longer than four years. Using this approach the three options have a significant spread
Not surprisingly, the cheapest option is to buy the hardware ourselves and just pay for a colocation facility. Using that as a baseline, Azure comes in at 50% higher and AWS 75% higher. The major driver of AWS being higher are the differences in bandwidth charging (we move a tremendous amount of data inbound only) and SQL Server.
So, it’s an easy call then right? Go with staying where we are as it’s the lowest cost and conceptually easiest option. Well, no.
It’s Not Always About the Money
What’s left out of a simple financial analysis are a few other factors.
First, we’re a distributed company and avoid anything that ties us to a specific geography. We have permanent staff in various parts of the US and Europe, we’ve engaged for periods of time with contractors from around the globe, and we want everyone to be on an even playing field. Initially that meant we didn’t let hardware live in the office space we got when we were just a few folks in Baltimore so we would force ourselves to do everything remotely. Then came the day we had a major problem and all of our Baltimore folks were unavailable. While our data center has remote hands that can help out, it’s a lot slower than if you’re directly on site. What could have been a 10 minute glitch became a 3 hour outage. Not cool. Discussing it internally we realized that our Data Center really tied us to staff near it, and that’s just not what we’re up for.
Second, we’ve grown a lot in the last few years and we want to keep growing where our customers are. We push a lot of traffic into Europe and want to set up a presence there as well as in the US. It’s simpler for us to do that if we’re already running in a large world-wide network of data centers. We’re not keen to have to purchase and maintain our own hardware in another country so this is another driver towards Azure or AWS.
Looking for Blue Skies
In the final analysis, Azure comes up the clear winner. Everything we’re looking to do over the next few years is available today and at a better total cost (based on published pay-as-you-go costs) than AWS. We’re already operating in two Azure data centers and we’ve gained a fair bit of operational experience with it. We’re going all in - our goal is to shut down our data center by June of 2016. We’ve got a lot of services to move and we want it to be as seamless as possible for our customers. Our goal is the only thing you’ll notice as we move services is that they’re better. We’re making a service-by-service plan for migration: Some things will be easy (like the My.GibraltarSoftware.Com web site) and others more complicated - like our internal build & validation environments and our WWW site. Our goal is to maximize how native we can run in Azure - try to not just lift and shift virtual machines but instead use Azure Web Apps. This way we maximize the opportunities for improving availability and performance and reducing our IT burden. Because we want every minute of our day to benefit you, our customer. If it doesn’t, it’s waste and we want it gone.
We’ll post more as we go along - what we’re running into, how we’ve addressed it, really anything behind the scenes that might be interesting to someone else looking to move everything to the cloud!