How to Measure Cost Optimization on AWS

Customers pay more and more attention to their cloud bill and usage these days. Today one can launch thousands of instances on AWS to complete a computing job that took years in the past. But one can also launch hundreds of instances sitting in idle. Most of our customers “Lift and Shift” their IT infrastructure to AWS at beginning. They choose gradually to not only optimize the cost but also elasticity in the cloud later. They may have done right-size their EC2, purchased reserve instances or moved away some computing power to spot instances. Perhaps they have replaced existing service with AWS managed service or dedicated dev-ops effort to write infrastructure as code. However, they didn’t have a way to quantify or measure cost optimization on AWS whether the approach taken was successful or not. How to measure cost optimization is critical part in the chain.

We will talk a few key metrics to measure cost optimization effort for our customer.  We thought it’d be helpful when you decide to optimize your cloud spending, which should be pretty soon.

Cost Optimization

First thing first, we need to put together a plan for cost optimization. We have talked about various way of optimizing cost on AWS. Here are some practical cases.

Scenario 1: Customer only use EC2

It’s a phenomenon that enterprise”Lift and Shift” their IT infrastructure to AWS and getting a large sum of EC2 bill every month. When we worked for these type of customers, we usually found lots of cost optimization space for them.

  • Spot – We recommend off-load some computing to spot instances.
  • Lambda – Some schedule job can be handled by lambda.
  • Managed Services – We also find the possibility of moving services to AWS managed services.

These are pretty generic recommendation. For instances, we found lambda is especially cost effective to complete certain job if we can break them into pieces and false tolerant. It’s serverless and we don’t have to worry about the availability at all. For instance, we can replace memory cache or Redis with ElastiCache service with only a few configuration code change. We not only improve elasticity but also reduced operation cost and EC2 cost. Same for own database to RDS, CI servers to Codepipeline etc.

Scenario 2: Customer can’t effective manage multiple environment.

They have already used various service in AWS. This type of customer usually have lots of environments for each product and very hard to oversee all. We help them implement infrastructure as code to effective manage those environments such as using cloudformation to automate their entire stack on AWS, using puppet to manage OS level provision. By doing that, anybody in the team can switch on/off any environment with just a click. It’s even possible to spin up extra environment within 30 minutes for a particular sales call. It helped customer eliminate idle computing power on cloud.

Scenario 3: Customer is looking to optimize architecture

This is more custom topic for specific customer. We have talked about turning entire stack into micro service dockerize container, we have also talked about how to optimize specific AWS service like S3 storage.

Measure Cost Optimization

After all cost optimization effort, it’s important to know if it has really worked. By working with several customers, we found a few key metrics are essential for them to measure cost optimization.

Computing Hrs/Cost Composite Metrics

It is sometime rather difficult to measure cost optimization effort by cost index.  Let’s image a company with IT infrastructure growing each month, we can’t measure just by cost alone. Because finance guy will soon find out no matter how hard you work, the cost would have stayed same or outnumber last month. We found a slightly different approach to tell the story, which computing hours/cost composite metrics.

Measure Cost Optimization-Computing Unit/Cost composite index

It fits perfectly to most generic scenarios within a quickly IT expansion enterprise. This is how we read the metrics for above 3 scenarios

  1. For Scenario 1 – Even thought the cost has stayed the same or slightly up month after month, but the computing hours has increase 20-30% from cost optimization effort. We have essentially saved 20-30% cost on EC2.
  2. For Scenario 2 – Customer should see computing hours and cost drop from time to time, because they turn on environments only when they are needed.
  3. For Scenario 3 – Customer should see cost drop but computing hours almost static.

Unused Reserved Instance Metrics

In an ideal situation, customer should see their unused reserved instance go down month over month. Someone has to be notified vice versa. We suggest to revisit this metrics each month or bi-weekly.

Measure Cost Optimization-Unused Reserved instance usage trend

On-Demand Usage Trend

On-demand computing hours shouldn’t spike in an ideal world, it should be flat most of time. Same as unused RI metrics, we recommend to revisit this metrics each month or bi-weekly.

Measure Cost Optimization-Overall Utilization trend for on-demand instances in AWS account

Reserved Instance Recommendation Tool

From 2017 Mar, Linux reserved instance rule has changed again. Why do we use the word “again”… AWS is enthusiastic about getting best value to their customers. All of our customers has previously opt-in regional RI for flexibility. Now they no longer have to worry about instance type because RI will automatically be applied by family and computing unit. Long story short, if you have one m4.2xlarge instance , it can cover one m4.2xlarge instance or two m4.xlarge instance running in that region and so forth 4 m4.large … You can refer to the AWS post for more information.

The new rule gives lots of room to improve reserved instance purchase process. In the past, it’s one to one map purchase in a sense customer purchase a m4.2xlarge RI for one running m4.2xlarge instance. As we know, lots of advance AWS customer has dynamic running instances in their account each month. Sometime 2xlarge instance will be stop, and then xlarge and large instance will be launched instead. Other time a 4xlarge instance will be launched to handle intense job. The average running computing unit is probably a combination of one 2xlarge,xlarge,medium plus nano instance rather than one 4xlarge instance.

The better approach is to tally that particular instance family usage in last period(month) and figure out optimum purchase plan for current month RI purchase. Of course, it will be more bullet proof if you use Unused Reserved Instance Metrics,On-Demand Usage Trend to measure after purchasing.

Reserved Instance Purchase Plan
FrugalBot auto-generated RI purchase plan

 

Continuous Cost Optimization

We are essentially renting service from AWS, who constantly Re:Invent a new way to host IT infrastructure. This is the best part of  “pay you go” model, we don’t have to commit one architecture all the time. Sometime there are better way and technology to distribute application on the cloud. Sometime we have to terminate servers are running idle. There are so many ways to optimize your stack on cloud.

For example, we have evolved a service from static server to stateless in autoscaling group, then to container service on spot instances. Now we are working on serverless proof of concept. We enabled measurement in each optimization cycle. Frankly, this is a lot easier for our client to gain confidence and for us to conduct work. They can tell they cut 50% cost from metrics immediately.

This is why FrugalOps team has released FrugalBot version 2 to AWSMarket place. It helps you measure your cost optimization effort with above critical metrics and other more. We purpose a healthy loop for every team who would like drive cloud spending low, optimize->measure->optimize, which is our concept of Continuous Cost Optimization mentioned in the very first blog. With the help of key-metrics insight, you will be amazed how much you have done throughout journey!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.