Most of us know that Amazon Spot EC2 instances are usually good choice for Time-flexible and interruption-tolerant tasks. These instances gets traded frequently on a Spot market price and you can fix your Bid Price using AWS API’s or AWS Console. Once free Spot EC2 instances are available for your Bid Price, AWS will allot them for use in your account. Spot instances are usually available way cheaper than On-Demand EC2 instances most of the times. Example: On-Demand m1.xlarge per hour price is 0.48 USD and on spot market you can find them sometimes @ 0.052 per hour. This is ~9 times cheaper than the on-demand price; imagine if you can bid competitively and get hold of spot EC2 even around 0.24 USD most of the times, you are saving 50% from the on-demand price straight away. In Big data use cases usually you might need lots of EC2 nodes for processing, adopting such techniques can vastly make difference in your infra cost and operations in long term. I am sharing my experience on this subject as tips and techniques you can adopt to save costs while using EMR clusters in Amazon for big data problems.
Note : While dealing with spot you can be sure that you will never pay more than your maximum bid price per hour.
Tip 1: Make right choice (Spot vs On-Demand) for the cluster components
Data Critical workloads: For workloads which cannot afford to lose data you can have the Master + Core on Amazon On-Demand EC2 and your task nodes on Spot EC2. This is the most common pattern while combining Spot and On-Demand on Amazon EMR cluster. Since task nodes are operating on spot prices depending upon your bidding strategy you can save ~50% costs from running your task nodes using On-Demand EC2. You can further save(if you are lucky) by reserving your Core and Master Nodes , but you will be tied to an AZ. According to me this is not a good or common technique, because some AZ’s can be very noisy with high spot prices.
Cost Driven workloads: When solving big data problems, sometimes you might have to face scenarios where cost is very important than time. Example: You are processing archives of old logs as low priority jobs, where cost of processing is very important and usually with abundant time left. Such cases you can have all the Master+Core+Task run on Spot EC2 to get further savings from the data critical workloads approach. Since all the nodes are operating on spot prices depending upon your bidding strategy you can save ~60% or more costs from running your nodes using On-Demand EC2. The below mentioned table published by AWS gives an indication of the Amazon EMR + Spot combinations that are widely used:
Tip 2: There is free lunch sometimes
Spot Instances can be interrupted by AWS when the spot price reaches your bidding price. What interruption means is that, AWS can pull out the Spot EC2’s assigned to your account when the price matches/exceeds. If your Spot Task Nodes are interrupted you will not be charged for any partial hour of usage by AWS i.e. if you have started the instance @ 10:05 am and if your instances are interrupted by spot price fluctuations @ 10:45 am you will not be charged for the partial hour of usage. If your processing exercise is totally time insensitive, you can keep your bidding price at closer level to spot price which are easily interrupt-able by AWS and exploit this partial hours concept. Theoretically you can get most of the processing done through your task nodes for free* exploiting this strategy.
Tip 3: Use the AZ wisely when it comes to spot
Different AZ’s inside an Amazon EC2 region has different spot prices for the same Instance type. Observe this pattern for a while, build some intelligence around the price data collected and rebuild your cluster in the AZ with lowest price. Since the Master+Core+Task need to run on the same AZ for better latency, it is advisable to architect your EMR clusters in such a way they can be switched(i.e.recreate) to different AZ’s according to spot prices. If you can build this flexibility in your architecture you can save costs by leveraging the Inter AZ price fluctuations. Refer the below images for Spot Price variations in 2 AZ’s inside the same Region for same time period. “Make your choice wisely time to time”
Tip 4: Keep your Job logic small and store intermediate outputs in S3
Breakdown your complex processing logic into small jobs and design your jobs and tasks in EMR cluster in such a way that they run for very small period of time (example few minutes). Store all the intermediate job outputs in Amazon S3. This approach is helpful in EMR world and gives you following benefits:
When your Core+ Task nodes are interrupted frequently, you can still continue from the intermediate points. Data accessed from S3.
You now have the flexibility to recreate the EMR clusters in multiple AZ depending upon the Spot price fluctuations
You can decide the number of nodes needed for your EMR cluster(even every hour) depending upon the data volume, density and velocity
All the above 3 points when implemented contribute to elasticity in your architecture and there by helps you save costs in Amazon cloud. The above recommendation is not suitable for all Jobs, it has to be carefully mapped with right use cases by the architects.