Comparison Analysis:Amazon ELB vs HAProxy EC2

In this article i have analysed Amazon Elastic Load Balancer (ELB) and HAProxy (popular LB in AWS infra) in the following production scenario aspects and fitment:

Algorithms: In terms of algorithms ELB provides Round Robin and Session Sticky algorithms based on EC2 instance health status. HAProxy provides variety of algorithms like Round Robin, Static-RR, Least connection, source, uri, url_param etc. For most of the production cases use Round Robin and Session Sticky is more than enough, But in case you require algorithms like least connection you might have to lean towards HAProxy currently. In future AWS might add this algorithm in their Load Balancer

Spikey or Flash Traffic: Amazon ELB is designed to handle unlimited concurrent requests per second with “gradually increasing” load pattern. It is not designed to handle heavy sudden spike of load or flash traffic. For example: Imagine an e-commerce website whose traffic increases gradually to thousands of concurrent requests/sec in hours, Whereas imagine use cases like Mass Online Exam or GILT load pattern or 3-Hrs Sales/launch campaign sites expecting 20K+ concurrent requests/sec spike suddenly in few minutes, Amazon ELB will struggle to handle this load volatility pattern. If this sudden spike pattern is not a frequent occurrence then we can Pre-warm ELB else we need to look for alternative Load balancers like HAProxy in AWS infrastructure. If you expect a sudden surge of traffic you can provision X number of HAProxy EC2 instances in running state.

Gradually Increasing Traffic: Both Amazon ELB and HAProxy can handle gradually increasing traffic. But when your needs become elastic and traffic increases in a day, you either need to automate or manually add new HAProxy EC2 instances when the threshold is breached. Also when the load decreases you may need to manually remove the HAProxy EC2 instances from Load Balancing Tier. If you want to avoid these manual efforts you may need to engineer using automation scripts and programs. Amazon has intelligently automated this elastic problem in their ELB Tier. We just need to configure and use this, that’s all.

Protocols : Currently Amazon ELB only supports following protocols: HTTP, HTTPS (Secure HTTP), SSL (Secure TCP) and TCP protocols. ELB supports load balancing for the following TCP ports: 25, 80, 443, and 1024-65535. In case RTMP or HTTP Streaming protocol is needed, we need to use Amazon CloudFront CDN in your architecture. HAProxy can support both TCP and HTTP protocols. In case HAProxy EC2 instance is working in pure TCP mode. A full-duplex connection will be established between clients and servers, and no layer 7 examination will be performed. This is the default mode. It can be used for SSL, SSH, SMTP etc. Current 1.4 version of HAProxy does not support HTTPS protocol natively, you may need to use Stunnel or Stud or Nginx before HAProxy to do the SSL termination. HAProxy 1.5 dev-12 comes with SSL support, it will become production ready soon.

Timeouts: Amazon ELB currently timeouts persistent socket connections @ 60 seconds if it is kept idle. This condition will be a problem for use cases which generates large files (PDF, reports etc) at backend EC2, sends them as response back and keeps connection idle during entire generation process. To avoid this you’ll have to send something on the socket every 40 or so seconds to keep the connection active in Amazon ELB. In HAProxy you can configure very large socket timeout values to avoid this problem.

White listing IP’s :Some Enterprises might want to white list 3rd party Load Balancer IP range in their firewalls . If the 3rd party service is hosted using Amazon ELB it will become a problem. Currently Amazon ELB does not provide fixed or permanent IP address for the Load balancing instances that are launched in its tier. This will be a bottleneck for enterprises which have compulsion to white list the Load balancer IP’s in external firewalls/gateways. For such use cases, currently we can use HAProxy EC2 attached with Elastic IPs as load balancers in AWS infrastructure and white list the Elastic IP’s.

Amazon VPC/ Non VPC : VPC- Virtual Private Cloud. Both Amazon ELB and HAProxy EC2 can work inside the VPC and Non VPC environments of AWS.

Internal Load Balancing: Both Amazon ELB and HAProxy can be used for internal load balancing inside VPC. You might provide a service that is consumed internally by the other applications which needs load balancing. ELB and HAProxy can fit in the same. In case internal Load balancing is required in Amazon Non-VPC environments, ELB is not capable currently and HAProxy can be deployed.

URI/URL based Load balancing: Amazon ELB cannot Load Balance based on URL patterns like other Reverse proxies. Example Amazon ELB cannot direct and load balance between request URLs www.xyz.com/URL1 and www.xyz.com/URL2. Currently for such use cases you can use HAProxy on EC2.

Sticky problem: This point comes as a surprise to many users using Amazon ELB. Amazon ELB behaves little strange when incoming traffic is originated from Single or Specific IP ranges, it does not efficiently do round robin and sticks the request to some EC2’s only. Since i do not know the ELB internals i assume ELB might be using “Source” algorithm as default for such conditions. No such cases were observed with HAProxy EC2 in AWS unless the balance algorithm is “Source”. In HAProxy you can combine “Source” and “Round Robin” efficiently. In case the HTTP request does not have cookie it uses source algorithm, but if the HTTP request has a cookie HAProxy automatically shifts to RR or Weighted. (I will have to check this with AWS team)

Logging: Amazon ELB currently does not provide access to its log files for analysis. We can only monitor some essential metrics using CloudWatch for ELB. We cannot debug load balancing problems, analyze the traffic and access patterns; categorize bots / visitors etc currently because we do not have access to the ELB logs.This will also be a bottleneck for some organizations which has strong audit/compliance requirements to be met at all layers of their infrastructure. In case very strict/specific log requirements are needed, You might need to use HAProxy on EC2, in case it suffices the need.

Monitoring: Amazon ELB can be monitored using Amazon CloudWatch. Refer this URL for ELB metrics that can be currently monitored: http://harish11g.blogspot.in/2012/02/cloudwatch-elastic-load-balancing.html. CloudWatch+ELB is detailed for most use cases and provides consolidated result of the entire ELB tier in console/API. On the other hand HAProxy provides user interface and stats for monitoring its instances. But if you have farms(20+) of HAProxy EC2 instances it becomes complex to manage this monitoring part efficiently. You can use tools like ServerDensity to monitor such HAProxy farms, but it has huge dependency on NAT instances availability for inside Amazon VPC deployments.

SSL Termination and Compliance requirements:
SSL Termination can be done at 2 levels using Amazon ELB in your application architecture .They are
SSL termination can be done at Amazon ELB Tier, which means connection is encrypted between Client(browser etc) and Amazon ELB, but connection between ELB and Web/App EC2 is clear. This configuration may not be acceptable in strictly secure environments and will not pass through compliance requirements.
SSL termination can be done at Backend with End to End encryption, which means connection is encrypted between Client and Amazon ELB, and connection between ELB and Web/App EC2 backed is also encrypted. This is the recommended ELB configuration for meeting the compliance requirements at LB level.
HAProxy 1.4 does not support SSL termination directly and it has to be done in Stunnel or Stud or Nginx layer before HAProxy. HAProxy 1.5 dev-12 comes with SSL support, it will become production ready soon, i have not yet analyzed/tested the backend encryption support in this version.

Scalability and Elasticity : Most important architectural requirements of web scale systems are scalability and elasticity. Amazon ELB is designed for this and handle these requirements with ease.Elastic Load Balancer does not cap the number of connections that it can attempt to establish with the load balanced Amazon EC2 instances.Amazon ELB is designed to handle unlimited concurrent requests per second. ELB is inherently scalable and it can elastically increase /decrease its capacity depending upon the traffic. According to a benchmark done by RightScale, Amazon ELB was easily able to scale out and handle 20K+ or more concurrent requests /sec. Refer URL:http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-the-cloud/
Note: The load testing was stopped after 20K req/sec by RightScale because ELB kept expanding its capacity. Considerable of DevOps engineering is needed to automate this functionality with HAProxy.

High Availability: Amazon ELB is inherently fault tolerant and a Highly available service. Since it is a managed service, Unhealthy load balancer instances are automatically replaced in ELB tier. In case of HAProxy, you need to do this work yourself and build HA on your own. Refer URL http://harish11g.blogspot.in/2012/10/high-availability-haproxy-amazon-ec2.html to understand more about High Availability @ Load Balancing Layer using HAProxy.

Integration with Other services: Amazon ELB can be configured with work seamlessly with Amazon AutoScaling, Amazon CloudWatch and Route 53 DNS services. The new web EC2 instances launched by Amazon AutoScaling are added to the Amazon ELB for Load balancing automatically and whenever load drops; existing EC2 instances can be removed by Amazon Auto Scaling from ELB. Amazon AutoScaling and CloudWatch cannot be integrated seamlessly with HAProxy EC2 for this functionality. But HAProxy can be integrated with Route53 easily for DNS RR/Weighted algorithms.

Cost: If you run a ELB in US-East Amazon EC2 region for a month (744 hrs) processing close to 1 TB of data, it will cost around ~26 USD (ELB usage+Data charge). In case if you use HAProxy (2 X m1.large EC2 for HAProxy, S3 backed AMI, Linux instances, No EBS attached) as base capacity and add upto 4 or more m1.large EC2 depending upon traffic. It will minimum cost 387 USD for EC2 compute + Data Charges to start with. it is very clear and evident that larger deployments can save lots of cost and immensely benefit using Amazon ELB compared to HAProxy on EC2.

Use Amazon S3 Object Expiration for Cost Savings

Amazon S3 is one of the earliest and most popular services in AWS infra for storing files & documents. Customers usually store variety of files including their logs, documents, images, videos, dumps etc in Amazon S3. We all understand different files have different lifetime and use cases in any production application. Some documents are frequently accessed for a limited period of time and after that, you might not need real-time access to these objects, it becomes a candidate for deletion or archival.
For example:
Log files will have limited life time and they can be either parsed to Data Store or archived every few months
Database and Data store dumps also have retention period and hence limited life time
Files related to campaigns are not most of the time not needed once the Sales promotion is over
Customer documents are dependent upon customer usage life cycle and have to be retained till the customer is active in the application
Digital media archives, financial and healthcare records must be retained for regulatory compliance

Usually IT teams have to build some sort of mechanism or automated programs in-house to track these document ages and initiate a deletion process (individual or bulk) from time to time. In my customer consulting experience, I have often observed that above mechanism is not adequately in place because of following reasons:
Not all the IT teams are efficient in their development and operations
No mechanism/automation in place to manage the retention period efficiently
IT staff not fully equipped with AWS cloud knowledge
IT teams are usually occupied with their solutions/products catering to their business and hence do not have time to keep track of the rapid AWS feature roll out pace

Imagine your application stores ~5TB of documents every month. In a year it will aggregate to ~60TB of documents in Standard storage of Amazon S3. In Amazon S3 standard on US-East Region ~60TB of aggregated storage for the year will cost ~30,000 USD. Out of this imagine ~20 TB of documents aggregated for the year have limited life time and can be deleted or archived periodically a month. This equates to ~1650 USD cost leakage a year. This can avoided if proper mechanism or automation is put in place by the respective teams.
Note: Current charges for Amazon S3 standard storage in US-EAST per GB is 0.095 USD for first 1TB & 0.80 for next 49 TB.
But is there a simpler way for IT teams to cut this leakage and save costs in Amazon S3. Yes, Use Amazon S3 object expiration feature.

What is Amazon S3 Object expiration?
Amazon S3 introduced a feature called Object Expiration (in late 2011) for easing the above automation mechanism. This is a very helpful feature for the customers who want their data on s3 for a limited period of time and after that you might not need to keep those files and it should be deleted automatically by Amazon S3. Earlier as a customer you were responsible for deleting those files manually, when they do not remain useful but now you do not have to worry about it, just use Amazon S3 Object Expiration.
The leakage of ~1650 USD you saw in the above scenario can be saved by implementing Amazon S3 Object expiration feature in your system. Since it does not involve automation effort, Compute hours for the automation program to run and does not consume manual labor as well, it offers invisible savings in addition to the direct savings.

Overall Savings = ~1650 USD (scenario) + Cost of compute hrs (for deletion program) + Automation engineering effort (or) Manual deletion effort

How does it work?
Amazon S3 Object Expiration feature allows you to define rules to schedule the removal of your objects after a pre-defined time period. The rules are specified in the Lifecycle Configuration policy of an Amazon S3 bucket. Updates can be done either through AWS Management Console or S3 API’s.
Once the rule is set, the Object Expiration time is calculated by Amazon S3 by adding the expiration lifetime to the file creation time and then roundup the result time to the next day midnight GMT . For example : if a file was created on 11/12/2012 11:00 am UTC and the expiration time period was specified 3 days, then Amazon S3 would be calculating the expiration date-time of the file as 15/12/2012 00:00 UTC. Once the objects are past their expiration date, they will be queued for deletion. You can use Object Expiration rules on objects stored in both Standard and Reduced Redundancy storage of Amazon S3.

Add Spot Instances with Amazon EMR

Most of us know that Amazon Spot EC2 instances are usually good choice for Time-flexible and interruption-tolerant tasks. These instances gets traded frequently on a Spot market price and you can fix your Bid Price using AWS API’s or AWS Console. Once free Spot EC2 instances are available for your Bid Price, AWS will allot them for use in your account. Spot instances are usually available way cheaper than On-Demand EC2 instances most of the times. Example: On-Demand m1.xlarge per hour price is 0.48 USD and on spot market you can find them sometimes @ 0.052 per hour. This is ~9 times cheaper than the on-demand price; imagine if you can bid competitively and get hold of spot EC2 even around 0.24 USD most of the times, you are saving 50% from the on-demand price straight away. In Big data use cases usually you might need lots of EC2 nodes for processing, adopting such techniques can vastly make difference in your infra cost and operations in long term. I am sharing my experience on this subject as tips and techniques you can adopt to save costs while using EMR clusters in Amazon for big data problems.
Note : While dealing with spot you can be sure that you will never pay more than your maximum bid price per hour.

Tip 1: Make right choice (Spot vs On-Demand) for the cluster components
Data Critical workloads: For workloads which cannot afford to lose data you can have the Master + Core on Amazon On-Demand EC2 and your task nodes on Spot EC2. This is the most common pattern while combining Spot and On-Demand on Amazon EMR cluster. Since task nodes are operating on spot prices depending upon your bidding strategy you can save ~50% costs from running your task nodes using On-Demand EC2. You can further save(if you are lucky) by reserving your Core and Master Nodes , but you will be tied to an AZ. According to me this is not a good or common technique, because some AZ’s can be very noisy with high spot prices.
Cost Driven workloads: When solving big data problems, sometimes you might have to face scenarios where cost is very important than time. Example: You are processing archives of old logs as low priority jobs, where cost of processing is very important and usually with abundant time left. Such cases you can have all the Master+Core+Task run on Spot EC2 to get further savings from the data critical workloads approach. Since all the nodes are operating on spot prices depending upon your bidding strategy you can save ~60% or more costs from running your nodes using On-Demand EC2. The below mentioned table published by AWS gives an indication of the Amazon EMR + Spot combinations that are widely used:

Tip 2: There is free lunch sometimes
Spot Instances can be interrupted by AWS when the spot price reaches your bidding price. What interruption means is that, AWS can pull out the Spot EC2’s assigned to your account when the price matches/exceeds. If your Spot Task Nodes are interrupted you will not be charged for any partial hour of usage by AWS i.e. if you have started the instance @ 10:05 am and if your instances are interrupted by spot price fluctuations @ 10:45 am you will not be charged for the partial hour of usage. If your processing exercise is totally time insensitive, you can keep your bidding price at closer level to spot price which are easily interrupt-able by AWS and exploit this partial hours concept. Theoretically you can get most of the processing done through your task nodes for free* exploiting this strategy.

Tip 3: Use the AZ wisely when it comes to spot
Different AZ’s inside an Amazon EC2 region has different spot prices for the same Instance type. Observe this pattern for a while, build some intelligence around the price data collected and rebuild your cluster in the AZ with lowest price. Since the Master+Core+Task need to run on the same AZ for better latency, it is advisable to architect your EMR clusters in such a way they can be switched(i.e.recreate) to different AZ’s according to spot prices. If you can build this flexibility in your architecture you can save costs by leveraging the Inter AZ price fluctuations. Refer the below images for Spot Price variations in 2 AZ’s inside the same Region for same time period. “Make your choice wisely time to time”

Tip 4: Keep your Job logic small and store intermediate outputs in S3
Breakdown your complex processing logic into small jobs and design your jobs and tasks in EMR cluster in such a way that they run for very small period of time (example few minutes). Store all the intermediate job outputs in Amazon S3. This approach is helpful in EMR world and gives you following benefits:

When your Core+ Task nodes are interrupted frequently, you can still continue from the intermediate points. Data accessed from S3.
You now have the flexibility to recreate the EMR clusters in multiple AZ depending upon the Spot price fluctuations
You can decide the number of nodes needed for your EMR cluster(even every hour) depending upon the data volume, density and velocity

All the above 3 points when implemented contribute to elasticity in your architecture and there by helps you save costs in Amazon cloud. The above recommendation is not suitable for all Jobs, it has to be carefully mapped with right use cases by the architects.