25 Best Practice Tips for architecting your Amazon VPC

According to me Amazon VPC is one of the most important feature introduced by AWS. We have been using AWS from 2008 and Amazon VPC from the day it was introduced and i strongly feel the customer adoption towards AWS cloud gained real momentum only after the introduction of VPC into the market.
Amazon VPC comes with lots of advantages over the limitations faced in Amazon Classic cloud like: Static private IP address , Elastic Network Interfaces :  possible to bind multiple Elastic Network Interfaces to a single instance, Internal Elastic Load Balancers, Advanced Network Access Control ,Setup a secure bastion host , DHCP options , Predictable internal IP ranges , Moving NICs and internal IPs between instances, VPN connectivity, Heightened security etc. Each and everything is a interesting topic on its own and i will be discussing them in detail in future.
Today i am sharing some of our implementation experience on working with hundreds of Amazon VPC deployments as best practice tips for the AWS user community. You can apply some of the relevant ones in your existing VPC or use these points as part of your migration approach to Amazon VPC.

Practice 1) Get your Amazon VPC combination right: Select the right Amazon VPC architecture first.  You need to decide the right Amazon VPC & VPN setup combination based on your current and future requirements. It is tough to modify/re-design the Amazon VPC at later stage, so it is better to design it taking into consideration your NW and expansion needs for next ~2 years. Currently different types of Amazon VPC setups are available; Like Public facing VPC, Public and Private setup VPC, Amazon VPC with Public and Private Subnets and Hardware VPN Access, Amazon VPC with Private Subnets and Hardware VPN Access, Software based VPN access etc. Choose the one which you feel you will be in next 1-2 years.

Practice 2) Choose your CIDR Blocks: While designing your Amazon VPC, the CIDR block should be chosen in consideration with the number of IP addresses needed and whether we are going to establish connectivity with our data center. The allowed block size is between a /28 netmask and /16 netmask. Amazon VPC can have contain from 16 to 65536 IP addresses. Currently Amazon VPC once created can’t be modified, so it is best to choose the CIDR block which has more IP addresses usually. Also when you design the Amazon VPC architecture to communicate with the on premise/data center ensure your CIDR range used in Amazon VPC does not overlaps or conflicts with the CIDR blocks in your On premise/Data center. Note: If you are using same CIDR blocks while configuring the customer gateway it may conflict.
E.g., Your VPC CIDR block is 10.0.0.0/16 and if you have 10.0.25.0/24 subnet in a data center the communication from instances in VPC to data center will not happen since the subnet is the part of the VPC CIDR. In order to avoid these consequences it is good to have the IP ranges in different class. Example., Amazon VPC is in 10.0.0.0/16 and data center is in 172.16.0.0/24 series.

Practice 3) Isolate according to your Use case: Create separate Amazon VPC for Development , Staging and Production environment (or) Create one Amazon VPC with Separate Subnets/Security/isolated NW groups for Production , Staging and development. We have observed 60% of the customer preferring the second choice. You chose the right one according to your use case.

Practice 4) Securing Amazon VPC : If you are running a machine critical workload demanding complex security needs you can secure the Amazon VPC like your on-premise data center or more sometimes. Some of the tips to secure your VPC are:

  • Secure your Amazon VPC using Firewall virtual appliance, Web application firewall available from Amazon Web Services Marketplace. You can use check point, Sophos etc for this
  • You can configure Intrusion Prevention or Intrusion Detection virtual appliances and secure the protocols and take preventive/corrective actions in your VPC
  • Configure VM encryption tools which encrypts your root and additional EBS volumes. The Key can be stored inside AWS (or) in your Data center outside Amazon Web Services depending on your compliance needs. http://harish11g.blogspot.in/2013/04/understanding-Amazon-Elastic-Block-Store-Securing-EBS-TrendMicro-SecureCloud.html
  • Configure Privileged Identity access management solutions on your Amazon VPC to monitor and audit the access of Administrators of your VPC.
  • Enable the cloud trail to audit in the VPC environments  ACL policy’s. Enable cloud trail :http://harish11g.blogspot.in/2014/01/Integrating-AWS-CloudTrail-with-Splunk-for-managed-services-monitoring-audit-compliance.html
  • Apply anti virus for cleansing specific EC2 instances inside VPC. Trend micro has very good product for this.
  • Configure Site to Site VPN for securely transferring information between Amazon VPC in different regions or between Amazon VPC to your On premise Data center
  • Follow the Security Groups and NW ACL’s best practices listed below

Practice 5) Understand Amazon VPC Limits: Always design the VPC subnets in consideration with the expansion in the future. Also understand the Amazon VPC’s limits before using the same. AWS has various limitations on the VPC components like Rules per security group, No of route tables and Subnets etc. Some of them may be increased after providing the request to the Amazon support team while few components cannot be increased. Ensure the limitations are not affecting your overall design. Refer URL:
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html

Practice 6) IAM your Amazon VPC: When you are going to assign people to maintain your Amazon VPC you can create Amazon IAM account with the fine grained permissions (or) use Sophisticated Privileged identity Management solutions available on AWS marketplace to IAM your VPC.

Practice 7) Disaster Recovery or Geo Distributed Amazon VPC Setup : When you are designing a Disaster Recovery Setup plan using VPC or expanding to another Amazon VPC region you can follow these simple rules. Create your Production site VPC CIDR : 10.0.0.0/16 and your DR region VPC CIDR:  172.16.0.0/16. Make sure they do not conflict with on premises subnet CIDR block in event both needs to be integrated to on premise DC as well. After CIDR blocks creation , setup a VPC tunnel between regions and to your on premise DC. This will help to replicate your data using private IP’s.

Practice 8) Use security groups and Network ACLs wisely:  It is advisable to use security groups over Network ACLs inside Amazon VPC wherever applicable for better control. Security groups are applicable on EC2 instance level while network ACL is applicable on Subnet level.  Security groups are used for White list mostly. To blacklist IPs, one can use Network ACLs.

Practice 9) Tier your Security Groups : Create different security groups for different tiers of your infrastructure architecture inside your VPC. If you have Web, App, DB tiers create different security group for each of them. Creating tier wise security groups will increase the infrastructure security inside Amazon VPC.  EC2 instances in each tier can talk only on application specified ports and not at all ports. If you create Amazon VPC security groups for each and every tier/service separately it will be easier to open a port to a particular service. Don’t use same security group for multiple tiers of instances, this is a bad practice.
Example: Open ports for security group instead of IP ranges : For example : People have tendency to open for port 8080 to 10.10.0.0/24 (web layer) range. Instead of that, open port 8080 to web-security-group. This will make sure only web security group instances will be able to contact on port 8080. If someone launches NAT instance with NAT-Security-Group in 10.10.0.0/24, he won’t be able to contact on port 8080 as it allows access from only web security group.
Practice 10 ) Standardize your Security Group Naming conventions : Following a security group naming conventions inside Amazon VPC will improve operations/management for large scale deployments inside VPC. It also avoids manual errors, leaks and saves cost and time overall.
For example: Simple ones like Prod_DMZ_Web_SG or Dev_MGMT_Utility_SG (or) complex coded ones for large scale deployments like
USVA5LXWEBP001- US East Virginia AZ 5 Linux Web Server Production 001
This helps in better management of security groups.
Practice 11) ELB on Amazon VPC:  When using Amazon ELB for Web Applications, put all other EC2 instances( Tiers like App,cache,DB,BG etc)  in private subnets as much possible. Unless there is a specific requirement where instances need outside world access and EIP attached, put all instances in private subnet only. Only ELBs should be provisioned in Public Subnet as secure practice in Amazon VPC environment.
Practice 12) Control your outgoing traffic in Amazon VPC: If you are looking for better security, for the traffic going to internet gateway use Software’s like Squid or Sophos to restrict the ports,URL,Domains etc so that all traffic go through the proxy tier controlled and it also gets logged. Using these proxy/security systems we can also restrict the unwanted ports, by doing so,  if there is any security compromise to the application running inside Amazon VPC they can be detected by auditing the restricted connections captured from the logs. This helps in corrective security measure.
Practice 13) Plan your NAT Instance Type: Whenever your Application EC2 instances residing inside private subnet of Amazon VPC are making Web Service/HTTP/S3/SQS calls they go through NAT instance. If you have designed Auto scaling for your application tier and there are chances ten’s of app EC2 instances are going to make lots of web calls concurrently, NAT instance will become a performance bottleneck at this juncture. Size your NAT instance capacity depending upon application needs for avoiding performance bottlenecks. Using the NAT instances provides us with advantages of saving cost of Elastic IP and provides extra security by not exposing the instances to outside world for accessing the internet.
Practice 14) Spread your NAT instance with Multiple Subnets: What if you have hundreds of EC2 instances inside your Amazon VPC and they are making lots of heavy web service/HTTP calls concurrently. A single NAT instance with even largest EC2 size cannot handle that bandwidth sometimes and may become performance bottleneck. In Such scenarios, span your EC2 across multiple subnets and create NAT’s for each subnet. This way you can spread your out going bandwidth and improve the performance in your VPC based deployments.
Practice 15) Use EIP when needed: At times you may need to keep a part of your application services to be kept in Public subnet for external communication. It is recommended practice to associate them with Amazon Elastic IP and white list these IP address in the target services used by them
Practice 16) NAT instance practices : If needed, enable Multi factor authentication on NAT instance. SSH and RDP ports are open only on sources and destination IP’s, not global network (0.0.0.0/0). SSH / RDP ports are opened only on static exit IP’s not dynamic exit IP’s.
Practice 17) Plan your Tunnel between On-Premise DC to Amazon VPC: 
Select the right mechanism to connect your on premises DC to Amazon VPC. This will help you to connect the EC2 instance via private IP’s in a secure manner.
  • Option 1: Secure IPSec tunnel to connect a corporate network with Amazon VPC (http://aws.amazon.com/articles/8800869755706543)
  • Option 2 : Secure communication between sites using the AWS VPN CloudHub (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPN_CloudHub.html)
  • Option 3: Use Direct connect between Amazon VPC and on premise when you have lots of data to be transferred with reduced latency (or) you have spread your mission critical workloads across cloud and on premise. Example: Oracle RAC in your DC and Web/App tier in your Amazon VPC. Contact us if you need help on setting up direct connect between Amazon VPC and DC.
Practice 18) Always span your Amazon VPC across multiple subnets in Multiple Availability zones inside a Region. This helps is architecting high availability inside your Amazon VPC properly. Example: Classification of the VPC subnet : WEB Tier Subnet : 10.0.10.0/24 in Az1 and 10.0.11.0/24 in Az2, Application Tier Subnet :  10.0.12.0/24 and 10.0.13.0/24, DB Tier Subnet :  10.0.14.0/24 and 10.0.15.0/24, Cache Tier Subnet : 10.0.16.0/24 and 10.0.17.0/24 etc
Practice 19) Good security practice is that to have only public subnet with route table which carries route to internet gateway. Apply this wherever applicable.
Practice 20) Keep your Data closer : For small scale deployments in AWS where cost is critical than high availability, It is better to keep the Web/App in same availability zone as of ElastiCache , RDS etc inside your Amazon VPC. Design your subnets accordingly to suit this. This is not a recommended architecture for applications demanding High Availability.
Practice 21) Allow and Deny Network ACL : Create Internet outbound allow and deny network ACL in your VPC.
First network ACL: Allow all the HTTP and HTTPS outbound traffic on public internet facing subnet.
Second network ACL: Deny all the HTTP/HTTPS traffic. Allow all the traffic to Squid proxy server or any virtual appliance.
Practice 22 ) Restricting Network ACL : Block all the inbound and outbound ports. Only allow application request ports. These are stateless traffic filters that apply to all traffic inbound or outbound from a Subnet within VPC. AWS recommended Outbound rules : http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_NACLs.html
Practice 23) Create route tables only when needed and use the Associations option to map subnets to the route table in your Amazon VPC
Practice 24) Use Amazon VPC Peering (new) : Amazon Web Services has introduced VPC peering feature which is quite useful one. AWS VPC peering connection is a networking connection between two Amazon VPCs that enables you to route traffic between them using private IP addresses. Currently it can be in same AWS region, Instances in either VPC can communicate with each other as if they are within the same network. Since AWS uses the existing infrastructure of a VPC to create a VPC peering connection; it is neither a gateway nor a VPN connection, and does not rely on a separate piece of physical hardware (which essentially means there is no single point of failure for communication or a bandwidth bottleneck).

We have seen it is useful in following scenarios :
  1. Large Enterprises usually run Multiple Amazon VPC in single region and some of their applications are so interconnected that they may need to access them privately + securely inside AWS. Example Active Directory, Exchange, Common business services will be usually interconnected.
  2. Large Enterprise have different AWS accounts for different business units/teams/departments , at times systems deployed by some business units in different AWS accounts need to be shared or need to consume a shared resource privately. Example: CRM , HRMS ,File Sharing etc can be internal and shared. In such scenarios VPC peering comes very useful.
  3. Customer can peer their VPC with their core suppliers to have tighter integrated access of their systems.
  4. Companies offering Infra/Application Managed Services on AWS can now safely peer into customer Amazon VPC and provide monitoring and management of AWS resources.

Practice 25) Use Amazon VPC: It is highly recommended that migrate all your new workloads inside Amazon VPC rather than Amazon Classic Cloud. I also strongly recommend to migrate your existing workloads from Amazon Classic cloud to Amazon VPC in phases or one shot which ever is feasible. In addition to the benefits of the VPC that is detailed in the start of the article, AWS has started introducing lots of features which are compatible only inside VPC and in the AWS marketplace as well there are lots of products which are compatible only with Amazon VPC.  So make sure you leverage this strength of VPC. If you require any help for this migration please contact me.

readers feel free to suggest more.. I will link relevant ones in this article

Load Testing tool comparison – JMeter on it’s own vs JMeter & BlazeMeter together

Load testing is an important aspect of web applications life cycle on Amazon Cloud. Some of our customers ask us to generate 50000+ RPS to load test the scalability of their application deployed on Amazon cloud. Whenever we used to help such customers and migrate their applications on Amazon cloud for achieving scalability, load testing phase itself becomes a pain. Setting up the Load Testing infrastructure, writing automation around it, Managing, Maintaining and monitoring the load test infrastructure is an headache. Our Load testers and Infrastructure teams were spending considerable time and efforts on the above , instead of focusing only on load testing. We usually work with variety of tools from Grinder, JMeter, HP Load Runner to custom engineered load testing tools during the load testing phase. Some time back , our team started playing around with a SAAS based load testing tool called BlazeMeter. In this article i am going to share our experience in the form comparison between BlazeMeter and JMeter and why BlazeMeter has a bright future.
Blazemeter is a Saas based high scalable load testing tool that handle up to 300,000+ concurrent users. Their load test infrastructure is spread across major AWS regions. Since most of us have been using JMeter for years , the 100 % compatibility it provides to existing JMeter scripts is a good feature. Blazemeter also provides a Chrome Extension which can record browser actions & convert it to .jmx file.

10 Things I like about BlazeMeter

Point 1) Load Test becomes effective only when the load comes from different IP Addresses similar to real world scenario and not from a single source IP. When multiple virtual user load is generated from the same IP, the router as well as the server tries to cache information and optimize the throughput many times. Hence by using multiple IP addresses for the host, the EC2 server will get an illusion of receiving requests from multiple source IP’s. Also it is better that load is generated from multiple IP’s for Amazon ELB to evenly distribute load. Refer URL. BlazeMeter has capability to generate load from IP’s which is very important on load testing the cloud applications.
Point 2) Customizing the Network Emulation: Usually online applications will be accessed from multiple devices like PC’s, Laptop and mobiles. These devices have multiple network types such as 3G, broad band etc.Also at times times our online application will be accessed from locations which has poor network bandwidth , Both these parameters play an important role in capacity planning and load testing. We can chose the Bandwidth and network type emulation while doing the load test using BlazeMeter. Example we can configure the network type such as Unlimited Internet, 3G, Cable, Wifi etc and. Bandwidth download limit per device can also be set.
Point 3) Controlling the Throughput: Target throughput is a parameter of Apache JMeter that can be used to achieve a required throughput value of the application. A server’s performance need not always satisfy the target throughput value mentioned in JMeter. It could provide more throughput or lesser.The target throughput parameter can be controlled in run time in Blazemeter. Live server monitoring can help us identify if our servers are performing well for say 5000 Hits/sec & change the throughput value in run-time to a higher or lower value based on the server’s performance.
Point 4) Controlling the Agents: Apache JMeter works based on Master-Agent based architecture where the Master controls multiple agents generating the load. The number of agents parameter has to be usually decided before the starting of the test while using JMeter based load testing on Amazon Cloud. Option to dynamically change the throughput value is a very good feature to have while load testing a cloud application requiring thousands of Requests per second. BlazeMeter enables us to add or remove agent instances when a test is running. Any instance can be marked as Master or Slave(Agent) while the test is running.
Point 5) Controlling No. of Simulated Users on Slaves (Agents) : A load test strategy is mainly determined by following parameters like number of concurrent users, ramp up time, no. of test engines and test iterations and the test duration. Apache JMeter allows us to manually configure these values before the test is started. New EC2 instances have to provisioned for the Agents, the IP addresses (Usually Elastic IP) of the slaves/agents has to be manually added to the master. The entire setup has to be maintained, managed and monitored during the test cycles. This is ok for an load testing environment with few load test agents and low RPS, imagine an environment where you have generate thousands of RPS and having 50+ agents running. This process of managing the EC2 load test infrastructure will become tedious process overall for the load testing teams. In BlazeMeter, once the number of concurrent users is given, the number of test engines, number of threads and engine capacity is chosen automatically. This can be made semi-automatic, where the number of engines & number of threads as well can be selected by user and only engine capacity is chosen by BlazeMeter. Since it is a managed Load Test infrastructure, the Load Testers can concentrate the testing and not managing 100’s of EC2 load agents.
Point 6) Integrated Monitoring:
BlazeMeter offers live monitoring of essential parameters of test servers when the test is running which enables us to decide on the number & instance type for the test. In the conventional Apache JMeter load test setup in Amazon EC2 we have to observe the Key parameters using AWS Cloudwatch.
Blaze Meter provides AWS Cloud watch integration.An account with IAM access has to be created and
AWS Access Key & Secret Key values have to be configured so that the metrics are available in the Blazemeter’s dashboard. This features helps us to understand how the assets in the cloud are reacting to our load tests and help us accordingly tune the infrastructure.
While performing load testing, it is important not only to monitor your Web Servers & Databases but also the agents from where the load is generated . The New Relic plugin gives us the front end KPIs and back end KPIs.
BlazeMeters’s frontend KPIs provide insight on how many users are actually trying to access your website, mobile site or mobile apps.
BlazeMeters’s backend KPIs show how many users are getting through to your applications.
Point 7) Blazemeter allows us to have a different csv file per load test engine. Though this possible in Apache JMeter, it had to be done manually by copying the files onto the JMeter Agent EC2 instances and have the same filename since the agents refer to the Master’s properties. Blazemeter allows us to parameterize the values of even filenames and have different csv files in each engine without giving us to the trouble of copying files into specific EC2 instances & holds the files in a common repository so that it can be referred from there to each agent.
Point 8) Run the load test using older version of JMeter scripts: Old scripts can be reusable with this feature of BlazeMeter which lets us run the test using any version of Apache JMeter right from version 2.3.2 to 2.10. Some complex scripts prepared some months/years ago can be still be made usable and need not be redone. Saves efforts and costs.
Point 9) Schedule the Test & Stay Relaxed: BlazeMeter as well as JMeter lets you schedule your test duration & test time so that we can run longevity test at any time of the day. Even weekly scheduling is possible in BlazeMeter it is an added advantage, though it is not widely used.
Point 10) Interesting Plug-ins provided by Blazemeter :
Integration with Google Analytics: At the time of scripting, it is enough if we select the Google Analytics Option & provide account details of Google Analytics. BlazeMeter obtains the last 12 months of data and creates a test with 5 most visited pages and sets up the number of concurrent users based on that record.
Integration with WordPress: BlazeMeter provides integration with WordPress where WordPress users can test their App by using the BlazeMeter plug-in without any scripting.
Integration with Drupal & Jenkins: Plugins are available to load test Drupal & Jenkins servers as well.

Post Co Authored with Harine 8KMiles.

Architecting Highly Available ElastiCache Redis replication cluster in AWS VPC

In this post lets explore how to architect and create a Highly Available + Scalable Redis Cache Cluster for your web application in AWS VPC. Following is the architecture in which the ElastiCache Redis Cluster is assembled:

  • Redis Cache Cluster inside Amazon VPC for better control and security
  • Master Redis Node 1 will be created in AZ-1 of US-West
  • Redis Read Replica Node 2 will be created in AZ-2 of US-West
  • Redis Read Replica Node 3 will be created in AZ-3 of US-West

You can position all the 3 Redis Nodes in different Availability zones for Achieving High Availability (or) you can position Master + RR 1 in AZ1 and RR 2 in AZ2. This reduces the Inter – AZ latency and might give better performance for heavily used clusters.
Step 1: Creating Cache Subnet groups:
To create Cache Subnet group  navigate to the dashboard of ElastiCache, select Cache Subnet groups and then click “Create Cache Subnet group”. Add the Subnet Id and the Availability Zone you need to use for the ElastiCache cluster.

 

We have created Amazon VPC spreading across 3 availability zones. In this post we are going to place the Redis Master and 2 Redis Replica Slaves in these 3 availability zones. Since Redis will be most of the times accessed by your application tier it is better if you place them on Private Subnet of your VPC.
Step 2: Creating Redis Cache Cluster: 
To create Cache Cluster navigate to the  dashboard of ElastiCache, select Launch Cache Cluster and provide the necessary details. We are launching it inside Amazon VPC, so we have to select the Cache Subnet group .
Note: It is mandatory to create Cache Subnet group before Launch if you need ElastiCache Redis cluster in Amazon VPC.

 

For test purposes i have used m1.small EC2 instance for the Redis. Since this is a fresh Redis installation, i have not mentioned S3 bucket from where the persistent Redis Snapshot will be used as input. On successful creation of the Cache Cluster you can see the details in the dashboard.
Step 3: Replication Group Creation:
To create Replication group select the option of Replication Groups from dashboard and then select the “Create Replication Group”

Select the master Redis node “redisinsidevpc” created previously as the primary cluster id of the Cache cluster.  Give the Replication group id and description as illustrated below.

Note: Replication Group should be created only after the Primary Cache Cluster node is UP and running, else you will get the error as shown below.

On the successful creation of the Replication group you can see the following details. You can observe from below screenshot that there is only one primary node in US-WEST-2A and zero Redis Read Replica’s are attached to it.

Step 4: Adding Read Replica Nodes:
When you select the Replication group, you can see the option to add Redis Read Replica. We are adding 2 Redis Read Replica named Redis-RR1 (in US-West-2B) and Redis-RR2 (in US-WEST-2C). Both the Read replica’s are pointed to the master node “redisinsidevpc”. Currently we can add up to 5 Read replica Nodes for a Redis Master Node. This is more than enough to handle Thousands of messages per second. If you combine it with Redis Pipeline handling 100K messages per second from a node is like cake walk.
Adding Read Replica 1 in Us-West -2B

Adding Read Replica 2 in US-West-2c

On successful creation you can see the following details of Replication group in the dashboard. Now you can see there are 3 Redis nodes listed with Number of read Replica’s as 2. Placing the Read Replica’s and master node in multiple AZ will increase the high availability and protects you from node and AZ level failure. On our sample tests inter AZ Replication deployments had <2 second replication lag for massive writes on master and <1 second replication lag between master slave inside same AZ deployments. We pumped @100K messages per second for few minutes on m1.large Redis instance cluster.
In event, if you need additional read scalability i recommend to use more read Replica slaves added to the master.
In your application tier you need to use the primary Endpoint “redis-replication.qcdze2.0001.usw2.cache.amazon.aws.com:6379” shown below to connect to Redis.

If you need to delete/reboot/Modify you can make it through the options available here.

Step 5: Promoting the Read replica:

You can also promote any node as the Primary cluster using the Promote/Demote option. There will be only one Primary Node.
Note: This step is not part of the cluster creation process.

This promotion has to be carried out with caution and proper understanding for maintaining data consistency.

Post was co authored with Senthil 8KMiles

Billion Messages – Art of Architecting scalable ElastiCache Redis tier`

Whenever we are designing a highly scalable architectures on AWS running thousands of application servers and supporting millions of requests, usage of NoSQL solutions have become inevitable part. One such solution we often been using for years on AWS is Redis . We love Redis. 
AWS introduced ElastiCache Redis on 2013 and we started using the same since it eased the management and operational efforts.  In this article i am going to share my experience on designing large scale Redis tiers supporting billions of messages per day on AWS, step by step guide on how to deploy the same, what are the Implications you face at scale ? Best Practices to be adopted while designing sharded+replicated Redis Tiers etc.

Since we need to support billions of message requests per day and it was growing:

  • the ElastiCache Redis tier was designed with Partitions( shards) to scale out as the customer grows
  • the ElastiCache Redis tier was designed with Replica Slaves for HA and read scaling as the read volumes grow

When your application is growing at Rapid pace and lots of data are created every day, you cannot keep increasing (scaling up) the size of the ElastiCache Node. At one point you will hit the maximum memory capacity of your EC2 instance and you will be forced to partition.  Partitioning is the process of splitting your Key Value data into multiple ElastiCache Redis instances, so that every instance will only contain a subset of your Key Value pair. It allows for much larger ElastiCache Redis data stores, using the sum of the memory of many ElastiCache Redis Nodes. It also allows to scale the computational power to multiple cores and multiple EC2, and the network bandwidth to multiple EC2 network adapters. There are two widely used partition/shard implementation techniques that are available for ElastiCache Redis Tier :
Technique 1) Client side partitioning means that the Redis clients directly select the right ElastiCache Redis node where to write or read a given key. Many Redis clients implement client side partitioning, chose the right one wisely.
Technique 2) Proxy assisted partitioning means that your clients send requests to a proxy that is able to speak the Redis protocol, which in turn sends requests directly to the right ElastiCache Redis instance. The proxy will make sure to forward our request to the right Redis instance accordingly to the configured partitioning schema. Currently the most widely used Proxy assisted partitioning tool is Twemproxy , written by Manju Raj of twitter. Git hub link https://github.com/twitter/twemproxy . Twemproxy is a proxy developed at Twitter for the Memcached ASCII and the Redis protocol. Twemproxy supports automatic partitioning among multiple Redis instances and  currently it is the suggested way to handle partitioning with Redis.

In this article we are going to explore in detail about Proxy assisted partitioning technique for highly scalable and available Redis tier.

Welcome to Twemproxy

Twemproxy( nutcracker) is a fast single-threaded proxy supporting the Memcached ASCII protocol and more recently the Redis protocol.

Installing Twemproxy:

Download the Twemproxy package.
wget http://twemproxy.googlecode.com/files/nutcracker-0.3.0.tar.gz
tar -xf nutcracker-0.3.0.tar.gz
cd nutcracker-0.3.0
./configure
make
make install

Configuration:

Twemproxy (Nutcracker) can be configured through a YAML file specified by the -c or –conf-file command-line argument on process start. The configuration file is used to specify the server pools and the servers within each pool that nutcracker manages. The configuration files parses and understands the following keys:

• listen: The listening address and port (name:port or ip:port) for this server pool.
• hash: The name of the hash function.
• hash_tag: A two character string that specifies the part of the key used for hashing. Eg “{}” or “$$”. Hash tag enable mapping different keys to the same server as long as the part of the key within the tag is the same.
• distribution: The key distribution mode.
• timeout: The timeout value in msec that we wait for to establish a connection to the server or receive a response from a server. By default, we wait indefinitely.
• backlog: The TCP backlog argument. Defaults to 512.
• preconnect: A boolean value that controls if nutcracker should preconnect to all the servers in this pool on process start. Defaults to false.
• redis: A boolean value that controls if a server pool speaks redis or memcached protocol. Defaults to false.
• server_connections: The maximum number of connections that can be opened to each server. By default, we open at most 1 server connection.
• auto_eject_hosts: A boolean value that controls if server should be ejected temporarily when it fails consecutively server_failure_limit times. See liveness recommendations for information. Defaults to false.
• server_retry_timeout: The timeout value in msec to wait for before retrying on a temporarily ejected server, when auto_eject_host is set to true. Defaults to 30000 msec.
• server_failure_limit: The number of consecutive failures on a server that would lead to it being temporarily ejected when auto_eject_host is set to true. Defaults to 2.
• servers: A list of server address, port and weight (name:port:weight or ip:port:weight) for this server pool.

For More details Refer: https://github.com/twitter/twemproxy

Running and Accessing Twemproxy 

To start the proxy just use the command “nutcracker” with the configuration file path specified or in its default path(conf/nutcracker.yml) .
Based on the configuration the twemproxy will be running and listening. Configure your application to point to the port and address instead of the Redis cluster.

Twemproxy Deployment models:

We usually deploy Twemproxy one one of the following models in AWS :

Model 1: Twemproxy as a separate Proxy Tier: In this model Twemproxies are deployed in separate EC2 instances, The application tier is configured to point to Twemproxies . The Twemproxy tier in turn maintains the mappings to the ElastiCache redis nodes. It is better to use instances with very good IO bandwidth for twemproxy tier in AWS. In case you feel the instance CPU is underutilized, you can launch multiple Twemproxy instances inside the same single EC2 instance as well.

Though the above model looks clean and efficient there are optimizations that can be applied to this architecture :
What happens when the twemproxy01 fails, how will the Application server instances know about it ?
Why should i pay additional for twemproxy EC2 instances, Can it be minimized ?

Model 2 : Twemproxy bundled with application tier EC2’s: 

In this model twemproxies are bundled in the same box of the application server EC2 itself. Since two twemproxies are not aware of each others existence, it is easy to architect this model even in App->Auto Scaling mode. Every application server talks to the local twemproxy deployed in the same box this saves cost and avoids managing additional tier complexity as well.

Reference ElastiCache Redis + Twemproxy  deployment:

(This is a Reference deployment, the same can be scaled out to hundreds depending upon the need. It is a Redis Partitioned + replicated setup )
1. Two ElastiCache Redis nodes in AWS (twem01 and twem02)
2. Replication group for each ElastiCache redis nodes (twem01-rg and twem02-rg with one Read Replica each)
3. Two twemproxy servers running in separate EC2. (twemproxy01 and twemproxy02)
Once the above setup is done please note down the endpoints. We will be using the Replication group endpoint as the ElastiCache Redis endpoint for the twemproxy.

ElastiCache Redis Endpoints:

twem01-twem01.qcdze2.0001.usw2.cache.amazonaws.com:6379
twem02-twem02.qcdze2.0001.usw2.cache.amazonaws.com:6379
ElastiCache Redis Replication endpoints:

twem01-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379
twem02-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379

To test the Twemproxy we pumped following keys:
Pump KV data through the Twemproxy01 (1-2000 keys)
Pump KV data through the Twemproxy02(2001-4000 keys).

Configuration:
beta:
listen: 127.0.0.1:22122
hash: fnv1a_64
hash_tag: “{}”
distribution: ketama   #Consistent Hashing
auto_eject_hosts: false
timeout: 5000
redis: true
servers:
– twem01-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379:1 server1
– twem02-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379:1 server2

Test 1: Testing Key accessibility . Testing “GET” operation across both the Twemproxy Instances for few sample keys. 

Fetch 4 Keys spread across 4000 KV data from Twemproxy01  EC2 instance:
[root@twemproxy01 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”
Fetch 4 Keys spread across 4000 KV data from Twemproxy02  EC2 instance:
[root@twemproxy02 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”

From the above test it is evident that all 4000 KV data inserted using both Twemproxies are accessible from both Twemproxies( testing the sample) even though they are not aware among themselves. This is because of the same hashing and Key mapping translation done at Twemproxy level.

Test 2: Testing the ElastiCache Redis Availability and Fail over mechanism:

We are going to promote the twem01-rg replication group read replica to be the Primary Redis Node. After promotion we are going to test:

 

  1. Whether the Twemproxy is able to recognize the newly promoted master
  2. Whether the sample KV data is safely replicated and still accessible , to ensure failover is successful.

To promote ElastiCache Redis slave just click the promote Action and confirm or automate using API. During the promotion of Read Replica to master we observed that the transition happens very quickly and there is no timeout but the response time for the query is about 4-5 secs for about 3-4 minutes during the switch over. In the Twemproxy configuration we can set the timeout configuration, this value needs to be set accordingly so that during switch over there will be no connection refused. For the sample test we have set it as 5000

Repeat Test 1:

[root@twemproxy01 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”
Fetch 4 Keys spread across 4000 KV data from Twemproxy02  EC2 instance:
[root@twemproxy02 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”

From the above test it is evident that all 4000 KV data are replicated properly between master and slaves nodes and the transition between slave to master happened successfully with all the data.
Reporting

Nutcracker exposes stats at the granularity of server pool and servers per pool through the stats monitoring port. The stats are essentially JSON formatted key-value pairs, with the keys corresponding to counter names. By default stats are exposed on port 22222 and aggregated every 30 seconds.

Some best practices while designing highly scalable+available ElastiCache Redis Tier :

Practice 1 : Reduce the Number of Connections and pipeline messages:

Whenever the application instance gets a request to get/put value to the ElastiCache redis node, the client makes a connection to the Redis Tier. Imagine it is a heavy traffic site, then thousands of requests hitting translates to thousands of connections from the application instance to Redis Tier. Now when you add Auto- scaling to your application tier and you have few hundred servers scaled out , then imagine the connection complexity and overhead this architecture brings to the ElastiCache Redis Tier.

Best practice is minimize the number of connections made from your application instance to your ElastiCache redis node. Use Twemproxy in bundled mode with Application EC2 instance, this keeps the process in close proximity and reduces the connection overhead.  Secondly, Twemproxy internally uses minimal connections to ElastiCache Redis Instance by proxying multiple client connections onto one or few server connections.
Redis also supports pipelines, where multiple requests can be pipelined and sent on a single connection. In a simple test using large Application & ElastiCache node we were able to process 125K message/sec in pipeline mode, now imagine what you could achieve on bigger instance types on AWS. The connection minimization architectural setup of twemproxy makes it ideal for pipelining requests and responses and hence saving on the round trip time.  For example, if twemproxy is proxying three client connections onto a single server and we get requests – ‘get key\r\n’, ‘set key 0 0 3\r\nval\r\n’ and ‘delete key\r\n’ on these three connections respectively, twemproxy would try to batch these requests and send them as a single message onto the server connection.

Note : It is important to note that “read my last write” constraint doesn’t necessarily hold true when twemproxy is configured withserver_connections: > 1. Let us consider a scenario where twemproxy is configured with server_connections: 2. If a client makes pipelined requests with the first request in pipeline being set foo 0 0 3\r\nbar\r\n (write) and the second request being get foo\r\n (read), the expectation is that the read of key foo would return the value bar. However, with configuration of two server connections it is possible that write and read request are sent on different server connections which would mean that their completion could race with one another. In summary, if the client expects “read my last write” constraint, you either configure twemproxy to use server_connections:1 or use clients that only make synchronous requests to twemproxy.

Practice 2:  Configure Auto Ejection and Hashing combination properly

Design for failure is the mantra of cloud architecture. Failures are commons when things are distributed on scale. Though partitioning when using ElastiCache Redis as a data store or cache is conceptually the same on broad lines, there is a huge difference operationally on large scale systems. When you are using ElastiCache Redis as a data store you need to be sure that a given key always maps to the same instance, Whereas if you are using  ElastiCache Redis as cache if a given node is not available, then you can always start afresh using a different node in the hash ring with consistent hashing implementations.
To be resilient against failures, it is recommended that you configure Auto eject hosts false when you treat redis as a Data Store and true in when you treat redis as a cache.
resilient_pool:
auto_eject_hosts: true
server_retry_timeout: 30000
server_failure_limit: 3
Enabling auto_eject_hosts: This property ensures that a dead ElastiCache redis Node can be ejected out of the hash ring after server_failure_limit: consecutive failures have been encountered on that node. A non-zero server_retry_timeout: ensures that we don’t incorrectly mark a node as dead forever especially when the failures were really transient. The combination of server_retry_timeout: and server_failure_limit: controls the tradeoff between resiliency to permanent and transient failures.
Note that an ejected node will not be included in the hash ring for any requests until the retry timeout passes. This will lead to data partitioning as keys originally on the ejected node will now be written to another node still in the pool. If ElastiCache Redis is used as a cache (in memory) then in event of a Redis Node going down, the cache data will be lost. This cache miss can cascade performance problems to other tiers and altogether bring down your system on the cloud. To minimize KV cache miss,  you can design your hash ring with Ketama hashing on the Redis Proxy. This will minimize the Cache miss in event of cache node failure, also it decreases the overall re-balancing needed in your Redis tier.  In addition to helping hand on availability problems, Redis Proxy+Ketama can also help your Redis farm to Scale out and Scale down easily with minimal cache miss. To know more about Ketama on ElastiCache refer http://harish11g.blogspot.com/2013/01/amazon-elasticache-memcached-internals_8.html  .
The below diagram illustrates a ElastiCache Redis Cache Farm with Consistent Hash Ring.
In short to minimize the cache miss when using auto eject with true it is recommended to use “Ketama Hashing ( Consistent Hashing Algorithm)” on your Twemproxy configuration. 
ElastiCache Redis as a Data Store:

What if the data stored in your Cache is important and needs to persisted across node failures and launch ? What if the date stored in your Cache cannot be lost and it needs to be replicated and promoted during failures?
Welcome to ElastiCache Redis as Data store. ElastiCache Redis offers features to persist the in memory cache data to disk and also replicate it to a slave for high availability. If ElastiCache Redis is used as a store (persistent), you need to keep the map between keys and nodes fixed, and a fixed number of nodes. Since the data stored is important when you treat ElastiCache Redis as a data store, in event one Redis node goes down, you should have immediate standby up and running in minutes.  You can architect ElastiCache Redis master with one or more replication Slave launched on different AZ from Master for High Availability in AWS. In event master node failure or master AZ failure, the slave Redis node can be promoted in minutes to act as master. This whole High availability design keeps the number of nodes on the hash ring stable and simple, Otherwise, you will end up building a system to re balance the keys (which is not easy) between nodes whenever there is a addition or removal of nodes during outages. In addition to above the ElastiCache Redis supports Partial Resynchronization with Slaves – If the connection between a master node and a slave node is momentarily broken, the master now accumulates data that is destined for the slave in a backlog buffer. If the connection is restored before the buffer becomes full, a quick partial resync will be done instead of a potentially longer full resync. This really saves network bottleneck during momentary failures.
In large scale systems you will often find some partitions are heavily used than others , in event the usage is read heavy in nature you can add upto 5 Read replicas for the ElastiCache Redis Master partition. Since these replicas are used only for read they do not affect the Hash ring structure. But Twemproxy lacks the support for read scaling with Redis Replica’s. So in event when you face this problem, you will have to Scale up the capacity(instance/node type) of the Master and Slave of that partition alone.

If you are using ElastiCache redis as a Data store in the TwemProxy it is recommended to keep “auto_eject_hosts” property false so that in event of redis node failure it is not ejected from the hash ring. The hash ring can be built with both ketama or modula hash algorithms , since in event of Primary node failure, the Slave is going to be promoted and ring structure is going to be always maintained. But if you feel there is immense possibility for the number of primary node partitions to grow, or major failures to occu, it is better to choose ketama hash ring itself from beginning. The below diagram illustrates the architecture.

Practice 3: Configure the Buffer properly:

All memory for incoming requests and outgoing responses is allocated in mbuf in Twemproxy. Mbuf enables zero copy for requests and responses flowing through the proxy. By default an mbuf is 16K bytes in size and this value can be tuned between 512 and 16M bytes using -m or –mbuf-size=N argument. Every connection has at least one mbuf allocated to it. This means that the number of concurrent connections twemproxy can support is dependent on the mbuf size. A small mbuf allows us to handle more connections, while a large mbuf allows us to read and write more data to and from kernel socket buffers. Large Scale web/mobile applications involving millions of hits might have small size request/response and lots of concurrent connections to handle in their backend. So at such scenarios, when Twemproxy is meant to handle a large number of concurrent client connections, you should set chunk size to a small value like 512 bytes to 1K bytes using the -m or –mbuf-size=N argument.

Practice 4: Configure proper Timeouts
It is always a good idea to configure Twemproxy timeout: for every server pool, rather than purely relying on client-side timeouts. Eg:

resilient_pool_with_timeout:
auto_eject_hosts: true
server_retry_timeout: 30000
server_failure_limit: 3
timeout: 400
Relying only on client-side timeouts has the adverse effect of the original request having timed out on the client to proxy connection, but still pending and outstanding on the proxy to server connection. This further gets exacerbated when client retries the original request.

Benefits of using Twemproxy for Redis Scaling

  • Avoids re inventing the wheel. Thanks Manju Raj (twitter).
  • reduce the number of connections to your cache server by acting as a proxy
  • shard data automatically between multiple cache servers
  • support consistent hashing with different strategies and hashing functions
  • be configured to disable nodes on failure
  • run in multiple instances, allowing client to connect to the first available proxy server
  • Pipelining and batching of requests and hence saving of round-trips

Disadvantages of Partitioning Model:

Point 1) Operations involving multiple keys are usually not supported. For instance you can’t perform the intersection between two sets if they are stored in keys that are mapped to different Redis instances (actually there are ways to do this, but not directly).Redis transactions involving multiple keys can not be used.
Point 2) The partitioning granularity is the key, so it is not possible to shard a dataset with a single huge key like a very big sorted set. Ideally in such cases you should Scale UP the particular Redis Master-Slave to larger EC2 instance or pro grammatically stitch up the sorted set.
Point 3)When partitioning is used, data handling is more complex, for instance you have to handle multiple RDB / AOF files, and to make a backup of your data you need to aggregate the persistence files/snapshots from multiple EC2 Redis slaves.
Point 4) Architecting a partitioned + replicated ElastiCache Redis tier not complex. What is more complex is ? supporting transparent rebalancing of data with the ability to add and remove nodes at runtime. Systems like client side partitioning and proxies don’t support this feature. However a technique called Presharding helps in this regard with limitations. Presharding technique ->Since Redis is lightweight, you can start with a lot of EC2 instances since the beginning itself. For example if you start with 32 or 64 EC2 instances (micro or small Cache Node instance type)  as your node capacity , it will provide enough room to keep scaling up the capacity when your data storage needs increase. It is not a highly recommended technique. But still can be used in production if your growth pattern is very predictable.

Future of highly scalable + available Redis tiers -> Redis Cluster

Redis Cluster is the preferred way to get automatic sharding and high availability. It is currently not production ready. Once Redis Cluster / Client  is available on Amazon ElastiCache, it will be the de facto standard for Redis partitioning. It uses a mix between query routing and client side partitioning.

References:
http://redis.io/documentation
https://github.com/twitter/twemproxy

This article was co-authored with Senthil

451 Research Report: 8KMiles crosses the chasm in cloud-based identity federation

Analyst: Wendy Nather 22 Nov, 2013

Original Report URL from 451 Research website : https://451research.com/report-short?entityId=79384

Full Report is published down…

8KMiles has been heavily invested in cloud integration. As one of Amazon Web Services’ Premier Consulting Partners for 2013, it has helped customers stand up everything from Amazon’s Elastic Block Store to its S3 and Relational Database services. So it made sense to continue to add cloud integration services in the identity and access management (IAM) space. To this end, the company acquired Sunnyvale, California-based FuGen Solutions in May to obtain its Cloud Identity Broker and Multi-Domain Identity Services Platform.

The 451 Take
A combination of design and operations support helps 8KMiles, and its subsidiary FuGen makes on-ramping of federated identity partners easier, particularly for enterprises that don’t have the infrastructure or expertise to figure it out themselves. A migration opportunity can become a hosting opportunity, while a hosting opportunity could turn into the kind of identity and attribute exchange that is still needed. Other efforts are underway to build such an exchange, but 8KMiles and FuGen could get out ahead of it – although it might help if they settled on one company name to promote the unity they’re offering.

Context
Did identity federation get any easier when the execution venue moved from legacy systems to the cloud? Actually, that’s a trick question, because most of it hasn’t moved – it’s just been stretched. Even without the dynamism and scale requirements of the cloud, an enterprise’s federation efforts with its partners suffer from complexity that many organizations aren’t equipped to handle.
There are many types of federation, and only some of them are binary: that is, one organization completely trusts the other one, so that it accepts any identity offered. A common example is federation between a health insurance provider and a partner that provides pharmacy benefits: there can be a
one-to-one acceptance because it’s the same business case (benefits for an insured client) and the same level of security risk. Because it’s the same business case, both sides can validate the user in the same way and no additional validation is needed. A user can be passed through single sign-on from one site to another in a fairly seamless fashion.

However, not all federation is binary. Take the example of a state education agency: it has thousands of school district employees that need to use the agency’s applications. The agency would like to have the districts set up access for those users, but it is still legally on the hook to approve every access. This means that the agency has to rely on some assertions by the district, but must take an additional step of its own for validation and approval before it can fully accept that user into its systems. These validation workflows often use attributes of the user’s identity: whether the user is an employee of the district (which only the district is authoritative about), which roles the user is assigned (which might be determined by the agency), or whether the user is also a member of a different organization (such as working for a second district).

Attributes may sound complicated, and the business rules behind them can be. But an attribute is really the reason why you’re allowing access to that user. You’re allowing access because the insurance provider says this is a registered subscriber; you’re allowing it because the Department of Motor Vehicles (DMV) asserts that this is a licensed driver; you’re allowing it because the user is a registered PayPal customer. And you can only rely on that attribute when it comes from the right authoritative party: only PayPal can say with certainty who its current customers are.

The ecosystem of attributes has yet to be addressed in a coherent way. Many websites and applications will be happy to accept the credentials of a Facebook user, because they only care that someone at Facebook (presumably) validated the user account. That’s all the validation they need. But that’s not enough for many other organizations, especially where legal and regulatory issues are on the line. But if you could get all these authoritative parties in one place…

This is where 8KMiles and FuGen come in.
Founded in 2007, 8KMiles is led by Suresh Venkatachari, its chairman and CEO, who also founded consulting firm SolutionNET. The company has 140 employees among its locations in California, Virginia, Canada and India. In May, 8KMiles acquired FuGen Solutions for $7.5m, with the target becoming a subsidiary.

Products and services
8KMiles offers both consulting services (cloud migration, engineering and application development) and frameworks for assembling secure cloud systems. The company provides a turnkey architecture for implementing a secure private cloud, including firewall and DDoS protection services, secure remote access, system administrator access and monitoring, and disk encryption. This can be deployed either as an Amazon virtual private cloud or in an organization’s own datacenter. 8KMiles similarly offers a secure enterprise collaboration implementation that combines Alfresco’s content management and Amazon’s RDS. An AWS Direct Connect package contains both design and management of the network, points of
presence, and security.

When 8KMiles bought FuGen, it obtained both a cloud identity brokerage and the target’s Multi-Domain Identity Services Platform (MISP). The platform supports the partner onboarding and federation management activities, as well as what the vendor calls last-mile single-sign-on integration to a centralized hub for smaller customers that don’t have legacy IAM systems to connect, or who don’t have the expertise to put everything together. The platform is vendor-agnostic in that it can be used with any IAM provider’s systems to connect and federate partners. The authentication protocols supported include SAML 1.1, SAML 2.0, WS-Federation, WS-Trust, OpenID and OAuth. MISP comes with rules-based validation and reporting, criteria certification, monitoring and logging, and storage of scenarios, data messages, templates and certification reports.

One of the strengths of the broker and platform offerings is that FuGen and 8KMiles staff can duplicate the customer’s complex federation requirements in their virtualized environment. The vendor can build the hub and test all of the integrations with the partners’ systems in a lab setting. Once it’s been assembled and shown to work properly, the company can walk the customer through implementing the working version on its own systems, providing instructions down to the level of the configuration file changes. In cases where the customer does not have specialized IAM expertise or a test network, FuGen can provide both.

These services are available for community providers, SaaS application firms, identity and attribute vendors, and many others. FuGen’s customers range from one of the largest financial services institutions to media providers, large IT suppliers and defense contractors (Amazon AWS customers use FuGen’s federated identity features).

The idea of creating a vendor-agnostic federation space is a good one – as the number of partners grows with which FuGen has already built integrations, the onboarding for future customers goes more quickly. For example, if FuGen has already done the hard work of figuring out connectors for a large payment provider that happens to use Oracle for an IAM system and Ping Identity for cloud-based SSO, then any other partners that want to federate with that large payment provider using the same products will have most of the work already done. The network effect comes into play here: the more partners FuGen integrates, the stronger its offerings grow as a cloud-based ID federation service.

For the reasons described above, many enterprises end up relying on a varied set of IDs and attributes, all coming from different partners. Building a central ID and attribute exchange could speed federation projects for government, healthcare, finance and other verticals if FuGen can pre-integrate those providers. When businesses can join a virtual marketplace where they can get the attributes they need from their state DMV, PayPal and business process outsourcer, and all of the integration work is done for them, then the community has a good chance of growth. Many identity and attribute exchange projects are already underway (and FuGen is already part of some of these open initiatives) – the one advantage is that the company helps facilitate the plumbing, not just the framework. Also, this isn’t just about the cloud: enterprises can still federate with one another using their own systems, with FuGen’s services to set it up. The one hitch is that this is a potential that hasn’t been fully realized. 8KMiles and FuGen would have to figure out how to charge for this service, since charging by ID or partner account might be too dynamic to support a licensing structure. (This isn’t to say that a cloud provider can’t charge dynamically, it’s just that determining how many IDs are in use at any given time is a tricky proposition.) The vendor could charge an onboarding project fee, but services after that – such as monitoring, support, troubleshooting and integration tweaks – would need a different incremental pricing structure. If a large provider is hooked into the hub, and new partners join it, does the provider get charged more, or just the new partners? Identity and attribute management are both still developing areas of technology, and with the cloud as a delivery method, many aspects have to be reconsidered.

Competition
The term ‘identity broker’ is unintentionally confusing, since it is most often used to describe technology that helps intermediate an enterprise’s portfolio of ID stores and services, usually to provide single sign-on for that enterprise’s users or its customers. This is not the same as a third-party identity exchange, such as the kind envisioned by the Identity Ecosystem Steering Group (whose website, incidentally, is powered by Ping). There is also a lot of discussion in the IAM community about who can and should act as identity providers, and the candidates include social media such as Facebook or Google, financial institutions and telcos, since all of these appear to have the largest user bases.

However, none of these identity providers in and of themselves can supply all of the assurance and validation that different business cases require. It doesn’t matter whether Verizon has verified a user for phone service if a relying party has to figure out whether the user is really the same one who walked into the emergency room last night. Some organizations have much stronger requirements for identity assurance, and will have to assemble their own validation lists from multiple ID providers.

Not only does the ID and attribute exchange need to be vendor-agnostic, it also needs to be easy to join. This is where the pre-integration and onboarding services are crucial. Customers don’t have to let FuGen host the hub, but it helps with the kind of complex troubleshooting that federated IAM can sometimes require. The opportunity for FuGen is that it can be a broker for the brokers, so to speak: each enterprise in an ideal world would have just one interface to expose to the world, but those interfaces still need to be matched up with the other ones.

The term ‘broker’ is confusing, but if we focus on ‘exchange,’ we get closer to our original meaning and can consider the competition. SecureKey Technologies was recently awarded a contract by the US Postal Service to create the Federal Cloud Credential Exchange. Criterion Systems was one of the National Strategy for Trusted Identities in Cyberspace pilot grant recipients in 2012, and is building its ID DataWeb Attribute Exchange Network, with an ecosystem of technology partners and relying parties such as Ping, CA Technologies, Fixmo, Verizon, Experian and Wave Systems. If firms like these manage to build a working exchange, it could rival what 8KMiles and FuGen can do. Again, the latter are helping customers set up the integration, not just acting as a provider, so the operational features of their offering set it apart from these exchange projects. The race will be to see who can collect the largest amount of trusted resources and participants in a broadly working exchange. Vendor neutrality and open standards will play a role, but so will user-friendliness. If FuGen can offer both the onramp services and the day-to-day operation in a way that preserves trust, it could have the magic formula.

SWOT Analysis

Strengths
As a cloud broker, 8KMiles expanded its repertoire with the acquisition of FuGen. Identity management is certainly a key part of cloud migration and operation, and FuGen’s virtualized lab environment helps it work out all of the bugs in a complex identity federation system without impacting the customer.

Weakness
FuGen may be known in the IAM industry, particularly due to its participation in public initiatives, but customers may find the name too confusing alongside 8KMiles (neither name really says what the company does). It also has a lot of potential in supporting an identity and attribute exchange, but that potential needs to be realized.

Opportunities
Nobody has really figured out federation yet. Even though some straightforward, homogeneous business use cases are working fine, the more complicated ecosystems are still in the committee/framework/pilot stages. If 8KMiles/FuGen can onramp enough critical-mass partners, it could become a de facto hub before these committees can turn around.

Threats
Vendors such as SecureKey and Criterion are building exchanges too, although they’re in the early stages.
8KMiles/FuGen will also be confused with many other cloud IAM technology vendors due to the misuse of the term broker.

Analyst(s): Wendy Nather , 451 Research

Comparison Analysis:Amazon ELB vs HAProxy EC2

In this article i have analysed Amazon Elastic Load Balancer (ELB) and HAProxy (popular LB in AWS infra) in the following production scenario aspects and fitment:

Algorithms: In terms of algorithms ELB provides Round Robin and Session Sticky algorithms based on EC2 instance health status. HAProxy provides variety of algorithms like Round Robin, Static-RR, Least connection, source, uri, url_param etc. For most of the production cases use Round Robin and Session Sticky is more than enough, But in case you require algorithms like least connection you might have to lean towards HAProxy currently. In future AWS might add this algorithm in their Load Balancer

Spikey or Flash Traffic: Amazon ELB is designed to handle unlimited concurrent requests per second with “gradually increasing” load pattern. It is not designed to handle heavy sudden spike of load or flash traffic. For example: Imagine an e-commerce website whose traffic increases gradually to thousands of concurrent requests/sec in hours, Whereas imagine use cases like Mass Online Exam or GILT load pattern or 3-Hrs Sales/launch campaign sites expecting 20K+ concurrent requests/sec spike suddenly in few minutes, Amazon ELB will struggle to handle this load volatility pattern. If this sudden spike pattern is not a frequent occurrence then we can Pre-warm ELB else we need to look for alternative Load balancers like HAProxy in AWS infrastructure. If you expect a sudden surge of traffic you can provision X number of HAProxy EC2 instances in running state.

Gradually Increasing Traffic: Both Amazon ELB and HAProxy can handle gradually increasing traffic. But when your needs become elastic and traffic increases in a day, you either need to automate or manually add new HAProxy EC2 instances when the threshold is breached. Also when the load decreases you may need to manually remove the HAProxy EC2 instances from Load Balancing Tier. If you want to avoid these manual efforts you may need to engineer using automation scripts and programs. Amazon has intelligently automated this elastic problem in their ELB Tier. We just need to configure and use this, that’s all.

Protocols : Currently Amazon ELB only supports following protocols: HTTP, HTTPS (Secure HTTP), SSL (Secure TCP) and TCP protocols. ELB supports load balancing for the following TCP ports: 25, 80, 443, and 1024-65535. In case RTMP or HTTP Streaming protocol is needed, we need to use Amazon CloudFront CDN in your architecture. HAProxy can support both TCP and HTTP protocols. In case HAProxy EC2 instance is working in pure TCP mode. A full-duplex connection will be established between clients and servers, and no layer 7 examination will be performed. This is the default mode. It can be used for SSL, SSH, SMTP etc. Current 1.4 version of HAProxy does not support HTTPS protocol natively, you may need to use Stunnel or Stud or Nginx before HAProxy to do the SSL termination. HAProxy 1.5 dev-12 comes with SSL support, it will become production ready soon.

Timeouts: Amazon ELB currently timeouts persistent socket connections @ 60 seconds if it is kept idle. This condition will be a problem for use cases which generates large files (PDF, reports etc) at backend EC2, sends them as response back and keeps connection idle during entire generation process. To avoid this you’ll have to send something on the socket every 40 or so seconds to keep the connection active in Amazon ELB. In HAProxy you can configure very large socket timeout values to avoid this problem.

White listing IP’s :Some Enterprises might want to white list 3rd party Load Balancer IP range in their firewalls . If the 3rd party service is hosted using Amazon ELB it will become a problem. Currently Amazon ELB does not provide fixed or permanent IP address for the Load balancing instances that are launched in its tier. This will be a bottleneck for enterprises which have compulsion to white list the Load balancer IP’s in external firewalls/gateways. For such use cases, currently we can use HAProxy EC2 attached with Elastic IPs as load balancers in AWS infrastructure and white list the Elastic IP’s.

Amazon VPC/ Non VPC : VPC- Virtual Private Cloud. Both Amazon ELB and HAProxy EC2 can work inside the VPC and Non VPC environments of AWS.

Internal Load Balancing: Both Amazon ELB and HAProxy can be used for internal load balancing inside VPC. You might provide a service that is consumed internally by the other applications which needs load balancing. ELB and HAProxy can fit in the same. In case internal Load balancing is required in Amazon Non-VPC environments, ELB is not capable currently and HAProxy can be deployed.

URI/URL based Load balancing: Amazon ELB cannot Load Balance based on URL patterns like other Reverse proxies. Example Amazon ELB cannot direct and load balance between request URLs www.xyz.com/URL1 and www.xyz.com/URL2. Currently for such use cases you can use HAProxy on EC2.

Sticky problem: This point comes as a surprise to many users using Amazon ELB. Amazon ELB behaves little strange when incoming traffic is originated from Single or Specific IP ranges, it does not efficiently do round robin and sticks the request to some EC2’s only. Since i do not know the ELB internals i assume ELB might be using “Source” algorithm as default for such conditions. No such cases were observed with HAProxy EC2 in AWS unless the balance algorithm is “Source”. In HAProxy you can combine “Source” and “Round Robin” efficiently. In case the HTTP request does not have cookie it uses source algorithm, but if the HTTP request has a cookie HAProxy automatically shifts to RR or Weighted. (I will have to check this with AWS team)

Logging: Amazon ELB currently does not provide access to its log files for analysis. We can only monitor some essential metrics using CloudWatch for ELB. We cannot debug load balancing problems, analyze the traffic and access patterns; categorize bots / visitors etc currently because we do not have access to the ELB logs.This will also be a bottleneck for some organizations which has strong audit/compliance requirements to be met at all layers of their infrastructure. In case very strict/specific log requirements are needed, You might need to use HAProxy on EC2, in case it suffices the need.

Monitoring: Amazon ELB can be monitored using Amazon CloudWatch. Refer this URL for ELB metrics that can be currently monitored: http://harish11g.blogspot.in/2012/02/cloudwatch-elastic-load-balancing.html. CloudWatch+ELB is detailed for most use cases and provides consolidated result of the entire ELB tier in console/API. On the other hand HAProxy provides user interface and stats for monitoring its instances. But if you have farms(20+) of HAProxy EC2 instances it becomes complex to manage this monitoring part efficiently. You can use tools like ServerDensity to monitor such HAProxy farms, but it has huge dependency on NAT instances availability for inside Amazon VPC deployments.

SSL Termination and Compliance requirements:
SSL Termination can be done at 2 levels using Amazon ELB in your application architecture .They are
SSL termination can be done at Amazon ELB Tier, which means connection is encrypted between Client(browser etc) and Amazon ELB, but connection between ELB and Web/App EC2 is clear. This configuration may not be acceptable in strictly secure environments and will not pass through compliance requirements.
SSL termination can be done at Backend with End to End encryption, which means connection is encrypted between Client and Amazon ELB, and connection between ELB and Web/App EC2 backed is also encrypted. This is the recommended ELB configuration for meeting the compliance requirements at LB level.
HAProxy 1.4 does not support SSL termination directly and it has to be done in Stunnel or Stud or Nginx layer before HAProxy. HAProxy 1.5 dev-12 comes with SSL support, it will become production ready soon, i have not yet analyzed/tested the backend encryption support in this version.

Scalability and Elasticity : Most important architectural requirements of web scale systems are scalability and elasticity. Amazon ELB is designed for this and handle these requirements with ease.Elastic Load Balancer does not cap the number of connections that it can attempt to establish with the load balanced Amazon EC2 instances.Amazon ELB is designed to handle unlimited concurrent requests per second. ELB is inherently scalable and it can elastically increase /decrease its capacity depending upon the traffic. According to a benchmark done by RightScale, Amazon ELB was easily able to scale out and handle 20K+ or more concurrent requests /sec. Refer URL:http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-the-cloud/
Note: The load testing was stopped after 20K req/sec by RightScale because ELB kept expanding its capacity. Considerable of DevOps engineering is needed to automate this functionality with HAProxy.

High Availability: Amazon ELB is inherently fault tolerant and a Highly available service. Since it is a managed service, Unhealthy load balancer instances are automatically replaced in ELB tier. In case of HAProxy, you need to do this work yourself and build HA on your own. Refer URL http://harish11g.blogspot.in/2012/10/high-availability-haproxy-amazon-ec2.html to understand more about High Availability @ Load Balancing Layer using HAProxy.

Integration with Other services: Amazon ELB can be configured with work seamlessly with Amazon AutoScaling, Amazon CloudWatch and Route 53 DNS services. The new web EC2 instances launched by Amazon AutoScaling are added to the Amazon ELB for Load balancing automatically and whenever load drops; existing EC2 instances can be removed by Amazon Auto Scaling from ELB. Amazon AutoScaling and CloudWatch cannot be integrated seamlessly with HAProxy EC2 for this functionality. But HAProxy can be integrated with Route53 easily for DNS RR/Weighted algorithms.

Cost: If you run a ELB in US-East Amazon EC2 region for a month (744 hrs) processing close to 1 TB of data, it will cost around ~26 USD (ELB usage+Data charge). In case if you use HAProxy (2 X m1.large EC2 for HAProxy, S3 backed AMI, Linux instances, No EBS attached) as base capacity and add upto 4 or more m1.large EC2 depending upon traffic. It will minimum cost 387 USD for EC2 compute + Data Charges to start with. it is very clear and evident that larger deployments can save lots of cost and immensely benefit using Amazon ELB compared to HAProxy on EC2.

Use Amazon S3 Object Expiration for Cost Savings

Amazon S3 is one of the earliest and most popular services in AWS infra for storing files & documents. Customers usually store variety of files including their logs, documents, images, videos, dumps etc in Amazon S3. We all understand different files have different lifetime and use cases in any production application. Some documents are frequently accessed for a limited period of time and after that, you might not need real-time access to these objects, it becomes a candidate for deletion or archival.
For example:
Log files will have limited life time and they can be either parsed to Data Store or archived every few months
Database and Data store dumps also have retention period and hence limited life time
Files related to campaigns are not most of the time not needed once the Sales promotion is over
Customer documents are dependent upon customer usage life cycle and have to be retained till the customer is active in the application
Digital media archives, financial and healthcare records must be retained for regulatory compliance

Usually IT teams have to build some sort of mechanism or automated programs in-house to track these document ages and initiate a deletion process (individual or bulk) from time to time. In my customer consulting experience, I have often observed that above mechanism is not adequately in place because of following reasons:
Not all the IT teams are efficient in their development and operations
No mechanism/automation in place to manage the retention period efficiently
IT staff not fully equipped with AWS cloud knowledge
IT teams are usually occupied with their solutions/products catering to their business and hence do not have time to keep track of the rapid AWS feature roll out pace

Imagine your application stores ~5TB of documents every month. In a year it will aggregate to ~60TB of documents in Standard storage of Amazon S3. In Amazon S3 standard on US-East Region ~60TB of aggregated storage for the year will cost ~30,000 USD. Out of this imagine ~20 TB of documents aggregated for the year have limited life time and can be deleted or archived periodically a month. This equates to ~1650 USD cost leakage a year. This can avoided if proper mechanism or automation is put in place by the respective teams.
Note: Current charges for Amazon S3 standard storage in US-EAST per GB is 0.095 USD for first 1TB & 0.80 for next 49 TB.
But is there a simpler way for IT teams to cut this leakage and save costs in Amazon S3. Yes, Use Amazon S3 object expiration feature.

What is Amazon S3 Object expiration?
Amazon S3 introduced a feature called Object Expiration (in late 2011) for easing the above automation mechanism. This is a very helpful feature for the customers who want their data on s3 for a limited period of time and after that you might not need to keep those files and it should be deleted automatically by Amazon S3. Earlier as a customer you were responsible for deleting those files manually, when they do not remain useful but now you do not have to worry about it, just use Amazon S3 Object Expiration.
The leakage of ~1650 USD you saw in the above scenario can be saved by implementing Amazon S3 Object expiration feature in your system. Since it does not involve automation effort, Compute hours for the automation program to run and does not consume manual labor as well, it offers invisible savings in addition to the direct savings.

Overall Savings = ~1650 USD (scenario) + Cost of compute hrs (for deletion program) + Automation engineering effort (or) Manual deletion effort

How does it work?
Amazon S3 Object Expiration feature allows you to define rules to schedule the removal of your objects after a pre-defined time period. The rules are specified in the Lifecycle Configuration policy of an Amazon S3 bucket. Updates can be done either through AWS Management Console or S3 API’s.
Once the rule is set, the Object Expiration time is calculated by Amazon S3 by adding the expiration lifetime to the file creation time and then roundup the result time to the next day midnight GMT . For example : if a file was created on 11/12/2012 11:00 am UTC and the expiration time period was specified 3 days, then Amazon S3 would be calculating the expiration date-time of the file as 15/12/2012 00:00 UTC. Once the objects are past their expiration date, they will be queued for deletion. You can use Object Expiration rules on objects stored in both Standard and Reduced Redundancy storage of Amazon S3.

Add Spot Instances with Amazon EMR

Most of us know that Amazon Spot EC2 instances are usually good choice for Time-flexible and interruption-tolerant tasks. These instances gets traded frequently on a Spot market price and you can fix your Bid Price using AWS API’s or AWS Console. Once free Spot EC2 instances are available for your Bid Price, AWS will allot them for use in your account. Spot instances are usually available way cheaper than On-Demand EC2 instances most of the times. Example: On-Demand m1.xlarge per hour price is 0.48 USD and on spot market you can find them sometimes @ 0.052 per hour. This is ~9 times cheaper than the on-demand price; imagine if you can bid competitively and get hold of spot EC2 even around 0.24 USD most of the times, you are saving 50% from the on-demand price straight away. In Big data use cases usually you might need lots of EC2 nodes for processing, adopting such techniques can vastly make difference in your infra cost and operations in long term. I am sharing my experience on this subject as tips and techniques you can adopt to save costs while using EMR clusters in Amazon for big data problems.
Note : While dealing with spot you can be sure that you will never pay more than your maximum bid price per hour.

Tip 1: Make right choice (Spot vs On-Demand) for the cluster components
Data Critical workloads: For workloads which cannot afford to lose data you can have the Master + Core on Amazon On-Demand EC2 and your task nodes on Spot EC2. This is the most common pattern while combining Spot and On-Demand on Amazon EMR cluster. Since task nodes are operating on spot prices depending upon your bidding strategy you can save ~50% costs from running your task nodes using On-Demand EC2. You can further save(if you are lucky) by reserving your Core and Master Nodes , but you will be tied to an AZ. According to me this is not a good or common technique, because some AZ’s can be very noisy with high spot prices.
Cost Driven workloads: When solving big data problems, sometimes you might have to face scenarios where cost is very important than time. Example: You are processing archives of old logs as low priority jobs, where cost of processing is very important and usually with abundant time left. Such cases you can have all the Master+Core+Task run on Spot EC2 to get further savings from the data critical workloads approach. Since all the nodes are operating on spot prices depending upon your bidding strategy you can save ~60% or more costs from running your nodes using On-Demand EC2. The below mentioned table published by AWS gives an indication of the Amazon EMR + Spot combinations that are widely used:

Tip 2: There is free lunch sometimes
Spot Instances can be interrupted by AWS when the spot price reaches your bidding price. What interruption means is that, AWS can pull out the Spot EC2’s assigned to your account when the price matches/exceeds. If your Spot Task Nodes are interrupted you will not be charged for any partial hour of usage by AWS i.e. if you have started the instance @ 10:05 am and if your instances are interrupted by spot price fluctuations @ 10:45 am you will not be charged for the partial hour of usage. If your processing exercise is totally time insensitive, you can keep your bidding price at closer level to spot price which are easily interrupt-able by AWS and exploit this partial hours concept. Theoretically you can get most of the processing done through your task nodes for free* exploiting this strategy.

Tip 3: Use the AZ wisely when it comes to spot
Different AZ’s inside an Amazon EC2 region has different spot prices for the same Instance type. Observe this pattern for a while, build some intelligence around the price data collected and rebuild your cluster in the AZ with lowest price. Since the Master+Core+Task need to run on the same AZ for better latency, it is advisable to architect your EMR clusters in such a way they can be switched(i.e.recreate) to different AZ’s according to spot prices. If you can build this flexibility in your architecture you can save costs by leveraging the Inter AZ price fluctuations. Refer the below images for Spot Price variations in 2 AZ’s inside the same Region for same time period. “Make your choice wisely time to time”

Tip 4: Keep your Job logic small and store intermediate outputs in S3
Breakdown your complex processing logic into small jobs and design your jobs and tasks in EMR cluster in such a way that they run for very small period of time (example few minutes). Store all the intermediate job outputs in Amazon S3. This approach is helpful in EMR world and gives you following benefits:

When your Core+ Task nodes are interrupted frequently, you can still continue from the intermediate points. Data accessed from S3.
You now have the flexibility to recreate the EMR clusters in multiple AZ depending upon the Spot price fluctuations
You can decide the number of nodes needed for your EMR cluster(even every hour) depending upon the data volume, density and velocity

All the above 3 points when implemented contribute to elasticity in your architecture and there by helps you save costs in Amazon cloud. The above recommendation is not suitable for all Jobs, it has to be carefully mapped with right use cases by the architects.

The AdWantageS

Every customer has a reason to move in to the cloud. Be it cost, scalability, on demand provisioning, there are plenty of reasons why one moves into the Cloud. The latest whitepaper ”The Total Cost of (Non) Ownership of Web Applications in the Cloud” by Jinesh Varia, Technical evangelist of Amazon Web Services provides a good insight between hosting an infrastructure in-house and in the Cloud (AWS). There are plenty of pricing models available with AWS currently which can provide cost benefits ranging from 30% to 80% when compared to hosting the servers in-house.

On-Demand Instances – this is where every one starts with. You simply add your credit card to your AWS account and start spinning up Instances. You provision them on demand and pay for how long you run them. You of course have the option of stopping and starting them whenever needed. You are charged for every hour of running an Instance. For example, a Large Instance (2 CPU, 7.5GB Memory, Linux) you will pay $0.32/hr (US-East).

Reserved Instances – let’s say after you migrate/host your web application to AWS and run multiple web servers and DB servers in AWS. After couple of months, you may notice that some of your servers will run 24 hours/day. You may spin up additional web servers during peak load but will at least run 2 of them always; plus a DB server. For such cases, where you know that you will always run the Instances, AWS provides an option for reducing your cost – Reserved Instances. This is purely a pricing model, where you purchase the required Instance type (say Large/X-Large) in a region for a minimum amount. You will then be charged substantially less for your on-demand charge of that Instance. This way there is a potential savings of 30% over a period of one year when compared to on-demand Instances. The following provides an illustration of the cost savings for purchasing an m1.large Reserved Instance against using it on demand through an year

Cost comparison between On-Demand and Reserved Instance for m1.large Linux Instance in US-East

Be careful that,

Reserved Instances are purchased against an Instance type. If you purchase an m1.large Reserved Instance, at the end of the month when your bill is generated, only m1.large usage will be billed at the reduced Instance hourly charge. Hence, on that given month if you figure out m1.large is not sufficient and move up to m1.xlarge, you will not be billed at the reduced hourly charge. In such a case, you may end up paying more to AWS on an yearly basis. So, examine your usage pattern, fix your Instance type and purchase a Reserved Instance.
Reserved Instances are for a region – if you buy one in the US-East region and later decide to move your Instances to US-West, the cost reduction will not be applicable for Instances running out of US-West region
Of course, you have the benefits of,

Reduced cost – a minimum of 30-40% cost savings which increases if you purchase a 3-year term
Guaranteed Capacity – AWS will always provide you the number of Reserved Instances you have purchased (you will not get an error saying “Insufficient Capacity”)
Spot Instances – in this model, you will bid against the market price of an Instance and if your’s is the highest bid then you will get an Instance at your bid price. The Spot Market price is available through an API which can be queried regularly. You can write code that will check for the Spot Market price and keep placing bids mentioning the maximum price that you are willing to pay against the current Spot Market price. If your bid exceeds then you will get an Instance provisioned. The spot price will be substantially low than on demand pricing. For example, at the time of writing this article, the spot instance pricing for a m1.large Linux Instance in US-East was $0.026/hr as against $0.32/hr on demand pricing. This provides about 90% cost reduction on an hourly basis. But the catch is,

The Spot Market price will keep changing as other users place their bids and AWS has additional excess capacity
If your maximum bid price falls below the Spot Market price, then AWS may terminate your Instance. Abruptly
Hence you may loose your data or your code may terminate unfinished.
Jobs that you anticipate to be completed in one hour may take few more hours to complete
Hence Spot Instances are not suitable for all kind of work loads. Certain work loads like log processing, encoding can exploit Spot Instances but requires writing careful algorithms and deeper understanding. Here are some of the use cases for using Spot Instances.

Now, with that basic understanding, let’s examine the whitepaper little carefully not just from cost point of view. Web application can be classified in to three types based on traffic nature – Steady Traffic, Periodic Burst and Always Unpredictable. Here is a summary of the comparison of benefits of hosting them in AWS

AWS benefits for different type of web applications

Steady Traffic

The website has steady traffic. You are running couple of servers on-premise and consider moving it to AWS. Or you are hosting a new web application on AWS. Here’s the cost comparison from the whitepaper

Source: AWS Whitepaper quoted above

You will most likely start with spinning up On-Demand Instances. You will be running couple of them for web servers and couple of them for your database (for HA)
Over long run (3 years) if you only use On-Demand Instances, you may end up paying more than hosting it on-premise. Do NOT just run On-Demand Instances if you your usage is steady
If you are on AWS for about couple of months and are OK with the performance from your setup, you should definitely consider purchasing Reserved Instances. You will end up with a minimum of 40% savings against on-premise infrastructure and about 30% against running On-Demand Instances
You will still benefit from spinning up infrastructure on demand. Unlike on-premise, where you need to plan and purchase ahead, here you have the option of provisioning on demand; just in time
And in case, you grow and your traffic increases, you have the option to add more capacity to your fleet and remove it. You can change server sizes on demand. And pay only for what you use. This will go a long way in terms of business continuity, user experience and more sales
You can always mix and match Reserved and On-Demand Instances and reduce your cost whenever required. Reserved Instances can be purchased anytime
Periodic Burst

In this scenario, you have some constant traffic to your website but periodically there are some spikes in the traffic. For example, every quarter there can be some sales promotion. Or during thanks giving or Christmas you will have more traffic to the website. During other months, the traffic to the website will be low. Here’s the cost comparison for such a scenario

Source: AWS Whitepaper quoted above

You will spin up On-Demand Instances to start with. You will run couple of them for web servers and couple of them for the database
During the burst period, you will need additional capacity to meet the burst in traffic. You need to spin up additional Instances for your web tier and application tier to meet the demand
Here is where you will enjoy the benefits of on demand provisioning. This is something that is not possible in on-premise hosting. In on-premise hosting, you will purchase the excess capacity required well ahead and keep running them even though the traffic is less. With on demand provisioning, you will only provision them during the burst period. Once the promotion is over, you can terminate those extra capacity
For the capacity that you will always run as the baseline, you can purchase Reserved Instances and reduce the cost up to 75%
Even if you do not purchase Reserved Instances, you can run On-Demand Instances and save around 40% against on-premise infrastructure. Because, for the periodic burst requirement, you can purchase only during the burst period and turn off later. This is not possible in an on-premise setup where you will anyways purchase this ahead of time
Always Unpredictable

In this case, you have an application where you cannot predict the traffic all the time. For example, a social application that is in experimental stage and you expect it to go viral. If it goes viral and gains popularity you will need to expand the infrastructure quickly. If it doesn’t, then you do not want to risk a heavy cap-ex. Here’s the cost comparison for such a scenario

Source: AWS Whitepaper quoted above

You will spin up On-Demand Instances and scale them according to the traffic
You will use automation tools such as AutoScaling to scale the infrastructure on demand. You can align your infrastructure setup according to the traffic
Over a 3 year period, there will be some initial steady growth of the application. As the application goes viral you will need to add capacity. And beyond its lifetime of say, 18 months to 2 years the traffic may start to fall
Through monitoring tools such as CloudWatch you can constantly tweak the infrastructure and arrive at a baseline infrastructure. You will figure out that during the initial growth and “viral” period you will need certain baseline servers. You can go ahead and purchase Reserved Instances for them and mix them with On-Demand Instances when you scale them. You will enjoy a cost saving benefit of around 70% against on-premise setup
It is not advisable to plan for full capacity and run at full capacity. Or purchase full Reserved Instances for the full capacity. If the application doesn’t go well as anticipated, you may end up paying more to AWS than the actual
As you can see, whether you have a steady state application or trying out a new idea, AWS proves advantageous from different perspectives for different requirements. Cost, on-demand provisioning, scalability, flexibility, automation tools are things that even a startup can think off and get on board quickly. One question that you need to ask yourself is “Why am I moving in to AWS?”. Ask this question during the early stages and spend considerable time in design and architecture for the infrastructure setup. Otherwise, you may end up asking yourself “Why did I come into AWS?”.

  • August 20, 2012
  • blog

UIGestureRecognizer for IOS

Introductions:

UIGestureRecognizer is an abstract base class for concrete gesture-recognizer classes.

If you need to detect gestures in your app, such as taps, pinches, pans, or rotations, it’s extremely easy with the built-in UIGestureRecognizer classes.

In the old days before UIGestureRecognizers, if you wanted to detect a gesture such as a swipe, you’d have to register for notifications on every touch within a UIView – such as touchesBegan, touchesMoves, and touchesEnded.

The concrete subclasses of UIGestureRecognizer are the following:

UITapGestureRecognizer
UIPinchGestureRecognizer
UIRotationGestureRecognizer
UISwipeGestureRecognizer
UIPanGestureRecognizer
UILongPressGestureRecognizer

A gesture recognizer has one or more target-action pairs associated with it. If there are multiple target-action pairs, they are discrete, and not cumulative. Recognition of a gesture results in the dispatch of an action message to a target for each of those pairs. The action methods invoked must conform to one of the following signatures:

-(void)handleGesture;

-(void)handleGesture:(UIGestureRecognizer *)gestureRecognize;

Solution:

Step 1: Create a new file of class ViewController and a UIView

Step 2: in viewDidLoad implement concrete subclasses of UIGestureRecognizer

eg : UITapGestureRecognizer

-(void) viewDidLoad {

UITapGestureRecognizer * recognizer = [[UITapGestureRecognizer alloc] initWithTarget:self action:@selector(handleTap:)];

recognizer.delegate = self;

[view addGestureRecognizer:recognizer];

}

Step 3: Implement the gesture handle

– (void)handleTap:(UITapGestureRecognizer *)recognizer {

NSLog (@”your implementation here”);

}

Conclusion:

UIGestureRecognizer classes! These provide a default implementation of detecting common gestures such as taps, pinches, rotations, swipes, pans, and long presses. By using them, not only it reduce you code length , but it easy too.