7 Common AWS Cost Issues and How You Can Fix Them

Cloud solutions offer significant business benefits for startups as well as established enterprises. To help the cloud setup, Amazon Web Services (AWS) delivers brilliant cloud infrastructure solutions with pay-per-use service and other computing resources with respect to the growing needs of a business. However, even with all the benefits there still exist cost related issues. Even though AWS model saves building and maintenance cost there are cost management issues which users encounter while using cloud. So keeping cost management in mind, we have the 7 common AWS cost issues and how you can fix them.

Resource Purchase

Remember to check your resource utilization before purchasing. The reserved, on-demand and spot instances should be purchased appropriately depending on use and risks; as spot instances have termination risk or reserved instances become inactive due to improper mapping. For long term usage, use reserved EC2 instances.

Instance Size

Remember to analyze your needs and choose the appropriate size rather than sizing for the highest demand. As the size may vary, like large, medium or small, so do not choose the defaults. You can use autoscaling services to manage high load for certain period of time. Also, save cache or non-critical data from application into non-persistent storage, instead of increasing the size of the Elastic Block Storage (EBS).

EC2 Utilization

The Elastic Compute cloud (EC2) is charged as per their usage time even if the EC2 instance is using less than destined capacity or sitting idle. You can identify the idle and underutilized instances and analyze the CPU utilization and network activity. If these data hit low then EC2 instance should be flagged. So you could contact the instance owner and verify if the instance is needed or not or the correct size. Shutdown the instances when they are not needed. This can help you reduce cost. Also find way to reuse the unused reserved instances.

Using elastic cache to store cache information from application and database reduces the instance CPU utilization and bandwidth; this allows us to minimize the bandwidth usage and thus reduces the cost. Also, you can use ECS service on underutilized EC2 instances to increase the workload and efficiency of the Instances, instead of launching new resources.

S3 Lifecycle

Keep an eye on your object storage and regularly track the following: what storage you have, where the storages are and how you are storing it. By using the Simple Storage Service (S3) lifecycle you can control and reduce the storage costs. During expiration and transition of object storage class to RRS and Glacier you can reduce your S3 and storage costs. The data that are no longer needed or need to be highly available can be deleted or moved to Glacier storage using the S3 Lifecycle Policy.

Use Glacier to archive data for longer time period and plan the data retrieval process from Glacier, do not retrieve data frequently.

Data Transfer Charges

It is important to constantly track the data transfer charges as they could cause unnecessary expenses. Maintaining a precise resource inventory on ‘what data is transferred’ and ‘where’ (i.e. to which region) would prevent money wastage on data transfer.

AWS Support Services

The EC2 hourly charges are greater for many users than the pay-as-you-go usage charges. So using AWS support services like ELB and pay-as-you-go would help to reduce cost. Analyze your costs and check if these services are effective for your usage.

Remove Resources

Detach elastic IPs from instances that are in stopped state and release the other unattached IPs. Also, delete older and unwanted AMIs and snapshots as well as delete snapshots of deleted AMIs. These resources should be tracked regularly so they don’t get missed among the many other resources. Also individually these items cost less but together they create large expense. Meanwhile in AWS environment, all resources are accounted even if they are inactive so it is important to turn off the unused ones. So take snapshot of unused RDS instance (if needed) and terminate the RDS instance. So keep track and remove all unwanted RDS manual snapshots and all unused resources.

Though AWS is a dynamic and effective cloud service it is important to regularly check the progress manually or via automated reports to avoid mistakes. As well as these 7 cost issues can be avoided by regularly monitoring the AWS services, which will largely help to reduce the cost.

CloudWatch + Lambda Case 4: Control launch of Specific “C” type EC2 instances post office hours to save costs

We have a customer who has predictable load volatility between 9 am to 6 pm and uses specific large EC2 instances during office hours for analysis, they use “c4.8xlarge” for that purpose. Their IT wanted to control launch of such large instance class post office hours and during nights to control costs, currently there is no way to restrict or control this action using Amazon IAM. In short we cannot create complex IAM policy with conditions that user A belonging to group A cannot launch instance type C every day between X and Y.

Some stop gap followed is to have a job running which removes the policy from an IAM user when certain time conditions are met. So basically what we would do is, to have a job that calls an API that removes the policy which restricts an IAM user or group from launching instances. This will make the IAM policy management complex and tough to assess/govern drifts between versions.

After the introduction of the CloudWatch events our Cloud operations started controlling it with lambda functions. Whenever an Instance type is launched it will trigger a lambda function, the function will filter whether it is a specific “C” type and check for the current time, if the time falls after office hours, it will terminate the EC2 instance launched immediately.

As a first step, we will be creating a rule in Amazon CloudWatch Events dashboard. We have chosen AWS API Call as an Event to be processed by an AWSCloudTrail Lambda function as a target.

CloudWatch Events Lambda EC2

The next step would be configuring rule details with Rule definition

CloudWatch Events Lambda EC2

Finally, we will review the Rules Summary

CloudWatch Events Lambda EC2

Amazon Lambda Function Code Snippet (Python)
import boto3

def lambda_handler(event, context):
#print (“Received event: ” + json.dumps(event, indent=2))
#print (“************************************************”)

ec2_client = boto3.client(“ec2”)

print “Event Region :”, event[‘region’]

event_time = event[‘detail’][‘eventTime’]
print “Event Time :”, event_time

time = event_time.split(‘T’)
t = time[1]
t = t.split(‘:’)
hour = t[0]

instance_type = event[‘detail’][‘requestParameters’][‘instanceType’]
print “Instance Type:”, instance_type

instance_id = event[‘detail’][‘responseElements’][‘instancesSet’][‘items’][0][‘instanceId’]
print “Instance Id:”,instance_id

if( instance_type.startswith( ‘t’ ) and hour > 18 or hour < 8 ):
print ec2_client.terminate_instances( InstanceIds = [ instance_id ] )

GitHub Gist URL:  https://github.com/cloud-automaton/automaton/blob/master/aws/events/TerminateAWSEC2.py

This post was co authored with Priya and Ramprasad of 8KMiles.

This article was originally published in: http://harish11g.blogspot.in/

CloudWatch + Lambda Case 3 -Controlling cross region EBS/RDS Snapshot copies for regulated industries

If you are part of regulated industry like Pharmaceutical/ Life sciences/BFSI running mission critical applications on AWS, at times as part of the compliance requirements you will have to restrict/control data movement to a particular geographic region in the cloud. This becomes complex to restrict sometimes. Let us explore in detail:

We all know there are varieties of ways to move data from one AWS region to another, but one commonly used method is Snapshot copy across AWS regions. Usually you can restrict snapshot copy permission in IAM Policy, but what if you need the permission enabled for moving data between AWS accounts inside a region, but still want to control EBS/RDS snapshot copy action across regions. It can be only mitigated by automatically deleting the snapshot on destination AWS region in case snapshot copy activity is done.

Our Cloud operations team used to altogether remove this permission in IAM or monitor this activity using polling scripts for customers with multiple accounts who need this permission and still need control. Now after the introduction of CloudWatch Events we have configured a rule that points to an AWS Lambda which gets triggered in near real time when snapshot is copied to destination AWS region. The lambda function will initiate a deletion process immediately. Though it is reactive it is incomparably faster than manual intervention.

In this use case, Amazon CloudWatch Event will identify the EBS Snapshot copies across the regions and delete them.

As a first step, we will be creating a rule in Amazon CloudWatch Events dashboard. We have chosen AWS API Call as an Event to be processed by an AWSCloudTrail Lambda function as a target.

CloudWatch Events Lambda

The next step would be configuring rule details with Rule definition

CloudWatch Events Lambda

Finally, we will review the Rules Summary

CloudWatch Events Lambda

Amazon Lambda Function Code Snippet (Python)

CloudWatch Events Lambda

GitHub Gist URL: https://github.com/cloud-automaton/automaton/blob/master/aws/events/AWSSnapShotCopy.py

https://github.com/cloud-automaton/automaton/blob/master/aws/events/AWSSnapShotCopy.py

This post was co-authored with Muthukumar and Ramprasad of 8KMiles

This article was originally published in: http://harish11g.blogspot.in/

CloudWatch + Lambda Case 2- Keeping watch on AWS ROOT user activity is normal or anomaly ?

As a Best Practice you should never use your AWS root account credentials to access AWS. Instead, create individual (IAM) users for anyone who needs access to your AWS account. This allows you to give each IAM user a unique set of security credentials and grant different permissions to each user. Example: Create an IAM user for yourself as well, give that user administrative privilege, and use that IAM user for all your work and never share your credentials to anyone else.

Usually Root has full access and it is not ideal to restrict the same in AWS IAM. Imagine you suddenly doubt some anomaly/suspicious activities done as Root user (using EC2 API’s etc) in your logs other than normal IAM user provisioning; this could be because Root user is compromised or forced, but ultimately it is a deviation from the best practice.

In the past we used to poll the CloudTrail logs using programs and differentiate between “root” and “Root”, and our cloud operations used to react to these anomaly behaviors. Now we can inform the cloud operations and customer stake holders near real time using CloudWatch events.

In this use case, Amazon CloudWatch Event will identify activities if any performed by an AWS ROOT user and notifications will be sent to SNS thru AWS Lambda.

As a first step, we will be creating a rule in Amazon CloudWatch Events dashboard. We have chosen AWS API Call as an Event to be processed by an AWSCloudTrail Lambda function as a target. The lambda function will detect if the event is triggered by root user and notifies through SNS.

CloudWatch Events Lambda Root Activity Tracking

The next step would be configuring rule details with Rule definition

CloudWatch Events Lambda Root Activity Tracking

Finally, we will review the Rules Summary

CloudWatch Events Lambda Root Activity Tracking

Amazon Lambda Function Code Snippet (Python)

CloudWatch Events Lambda Root Activity Tracking

GitHub Gist URL:

https://github.com/cloud-automaton/automaton/blob/master/aws/events/TrackAWSRootActivity.py

This post was co-authored with Saravanan and Ramprasad of 8KMiles

This article was originally published in: http://harish11g.blogspot.in/

CloudWatch + Lambda Case 1- Avoid malicious CloudTrail action in your AWS Account

As many of you know AWS CloudTrail provides visibility into API activity in your AWS account, Cloud Trail Logging lets you see which actions users have taken and which resources have been used, along with details such as the time and date of actions and the actions that have failed because of inadequate permissions. It enables you to answer important questions such as which user made an API call or which resources were acted upon in an API call. If a user disables CloudTrail logs accidentally or with malicious intent then audit logging events will not captured and hence you fail to have proper governance in place. The situation will get complex, If the user disables- enables back CloudTrail for a brief period of time where some important activities can go unlogged and unaudited. In short once CloudTrail logging is enabled it should not be disabled and this action needs to be defended in depth.

Our Cloud operations team had earlier written a program that periodically scans the Cloud Trail logs entries, if any log activity was missing after an X period of time it alerts the operations.  Overall reaction time on our cloud operations was >15-20 mins to mitigate this CloudTrail disable action.

Now after the introduction of CloudWatch Events we have configured a rule that points to an AWS Lambda function as target. This function gets triggered in near real time when CloudWatch is disabled and automatically enables it back without any manual interaction from Cloud operations. The advanced version of the program triggers workflow which logs entries into ticket system as well. This event model has helped us reduce the mitigation to less than a minute.
We have illustrated below the detailed steps on how to configure this event. Also we given the link for GIT with basic AWS Lambda Python code that can be used by your cloud operations.

In this use case, Amazon CloudWatch Event will identify whether an AWS account has got CloudTrail enabled or not, if not enabled, Amazon CloudWatch Events will take corrective actions by enabling the same.

As a first step, we will be creating a rule in Amazon CloudWatch Events dashboard. We have chosen AWS API Call as an Event to be processed by an AWSCloudTrail Lambda function as a target.

CloudWatch Events Lambda CloudTrail

The next step would be configuring rule details with Rule definition

CloudWatch Events Lambda CloudTrail

Finally, we will review the Rules Summary

CloudWatch Events Lambda CloudTrail

Amazon Lambda Function Code Snippet (Python)
import json
import boto3
print(‘Loading function’)
“”” Function to define Lambda Handler “””
def lambda_handler(event, context):
    try:
        client = boto3.client(‘cloudtrail’)
        if event[‘detail’][‘eventName’] == ‘StopLogging’:
            response = client.start_logging(Name=event[‘detail’][‘requestParameters’][‘name’])
    except Exception, e:
        sys.exit();

 

GitHub Gist URL:

This post was co-authored with Mohan and Ramprasad of 8KMiles

This article was originally published in: http://harish11g.blogspot.in/

27 Best Practice Tips on Amazon Web Services Security Groups

AWS Security Groups are one of the most used and abused configurations inside an AWS environment if you are using them on cloud quite long. Since AWS security groups are simple to configure, users many times ignore the importance of it and do not follow best practices relating to it. In reality, operating on AWS security groups every day is much more intensive and complex than configuring them once. Actually, nobody talks about it! So in this article, I am going to share our experience in dealing with AWS Security groups since 2008 as a set of best practice pointers relating to configuration and day to day operations perspective.
In the world of security, proactive and reactive speed determines the winner. So a lot of these best practices should be automated in reality. In case your organizations’ Dev/Ops/Devops teams needs help with security group best practices automation, feel free to contact me.

AWS released so many features in the last few years relating to Security, that we should not visualize Security groups in isolation, It just does not make sense anymore. The Security Group should always be seen in the overall security context, with this I start the pointers.

Practice 1:  Enable AWS VPC Flow Logs for your VPC or Subnet or ENI level. AWS VPC flow logs can be configured to capture both accept and reject entries flowing through the ENI and Security groups of the EC2, ELB + some more services. This VPC Flow log entries can be scanned to detect attack patterns,alert abnormal activities and information flow inside the VPC and provide valuable insights to the SOC/MS team operations.

Practice 2: Use AWS Identity and Access Management (IAM) to control who in your organization has permission to create and manage security groups and network ACLs (NACL). Isolate the responsibilities and roles for better defense. For example, you can give only your network administrators or security admin the permission to manage the security groups and restrict other roles.

Practice 3: Enable AWS Cloud Trail logs for your account. The AWS Cloud Trail will log all the security group events and it is needed for management and operations of security groups. Event streams can be created from AWS Cloud Trail logs and it can be processed using AWS Lambda. For example : whenever a Security Group is deleted , this event will be captured with details on the AWS Cloud Trail logs. Events can be triggered in AWS Lamdba which can process this SG change and alert the MS/SOC on the dashboard or email as per your workflow. This is a very powerful way of reacting to events within span of <7 minutes. Alternatively, you can process the AWS Cloud Trail logs stored in your S3 every X frequency as a batch and achieve the above. But the Operation teams reaction time can vary depending on generation and polling frequency of the AWS Cloud Trail logs. This activity is a must for your operations team.
Practice 4: Enable AWS App Config for your AWS account. App records all events related to your security group changes and can even send emails.

Practice 5: Have proper naming conventions for the Amazon Web Services security group. The naming convention should follow a enterprise standards. For example it can follow the notation: “AWS Region+ Environment Code+ OS Type+Tier+Application Code”
Security Group Name – EU-P-LWA001
AWS Region ( 2 char ) = EU, VA, CA etc
Environment Code (1 Char)  = P-Production , Q-QA, T-testing, D-Development etc
OS Type (1 Char)= L -Linux, W-Windows etc
Tier (1 Char)= W-Web, A-App, C-Cache, D-DB etc
Application Code ( 4 Chars) = A001
We have been using Amazon Web Services from 2008 and found over the years managing the security groups in multiple environments is itself a huge task. Proper naming conventions from beginning is a simple practice, but will make your AWS journey manageable.

Practice 6: For security in depth, make sure your Amazon Web Services security groups naming convention is not self explanatory also make sure your naming standards stays internal. Example : AWS security group named UbuntuWebCRMProd is self explanatory for hackers that it is a Production CRM web tier running on ubuntu OS. Have an automated program detecting AWS security groups with Regex Pattern scanning of AWS SG assets periodically for information revealing names and alert the SOC/Managed service teams.

Practice 7: Periodically detect, alert or delete AWS Security groups not following the organization naming standards strictly. Also have an automated program doing this as part of your SOC/Managed service operations.  Once you have this stricter control implemented then things will fall in line automatically.

Practice 8: Have automation in place to detect all EC2,ELB and other AWS assets associated with Security groups. This automation will help us to periodically detect Amazon Web Services Security groups lying idle with no associations, alert the MS team and cleanse them. Unwanted security groups accumulated over time will create unwanted confusion.

Practice 9: In your AWS account, when you create a VPC, AWS automatically creates a default security group for the VPC. If you don’t specify a different security group when you launch an instance, the instance is automatically associated with the appropriate default security group. It will
allow inbound traffic only from other instances associated with the “default” security group and allow all outbound traffic from the instance. The default security group specifies itself as a source security group in its inbound rules. This is what allows instances associated with the default security group to communicate with other instances associated with the default security group. This is not a good security practice. If you don’t want all your instances to use the default security group, you can create your own security groups and specify them when you launch your instances. This is applicable to EC2 , RDS , ElastiCache and some more services in AWS. So detect “default” security groups periodically and alert to the SOC/MS.

Practice 10: Alerts by email and cloud management dash board should be triggered whenever critical security groups or rules are added/modified/deleted in production.  This is important for reactive action of your managed services/security operations team and audit purpose.

Practice 11 : When you associate multiple security groups with an Amazon EC2 instance, the rules from each security group are effectively aggregated to create one set of rules. AWS uses this set of rules to determine whether to allow access or not. If there is more than one SG rule for a specific port, AWS applies the most permissive rule. For example, if you have a rule that allows access to TCP port 22 (SSH) from IP address 203.0.113.10 and another rule that allows access to TCP port 22 for everyone, then everyone will have access to TCP port 22 because permissive takes precedence.
Practice X.1 : Have automated programs detecting EC2 associated with multiple SG/rules and alert the SOC/MS periodically. Condense the same manually to 1-3 rules max as part of your operations.

Practice X.1 : Have automated programs detecting conflicting SG/rules like restrictive+permissive rules together and alert the SOC/MS periodically.

Practice 12 : Do not create least restrictive security groups like 0.0.0.0/0 which is open to every one.
Since web servers can receive HTTP and HTTPS traffic open, only their SG can be permissive like
0.0.0.0/0,TCP, 80, Allow inbound HTTP access from anywhere
0.0.0.0/0,TCP, 443, Allow inbound HTTPS access from anywhere
All least restrictive SG created in your account should be alerted to SOC/MS teams immediately.

Practice 13: Have a security policy not to launch servers with default ports like 3306, 1630, 1433, 11211, 6379 etc. If the policy has to be accepted, then security groups also have to be created on the new hidden listening ports instead of the default ports. This provides a small layer of defense, since one cannot infer the information from the security group port on the EC2 service it is protecting. Automated detection and alerts should be created for SOC/MS, if security groups are created with default ports.

Practice 14: Applications which require stricter compliance requirements like HIPAA, PCI etc to be met need end to end transport encryption to be implemented on server back end in AWS. The communication from ELB to Web->App->DB->Others tiers need to be encrypted using SSL or HTTPS. This means only secured ports like 443, 465, 22 are permitted in corresponding EC2 security groups. Automated detection and alerts should be created for SOC/MS if security groups are created on secure ports for regulated applications.

Practice 15: Detection , alert and actions can be taken by parsing the AWS Cloud Trail logs based on usual patterns observed in your production environment
Example:
15.1 :If a port was opened and closed in <30 or X mins in production can be a candidate for suspicious activity if it is not normal pattern for your production
15.2 :If a permissive Security Group was created and closed in <30 or X mins can be a candidate for suspicious activity if it is not the normal pattern for your production
Detect anomalies on how long a change effected and reverted in security groups in production.

Practice 16: In case ports have to be opened in Amazon Web Services security groups or a permissive AWS security group needs to be applied, Automate this entire process as part of your operations such that a security group is open for X agreed minutes and will be automatically closed aligning with your change management. Reducing manual intervention avoids operational errors and adds security.

Practice 17: Make sure SSH/RDP connection is open in AWS Security Group only for jump box/bastion hosts for your VPC/subnets. Have stricter controls/policies avoid opening SSH/RDP to other instances of production environment. Periodically check , alert and close for this loop hole as part of your operations.

Practice 18: It is a bad practice to have SSH open to the entire Internet for emergency or remote support. By allowing the entire Internet access to your SSH port there is nothing stopping an attacker from exploiting your EC2 instance. The best practice is to allow very specific IP address in your security groups, this restriction improves the protection. This could be your office or on premise or DC through which you connect your jump box.

Practice 19: Too much or Too less: How many security groups for a usual multi tiered web app is preferred is a frequently asked question ?
Option 1 : One security group cutting across multiple tiers is easy to configure, but it is not a recommended for secure production applications.
Option 2: One Security group for every instance is too much protection and tough to manage operationally on longer term
Option 3: Individual Security group for different tiers of the application, For example : Have separate security groups for ELB, Web , App, DB and Cache tiers of your application stack.
Periodically check whether Option 1 type rule is being created in your production and alert the SOC/MS.

Practice 20: Avoid allowing  UDP or ICMP for private instances in Security groups. Not a good practice unless specifically needed.

Practice 21: Open only specific ports, Opening range of ports in a security group is not a good practice. In the security group you can add many inbound ingress rules, While opening the ports it is always advised to open for specific ports like 80,443, etc rather than range of ports like 200-300.

 

Add rules for communication between associated instances

Practice 22: Private Subnet instances can be accessed only from the VPC CIDR IP range. Opening instances to the public IP ranges is a possibility , but it does not make any sense. E.g., Opening HTTP to 0.0.0.0/0 in the SG of the private subnet instance does not make any sense. So detect and cleanse such rules.

 

Practice 23: AWS CloudTrail log captures the events related security. AWS lambda events or automated programs should trigger alerts to operations when abnormal activities are detected. For example:
23.1:Alert when X number of SG were added/deleted at “Y” Hours or Day by IAM user / Account
23.2:Alert when X number of SG Rules were added/deleted at “Y” Hours or Day by IAM user / Account

Practice 24: In case you are an enterprise make sure all security groups related activities of your production are part of your change management process. Security Group actions can be manual or automated with your change management in an enterprise.
In case you are an agile Startup or SMB and do not have complicated Change management process, then automate most of the security group related tasks and events as illustrated above on various best practices. This will bring immense efficiency into your operations

Practice 25: Use outbound/egress security groups wherever applicable within your VPC. Restrict FTP connection to any server on the Internet from your VPC. This way you can avoid data dumps and important files getting transferred out from your VPC. Defend harder and make it tougher !

Practice 26: For some tiers of your application, use ELB in front your instance as a security proxy with restrictive security groups – restrictive ports and IP ranges. This doubles your defense but increases the latency.

Practice 27: Some of the tools we use in conjunction to automate and meet above best practices are ServiceNow, Amazon CFT, AWS API’S, Rundeck, Puppet, Chef, Python , .Net and Java automated programs.

Note : In case your organizations Dev/Ops/Devops teams needs help on security group best practices automation on points listed above, feel free to contact me harish11g.aws@gmail.com

 

Source

 

About the Author

Harish Ganesan is the Chief Technology Officer (CTO) of 8K Miles and is responsible for the overall technology direction of the 8K Miles products and services. He has around two decades of experience in architecting and developing Cloud Computing, E-commerce and Mobile application systems. He has also built large internet banking solutions that catered to the needs of millions of users, where security and authentication were critical factors. He is also a prolific blogger and frequent speaker at popular cloud conferences.

 

Loading Big Index Data into newly launched Amazon CloudSearch engine

Search tier is the most critical section of many online verticals like travel, e-commerce, classifieds etc. If users cannot search products efficiently they will not make their buying decisions properly, which in turn massively affects the revenues of these companies. Most of them are usually powered by Apache Solr, FAST , Autonomy, ElastiSearch etc.  AWS also has a Search Service called CloudSearch which is a fully-managed service in the cloud that makes it easy to set up, manage, and scale a search solution for your website. Amazon CloudSearch relieves you from the worry of hardware provisioning, setup, and maintenance. As your volume of data and traffic fluctuates, Amazon CloudSearch automatically scales to meet your needs.

In AWS infrastructure Apache Solr has been the king and the software to beat till now, recently it has got heavy competitor in the form of Amazon CloudSearch – API 2013-01-01.

API version 2013-01-01 of Amazon CloudSearch is internally powered by customized version of Apache Solr Engine, and it is specifically designed for running highly scalable and available search on Amazon Web Services Cloud. This 2013 CloudSearch API has lots of similarities with Apache Solr and customers can easily migrate to this version and leverage the benefits of Amazon Cloud Infrastructure. We are already hearing many AWS customers are planning their migration from FAST, Solr and A9 Engine into the Amazon CloudSearch – 2013-01-01 API engine.

My team is already migrating couple of customers into this Amazon CloudSearch 2013-01-01 API and i have shared our experience on this process for the benefit of AWS community.

Reference Migration Architecture and requirements:


In this article i am going to explore how to

  • Migrate a 300+ GB index containing close to 247+ million records distributed in 105 searchable fields in a highly scalable /parallel manner in AWS infrastructure.
  • 300 + GB index file is stored in Amazon S3
  • Custom Data loader program built on Amazon Elastic MapReduce is used for parallel loading
  • Around ~6 Search.M2.2Xlarge are created with 2 partitions and 5 replication count
  • Around 10+ M1.large EMR Core nodes are for Data loading. This loader can be increased to hundreds of nodes depending upon the volume and velocity of data pump required.
  • Amazon CloudSearch Infrastructure provisioning, Automated partitioning, replication count are handled by AWS.

Lets get into the details below:

Step 1)Create a new Amazon CloudSearch Domain: We have named the search domain as “bigdatasearch” and chose the search instance type as search.m2.2xlarge.  Since we are planning to pump and query a 300 GB index with millions of document, it did not make sense for us to chose a smaller instance type of Amazon CloudSearch.  Usually the base instance type can be selected based on the number and size of the documents you are planning to maintain in the Amazon CloudSearch.
Note: Here we have chosen replication count as 5.  This is little strange in a distributed architecture because usually more replication count for the master decreases the speed of document upload. But when we were playing with Amazon CloudSearch we observed that it is increasing the speed of uploads. In addition we also observed the following :

  • If we keep the replication count 0 or less, use a smaller search instance type and pump documents in parallel from multiple nodes, either the Amazon CloudSearch Server is failing sometimes or error rates are high.
  • If we keep the replication count 0 or less , use a larger search instance type and pump documents in parallel from multiple nodes, internally Amazon Cloud Search itself is creating 3-5 nodes and it shows in the replication count. Waiting to discuss with AWS SA folks on this behavior.

We will be utilizing distributed uploading technique which we custom built using Amazon Elastic MapReduce to pump data to the Amazon CloudSearch server. This technique enables us to write more Index data in parallel.

Step 2) Select how you would like to create the Amazon CloudSearch Schema: Here we have chosen Manual setup, since we already have schema to be migrated to Amazon CloudSearch.

Next step is to Add index fields to create your Amazon CloudSearch Schema configuration.

Step 3)Adding Amazon CloudSearch Index Fields: Once all the fields have been configured in the schema, click on continue button. In the schema file used we have 100+ fields to be indexed for this particular search domain.
Step 4) Review the setup configurations and launch:
We have 100+ Index fields with scaling options instance type as m2.2xlarge and replication count 5 in the “bigdatasearch” domain.
Step 5 ) Wait till the Amazon CloudSearch Infrastructure is provisioned for you on the back. Usually it takes 10 minutes, it will also list if there is any error encountered when creating the index fields.
Once the Amazon CloudSearch infrastructure is provisioned at the back end , you should notice the “bigdatasearch” domain is“Active”. The search and Document endpoints are published and currently no of searchable document is “0”. There is only 1 CloudSearch Index Partition (Shards) and 5 search.m2.2xlarge instances.
Step 6)Configuring Synonyms: We have 2+ MB of Synonyms which needs to be configured into the Amazon CloudSearch domain. For this, we used Cloud Search cli-toolkit to upload synonyms to Cloud Search.
cs-configure-analysis-scheme -d bigdatasearch –name customanalysisscheme –lang en -e cloudsearch.ap-southeast-1.amazonaws.com –synonyms customsynonyms.txt
Since the volume of index data is huge (300+ GB) we have created a Custom Data Loader built on Amazon Elastic MapReduce to pump the data in parallel into Amazon CloudSearch. Since it is built on Amazon Elastic MapReduce,  we can use the same program without modification for scale to upload TB’s of index into the search system with hundreds of Data loader EMR core/task nodes.
Step 7) Create Amazon Elastic MapReduce Data Loader Cluster Configuration:
Step 8) Configure the Elastic MapReduce (EMR) Capacity: We are using 10 M1.Large core node instances for uploading the data from inside AWS VPC. Depending upon the Data size (GB->TB) and Upload hours we can increase the EMR core nodes capacity and number to speed up the data pump (upload) process.

To know more about How Spot instances can save cost on Amazon EMR ? refer URL AWS Cost Saving Tip 12: Add Spot Instances with Amazon EMR

Step 9)Add Custom data loader program Jar to EMR:
We have exported the data from a MSSQL server as flat UTF-8 dump file and stored it in Amazon S3. We are giving the 300+ GB Dump file as the input for the Amazon EMR CloudSearch Data Loader program to upload into Amazon CS in parallel. Buckets configurations of the Data Loader jar, Input, output and log files are configured in this screen

Step 10) Configure Amazon CloudSearch Access Policies:  We need to open Cloud Search security group access policies to accept upload requests from EMR cluster inside VPC. Configure static IP’s of all the instances or IP range of the data loader clients
Step 11)Run the Amazon Elastic MapReduce Data loader job :
Step 12) Analyzing the Amazon EMR Data loader Job Output:
Output of the JOB can be seen in the AWS EMR JOB logs. Here are few details:
  • “Map output records” in the log tells how many records are inserted into the Amazon CloudSearch , we can observe close to 247,681,520 documents(247+ million) are pumped.
  • “Bytes Read” in the output tells what is size of data set which the JOB has read. We can observe 322387978332 bytes which is equivalent to 300+ GB of index in the Amazon CloudSearch
  • The entire pumping process took ~30 hours with 10 m1.large core nodes for us. We observed that increasing the number of Data loader EMR nodes or their capacity improves the upload speed drastically.
 Step 13) Clean up : Reset Replication Count to level of HA needed ideally 1-2 nodes. Once the Job is completed, Revert back the Security Access Policies in Amazon cloud search. Terminate the EMR Cluster and clean any leftover resources.

Step 14) Analyzing the CloudSearch Dashboard :
We observed that it takes some time for cloud search to reflect actual count of the indexed documents.

After the pumping of 300 + GB index you can observe that currently 2 Amazon CloudSearch partitions ( shards) are used to distribute 247+ million documents with 100+ index fields. This is tremendous cost savings compared to A9 powered Amazon CloudSearch. Amazon CloudSearch has automatically created shards based on the volume of data pumped in to the system. This is cool !!!, it reduces the maintenance headache of the infra admins. If the Amazon CloudSearch team can make this partition concept as configurable parameter in future it will be useful.
Step 15) Executing a Sample Search queries: We are executing a some sample product search queries on the “bigdatasearch” domain to check whether everything is fine. Distributed query was fired and Results came Sub Second from one of the partitions.
In short, It is cost effective compared to old A9 powered CloudSearch, Automated scaling of replication counts for request scalability, automated scaling of partitions for data scalability relieves the infra admin headaches, strong apache Solr pedigree and its long list of feature additions in coming months will make it more interesting.
After working with this service few weeks, we feel it is going to become the major search service on AWS in coming years, giving tough fight for Apache Solr and ElastiSearch deployments on EC2.
This article was co authored with Ankit @8Kmiles.

25 Best Practice Tips for architecting your Amazon VPC

According to me Amazon VPC is one of the most important feature introduced by AWS. We have been using AWS from 2008 and Amazon VPC from the day it was introduced and i strongly feel the customer adoption towards AWS cloud gained real momentum only after the introduction of VPC into the market.
Amazon VPC comes with lots of advantages over the limitations faced in Amazon Classic cloud like: Static private IP address , Elastic Network Interfaces :  possible to bind multiple Elastic Network Interfaces to a single instance, Internal Elastic Load Balancers, Advanced Network Access Control ,Setup a secure bastion host , DHCP options , Predictable internal IP ranges , Moving NICs and internal IPs between instances, VPN connectivity, Heightened security etc. Each and everything is a interesting topic on its own and i will be discussing them in detail in future.
Today i am sharing some of our implementation experience on working with hundreds of Amazon VPC deployments as best practice tips for the AWS user community. You can apply some of the relevant ones in your existing VPC or use these points as part of your migration approach to Amazon VPC.

Practice 1) Get your Amazon VPC combination right: Select the right Amazon VPC architecture first.  You need to decide the right Amazon VPC & VPN setup combination based on your current and future requirements. It is tough to modify/re-design the Amazon VPC at later stage, so it is better to design it taking into consideration your NW and expansion needs for next ~2 years. Currently different types of Amazon VPC setups are available; Like Public facing VPC, Public and Private setup VPC, Amazon VPC with Public and Private Subnets and Hardware VPN Access, Amazon VPC with Private Subnets and Hardware VPN Access, Software based VPN access etc. Choose the one which you feel you will be in next 1-2 years.

Practice 2) Choose your CIDR Blocks: While designing your Amazon VPC, the CIDR block should be chosen in consideration with the number of IP addresses needed and whether we are going to establish connectivity with our data center. The allowed block size is between a /28 netmask and /16 netmask. Amazon VPC can have contain from 16 to 65536 IP addresses. Currently Amazon VPC once created can’t be modified, so it is best to choose the CIDR block which has more IP addresses usually. Also when you design the Amazon VPC architecture to communicate with the on premise/data center ensure your CIDR range used in Amazon VPC does not overlaps or conflicts with the CIDR blocks in your On premise/Data center. Note: If you are using same CIDR blocks while configuring the customer gateway it may conflict.
E.g., Your VPC CIDR block is 10.0.0.0/16 and if you have 10.0.25.0/24 subnet in a data center the communication from instances in VPC to data center will not happen since the subnet is the part of the VPC CIDR. In order to avoid these consequences it is good to have the IP ranges in different class. Example., Amazon VPC is in 10.0.0.0/16 and data center is in 172.16.0.0/24 series.

Practice 3) Isolate according to your Use case: Create separate Amazon VPC for Development , Staging and Production environment (or) Create one Amazon VPC with Separate Subnets/Security/isolated NW groups for Production , Staging and development. We have observed 60% of the customer preferring the second choice. You chose the right one according to your use case.

Practice 4) Securing Amazon VPC : If you are running a machine critical workload demanding complex security needs you can secure the Amazon VPC like your on-premise data center or more sometimes. Some of the tips to secure your VPC are:

  • Secure your Amazon VPC using Firewall virtual appliance, Web application firewall available from Amazon Web Services Marketplace. You can use check point, Sophos etc for this
  • You can configure Intrusion Prevention or Intrusion Detection virtual appliances and secure the protocols and take preventive/corrective actions in your VPC
  • Configure VM encryption tools which encrypts your root and additional EBS volumes. The Key can be stored inside AWS (or) in your Data center outside Amazon Web Services depending on your compliance needs. http://harish11g.blogspot.in/2013/04/understanding-Amazon-Elastic-Block-Store-Securing-EBS-TrendMicro-SecureCloud.html
  • Configure Privileged Identity access management solutions on your Amazon VPC to monitor and audit the access of Administrators of your VPC.
  • Enable the cloud trail to audit in the VPC environments  ACL policy’s. Enable cloud trail :http://harish11g.blogspot.in/2014/01/Integrating-AWS-CloudTrail-with-Splunk-for-managed-services-monitoring-audit-compliance.html
  • Apply anti virus for cleansing specific EC2 instances inside VPC. Trend micro has very good product for this.
  • Configure Site to Site VPN for securely transferring information between Amazon VPC in different regions or between Amazon VPC to your On premise Data center
  • Follow the Security Groups and NW ACL’s best practices listed below

Practice 5) Understand Amazon VPC Limits: Always design the VPC subnets in consideration with the expansion in the future. Also understand the Amazon VPC’s limits before using the same. AWS has various limitations on the VPC components like Rules per security group, No of route tables and Subnets etc. Some of them may be increased after providing the request to the Amazon support team while few components cannot be increased. Ensure the limitations are not affecting your overall design. Refer URL:
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html

Practice 6) IAM your Amazon VPC: When you are going to assign people to maintain your Amazon VPC you can create Amazon IAM account with the fine grained permissions (or) use Sophisticated Privileged identity Management solutions available on AWS marketplace to IAM your VPC.

Practice 7) Disaster Recovery or Geo Distributed Amazon VPC Setup : When you are designing a Disaster Recovery Setup plan using VPC or expanding to another Amazon VPC region you can follow these simple rules. Create your Production site VPC CIDR : 10.0.0.0/16 and your DR region VPC CIDR:  172.16.0.0/16. Make sure they do not conflict with on premises subnet CIDR block in event both needs to be integrated to on premise DC as well. After CIDR blocks creation , setup a VPC tunnel between regions and to your on premise DC. This will help to replicate your data using private IP’s.

Practice 8) Use security groups and Network ACLs wisely:  It is advisable to use security groups over Network ACLs inside Amazon VPC wherever applicable for better control. Security groups are applicable on EC2 instance level while network ACL is applicable on Subnet level.  Security groups are used for White list mostly. To blacklist IPs, one can use Network ACLs.

Practice 9) Tier your Security Groups : Create different security groups for different tiers of your infrastructure architecture inside your VPC. If you have Web, App, DB tiers create different security group for each of them. Creating tier wise security groups will increase the infrastructure security inside Amazon VPC.  EC2 instances in each tier can talk only on application specified ports and not at all ports. If you create Amazon VPC security groups for each and every tier/service separately it will be easier to open a port to a particular service. Don’t use same security group for multiple tiers of instances, this is a bad practice.
Example: Open ports for security group instead of IP ranges : For example : People have tendency to open for port 8080 to 10.10.0.0/24 (web layer) range. Instead of that, open port 8080 to web-security-group. This will make sure only web security group instances will be able to contact on port 8080. If someone launches NAT instance with NAT-Security-Group in 10.10.0.0/24, he won’t be able to contact on port 8080 as it allows access from only web security group.
Practice 10 ) Standardize your Security Group Naming conventions : Following a security group naming conventions inside Amazon VPC will improve operations/management for large scale deployments inside VPC. It also avoids manual errors, leaks and saves cost and time overall.
For example: Simple ones like Prod_DMZ_Web_SG or Dev_MGMT_Utility_SG (or) complex coded ones for large scale deployments like
USVA5LXWEBP001- US East Virginia AZ 5 Linux Web Server Production 001
This helps in better management of security groups.
Practice 11) ELB on Amazon VPC:  When using Amazon ELB for Web Applications, put all other EC2 instances( Tiers like App,cache,DB,BG etc)  in private subnets as much possible. Unless there is a specific requirement where instances need outside world access and EIP attached, put all instances in private subnet only. Only ELBs should be provisioned in Public Subnet as secure practice in Amazon VPC environment.
Practice 12) Control your outgoing traffic in Amazon VPC: If you are looking for better security, for the traffic going to internet gateway use Software’s like Squid or Sophos to restrict the ports,URL,Domains etc so that all traffic go through the proxy tier controlled and it also gets logged. Using these proxy/security systems we can also restrict the unwanted ports, by doing so,  if there is any security compromise to the application running inside Amazon VPC they can be detected by auditing the restricted connections captured from the logs. This helps in corrective security measure.
Practice 13) Plan your NAT Instance Type: Whenever your Application EC2 instances residing inside private subnet of Amazon VPC are making Web Service/HTTP/S3/SQS calls they go through NAT instance. If you have designed Auto scaling for your application tier and there are chances ten’s of app EC2 instances are going to make lots of web calls concurrently, NAT instance will become a performance bottleneck at this juncture. Size your NAT instance capacity depending upon application needs for avoiding performance bottlenecks. Using the NAT instances provides us with advantages of saving cost of Elastic IP and provides extra security by not exposing the instances to outside world for accessing the internet.
Practice 14) Spread your NAT instance with Multiple Subnets: What if you have hundreds of EC2 instances inside your Amazon VPC and they are making lots of heavy web service/HTTP calls concurrently. A single NAT instance with even largest EC2 size cannot handle that bandwidth sometimes and may become performance bottleneck. In Such scenarios, span your EC2 across multiple subnets and create NAT’s for each subnet. This way you can spread your out going bandwidth and improve the performance in your VPC based deployments.
Practice 15) Use EIP when needed: At times you may need to keep a part of your application services to be kept in Public subnet for external communication. It is recommended practice to associate them with Amazon Elastic IP and white list these IP address in the target services used by them
Practice 16) NAT instance practices : If needed, enable Multi factor authentication on NAT instance. SSH and RDP ports are open only on sources and destination IP’s, not global network (0.0.0.0/0). SSH / RDP ports are opened only on static exit IP’s not dynamic exit IP’s.
Practice 17) Plan your Tunnel between On-Premise DC to Amazon VPC: 
Select the right mechanism to connect your on premises DC to Amazon VPC. This will help you to connect the EC2 instance via private IP’s in a secure manner.
  • Option 1: Secure IPSec tunnel to connect a corporate network with Amazon VPC (http://aws.amazon.com/articles/8800869755706543)
  • Option 2 : Secure communication between sites using the AWS VPN CloudHub (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPN_CloudHub.html)
  • Option 3: Use Direct connect between Amazon VPC and on premise when you have lots of data to be transferred with reduced latency (or) you have spread your mission critical workloads across cloud and on premise. Example: Oracle RAC in your DC and Web/App tier in your Amazon VPC. Contact us if you need help on setting up direct connect between Amazon VPC and DC.
Practice 18) Always span your Amazon VPC across multiple subnets in Multiple Availability zones inside a Region. This helps is architecting high availability inside your Amazon VPC properly. Example: Classification of the VPC subnet : WEB Tier Subnet : 10.0.10.0/24 in Az1 and 10.0.11.0/24 in Az2, Application Tier Subnet :  10.0.12.0/24 and 10.0.13.0/24, DB Tier Subnet :  10.0.14.0/24 and 10.0.15.0/24, Cache Tier Subnet : 10.0.16.0/24 and 10.0.17.0/24 etc
Practice 19) Good security practice is that to have only public subnet with route table which carries route to internet gateway. Apply this wherever applicable.
Practice 20) Keep your Data closer : For small scale deployments in AWS where cost is critical than high availability, It is better to keep the Web/App in same availability zone as of ElastiCache , RDS etc inside your Amazon VPC. Design your subnets accordingly to suit this. This is not a recommended architecture for applications demanding High Availability.
Practice 21) Allow and Deny Network ACL : Create Internet outbound allow and deny network ACL in your VPC.
First network ACL: Allow all the HTTP and HTTPS outbound traffic on public internet facing subnet.
Second network ACL: Deny all the HTTP/HTTPS traffic. Allow all the traffic to Squid proxy server or any virtual appliance.
Practice 22 ) Restricting Network ACL : Block all the inbound and outbound ports. Only allow application request ports. These are stateless traffic filters that apply to all traffic inbound or outbound from a Subnet within VPC. AWS recommended Outbound rules : http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_NACLs.html
Practice 23) Create route tables only when needed and use the Associations option to map subnets to the route table in your Amazon VPC
Practice 24) Use Amazon VPC Peering (new) : Amazon Web Services has introduced VPC peering feature which is quite useful one. AWS VPC peering connection is a networking connection between two Amazon VPCs that enables you to route traffic between them using private IP addresses. Currently it can be in same AWS region, Instances in either VPC can communicate with each other as if they are within the same network. Since AWS uses the existing infrastructure of a VPC to create a VPC peering connection; it is neither a gateway nor a VPN connection, and does not rely on a separate piece of physical hardware (which essentially means there is no single point of failure for communication or a bandwidth bottleneck).

We have seen it is useful in following scenarios :
  1. Large Enterprises usually run Multiple Amazon VPC in single region and some of their applications are so interconnected that they may need to access them privately + securely inside AWS. Example Active Directory, Exchange, Common business services will be usually interconnected.
  2. Large Enterprise have different AWS accounts for different business units/teams/departments , at times systems deployed by some business units in different AWS accounts need to be shared or need to consume a shared resource privately. Example: CRM , HRMS ,File Sharing etc can be internal and shared. In such scenarios VPC peering comes very useful.
  3. Customer can peer their VPC with their core suppliers to have tighter integrated access of their systems.
  4. Companies offering Infra/Application Managed Services on AWS can now safely peer into customer Amazon VPC and provide monitoring and management of AWS resources.

Practice 25) Use Amazon VPC: It is highly recommended that migrate all your new workloads inside Amazon VPC rather than Amazon Classic Cloud. I also strongly recommend to migrate your existing workloads from Amazon Classic cloud to Amazon VPC in phases or one shot which ever is feasible. In addition to the benefits of the VPC that is detailed in the start of the article, AWS has started introducing lots of features which are compatible only inside VPC and in the AWS marketplace as well there are lots of products which are compatible only with Amazon VPC.  So make sure you leverage this strength of VPC. If you require any help for this migration please contact me.

readers feel free to suggest more.. I will link relevant ones in this article

Architecting Highly Available ElastiCache Redis replication cluster in AWS VPC

In this post lets explore how to architect and create a Highly Available + Scalable Redis Cache Cluster for your web application in AWS VPC. Following is the architecture in which the ElastiCache Redis Cluster is assembled:

  • Redis Cache Cluster inside Amazon VPC for better control and security
  • Master Redis Node 1 will be created in AZ-1 of US-West
  • Redis Read Replica Node 2 will be created in AZ-2 of US-West
  • Redis Read Replica Node 3 will be created in AZ-3 of US-West

You can position all the 3 Redis Nodes in different Availability zones for Achieving High Availability (or) you can position Master + RR 1 in AZ1 and RR 2 in AZ2. This reduces the Inter – AZ latency and might give better performance for heavily used clusters.
Step 1: Creating Cache Subnet groups:
To create Cache Subnet group  navigate to the dashboard of ElastiCache, select Cache Subnet groups and then click “Create Cache Subnet group”. Add the Subnet Id and the Availability Zone you need to use for the ElastiCache cluster.

 

We have created Amazon VPC spreading across 3 availability zones. In this post we are going to place the Redis Master and 2 Redis Replica Slaves in these 3 availability zones. Since Redis will be most of the times accessed by your application tier it is better if you place them on Private Subnet of your VPC.
Step 2: Creating Redis Cache Cluster: 
To create Cache Cluster navigate to the  dashboard of ElastiCache, select Launch Cache Cluster and provide the necessary details. We are launching it inside Amazon VPC, so we have to select the Cache Subnet group .
Note: It is mandatory to create Cache Subnet group before Launch if you need ElastiCache Redis cluster in Amazon VPC.

 

For test purposes i have used m1.small EC2 instance for the Redis. Since this is a fresh Redis installation, i have not mentioned S3 bucket from where the persistent Redis Snapshot will be used as input. On successful creation of the Cache Cluster you can see the details in the dashboard.
Step 3: Replication Group Creation:
To create Replication group select the option of Replication Groups from dashboard and then select the “Create Replication Group”

Select the master Redis node “redisinsidevpc” created previously as the primary cluster id of the Cache cluster.  Give the Replication group id and description as illustrated below.

Note: Replication Group should be created only after the Primary Cache Cluster node is UP and running, else you will get the error as shown below.

On the successful creation of the Replication group you can see the following details. You can observe from below screenshot that there is only one primary node in US-WEST-2A and zero Redis Read Replica’s are attached to it.

Step 4: Adding Read Replica Nodes:
When you select the Replication group, you can see the option to add Redis Read Replica. We are adding 2 Redis Read Replica named Redis-RR1 (in US-West-2B) and Redis-RR2 (in US-WEST-2C). Both the Read replica’s are pointed to the master node “redisinsidevpc”. Currently we can add up to 5 Read replica Nodes for a Redis Master Node. This is more than enough to handle Thousands of messages per second. If you combine it with Redis Pipeline handling 100K messages per second from a node is like cake walk.
Adding Read Replica 1 in Us-West -2B

Adding Read Replica 2 in US-West-2c

On successful creation you can see the following details of Replication group in the dashboard. Now you can see there are 3 Redis nodes listed with Number of read Replica’s as 2. Placing the Read Replica’s and master node in multiple AZ will increase the high availability and protects you from node and AZ level failure. On our sample tests inter AZ Replication deployments had <2 second replication lag for massive writes on master and <1 second replication lag between master slave inside same AZ deployments. We pumped @100K messages per second for few minutes on m1.large Redis instance cluster.
In event, if you need additional read scalability i recommend to use more read Replica slaves added to the master.
In your application tier you need to use the primary Endpoint “redis-replication.qcdze2.0001.usw2.cache.amazon.aws.com:6379” shown below to connect to Redis.

If you need to delete/reboot/Modify you can make it through the options available here.

Step 5: Promoting the Read replica:

You can also promote any node as the Primary cluster using the Promote/Demote option. There will be only one Primary Node.
Note: This step is not part of the cluster creation process.

This promotion has to be carried out with caution and proper understanding for maintaining data consistency.

Post was co authored with Senthil 8KMiles