CloudWatch + Lambda Case 3 -Controlling cross region EBS/RDS Snapshot copies for regulated industries

If you are part of regulated industry like Pharmaceutical/ Life sciences/BFSI running mission critical applications on AWS, at times as part of the compliance requirements you will have to restrict/control data movement to a particular geographic region in the cloud. This becomes complex to restrict sometimes. Let us explore in detail:

We all know there are varieties of ways to move data from one AWS region to another, but one commonly used method is Snapshot copy across AWS regions. Usually you can restrict snapshot copy permission in IAM Policy, but what if you need the permission enabled for moving data between AWS accounts inside a region, but still want to control EBS/RDS snapshot copy action across regions. It can be only mitigated by automatically deleting the snapshot on destination AWS region in case snapshot copy activity is done.

Our Cloud operations team used to altogether remove this permission in IAM or monitor this activity using polling scripts for customers with multiple accounts who need this permission and still need control. Now after the introduction of CloudWatch Events we have configured a rule that points to an AWS Lambda which gets triggered in near real time when snapshot is copied to destination AWS region. The lambda function will initiate a deletion process immediately. Though it is reactive it is incomparably faster than manual intervention.

In this use case, Amazon CloudWatch Event will identify the EBS Snapshot copies across the regions and delete them.

As a first step, we will be creating a rule in Amazon CloudWatch Events dashboard. We have chosen AWS API Call as an Event to be processed by an AWSCloudTrail Lambda function as a target.

CloudWatch Events Lambda

The next step would be configuring rule details with Rule definition

CloudWatch Events Lambda

Finally, we will review the Rules Summary

CloudWatch Events Lambda

Amazon Lambda Function Code Snippet (Python)

CloudWatch Events Lambda

GitHub Gist URL: https://github.com/cloud-automaton/automaton/blob/master/aws/events/AWSSnapShotCopy.py

https://github.com/cloud-automaton/automaton/blob/master/aws/events/AWSSnapShotCopy.py

This post was co-authored with Muthukumar and Ramprasad of 8KMiles

This article was originally published in: http://harish11g.blogspot.in/

CloudWatch + Lambda Case 2- Keeping watch on AWS ROOT user activity is normal or anomaly ?

As a Best Practice you should never use your AWS root account credentials to access AWS. Instead, create individual (IAM) users for anyone who needs access to your AWS account. This allows you to give each IAM user a unique set of security credentials and grant different permissions to each user. Example: Create an IAM user for yourself as well, give that user administrative privilege, and use that IAM user for all your work and never share your credentials to anyone else.

Usually Root has full access and it is not ideal to restrict the same in AWS IAM. Imagine you suddenly doubt some anomaly/suspicious activities done as Root user (using EC2 API’s etc) in your logs other than normal IAM user provisioning; this could be because Root user is compromised or forced, but ultimately it is a deviation from the best practice.

In the past we used to poll the CloudTrail logs using programs and differentiate between “root” and “Root”, and our cloud operations used to react to these anomaly behaviors. Now we can inform the cloud operations and customer stake holders near real time using CloudWatch events.

In this use case, Amazon CloudWatch Event will identify activities if any performed by an AWS ROOT user and notifications will be sent to SNS thru AWS Lambda.

As a first step, we will be creating a rule in Amazon CloudWatch Events dashboard. We have chosen AWS API Call as an Event to be processed by an AWSCloudTrail Lambda function as a target. The lambda function will detect if the event is triggered by root user and notifies through SNS.

CloudWatch Events Lambda Root Activity Tracking

The next step would be configuring rule details with Rule definition

CloudWatch Events Lambda Root Activity Tracking

Finally, we will review the Rules Summary

CloudWatch Events Lambda Root Activity Tracking

Amazon Lambda Function Code Snippet (Python)

CloudWatch Events Lambda Root Activity Tracking

GitHub Gist URL:

https://github.com/cloud-automaton/automaton/blob/master/aws/events/TrackAWSRootActivity.py

This post was co-authored with Saravanan and Ramprasad of 8KMiles

This article was originally published in: http://harish11g.blogspot.in/

CloudWatch + Lambda Case 1- Avoid malicious CloudTrail action in your AWS Account

As many of you know AWS CloudTrail provides visibility into API activity in your AWS account, Cloud Trail Logging lets you see which actions users have taken and which resources have been used, along with details such as the time and date of actions and the actions that have failed because of inadequate permissions. It enables you to answer important questions such as which user made an API call or which resources were acted upon in an API call. If a user disables CloudTrail logs accidentally or with malicious intent then audit logging events will not captured and hence you fail to have proper governance in place. The situation will get complex, If the user disables- enables back CloudTrail for a brief period of time where some important activities can go unlogged and unaudited. In short once CloudTrail logging is enabled it should not be disabled and this action needs to be defended in depth.

Our Cloud operations team had earlier written a program that periodically scans the Cloud Trail logs entries, if any log activity was missing after an X period of time it alerts the operations.  Overall reaction time on our cloud operations was >15-20 mins to mitigate this CloudTrail disable action.

Now after the introduction of CloudWatch Events we have configured a rule that points to an AWS Lambda function as target. This function gets triggered in near real time when CloudWatch is disabled and automatically enables it back without any manual interaction from Cloud operations. The advanced version of the program triggers workflow which logs entries into ticket system as well. This event model has helped us reduce the mitigation to less than a minute.
We have illustrated below the detailed steps on how to configure this event. Also we given the link for GIT with basic AWS Lambda Python code that can be used by your cloud operations.

In this use case, Amazon CloudWatch Event will identify whether an AWS account has got CloudTrail enabled or not, if not enabled, Amazon CloudWatch Events will take corrective actions by enabling the same.

As a first step, we will be creating a rule in Amazon CloudWatch Events dashboard. We have chosen AWS API Call as an Event to be processed by an AWSCloudTrail Lambda function as a target.

CloudWatch Events Lambda CloudTrail

The next step would be configuring rule details with Rule definition

CloudWatch Events Lambda CloudTrail

Finally, we will review the Rules Summary

CloudWatch Events Lambda CloudTrail

Amazon Lambda Function Code Snippet (Python)
import json
import boto3
print(‘Loading function’)
“”” Function to define Lambda Handler “””
def lambda_handler(event, context):
    try:
        client = boto3.client(‘cloudtrail’)
        if event[‘detail’][‘eventName’] == ‘StopLogging’:
            response = client.start_logging(Name=event[‘detail’][‘requestParameters’][‘name’])
    except Exception, e:
        sys.exit();

 

GitHub Gist URL:

This post was co-authored with Mohan and Ramprasad of 8KMiles

This article was originally published in: http://harish11g.blogspot.in/

Managing User Identity across cloud-based application with SCIM

Simple Cloud Identity Management (SCIM) Protocol is a standard-based provisioning and de-provisioning user identity to the cloud-based SaaS applications. SCIM’s pragmatic approach it is designed quick and easy to move the user identity across the cloud applications. It’s mainly intent is to reduce the cost and complexity of user management operations by providing a common user schema and extension model, as well as binding documents to provide patterns for exchanging this schema using standard protocols.

SCIM is built on a model where a resource is the common denominator and all SCIM objects are derived from it. SCIM has three objects that directly derives from the Resource object. The ServiceProviderConfiguration and Schema are used to discover the service provider configuration. The Core Resource object specifies the endpoint resources User, Group and Organization.
The SCIM protocol exchange the user identities between two applications over HTTP protocol using REST (Representational State Transfer) protocol. SCIM protocol exposes a common user schema and resource object expressed JSON format and XML format. SCIM requests are made via HTTP requests with different HTTP methods and responses are returned in the body of the HTTP response, formatted as JSON or XML depending on the request.

Following are the SCIM Endpoint services for standard-based user identity provisioning and de-provisioning across cloud-based applications.

# SCIM provides two end point to discover supported features and specific attribute details.
• GET /ServiceProviderConfigs – This endpoint specify the service provider specification and compliance, authentication schemes and data models.
• GET /Schemas – This endpoint specify the service provider’s resources and attribute extensions.
# SCIM Provides a REST API with simple set of HTTP/HTTPS (For CRUD) operations.

• POST – https://endpoint.com/{v}/{resource} – Create a new resource or Bulk resource.
• GET – https://endpoint.com/{v}/{resource}/{id} – Retrieves a particular resource.
• GET – https://endpoint.com/{v}/{resource}?filter={attribute}{op}{value}&sortBy={attributeName}&sortOrder={ascending|descending} – Retrieves a resource with filter parameters.
• PUT – https://endpoint.com/{v}/{resource}/{id} – Modifies a resource with a complete, consumer specified resource (Replace the full resource).
• PATCH – https://endpoint.com/{v}/{resource}/{id} – Modifies a resource with a set of consumer specified changes (Update particular resource).
• DELETE – https://endpoint.com/{v}/{resource}/{id} – Deletes a resource. (Delete particular resource)

How SCIM protocol to provision the user identities into the Cloud (SaaS) application.

SCIM

The SCIM protocol does not define a scheme for authentication and authorization therefore Service Provider are free to choose mechanisms appropriate to their use cases. Most of the SaaS applications (Service provider) provide OAuth2.0 security protocol for authentication and authorization, some of the SaaS application provide their own authentication mechanism. However now days most of the IDM vendors (CA, SailPoint, PingIdentity.) support SCIM protocol.

If you need more information about on this SCIM protocol refer this link http://www.simplecloud.info/

Keys to Ensure a Smooth and Successful Go-Live

Keys to Ensure a Smooth and Successful Go-Live:

The day is finally here. The months of planning, build, testing and training have culminated in go-live day. Will this be a successful and relatively smooth day? If you have followed the best-practice tips below, then the day will go without any major issues.

  • Early End-User Involvement
    Often the biggest complaint from end-users at go-live is the feeling that the new system was forced on them without being able to provide input. One way to help alleviate this is to engage the users early with information and question and answer sessions, where the users are able to ask questions and express concerns. These sessions along with usability labs and training provide continued involvement throughout the process and allow the end-users to take some ownership of the success of the project and go-live.
  • Document Decisions
    Throughout the implementation process, many decisions will be made. One thing that can slow the process and cause a rush close to go live is poor documentation of decisions and revisiting decisions. Documenting what decisions have been made and why can help alleviate back-tracking. Yes, sometimes a decision will need to be revisited based on new information, but keeping to a minimum is important to keep the project on track, and allow go-live day to be a success.
  • Testing, Testing, Testing
    This may be the most obvious point, but robust testing is extremely important to a smooth go-live. Identifying and correcting problems before go live will make the day better for everyone.
  • Be Nice
    Go-live is a big day and it will be stressful. The project team and support staff will need to keep a calm demeanor throughout the day. Listening to the users and providing answers with a smile will help the end-users stay calm as well.
  • Issue tracking and resolution
    Problems will arise at go-live and having a maintained issues log allows for the identification of trends in issues. Discussing the issues during pre-shift meetings with all project team members allows for everyone to be on the same page and helps prevent duplicate issues from being tracked. Along with tracking the issues, having a ‘fast-track’ process for fixing any issues in the system allows the end-users to see progress and move forward.

Go-live day will come with some stress and challenges, but if you follow the steps listed, it will be a smooth process and a positive experience for everyone.

Community Connect/New Practice Team Model

Community Connect/New Practice Team Model
One effect of the new and upcoming Affordable Care Act laws, is that healthcare organizations are acquiring smaller private practices and able to grow their organizations. Another is for smaller practices to contract with larger organizations to utilize their Electronic Medical Records (EMRs). EMRs can be prohibitively expensive for smaller practices, as can the penalties for not using them, so contracting with larger organizations allows them to keep autonomy and use of and EMR with a large network.
The benefits to the large organizations in both cases are clear with the increase in patient base and potential revenue. However this growth, often rapid, can present significant and numerous challenges to the administration.
EMR Project Teams
One area where challenges is realized is with the EMR Project Teams. The teams are often ‘bare-bones’ so an aggressive timeline can stretch teams too far and existing projects (upgrade, optimization, maintenance, etc) are often be neglected by necessity.
One method to help alleviate this strain is to bring in a team dedicated to the build and roll out of the newly acquired and community connect clinics. This allows the organization analysts to focus on the existing projects, and the dedicated team to build and roll out the clinics quickly and consistently.
Success Story
Utilizing this ‘team’ approach has been shown to be successful at a recent site which uses the Epic EMR. Creating a team with 1 Project Manager, 1 Ambulatory Analyst, and 1 Cadence Analyst was shown to be an ideal team structure. The addition of a second ‘team’ of 1 Ambulatory and 1 Cadence Analyst when the timeline was very tight proved to be beneficial.
The first clinic to go up was a collaborative approach between the added team and the existing staff. Embedding the team allowed them to learn the build conventions, documentation, change management, and various organization experts for different build aspects. The go-live was also collaborative to ensure the proper processes were being used.
The second clinic was more autonomous for the team, but a point-person was used from the staff for questions and the build was reviewed for accuracy.
The following clinics were autonomous which allowed the staff to work on a major upgrade. The team was able to ask questions as needed, but were able to build and roll-out clinics with limited time required from the staff.
If you would like any more information on this community connect model and how it could work for your organization, then please contact info@serj.com for more information.

Open Sourcing S2C Tool

This is a follow up post to our previous article ‘Migrating Solr to Amazon CloudSearch using the S2C tool’. Last month, we released an open source tool ‘S2C’, a Linux console based utility that helps developers to migrate search index from Apache Solr to Amazon CloudSearch.

In this article, we share the source code of the S2C tool which will allow developers to customize and extend the S2C tool to suit their requirements. The source code can be downloaded in the below link.

https://github.com/8KMiles/s2c/

Further, we discuss step-by-step instructions on how to build the source code of S2C tool.

Pre-Requisites

In this section, we detail the pre-requisites for building S2C tool process.
1. The application is developed using Java. Download and Install Java 8 .Validate the JDK path and ensure the environment variables like JAVA_HOME, classpath, path is set correctly.

2. We will use Gradle to build the S2C tool. Download and Install Gradle.
Please read getting-started.html (inside Gradle base folder) in setting up Gradle. Gradle is an open source build tool which does not require any pre-requisites like Maven or Ant.

3. Download the source code directly from the link https://github.com/8KMiles/s2c/
or alternatively
Download and install Git – http://git-scm.com/book/en/v2/Getting-Started-Installing-Git and then use the following command to clone the source from Git

git clone https://github.com/8KMiles/s2c/.git (requires Git installation)

Note: The source code is available in public and does not require any credentials to access the source.

Build process

We will use Gradle, an open source build tool to build the S2C migration utility.
1. Verify the path, classpath, environment variables of Java, Gradle. Example: JAVA_HOME, GRADLE_HOME

2. Unzip the downloaded S2C source code and run the following command from the main directory. Example: E:/s2ctool/s2c-master or /opt/s2ctool/s2c-master

 ./gradlew -PexportPath=/tmp :s2c-cli:exportTarGz

The above command will create .tgz (zip file) at ‘/tmp’ directory. The directory path can be changed if required
or

gradle exportTarGz

The above command will create .tgz at ‘s2c-master/s2c-cli/tmp’ directory.

or

gradle exportZip

The above command will create .zip at ‘s2c-master/s2c-cli/tmp’ directory.

Build output file:s2c-cli-1.0.zip or s2c-cli-1.0.tgz

3.The build output is the final product that is deployed to do migration from Solr to Amazon CloudSearch. The deployment steps are discussed in detail in the original blog ‘Migrating Solr to Amazon CloudSearch using the S2C tool’.

Please do write your feedback and suggestions in the below comments section to improve this tool. We also intend to write a follow-up post sharing the original source code of this tool.

About the Authors

 Dhamodharan P is a Senior Cloud Architect at 8KMiles.

 

 

 

 Dwarakanath R is a Principal Architect at 8KMiles.

 

 

 

27 Best Practice Tips on Amazon Web Services Security Groups

AWS Security Groups are one of the most used and abused configurations inside an AWS environment if you are using them on cloud quite long. Since AWS security groups are simple to configure, users many times ignore the importance of it and do not follow best practices relating to it. In reality, operating on AWS security groups every day is much more intensive and complex than configuring them once. Actually, nobody talks about it! So in this article, I am going to share our experience in dealing with AWS Security groups since 2008 as a set of best practice pointers relating to configuration and day to day operations perspective.
In the world of security, proactive and reactive speed determines the winner. So a lot of these best practices should be automated in reality. In case your organizations’ Dev/Ops/Devops teams needs help with security group best practices automation, feel free to contact me.

AWS released so many features in the last few years relating to Security, that we should not visualize Security groups in isolation, It just does not make sense anymore. The Security Group should always be seen in the overall security context, with this I start the pointers.

Practice 1:  Enable AWS VPC Flow Logs for your VPC or Subnet or ENI level. AWS VPC flow logs can be configured to capture both accept and reject entries flowing through the ENI and Security groups of the EC2, ELB + some more services. This VPC Flow log entries can be scanned to detect attack patterns,alert abnormal activities and information flow inside the VPC and provide valuable insights to the SOC/MS team operations.

Practice 2: Use AWS Identity and Access Management (IAM) to control who in your organization has permission to create and manage security groups and network ACLs (NACL). Isolate the responsibilities and roles for better defense. For example, you can give only your network administrators or security admin the permission to manage the security groups and restrict other roles.

Practice 3: Enable AWS Cloud Trail logs for your account. The AWS Cloud Trail will log all the security group events and it is needed for management and operations of security groups. Event streams can be created from AWS Cloud Trail logs and it can be processed using AWS Lambda. For example : whenever a Security Group is deleted , this event will be captured with details on the AWS Cloud Trail logs. Events can be triggered in AWS Lamdba which can process this SG change and alert the MS/SOC on the dashboard or email as per your workflow. This is a very powerful way of reacting to events within span of <7 minutes. Alternatively, you can process the AWS Cloud Trail logs stored in your S3 every X frequency as a batch and achieve the above. But the Operation teams reaction time can vary depending on generation and polling frequency of the AWS Cloud Trail logs. This activity is a must for your operations team.
Practice 4: Enable AWS App Config for your AWS account. App records all events related to your security group changes and can even send emails.

Practice 5: Have proper naming conventions for the Amazon Web Services security group. The naming convention should follow a enterprise standards. For example it can follow the notation: “AWS Region+ Environment Code+ OS Type+Tier+Application Code”
Security Group Name – EU-P-LWA001
AWS Region ( 2 char ) = EU, VA, CA etc
Environment Code (1 Char)  = P-Production , Q-QA, T-testing, D-Development etc
OS Type (1 Char)= L -Linux, W-Windows etc
Tier (1 Char)= W-Web, A-App, C-Cache, D-DB etc
Application Code ( 4 Chars) = A001
We have been using Amazon Web Services from 2008 and found over the years managing the security groups in multiple environments is itself a huge task. Proper naming conventions from beginning is a simple practice, but will make your AWS journey manageable.

Practice 6: For security in depth, make sure your Amazon Web Services security groups naming convention is not self explanatory also make sure your naming standards stays internal. Example : AWS security group named UbuntuWebCRMProd is self explanatory for hackers that it is a Production CRM web tier running on ubuntu OS. Have an automated program detecting AWS security groups with Regex Pattern scanning of AWS SG assets periodically for information revealing names and alert the SOC/Managed service teams.

Practice 7: Periodically detect, alert or delete AWS Security groups not following the organization naming standards strictly. Also have an automated program doing this as part of your SOC/Managed service operations.  Once you have this stricter control implemented then things will fall in line automatically.

Practice 8: Have automation in place to detect all EC2,ELB and other AWS assets associated with Security groups. This automation will help us to periodically detect Amazon Web Services Security groups lying idle with no associations, alert the MS team and cleanse them. Unwanted security groups accumulated over time will create unwanted confusion.

Practice 9: In your AWS account, when you create a VPC, AWS automatically creates a default security group for the VPC. If you don’t specify a different security group when you launch an instance, the instance is automatically associated with the appropriate default security group. It will
allow inbound traffic only from other instances associated with the “default” security group and allow all outbound traffic from the instance. The default security group specifies itself as a source security group in its inbound rules. This is what allows instances associated with the default security group to communicate with other instances associated with the default security group. This is not a good security practice. If you don’t want all your instances to use the default security group, you can create your own security groups and specify them when you launch your instances. This is applicable to EC2 , RDS , ElastiCache and some more services in AWS. So detect “default” security groups periodically and alert to the SOC/MS.

Practice 10: Alerts by email and cloud management dash board should be triggered whenever critical security groups or rules are added/modified/deleted in production.  This is important for reactive action of your managed services/security operations team and audit purpose.

Practice 11 : When you associate multiple security groups with an Amazon EC2 instance, the rules from each security group are effectively aggregated to create one set of rules. AWS uses this set of rules to determine whether to allow access or not. If there is more than one SG rule for a specific port, AWS applies the most permissive rule. For example, if you have a rule that allows access to TCP port 22 (SSH) from IP address 203.0.113.10 and another rule that allows access to TCP port 22 for everyone, then everyone will have access to TCP port 22 because permissive takes precedence.
Practice X.1 : Have automated programs detecting EC2 associated with multiple SG/rules and alert the SOC/MS periodically. Condense the same manually to 1-3 rules max as part of your operations.

Practice X.1 : Have automated programs detecting conflicting SG/rules like restrictive+permissive rules together and alert the SOC/MS periodically.

Practice 12 : Do not create least restrictive security groups like 0.0.0.0/0 which is open to every one.
Since web servers can receive HTTP and HTTPS traffic open, only their SG can be permissive like
0.0.0.0/0,TCP, 80, Allow inbound HTTP access from anywhere
0.0.0.0/0,TCP, 443, Allow inbound HTTPS access from anywhere
All least restrictive SG created in your account should be alerted to SOC/MS teams immediately.

Practice 13: Have a security policy not to launch servers with default ports like 3306, 1630, 1433, 11211, 6379 etc. If the policy has to be accepted, then security groups also have to be created on the new hidden listening ports instead of the default ports. This provides a small layer of defense, since one cannot infer the information from the security group port on the EC2 service it is protecting. Automated detection and alerts should be created for SOC/MS, if security groups are created with default ports.

Practice 14: Applications which require stricter compliance requirements like HIPAA, PCI etc to be met need end to end transport encryption to be implemented on server back end in AWS. The communication from ELB to Web->App->DB->Others tiers need to be encrypted using SSL or HTTPS. This means only secured ports like 443, 465, 22 are permitted in corresponding EC2 security groups. Automated detection and alerts should be created for SOC/MS if security groups are created on secure ports for regulated applications.

Practice 15: Detection , alert and actions can be taken by parsing the AWS Cloud Trail logs based on usual patterns observed in your production environment
Example:
15.1 :If a port was opened and closed in <30 or X mins in production can be a candidate for suspicious activity if it is not normal pattern for your production
15.2 :If a permissive Security Group was created and closed in <30 or X mins can be a candidate for suspicious activity if it is not the normal pattern for your production
Detect anomalies on how long a change effected and reverted in security groups in production.

Practice 16: In case ports have to be opened in Amazon Web Services security groups or a permissive AWS security group needs to be applied, Automate this entire process as part of your operations such that a security group is open for X agreed minutes and will be automatically closed aligning with your change management. Reducing manual intervention avoids operational errors and adds security.

Practice 17: Make sure SSH/RDP connection is open in AWS Security Group only for jump box/bastion hosts for your VPC/subnets. Have stricter controls/policies avoid opening SSH/RDP to other instances of production environment. Periodically check , alert and close for this loop hole as part of your operations.

Practice 18: It is a bad practice to have SSH open to the entire Internet for emergency or remote support. By allowing the entire Internet access to your SSH port there is nothing stopping an attacker from exploiting your EC2 instance. The best practice is to allow very specific IP address in your security groups, this restriction improves the protection. This could be your office or on premise or DC through which you connect your jump box.

Practice 19: Too much or Too less: How many security groups for a usual multi tiered web app is preferred is a frequently asked question ?
Option 1 : One security group cutting across multiple tiers is easy to configure, but it is not a recommended for secure production applications.
Option 2: One Security group for every instance is too much protection and tough to manage operationally on longer term
Option 3: Individual Security group for different tiers of the application, For example : Have separate security groups for ELB, Web , App, DB and Cache tiers of your application stack.
Periodically check whether Option 1 type rule is being created in your production and alert the SOC/MS.

Practice 20: Avoid allowing  UDP or ICMP for private instances in Security groups. Not a good practice unless specifically needed.

Practice 21: Open only specific ports, Opening range of ports in a security group is not a good practice. In the security group you can add many inbound ingress rules, While opening the ports it is always advised to open for specific ports like 80,443, etc rather than range of ports like 200-300.

 

Add rules for communication between associated instances

Practice 22: Private Subnet instances can be accessed only from the VPC CIDR IP range. Opening instances to the public IP ranges is a possibility , but it does not make any sense. E.g., Opening HTTP to 0.0.0.0/0 in the SG of the private subnet instance does not make any sense. So detect and cleanse such rules.

 

Practice 23: AWS CloudTrail log captures the events related security. AWS lambda events or automated programs should trigger alerts to operations when abnormal activities are detected. For example:
23.1:Alert when X number of SG were added/deleted at “Y” Hours or Day by IAM user / Account
23.2:Alert when X number of SG Rules were added/deleted at “Y” Hours or Day by IAM user / Account

Practice 24: In case you are an enterprise make sure all security groups related activities of your production are part of your change management process. Security Group actions can be manual or automated with your change management in an enterprise.
In case you are an agile Startup or SMB and do not have complicated Change management process, then automate most of the security group related tasks and events as illustrated above on various best practices. This will bring immense efficiency into your operations

Practice 25: Use outbound/egress security groups wherever applicable within your VPC. Restrict FTP connection to any server on the Internet from your VPC. This way you can avoid data dumps and important files getting transferred out from your VPC. Defend harder and make it tougher !

Practice 26: For some tiers of your application, use ELB in front your instance as a security proxy with restrictive security groups – restrictive ports and IP ranges. This doubles your defense but increases the latency.

Practice 27: Some of the tools we use in conjunction to automate and meet above best practices are ServiceNow, Amazon CFT, AWS API’S, Rundeck, Puppet, Chef, Python , .Net and Java automated programs.

Note : In case your organizations Dev/Ops/Devops teams needs help on security group best practices automation on points listed above, feel free to contact me harish11g.aws@gmail.com

 

Source

 

About the Author

Harish Ganesan is the Chief Technology Officer (CTO) of 8K Miles and is responsible for the overall technology direction of the 8K Miles products and services. He has around two decades of experience in architecting and developing Cloud Computing, E-commerce and Mobile application systems. He has also built large internet banking solutions that catered to the needs of millions of users, where security and authentication were critical factors. He is also a prolific blogger and frequent speaker at popular cloud conferences.

 

Apache Solr to Amazon CloudSearch Migration Tool

In this post, we are introducing a new tool called S2C – Apache Solr to Amazon CloudSearch Migration Tool. S2C is a Linux console based utility that helps developers / engineers to migrate search index from Apache Solr to Amazon CloudSearch.

Very often customers initially build search for their website or application on top of Solr, but later run into challenges like elastic scaling and managing the Solr servers. This is a typical scenario we have observed in our years of search implementation experience. For such use cases, Amazon CloudSearch is a good choice. Amazon CloudSearch is a fully-managed service in the cloud that makes it easy to set up, manage, and scale a search solution for your website. To know more, please read the Amazon CloudSearch documentation.

We are seeing growing trend every year, organizations of various sizes are migrating their workloads to Amazon CloudSearch and leveraging the benefits of fully managed service. For example, Measured Search, an analytics and e-Commerce platform vendor, found it easier to migrate to Amazon CloudSearch rather than scale Solr themselves (see article for details).

Since Amazon CloudSearch is built on top of Solr, it exposes all the key features of Solr while providing the benefits of a fully managed service in the cloud such as auto-scaling, self-healing clusters, high availability, data durability, security and monitoring.

In this post, we provide step-by-step instructions on how to use the Apache Solr to Amazon CloudSearch Migration (S2C) tool to migrate from Apache Solr to Amazon CloudSearch.

Before we get into detail, you can download the S2C tool in the below link.
Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip

Pre-Requisites

Before starting the migration, the following pre-requisites have to be met. The pre-requisites include installations and configuration on the migration server. The migration server could be the same Solr server or independent server that sits between your Solr server and Amazon CloudSearch instance.

Note: We recommend running the migration from the Solr server instead of independent server as it can save time and bandwidth. It is much better if the Solr server is hosted on EC2 as the latency between EC2 and CloudSearch is relatively less.

The following installations and configuration should be done on the migration server (i.e. your Solr server or any new independent server that connects between your Solr machine and Amazon CloudSearch).

  1. The application is developed using Java. Download and Install Java 8 .Validate the JDK path and ensure the environment variables like JAVA_HOME, classpath, path is set correctly.
  2. We assume you already have setup Amazon Web services IAM account. Please ensure the IAM user has right permissions to access AWS services like CloudSearch.
    Note: If you do not have an AWS IAM account with above mentioned permissions, you cannot proceed further.
  3. The IAM user should have AWS Access key and Secret key. In the application hosting server, set up the Amazon environment variables for access key and secret key. It is important that the application runs using the AWS environment variables.
    To setup AWS environment variables, please read the below link. It is important that the tool is run using AWS environment variables.http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.htmlhttp://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/java-dg-roles.html
    Alternatively, you can set the following AWS environment variables by running the commands below from Linux console.
    export AWS_ACCESS_KEY=Access Key
    export AWS_SECRET_KEY=Secret Key
  4. Note: This step is applicable only if migration server is hosted on Amazon EC2.
    If you do not have an AWS Access key and Secret key, you can opt for IAM role attached to an EC2 instance. A new IAM role can be created and attached to EC2 during the instance launch. The IAM role should have access to Amazon CloudSearch.
    For more information, read the below link
    http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
  5. Download the migration utility ‘S2C’ (You would have completed this step earlier), unzip the tool and copy it in your working directory.Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip

S2C Utility File
The downloaded ‘S2C’ migration utility should have the following sub directories and files.

Folder / Files Description  
 
bin Binaries of the migration tool
 
lib Libraries required for migration
 
application.conf Configuration file that allows end users to input parameters Require end-user’s input.
 
logback.xml Log file configuration Optional. Does not require end-user  / developer input
 
s2c script file that executes the migration process

Configure only application.conf and logback.xml.  Do not modify any other file.
Application.conf: The application.conf file has the configuration related to the new Amazon CloudSearch domain that will be created. The parameters configured  in the application.conf file are explained in the table below.

s2c {api {SchemaParser = “s2c.impl.solr.DefaultSchemaParser”SchemaConverter = “s2c.impl.cs.DefaultSchemaConverter”DataFetcher = “s2c.impl.solr.DefaultDataFetcher”DataPusher = “s2c.impl.cs.DefaultDataPusher”  } List of API that is executed step by step during the migration.Do not change this.
solr {dir = “files”
server-url = “http://localhost:8983/solr/collection1”
fetch-limit = 100}
dirThe base directory path of Solr.Ensure the directory is present and also its validity.Eg:/opt/solr/example/solr/collection1/conf
server-url– Server host, port and collection path.The endpoint which will be used to fetch the data.If the utility is run from a different server, ensure the IP address and port has firewall access.
fetch-limit– number of solr documents that can be fetched for each batch call. This configuration number should be carefully set by the developer.The fetch limit depends on the following factors:

  1. Record size of a Solr record(1KB or 2KB)
  2. Latency between migration server and Amazon CloudSearch
  3. Current Request Load on the Solr Server

E.g.: If the total Solr documents is 100000 and fetch limit is 100, then it would take 100000 / 10 = 10000 batch calls to complete the fetch.If size of each Solr record is 2KB, then 100000 * 2KB = 200MB data is migrated.

cs {domain = “collection1”
region = “us-east-1″
instance-type = ” search.m3.xlarge”
partition-count = 1
replication-count = 1}
domain – CloudSearch domain name. Ensure that the domain name does not already exist.
Region – AWS region for the new CloudSearch domain
Instance type – Desired instance type for CloudSearch nodes. Choose the instance type based on the volume of data and the expected query volume. 
Partition count – Number of partitions required for CloudSearch
replication-count – Replication count for CloudSearch
wd = “/tmp” Temporary file path to store intermediate data files and migration log files

Running the migration

Before launching the S2C migration tool, verify the following:

    • Solr directory path – Make sure that the Solr directory path is valid and available. The tool cannot read the configuration if the path or directory is invalid.
    • Solr configuration contents – Validate that the Solr configuration contents are correctly set inside the directory. Example: solrconfig.xml, schema.xml, stopwords.txt, etc.
    • Make sure that the working directory is present in the file system and has write permissions for the current user. It can be an existing directory or a new directory. The working directory stores the fetched data from Solr and migration logs.
    • Validate the disk size before starting the migration. If the available free disk space is lesser than the size of the Solr index, the fetch operations will fail.

For example, if the Solr index size is 7 GB, make sure that the disk has at least 10 GB to 20 GB of free space.
Note: The tool reads the data from Solr and stores in a temporary directory (Please see configuration wd = /tmp in the above table).

  • Verify that the AWS environment variables are set correctly. The AWS environment variables are mentioned in the pre-requisites section above.
  • Validate the firewall rules for IP address and ports if the migration tool is run from a different server or instance. Example: Solr default port 8983 should be opened to the EC2 instance executing this tool.

Run the following command from directory ‘{S2C filepath}’
Example: /build/install/s2c-cli

/s2c or JVM_OPTS=”-Xms2048m -Xmx2048m” ./s2c (With heap size)

The above will invoke the shell ‘s2c’ script that starts the search migration process. The migration process is a series of steps that require user inputs as shown in the screen shots below.
Step 1: Parse the Solr schema The first step of migration prompts for a confirmation to parse the Solr schema and Solr configuration file. During this step, the application generates a ‘Run Id’ folder inside the working directory.
  Example: /tmp/s2c/m1416220194655

The Run Id is a unique identifier for each migration. Note down the Run Id as you will need it to resume the migration in case of any failures.

Step 2: Schema conversion from Solr to CloudSearch.The second step prompts confirmation to convert Solr schema to CloudSearch schema. Press any key to proceed further.

The second step will also list all the converted fields which are ready to be migrated from Solr to CloudSearch. If any fields are left out, this step will allow you to correct the original schema. User can abort the migration and identify the ignored fields, rectify the schema and re-run the migration again.The below screen shot shows the fields ready for CloudSearch migration.


Step 3: Data Fetch: The third step prompts for confirmation to fetch the search index data from the Solr server. Press any key to proceed. This step will generate a temporary file which will be stored in the working directory. This temporary file will have all the fetched documents from the Solr index.


There is also option to skip the fetch process if all the Solr data is already stored in the temporary file. If this is the case, the prompt will look like the screenshot below.

Step 4: Data push to CloudSearchThe last and final step prompts for confirmation to push the search data from the temporary file store to Amazon CloudSearch. This step also creates the CloudSearch domain with the configuration specified in application.conf including desired instance type, replication count, and multi-AZ options.

If the domain is already created, the utility will prompt to use the existing domain. If you do not wish to use an existing domain, you can create a new CloudSearch domain using the same prompt.
Note: The console does not prompt for any ‘CloudSearch domain name’ but instead it uses the domain name configured in the application.conf file.

Step 5: Resume (Optional) During the migration steps, if there is any failure during the fetch operation, it can be resumed. This is illustrated in the screen shot below.

Step 6: Verification Log into AWS CloudSearch management console to verify that the domain and index fields.

Amazon CloudSearch allows running test queries to validate the migration and as well the functionality of your application.

Features supported

  • Support for other non-Linux environments is not available for now.
  • Support for Solr Shards is not available for now. The Solr shard needs to be migrated separately.
  • The install commands may vary for different Linux flavors. Example installing software, file editor command, permission set commands can be different for every Linux flavors. It is left to engineering team to choose the right commands during the installation and execution of this migration tool.
  • Only fields configured as ‘stored’ in Solr schema.xml are supported. The non-stored fields are ignored during schema parsing.
  • The document id (unique key) is required to have following attributes:
    1. Document ID should be 128 characters or less in size.
    2. Document ID can contain any letter, any number, and any of the following characters:      _ – = # ; : / ? @ &
    3. The below link will help you to understand in data  preparation before migrating to CloudSearch http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html
  • If the conditions are not met in a document, it will be skipped during migration. Skipped records are shown in the log file.
  • If a field type (mapped to fields) is not stored, the stopwords mapped to that particular field type are ignored.

Example 1:

<field name=”description” type=”text_general” indexed=”true” stored=”true” />   

Note: The above field ‘description’ will be considered for stopwords.Example 2:

<field name=”fileName” type=”string” />     

Note: The above field ‘fileName’ will not be migrated and ignored in the stopwords.

Please do write your feedback and suggestions in the below comments section to improve this tool. The source code of the tool can be downloaded at https://github.com/8KMiles/s2c/. We have written a follow-up post in regard to that.

About the Authors
 Dhamodharan P is a Senior Cloud Architect at 8KMiles.

 

 

 

 Dwarakanath R is a Principal Architect at 8KMiles.

 

 

EzIAM – Moving your Identities to the Cloud – An Analysis

Before an enterprise implements an on-premise IDM (Identity Management) solution, there are a lots of factors to consider. These considerations go way up, if the enterprise were to implement a new cloud IDM solution (i.e decide to move their identities partially or fully to a cloud like AWS, Azure or Google and manage these identities using a cloud IDM solution like the EzIAMTM solution). I will touch upon these items.

There could be 3 types of movers to the cloud.

  • New enterprise (or a start-up) that is planning to start their operations with a cloud IDM itself straightaway. These enterprises may not have an on-premise presence at all (Neo IDM Movers).
  • Some other enterprises might be planning to move only some of their existing IDM parts to the cloud and keep the rest of them on-premise (they are generally called the Hybrid IDM Movers).
  • While a few others could try to move their entire on-premise IDM operations to the cloud (Total IDM Movers). Although there will be some common considerations for these 3 categories of movers, before they decide to move to Cloud IDM, they individually will have some unique issues to deal with.

New Movers to a cloud IDM Infrastructure – companies starting their operations in the Cloud & hence want to have all their identities in the new cloud IDM infrastructure from day 1 of their operations:

These are the companies that start their identity management in the cloud itself straightway. The number of questions that these enterprises would want to be answered would be far less compared to the other 2 category of enterprises. Prime considerations for these type of organizations would be:

1. Will the cloud IDM solution be safe to implement (i.e safe to have my corporate users & identities exist in there) ?
2. Will the cloud IDM solution be able to address the day-to-day IDM operations/workflows that each user is going to go through?
3. Will the cloud IDM solution be able to scale for the number of users ?
4. What are the connectivity options (from a provisioning standpoint) that the cloud-idm system provides ? (i.e connecting to their applications/db’s/directories that are existing on the cloud, assuming they are a complete cloud organization)?
5. How robust these connections are (i.e in terms of number of concurrent users, data transport safety) ?
6. What are the Single Sign-On connectivity options that the solution provides ?
7. What are the advanced authentication mechanisms that the solution provides ?
8. What are the compliance and regulatory mechanisms in place ?
9. What are the data backup and recovery technologies in place ?
10. What are the log and audit mechanisms in place ?

If the organization can get convincing answers to the above questions, I think it is prudent for them to move their identities to the cloud. EzIAMTM, as a cloud IDM solution (from 8KMiles Inc.) has the best possible answers to the above questions in the market today. It is definetly an identity-safe, data-safe and a transport-safe solution, meaning identities stored within EzIAMTM directories and databases stay there in a secure manner and when transported either within the cloud or outside, always go through a TLS tunnel. Each component of EzIAMTM (there are 7 components/servers) is load-balanced and are tuned for high scale performance.

There are more than 30 out-of-the-box provisioning connectors available to connect to various directories, databases and software applications. The Single Sign-On connectivity options are innumerable with support for SAML2.0, OpenID 2.0 and OAuth 2.0. Varied advanced authentication mechanisms are supported that ranges from X509 cert/smart card based tokens & OTP/mobile-based authentications. Being in the AWS cloud the backup process and recovery process is as efficient as any back process can be. Daily backups of snapshots and data are taken, with ability to recover within minutes.

Hybrid Movers to a cloud IDM infrastructure – companies moving their on-premise identities & applications to the cloud but not fully yet :

Most of the companies would fall into this category. These kind of movers, move only a few parts of their IDM infrastructure to the cloud. They would initially move their applications to the cloud to start with. Then they would probably move their user stores/directories and along with that their identities to the cloud. They would still have some applications on-premise, which they would need to connect from the cloud IDM solution. They would also want to perform the daily identity workflow process from the cloud IDM solution. This way they can streamline their operations especially if they have offices in multiple locations, with users in multiple Organizational Units (OUs), accessing multiple on-premise and cloud applications.

Hybrid movers would have the maximum expectations from their cloud IDM solution, as the solution needs to address both their on-premise and cloud assets. Generally if these movers can get answers to the following tough questions, they will be much satisfied, before they move their IDM assets to the cloud.

1. Will the cloud IDM solution enable me to have a single primary Corporate Directory in the cloud? How will it enable the move of my current on-premise primary directory/user database to the cloud?
2. Will the solution allow me to provision users from our existing on-premise endpoints to the cloud?
3. Will the solution help me keep my on-premise endpoints (that contain user identities) in tact and move these endpoints in stages to the cloud.
4. I have applications, on-premise whose access is controlled by on-premise Access Control software. How can I continue to have these applications on-premise and enable access control to them via the cloud IDM solution?
5. How will the solution provide access control to the applications that I am going to move to the cloud?
6. Will the cloud IDM solution help me chalk out a new administrator/group/role/user base structure?
7. Will the solution help me control my entire IDM life-cycle management (from the day a user joins the org to the time any user leaves the org) through the cloud IDM ?
8. How exhaustive will the cloud IDM solution allow my access permission levels to be?
9. How often would the cloud IDM solution allow me to do a bulk-load of users from an on-premise directory or db?
10. What will the performance of the system when I perform other IDM operations with the system, during this bulk-load of users?
11. Will the solution allow us to have a separate HR application which we would want to be connected and synched up with the cloud IDM Corporate Directory?
12. What are the security benefits in connectivity, transport, access control, IDM life cycle operations, provisioning, admin-access etc. that the solution offers?
13. What are the connectivity options (i.e connecting to other enterprise applications across that enterprise’s firewall’s?)
14. What SaaS applications that the solution would allow the users to connect to in the future? How would the solution control those connections through a standard universal access administration for my company?

Total Movers to a cloud IDM infrastructure – companies that move 100% of their identity infrastructure to the cloud from an on-premise datacenter :

The primary motivation behind the “Total Movers” of IDM to the cloud would be the following:

1. How can I move my entire IDM infrastructure without loosing data, application access control, identity workflows, Endpoint Identity Data, Connectors ?
2. How long would it take for my move ?
3. Would I be able to setup a QA environment and test the system thoroughly before moving to production in the cloud?
4. How can I transition from my on-premise IDM software to a different cloud IDM software like EzIAMTM?
5. What is the learning curve for my users to use this system?
6. How can I customize the cloud IDM user interface, so it depicts my organizations profile & IDM goals/strategies ?
7. How much can I save in trained IDM skilled personnel and on-premise infrastructure costs when I move my IDM to the cloud in its entirety?

For all the 3 kinds of cloud movers described above, EzIAMTM would be a perfect solution. Pretty much all the questions posted above for all the types of movers, can be answered by the deployment of EzIAMTM. The solution is very versatile, customizable and has great connectivity options to all types of endpoints that an enterprise can have. The learning curve to get used to the screens is very minimal, as the screens are intuitive. Mobile access is enabled. The feature of integrating EzIAMTM with a cloud Governance Service solution is an added incentive for the movers, as this option would be extremely helpful to govern their identity environment efficiently.

Database Scalability simplified with Microsoft Azure Elastic Scale (Preview) for SQL Azure

Database Multi Tenancy in enterprises is a less known fact, but it’s a must and very familiar topic among SaaS Application developers. It’s also very hard to develop and manage Multi-Tenant application & database infrastructure.

SaaS Developers usually build custom Sharding architectures because of the deep limitation and inflexibility of SQL Server Federation services. Custom Sharding and database scalability architectures are more manual and consists lots of moving parts. One such manual architecture usually suggested Application Architects are creating fixed set of ideal schema/DB and manage a Master Connection string table to manage the mappings between the Sharding ID/Customer ID with the right shard (Refer Illustration below).

As such, there’s nothing wrong in the architecture above, but there are lots of challenges imposed by it such as

1. Infrastructure Challenges

  • a)Maintaining and Managing Shard Meta DB infra-structure
  • b)Splitting 1 noisy customer data from one shard to another shard
  • c)Scaling up a particular shard on need basis
  • d)Querying data from multi shard Databases
  • e)Merging shards to cut down costs

2. OLAP Challenges (Database Analysis & Warehousing)

  • a)Developers will not be able to issue single query to fetch data from 2 different shards
  • b)Conducting Data Analysis is hard

Introducing Azure Elastic Scale

Azure introduced “SQL Azure Elastic Scale SDK” to overcome these challenges. To get started with Azure Elastic scale, you can download the Sample App which will be a good starting point to understand the SDK inside out.  Azure Elastic Scale has 3 Key APIs that makes the Sharding simple, they are

  1. Shard Map Manager (SMM)
  2. Data Dependent Routing (DDR)
  3. Multi Shard Query(MSQ)
  4. Split & Merge

Shard Map Manager

This is the Key part of the SQL Azure Elastic Sharding function. Shard Map Manager is essentially a Master Database to hold the shard mapping details along with shard range key (Customer ID/Dept. ID/Product ID) (Refer the below screenshot).  When you add/remove shard, it creates/removes an entry into the SMM database.

 

Data Dependent Routing

After the shard has been defined in the Sharding Manager DB, Data Dependent Routing API takes care of routing the customer request to the appropriate shard. Data dependent routing takes the shard ID to identify the right shard of the specific customer. Shards can also span Azure regions which helps us to keep the shard very next to the customer and reduce the network latency as well as compliancy requirement.

DDR also cache the shard details received from SMM db to avoid unwanted round trip on each request. However, this cache will be invalidated as when you change the shard details.

 

Multi Shard Query

Multi Shard Query is the Key API which allows the querying of data from multiple Shards and helps us in joining the result set. Under the hood, it actually queries the data individually from the different shards and applies join on the received result set to get a unified data in return. Multi Shard Query is only ideal when you have ideal schema and suitable for custom schema in the shards.  Click here to view the complete list of APIs that’s part of the Microsoft.Azure.SqlDatabase.ElasticScale namespace.

Example

public static void ExecuteMultiShardQuery(RangeShardMap shardMap, string credentialsConnectionString)
        {
            // Get the shards to connect to
            IEnumerable shards = shardMap.GetShards();

            // Create the multi-shard connection
            using (MultiShardConnection conn = new MultiShardConnection(shards, credentialsConnectionString))
            {
                // Create a simple command
                using (MultiShardCommand cmd = conn.CreateCommand())
                {
                    // Because this query is grouped by CustomerID, which is sharded,
                    // we will not get duplicate rows.
                    cmd.CommandText = @"
                        SELECT 
                            c.CustomerId, 
                            c.Name AS CustomerName, 
                            COUNT(o.OrderID) AS OrderCount
                        FROM 
                            dbo.Customers AS c INNER JOIN 
                            dbo.Orders AS o
                            ON c.CustomerID = o.CustomerID
                        GROUP BY 
                            c.CustomerId, 
                            c.Name
                        ORDER BY 
                            OrderCount";

                    // Append a column with the shard name where the row came from
                    cmd.ExecutionOptions = MultiShardExecutionOptions.IncludeShardNameColumn;

                    // Allow for partial results in case some shards do not respond in time
                    cmd.ExecutionPolicy = MultiShardExecutionPolicy.PartialResults;

                    // Allow the entire command to take up to 30 seconds
                    cmd.CommandTimeout = 30;

                    // Execute the command. 
                    // We do not need to specify retry logic because MultiShardDataReader will internally retry until the CommandTimeout expires.
                    using (MultiShardDataReader reader = cmd.ExecuteReader())
                    {
                        // Get the column names
                        TableFormatter formatter = new TableFormatter(GetColumnNames(reader).ToArray());

                        int rows = 0;
                        while (reader.Read())
                        {
                            // Read the values using standard DbDataReader methods
                            object[] values = new object[reader.FieldCount];
                            reader.GetValues(values);
                            // Extract just database name from the $ShardLocation pseudocolumn to make the output formater cleaner.
                            // Note that the $ShardLocation pseudocolumn is always the last column
                            int shardLocationOrdinal = values.Length - 1;
                            values[shardLocationOrdinal] = ExtractDatabaseName(values[shardLocationOrdinal].ToString());
                            // Add values to output formatter
                            formatter.AddRow(values);
                            rows++;
                        }

                        Console.WriteLine(formatter.ToString());
                        Console.WriteLine("({0} rows returned)", rows);
                    }
                }
            }
        }

Download the sample application from here.

Elastic Scale Split & Merge

As the name suggests, it helps the developers to Split or Merge the DB shards based on the needs. There are 2 Key scenarios when you would need Split & Merge functionality. They are

  1. Moving data from heavily growing hotspot database to a new Shard
  2. Merge 2 databases if the size of the provisioned DB is less to reduce the database cost.

Split & Merge is a combination .Net APIs + Web API + Powershell package. Refer the below links for an introduction and Step by Step Implementation Guide.

 

Introduction : http://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-overview-split-and-merge/

Step by Step Guide: http://azure.microsoft.com/en-us/documentation/articles/sql-database-elastic-scale-configure-deploy-split-and-merge/

About the Author

 Ilyas is a Cloud Solution Architect at 8K Miles specializing Microsoft Azure and AWS Clouds. He is also passionated about Big Data, Analytics and Machine Learning Technologies.

LinkedIn || Twitter 

8K Miles Blood Donation Drive: a CSR Initiative

As part of the Corporate Social Responsibility, 8KMiles Software Services in association with Jeevan Blood Bank and Research center (Public Charitable Trust) conducted a Blood Donation Camp at our Chennai Premises on Mar, 31 2015. Many of our employees eagerly volunteered and donated blood in the interest of saving people lives.

 

“I have been a Blood Donator ever since my college days, I truly enjoy the satisfaction that I get when I donate blood every time.” Vinoth, Human Resource, 8KMiles Software Services.

 

Our Blood Donors from 8K Miles Office Chennai, India

 

 

Want to become a game changer with a sense of social responsibilities? Head to our careers page.

EzIAM – On-premise and Cloud Connectivity Options

For an enterprise, a key decision factor in selecting a good Cloud Identity Management service, is the ability of the service to connect to on-premise & cloud endpoints. Enterprises normally have User data stored in their endpoints. User data would range from the groups (segregation of users into a classification needed for the endpoint) the user belongs to, to the roles(a classification that allows the user to perform a particular function in that endpoint when the user is part of that classification) of the user and access privileges(permission levels to access resources) for that particular endpoint.

The same user in an enterprise, can have different types of access, can perform different roles and can be part of different groups in different endpoints. We normally witness the fact that, once the number of endpoints grows in an enterprise this User data & the related objects like groups and roles become unmanageable and untraceable. Enterprise want to have this problem fixed.

Any enterprise would love to know what types of access each user has in each of the endpoints at a given point in time. It would be key for them to have this information in one central place. Having this data in a single location, would help enterprise managers to look out for improper access, redundant roles in each of the endpoints on a periodic basis.

EzIAMTM (a Cloud Identity Management Service from 8KMiles Inc.) offers something called a provisioning directory, where relevant data (User, groups, roles) from the endpoints can be stored and accessed by the enterprise administrator. (Please refer to my previous blog – EzIAM FAQ – to know about the origins and capabilities of EzIAMTM). This data can then be imported to an Identity Access Governance Service, that would then analyze the roles, groups & access permissions during periodic certification campaigns conducted by the business managers. That can be a topic for a future blog. Now, let us dwelve into the various facets of endpoint connectivity options available within EzIAMTM.

EndPoints:

Endpoints are Directories, databases, LDAPs, applications, OS user stores etc. Almost any endpoint in an enterprise would have a data store where the users of that particular endpoint would be stored. Sometimes the endpoint themselves could be applications, in which case there would be an application database where the user information would be stored. Almost any system that contains user information could act as an endpoint for EzIAMTM. The endpoints could reside either on-premise or in a cloud.

Typically, an endpoint is a specific installation of a platform or application, such as Active Directory or Microsoft Exchange, which communicates with Identity Management to synchronize information (primarily attributes of a user stored in the endpoint). An endpoint can be:

■ An operating system (such as Windows)
■ A security product that protects an operating system (such as CA Top Secret and CA ACF2)
■ An authentication server that creates, supplies, and manages user credentials (such as CA Arcot)
■ A business application (such as SAP, Oracle Applications, and PeopleSoft)
■ A cloud application (such as Salesforce and Google Apps)

Connectors:

A connector is the software that enables communication between EzIAMTM and an endpoint system. A connector server (an EzIAMTM Server Component) uses a connector to manage an endpoint. One can generate a dynamic connector using Connector Xpress (an EzIAM Tool), or one can develop a custom static connector in Java. For each endpoint that you want to manage, you must have a connector. Connectors are responsible for representing each of the managed objects in the endpoint in a consistent manner. Connectors translate add, modify, delete, rename, and search LDAP operations on those objects into corresponding actions against the endpoint system. A connector acts as a gateway to a native endpoint type system technology. For example, to manage computers running Active Directory Services (ADS) install the ADS connector on a connector server.

Three Types of Connectors:

EzIAMTM has a rich set of On-premise connectivity options. There are 3 primary ways of connecting to endpoints;

C++ Connectors (managed by C++ Connector Server (CCS))
Java Connectors (managed by CA IAM Connector Server (CA IAM CS)).
Provisioning Server Plugins

The endpoints (in the diagram, courtesy: CA) the Connectors connect to range primarily from PeopleSoft, SalesForce (IAM CS) to AD, DB2 (C++ Connector), RACF(Prov. Server Plugin). These are just examples of connectors. A list of out-of-the-box connectors is given in “Connecting to endpoints” sub-section below.

One cannot use both CA IAM CS and CCS to manage the same endpoint type.

What Connectors Can Do:

EzIAMTM has a number of out-of-the-box Connectors that help to connect to popular endpoints. Each connector lets Identity Management within EzIAMTM, perform the following operations on managed objects on the endpoint:
■ Add
■ Modify—Changes the value of attributes, including modifying associations between them (for example, changing which accounts belong to a group).
■ Delete
■ Rename
■ Search—Queries the values of the attributes that are stored for an endpoint system or the managed objects that it contains.
For most endpoint types, all of these operations can be performed on accounts. These operations can also be performed on other managed objects if the endpoint permits it.

Connecting to Endpoints:

Popular out-of-the-box Connectors in EzIAMTM:
CA Access Control Connector
CA ACF2 v2 Connector
CA Arcot Connector
CA DLP Connector
CA SSO Connector for Advanced Policy Server
CA Top Secret Connector
IBM DB2 UDB for z/OS Connector
Google Apps Connector
IBM DB2 UDB Connector
IBM RACF v2 Connector
Kerberos Connector
Lotus Domino Connector
Microsoft Active Directory Services Connector
Microsoft Exchange Connector
Microsoft Office 365 Connector
Microsoft SQL Server Connector
Microsoft Windows Connector
Oracle Applications Connector
Oracle Connector
IBM i5/OS (OS/400) Connector
PeopleSoft Connector
RSA ACE (SecurID) Connector
RSA Authentication Manager SecurID 7 Connector
Salesforce.com Connector
SAP R/3 Connector
SAP UME Connector
Siebel Connector
UNIX ETC and NIS Connector

Ways to Create a New Connector:

One can connect to an endpoint that is not supported out-of-the-box in EzIAMTM, also. To do this, an enterprise needs to create its own connector in one of these ways:

■ Use Connector Xpress to create the connector.
■ Use the CA IAM CS SDK to create the connector.
■ Ask 8KMiles to create a connector.

Set Up Identity Management Provisioning with Active Directory:

One can use Active Directory Server (ADS) to synchronize attribute data to supported endpoints. This could be done by configuring CA IAM CS to propagate local changes in Active Directory to a cloud-based identity store using a connector. For example, assume that you have a GoogleApps installation in the cloud. You could create an ADS group named “GoogleApps” and then configure the CA IAM CS to monitor that group. CA IAM CS synchronizes any changes to the GoogleApps environment in the cloud. If you add a user to the ADS GoogleApps group, CA IAM CS uses the GoogleApps connector to trigger a “Create User” action in the GoogleApps environment proper.

To set up directory synchronization:
1. Install CA IAM CS in your environment.
2. Acquire the endpoints that you want to synchronize with. You must acquire endpoints in order to create templates in step 4.
3. Create one or more directory monitors. Monitors capture changes that you make in your local Active Directory, and report them for the synchronization.
4. Create one or more synchronization templates. Templates control settings for the directory synchronization.

Custom Connectors:

Custom Connectors are connectors that can be programmed (mostly from pre-available template structures) that enables an enterprise to connect to custom endpoints (i.e endpoints that are not supported out-of-the-box in EzIAMTM).

Custom Connector Implementation Guidelines:

It would help the developers to consider the following guidelines when designing and implementing a connector:

■ Drive as much of the connector implementation logic as possible using metadata.
■ Write code that takes advantage of the service provided by the CA IAM CS framework, like pluggable validators and converters, and connection pooling support classes.
■ Write custom connector code to address any additional specific coding requirements.

In summary, connection to endpoints is a critical aspect of modern Cloud Identity Management systems. The crucial Connector properties to look for from your Cloud Identity Management system would be,

  • the efficiency of the connectors that would dictate the speed of data transfer between the endpoint and the Corporate user store
  • the synchronization of attributes between the endpoint and the store (strong synchronization vs weak synchronization)
  • the customization aspects of the connector (connector pool size, reverse synchronization from the endpoint to the Corporate Store etc.)
  • the Validators and Convertors of datatypes (from endpoint to Directory) that the connectors offer
  • the range of endpoints that the connectors could connect to ranging from AD, LDAP, DBs, Web Services (SOAP and REST-based) to custom endpoints with custom schema & metadata

EzIAMTM is an ideal candidate in this regard as it has a rich set of on-premise and cloud connectivity options. It has all the ideal connector properties that an enterprise would need to connect to their favourite endpoints.

Top 10 Azure Glossary: Demystified

1. Affinity Group

“Affinity group” AKA Scale units helps co-locating related resources in close proximity to reduce network latency.  for e.g., when you launch a multi-tiered web application with front end tier, business logic tier and database server, you don’t want to place these resources in different parts of the datacenter instead you want to group it together for better network performance.  Azure highly recommends Affinity Group for grouping of related resources but doesn’t mandate.

Azure Data centers consists of multiple Affinity groups and not all the affinity groups contains all the Services of Azure, for e.g. New High power VM Families, Internal Load  Balancers, Reserved IPs may not be available  in all the Affinity Groups.

2. Regional VNet

Regional VNet is the enhanced version of VNet. Until 2014, VNet was originally bound to Affinity Groups which is just a sub section of Azure Data Center. Affinity group has limited set of resources and it doesn’t contain all the services offered within a region. As of this writing, Azure has 17 regions spread globally and planning to power up many more datacenters. When you create a Regional Virtual Network, it can span the entire region and thus you can avail all the services available within the region and not limited to Affinity Groups.

3. Availability Set

Azure’s main promise is High Availability. To achieve HA for your applications, it is always recommended to run at least two instances of your solution to qualify for HA and 99.95 Azure SLA.

Availability Set has two main concepts called Fault Domain & Upgrade Domain.

As the name suggest Fault Domain is an individual or group of Container/Rack placed inside the Azure Datacenter that shares Power and Network Switches. 2 Virtual Machines placed under an Availability group, will be virtually deployed in two different Fault Domains so that problems occurred in one Fault domain will not affect another.

Upgrade Domains is a categorization of resources to manage Host Operating Updates and patches. This helps us with avoiding both the VMs get updated or patched at the same time.

4. Resource Group

Resource Group helps you to group all the related services together for better resource management, tagging and billing. Not to be confused with Affinity Groups, which is keeping virtual resources close proximity.

For e.g. If you manage 2 different projects 1. Internal SharePoint Portal, 2. Public facing corporate website built on PHP. Each of this solution have different set of resources and hence you may want to group them together.

Key pointers about Resource Group at this moment are

  1. Default and Maximum Resource group that you can create within a subscription is 500.
  2. Resource Group should not be confused with Active Directory Group functionally, both are two different services.
  3. Linking of shared resources between groups is not fully functional yet
  4. Resource Group can span regions.

5. Endpoint

Be default VMs launched within Virtual Networks can communicate to each using their private address, but if you want to make VMs placed in different Networks irrespective of whether it is within azure/on premise/other cloud, you need public IPs otherwise called as Endpoints. When you create VMs, ports like Remote Desktop, Windows PowerShell Remoting, and Secure Shell (SSH) are automatically opened, but you can also open other ports like FTP, SMTP, DNS, HTTP, POP3, IMAP, LDAP, HTTPS, SMTPS, IMAPS, POP3S, MSSQL, and MySQL as it requires.

Each endpoint in the VM has two ports i.e. Public Port & Private Port. Public port is used for incoming traffic from the internet and private port is for internal communication with other services within the virtual network.

6. Public Virtual IP Address/Dynamic IP Address

When you first create a Cloud Service in Azure you will be assigned with Virtual Public IP Address. This VIP will not be released until all the VMs placed inside the Cloud services is successfully Deleted or Stopped (De-allocated).

Dynamic IP Address (DIP) are nothing Private IP address allocated by DHCP (Dynamic Host Control protocol), also note that it bounds to the VNet CIDR block defined by the user. Similar to the Public IP, DIPs are also not release until all the VMs placed inside the Cloud services is successfully Deleted or Stopped (De-allocated).

Reserved Virtual IP Address

Users can reserve IP addresses for their subscription. This helps them with predictable IP address that can associated with their Cloud Services and Virtual Machines. By default when you delete or stop (De-allocate) your instances the VIPs will be released to Azure IP address pool, but when you reserve IPs it will remain in your subscription until to remove Reserved IPs from your subscription.

 

7. Instance Level Public IP Address

Instance level IP address is associated directly to the Virtual Machine Instances rather than to the Cloud Services where you back all the Virtual Machines within. Currently you can only allocate one PIP to a VM instance and it is not currently supported Multi NIC VMs.

Instance Level IP addresses can be used when you simply want to connect your VM with an IP instead of using Cloud Service endpoints opened individually for each ports like http://mytestvm.cloudapp.net:8080. Other benefits includes receiving traffic on any port instead of selective ports which is best suitable Passive FTP where the selection of the ports are completely dynamic in nature, similarly outbound traffic from VM can be routed via PIP.

At this moment, requesting of Instance level IPs as well as allocation if IPs can only be done using Windows PowerShell and Rest APIs.

8. X-PLAT CLI

It’s a command line interface for Windows, Linux and IOS Platforms. You might be familiar with Windows PowerShell CLI, the favorite power shell scripting utility of IT pros used to automate and execute remote commands, but it’s meant only for windows. X-PLAT CLI built using JavaScript/Node.js is an alternate solution which brings the same power to non-Microsoft platforms. You can download Windows installer here and OS X installer here and find Linux installation instructions here.

9. Cloud Service

Out of all the Naming Conventions of Microsoft Azure, Cloud Service is the single most confusing and ubiquitous term. Cloud Service is a very broad term and used by everyone, everywhere basically for one reason, anything hosted out of premise is generally called as Cloud Service.

Cloud Service in Azure is nothing but a DNS name e.g. http://<<contonso>>.cloudapp.net or http://<<contonso>>.azure websites.net which could be mapped with a custom domain. Creating cloud service is the first step of creating public interfaces like WebApp, Mobile services  or Azure VM.

 

10. App Services

Azure App Service is the new term coined by Microsoft recently which consolidates Websites (Web Roles/Worker Roles), Web Jobs, Mobile Services, API Services together and offers it as a package. As of writing this article, it’s currently available only in the preview portal. There was lot of confusions within the Developer community as when to choose Web Roles, Website, Mobile Services etc because of close resemblance to each other. In fact you can create a mobile services using Worker Role or a Web role.

Now let’s look at what these individual services can do

Web App

This is nothing but Azure Websites that helps developers to quickly build websites using variety of different programming languages and host and scale seamlessly using Azure PaaS offering.

Mobile App

Azure Mobile App service is purpose built for 3 Key Scenarios. 1. Enterprise SSO with AD, 2. Push Messaging and 3. Social Integration. Mobile service is completely platform agnostic and technology agnostic, means you can build mobile services for variety of different platforms like iOS, Android, Windows with both .Net or JavaScript as the back end.

Logic App

It’s a new breed of service targeted at developers and technical business users to orchestrate and create API workflows. APIs found everywhere, almost all the services exposes APIs. Logic Apps helps you to connect various APIs together in a secured and organized manner. Logic App provides out of box Social Media connectors for Twitter, Facebook, Yammer. Enterprise Connectors for SAP, Marketo, Salesforce and Azure Data Service connectors for  Sharepont, Mobile Services, Storage etc. If you don’t find connectors of your favorite services, you can build one by yourself using API App service.

API App

It’s an API hosting Service where you can build APIs using various programming languages including C#,Java, Python,Node.js,PHP and  host it with Azure Apps service. API App connects seamlessly with Azure Web App/Mobile and Logic App. The 2 major benefits of API app are 1. Simplification of Security using AD/SSO and OAuth and 2. Quick API deployments and automated versioning support.

About the Author

 Ilyas is a Cloud Solution Architect at 8K Miles specializing Microsoft Azure and AWS Clouds. He is also passionated about Big Data, Analytics and Machine Learning Technologies.

LinkedIn || Twitter 

Azure Virtual Network vs AWS Virtual Private Cloud

Statuary Warning: This content is totally neutral to any cloud provider. The blog content is strictly time bound and we request you to read the respective cloud providers documentation and refer the current status of the respective service updates portal of Azure & AWS.

Virtual Network vs AWS VPC

Amazon has been a fore runner in the cloud computing arena and pioneered many industry revolutionizing services like EC2, VPC etc. AWS’s initial offering EC2-classic platform allowed customers to run ec2 instances on a flat global network shared by all the customers, also there were other attributes including shared tenancy, restrictions on Security Groups and lack of Network Access control lists concerned security minded customers. AWS then introduced EC2-VPC, an advanced platform which provisions logically isolated section of the AWS Cloud. AWS EC2-VPC supports Shared/Dedicated Tenancy, Improved Network Security Groups/Network Access Control etc., Enterprise Customers and SMB customers gained more confidence with the VPC architecture and started adopting AWS better than before.

In 2013, Azure turned its focus from being just a PaaS provider into a Full-fledged IaaS provider to avoid the competitive edge and market loss. In order to compete with the early starter AWS, Azure introduced many new services and importantly Virtual Networks, “a Logically Isolated network” the VPC version of Azure within its Datacenter. Azure’s Virtual Network resembles VPC in many aspects and in fact behaves similar in many cases but there are few differences as well.

In this blog, we’ll see those differences in detail and off course the similarities as well. It’s all about Networking, so let’s begin with

Subnet

Subnets are the building blocks of Private Networks. Subnets are a great way to divide the bigger network into many smaller networks and place the workload depends on the nature of the data that it deals with. AWS being an IaaS provider has matured tools like their management portal, Cloud Formation Templates, CLIs and programmable APIs to launch subnets. AWS also provides Wizards to automate the common VPC architectures such as

  • VPC with a Single Public Subnet
  • VPC with Public and Private Subnets
  • VPC with Public and Private Subnets and Hardware VPN Access
  • VPC with a Private Subnet Only and Hardware VPN Access

This helps users to greatly reduce the VPC setup time and simplifies the entire process. AWS makes creating complex Networks like a child play using the Wizard, anyone who wants to create and provision multi-tiered Web application or any workload in public-private subnet in minutes.

Azure Virtual Network also allows us to create subnets of any quantity using the Management portal, PowerShell, CLI. Unlike AWS, azure doesn’t currently have wizards to create the common architectures like the ones mentioned above.

 Security

Security is the primary driving force why Virtual network is preferred over public facing endpoints. AWS provides various virtual Security services to provide maximum security both at Virtual Instance level, subnet level and overall network Level.

Security Group

AWS “Security Groups” helps protecting instances by configuring inbound and outbound rules. Users can configure what ports to open to accept traffic from what source and similarly configure outbound ports from EC2 instances.

Image Source : MSDN

Azure’s naming convention is “Network Security Group” is currently available only for Regional Virtual Networks (Read what regional Network is) and not available for VNet that has Affinity Group Associated. You can have max 100 NSGs per subscription (hope this is the hard limit enforced, MSDN doesn’t explains it further).

AWS allows us to create 200 Security groups per VPC, for example if you have 5 VPCs you can create 200 * 5 = 1000 Security groups totally, but Security groups in both clouds cannot span regions.

Unlike AWS, Network Security Group of Azure can be associated to VM Instance, Subnets and hybrid i.e (Subnet and VM), this is a powerful multi-layer protects that a VM can get, click here to read more.  Azure currently doesn’t offer user interface to add/edit security groups, so users must use PowerShell and REST APIs to setup the same (Refer the below Powershell Workflow).

Powershell Commandlet to create Azure Network Security Group

Network ACLS

Azure and AWS supports Network Access control list. ACLs allow users to selectively permit or deny traffic to your Networks.  Both the clouds states it as an enhancement or an optional security mechanism on top of security groups and other security mechanisms. ACLs in azure is currently limited to securing Endpoints (What is Endpo

As of writing this article, you can only create Network ACLs using Powershell and REST API commands.ints) and doesn’t offer the same flexibility and control as AWS provides. ACL in AWS allows us to set Access control at the subnet level, i.e. if you allow http traffic to a subnet, all the EC2 instances inside the subnet can receive HTTP traffic, however if you have configured not to allow HTTP traffic in certain EC2 those traffic will be filtered by Security Groups. Azure’s Network ACLs behaves almost similar except it works for an endpoint.

Note: Azure recommends either Network Access Control List or Security group, not both at the same time, because functionally they do the same. If you have configured Network ACL and wanted switch to Security Groups, first you must remove the Endpoint ACLs and configure Security Group.

Custom Routing Tables

Custom routing tables contains list of Routing Rules to determine how the traffic should flow inside the subnet.

Image Source : MSDN

In AWS, Each subnet must be associated with a route table, which controls the routing for the subnet. If you don’t explicitly associate a subnet with a particular route table, the subnet uses the main route table of the VPC.

Windows Azure provides default routing across subnets within a single virtual network, but does not provide any type of network ACL capability with respect to internal IP addresses.  So in order to restrict access to machines within a single virtual network, those machines must leverage Windows Firewall with Advanced Security (Refer the diagram).

Microsoft must be cooking this feature in their kitchens. We can expect this delicious feature in Azure restaurant soon.

(Image source: Technet Blog)

Dedicated Instances

Amazon provides Dedicated EC2 instances that run in VPC on hardware that is dedicated to a single customer. Dedicated Instances are physically isolated at the host hardware level from other dedicated instances of other customer accounts. Although currently dedicated instances in VPC doesn’t work with many main stream services including EBS block storage but there are certain cases where dedicated instances are preferred by the customers.

Azure doesn’t offer dedicated instances at this moment, however customers have raised requests with Microsoft for such offering, it is expected that Microsoft will consider this request and bring in support for Dedicated Instance.

Virtual Network Interfaces

Virtual Network interface card (NIC) is a virtual appliance that can be plugged and unplugged with VMs. This provides full time connectivity with the Network and helps route certain networks to certain NICs.
AWS allows you to attach multiple Elastic Network interface cards to EC2, however AWS restricts this capability to certain EC2 Families and not all. As of writing this article, C3 /C4/CC2/CG1/CR1/HI1/HS1/I2/M2/R3 Large families are allowed to plug maximum of 8 Network interfaces and 30 private IP Addresses.

Azure also supports this feature, however just like AWS, azure also restricts to Multiple Virtual NIC for only certain large machines.  Azure lets you to create Multiple NICs on the following VM categories

  • Large (A3) and A6: 2
  • ExtraLarge (A4) and A7: 4
  • A9: 2
  • D3: 2
  • D4: 4
  • D13: 4

Azure has enabled this feature only on their IaaS offering and not in PaaS. There are some more limitations like only Public facing Virtual IP address is supported in the default NIC, adding or removing of IP is not allowed once the VM is created. Users cannot apply Network security or Forced Tunneling to the Non Default NIC. However, we can expect Microsoft’s Network team enabling and removing some of the current limitation in the upcoming months. Click here to read more.

DNS Service

DNS is a very crucial part of Networking and it’s very essential to avoid latency and unnecessary networking hopping. AWS Route53 provides a highly available and redundant DNS service that connects user requests to various services of AWS such as EC2, ELB, or S3 and it can also be used to route users to infrastructure outside of AWS.

Currently Azure doesn’t offer DNS services and requests users to add DNS redirects to CloudApp.Net url given to all the services of Azure cloud. However, there are loads of request from Azure customers to build DNS system to get out of the Redirection issue.

Connectivity

Inter connectivity lets different networks connect each other. Cloud providers provides 3 basic inter connectivity option

Direct Internet Connectivity

AWS allows users to associate Public IPs to EC2 instances there by allowing internet connectivity to those machines and similarly VMs in the private subnet gain internet access by routing through NAT instances in the public subnet.

Azure lets users to configure public endpoints aka Public IP addresses to VMs inside the subnet thereby VMS can be connected with other systems.

VPN over IPsec

VPN over IPsec is an IP based connection methodology to interconnect two different networks, irrespective of networks within cloud/ outside, cloud to on premise network etc., broadly there are two types of VPN routing protocols used 1. Static Routing protocol 2. Dynamic Routing protocol.

Azure and AWS provide support for Static and Dynamic Routing, however Azure at this moment doesn’t support Active Routing Support (BGP) but Azure has published a huge list of VPN device manufactures who support BGP routing.

Private Connectivity using Exchange Provider

Private connectivity option mainly focused towards enterprise customers who have bandwidth heavy workloads.  Private connection by ISPs can provide much better performance than Internet. Both AWS and Azure has partnered with major Telecom and ISVs to offer private connectivity between their clouds and customer’s on premise infrastructure. Azure supports most of their features through Express Route except certain features like Service Bus, CDN, RemoteApp, Push Notifications etc. (Click here to read more).  Similarly AWS supports All AWS services, including Amazon Elastic Compute Cloud (EC2), Amazon Virtual Private Cloud (VPC), Amazon Simple Storage Service (S3), and Amazon DynamoDB can be used with AWS Direct Connect. As far as the SLA is concerned, AWS doesn’t provide SLA for this service, but Azure on the other hand promises 99.9% SLA, otherwise the customer can claim service credits.

SDK & Tools

Azure & AWS provide programmable SDKs and APIs to deal with various services of networking options provided by these clouds. Developers can create a Virtual network using Azure’s PowerShell and CLI or the management dashboard, similarly AWS allows the users to configure VPC using CloudFormation templates, Rest APIs and CLIs.

Summary

The intention of this article is to highlight certain intricate differences and not an in-depth comparison guide. AWS being the pioneer in the IaaS space has lot of matured options and tools set to offer, but Azure on the other hand is currently building and maturing their IaaS offering. Azure being Conventional Software provider focused mainly on enabling their windows environment to suit and operate within IaaS offering, hence all the services newly launched and services in preview seems to be more Windows focused. Microsoft welcomes partners and vendors to build the Providers/Adaptors/Connectors/APIs for the Open Source programming languages like Python or Ruby n Rails etc. Azure from its inception focuses Enterprise customers and goes with Hybrid Story, AWS on the other end tasted their success with startups and SMB customers now trying to build Enterprise storyline to take AWS to the next level.

About the Author

 Ilyas is a Cloud Solution Architect at 8K Miles specializing Microsoft Azure and AWS Clouds. He is also passionated about Big Data, Analytics and Machine Learning Technologies.

LinkedIn || Twitter 

GHost Vulnerability and its mitigation using RunDeck

8KMiles always thrive to simplify the complex process and procedures, likewise we have come up with a simple solution to fix the GHost vulnerability which has affected millions of Linux systems across the globe. Applying patch to an single server is a cakewalk however consider patching hundreds and thousands of servers.

 

Synopsis

 

Item Description
Vulnerability GHost
CVE ID CVE-2015-0235
Operating Systems Affected Debian 7 (wheezy), Red Hat Enterprise Linux 6 & 7, CentOS 6 & 7, Ubuntu 12.04
Documented Operating System RHEL (v5)
Vulnerable Software glibc-2.2, released on November 10, 2000 and nscd
Fixed Software Version glibc-2.5 and latest nscd

 

Summary

A GNU C Library (glibc) vulnerability (CVE-2015-0235), referred to as the GHOST vulnerability, was announced to the general public. In summary, the vulnerability allows remote attackers to take complete control of a system by exploiting a buffer overflow bug in glibc’s GetHOST functions (hence the name)

Procedure (Single Server)

 

The following procedure was performed on RHEL/CentOS (v5) Operating Systems

Step 1:

 

Check for the glibc version

#rpm -q glibc

If the version of glibc matches, or is more recent than, the ones listed here, you are safe from the GHOST vulnerability:

CentOS 6: glibc-2.12-1.149.el6_6.5

CentOS 7: glibc-2.17-55.el7_0.5

RHEL 5: glibc-2.5-123.el5_11.1

RHEL 6: glibc-2.12-1.149.el6_6.5

RHEL 7: glibc-2.17-55.el7_0.5

 

If the version of glibc is older than the ones listed here, your system is vulnerable to GHOST and should be updated.

 

Step 2: (as root user)

 

# yum update glibc nscd

Or (sudo user)

# sudo yum update glibc nscd

 

 

Step 3:

 

#reboot

Or

#sudo reboot

 

 

 

Procedure (Multiple Server with RunDeck)

 

Step 1:

Execute the command on the ad-hoc tab and choose all the Linux Servers, refer the screenshot below

 

Step 2:

Once the above activity is completed, execute the reboot command on the ad-hoc tab, refer the screenshot below

As simple as that! immaterial of the number of servers you have, whether 100 or 1000, RunDeck will execute the commands with ease and provide real-time activity update and logs for auditing.

*  RunDeck should have public keys to access the privileged User on the Server to execute commands

 

Please Contact 8KMiles to make things simple and experience our Operations Automation expertise.

  • January 29, 2015
  • blog

EzIAM FAQ

 

EzIAMTM Identity-as-a-Service was recently launched by 8kMiles in AWS. We get a lot of queries from customers about the technical & functional capabilities of EzIAM. I am planning to write a series of blog posts that would help understand this service better. The following FAQ would help us get introduced to the service.

1. What is EzIAMTM?

EzIAM is a cloud-based Identity Management solution that can be configured to accomplish 3 important Identity and Access Management functions:

  • Identity Management
  • Advanced Authentication
  • Single Sign-On

2. How is EzIAM different from an On-premise Identity Management Solution?

With EzIAM one can completely outsource the management of their identities to a secure cloud.    For a company, especially Small & Medium Businesses, this could be a great option as they can save up on the:

  • The setup costs of IAM infrastructure
  • Skill and knowledge required to drive the IDM systems
  • Day to Day running & operations of their IDM systems

3. Is EzIAM secure?

All communications from, to and within EzIAM (be it http, ldap, database operations, reading configuration files, user data inputs into html forms of EzIAM, email notifications) happen via Secure Socket Layer/TLS with AES ciphers aided by 2048 bit key certificates.

4. What are the technology benefits offered by EzIAM?

EzIAM offers a lot of technology benefits for an enterprise:

  • SSL/TLS Communications
  • IAM Hosted in a secure AWS (Amazon Web Services) Virtual Private Cloud (VPC)
  • A Multi-tenant environment where each customer’s data is logically and physically segregated from another customer’s data
  • Advanced & Multi-factor Authentication features that can be leveraged to control access to high valued assets/resources
  • Identity Federation infrastructure that would help companies to access other SaaS Services & expose their own SaaS services to other companies
  • Synchronization with on-premise Active Directory & other on-premise endpoints
  • Out-of-the box SSO connectors to common SSO endpoints
  • Out-of-the box provisioning connectors to common provisioning endpoints
  • Option to have custom connectors to custom endpoints (both SSO and provisioning)
  • Simple and Complex IDM workflows
  • Email Notifications

5. Is EzIAM a multi-tenant solution?

Yes, EzIAM is a multi-tenant solution.   Each company’s identity data is logically and physically segregated from another company that subscribes to this solution.    Designated Tenant Administrators are assigned for each tenant/company who can basically control the identity and access management objects of their own company only.  No asset of one company can be accessed by a user or admin of another company.

6. How is EzIAM managed?

There are 3 sets of administrators to functionally manage EzIAM.

  • MSP Administrators
  • CSP Administrators
  • Tenant Administrators

The 8kmiles team manages the EzIAM infrastructure with strict SLAs.

7. Is EzIAM managed 24×7?

Yes.  EzIAM is run and operated by 8kMiles with strict SLAs on a 24×7 basis.    8kMiles team is responsible for fixing any operational or functional issue related to EzIAM.     8kMiles has deployed multiple layers of support and help desk, to troubleshoot any issues.

8. What is the role of administrators (of a company) who signs up for EzIAM?

The Tenant Administrator role in EzIAM is assigned to a person (of a company that signs up for EzIAM) who is currently responsible for maintaining the IAM infrastructure of that particular tenant/company on-premise.

9. Can EzIAM be used to “Request Access” by users to applications?

Yes. EzIAM has a “Request Access” feature by which users can request access to applications. The request will be assessed and granted permission by the administrators (who will be part of the Request Access Workflow)

10. Does EzIAM have email notifications as part of its workflows?

Yes.  EzIAM has secure configurable email servers that make sure that email notifications are sent and received by identities within EzIAM in a secure manner.

11. Does EzIAM support federated access to other SaaS providers and third party applications?

Yes. EzIAM supports federated access to other SaaS providers and third party applications.   A Federated partnership can be setup between EzIAM and the external party wherein EzIAM can act as either the IDP (Identity Provider) or the SP (Service Provider).

12. How is advanced authentication implemented in EzIAM?

Advanced or Strong authentication schemes can be used by tenant or CSP administrators to protect high valued resources within the IAM infrastructure of the tenant deployment.  It is implemented in an easily configurable manner.   The Advanced authentication scheme can easily be configured to be part of a multi-factor authentication also.

13. What are the primary advanced/strong authentication mechanisms supported by EzIAM?

The primary strong authentication mechanisms supported by EzIAM are:

  • ArcotID PKI
  • ArcotID OTP

14. What is ArcotID PKI?

ArcotID PKI is a patented Cryptographic key concealment technology from CA.  It can be used to authenticate to a website or other online resource, through a web browser.

15. What are the features of ArcotID PKI?

The important features of the ArcotID PKI credential are as follows:

  • An ArcotID PKI can be accessed only with the correct password
  • ArcotID PKI authentication uses a challenge-response authentication protocol. During authentication, a client application on the end user’s device signs the challenge with the end user’s private key.  The signed challenge is then sent to the Advanced Authentication Server for verification
  • A plausible response is generated for every password that is entered, even if the password is incorrect
  • The validity period for the ArcotID PKI credential is configurable

16. What is ArcotID OTP?

ArcotID OTP is a secure software authentication mechanism that allows the use of mobile phones, iPads, and other PDAs as convenient authentication devices. The ArcotID OTP credential is used for primary authentication, and it supports the Open Authentication (OATH) standard. Similar to the ArcotID PKI credential, ArcotID OTP also uses CA Arcot’s patented Cryptographic Camouflage technology to protect credentials from brute force attacks.

17. What are the Risk evaluation and Fraud detection features enabled in EzIAM?

EzIAM’s Advanced Authentication service provides real-time protection against fraud in online transactions.  This is made possible by the following features:

  • End-User Device Identification Data and Location Data
  • Risk Score and advice
  • Risk Evaluation Rules
  • User Device Association

18. What are the secondary authentication mechanisms supported by EzIAM?

Secondary authentication refers to the additional authentication that is performed in the following cases:

  • An end user has either forgotten or wants to reset the password or PIN
  • An end user’s ArcotID PKI or ArcotID OTP credential has expired
  • A roaming end user is trying to authenticate from a device that is different from the one used to enrol with the system, or one that is already marked trusted during a previous roaming attempt
  • Risk evaluation is enabled, and it generates an advice to increase authentication for the transaction that the end user is trying to perform

Secondary authentication methods supported by EzIAM are:

  • Question and Answer pairs
  • Security Code (which is similar to a one time password)

19. What is a two-step authentication?

When a two-step authentication is enabled, the end user is authenticated consecutively using two different authentication methods.

20. What are the Advanced Authentication flows?

The Advanced Authentication service of CA CloudMinder provides various advanced authentication flows that cater to a tenant’s business requirements. Each flow is used to secure access to a tenant’s resource and define the authentication steps that take place when end users try to access the resource.

The Advanced Authentication service offers ArcotID PKI, ArcotID OTP, Security Code, and Risk Evaluation as primary credential types that can be used to secure access to a resource. An advanced authentication flow is based on either a single credential type or a combination of these credential types.

21. What are the Advanced Authentication flows supported by EzIAM?

The Advanced Authentication service offers the following advanced authentication flows for the supported credential types :

  • ArcotID PKI Only
  • ArcotID PKI with Risk
  • ArcotID OTP Only
  • ArcotID OTP with Risk

22. What are the ArcotID OTP flows supported by EzIAM?

  • ArcotID OTP Only flow
  • ArcotID OTP Roaming Download flow
  • ArcotID OTP New Device Activation flow
  • Forgot my PIN flow

23. What are the primary Identity Management features supported by EzIAM?

  • User Management
  • Password Management including Synchronizing Passwords on Endpoints
  • Role Management (including Admin & Provisioning Roles)
  • Access Requests
  • Integrating Managed Endpoints
  • On-premise Provisioning
  • Provisioning with Active Directory
  • Synchronization
  • Identity Policies
  • Reporting
  • Workflow
  • Email Notifications
  • Task Persistence
  • System Tasks
  • Custom Connectors

24. What are the primary SSO features supported by EzIAM?

  • SSO Applications configured for your business portal
  • Authentication Methods for SSO Applications
  • Federated Partnerships to enable SSO
  • SSO using a Third-party IDP
  • Secure Token Service (STS)
  • WS-Trust claims transformation
  • Self-registration services for SSO
  • User validation for sensitive applications
  • Attribute Query Support
  • Proxied Attribute Query Support

25. Is EzIAM highly available and load-balanced?

Yes, each component server of EzIAM is load-balanced and is made highly available in an AWS (Amazon Web Services) cloud environment.

26. What are the specific benefits offered by EzIAM to companies, especially SMBs, from a cost standpoint?

  • Companies do not have to invest in an IAM infrastructure
  • Companies do not have to hire or train staff to manage IAM infrastructure
  • IAM consultants need not be hired to perform domain specific complex IAM tasks for IDM setup, federation, SSO or Advanced Authentication
  • The EzIAM infra is available 24×7 with help desk and support. So companies can save on these

27. Can EzIAM support directory synchronization with on-premise Active Directories?

Yes it does.   EzIAM can synchronize with an on-premise Active Directory.

28. Can EzIAM support SSO with on-premise applications?

Yes. EzIAM supports SSO to on-premise applications.   EzIAM can also protect applications to be accessed by external users through an SSO process i.e. it can act as an SP too.

29. Can EzIAM support advanced authentication and/or multi-factor authentication as part of SSO process?

EzIAM supports advanced authentication and/or multi-factor authentication as part of the SSO process.

30. Can EzIAM UI be customized?

EzIAM UI can be customized to reflect the tenant environment’s look and feel.

 

Loading Big Index Data into newly launched Amazon CloudSearch engine

Search tier is the most critical section of many online verticals like travel, e-commerce, classifieds etc. If users cannot search products efficiently they will not make their buying decisions properly, which in turn massively affects the revenues of these companies. Most of them are usually powered by Apache Solr, FAST , Autonomy, ElastiSearch etc.  AWS also has a Search Service called CloudSearch which is a fully-managed service in the cloud that makes it easy to set up, manage, and scale a search solution for your website. Amazon CloudSearch relieves you from the worry of hardware provisioning, setup, and maintenance. As your volume of data and traffic fluctuates, Amazon CloudSearch automatically scales to meet your needs.

In AWS infrastructure Apache Solr has been the king and the software to beat till now, recently it has got heavy competitor in the form of Amazon CloudSearch – API 2013-01-01.

API version 2013-01-01 of Amazon CloudSearch is internally powered by customized version of Apache Solr Engine, and it is specifically designed for running highly scalable and available search on Amazon Web Services Cloud. This 2013 CloudSearch API has lots of similarities with Apache Solr and customers can easily migrate to this version and leverage the benefits of Amazon Cloud Infrastructure. We are already hearing many AWS customers are planning their migration from FAST, Solr and A9 Engine into the Amazon CloudSearch – 2013-01-01 API engine.

My team is already migrating couple of customers into this Amazon CloudSearch 2013-01-01 API and i have shared our experience on this process for the benefit of AWS community.

Reference Migration Architecture and requirements:


In this article i am going to explore how to

  • Migrate a 300+ GB index containing close to 247+ million records distributed in 105 searchable fields in a highly scalable /parallel manner in AWS infrastructure.
  • 300 + GB index file is stored in Amazon S3
  • Custom Data loader program built on Amazon Elastic MapReduce is used for parallel loading
  • Around ~6 Search.M2.2Xlarge are created with 2 partitions and 5 replication count
  • Around 10+ M1.large EMR Core nodes are for Data loading. This loader can be increased to hundreds of nodes depending upon the volume and velocity of data pump required.
  • Amazon CloudSearch Infrastructure provisioning, Automated partitioning, replication count are handled by AWS.

Lets get into the details below:

Step 1)Create a new Amazon CloudSearch Domain: We have named the search domain as “bigdatasearch” and chose the search instance type as search.m2.2xlarge.  Since we are planning to pump and query a 300 GB index with millions of document, it did not make sense for us to chose a smaller instance type of Amazon CloudSearch.  Usually the base instance type can be selected based on the number and size of the documents you are planning to maintain in the Amazon CloudSearch.
Note: Here we have chosen replication count as 5.  This is little strange in a distributed architecture because usually more replication count for the master decreases the speed of document upload. But when we were playing with Amazon CloudSearch we observed that it is increasing the speed of uploads. In addition we also observed the following :

  • If we keep the replication count 0 or less, use a smaller search instance type and pump documents in parallel from multiple nodes, either the Amazon CloudSearch Server is failing sometimes or error rates are high.
  • If we keep the replication count 0 or less , use a larger search instance type and pump documents in parallel from multiple nodes, internally Amazon Cloud Search itself is creating 3-5 nodes and it shows in the replication count. Waiting to discuss with AWS SA folks on this behavior.

We will be utilizing distributed uploading technique which we custom built using Amazon Elastic MapReduce to pump data to the Amazon CloudSearch server. This technique enables us to write more Index data in parallel.

Step 2) Select how you would like to create the Amazon CloudSearch Schema: Here we have chosen Manual setup, since we already have schema to be migrated to Amazon CloudSearch.

Next step is to Add index fields to create your Amazon CloudSearch Schema configuration.

Step 3)Adding Amazon CloudSearch Index Fields: Once all the fields have been configured in the schema, click on continue button. In the schema file used we have 100+ fields to be indexed for this particular search domain.
Step 4) Review the setup configurations and launch:
We have 100+ Index fields with scaling options instance type as m2.2xlarge and replication count 5 in the “bigdatasearch” domain.
Step 5 ) Wait till the Amazon CloudSearch Infrastructure is provisioned for you on the back. Usually it takes 10 minutes, it will also list if there is any error encountered when creating the index fields.
Once the Amazon CloudSearch infrastructure is provisioned at the back end , you should notice the “bigdatasearch” domain is“Active”. The search and Document endpoints are published and currently no of searchable document is “0”. There is only 1 CloudSearch Index Partition (Shards) and 5 search.m2.2xlarge instances.
Step 6)Configuring Synonyms: We have 2+ MB of Synonyms which needs to be configured into the Amazon CloudSearch domain. For this, we used Cloud Search cli-toolkit to upload synonyms to Cloud Search.
cs-configure-analysis-scheme -d bigdatasearch –name customanalysisscheme –lang en -e cloudsearch.ap-southeast-1.amazonaws.com –synonyms customsynonyms.txt
Since the volume of index data is huge (300+ GB) we have created a Custom Data Loader built on Amazon Elastic MapReduce to pump the data in parallel into Amazon CloudSearch. Since it is built on Amazon Elastic MapReduce,  we can use the same program without modification for scale to upload TB’s of index into the search system with hundreds of Data loader EMR core/task nodes.
Step 7) Create Amazon Elastic MapReduce Data Loader Cluster Configuration:
Step 8) Configure the Elastic MapReduce (EMR) Capacity: We are using 10 M1.Large core node instances for uploading the data from inside AWS VPC. Depending upon the Data size (GB->TB) and Upload hours we can increase the EMR core nodes capacity and number to speed up the data pump (upload) process.

To know more about How Spot instances can save cost on Amazon EMR ? refer URL AWS Cost Saving Tip 12: Add Spot Instances with Amazon EMR

Step 9)Add Custom data loader program Jar to EMR:
We have exported the data from a MSSQL server as flat UTF-8 dump file and stored it in Amazon S3. We are giving the 300+ GB Dump file as the input for the Amazon EMR CloudSearch Data Loader program to upload into Amazon CS in parallel. Buckets configurations of the Data Loader jar, Input, output and log files are configured in this screen

Step 10) Configure Amazon CloudSearch Access Policies:  We need to open Cloud Search security group access policies to accept upload requests from EMR cluster inside VPC. Configure static IP’s of all the instances or IP range of the data loader clients
Step 11)Run the Amazon Elastic MapReduce Data loader job :
Step 12) Analyzing the Amazon EMR Data loader Job Output:
Output of the JOB can be seen in the AWS EMR JOB logs. Here are few details:
  • “Map output records” in the log tells how many records are inserted into the Amazon CloudSearch , we can observe close to 247,681,520 documents(247+ million) are pumped.
  • “Bytes Read” in the output tells what is size of data set which the JOB has read. We can observe 322387978332 bytes which is equivalent to 300+ GB of index in the Amazon CloudSearch
  • The entire pumping process took ~30 hours with 10 m1.large core nodes for us. We observed that increasing the number of Data loader EMR nodes or their capacity improves the upload speed drastically.
 Step 13) Clean up : Reset Replication Count to level of HA needed ideally 1-2 nodes. Once the Job is completed, Revert back the Security Access Policies in Amazon cloud search. Terminate the EMR Cluster and clean any leftover resources.

Step 14) Analyzing the CloudSearch Dashboard :
We observed that it takes some time for cloud search to reflect actual count of the indexed documents.

After the pumping of 300 + GB index you can observe that currently 2 Amazon CloudSearch partitions ( shards) are used to distribute 247+ million documents with 100+ index fields. This is tremendous cost savings compared to A9 powered Amazon CloudSearch. Amazon CloudSearch has automatically created shards based on the volume of data pumped in to the system. This is cool !!!, it reduces the maintenance headache of the infra admins. If the Amazon CloudSearch team can make this partition concept as configurable parameter in future it will be useful.
Step 15) Executing a Sample Search queries: We are executing a some sample product search queries on the “bigdatasearch” domain to check whether everything is fine. Distributed query was fired and Results came Sub Second from one of the partitions.
In short, It is cost effective compared to old A9 powered CloudSearch, Automated scaling of replication counts for request scalability, automated scaling of partitions for data scalability relieves the infra admin headaches, strong apache Solr pedigree and its long list of feature additions in coming months will make it more interesting.
After working with this service few weeks, we feel it is going to become the major search service on AWS in coming years, giving tough fight for Apache Solr and ElastiSearch deployments on EC2.
This article was co authored with Ankit @8Kmiles.

25 Best Practice Tips for architecting your Amazon VPC

According to me Amazon VPC is one of the most important feature introduced by AWS. We have been using AWS from 2008 and Amazon VPC from the day it was introduced and i strongly feel the customer adoption towards AWS cloud gained real momentum only after the introduction of VPC into the market.
Amazon VPC comes with lots of advantages over the limitations faced in Amazon Classic cloud like: Static private IP address , Elastic Network Interfaces :  possible to bind multiple Elastic Network Interfaces to a single instance, Internal Elastic Load Balancers, Advanced Network Access Control ,Setup a secure bastion host , DHCP options , Predictable internal IP ranges , Moving NICs and internal IPs between instances, VPN connectivity, Heightened security etc. Each and everything is a interesting topic on its own and i will be discussing them in detail in future.
Today i am sharing some of our implementation experience on working with hundreds of Amazon VPC deployments as best practice tips for the AWS user community. You can apply some of the relevant ones in your existing VPC or use these points as part of your migration approach to Amazon VPC.

Practice 1) Get your Amazon VPC combination right: Select the right Amazon VPC architecture first.  You need to decide the right Amazon VPC & VPN setup combination based on your current and future requirements. It is tough to modify/re-design the Amazon VPC at later stage, so it is better to design it taking into consideration your NW and expansion needs for next ~2 years. Currently different types of Amazon VPC setups are available; Like Public facing VPC, Public and Private setup VPC, Amazon VPC with Public and Private Subnets and Hardware VPN Access, Amazon VPC with Private Subnets and Hardware VPN Access, Software based VPN access etc. Choose the one which you feel you will be in next 1-2 years.

Practice 2) Choose your CIDR Blocks: While designing your Amazon VPC, the CIDR block should be chosen in consideration with the number of IP addresses needed and whether we are going to establish connectivity with our data center. The allowed block size is between a /28 netmask and /16 netmask. Amazon VPC can have contain from 16 to 65536 IP addresses. Currently Amazon VPC once created can’t be modified, so it is best to choose the CIDR block which has more IP addresses usually. Also when you design the Amazon VPC architecture to communicate with the on premise/data center ensure your CIDR range used in Amazon VPC does not overlaps or conflicts with the CIDR blocks in your On premise/Data center. Note: If you are using same CIDR blocks while configuring the customer gateway it may conflict.
E.g., Your VPC CIDR block is 10.0.0.0/16 and if you have 10.0.25.0/24 subnet in a data center the communication from instances in VPC to data center will not happen since the subnet is the part of the VPC CIDR. In order to avoid these consequences it is good to have the IP ranges in different class. Example., Amazon VPC is in 10.0.0.0/16 and data center is in 172.16.0.0/24 series.

Practice 3) Isolate according to your Use case: Create separate Amazon VPC for Development , Staging and Production environment (or) Create one Amazon VPC with Separate Subnets/Security/isolated NW groups for Production , Staging and development. We have observed 60% of the customer preferring the second choice. You chose the right one according to your use case.

Practice 4) Securing Amazon VPC : If you are running a machine critical workload demanding complex security needs you can secure the Amazon VPC like your on-premise data center or more sometimes. Some of the tips to secure your VPC are:

  • Secure your Amazon VPC using Firewall virtual appliance, Web application firewall available from Amazon Web Services Marketplace. You can use check point, Sophos etc for this
  • You can configure Intrusion Prevention or Intrusion Detection virtual appliances and secure the protocols and take preventive/corrective actions in your VPC
  • Configure VM encryption tools which encrypts your root and additional EBS volumes. The Key can be stored inside AWS (or) in your Data center outside Amazon Web Services depending on your compliance needs. http://harish11g.blogspot.in/2013/04/understanding-Amazon-Elastic-Block-Store-Securing-EBS-TrendMicro-SecureCloud.html
  • Configure Privileged Identity access management solutions on your Amazon VPC to monitor and audit the access of Administrators of your VPC.
  • Enable the cloud trail to audit in the VPC environments  ACL policy’s. Enable cloud trail :http://harish11g.blogspot.in/2014/01/Integrating-AWS-CloudTrail-with-Splunk-for-managed-services-monitoring-audit-compliance.html
  • Apply anti virus for cleansing specific EC2 instances inside VPC. Trend micro has very good product for this.
  • Configure Site to Site VPN for securely transferring information between Amazon VPC in different regions or between Amazon VPC to your On premise Data center
  • Follow the Security Groups and NW ACL’s best practices listed below

Practice 5) Understand Amazon VPC Limits: Always design the VPC subnets in consideration with the expansion in the future. Also understand the Amazon VPC’s limits before using the same. AWS has various limitations on the VPC components like Rules per security group, No of route tables and Subnets etc. Some of them may be increased after providing the request to the Amazon support team while few components cannot be increased. Ensure the limitations are not affecting your overall design. Refer URL:
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html

Practice 6) IAM your Amazon VPC: When you are going to assign people to maintain your Amazon VPC you can create Amazon IAM account with the fine grained permissions (or) use Sophisticated Privileged identity Management solutions available on AWS marketplace to IAM your VPC.

Practice 7) Disaster Recovery or Geo Distributed Amazon VPC Setup : When you are designing a Disaster Recovery Setup plan using VPC or expanding to another Amazon VPC region you can follow these simple rules. Create your Production site VPC CIDR : 10.0.0.0/16 and your DR region VPC CIDR:  172.16.0.0/16. Make sure they do not conflict with on premises subnet CIDR block in event both needs to be integrated to on premise DC as well. After CIDR blocks creation , setup a VPC tunnel between regions and to your on premise DC. This will help to replicate your data using private IP’s.

Practice 8) Use security groups and Network ACLs wisely:  It is advisable to use security groups over Network ACLs inside Amazon VPC wherever applicable for better control. Security groups are applicable on EC2 instance level while network ACL is applicable on Subnet level.  Security groups are used for White list mostly. To blacklist IPs, one can use Network ACLs.

Practice 9) Tier your Security Groups : Create different security groups for different tiers of your infrastructure architecture inside your VPC. If you have Web, App, DB tiers create different security group for each of them. Creating tier wise security groups will increase the infrastructure security inside Amazon VPC.  EC2 instances in each tier can talk only on application specified ports and not at all ports. If you create Amazon VPC security groups for each and every tier/service separately it will be easier to open a port to a particular service. Don’t use same security group for multiple tiers of instances, this is a bad practice.
Example: Open ports for security group instead of IP ranges : For example : People have tendency to open for port 8080 to 10.10.0.0/24 (web layer) range. Instead of that, open port 8080 to web-security-group. This will make sure only web security group instances will be able to contact on port 8080. If someone launches NAT instance with NAT-Security-Group in 10.10.0.0/24, he won’t be able to contact on port 8080 as it allows access from only web security group.
Practice 10 ) Standardize your Security Group Naming conventions : Following a security group naming conventions inside Amazon VPC will improve operations/management for large scale deployments inside VPC. It also avoids manual errors, leaks and saves cost and time overall.
For example: Simple ones like Prod_DMZ_Web_SG or Dev_MGMT_Utility_SG (or) complex coded ones for large scale deployments like
USVA5LXWEBP001- US East Virginia AZ 5 Linux Web Server Production 001
This helps in better management of security groups.
Practice 11) ELB on Amazon VPC:  When using Amazon ELB for Web Applications, put all other EC2 instances( Tiers like App,cache,DB,BG etc)  in private subnets as much possible. Unless there is a specific requirement where instances need outside world access and EIP attached, put all instances in private subnet only. Only ELBs should be provisioned in Public Subnet as secure practice in Amazon VPC environment.
Practice 12) Control your outgoing traffic in Amazon VPC: If you are looking for better security, for the traffic going to internet gateway use Software’s like Squid or Sophos to restrict the ports,URL,Domains etc so that all traffic go through the proxy tier controlled and it also gets logged. Using these proxy/security systems we can also restrict the unwanted ports, by doing so,  if there is any security compromise to the application running inside Amazon VPC they can be detected by auditing the restricted connections captured from the logs. This helps in corrective security measure.
Practice 13) Plan your NAT Instance Type: Whenever your Application EC2 instances residing inside private subnet of Amazon VPC are making Web Service/HTTP/S3/SQS calls they go through NAT instance. If you have designed Auto scaling for your application tier and there are chances ten’s of app EC2 instances are going to make lots of web calls concurrently, NAT instance will become a performance bottleneck at this juncture. Size your NAT instance capacity depending upon application needs for avoiding performance bottlenecks. Using the NAT instances provides us with advantages of saving cost of Elastic IP and provides extra security by not exposing the instances to outside world for accessing the internet.
Practice 14) Spread your NAT instance with Multiple Subnets: What if you have hundreds of EC2 instances inside your Amazon VPC and they are making lots of heavy web service/HTTP calls concurrently. A single NAT instance with even largest EC2 size cannot handle that bandwidth sometimes and may become performance bottleneck. In Such scenarios, span your EC2 across multiple subnets and create NAT’s for each subnet. This way you can spread your out going bandwidth and improve the performance in your VPC based deployments.
Practice 15) Use EIP when needed: At times you may need to keep a part of your application services to be kept in Public subnet for external communication. It is recommended practice to associate them with Amazon Elastic IP and white list these IP address in the target services used by them
Practice 16) NAT instance practices : If needed, enable Multi factor authentication on NAT instance. SSH and RDP ports are open only on sources and destination IP’s, not global network (0.0.0.0/0). SSH / RDP ports are opened only on static exit IP’s not dynamic exit IP’s.
Practice 17) Plan your Tunnel between On-Premise DC to Amazon VPC: 
Select the right mechanism to connect your on premises DC to Amazon VPC. This will help you to connect the EC2 instance via private IP’s in a secure manner.
  • Option 1: Secure IPSec tunnel to connect a corporate network with Amazon VPC (http://aws.amazon.com/articles/8800869755706543)
  • Option 2 : Secure communication between sites using the AWS VPN CloudHub (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPN_CloudHub.html)
  • Option 3: Use Direct connect between Amazon VPC and on premise when you have lots of data to be transferred with reduced latency (or) you have spread your mission critical workloads across cloud and on premise. Example: Oracle RAC in your DC and Web/App tier in your Amazon VPC. Contact us if you need help on setting up direct connect between Amazon VPC and DC.
Practice 18) Always span your Amazon VPC across multiple subnets in Multiple Availability zones inside a Region. This helps is architecting high availability inside your Amazon VPC properly. Example: Classification of the VPC subnet : WEB Tier Subnet : 10.0.10.0/24 in Az1 and 10.0.11.0/24 in Az2, Application Tier Subnet :  10.0.12.0/24 and 10.0.13.0/24, DB Tier Subnet :  10.0.14.0/24 and 10.0.15.0/24, Cache Tier Subnet : 10.0.16.0/24 and 10.0.17.0/24 etc
Practice 19) Good security practice is that to have only public subnet with route table which carries route to internet gateway. Apply this wherever applicable.
Practice 20) Keep your Data closer : For small scale deployments in AWS where cost is critical than high availability, It is better to keep the Web/App in same availability zone as of ElastiCache , RDS etc inside your Amazon VPC. Design your subnets accordingly to suit this. This is not a recommended architecture for applications demanding High Availability.
Practice 21) Allow and Deny Network ACL : Create Internet outbound allow and deny network ACL in your VPC.
First network ACL: Allow all the HTTP and HTTPS outbound traffic on public internet facing subnet.
Second network ACL: Deny all the HTTP/HTTPS traffic. Allow all the traffic to Squid proxy server or any virtual appliance.
Practice 22 ) Restricting Network ACL : Block all the inbound and outbound ports. Only allow application request ports. These are stateless traffic filters that apply to all traffic inbound or outbound from a Subnet within VPC. AWS recommended Outbound rules : http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_NACLs.html
Practice 23) Create route tables only when needed and use the Associations option to map subnets to the route table in your Amazon VPC
Practice 24) Use Amazon VPC Peering (new) : Amazon Web Services has introduced VPC peering feature which is quite useful one. AWS VPC peering connection is a networking connection between two Amazon VPCs that enables you to route traffic between them using private IP addresses. Currently it can be in same AWS region, Instances in either VPC can communicate with each other as if they are within the same network. Since AWS uses the existing infrastructure of a VPC to create a VPC peering connection; it is neither a gateway nor a VPN connection, and does not rely on a separate piece of physical hardware (which essentially means there is no single point of failure for communication or a bandwidth bottleneck).

We have seen it is useful in following scenarios :
  1. Large Enterprises usually run Multiple Amazon VPC in single region and some of their applications are so interconnected that they may need to access them privately + securely inside AWS. Example Active Directory, Exchange, Common business services will be usually interconnected.
  2. Large Enterprise have different AWS accounts for different business units/teams/departments , at times systems deployed by some business units in different AWS accounts need to be shared or need to consume a shared resource privately. Example: CRM , HRMS ,File Sharing etc can be internal and shared. In such scenarios VPC peering comes very useful.
  3. Customer can peer their VPC with their core suppliers to have tighter integrated access of their systems.
  4. Companies offering Infra/Application Managed Services on AWS can now safely peer into customer Amazon VPC and provide monitoring and management of AWS resources.

Practice 25) Use Amazon VPC: It is highly recommended that migrate all your new workloads inside Amazon VPC rather than Amazon Classic Cloud. I also strongly recommend to migrate your existing workloads from Amazon Classic cloud to Amazon VPC in phases or one shot which ever is feasible. In addition to the benefits of the VPC that is detailed in the start of the article, AWS has started introducing lots of features which are compatible only inside VPC and in the AWS marketplace as well there are lots of products which are compatible only with Amazon VPC.  So make sure you leverage this strength of VPC. If you require any help for this migration please contact me.

readers feel free to suggest more.. I will link relevant ones in this article

Load Testing tool comparison – JMeter on it’s own vs JMeter & BlazeMeter together

Load testing is an important aspect of web applications life cycle on Amazon Cloud. Some of our customers ask us to generate 50000+ RPS to load test the scalability of their application deployed on Amazon cloud. Whenever we used to help such customers and migrate their applications on Amazon cloud for achieving scalability, load testing phase itself becomes a pain. Setting up the Load Testing infrastructure, writing automation around it, Managing, Maintaining and monitoring the load test infrastructure is an headache. Our Load testers and Infrastructure teams were spending considerable time and efforts on the above , instead of focusing only on load testing. We usually work with variety of tools from Grinder, JMeter, HP Load Runner to custom engineered load testing tools during the load testing phase. Some time back , our team started playing around with a SAAS based load testing tool called BlazeMeter. In this article i am going to share our experience in the form comparison between BlazeMeter and JMeter and why BlazeMeter has a bright future.
Blazemeter is a Saas based high scalable load testing tool that handle up to 300,000+ concurrent users. Their load test infrastructure is spread across major AWS regions. Since most of us have been using JMeter for years , the 100 % compatibility it provides to existing JMeter scripts is a good feature. Blazemeter also provides a Chrome Extension which can record browser actions & convert it to .jmx file.

10 Things I like about BlazeMeter

Point 1) Load Test becomes effective only when the load comes from different IP Addresses similar to real world scenario and not from a single source IP. When multiple virtual user load is generated from the same IP, the router as well as the server tries to cache information and optimize the throughput many times. Hence by using multiple IP addresses for the host, the EC2 server will get an illusion of receiving requests from multiple source IP’s. Also it is better that load is generated from multiple IP’s for Amazon ELB to evenly distribute load. Refer URL. BlazeMeter has capability to generate load from IP’s which is very important on load testing the cloud applications.
Point 2) Customizing the Network Emulation: Usually online applications will be accessed from multiple devices like PC’s, Laptop and mobiles. These devices have multiple network types such as 3G, broad band etc.Also at times times our online application will be accessed from locations which has poor network bandwidth , Both these parameters play an important role in capacity planning and load testing. We can chose the Bandwidth and network type emulation while doing the load test using BlazeMeter. Example we can configure the network type such as Unlimited Internet, 3G, Cable, Wifi etc and. Bandwidth download limit per device can also be set.
Point 3) Controlling the Throughput: Target throughput is a parameter of Apache JMeter that can be used to achieve a required throughput value of the application. A server’s performance need not always satisfy the target throughput value mentioned in JMeter. It could provide more throughput or lesser.The target throughput parameter can be controlled in run time in Blazemeter. Live server monitoring can help us identify if our servers are performing well for say 5000 Hits/sec & change the throughput value in run-time to a higher or lower value based on the server’s performance.
Point 4) Controlling the Agents: Apache JMeter works based on Master-Agent based architecture where the Master controls multiple agents generating the load. The number of agents parameter has to be usually decided before the starting of the test while using JMeter based load testing on Amazon Cloud. Option to dynamically change the throughput value is a very good feature to have while load testing a cloud application requiring thousands of Requests per second. BlazeMeter enables us to add or remove agent instances when a test is running. Any instance can be marked as Master or Slave(Agent) while the test is running.
Point 5) Controlling No. of Simulated Users on Slaves (Agents) : A load test strategy is mainly determined by following parameters like number of concurrent users, ramp up time, no. of test engines and test iterations and the test duration. Apache JMeter allows us to manually configure these values before the test is started. New EC2 instances have to provisioned for the Agents, the IP addresses (Usually Elastic IP) of the slaves/agents has to be manually added to the master. The entire setup has to be maintained, managed and monitored during the test cycles. This is ok for an load testing environment with few load test agents and low RPS, imagine an environment where you have generate thousands of RPS and having 50+ agents running. This process of managing the EC2 load test infrastructure will become tedious process overall for the load testing teams. In BlazeMeter, once the number of concurrent users is given, the number of test engines, number of threads and engine capacity is chosen automatically. This can be made semi-automatic, where the number of engines & number of threads as well can be selected by user and only engine capacity is chosen by BlazeMeter. Since it is a managed Load Test infrastructure, the Load Testers can concentrate the testing and not managing 100’s of EC2 load agents.
Point 6) Integrated Monitoring:
BlazeMeter offers live monitoring of essential parameters of test servers when the test is running which enables us to decide on the number & instance type for the test. In the conventional Apache JMeter load test setup in Amazon EC2 we have to observe the Key parameters using AWS Cloudwatch.
Blaze Meter provides AWS Cloud watch integration.An account with IAM access has to be created and
AWS Access Key & Secret Key values have to be configured so that the metrics are available in the Blazemeter’s dashboard. This features helps us to understand how the assets in the cloud are reacting to our load tests and help us accordingly tune the infrastructure.
While performing load testing, it is important not only to monitor your Web Servers & Databases but also the agents from where the load is generated . The New Relic plugin gives us the front end KPIs and back end KPIs.
BlazeMeters’s frontend KPIs provide insight on how many users are actually trying to access your website, mobile site or mobile apps.
BlazeMeters’s backend KPIs show how many users are getting through to your applications.
Point 7) Blazemeter allows us to have a different csv file per load test engine. Though this possible in Apache JMeter, it had to be done manually by copying the files onto the JMeter Agent EC2 instances and have the same filename since the agents refer to the Master’s properties. Blazemeter allows us to parameterize the values of even filenames and have different csv files in each engine without giving us to the trouble of copying files into specific EC2 instances & holds the files in a common repository so that it can be referred from there to each agent.
Point 8) Run the load test using older version of JMeter scripts: Old scripts can be reusable with this feature of BlazeMeter which lets us run the test using any version of Apache JMeter right from version 2.3.2 to 2.10. Some complex scripts prepared some months/years ago can be still be made usable and need not be redone. Saves efforts and costs.
Point 9) Schedule the Test & Stay Relaxed: BlazeMeter as well as JMeter lets you schedule your test duration & test time so that we can run longevity test at any time of the day. Even weekly scheduling is possible in BlazeMeter it is an added advantage, though it is not widely used.
Point 10) Interesting Plug-ins provided by Blazemeter :
Integration with Google Analytics: At the time of scripting, it is enough if we select the Google Analytics Option & provide account details of Google Analytics. BlazeMeter obtains the last 12 months of data and creates a test with 5 most visited pages and sets up the number of concurrent users based on that record.
Integration with WordPress: BlazeMeter provides integration with WordPress where WordPress users can test their App by using the BlazeMeter plug-in without any scripting.
Integration with Drupal & Jenkins: Plugins are available to load test Drupal & Jenkins servers as well.

Post Co Authored with Harine 8KMiles.

Architecting Highly Available ElastiCache Redis replication cluster in AWS VPC

In this post lets explore how to architect and create a Highly Available + Scalable Redis Cache Cluster for your web application in AWS VPC. Following is the architecture in which the ElastiCache Redis Cluster is assembled:

  • Redis Cache Cluster inside Amazon VPC for better control and security
  • Master Redis Node 1 will be created in AZ-1 of US-West
  • Redis Read Replica Node 2 will be created in AZ-2 of US-West
  • Redis Read Replica Node 3 will be created in AZ-3 of US-West

You can position all the 3 Redis Nodes in different Availability zones for Achieving High Availability (or) you can position Master + RR 1 in AZ1 and RR 2 in AZ2. This reduces the Inter – AZ latency and might give better performance for heavily used clusters.
Step 1: Creating Cache Subnet groups:
To create Cache Subnet group  navigate to the dashboard of ElastiCache, select Cache Subnet groups and then click “Create Cache Subnet group”. Add the Subnet Id and the Availability Zone you need to use for the ElastiCache cluster.

 

We have created Amazon VPC spreading across 3 availability zones. In this post we are going to place the Redis Master and 2 Redis Replica Slaves in these 3 availability zones. Since Redis will be most of the times accessed by your application tier it is better if you place them on Private Subnet of your VPC.
Step 2: Creating Redis Cache Cluster: 
To create Cache Cluster navigate to the  dashboard of ElastiCache, select Launch Cache Cluster and provide the necessary details. We are launching it inside Amazon VPC, so we have to select the Cache Subnet group .
Note: It is mandatory to create Cache Subnet group before Launch if you need ElastiCache Redis cluster in Amazon VPC.

 

For test purposes i have used m1.small EC2 instance for the Redis. Since this is a fresh Redis installation, i have not mentioned S3 bucket from where the persistent Redis Snapshot will be used as input. On successful creation of the Cache Cluster you can see the details in the dashboard.
Step 3: Replication Group Creation:
To create Replication group select the option of Replication Groups from dashboard and then select the “Create Replication Group”

Select the master Redis node “redisinsidevpc” created previously as the primary cluster id of the Cache cluster.  Give the Replication group id and description as illustrated below.

Note: Replication Group should be created only after the Primary Cache Cluster node is UP and running, else you will get the error as shown below.

On the successful creation of the Replication group you can see the following details. You can observe from below screenshot that there is only one primary node in US-WEST-2A and zero Redis Read Replica’s are attached to it.

Step 4: Adding Read Replica Nodes:
When you select the Replication group, you can see the option to add Redis Read Replica. We are adding 2 Redis Read Replica named Redis-RR1 (in US-West-2B) and Redis-RR2 (in US-WEST-2C). Both the Read replica’s are pointed to the master node “redisinsidevpc”. Currently we can add up to 5 Read replica Nodes for a Redis Master Node. This is more than enough to handle Thousands of messages per second. If you combine it with Redis Pipeline handling 100K messages per second from a node is like cake walk.
Adding Read Replica 1 in Us-West -2B

Adding Read Replica 2 in US-West-2c

On successful creation you can see the following details of Replication group in the dashboard. Now you can see there are 3 Redis nodes listed with Number of read Replica’s as 2. Placing the Read Replica’s and master node in multiple AZ will increase the high availability and protects you from node and AZ level failure. On our sample tests inter AZ Replication deployments had <2 second replication lag for massive writes on master and <1 second replication lag between master slave inside same AZ deployments. We pumped @100K messages per second for few minutes on m1.large Redis instance cluster.
In event, if you need additional read scalability i recommend to use more read Replica slaves added to the master.
In your application tier you need to use the primary Endpoint “redis-replication.qcdze2.0001.usw2.cache.amazon.aws.com:6379” shown below to connect to Redis.

If you need to delete/reboot/Modify you can make it through the options available here.

Step 5: Promoting the Read replica:

You can also promote any node as the Primary cluster using the Promote/Demote option. There will be only one Primary Node.
Note: This step is not part of the cluster creation process.

This promotion has to be carried out with caution and proper understanding for maintaining data consistency.

Post was co authored with Senthil 8KMiles

Billion Messages – Art of Architecting scalable ElastiCache Redis tier`

Whenever we are designing a highly scalable architectures on AWS running thousands of application servers and supporting millions of requests, usage of NoSQL solutions have become inevitable part. One such solution we often been using for years on AWS is Redis . We love Redis. 
AWS introduced ElastiCache Redis on 2013 and we started using the same since it eased the management and operational efforts.  In this article i am going to share my experience on designing large scale Redis tiers supporting billions of messages per day on AWS, step by step guide on how to deploy the same, what are the Implications you face at scale ? Best Practices to be adopted while designing sharded+replicated Redis Tiers etc.

Since we need to support billions of message requests per day and it was growing:

  • the ElastiCache Redis tier was designed with Partitions( shards) to scale out as the customer grows
  • the ElastiCache Redis tier was designed with Replica Slaves for HA and read scaling as the read volumes grow

When your application is growing at Rapid pace and lots of data are created every day, you cannot keep increasing (scaling up) the size of the ElastiCache Node. At one point you will hit the maximum memory capacity of your EC2 instance and you will be forced to partition.  Partitioning is the process of splitting your Key Value data into multiple ElastiCache Redis instances, so that every instance will only contain a subset of your Key Value pair. It allows for much larger ElastiCache Redis data stores, using the sum of the memory of many ElastiCache Redis Nodes. It also allows to scale the computational power to multiple cores and multiple EC2, and the network bandwidth to multiple EC2 network adapters. There are two widely used partition/shard implementation techniques that are available for ElastiCache Redis Tier :
Technique 1) Client side partitioning means that the Redis clients directly select the right ElastiCache Redis node where to write or read a given key. Many Redis clients implement client side partitioning, chose the right one wisely.
Technique 2) Proxy assisted partitioning means that your clients send requests to a proxy that is able to speak the Redis protocol, which in turn sends requests directly to the right ElastiCache Redis instance. The proxy will make sure to forward our request to the right Redis instance accordingly to the configured partitioning schema. Currently the most widely used Proxy assisted partitioning tool is Twemproxy , written by Manju Raj of twitter. Git hub link https://github.com/twitter/twemproxy . Twemproxy is a proxy developed at Twitter for the Memcached ASCII and the Redis protocol. Twemproxy supports automatic partitioning among multiple Redis instances and  currently it is the suggested way to handle partitioning with Redis.

In this article we are going to explore in detail about Proxy assisted partitioning technique for highly scalable and available Redis tier.

Welcome to Twemproxy

Twemproxy( nutcracker) is a fast single-threaded proxy supporting the Memcached ASCII protocol and more recently the Redis protocol.

Installing Twemproxy:

Download the Twemproxy package.
wget http://twemproxy.googlecode.com/files/nutcracker-0.3.0.tar.gz
tar -xf nutcracker-0.3.0.tar.gz
cd nutcracker-0.3.0
./configure
make
make install

Configuration:

Twemproxy (Nutcracker) can be configured through a YAML file specified by the -c or –conf-file command-line argument on process start. The configuration file is used to specify the server pools and the servers within each pool that nutcracker manages. The configuration files parses and understands the following keys:

• listen: The listening address and port (name:port or ip:port) for this server pool.
• hash: The name of the hash function.
• hash_tag: A two character string that specifies the part of the key used for hashing. Eg “{}” or “$$”. Hash tag enable mapping different keys to the same server as long as the part of the key within the tag is the same.
• distribution: The key distribution mode.
• timeout: The timeout value in msec that we wait for to establish a connection to the server or receive a response from a server. By default, we wait indefinitely.
• backlog: The TCP backlog argument. Defaults to 512.
• preconnect: A boolean value that controls if nutcracker should preconnect to all the servers in this pool on process start. Defaults to false.
• redis: A boolean value that controls if a server pool speaks redis or memcached protocol. Defaults to false.
• server_connections: The maximum number of connections that can be opened to each server. By default, we open at most 1 server connection.
• auto_eject_hosts: A boolean value that controls if server should be ejected temporarily when it fails consecutively server_failure_limit times. See liveness recommendations for information. Defaults to false.
• server_retry_timeout: The timeout value in msec to wait for before retrying on a temporarily ejected server, when auto_eject_host is set to true. Defaults to 30000 msec.
• server_failure_limit: The number of consecutive failures on a server that would lead to it being temporarily ejected when auto_eject_host is set to true. Defaults to 2.
• servers: A list of server address, port and weight (name:port:weight or ip:port:weight) for this server pool.

For More details Refer: https://github.com/twitter/twemproxy

Running and Accessing Twemproxy 

To start the proxy just use the command “nutcracker” with the configuration file path specified or in its default path(conf/nutcracker.yml) .
Based on the configuration the twemproxy will be running and listening. Configure your application to point to the port and address instead of the Redis cluster.

Twemproxy Deployment models:

We usually deploy Twemproxy one one of the following models in AWS :

Model 1: Twemproxy as a separate Proxy Tier: In this model Twemproxies are deployed in separate EC2 instances, The application tier is configured to point to Twemproxies . The Twemproxy tier in turn maintains the mappings to the ElastiCache redis nodes. It is better to use instances with very good IO bandwidth for twemproxy tier in AWS. In case you feel the instance CPU is underutilized, you can launch multiple Twemproxy instances inside the same single EC2 instance as well.

Though the above model looks clean and efficient there are optimizations that can be applied to this architecture :
What happens when the twemproxy01 fails, how will the Application server instances know about it ?
Why should i pay additional for twemproxy EC2 instances, Can it be minimized ?

Model 2 : Twemproxy bundled with application tier EC2’s: 

In this model twemproxies are bundled in the same box of the application server EC2 itself. Since two twemproxies are not aware of each others existence, it is easy to architect this model even in App->Auto Scaling mode. Every application server talks to the local twemproxy deployed in the same box this saves cost and avoids managing additional tier complexity as well.

Reference ElastiCache Redis + Twemproxy  deployment:

(This is a Reference deployment, the same can be scaled out to hundreds depending upon the need. It is a Redis Partitioned + replicated setup )
1. Two ElastiCache Redis nodes in AWS (twem01 and twem02)
2. Replication group for each ElastiCache redis nodes (twem01-rg and twem02-rg with one Read Replica each)
3. Two twemproxy servers running in separate EC2. (twemproxy01 and twemproxy02)
Once the above setup is done please note down the endpoints. We will be using the Replication group endpoint as the ElastiCache Redis endpoint for the twemproxy.

ElastiCache Redis Endpoints:

twem01-twem01.qcdze2.0001.usw2.cache.amazonaws.com:6379
twem02-twem02.qcdze2.0001.usw2.cache.amazonaws.com:6379
ElastiCache Redis Replication endpoints:

twem01-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379
twem02-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379

To test the Twemproxy we pumped following keys:
Pump KV data through the Twemproxy01 (1-2000 keys)
Pump KV data through the Twemproxy02(2001-4000 keys).

Configuration:
beta:
listen: 127.0.0.1:22122
hash: fnv1a_64
hash_tag: “{}”
distribution: ketama   #Consistent Hashing
auto_eject_hosts: false
timeout: 5000
redis: true
servers:
– twem01-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379:1 server1
– twem02-rg.qcdze2.ng.0001.usw2.cache.amazonaws.com:6379:1 server2

Test 1: Testing Key accessibility . Testing “GET” operation across both the Twemproxy Instances for few sample keys. 

Fetch 4 Keys spread across 4000 KV data from Twemproxy01  EC2 instance:
[root@twemproxy01 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”
Fetch 4 Keys spread across 4000 KV data from Twemproxy02  EC2 instance:
[root@twemproxy02 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”

From the above test it is evident that all 4000 KV data inserted using both Twemproxies are accessible from both Twemproxies( testing the sample) even though they are not aware among themselves. This is because of the same hashing and Key mapping translation done at Twemproxy level.

Test 2: Testing the ElastiCache Redis Availability and Fail over mechanism:

We are going to promote the twem01-rg replication group read replica to be the Primary Redis Node. After promotion we are going to test:

 

  1. Whether the Twemproxy is able to recognize the newly promoted master
  2. Whether the sample KV data is safely replicated and still accessible , to ensure failover is successful.

To promote ElastiCache Redis slave just click the promote Action and confirm or automate using API. During the promotion of Read Replica to master we observed that the transition happens very quickly and there is no timeout but the response time for the query is about 4-5 secs for about 3-4 minutes during the switch over. In the Twemproxy configuration we can set the timeout configuration, this value needs to be set accordingly so that during switch over there will be no connection refused. For the sample test we have set it as 5000

Repeat Test 1:

[root@twemproxy01 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”
Fetch 4 Keys spread across 4000 KV data from Twemproxy02  EC2 instance:
[root@twemproxy02 redish]# src/redis-cli -h 127.0.0.1 -p 22122
redis 127.0.0.1:22122> get 1000
“1000-data”
redis 127.0.0.1:22122> get 2000
“2000-data”
redis 127.0.0.1:22122> get 3000
“3000-data”
redis 127.0.0.1:22122> get 4000
“4000-data”

From the above test it is evident that all 4000 KV data are replicated properly between master and slaves nodes and the transition between slave to master happened successfully with all the data.
Reporting

Nutcracker exposes stats at the granularity of server pool and servers per pool through the stats monitoring port. The stats are essentially JSON formatted key-value pairs, with the keys corresponding to counter names. By default stats are exposed on port 22222 and aggregated every 30 seconds.

Some best practices while designing highly scalable+available ElastiCache Redis Tier :

Practice 1 : Reduce the Number of Connections and pipeline messages:

Whenever the application instance gets a request to get/put value to the ElastiCache redis node, the client makes a connection to the Redis Tier. Imagine it is a heavy traffic site, then thousands of requests hitting translates to thousands of connections from the application instance to Redis Tier. Now when you add Auto- scaling to your application tier and you have few hundred servers scaled out , then imagine the connection complexity and overhead this architecture brings to the ElastiCache Redis Tier.

Best practice is minimize the number of connections made from your application instance to your ElastiCache redis node. Use Twemproxy in bundled mode with Application EC2 instance, this keeps the process in close proximity and reduces the connection overhead.  Secondly, Twemproxy internally uses minimal connections to ElastiCache Redis Instance by proxying multiple client connections onto one or few server connections.
Redis also supports pipelines, where multiple requests can be pipelined and sent on a single connection. In a simple test using large Application & ElastiCache node we were able to process 125K message/sec in pipeline mode, now imagine what you could achieve on bigger instance types on AWS. The connection minimization architectural setup of twemproxy makes it ideal for pipelining requests and responses and hence saving on the round trip time.  For example, if twemproxy is proxying three client connections onto a single server and we get requests – ‘get key\r\n’, ‘set key 0 0 3\r\nval\r\n’ and ‘delete key\r\n’ on these three connections respectively, twemproxy would try to batch these requests and send them as a single message onto the server connection.

Note : It is important to note that “read my last write” constraint doesn’t necessarily hold true when twemproxy is configured withserver_connections: > 1. Let us consider a scenario where twemproxy is configured with server_connections: 2. If a client makes pipelined requests with the first request in pipeline being set foo 0 0 3\r\nbar\r\n (write) and the second request being get foo\r\n (read), the expectation is that the read of key foo would return the value bar. However, with configuration of two server connections it is possible that write and read request are sent on different server connections which would mean that their completion could race with one another. In summary, if the client expects “read my last write” constraint, you either configure twemproxy to use server_connections:1 or use clients that only make synchronous requests to twemproxy.

Practice 2:  Configure Auto Ejection and Hashing combination properly

Design for failure is the mantra of cloud architecture. Failures are commons when things are distributed on scale. Though partitioning when using ElastiCache Redis as a data store or cache is conceptually the same on broad lines, there is a huge difference operationally on large scale systems. When you are using ElastiCache Redis as a data store you need to be sure that a given key always maps to the same instance, Whereas if you are using  ElastiCache Redis as cache if a given node is not available, then you can always start afresh using a different node in the hash ring with consistent hashing implementations.
To be resilient against failures, it is recommended that you configure Auto eject hosts false when you treat redis as a Data Store and true in when you treat redis as a cache.
resilient_pool:
auto_eject_hosts: true
server_retry_timeout: 30000
server_failure_limit: 3
Enabling auto_eject_hosts: This property ensures that a dead ElastiCache redis Node can be ejected out of the hash ring after server_failure_limit: consecutive failures have been encountered on that node. A non-zero server_retry_timeout: ensures that we don’t incorrectly mark a node as dead forever especially when the failures were really transient. The combination of server_retry_timeout: and server_failure_limit: controls the tradeoff between resiliency to permanent and transient failures.
Note that an ejected node will not be included in the hash ring for any requests until the retry timeout passes. This will lead to data partitioning as keys originally on the ejected node will now be written to another node still in the pool. If ElastiCache Redis is used as a cache (in memory) then in event of a Redis Node going down, the cache data will be lost. This cache miss can cascade performance problems to other tiers and altogether bring down your system on the cloud. To minimize KV cache miss,  you can design your hash ring with Ketama hashing on the Redis Proxy. This will minimize the Cache miss in event of cache node failure, also it decreases the overall re-balancing needed in your Redis tier.  In addition to helping hand on availability problems, Redis Proxy+Ketama can also help your Redis farm to Scale out and Scale down easily with minimal cache miss. To know more about Ketama on ElastiCache refer http://harish11g.blogspot.com/2013/01/amazon-elasticache-memcached-internals_8.html  .
The below diagram illustrates a ElastiCache Redis Cache Farm with Consistent Hash Ring.
In short to minimize the cache miss when using auto eject with true it is recommended to use “Ketama Hashing ( Consistent Hashing Algorithm)” on your Twemproxy configuration. 
ElastiCache Redis as a Data Store:

What if the data stored in your Cache is important and needs to persisted across node failures and launch ? What if the date stored in your Cache cannot be lost and it needs to be replicated and promoted during failures?
Welcome to ElastiCache Redis as Data store. ElastiCache Redis offers features to persist the in memory cache data to disk and also replicate it to a slave for high availability. If ElastiCache Redis is used as a store (persistent), you need to keep the map between keys and nodes fixed, and a fixed number of nodes. Since the data stored is important when you treat ElastiCache Redis as a data store, in event one Redis node goes down, you should have immediate standby up and running in minutes.  You can architect ElastiCache Redis master with one or more replication Slave launched on different AZ from Master for High Availability in AWS. In event master node failure or master AZ failure, the slave Redis node can be promoted in minutes to act as master. This whole High availability design keeps the number of nodes on the hash ring stable and simple, Otherwise, you will end up building a system to re balance the keys (which is not easy) between nodes whenever there is a addition or removal of nodes during outages. In addition to above the ElastiCache Redis supports Partial Resynchronization with Slaves – If the connection between a master node and a slave node is momentarily broken, the master now accumulates data that is destined for the slave in a backlog buffer. If the connection is restored before the buffer becomes full, a quick partial resync will be done instead of a potentially longer full resync. This really saves network bottleneck during momentary failures.
In large scale systems you will often find some partitions are heavily used than others , in event the usage is read heavy in nature you can add upto 5 Read replicas for the ElastiCache Redis Master partition. Since these replicas are used only for read they do not affect the Hash ring structure. But Twemproxy lacks the support for read scaling with Redis Replica’s. So in event when you face this problem, you will have to Scale up the capacity(instance/node type) of the Master and Slave of that partition alone.

If you are using ElastiCache redis as a Data store in the TwemProxy it is recommended to keep “auto_eject_hosts” property false so that in event of redis node failure it is not ejected from the hash ring. The hash ring can be built with both ketama or modula hash algorithms , since in event of Primary node failure, the Slave is going to be promoted and ring structure is going to be always maintained. But if you feel there is immense possibility for the number of primary node partitions to grow, or major failures to occu, it is better to choose ketama hash ring itself from beginning. The below diagram illustrates the architecture.

Practice 3: Configure the Buffer properly:

All memory for incoming requests and outgoing responses is allocated in mbuf in Twemproxy. Mbuf enables zero copy for requests and responses flowing through the proxy. By default an mbuf is 16K bytes in size and this value can be tuned between 512 and 16M bytes using -m or –mbuf-size=N argument. Every connection has at least one mbuf allocated to it. This means that the number of concurrent connections twemproxy can support is dependent on the mbuf size. A small mbuf allows us to handle more connections, while a large mbuf allows us to read and write more data to and from kernel socket buffers. Large Scale web/mobile applications involving millions of hits might have small size request/response and lots of concurrent connections to handle in their backend. So at such scenarios, when Twemproxy is meant to handle a large number of concurrent client connections, you should set chunk size to a small value like 512 bytes to 1K bytes using the -m or –mbuf-size=N argument.

Practice 4: Configure proper Timeouts
It is always a good idea to configure Twemproxy timeout: for every server pool, rather than purely relying on client-side timeouts. Eg:

resilient_pool_with_timeout:
auto_eject_hosts: true
server_retry_timeout: 30000
server_failure_limit: 3
timeout: 400
Relying only on client-side timeouts has the adverse effect of the original request having timed out on the client to proxy connection, but still pending and outstanding on the proxy to server connection. This further gets exacerbated when client retries the original request.

Benefits of using Twemproxy for Redis Scaling

  • Avoids re inventing the wheel. Thanks Manju Raj (twitter).
  • reduce the number of connections to your cache server by acting as a proxy
  • shard data automatically between multiple cache servers
  • support consistent hashing with different strategies and hashing functions
  • be configured to disable nodes on failure
  • run in multiple instances, allowing client to connect to the first available proxy server
  • Pipelining and batching of requests and hence saving of round-trips

Disadvantages of Partitioning Model:

Point 1) Operations involving multiple keys are usually not supported. For instance you can’t perform the intersection between two sets if they are stored in keys that are mapped to different Redis instances (actually there are ways to do this, but not directly).Redis transactions involving multiple keys can not be used.
Point 2) The partitioning granularity is the key, so it is not possible to shard a dataset with a single huge key like a very big sorted set. Ideally in such cases you should Scale UP the particular Redis Master-Slave to larger EC2 instance or pro grammatically stitch up the sorted set.
Point 3)When partitioning is used, data handling is more complex, for instance you have to handle multiple RDB / AOF files, and to make a backup of your data you need to aggregate the persistence files/snapshots from multiple EC2 Redis slaves.
Point 4) Architecting a partitioned + replicated ElastiCache Redis tier not complex. What is more complex is ? supporting transparent rebalancing of data with the ability to add and remove nodes at runtime. Systems like client side partitioning and proxies don’t support this feature. However a technique called Presharding helps in this regard with limitations. Presharding technique ->Since Redis is lightweight, you can start with a lot of EC2 instances since the beginning itself. For example if you start with 32 or 64 EC2 instances (micro or small Cache Node instance type)  as your node capacity , it will provide enough room to keep scaling up the capacity when your data storage needs increase. It is not a highly recommended technique. But still can be used in production if your growth pattern is very predictable.

Future of highly scalable + available Redis tiers -> Redis Cluster

Redis Cluster is the preferred way to get automatic sharding and high availability. It is currently not production ready. Once Redis Cluster / Client  is available on Amazon ElastiCache, it will be the de facto standard for Redis partitioning. It uses a mix between query routing and client side partitioning.

References:
http://redis.io/documentation
https://github.com/twitter/twemproxy

This article was co-authored with Senthil

451 Research Report: 8KMiles crosses the chasm in cloud-based identity federation

Analyst: Wendy Nather 22 Nov, 2013

Original Report URL from 451 Research website : https://451research.com/report-short?entityId=79384

Full Report is published down…

8KMiles has been heavily invested in cloud integration. As one of Amazon Web Services’ Premier Consulting Partners for 2013, it has helped customers stand up everything from Amazon’s Elastic Block Store to its S3 and Relational Database services. So it made sense to continue to add cloud integration services in the identity and access management (IAM) space. To this end, the company acquired Sunnyvale, California-based FuGen Solutions in May to obtain its Cloud Identity Broker and Multi-Domain Identity Services Platform.

The 451 Take
A combination of design and operations support helps 8KMiles, and its subsidiary FuGen makes on-ramping of federated identity partners easier, particularly for enterprises that don’t have the infrastructure or expertise to figure it out themselves. A migration opportunity can become a hosting opportunity, while a hosting opportunity could turn into the kind of identity and attribute exchange that is still needed. Other efforts are underway to build such an exchange, but 8KMiles and FuGen could get out ahead of it – although it might help if they settled on one company name to promote the unity they’re offering.

Context
Did identity federation get any easier when the execution venue moved from legacy systems to the cloud? Actually, that’s a trick question, because most of it hasn’t moved – it’s just been stretched. Even without the dynamism and scale requirements of the cloud, an enterprise’s federation efforts with its partners suffer from complexity that many organizations aren’t equipped to handle.
There are many types of federation, and only some of them are binary: that is, one organization completely trusts the other one, so that it accepts any identity offered. A common example is federation between a health insurance provider and a partner that provides pharmacy benefits: there can be a
one-to-one acceptance because it’s the same business case (benefits for an insured client) and the same level of security risk. Because it’s the same business case, both sides can validate the user in the same way and no additional validation is needed. A user can be passed through single sign-on from one site to another in a fairly seamless fashion.

However, not all federation is binary. Take the example of a state education agency: it has thousands of school district employees that need to use the agency’s applications. The agency would like to have the districts set up access for those users, but it is still legally on the hook to approve every access. This means that the agency has to rely on some assertions by the district, but must take an additional step of its own for validation and approval before it can fully accept that user into its systems. These validation workflows often use attributes of the user’s identity: whether the user is an employee of the district (which only the district is authoritative about), which roles the user is assigned (which might be determined by the agency), or whether the user is also a member of a different organization (such as working for a second district).

Attributes may sound complicated, and the business rules behind them can be. But an attribute is really the reason why you’re allowing access to that user. You’re allowing access because the insurance provider says this is a registered subscriber; you’re allowing it because the Department of Motor Vehicles (DMV) asserts that this is a licensed driver; you’re allowing it because the user is a registered PayPal customer. And you can only rely on that attribute when it comes from the right authoritative party: only PayPal can say with certainty who its current customers are.

The ecosystem of attributes has yet to be addressed in a coherent way. Many websites and applications will be happy to accept the credentials of a Facebook user, because they only care that someone at Facebook (presumably) validated the user account. That’s all the validation they need. But that’s not enough for many other organizations, especially where legal and regulatory issues are on the line. But if you could get all these authoritative parties in one place…

This is where 8KMiles and FuGen come in.
Founded in 2007, 8KMiles is led by Suresh Venkatachari, its chairman and CEO, who also founded consulting firm SolutionNET. The company has 140 employees among its locations in California, Virginia, Canada and India. In May, 8KMiles acquired FuGen Solutions for $7.5m, with the target becoming a subsidiary.

Products and services
8KMiles offers both consulting services (cloud migration, engineering and application development) and frameworks for assembling secure cloud systems. The company provides a turnkey architecture for implementing a secure private cloud, including firewall and DDoS protection services, secure remote access, system administrator access and monitoring, and disk encryption. This can be deployed either as an Amazon virtual private cloud or in an organization’s own datacenter. 8KMiles similarly offers a secure enterprise collaboration implementation that combines Alfresco’s content management and Amazon’s RDS. An AWS Direct Connect package contains both design and management of the network, points of
presence, and security.

When 8KMiles bought FuGen, it obtained both a cloud identity brokerage and the target’s Multi-Domain Identity Services Platform (MISP). The platform supports the partner onboarding and federation management activities, as well as what the vendor calls last-mile single-sign-on integration to a centralized hub for smaller customers that don’t have legacy IAM systems to connect, or who don’t have the expertise to put everything together. The platform is vendor-agnostic in that it can be used with any IAM provider’s systems to connect and federate partners. The authentication protocols supported include SAML 1.1, SAML 2.0, WS-Federation, WS-Trust, OpenID and OAuth. MISP comes with rules-based validation and reporting, criteria certification, monitoring and logging, and storage of scenarios, data messages, templates and certification reports.

One of the strengths of the broker and platform offerings is that FuGen and 8KMiles staff can duplicate the customer’s complex federation requirements in their virtualized environment. The vendor can build the hub and test all of the integrations with the partners’ systems in a lab setting. Once it’s been assembled and shown to work properly, the company can walk the customer through implementing the working version on its own systems, providing instructions down to the level of the configuration file changes. In cases where the customer does not have specialized IAM expertise or a test network, FuGen can provide both.

These services are available for community providers, SaaS application firms, identity and attribute vendors, and many others. FuGen’s customers range from one of the largest financial services institutions to media providers, large IT suppliers and defense contractors (Amazon AWS customers use FuGen’s federated identity features).

The idea of creating a vendor-agnostic federation space is a good one – as the number of partners grows with which FuGen has already built integrations, the onboarding for future customers goes more quickly. For example, if FuGen has already done the hard work of figuring out connectors for a large payment provider that happens to use Oracle for an IAM system and Ping Identity for cloud-based SSO, then any other partners that want to federate with that large payment provider using the same products will have most of the work already done. The network effect comes into play here: the more partners FuGen integrates, the stronger its offerings grow as a cloud-based ID federation service.

For the reasons described above, many enterprises end up relying on a varied set of IDs and attributes, all coming from different partners. Building a central ID and attribute exchange could speed federation projects for government, healthcare, finance and other verticals if FuGen can pre-integrate those providers. When businesses can join a virtual marketplace where they can get the attributes they need from their state DMV, PayPal and business process outsourcer, and all of the integration work is done for them, then the community has a good chance of growth. Many identity and attribute exchange projects are already underway (and FuGen is already part of some of these open initiatives) – the one advantage is that the company helps facilitate the plumbing, not just the framework. Also, this isn’t just about the cloud: enterprises can still federate with one another using their own systems, with FuGen’s services to set it up. The one hitch is that this is a potential that hasn’t been fully realized. 8KMiles and FuGen would have to figure out how to charge for this service, since charging by ID or partner account might be too dynamic to support a licensing structure. (This isn’t to say that a cloud provider can’t charge dynamically, it’s just that determining how many IDs are in use at any given time is a tricky proposition.) The vendor could charge an onboarding project fee, but services after that – such as monitoring, support, troubleshooting and integration tweaks – would need a different incremental pricing structure. If a large provider is hooked into the hub, and new partners join it, does the provider get charged more, or just the new partners? Identity and attribute management are both still developing areas of technology, and with the cloud as a delivery method, many aspects have to be reconsidered.

Competition
The term ‘identity broker’ is unintentionally confusing, since it is most often used to describe technology that helps intermediate an enterprise’s portfolio of ID stores and services, usually to provide single sign-on for that enterprise’s users or its customers. This is not the same as a third-party identity exchange, such as the kind envisioned by the Identity Ecosystem Steering Group (whose website, incidentally, is powered by Ping). There is also a lot of discussion in the IAM community about who can and should act as identity providers, and the candidates include social media such as Facebook or Google, financial institutions and telcos, since all of these appear to have the largest user bases.

However, none of these identity providers in and of themselves can supply all of the assurance and validation that different business cases require. It doesn’t matter whether Verizon has verified a user for phone service if a relying party has to figure out whether the user is really the same one who walked into the emergency room last night. Some organizations have much stronger requirements for identity assurance, and will have to assemble their own validation lists from multiple ID providers.

Not only does the ID and attribute exchange need to be vendor-agnostic, it also needs to be easy to join. This is where the pre-integration and onboarding services are crucial. Customers don’t have to let FuGen host the hub, but it helps with the kind of complex troubleshooting that federated IAM can sometimes require. The opportunity for FuGen is that it can be a broker for the brokers, so to speak: each enterprise in an ideal world would have just one interface to expose to the world, but those interfaces still need to be matched up with the other ones.

The term ‘broker’ is confusing, but if we focus on ‘exchange,’ we get closer to our original meaning and can consider the competition. SecureKey Technologies was recently awarded a contract by the US Postal Service to create the Federal Cloud Credential Exchange. Criterion Systems was one of the National Strategy for Trusted Identities in Cyberspace pilot grant recipients in 2012, and is building its ID DataWeb Attribute Exchange Network, with an ecosystem of technology partners and relying parties such as Ping, CA Technologies, Fixmo, Verizon, Experian and Wave Systems. If firms like these manage to build a working exchange, it could rival what 8KMiles and FuGen can do. Again, the latter are helping customers set up the integration, not just acting as a provider, so the operational features of their offering set it apart from these exchange projects. The race will be to see who can collect the largest amount of trusted resources and participants in a broadly working exchange. Vendor neutrality and open standards will play a role, but so will user-friendliness. If FuGen can offer both the onramp services and the day-to-day operation in a way that preserves trust, it could have the magic formula.

SWOT Analysis

Strengths
As a cloud broker, 8KMiles expanded its repertoire with the acquisition of FuGen. Identity management is certainly a key part of cloud migration and operation, and FuGen’s virtualized lab environment helps it work out all of the bugs in a complex identity federation system without impacting the customer.

Weakness
FuGen may be known in the IAM industry, particularly due to its participation in public initiatives, but customers may find the name too confusing alongside 8KMiles (neither name really says what the company does). It also has a lot of potential in supporting an identity and attribute exchange, but that potential needs to be realized.

Opportunities
Nobody has really figured out federation yet. Even though some straightforward, homogeneous business use cases are working fine, the more complicated ecosystems are still in the committee/framework/pilot stages. If 8KMiles/FuGen can onramp enough critical-mass partners, it could become a de facto hub before these committees can turn around.

Threats
Vendors such as SecureKey and Criterion are building exchanges too, although they’re in the early stages.
8KMiles/FuGen will also be confused with many other cloud IAM technology vendors due to the misuse of the term broker.

Analyst(s): Wendy Nather , 451 Research

Comparison Analysis:Amazon ELB vs HAProxy EC2

In this article i have analysed Amazon Elastic Load Balancer (ELB) and HAProxy (popular LB in AWS infra) in the following production scenario aspects and fitment:

Algorithms: In terms of algorithms ELB provides Round Robin and Session Sticky algorithms based on EC2 instance health status. HAProxy provides variety of algorithms like Round Robin, Static-RR, Least connection, source, uri, url_param etc. For most of the production cases use Round Robin and Session Sticky is more than enough, But in case you require algorithms like least connection you might have to lean towards HAProxy currently. In future AWS might add this algorithm in their Load Balancer

Spikey or Flash Traffic: Amazon ELB is designed to handle unlimited concurrent requests per second with “gradually increasing” load pattern. It is not designed to handle heavy sudden spike of load or flash traffic. For example: Imagine an e-commerce website whose traffic increases gradually to thousands of concurrent requests/sec in hours, Whereas imagine use cases like Mass Online Exam or GILT load pattern or 3-Hrs Sales/launch campaign sites expecting 20K+ concurrent requests/sec spike suddenly in few minutes, Amazon ELB will struggle to handle this load volatility pattern. If this sudden spike pattern is not a frequent occurrence then we can Pre-warm ELB else we need to look for alternative Load balancers like HAProxy in AWS infrastructure. If you expect a sudden surge of traffic you can provision X number of HAProxy EC2 instances in running state.

Gradually Increasing Traffic: Both Amazon ELB and HAProxy can handle gradually increasing traffic. But when your needs become elastic and traffic increases in a day, you either need to automate or manually add new HAProxy EC2 instances when the threshold is breached. Also when the load decreases you may need to manually remove the HAProxy EC2 instances from Load Balancing Tier. If you want to avoid these manual efforts you may need to engineer using automation scripts and programs. Amazon has intelligently automated this elastic problem in their ELB Tier. We just need to configure and use this, that’s all.

Protocols : Currently Amazon ELB only supports following protocols: HTTP, HTTPS (Secure HTTP), SSL (Secure TCP) and TCP protocols. ELB supports load balancing for the following TCP ports: 25, 80, 443, and 1024-65535. In case RTMP or HTTP Streaming protocol is needed, we need to use Amazon CloudFront CDN in your architecture. HAProxy can support both TCP and HTTP protocols. In case HAProxy EC2 instance is working in pure TCP mode. A full-duplex connection will be established between clients and servers, and no layer 7 examination will be performed. This is the default mode. It can be used for SSL, SSH, SMTP etc. Current 1.4 version of HAProxy does not support HTTPS protocol natively, you may need to use Stunnel or Stud or Nginx before HAProxy to do the SSL termination. HAProxy 1.5 dev-12 comes with SSL support, it will become production ready soon.

Timeouts: Amazon ELB currently timeouts persistent socket connections @ 60 seconds if it is kept idle. This condition will be a problem for use cases which generates large files (PDF, reports etc) at backend EC2, sends them as response back and keeps connection idle during entire generation process. To avoid this you’ll have to send something on the socket every 40 or so seconds to keep the connection active in Amazon ELB. In HAProxy you can configure very large socket timeout values to avoid this problem.

White listing IP’s :Some Enterprises might want to white list 3rd party Load Balancer IP range in their firewalls . If the 3rd party service is hosted using Amazon ELB it will become a problem. Currently Amazon ELB does not provide fixed or permanent IP address for the Load balancing instances that are launched in its tier. This will be a bottleneck for enterprises which have compulsion to white list the Load balancer IP’s in external firewalls/gateways. For such use cases, currently we can use HAProxy EC2 attached with Elastic IPs as load balancers in AWS infrastructure and white list the Elastic IP’s.

Amazon VPC/ Non VPC : VPC- Virtual Private Cloud. Both Amazon ELB and HAProxy EC2 can work inside the VPC and Non VPC environments of AWS.

Internal Load Balancing: Both Amazon ELB and HAProxy can be used for internal load balancing inside VPC. You might provide a service that is consumed internally by the other applications which needs load balancing. ELB and HAProxy can fit in the same. In case internal Load balancing is required in Amazon Non-VPC environments, ELB is not capable currently and HAProxy can be deployed.

URI/URL based Load balancing: Amazon ELB cannot Load Balance based on URL patterns like other Reverse proxies. Example Amazon ELB cannot direct and load balance between request URLs www.xyz.com/URL1 and www.xyz.com/URL2. Currently for such use cases you can use HAProxy on EC2.

Sticky problem: This point comes as a surprise to many users using Amazon ELB. Amazon ELB behaves little strange when incoming traffic is originated from Single or Specific IP ranges, it does not efficiently do round robin and sticks the request to some EC2’s only. Since i do not know the ELB internals i assume ELB might be using “Source” algorithm as default for such conditions. No such cases were observed with HAProxy EC2 in AWS unless the balance algorithm is “Source”. In HAProxy you can combine “Source” and “Round Robin” efficiently. In case the HTTP request does not have cookie it uses source algorithm, but if the HTTP request has a cookie HAProxy automatically shifts to RR or Weighted. (I will have to check this with AWS team)

Logging: Amazon ELB currently does not provide access to its log files for analysis. We can only monitor some essential metrics using CloudWatch for ELB. We cannot debug load balancing problems, analyze the traffic and access patterns; categorize bots / visitors etc currently because we do not have access to the ELB logs.This will also be a bottleneck for some organizations which has strong audit/compliance requirements to be met at all layers of their infrastructure. In case very strict/specific log requirements are needed, You might need to use HAProxy on EC2, in case it suffices the need.

Monitoring: Amazon ELB can be monitored using Amazon CloudWatch. Refer this URL for ELB metrics that can be currently monitored: http://harish11g.blogspot.in/2012/02/cloudwatch-elastic-load-balancing.html. CloudWatch+ELB is detailed for most use cases and provides consolidated result of the entire ELB tier in console/API. On the other hand HAProxy provides user interface and stats for monitoring its instances. But if you have farms(20+) of HAProxy EC2 instances it becomes complex to manage this monitoring part efficiently. You can use tools like ServerDensity to monitor such HAProxy farms, but it has huge dependency on NAT instances availability for inside Amazon VPC deployments.

SSL Termination and Compliance requirements:
SSL Termination can be done at 2 levels using Amazon ELB in your application architecture .They are
SSL termination can be done at Amazon ELB Tier, which means connection is encrypted between Client(browser etc) and Amazon ELB, but connection between ELB and Web/App EC2 is clear. This configuration may not be acceptable in strictly secure environments and will not pass through compliance requirements.
SSL termination can be done at Backend with End to End encryption, which means connection is encrypted between Client and Amazon ELB, and connection between ELB and Web/App EC2 backed is also encrypted. This is the recommended ELB configuration for meeting the compliance requirements at LB level.
HAProxy 1.4 does not support SSL termination directly and it has to be done in Stunnel or Stud or Nginx layer before HAProxy. HAProxy 1.5 dev-12 comes with SSL support, it will become production ready soon, i have not yet analyzed/tested the backend encryption support in this version.

Scalability and Elasticity : Most important architectural requirements of web scale systems are scalability and elasticity. Amazon ELB is designed for this and handle these requirements with ease.Elastic Load Balancer does not cap the number of connections that it can attempt to establish with the load balanced Amazon EC2 instances.Amazon ELB is designed to handle unlimited concurrent requests per second. ELB is inherently scalable and it can elastically increase /decrease its capacity depending upon the traffic. According to a benchmark done by RightScale, Amazon ELB was easily able to scale out and handle 20K+ or more concurrent requests /sec. Refer URL:http://blog.rightscale.com/2010/04/01/benchmarking-load-balancers-in-the-cloud/
Note: The load testing was stopped after 20K req/sec by RightScale because ELB kept expanding its capacity. Considerable of DevOps engineering is needed to automate this functionality with HAProxy.

High Availability: Amazon ELB is inherently fault tolerant and a Highly available service. Since it is a managed service, Unhealthy load balancer instances are automatically replaced in ELB tier. In case of HAProxy, you need to do this work yourself and build HA on your own. Refer URL http://harish11g.blogspot.in/2012/10/high-availability-haproxy-amazon-ec2.html to understand more about High Availability @ Load Balancing Layer using HAProxy.

Integration with Other services: Amazon ELB can be configured with work seamlessly with Amazon AutoScaling, Amazon CloudWatch and Route 53 DNS services. The new web EC2 instances launched by Amazon AutoScaling are added to the Amazon ELB for Load balancing automatically and whenever load drops; existing EC2 instances can be removed by Amazon Auto Scaling from ELB. Amazon AutoScaling and CloudWatch cannot be integrated seamlessly with HAProxy EC2 for this functionality. But HAProxy can be integrated with Route53 easily for DNS RR/Weighted algorithms.

Cost: If you run a ELB in US-East Amazon EC2 region for a month (744 hrs) processing close to 1 TB of data, it will cost around ~26 USD (ELB usage+Data charge). In case if you use HAProxy (2 X m1.large EC2 for HAProxy, S3 backed AMI, Linux instances, No EBS attached) as base capacity and add upto 4 or more m1.large EC2 depending upon traffic. It will minimum cost 387 USD for EC2 compute + Data Charges to start with. it is very clear and evident that larger deployments can save lots of cost and immensely benefit using Amazon ELB compared to HAProxy on EC2.

Use Amazon S3 Object Expiration for Cost Savings

Amazon S3 is one of the earliest and most popular services in AWS infra for storing files & documents. Customers usually store variety of files including their logs, documents, images, videos, dumps etc in Amazon S3. We all understand different files have different lifetime and use cases in any production application. Some documents are frequently accessed for a limited period of time and after that, you might not need real-time access to these objects, it becomes a candidate for deletion or archival.
For example:
Log files will have limited life time and they can be either parsed to Data Store or archived every few months
Database and Data store dumps also have retention period and hence limited life time
Files related to campaigns are not most of the time not needed once the Sales promotion is over
Customer documents are dependent upon customer usage life cycle and have to be retained till the customer is active in the application
Digital media archives, financial and healthcare records must be retained for regulatory compliance

Usually IT teams have to build some sort of mechanism or automated programs in-house to track these document ages and initiate a deletion process (individual or bulk) from time to time. In my customer consulting experience, I have often observed that above mechanism is not adequately in place because of following reasons:
Not all the IT teams are efficient in their development and operations
No mechanism/automation in place to manage the retention period efficiently
IT staff not fully equipped with AWS cloud knowledge
IT teams are usually occupied with their solutions/products catering to their business and hence do not have time to keep track of the rapid AWS feature roll out pace

Imagine your application stores ~5TB of documents every month. In a year it will aggregate to ~60TB of documents in Standard storage of Amazon S3. In Amazon S3 standard on US-East Region ~60TB of aggregated storage for the year will cost ~30,000 USD. Out of this imagine ~20 TB of documents aggregated for the year have limited life time and can be deleted or archived periodically a month. This equates to ~1650 USD cost leakage a year. This can avoided if proper mechanism or automation is put in place by the respective teams.
Note: Current charges for Amazon S3 standard storage in US-EAST per GB is 0.095 USD for first 1TB & 0.80 for next 49 TB.
But is there a simpler way for IT teams to cut this leakage and save costs in Amazon S3. Yes, Use Amazon S3 object expiration feature.

What is Amazon S3 Object expiration?
Amazon S3 introduced a feature called Object Expiration (in late 2011) for easing the above automation mechanism. This is a very helpful feature for the customers who want their data on s3 for a limited period of time and after that you might not need to keep those files and it should be deleted automatically by Amazon S3. Earlier as a customer you were responsible for deleting those files manually, when they do not remain useful but now you do not have to worry about it, just use Amazon S3 Object Expiration.
The leakage of ~1650 USD you saw in the above scenario can be saved by implementing Amazon S3 Object expiration feature in your system. Since it does not involve automation effort, Compute hours for the automation program to run and does not consume manual labor as well, it offers invisible savings in addition to the direct savings.

Overall Savings = ~1650 USD (scenario) + Cost of compute hrs (for deletion program) + Automation engineering effort (or) Manual deletion effort

How does it work?
Amazon S3 Object Expiration feature allows you to define rules to schedule the removal of your objects after a pre-defined time period. The rules are specified in the Lifecycle Configuration policy of an Amazon S3 bucket. Updates can be done either through AWS Management Console or S3 API’s.
Once the rule is set, the Object Expiration time is calculated by Amazon S3 by adding the expiration lifetime to the file creation time and then roundup the result time to the next day midnight GMT . For example : if a file was created on 11/12/2012 11:00 am UTC and the expiration time period was specified 3 days, then Amazon S3 would be calculating the expiration date-time of the file as 15/12/2012 00:00 UTC. Once the objects are past their expiration date, they will be queued for deletion. You can use Object Expiration rules on objects stored in both Standard and Reduced Redundancy storage of Amazon S3.