Apache Solr to Amazon CloudSearch Migration Tool

In this post, we are introducing a new tool called S2C – Apache Solr to Amazon CloudSearch Migration Tool. S2C is a Linux console based utility that helps developers / engineers to migrate search index from Apache Solr to Amazon CloudSearch.

Very often customers initially build search for their website or application on top of Solr, but later run into challenges like elastic scaling and managing the Solr servers. This is a typical scenario we have observed in our years of search implementation experience. For such use cases, Amazon CloudSearch is a good choice. Amazon CloudSearch is a fully-managed service in the cloud that makes it easy to set up, manage, and scale a search solution for your website. To know more, please read the Amazon CloudSearch documentation.

We are seeing growing trend every year, organizations of various sizes are migrating their workloads to Amazon CloudSearch and leveraging the benefits of fully managed service. For example, Measured Search, an analytics and e-Commerce platform vendor, found it easier to migrate to Amazon CloudSearch rather than scale Solr themselves (see article for details).

Since Amazon CloudSearch is built on top of Solr, it exposes all the key features of Solr while providing the benefits of a fully managed service in the cloud such as auto-scaling, self-healing clusters, high availability, data durability, security and monitoring.

In this post, we provide step-by-step instructions on how to use the Apache Solr to Amazon CloudSearch Migration (S2C) tool to migrate from Apache Solr to Amazon CloudSearch.

Before we get into detail, you can download the S2C tool in the below link.
Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip

Pre-Requisites

Before starting the migration, the following pre-requisites have to be met. The pre-requisites include installations and configuration on the migration server. The migration server could be the same Solr server or independent server that sits between your Solr server and Amazon CloudSearch instance.

Note: We recommend running the migration from the Solr server instead of independent server as it can save time and bandwidth. It is much better if the Solr server is hosted on EC2 as the latency between EC2 and CloudSearch is relatively less.

The following installations and configuration should be done on the migration server (i.e. your Solr server or any new independent server that connects between your Solr machine and Amazon CloudSearch).

  1. The application is developed using Java. Download and Install Java 8 .Validate the JDK path and ensure the environment variables like JAVA_HOME, classpath, path is set correctly.
  2. We assume you already have setup Amazon Web services IAM account. Please ensure the IAM user has right permissions to access AWS services like CloudSearch.
    Note: If you do not have an AWS IAM account with above mentioned permissions, you cannot proceed further.
  3. The IAM user should have AWS Access key and Secret key. In the application hosting server, set up the Amazon environment variables for access key and secret key. It is important that the application runs using the AWS environment variables.
    To setup AWS environment variables, please read the below link. It is important that the tool is run using AWS environment variables.http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.htmlhttp://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/java-dg-roles.html
    Alternatively, you can set the following AWS environment variables by running the commands below from Linux console.
    export AWS_ACCESS_KEY=Access Key
    export AWS_SECRET_KEY=Secret Key
  4. Note: This step is applicable only if migration server is hosted on Amazon EC2.
    If you do not have an AWS Access key and Secret key, you can opt for IAM role attached to an EC2 instance. A new IAM role can be created and attached to EC2 during the instance launch. The IAM role should have access to Amazon CloudSearch.
    For more information, read the below link
    http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
  5. Download the migration utility ‘S2C’ (You would have completed this step earlier), unzip the tool and copy it in your working directory.Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip

S2C Utility File
The downloaded ‘S2C’ migration utility should have the following sub directories and files.

Folder / Files Description  
 
bin Binaries of the migration tool
 
lib Libraries required for migration
 
application.conf Configuration file that allows end users to input parameters Require end-user’s input.
 
logback.xml Log file configuration Optional. Does not require end-user  / developer input
 
s2c script file that executes the migration process

Configure only application.conf and logback.xml.  Do not modify any other file.
Application.conf: The application.conf file has the configuration related to the new Amazon CloudSearch domain that will be created. The parameters configured  in the application.conf file are explained in the table below.

s2c {api {SchemaParser = “s2c.impl.solr.DefaultSchemaParser”SchemaConverter = “s2c.impl.cs.DefaultSchemaConverter”DataFetcher = “s2c.impl.solr.DefaultDataFetcher”DataPusher = “s2c.impl.cs.DefaultDataPusher”  } List of API that is executed step by step during the migration.Do not change this.
solr {dir = “files”
server-url = “http://localhost:8983/solr/collection1”
fetch-limit = 100}
dirThe base directory path of Solr.Ensure the directory is present and also its validity.Eg:/opt/solr/example/solr/collection1/conf
server-url– Server host, port and collection path.The endpoint which will be used to fetch the data.If the utility is run from a different server, ensure the IP address and port has firewall access.
fetch-limit– number of solr documents that can be fetched for each batch call. This configuration number should be carefully set by the developer.The fetch limit depends on the following factors:

  1. Record size of a Solr record(1KB or 2KB)
  2. Latency between migration server and Amazon CloudSearch
  3. Current Request Load on the Solr Server

E.g.: If the total Solr documents is 100000 and fetch limit is 100, then it would take 100000 / 10 = 10000 batch calls to complete the fetch.If size of each Solr record is 2KB, then 100000 * 2KB = 200MB data is migrated.

cs {domain = “collection1”
region = “us-east-1″
instance-type = ” search.m3.xlarge”
partition-count = 1
replication-count = 1}
domain – CloudSearch domain name. Ensure that the domain name does not already exist.
Region – AWS region for the new CloudSearch domain
Instance type – Desired instance type for CloudSearch nodes. Choose the instance type based on the volume of data and the expected query volume. 
Partition count – Number of partitions required for CloudSearch
replication-count – Replication count for CloudSearch
wd = “/tmp” Temporary file path to store intermediate data files and migration log files

Running the migration

Before launching the S2C migration tool, verify the following:

    • Solr directory path – Make sure that the Solr directory path is valid and available. The tool cannot read the configuration if the path or directory is invalid.
    • Solr configuration contents – Validate that the Solr configuration contents are correctly set inside the directory. Example: solrconfig.xml, schema.xml, stopwords.txt, etc.
    • Make sure that the working directory is present in the file system and has write permissions for the current user. It can be an existing directory or a new directory. The working directory stores the fetched data from Solr and migration logs.
    • Validate the disk size before starting the migration. If the available free disk space is lesser than the size of the Solr index, the fetch operations will fail.

For example, if the Solr index size is 7 GB, make sure that the disk has at least 10 GB to 20 GB of free space.
Note: The tool reads the data from Solr and stores in a temporary directory (Please see configuration wd = /tmp in the above table).

  • Verify that the AWS environment variables are set correctly. The AWS environment variables are mentioned in the pre-requisites section above.
  • Validate the firewall rules for IP address and ports if the migration tool is run from a different server or instance. Example: Solr default port 8983 should be opened to the EC2 instance executing this tool.

Run the following command from directory ‘{S2C filepath}’
Example: /build/install/s2c-cli

/s2c or JVM_OPTS=”-Xms2048m -Xmx2048m” ./s2c (With heap size)

The above will invoke the shell ‘s2c’ script that starts the search migration process. The migration process is a series of steps that require user inputs as shown in the screen shots below.
Step 1: Parse the Solr schema The first step of migration prompts for a confirmation to parse the Solr schema and Solr configuration file. During this step, the application generates a ‘Run Id’ folder inside the working directory.
  Example: /tmp/s2c/m1416220194655

The Run Id is a unique identifier for each migration. Note down the Run Id as you will need it to resume the migration in case of any failures.

Step 2: Schema conversion from Solr to CloudSearch.The second step prompts confirmation to convert Solr schema to CloudSearch schema. Press any key to proceed further.

The second step will also list all the converted fields which are ready to be migrated from Solr to CloudSearch. If any fields are left out, this step will allow you to correct the original schema. User can abort the migration and identify the ignored fields, rectify the schema and re-run the migration again.The below screen shot shows the fields ready for CloudSearch migration.


Step 3: Data Fetch: The third step prompts for confirmation to fetch the search index data from the Solr server. Press any key to proceed. This step will generate a temporary file which will be stored in the working directory. This temporary file will have all the fetched documents from the Solr index.


There is also option to skip the fetch process if all the Solr data is already stored in the temporary file. If this is the case, the prompt will look like the screenshot below.

Step 4: Data push to CloudSearchThe last and final step prompts for confirmation to push the search data from the temporary file store to Amazon CloudSearch. This step also creates the CloudSearch domain with the configuration specified in application.conf including desired instance type, replication count, and multi-AZ options.

If the domain is already created, the utility will prompt to use the existing domain. If you do not wish to use an existing domain, you can create a new CloudSearch domain using the same prompt.
Note: The console does not prompt for any ‘CloudSearch domain name’ but instead it uses the domain name configured in the application.conf file.

Step 5: Resume (Optional) During the migration steps, if there is any failure during the fetch operation, it can be resumed. This is illustrated in the screen shot below.

Step 6: Verification Log into AWS CloudSearch management console to verify that the domain and index fields.

Amazon CloudSearch allows running test queries to validate the migration and as well the functionality of your application.

Features supported

  • Support for other non-Linux environments is not available for now.
  • Support for Solr Shards is not available for now. The Solr shard needs to be migrated separately.
  • The install commands may vary for different Linux flavors. Example installing software, file editor command, permission set commands can be different for every Linux flavors. It is left to engineering team to choose the right commands during the installation and execution of this migration tool.
  • Only fields configured as ‘stored’ in Solr schema.xml are supported. The non-stored fields are ignored during schema parsing.
  • The document id (unique key) is required to have following attributes:
    1. Document ID should be 128 characters or less in size.
    2. Document ID can contain any letter, any number, and any of the following characters:      _ – = # ; : / ? @ &
    3. The below link will help you to understand in data  preparation before migrating to CloudSearch http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html
  • If the conditions are not met in a document, it will be skipped during migration. Skipped records are shown in the log file.
  • If a field type (mapped to fields) is not stored, the stopwords mapped to that particular field type are ignored.

Example 1:

<field name=”description” type=”text_general” indexed=”true” stored=”true” />   

Note: The above field ‘description’ will be considered for stopwords.Example 2:

<field name=”fileName” type=”string” />     

Note: The above field ‘fileName’ will not be migrated and ignored in the stopwords.

Please do write your feedback and suggestions in the below comments section to improve this tool. The source code of the tool can be downloaded at https://github.com/8KMiles/s2c/. We have written a follow-up post in regard to that.

About the Authors
 Dhamodharan P is a Senior Cloud Architect at 8KMiles.

 

 

 

 Dwarakanath R is a Principal Architect at 8KMiles.

 

 

25 Best Practice Tips for architecting your Amazon VPC

According to me Amazon VPC is one of the most important feature introduced by AWS. We have been using AWS from 2008 and Amazon VPC from the day it was introduced and i strongly feel the customer adoption towards AWS cloud gained real momentum only after the introduction of VPC into the market.
Amazon VPC comes with lots of advantages over the limitations faced in Amazon Classic cloud like: Static private IP address , Elastic Network Interfaces :  possible to bind multiple Elastic Network Interfaces to a single instance, Internal Elastic Load Balancers, Advanced Network Access Control ,Setup a secure bastion host , DHCP options , Predictable internal IP ranges , Moving NICs and internal IPs between instances, VPN connectivity, Heightened security etc. Each and everything is a interesting topic on its own and i will be discussing them in detail in future.
Today i am sharing some of our implementation experience on working with hundreds of Amazon VPC deployments as best practice tips for the AWS user community. You can apply some of the relevant ones in your existing VPC or use these points as part of your migration approach to Amazon VPC.

Practice 1) Get your Amazon VPC combination right: Select the right Amazon VPC architecture first.  You need to decide the right Amazon VPC & VPN setup combination based on your current and future requirements. It is tough to modify/re-design the Amazon VPC at later stage, so it is better to design it taking into consideration your NW and expansion needs for next ~2 years. Currently different types of Amazon VPC setups are available; Like Public facing VPC, Public and Private setup VPC, Amazon VPC with Public and Private Subnets and Hardware VPN Access, Amazon VPC with Private Subnets and Hardware VPN Access, Software based VPN access etc. Choose the one which you feel you will be in next 1-2 years.

Practice 2) Choose your CIDR Blocks: While designing your Amazon VPC, the CIDR block should be chosen in consideration with the number of IP addresses needed and whether we are going to establish connectivity with our data center. The allowed block size is between a /28 netmask and /16 netmask. Amazon VPC can have contain from 16 to 65536 IP addresses. Currently Amazon VPC once created can’t be modified, so it is best to choose the CIDR block which has more IP addresses usually. Also when you design the Amazon VPC architecture to communicate with the on premise/data center ensure your CIDR range used in Amazon VPC does not overlaps or conflicts with the CIDR blocks in your On premise/Data center. Note: If you are using same CIDR blocks while configuring the customer gateway it may conflict.
E.g., Your VPC CIDR block is 10.0.0.0/16 and if you have 10.0.25.0/24 subnet in a data center the communication from instances in VPC to data center will not happen since the subnet is the part of the VPC CIDR. In order to avoid these consequences it is good to have the IP ranges in different class. Example., Amazon VPC is in 10.0.0.0/16 and data center is in 172.16.0.0/24 series.

Practice 3) Isolate according to your Use case: Create separate Amazon VPC for Development , Staging and Production environment (or) Create one Amazon VPC with Separate Subnets/Security/isolated NW groups for Production , Staging and development. We have observed 60% of the customer preferring the second choice. You chose the right one according to your use case.

Practice 4) Securing Amazon VPC : If you are running a machine critical workload demanding complex security needs you can secure the Amazon VPC like your on-premise data center or more sometimes. Some of the tips to secure your VPC are:

  • Secure your Amazon VPC using Firewall virtual appliance, Web application firewall available from Amazon Web Services Marketplace. You can use check point, Sophos etc for this
  • You can configure Intrusion Prevention or Intrusion Detection virtual appliances and secure the protocols and take preventive/corrective actions in your VPC
  • Configure VM encryption tools which encrypts your root and additional EBS volumes. The Key can be stored inside AWS (or) in your Data center outside Amazon Web Services depending on your compliance needs. http://harish11g.blogspot.in/2013/04/understanding-Amazon-Elastic-Block-Store-Securing-EBS-TrendMicro-SecureCloud.html
  • Configure Privileged Identity access management solutions on your Amazon VPC to monitor and audit the access of Administrators of your VPC.
  • Enable the cloud trail to audit in the VPC environments  ACL policy’s. Enable cloud trail :http://harish11g.blogspot.in/2014/01/Integrating-AWS-CloudTrail-with-Splunk-for-managed-services-monitoring-audit-compliance.html
  • Apply anti virus for cleansing specific EC2 instances inside VPC. Trend micro has very good product for this.
  • Configure Site to Site VPN for securely transferring information between Amazon VPC in different regions or between Amazon VPC to your On premise Data center
  • Follow the Security Groups and NW ACL’s best practices listed below

Practice 5) Understand Amazon VPC Limits: Always design the VPC subnets in consideration with the expansion in the future. Also understand the Amazon VPC’s limits before using the same. AWS has various limitations on the VPC components like Rules per security group, No of route tables and Subnets etc. Some of them may be increased after providing the request to the Amazon support team while few components cannot be increased. Ensure the limitations are not affecting your overall design. Refer URL:
http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html

Practice 6) IAM your Amazon VPC: When you are going to assign people to maintain your Amazon VPC you can create Amazon IAM account with the fine grained permissions (or) use Sophisticated Privileged identity Management solutions available on AWS marketplace to IAM your VPC.

Practice 7) Disaster Recovery or Geo Distributed Amazon VPC Setup : When you are designing a Disaster Recovery Setup plan using VPC or expanding to another Amazon VPC region you can follow these simple rules. Create your Production site VPC CIDR : 10.0.0.0/16 and your DR region VPC CIDR:  172.16.0.0/16. Make sure they do not conflict with on premises subnet CIDR block in event both needs to be integrated to on premise DC as well. After CIDR blocks creation , setup a VPC tunnel between regions and to your on premise DC. This will help to replicate your data using private IP’s.

Practice 8) Use security groups and Network ACLs wisely:  It is advisable to use security groups over Network ACLs inside Amazon VPC wherever applicable for better control. Security groups are applicable on EC2 instance level while network ACL is applicable on Subnet level.  Security groups are used for White list mostly. To blacklist IPs, one can use Network ACLs.

Practice 9) Tier your Security Groups : Create different security groups for different tiers of your infrastructure architecture inside your VPC. If you have Web, App, DB tiers create different security group for each of them. Creating tier wise security groups will increase the infrastructure security inside Amazon VPC.  EC2 instances in each tier can talk only on application specified ports and not at all ports. If you create Amazon VPC security groups for each and every tier/service separately it will be easier to open a port to a particular service. Don’t use same security group for multiple tiers of instances, this is a bad practice.
Example: Open ports for security group instead of IP ranges : For example : People have tendency to open for port 8080 to 10.10.0.0/24 (web layer) range. Instead of that, open port 8080 to web-security-group. This will make sure only web security group instances will be able to contact on port 8080. If someone launches NAT instance with NAT-Security-Group in 10.10.0.0/24, he won’t be able to contact on port 8080 as it allows access from only web security group.
Practice 10 ) Standardize your Security Group Naming conventions : Following a security group naming conventions inside Amazon VPC will improve operations/management for large scale deployments inside VPC. It also avoids manual errors, leaks and saves cost and time overall.
For example: Simple ones like Prod_DMZ_Web_SG or Dev_MGMT_Utility_SG (or) complex coded ones for large scale deployments like
USVA5LXWEBP001- US East Virginia AZ 5 Linux Web Server Production 001
This helps in better management of security groups.
Practice 11) ELB on Amazon VPC:  When using Amazon ELB for Web Applications, put all other EC2 instances( Tiers like App,cache,DB,BG etc)  in private subnets as much possible. Unless there is a specific requirement where instances need outside world access and EIP attached, put all instances in private subnet only. Only ELBs should be provisioned in Public Subnet as secure practice in Amazon VPC environment.
Practice 12) Control your outgoing traffic in Amazon VPC: If you are looking for better security, for the traffic going to internet gateway use Software’s like Squid or Sophos to restrict the ports,URL,Domains etc so that all traffic go through the proxy tier controlled and it also gets logged. Using these proxy/security systems we can also restrict the unwanted ports, by doing so,  if there is any security compromise to the application running inside Amazon VPC they can be detected by auditing the restricted connections captured from the logs. This helps in corrective security measure.
Practice 13) Plan your NAT Instance Type: Whenever your Application EC2 instances residing inside private subnet of Amazon VPC are making Web Service/HTTP/S3/SQS calls they go through NAT instance. If you have designed Auto scaling for your application tier and there are chances ten’s of app EC2 instances are going to make lots of web calls concurrently, NAT instance will become a performance bottleneck at this juncture. Size your NAT instance capacity depending upon application needs for avoiding performance bottlenecks. Using the NAT instances provides us with advantages of saving cost of Elastic IP and provides extra security by not exposing the instances to outside world for accessing the internet.
Practice 14) Spread your NAT instance with Multiple Subnets: What if you have hundreds of EC2 instances inside your Amazon VPC and they are making lots of heavy web service/HTTP calls concurrently. A single NAT instance with even largest EC2 size cannot handle that bandwidth sometimes and may become performance bottleneck. In Such scenarios, span your EC2 across multiple subnets and create NAT’s for each subnet. This way you can spread your out going bandwidth and improve the performance in your VPC based deployments.
Practice 15) Use EIP when needed: At times you may need to keep a part of your application services to be kept in Public subnet for external communication. It is recommended practice to associate them with Amazon Elastic IP and white list these IP address in the target services used by them
Practice 16) NAT instance practices : If needed, enable Multi factor authentication on NAT instance. SSH and RDP ports are open only on sources and destination IP’s, not global network (0.0.0.0/0). SSH / RDP ports are opened only on static exit IP’s not dynamic exit IP’s.
Practice 17) Plan your Tunnel between On-Premise DC to Amazon VPC: 
Select the right mechanism to connect your on premises DC to Amazon VPC. This will help you to connect the EC2 instance via private IP’s in a secure manner.
  • Option 1: Secure IPSec tunnel to connect a corporate network with Amazon VPC (http://aws.amazon.com/articles/8800869755706543)
  • Option 2 : Secure communication between sites using the AWS VPN CloudHub (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPN_CloudHub.html)
  • Option 3: Use Direct connect between Amazon VPC and on premise when you have lots of data to be transferred with reduced latency (or) you have spread your mission critical workloads across cloud and on premise. Example: Oracle RAC in your DC and Web/App tier in your Amazon VPC. Contact us if you need help on setting up direct connect between Amazon VPC and DC.
Practice 18) Always span your Amazon VPC across multiple subnets in Multiple Availability zones inside a Region. This helps is architecting high availability inside your Amazon VPC properly. Example: Classification of the VPC subnet : WEB Tier Subnet : 10.0.10.0/24 in Az1 and 10.0.11.0/24 in Az2, Application Tier Subnet :  10.0.12.0/24 and 10.0.13.0/24, DB Tier Subnet :  10.0.14.0/24 and 10.0.15.0/24, Cache Tier Subnet : 10.0.16.0/24 and 10.0.17.0/24 etc
Practice 19) Good security practice is that to have only public subnet with route table which carries route to internet gateway. Apply this wherever applicable.
Practice 20) Keep your Data closer : For small scale deployments in AWS where cost is critical than high availability, It is better to keep the Web/App in same availability zone as of ElastiCache , RDS etc inside your Amazon VPC. Design your subnets accordingly to suit this. This is not a recommended architecture for applications demanding High Availability.
Practice 21) Allow and Deny Network ACL : Create Internet outbound allow and deny network ACL in your VPC.
First network ACL: Allow all the HTTP and HTTPS outbound traffic on public internet facing subnet.
Second network ACL: Deny all the HTTP/HTTPS traffic. Allow all the traffic to Squid proxy server or any virtual appliance.
Practice 22 ) Restricting Network ACL : Block all the inbound and outbound ports. Only allow application request ports. These are stateless traffic filters that apply to all traffic inbound or outbound from a Subnet within VPC. AWS recommended Outbound rules : http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_NACLs.html
Practice 23) Create route tables only when needed and use the Associations option to map subnets to the route table in your Amazon VPC
Practice 24) Use Amazon VPC Peering (new) : Amazon Web Services has introduced VPC peering feature which is quite useful one. AWS VPC peering connection is a networking connection between two Amazon VPCs that enables you to route traffic between them using private IP addresses. Currently it can be in same AWS region, Instances in either VPC can communicate with each other as if they are within the same network. Since AWS uses the existing infrastructure of a VPC to create a VPC peering connection; it is neither a gateway nor a VPN connection, and does not rely on a separate piece of physical hardware (which essentially means there is no single point of failure for communication or a bandwidth bottleneck).

We have seen it is useful in following scenarios :
  1. Large Enterprises usually run Multiple Amazon VPC in single region and some of their applications are so interconnected that they may need to access them privately + securely inside AWS. Example Active Directory, Exchange, Common business services will be usually interconnected.
  2. Large Enterprise have different AWS accounts for different business units/teams/departments , at times systems deployed by some business units in different AWS accounts need to be shared or need to consume a shared resource privately. Example: CRM , HRMS ,File Sharing etc can be internal and shared. In such scenarios VPC peering comes very useful.
  3. Customer can peer their VPC with their core suppliers to have tighter integrated access of their systems.
  4. Companies offering Infra/Application Managed Services on AWS can now safely peer into customer Amazon VPC and provide monitoring and management of AWS resources.

Practice 25) Use Amazon VPC: It is highly recommended that migrate all your new workloads inside Amazon VPC rather than Amazon Classic Cloud. I also strongly recommend to migrate your existing workloads from Amazon Classic cloud to Amazon VPC in phases or one shot which ever is feasible. In addition to the benefits of the VPC that is detailed in the start of the article, AWS has started introducing lots of features which are compatible only inside VPC and in the AWS marketplace as well there are lots of products which are compatible only with Amazon VPC.  So make sure you leverage this strength of VPC. If you require any help for this migration please contact me.

readers feel free to suggest more.. I will link relevant ones in this article