Apache Solr to Amazon CloudSearch Migration Tool

In this post, we are introducing a new tool called S2C – Apache Solr to Amazon CloudSearch Migration Tool. S2C is a Linux console based utility that helps developers / engineers to migrate search index from Apache Solr to Amazon CloudSearch.

Very often customers initially build search for their website or application on top of Solr, but later run into challenges like elastic scaling and managing the Solr servers. This is a typical scenario we have observed in our years of search implementation experience. For such use cases, Amazon CloudSearch is a good choice. Amazon CloudSearch is a fully-managed service in the cloud that makes it easy to set up, manage, and scale a search solution for your website. To know more, please read the Amazon CloudSearch documentation.

We are seeing growing trend every year, organizations of various sizes are migrating their workloads to Amazon CloudSearch and leveraging the benefits of fully managed service. For example, Measured Search, an analytics and e-Commerce platform vendor, found it easier to migrate to Amazon CloudSearch rather than scale Solr themselves (see article for details).

Since Amazon CloudSearch is built on top of Solr, it exposes all the key features of Solr while providing the benefits of a fully managed service in the cloud such as auto-scaling, self-healing clusters, high availability, data durability, security and monitoring.

In this post, we provide step-by-step instructions on how to use the Apache Solr to Amazon CloudSearch Migration (S2C) tool to migrate from Apache Solr to Amazon CloudSearch.

Before we get into detail, you can download the S2C tool in the below link.
Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip

Pre-Requisites

Before starting the migration, the following pre-requisites have to be met. The pre-requisites include installations and configuration on the migration server. The migration server could be the same Solr server or independent server that sits between your Solr server and Amazon CloudSearch instance.

Note: We recommend running the migration from the Solr server instead of independent server as it can save time and bandwidth. It is much better if the Solr server is hosted on EC2 as the latency between EC2 and CloudSearch is relatively less.

The following installations and configuration should be done on the migration server (i.e. your Solr server or any new independent server that connects between your Solr machine and Amazon CloudSearch).

  1. The application is developed using Java. Download and Install Java 8 .Validate the JDK path and ensure the environment variables like JAVA_HOME, classpath, path is set correctly.
  2. We assume you already have setup Amazon Web services IAM account. Please ensure the IAM user has right permissions to access AWS services like CloudSearch.
    Note: If you do not have an AWS IAM account with above mentioned permissions, you cannot proceed further.
  3. The IAM user should have AWS Access key and Secret key. In the application hosting server, set up the Amazon environment variables for access key and secret key. It is important that the application runs using the AWS environment variables.
    To setup AWS environment variables, please read the below link. It is important that the tool is run using AWS environment variables.http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.htmlhttp://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/java-dg-roles.html
    Alternatively, you can set the following AWS environment variables by running the commands below from Linux console.
    export AWS_ACCESS_KEY=Access Key
    export AWS_SECRET_KEY=Secret Key
  4. Note: This step is applicable only if migration server is hosted on Amazon EC2.
    If you do not have an AWS Access key and Secret key, you can opt for IAM role attached to an EC2 instance. A new IAM role can be created and attached to EC2 during the instance launch. The IAM role should have access to Amazon CloudSearch.
    For more information, read the below link
    http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
  5. Download the migration utility ‘S2C’ (You would have completed this step earlier), unzip the tool and copy it in your working directory.Download Link: https://s3-us-west-2.amazonaws.com/s2c-tool/s2c-cli.zip

S2C Utility File
The downloaded ‘S2C’ migration utility should have the following sub directories and files.

Folder / Files Description  
 
bin Binaries of the migration tool
 
lib Libraries required for migration
 
application.conf Configuration file that allows end users to input parameters Require end-user’s input.
 
logback.xml Log file configuration Optional. Does not require end-user  / developer input
 
s2c script file that executes the migration process

Configure only application.conf and logback.xml.  Do not modify any other file.
Application.conf: The application.conf file has the configuration related to the new Amazon CloudSearch domain that will be created. The parameters configured  in the application.conf file are explained in the table below.

s2c {api {SchemaParser = “s2c.impl.solr.DefaultSchemaParser”SchemaConverter = “s2c.impl.cs.DefaultSchemaConverter”DataFetcher = “s2c.impl.solr.DefaultDataFetcher”DataPusher = “s2c.impl.cs.DefaultDataPusher”  } List of API that is executed step by step during the migration.Do not change this.
solr {dir = “files”
server-url = “http://localhost:8983/solr/collection1”
fetch-limit = 100}
dirThe base directory path of Solr.Ensure the directory is present and also its validity.Eg:/opt/solr/example/solr/collection1/conf
server-url– Server host, port and collection path.The endpoint which will be used to fetch the data.If the utility is run from a different server, ensure the IP address and port has firewall access.
fetch-limit– number of solr documents that can be fetched for each batch call. This configuration number should be carefully set by the developer.The fetch limit depends on the following factors:

  1. Record size of a Solr record(1KB or 2KB)
  2. Latency between migration server and Amazon CloudSearch
  3. Current Request Load on the Solr Server

E.g.: If the total Solr documents is 100000 and fetch limit is 100, then it would take 100000 / 10 = 10000 batch calls to complete the fetch.If size of each Solr record is 2KB, then 100000 * 2KB = 200MB data is migrated.

cs {domain = “collection1”
region = “us-east-1″
instance-type = ” search.m3.xlarge”
partition-count = 1
replication-count = 1}
domain – CloudSearch domain name. Ensure that the domain name does not already exist.
Region – AWS region for the new CloudSearch domain
Instance type – Desired instance type for CloudSearch nodes. Choose the instance type based on the volume of data and the expected query volume. 
Partition count – Number of partitions required for CloudSearch
replication-count – Replication count for CloudSearch
wd = “/tmp” Temporary file path to store intermediate data files and migration log files

Running the migration

Before launching the S2C migration tool, verify the following:

    • Solr directory path – Make sure that the Solr directory path is valid and available. The tool cannot read the configuration if the path or directory is invalid.
    • Solr configuration contents – Validate that the Solr configuration contents are correctly set inside the directory. Example: solrconfig.xml, schema.xml, stopwords.txt, etc.
    • Make sure that the working directory is present in the file system and has write permissions for the current user. It can be an existing directory or a new directory. The working directory stores the fetched data from Solr and migration logs.
    • Validate the disk size before starting the migration. If the available free disk space is lesser than the size of the Solr index, the fetch operations will fail.

For example, if the Solr index size is 7 GB, make sure that the disk has at least 10 GB to 20 GB of free space.
Note: The tool reads the data from Solr and stores in a temporary directory (Please see configuration wd = /tmp in the above table).

  • Verify that the AWS environment variables are set correctly. The AWS environment variables are mentioned in the pre-requisites section above.
  • Validate the firewall rules for IP address and ports if the migration tool is run from a different server or instance. Example: Solr default port 8983 should be opened to the EC2 instance executing this tool.

Run the following command from directory ‘{S2C filepath}’
Example: /build/install/s2c-cli

/s2c or JVM_OPTS=”-Xms2048m -Xmx2048m” ./s2c (With heap size)

The above will invoke the shell ‘s2c’ script that starts the search migration process. The migration process is a series of steps that require user inputs as shown in the screen shots below.
Step 1: Parse the Solr schema The first step of migration prompts for a confirmation to parse the Solr schema and Solr configuration file. During this step, the application generates a ‘Run Id’ folder inside the working directory.
  Example: /tmp/s2c/m1416220194655

The Run Id is a unique identifier for each migration. Note down the Run Id as you will need it to resume the migration in case of any failures.

Step 2: Schema conversion from Solr to CloudSearch.The second step prompts confirmation to convert Solr schema to CloudSearch schema. Press any key to proceed further.

The second step will also list all the converted fields which are ready to be migrated from Solr to CloudSearch. If any fields are left out, this step will allow you to correct the original schema. User can abort the migration and identify the ignored fields, rectify the schema and re-run the migration again.The below screen shot shows the fields ready for CloudSearch migration.


Step 3: Data Fetch: The third step prompts for confirmation to fetch the search index data from the Solr server. Press any key to proceed. This step will generate a temporary file which will be stored in the working directory. This temporary file will have all the fetched documents from the Solr index.


There is also option to skip the fetch process if all the Solr data is already stored in the temporary file. If this is the case, the prompt will look like the screenshot below.

Step 4: Data push to CloudSearchThe last and final step prompts for confirmation to push the search data from the temporary file store to Amazon CloudSearch. This step also creates the CloudSearch domain with the configuration specified in application.conf including desired instance type, replication count, and multi-AZ options.

If the domain is already created, the utility will prompt to use the existing domain. If you do not wish to use an existing domain, you can create a new CloudSearch domain using the same prompt.
Note: The console does not prompt for any ‘CloudSearch domain name’ but instead it uses the domain name configured in the application.conf file.

Step 5: Resume (Optional) During the migration steps, if there is any failure during the fetch operation, it can be resumed. This is illustrated in the screen shot below.

Step 6: Verification Log into AWS CloudSearch management console to verify that the domain and index fields.

Amazon CloudSearch allows running test queries to validate the migration and as well the functionality of your application.

Features supported

  • Support for other non-Linux environments is not available for now.
  • Support for Solr Shards is not available for now. The Solr shard needs to be migrated separately.
  • The install commands may vary for different Linux flavors. Example installing software, file editor command, permission set commands can be different for every Linux flavors. It is left to engineering team to choose the right commands during the installation and execution of this migration tool.
  • Only fields configured as ‘stored’ in Solr schema.xml are supported. The non-stored fields are ignored during schema parsing.
  • The document id (unique key) is required to have following attributes:
    1. Document ID should be 128 characters or less in size.
    2. Document ID can contain any letter, any number, and any of the following characters:      _ – = # ; : / ? @ &
    3. The below link will help you to understand in data  preparation before migrating to CloudSearch http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html
  • If the conditions are not met in a document, it will be skipped during migration. Skipped records are shown in the log file.
  • If a field type (mapped to fields) is not stored, the stopwords mapped to that particular field type are ignored.

Example 1:

<field name=”description” type=”text_general” indexed=”true” stored=”true” />   

Note: The above field ‘description’ will be considered for stopwords.Example 2:

<field name=”fileName” type=”string” />     

Note: The above field ‘fileName’ will not be migrated and ignored in the stopwords.

Please do write your feedback and suggestions in the below comments section to improve this tool. The source code of the tool can be downloaded at https://github.com/8KMiles/s2c/. We have written a follow-up post in regard to that.

About the Authors
 Dhamodharan P is a Senior Cloud Architect at 8KMiles.

 

 

 

 Dwarakanath R is a Principal Architect at 8KMiles.

 

 

Loading Big Index Data into newly launched Amazon CloudSearch engine

Search tier is the most critical section of many online verticals like travel, e-commerce, classifieds etc. If users cannot search products efficiently they will not make their buying decisions properly, which in turn massively affects the revenues of these companies. Most of them are usually powered by Apache Solr, FAST , Autonomy, ElastiSearch etc.  AWS also has a Search Service called CloudSearch which is a fully-managed service in the cloud that makes it easy to set up, manage, and scale a search solution for your website. Amazon CloudSearch relieves you from the worry of hardware provisioning, setup, and maintenance. As your volume of data and traffic fluctuates, Amazon CloudSearch automatically scales to meet your needs.

In AWS infrastructure Apache Solr has been the king and the software to beat till now, recently it has got heavy competitor in the form of Amazon CloudSearch – API 2013-01-01.

API version 2013-01-01 of Amazon CloudSearch is internally powered by customized version of Apache Solr Engine, and it is specifically designed for running highly scalable and available search on Amazon Web Services Cloud. This 2013 CloudSearch API has lots of similarities with Apache Solr and customers can easily migrate to this version and leverage the benefits of Amazon Cloud Infrastructure. We are already hearing many AWS customers are planning their migration from FAST, Solr and A9 Engine into the Amazon CloudSearch – 2013-01-01 API engine.

My team is already migrating couple of customers into this Amazon CloudSearch 2013-01-01 API and i have shared our experience on this process for the benefit of AWS community.

Reference Migration Architecture and requirements:


In this article i am going to explore how to

  • Migrate a 300+ GB index containing close to 247+ million records distributed in 105 searchable fields in a highly scalable /parallel manner in AWS infrastructure.
  • 300 + GB index file is stored in Amazon S3
  • Custom Data loader program built on Amazon Elastic MapReduce is used for parallel loading
  • Around ~6 Search.M2.2Xlarge are created with 2 partitions and 5 replication count
  • Around 10+ M1.large EMR Core nodes are for Data loading. This loader can be increased to hundreds of nodes depending upon the volume and velocity of data pump required.
  • Amazon CloudSearch Infrastructure provisioning, Automated partitioning, replication count are handled by AWS.

Lets get into the details below:

Step 1)Create a new Amazon CloudSearch Domain: We have named the search domain as “bigdatasearch” and chose the search instance type as search.m2.2xlarge.  Since we are planning to pump and query a 300 GB index with millions of document, it did not make sense for us to chose a smaller instance type of Amazon CloudSearch.  Usually the base instance type can be selected based on the number and size of the documents you are planning to maintain in the Amazon CloudSearch.
Note: Here we have chosen replication count as 5.  This is little strange in a distributed architecture because usually more replication count for the master decreases the speed of document upload. But when we were playing with Amazon CloudSearch we observed that it is increasing the speed of uploads. In addition we also observed the following :

  • If we keep the replication count 0 or less, use a smaller search instance type and pump documents in parallel from multiple nodes, either the Amazon CloudSearch Server is failing sometimes or error rates are high.
  • If we keep the replication count 0 or less , use a larger search instance type and pump documents in parallel from multiple nodes, internally Amazon Cloud Search itself is creating 3-5 nodes and it shows in the replication count. Waiting to discuss with AWS SA folks on this behavior.

We will be utilizing distributed uploading technique which we custom built using Amazon Elastic MapReduce to pump data to the Amazon CloudSearch server. This technique enables us to write more Index data in parallel.

Step 2) Select how you would like to create the Amazon CloudSearch Schema: Here we have chosen Manual setup, since we already have schema to be migrated to Amazon CloudSearch.

Next step is to Add index fields to create your Amazon CloudSearch Schema configuration.

Step 3)Adding Amazon CloudSearch Index Fields: Once all the fields have been configured in the schema, click on continue button. In the schema file used we have 100+ fields to be indexed for this particular search domain.
Step 4) Review the setup configurations and launch:
We have 100+ Index fields with scaling options instance type as m2.2xlarge and replication count 5 in the “bigdatasearch” domain.
Step 5 ) Wait till the Amazon CloudSearch Infrastructure is provisioned for you on the back. Usually it takes 10 minutes, it will also list if there is any error encountered when creating the index fields.
Once the Amazon CloudSearch infrastructure is provisioned at the back end , you should notice the “bigdatasearch” domain is“Active”. The search and Document endpoints are published and currently no of searchable document is “0”. There is only 1 CloudSearch Index Partition (Shards) and 5 search.m2.2xlarge instances.
Step 6)Configuring Synonyms: We have 2+ MB of Synonyms which needs to be configured into the Amazon CloudSearch domain. For this, we used Cloud Search cli-toolkit to upload synonyms to Cloud Search.
cs-configure-analysis-scheme -d bigdatasearch –name customanalysisscheme –lang en -e cloudsearch.ap-southeast-1.amazonaws.com –synonyms customsynonyms.txt
Since the volume of index data is huge (300+ GB) we have created a Custom Data Loader built on Amazon Elastic MapReduce to pump the data in parallel into Amazon CloudSearch. Since it is built on Amazon Elastic MapReduce,  we can use the same program without modification for scale to upload TB’s of index into the search system with hundreds of Data loader EMR core/task nodes.
Step 7) Create Amazon Elastic MapReduce Data Loader Cluster Configuration:
Step 8) Configure the Elastic MapReduce (EMR) Capacity: We are using 10 M1.Large core node instances for uploading the data from inside AWS VPC. Depending upon the Data size (GB->TB) and Upload hours we can increase the EMR core nodes capacity and number to speed up the data pump (upload) process.

To know more about How Spot instances can save cost on Amazon EMR ? refer URL AWS Cost Saving Tip 12: Add Spot Instances with Amazon EMR

Step 9)Add Custom data loader program Jar to EMR:
We have exported the data from a MSSQL server as flat UTF-8 dump file and stored it in Amazon S3. We are giving the 300+ GB Dump file as the input for the Amazon EMR CloudSearch Data Loader program to upload into Amazon CS in parallel. Buckets configurations of the Data Loader jar, Input, output and log files are configured in this screen

Step 10) Configure Amazon CloudSearch Access Policies:  We need to open Cloud Search security group access policies to accept upload requests from EMR cluster inside VPC. Configure static IP’s of all the instances or IP range of the data loader clients
Step 11)Run the Amazon Elastic MapReduce Data loader job :
Step 12) Analyzing the Amazon EMR Data loader Job Output:
Output of the JOB can be seen in the AWS EMR JOB logs. Here are few details:
  • “Map output records” in the log tells how many records are inserted into the Amazon CloudSearch , we can observe close to 247,681,520 documents(247+ million) are pumped.
  • “Bytes Read” in the output tells what is size of data set which the JOB has read. We can observe 322387978332 bytes which is equivalent to 300+ GB of index in the Amazon CloudSearch
  • The entire pumping process took ~30 hours with 10 m1.large core nodes for us. We observed that increasing the number of Data loader EMR nodes or their capacity improves the upload speed drastically.
 Step 13) Clean up : Reset Replication Count to level of HA needed ideally 1-2 nodes. Once the Job is completed, Revert back the Security Access Policies in Amazon cloud search. Terminate the EMR Cluster and clean any leftover resources.

Step 14) Analyzing the CloudSearch Dashboard :
We observed that it takes some time for cloud search to reflect actual count of the indexed documents.

After the pumping of 300 + GB index you can observe that currently 2 Amazon CloudSearch partitions ( shards) are used to distribute 247+ million documents with 100+ index fields. This is tremendous cost savings compared to A9 powered Amazon CloudSearch. Amazon CloudSearch has automatically created shards based on the volume of data pumped in to the system. This is cool !!!, it reduces the maintenance headache of the infra admins. If the Amazon CloudSearch team can make this partition concept as configurable parameter in future it will be useful.
Step 15) Executing a Sample Search queries: We are executing a some sample product search queries on the “bigdatasearch” domain to check whether everything is fine. Distributed query was fired and Results came Sub Second from one of the partitions.
In short, It is cost effective compared to old A9 powered CloudSearch, Automated scaling of replication counts for request scalability, automated scaling of partitions for data scalability relieves the infra admin headaches, strong apache Solr pedigree and its long list of feature additions in coming months will make it more interesting.
After working with this service few weeks, we feel it is going to become the major search service on AWS in coming years, giving tough fight for Apache Solr and ElastiSearch deployments on EC2.
This article was co authored with Ankit @8Kmiles.