Securing Cassandra

Data security is a major concern and is given top priority in every organization. Securing sensitive data and keeping it out of hands from those who should not have access is challenging even in traditional database environments, let alone a cloud hosted database.  Data should be secured on the fly and on rest. In this blog, we will talk about securing data in Cassandra database on cloud environment specifically on AWS. We will divide the blog into two.

  1. Secure Cassandra on AWS
  2. Cassandra data access security

Secure Cassandra on AWS

Cassandra is best used when hosted across multiple datacenters.  Hosting it on cloud across multiple datacenters will reduce lot of cost and peace of mind knowing that you can survive regional outages. However, securing cloud infra is most fundamental activity that need to be carried when hosted on cloud.

Securing Ports

Securing ports and unknown host access is the foremost think when hosted on cloud. Cassandra needs the following ports to be opened on your firewall for a multi-node cluster else it will act as standalone cluster.

Public ports

Port Number Description
22 SSH port

 

Create a Security Group with default rule as SSH traffic allowed on port 22 (both inbound and outbound).

  1. Click ‘ADD RULE’ (both inbound and outbound)
  2. Choose ‘SSH’ from the ‘Type’ dropdown
  3. Enter only allowed IPs from the ‘Source’ (inbound) / ‘Destination’ (outbound).

Private – Cassandra inter node ports

Ports used by the Cassandra cluster for inter-node communication must be restricted to communicate within the node, restricting the traffic flow from and to the external resources.

Port Number Description
7000 Inter node communication without SSL encryption enabled
7001 Inter node communication with SSL encryption enabled
7199 Cassandra JMX monitoring port
5599 Private port for DSEFS inter-node communication port.

 

To configure inter-node communication ports in a Security Group:

  1. Click ‘ADD RULE’.
  2. Choose ‘Custom TCP Rule’ from the ‘Type’ dropdown.
  3. Enter the port number in the ‘Port Range’ column.
  4. Choose ‘Custom’ from the ‘Source’ (inbound) / ‘Destination’ (outbound) dropdown and enter the same Security Group ID as the value. This allows communication only within the cluster over the configured port, when this Security Group would be attached to all the nodes in the Cassandra cluster.

Private – Cassandra inter node ports

The following port needs to be secured and opened only for the clients which will be connecting with our cluster.

Port Number Description
9042 Client port without SSL encryption enabled
9160 Client port SSL encryption enabled
9142 Should be open when both encrypted and unencrypted connections are required
9160 DSE client port (Thrift) port

 

To configure public ports in a Security Group:

  1. Click ‘ADD RULE’.
  2. Choose ‘Custom TCP Rule’ from the ‘Type’ dropdown.
  3. Enter the port number in the ‘Port Range’ column.
  4. Choose ‘Anywhere’ from the ‘Source’ (inbound) / ‘Destination’ (outbound).

To restrict the public ports to certain known IP or IP Range:

d.Choose ‘Custom’ from the ‘Source’ (inbound) / ‘Destination’ (outbound) dropdown and provide the IP value or CIDR block corresponding to the IP Range.

Now that we have configured the firewall, our VMS are secured for unknown access.  It is recommended to create Cassandra clusters in a private subnet within your VPC which does not have Internet access.

Create a NAT instance in a public subnet or configure NAT Gateway that can route the traffic from the Cassandra cluster in the private subnet for software updates.

Cassandra Data Access Security

Securing data involves the following security accesses,

  1. Node to node communication
  2. Client to node communication
  3. Encryption at rest
  4. Authentication and authorization

Node to Node and Client to Node Communication Encryption

Cassandra is a master-less database. Master-less design offers no single point of failure for any database process or function. Every node is same on Cassandra. Read and write is served by every node for any query on the database. So, there is lot of data transfer between each node on the cluster. When the database is hosted on public cloud network, this communication needs to be secured. Likewise, the data transferred between the database and client on the public network is always at risk. To secure the data on flight during these scenarios, usually encryption of data by sending over a SSL is preferred widely.

Most developers are not exposed to encryption in their day to day work. And setting up an encryption layer is always a tedious process. Cassandra helps this by providing a built-in feature. All we need to do is enable the server_encryption_options: and client_encryption_options: configurations on your cassandra.yaml file and provide the required certificates and keys. Cassandra takes care of the encryption of data during node to node and client to server communications.

Additionally, Cassandra follows Client Certificate Authentication. Imagine, without authentication that we are talking to another Cassandra node, the cluster is only expecting a SSL key, we can write programs to attach to a cluster and execute any commands, listen to writes on arbitrary token ranges, even create a admin account into the system_auth table.

To avoid this, Cassandra follows Client Certificate Authentication. Using this approach Cassandra takes the extra step of verifying the client against a local trust store. If it does not recognize the client’s certificate, it will not accept the connection. This additional verification can be enabled by setting require_client_auth:true in cassandra.yaml configuration file.

In the rest of the blog we will see step by step process of enabling and configuring the cluster for SSL connection. If you have a certificate already, you can skip Generating certificates using OpenSSL.

Generating Certificates using OpenSSL

Most of the UNIX system should have OpenSSL tool installed on it. If not available, install OpenSSL before proceeding further.

Steps:

  1. Create a configuration file gen_ca_cert.conf with the below configurations.

linkedin_sponsor_sentiment_v1

2.Run the following OpenSSL command to create the CA:
linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

3.You can verify the contents of the certificate you just created with the following command:
linkedin_sponsor_sentiment_v1

You can generate certificate for each node if required, but doing that is not recommended. Because it is very tough to maintain separate key for each node. Imagine, when a new node is added to the cluster, the certificate for that node needs to be added to all other nodes which is tedious process. So, we recommend using the same certificate for all the nodes. Following steps will help you to use the same certificate for all the nodes.

Building Keystore

I will be explaining the keystore building for a 3-node cluster. Same can be followed for a n node cluster.

linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

To verify that the keystore is generated with correct key pair information and accessible, execute the below command

linkedin_sponsor_sentiment_v1

With our key stores created and populated, we now need to export a certificate from each node’s key store as a “Signing Request” for our CA:

linkedin_sponsor_sentiment_v1

With the certificate signing requests ready to go, it’s now time to sign each with our CA’s public key via OpenSSL:

linkedin_sponsor_sentiment_v1

Add CA to the keystore into each node’s keystore via -import sub command of keytool.

linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

Building Trust Store

Since Cassandra uses Client Certificate Authentication, we need to add a trust store to each node. This is how each node will verify incoming connections from the rest of the cluster.

We need to create trust store by importing CA root certificate’s public key:

linkedin_sponsor_sentiment_v1

Since all our instance-specific keys have now been signed by the CA, we can share this trust store instance across the cluster.

Configuring the Cluster

After creating all the required files, you can keep the keystore and truststore files in /usr/local/lib/cassandra/conf/ or any directory of your choice. But make sure that the cassandra demon has access to the directory. By making he below configuration in cassandra.yaml file the inbound and outbound requests will be encrypted.

Enable Node to Node Encryption

linkedin_sponsor_sentiment_v1

Enable Client to Node Encryption

linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

Repeat the above process on all the nodes on the cluster and your cluster data is secured on flight and from unknowns.

Author Credits: This article was written by Bharathiraja S, Senior Data Engineer at 8KMiles Software Services.

Cassandra Backup and Restore Methods

Cassandra Backup and Restore Methods

Cassandra is a distributed database management system. In Cassandra, data is replicated among multiple nodes across multiple data centers. Cassandra can survive without any interruption in service when one or more nodes are down. It keeps its data in SSTable files. SSTables are stored in the keyspace directory within the data directory path specified by the ‘data_file_directories’ parameter in the cassandra.yaml file.  By default, its SSTable directory path is /var/lib/cassandra/data/<keypace_name>. However, Cassandra backups are still necessary to recover from following scenario

  1. Any errors made in data by client applications
  2. Accidental deletions
  3. Catastrophic failure that will require you to rebuild your entire cluster
  4. Data can become corrupt
  5. Useful to roll back the cluster to a known good state
  6. Disk failure

Cassandra Backup Methods

Cassandra provides two types of backup. One is snapshot based backup and the other is incremental backup.

Snapshot Based Backup

Cassandra provides nodetool utility which is a command line interface for managing a cluster. The nodetool utility gives a useful command for creating snapshots of the data. The nodetool snapshot command flushes memtables to the disk and creates a snapshot by creating a hard link to SSTables. SSTables are immutable. The nodetool snapshot command takes snapshot per node basis. To take an entire cluster snapshot, the nodetool snapshot command should be run using a parallel ssh utility, such as pssh.  Alternatively, snapshot of each node can be taken one by one.

It is possible to take a snapshot of all keyspaces in a cluster, or certain selected keyspaces, or a single table in a keyspace. Note that you must have enough free disk space on the node for taking the snapshot of your data files.

The schema does not get backed up in this method.  This must be done manual separately.

Example:

a.All keyspaces snapshot

If you want to take snapshot of all keyspaces on the node then run the below command.

$ nodetool snapshot

The following message appears:

Requested creating snapshot(s) for [all keyspaces] with snapshot name [1496225100] Snapshot directory: 1496225100

The snapshot directory is /var/lib/data/keyspace_name/table_nameUUID/ snapshots/1496225100

b.Single keyspace snapshot

Assuming you created the keyspace university. To took a snapshot of the keyspace and you want a name of the snapshot the run the below command

$ nodetool snapshot -t 2017.05.31 university

The following output appears:

Requested creating snapshot(s) for [university] with snapshot name [2015.07.17]

Snapshot directory: 2017.05.31

c.Single table snapshot

If you want to take a snapshot of only the student table in the university keyspace then run the below command

$ nodetool snapshot --table student university

The following message appears:

Requested creating snapshot(s) for [university] with snapshot name [1496228400]

Snapshot directory: 1496228400

After completing the snapshot, you can move the snapshot files to another location like AWS S3 or Google Cloud or MS Azure etc. You must backup the schema because Cassandra can only restore data from a snapshot when the table schema exists.

Advantages:

  1. Snapshotbased backup is simple and much easier to manage.
  2. Cassandra nodetool utility provides nodetool clearsnapshot command which removesthe snapshot files.

Disadvantages:

  1. For large datasets, it may be hard to take a daily backup of the entire keyspace.
  2. It is expensive to transfer large snapshot data to a safe location like AWS S3

Incremental Backup

Cassandra also provides incremental backups. By default incremental backup is disabled. This can be enabled by changing the value of “incremental_backups” to “true” in the cassandra.yaml file.

Once enabled, Cassandra creates a hard link to each memtable flushed to SSTable to a backup’s directory under the keyspace data directory. In Cassandra, incremental backups contain only new SSTable files; they are dependent on the last snapshot created.

In the case of incremental backup, less disk space is required because it only contains links to new SSTable files generated since the last full snapshot.

Advantages:

  1. The incremental backup reduces disk space requirements.
  2. Reducesthe transfer cost.

Disadvantages:

  1. Cassandra does not automatically clear incremental backup files. If you want to remove the hard-link files then write your own script for that. There is no built-in tool to clear them.
  2. Creates lots of small size file in backup. File management and recovery not a trivial task.
  3. It is not possible to select a subset of column families for incremental backup.

Cassandra Restore Methods

Backups are meaningful when they are restorable under situations when keyspace gets deleted or new cluster gets launched from the backup data or a node get replaced. Restoring backed up data is possible from snapshots and if you are using incremental backups then you need all incremental backup files created after the snapshot. There are mainly two ways to restore data from backup. One is using nodetool refresh and another one using sstableloader.

Restore using nodetool refresh:

Nodetool refresh command loads newly placed SSTables onto the system without a restart. This method is used when new node replace a node which is not recoverable. Restore data from a snapshot is possible if the table schema exists. Assuming you have created a new node then follow the below steps

  1. Create the schema if not created already.
  2. Truncate the table,if necessary.
  3. Locate the snapshot folder(/var/lib/keyspace_name/table_name UUID/snapshots/snapshot_name) and copy the snapshot SSTable directory to the /var/lib/keyspace/table_name-UUID directory.
  4. Run nodetool refresh.

Restore using sstableloader:

The sstableloader loads a set of SSTable files in a Cassandra cluster. The sstableloader provides the following options.

  1. Loading external data
  2. Loading existing SSTables
  3. Restore snapshots

The sstableloader does not simply copy the SSTables to every node, but also transfers the relevant part of the data to each node and also maintain the replication factor. Here sstableloader used for restore snapshots. Follow the below steps for restore using sstableloader

  1. Create the schema if not exists.
  2. Truncate the table if necessary.
  3. Bring your back up data to a node from AWS S3 or Google Cloud or MS AzureExample: Download your backup data in /home/data
  4. Run the below command
    sstableloader -d ip /home/data

 

Author Credits: This article was written by Sebabrata Ghosh, Data Engineer at 8KMiles Software Services  and can reach him here.

 

8KMiles strikes the right balance between Cloud Security and Performance

Healthcare industry is one of those sectors that face major challenges when it comes to embracing Cloud transformation. Regulatory specific security and huge amounts of sensitive data are the major reasons and there is a constant need for Technology and Information heads in the Healthcare organization to maintain the right equilibrium between security and privacy yet not compromising on the IT infra budgets and performance. On this context, A Capability matured level 5 Healthcare prospect approached 8KMiles with a specific set of requirement. The prospect company were using CPSI application and had an enormous amount of rectifications that were required to be either made or migrated. The reason behind the prospect opted 8KMiles the most preferred  choice of development partner because,  8KMiles is one of the  State of the Art Solution providers and has an Agile team of experts who practice Scrum and are ready to take up ad-hoc requirement with 24/7 development support system.

8KMiles worked extensively to collaborate with prospect company to :

1) Establish formal Business Relationship with the prospect.

2) Understand the Business needs and requirements:

a. User Interface : The interface involves multiple billing screens complicating navigation and unduly delays task completion

b. E-mail Messaging limitation : It is not possible to send messages to more than one person.

c. Compliance with HL7 requirements – Integration with other HL7 compliant systems is minimal, unable to interface with the radiologist even with a HL7 interface through a pretty basic MS SQL server database.

d. Workflow Management – The workflow management does not capture many areas of healthcare thus missing out on benefits

e.  Interoperability – CPSI does not allow FHIR (Fast Healthcare Interoperability Resources Specification) compatible APIs for open access to patient data

f.  Security – Multi-layered approach to security is not provided for limiting employee education

g.  Medical Records Synch – Updated information of patients treated at different facilities is not available on the fly

h.  Lack of Standardized terminology, system architecture and indexing – System is inflexible and incapacitated to capture the diverse requirements of the different healthcare disciplines

i.   Integration Issues – Integration of the hospital EMR with the Physician office EMR in a seamless fashion is not happening

j.   There were too many Switches in Role Hierarchy which were not recorded properly

3) 8KMiles studied the requirement systematically and came up with solutions based on Agile methodology for the above pain-points.

a. User Interface – The interface involves a SSO which allows the User to provide a onetime Credential making them to Open N-Number of Application/Resources with a single click .

b. E-mail Messaging limitation :

i. 8K Miles Access Governance & Identity Management Solution allows to send multiple mails based on approvals, rejection, attestation, re-certification.

ii. Multi-Level approval and messaging is possible .

c. Compliance with HL7 requirements:

i. 8K Miles Access Governance & Identity Management Solution provides the Customer to integrate any database such as MS SQL Server. Oracle, IDM dB 2.

ii. We provide our Customer/Employees to integrate with various portals through an SSO – Single Sign On (e.g.) A radiologist can login to multiple portals with single credential .

d. Workflow Management – The Workflow Management and Policy Compliance helps/Facilitates in capturing areas of restrictions in Health Care such as providing right access of resource to right user at right time .

e. Interoperability – 8KMiles Access Governance & Identity Management Solution & SSO helps in providing fast access to FHIR related Applications .

f. Security – 8K Miles Access Governance & Identity Management Solution Provides Multilevel Approval & Parallel Approval

g. Medical Records Synch – 8K Miles Access Governance & Identity Management Solution will be integrated and Synced with different databases, updated information of patients treated at different facilities will be available on the fly at any point of time .

h. Lack of Standardized terminology, system architecture and indexing – Highly customizable, flexible to handle any requirement based on Health Care Needs related to Identity & Access Governance.

i. Integration Issues – Integration of the hospital EMR with the Physician office EMR in a seamless fashion is provided using SSO.

j. Switch – 8K Miles Access Governance & Identity Management Solution helps in providing distribution of switches & roles to multiple users on a daily basis.

If you are experiencing similar problems in your healthcare business, please write to sales@8kmiles.com.

Cost Optimization Tips for Azure Cloud-Part III

In continuation to my previous blog am going to jot down more on how to optimize cost while moving into Azure public cloud

1. UPGRADE INSTANCES TO THE LATEST GENERATION-

With Microsoft Introducing next generation of Azure deployment via Azure Resource Manager (ARM) we can avail significant performance improvement just by upgrading the VM’s to latest versions (From Azure V1 to Azure V2). In all case the price would either be same or near to same.
For example- if you are upgrading a DV1-series VM to DV2- Series it gives you 35-40% faster processing for the same price point .

2. TERMINATE ZOMBIE ASSETS –

It is not enough to shut down VMs from within the instance to avoid being billed because Azure continues to reserve the compute resources for the VM including a reserved public IP. Unless you need VMs to be up and running all the time, shut down and deallocate them to save on cost. This can be achieved from Azure Management portal or Windows Powershell.

3. DELETING A VM-

If you delete a VM, the VHDs are not deleted. That means you can safely delete the VM without losing data. However, you will still be charged for storage. To delete the VHD, delete the file from Blob storage.

  •  When an end-user’s PC makes a DNS query, it doesn’t contact the Traffic Manager Name servers directly. Instead, these queries are sent via “recursive” DNS servers run by enterprises and ISPs. These servers cache the DNS responses, so that other users’ queries can be processed more quickly. Since these cached responses don’t reach the Traffic Manager Name servers, they don’t incur a charge.

The caching duration is determined by the “TTL” parameter in the original DNS response. This parameter is configurable in Traffic Manager—the         default is 300 seconds, and the minimum  is 30 seconds.

By using a larger TTL, you can increase the amount of caching done by recursive DNS servers and thereby reduce your DNS query charges. However, increased caching will also impact how quickly changes in endpoint status are picked up by end users, i.e. your end-user failover times in the event of an endpoint failure will become longer. For this   reason, we don’t recommend using very large TTL values.

Likewise, a shorter TTL gives more rapid failover times, but since caching is reduced the query counts against the Traffic Manage name servers will be higher.

By allowing you to configure the TTL value, Traffic Manager enables you to make the best choice of TTL based on your application’s business needs.

  • If you provide write access to a blob, a user may choose to upload a 200GB blob. If you’ve given them read access as well, they may choose do download it 10 times, incurring 2TB in egress costs for you. Again, provide limited permissions, to help mitigate the potential of malicious users. Use short-lived Shared Access Signature (SAS) to reduce this threat (but be mindful of clock skew on the end time).
  • Azure App Service charges are applied to apps in stopped state. Please delete apps that are not in use or update tier to Free to avoid charges.
  • In Azure Search, The stop button is meant to stop traffic to your service instance. As a result, your service is still running and will continue to be charged the hourly rate.
  • Use Blob storage to store Images, Videos and Text files instead of storing in SQL Database. The cost of the Blob storage is much less than SQL database. A 100GB SQL Database costs $175 per month, but the Blob storage costs only $7 per month. To reduce the cost and increase the performance, put the large items in the blob storage and store the Blob Record key in SQL database.
  • Cycle out old records and tables in your database. This saves money, and knowing what you can or cannot delete is important if you hit your database Max Size and you need to quickly delete records to make space for new data.
  • If you intend to use substantial amount of Azure resources for your application, you can choose to use volume purchase plan. These plans allow you to save 20 to 30 % of your Data Centre cost for your larger applications.
  • Use a strategy for removing old backups such that you maintain history but reduce storage needs. If you maintain backups for last hour, day, week, month and year, you have good backup coverage while not incurring more than 25% of your database costs for backup. If you have 1GB database, your cost would be $9.99 per month for the database and only $0.10 per month for the backup space.
  • Azure Document DB with the stored procedure is that they enable applications to perform complex batches and sequence of operations directly inside the database engine, closer to the data. So, the network traffic latency cost for batching and sequencing operations can be completely avoided. Another advantage to using stored procedure is that they get implicitly pre-complied to the byte code format upon registration, avoiding script compilation costs at the time of each invocation.
  • The default of a cloud service size is ‘small’. You can change it to extra small in your cloud service – properties – settings. This will reduce your costs from $90 to $30 a month at the time of writing. The difference between ‘extra small’ and ‘small’ is that the virtual machine memory is 780 MB instead of 1780 MB.
  • Windows Azure Diagnostic may burst your bill on Storage Transaction. If you do not control it properly.

We’ll need to define what kind of log (IIS Logs, Crash Dumps, FREB Logs, Arbitrary log files, Performance Counters, Event Logs, etc.) to be collected and send to Windows Azure Storage either on-schedule-basis or on-demand.

However, if you are not carefully define what you are really need for the diagnostic info, you might end up paying the unexpected bill.

Assuming the following figures:

  • You a few application that require high processing power of 100 instances
  • You apply 5 performance counter logs (Processor% Processor Time, Memory Available Bytes, Physical Disk% Disk Time, Network Interface Connection: Bytes Total/sec, Processor Interrupts/sec)
  • Performing a schedule transfer for every 5 seconds
  • The instance will run 24 hours per day, 30 days per month

How much it costs for Storage Transaction per month?

5 counters X 12 times X 60 min X 24 hours X 30 days X 100 instances = 259,200,000 transactions

$ 0.01 per 10,000 transactions X 129,600,000 transactions =$ 259.2 per month

To bring it down, if you really need to monitor all 5 performance counters on every 5 seconds? What if you reduce them to 3 counters and monitor it every 20 seconds?

3 counters X 3 times X 60 min X 24 hours X 30 days X 100 instances = 3,8880,000 transactions

$ 0.01 per 10,000 transactions X 129,600,000 transactions =$ 38.8 per month

You can see how much you save for this numbers. Windows Azure Diagnostic is really needed but use it improperly may cause you paying unnecessary money

  • An application will organize the blobs in different container per each user. It also allows the users to check size of each container. For that, a function is created to loop through entire files inside the container and return the size in decimal. Now, this functionality is exposed at UI screen. An admin can typically call this function a few times a day.

Assuming the following figures for illustration:

  • I have 1,000 users.
  • I have 10,000 of files in average for each container.
  • Admin call this function 5 times a day in average.
  • How much it costs for Storage Transaction per month?

Remember: a single Get Blob request is considered 1 transaction!

1,000 users X 10,000 files X 5 times query X 30 days = 1,500,000,000 transaction

$ 0.01 per 10,000 transactions X 1,500,000,000 transactions = $ 1,500 per month

Well, that’s not cheap at all so to bring it down.

Do not expose this functionality as real time query to admin. Considering to automatically run this function once in a day, save the size in somewhere. Just let admin to view the daily result (day by day).With limiting the admin to just only view once a day, what will be the monthly cost looks like:

1,000 users X 10,000 files X 1 times query X 30 days = 300,000,000 transaction

$ 0.01 per 10,000 transactions X 300,000,000 transactions = $ 300 per month

Author Credits: This article was written by Utkarsh Pandey, Azure Solution Architect at 8KMiles Software Services and originally published here