Securing Cassandra

Data security is a major concern and is given top priority in every organization. Securing sensitive data and keeping it out of hands from those who should not have access is challenging even in traditional database environments, let alone a cloud hosted database.  Data should be secured on the fly and on rest. In this blog, we will talk about securing data in Cassandra database on cloud environment specifically on AWS. We will divide the blog into two.

  1. Secure Cassandra on AWS
  2. Cassandra data access security

Secure Cassandra on AWS

Cassandra is best used when hosted across multiple datacenters.  Hosting it on cloud across multiple datacenters will reduce lot of cost and peace of mind knowing that you can survive regional outages. However, securing cloud infra is most fundamental activity that need to be carried when hosted on cloud.

Securing Ports

Securing ports and unknown host access is the foremost think when hosted on cloud. Cassandra needs the following ports to be opened on your firewall for a multi-node cluster else it will act as standalone cluster.

Public ports

Port Number Description
22 SSH port

 

Create a Security Group with default rule as SSH traffic allowed on port 22 (both inbound and outbound).

  1. Click ‘ADD RULE’ (both inbound and outbound)
  2. Choose ‘SSH’ from the ‘Type’ dropdown
  3. Enter only allowed IPs from the ‘Source’ (inbound) / ‘Destination’ (outbound).

Private – Cassandra inter node ports

Ports used by the Cassandra cluster for inter-node communication must be restricted to communicate within the node, restricting the traffic flow from and to the external resources.

Port Number Description
7000 Inter node communication without SSL encryption enabled
7001 Inter node communication with SSL encryption enabled
7199 Cassandra JMX monitoring port
5599 Private port for DSEFS inter-node communication port.

 

To configure inter-node communication ports in a Security Group:

  1. Click ‘ADD RULE’.
  2. Choose ‘Custom TCP Rule’ from the ‘Type’ dropdown.
  3. Enter the port number in the ‘Port Range’ column.
  4. Choose ‘Custom’ from the ‘Source’ (inbound) / ‘Destination’ (outbound) dropdown and enter the same Security Group ID as the value. This allows communication only within the cluster over the configured port, when this Security Group would be attached to all the nodes in the Cassandra cluster.

Private – Cassandra inter node ports

The following port needs to be secured and opened only for the clients which will be connecting with our cluster.

Port Number Description
9042 Client port without SSL encryption enabled
9160 Client port SSL encryption enabled
9142 Should be open when both encrypted and unencrypted connections are required
9160 DSE client port (Thrift) port

 

To configure public ports in a Security Group:

  1. Click ‘ADD RULE’.
  2. Choose ‘Custom TCP Rule’ from the ‘Type’ dropdown.
  3. Enter the port number in the ‘Port Range’ column.
  4. Choose ‘Anywhere’ from the ‘Source’ (inbound) / ‘Destination’ (outbound).

To restrict the public ports to certain known IP or IP Range:

d.Choose ‘Custom’ from the ‘Source’ (inbound) / ‘Destination’ (outbound) dropdown and provide the IP value or CIDR block corresponding to the IP Range.

Now that we have configured the firewall, our VMS are secured for unknown access.  It is recommended to create Cassandra clusters in a private subnet within your VPC which does not have Internet access.

Create a NAT instance in a public subnet or configure NAT Gateway that can route the traffic from the Cassandra cluster in the private subnet for software updates.

Cassandra Data Access Security

Securing data involves the following security accesses,

  1. Node to node communication
  2. Client to node communication
  3. Encryption at rest
  4. Authentication and authorization

Node to Node and Client to Node Communication Encryption

Cassandra is a master-less database. Master-less design offers no single point of failure for any database process or function. Every node is same on Cassandra. Read and write is served by every node for any query on the database. So, there is lot of data transfer between each node on the cluster. When the database is hosted on public cloud network, this communication needs to be secured. Likewise, the data transferred between the database and client on the public network is always at risk. To secure the data on flight during these scenarios, usually encryption of data by sending over a SSL is preferred widely.

Most developers are not exposed to encryption in their day to day work. And setting up an encryption layer is always a tedious process. Cassandra helps this by providing a built-in feature. All we need to do is enable the server_encryption_options: and client_encryption_options: configurations on your cassandra.yaml file and provide the required certificates and keys. Cassandra takes care of the encryption of data during node to node and client to server communications.

Additionally, Cassandra follows Client Certificate Authentication. Imagine, without authentication that we are talking to another Cassandra node, the cluster is only expecting a SSL key, we can write programs to attach to a cluster and execute any commands, listen to writes on arbitrary token ranges, even create a admin account into the system_auth table.

To avoid this, Cassandra follows Client Certificate Authentication. Using this approach Cassandra takes the extra step of verifying the client against a local trust store. If it does not recognize the client’s certificate, it will not accept the connection. This additional verification can be enabled by setting require_client_auth:true in cassandra.yaml configuration file.

In the rest of the blog we will see step by step process of enabling and configuring the cluster for SSL connection. If you have a certificate already, you can skip Generating certificates using OpenSSL.

Generating Certificates using OpenSSL

Most of the UNIX system should have OpenSSL tool installed on it. If not available, install OpenSSL before proceeding further.

Steps:

  1. Create a configuration file gen_ca_cert.conf with the below configurations.

linkedin_sponsor_sentiment_v1

2.Run the following OpenSSL command to create the CA:
linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

3.You can verify the contents of the certificate you just created with the following command:
linkedin_sponsor_sentiment_v1

You can generate certificate for each node if required, but doing that is not recommended. Because it is very tough to maintain separate key for each node. Imagine, when a new node is added to the cluster, the certificate for that node needs to be added to all other nodes which is tedious process. So, we recommend using the same certificate for all the nodes. Following steps will help you to use the same certificate for all the nodes.

Building Keystore

I will be explaining the keystore building for a 3-node cluster. Same can be followed for a n node cluster.

linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

To verify that the keystore is generated with correct key pair information and accessible, execute the below command

linkedin_sponsor_sentiment_v1

With our key stores created and populated, we now need to export a certificate from each node’s key store as a “Signing Request” for our CA:

linkedin_sponsor_sentiment_v1

With the certificate signing requests ready to go, it’s now time to sign each with our CA’s public key via OpenSSL:

linkedin_sponsor_sentiment_v1

Add CA to the keystore into each node’s keystore via -import sub command of keytool.

linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

Building Trust Store

Since Cassandra uses Client Certificate Authentication, we need to add a trust store to each node. This is how each node will verify incoming connections from the rest of the cluster.

We need to create trust store by importing CA root certificate’s public key:

linkedin_sponsor_sentiment_v1

Since all our instance-specific keys have now been signed by the CA, we can share this trust store instance across the cluster.

Configuring the Cluster

After creating all the required files, you can keep the keystore and truststore files in /usr/local/lib/cassandra/conf/ or any directory of your choice. But make sure that the cassandra demon has access to the directory. By making he below configuration in cassandra.yaml file the inbound and outbound requests will be encrypted.

Enable Node to Node Encryption

linkedin_sponsor_sentiment_v1

Enable Client to Node Encryption

linkedin_sponsor_sentiment_v1
linkedin_sponsor_sentiment_v1

Repeat the above process on all the nodes on the cluster and your cluster data is secured on flight and from unknowns.

Author Credits: This article was written by Bharathiraja S, Senior Data Engineer at 8KMiles Software Services.

Cassandra Backup and Restore Methods

Cassandra Backup and Restore Methods

Cassandra is a distributed database management system. In Cassandra, data is replicated among multiple nodes across multiple data centers. Cassandra can survive without any interruption in service when one or more nodes are down. It keeps its data in SSTable files. SSTables are stored in the keyspace directory within the data directory path specified by the ‘data_file_directories’ parameter in the cassandra.yaml file.  By default, its SSTable directory path is /var/lib/cassandra/data/<keypace_name>. However, Cassandra backups are still necessary to recover from following scenario

  1. Any errors made in data by client applications
  2. Accidental deletions
  3. Catastrophic failure that will require you to rebuild your entire cluster
  4. Data can become corrupt
  5. Useful to roll back the cluster to a known good state
  6. Disk failure

Cassandra Backup Methods

Cassandra provides two types of backup. One is snapshot based backup and the other is incremental backup.

Snapshot Based Backup

Cassandra provides nodetool utility which is a command line interface for managing a cluster. The nodetool utility gives a useful command for creating snapshots of the data. The nodetool snapshot command flushes memtables to the disk and creates a snapshot by creating a hard link to SSTables. SSTables are immutable. The nodetool snapshot command takes snapshot per node basis. To take an entire cluster snapshot, the nodetool snapshot command should be run using a parallel ssh utility, such as pssh.  Alternatively, snapshot of each node can be taken one by one.

It is possible to take a snapshot of all keyspaces in a cluster, or certain selected keyspaces, or a single table in a keyspace. Note that you must have enough free disk space on the node for taking the snapshot of your data files.

The schema does not get backed up in this method.  This must be done manual separately.

Example:

a.All keyspaces snapshot

If you want to take snapshot of all keyspaces on the node then run the below command.

$ nodetool snapshot

The following message appears:

Requested creating snapshot(s) for [all keyspaces] with snapshot name [1496225100] Snapshot directory: 1496225100

The snapshot directory is /var/lib/data/keyspace_name/table_nameUUID/ snapshots/1496225100

b.Single keyspace snapshot

Assuming you created the keyspace university. To took a snapshot of the keyspace and you want a name of the snapshot the run the below command

$ nodetool snapshot -t 2017.05.31 university

The following output appears:

Requested creating snapshot(s) for [university] with snapshot name [2015.07.17]

Snapshot directory: 2017.05.31

c.Single table snapshot

If you want to take a snapshot of only the student table in the university keyspace then run the below command

$ nodetool snapshot --table student university

The following message appears:

Requested creating snapshot(s) for [university] with snapshot name [1496228400]

Snapshot directory: 1496228400

After completing the snapshot, you can move the snapshot files to another location like AWS S3 or Google Cloud or MS Azure etc. You must backup the schema because Cassandra can only restore data from a snapshot when the table schema exists.

Advantages:

  1. Snapshotbased backup is simple and much easier to manage.
  2. Cassandra nodetool utility provides nodetool clearsnapshot command which removesthe snapshot files.

Disadvantages:

  1. For large datasets, it may be hard to take a daily backup of the entire keyspace.
  2. It is expensive to transfer large snapshot data to a safe location like AWS S3

Incremental Backup

Cassandra also provides incremental backups. By default incremental backup is disabled. This can be enabled by changing the value of “incremental_backups” to “true” in the cassandra.yaml file.

Once enabled, Cassandra creates a hard link to each memtable flushed to SSTable to a backup’s directory under the keyspace data directory. In Cassandra, incremental backups contain only new SSTable files; they are dependent on the last snapshot created.

In the case of incremental backup, less disk space is required because it only contains links to new SSTable files generated since the last full snapshot.

Advantages:

  1. The incremental backup reduces disk space requirements.
  2. Reducesthe transfer cost.

Disadvantages:

  1. Cassandra does not automatically clear incremental backup files. If you want to remove the hard-link files then write your own script for that. There is no built-in tool to clear them.
  2. Creates lots of small size file in backup. File management and recovery not a trivial task.
  3. It is not possible to select a subset of column families for incremental backup.

Cassandra Restore Methods

Backups are meaningful when they are restorable under situations when keyspace gets deleted or new cluster gets launched from the backup data or a node get replaced. Restoring backed up data is possible from snapshots and if you are using incremental backups then you need all incremental backup files created after the snapshot. There are mainly two ways to restore data from backup. One is using nodetool refresh and another one using sstableloader.

Restore using nodetool refresh:

Nodetool refresh command loads newly placed SSTables onto the system without a restart. This method is used when new node replace a node which is not recoverable. Restore data from a snapshot is possible if the table schema exists. Assuming you have created a new node then follow the below steps

  1. Create the schema if not created already.
  2. Truncate the table,if necessary.
  3. Locate the snapshot folder(/var/lib/keyspace_name/table_name UUID/snapshots/snapshot_name) and copy the snapshot SSTable directory to the /var/lib/keyspace/table_name-UUID directory.
  4. Run nodetool refresh.

Restore using sstableloader:

The sstableloader loads a set of SSTable files in a Cassandra cluster. The sstableloader provides the following options.

  1. Loading external data
  2. Loading existing SSTables
  3. Restore snapshots

The sstableloader does not simply copy the SSTables to every node, but also transfers the relevant part of the data to each node and also maintain the replication factor. Here sstableloader used for restore snapshots. Follow the below steps for restore using sstableloader

  1. Create the schema if not exists.
  2. Truncate the table if necessary.
  3. Bring your back up data to a node from AWS S3 or Google Cloud or MS AzureExample: Download your backup data in /home/data
  4. Run the below command
    sstableloader -d ip /home/data

 

Author Credits: This article was written by Sebabrata Ghosh, Data Engineer at 8KMiles Software Services  and can reach him here.