Cassandra Backup and Restore Methods
Cassandra is a distributed database management system. In Cassandra, data is replicated among multiple nodes across multiple data centers. Cassandra can survive without any interruption in service when one or more nodes are down. It keeps its data in SSTable files. SSTables are stored in the keyspace directory within the data directory path specified by the ‘data_file_directories’ parameter in the cassandra.yaml file. By default, its SSTable directory path is /var/lib/cassandra/data/<keypace_name>. However, Cassandra backups are still necessary to recover from following scenario
- Any errors made in data by client applications
- Accidental deletions
- Catastrophic failure that will require you to rebuild your entire cluster
- Data can become corrupt
- Useful to roll back the cluster to a known good state
- Disk failure
Cassandra Backup Methods
Cassandra provides two types of backup. One is snapshot based backup and the other is incremental backup.
Snapshot Based Backup
Cassandra provides nodetool utility which is a command line interface for managing a cluster. The nodetool utility gives a useful command for creating snapshots of the data. The nodetool snapshot command flushes memtables to the disk and creates a snapshot by creating a hard link to SSTables. SSTables are immutable. The nodetool snapshot command takes snapshot per node basis. To take an entire cluster snapshot, the nodetool snapshot command should be run using a parallel ssh utility, such as pssh. Alternatively, snapshot of each node can be taken one by one.
It is possible to take a snapshot of all keyspaces in a cluster, or certain selected keyspaces, or a single table in a keyspace. Note that you must have enough free disk space on the node for taking the snapshot of your data files.
The schema does not get backed up in this method. This must be done manual separately.
a.All keyspaces snapshot
If you want to take snapshot of all keyspaces on the node then run the below command.
$ nodetool snapshot
The following message appears:
Requested creating snapshot(s) for [all keyspaces] with snapshot name  Snapshot directory: 1496225100
The snapshot directory is /var/lib/data/keyspace_name/table_name–UUID/ snapshots/1496225100
b.Single keyspace snapshot
Assuming you created the keyspace university. To took a snapshot of the keyspace and you want a name of the snapshot the run the below command
$ nodetool snapshot -t 2017.05.31 university
The following output appears:
Requested creating snapshot(s) for [university] with snapshot name [2015.07.17]
Snapshot directory: 2017.05.31
c.Single table snapshot
If you want to take a snapshot of only the student table in the university keyspace then run the below command
$ nodetool snapshot --table student university
The following message appears:
Requested creating snapshot(s) for [university] with snapshot name 
Snapshot directory: 1496228400
After completing the snapshot, you can move the snapshot files to another location like AWS S3 or Google Cloud or MS Azure etc. You must backup the schema because Cassandra can only restore data from a snapshot when the table schema exists.
- Snapshotbased backup is simple and much easier to manage.
- Cassandra nodetool utility provides nodetool clearsnapshot command which removesthe snapshot files.
- For large datasets, it may be hard to take a daily backup of the entire keyspace.
- It is expensive to transfer large snapshot data to a safe location like AWS S3
Cassandra also provides incremental backups. By default incremental backup is disabled. This can be enabled by changing the value of “incremental_backups” to “true” in the cassandra.yaml file.Once enabled, Cassandra creates a hard link to each memtable flushed to SSTable to a backup’s directory under the keyspace data directory. In Cassandra, incremental backups contain only new SSTable files; they are dependent on the last snapshot created.
In the case of incremental backup, less disk space is required because it only contains links to new SSTable files generated since the last full snapshot.
- The incremental backup reduces disk space requirements.
- Reducesthe transfer cost.
- Cassandra does not automatically clear incremental backup files. If you want to remove the hard-link files then write your own script for that. There is no built-in tool to clear them.
- Creates lots of small size file in backup. File management and recovery not a trivial task.
- It is not possible to select a subset of column families for incremental backup.
Cassandra Restore Methods
Backups are meaningful when they are restorable under situations when keyspace gets deleted or new cluster gets launched from the backup data or a node get replaced. Restoring backed up data is possible from snapshots and if you are using incremental backups then you need all incremental backup files created after the snapshot. There are mainly two ways to restore data from backup. One is using nodetool refresh and another one using sstableloader.
Restore using nodetool refresh:
Nodetool refresh command loads newly placed SSTables onto the system without a restart. This method is used when new node replace a node which is not recoverable. Restore data from a snapshot is possible if the table schema exists. Assuming you have created a new node then follow the below steps
- Create the schema if not created already.
- Truncate the table,if necessary.
- Locate the snapshot folder(/var/lib/keyspace_name/table_name UUID/snapshots/snapshot_name) and copy the snapshot SSTable directory to the /var/lib/keyspace/table_name-UUID directory.
- Run nodetool refresh.
Restore using sstableloader:
The sstableloader loads a set of SSTable files in a Cassandra cluster. The sstableloader provides the following options.
- Loading external data
- Loading existing SSTables
- Restore snapshots
The sstableloader does not simply copy the SSTables to every node, but also transfers the relevant part of the data to each node and also maintain the replication factor. Here sstableloader used for restore snapshots. Follow the below steps for restore using sstableloader
- Create the schema if not exists.
- Truncate the table if necessary.
- Bring your back up data to a node from AWS S3 or Google Cloud or MS AzureExample: Download your backup data in /home/data
- Run the below command
sstableloader -d ip /home/data
Author Credits: This article was written by Sebabrata Ghosh, Data Engineer at 8KMiles Software Services and can reach him here.