Terminology

Partitioner: Distributes data across a cluster.
Replica: A copy of a portion of the whole database. Each node holds some replicas.
Replication Factor: The total number of replicas across the cluster.
Replication Strategy:
Gossip: A peer-to-peer communication protocol for exchanging location and state information between nodes.
Snitch: The mapping from the IP addresses of nodes to physical and virtual locations, such as racks and datacenters.
Consistency: The synchronization of data on replicas in a cluster.
Consistency level: A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively.
Keyspace: A namespace container that defines how data is replicated on nodes in each datacenter.
Partition: Collection of data addressable by a key.
Primary Key: One or more columns that uniquely identify a row in a table. Formed by (partition_key,clustering_ley)
Partition Key: A logical entity which helps a Cassandra cluster know on which node some requested data resides.
Clustering Key:
Memtables: A database table-specific, in-memory data structure that resembles a write-back cache.
Sstables: A sorted string table (SSTable) is an immutable data file to which the database writes memtables periodically.
Compaction: The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index.
Compression:
Garbage Collector: A Java background process that frees heap memory when it is no longer in use.
Token: Determines the node’s position on the ring and the portion of data for which it is responsible.
Tombstone: A marker in a row that indicates a column was deleted. During compaction, marked columns are deleted.
Bootstrap: The process by which new nodes join the cluster transparently gathering the data needed from existing nodes.
Seed node: Used to bootstrap the gossip process for new nodes joining a cluster. Any node can be used as seed with no configuration.
Coordinator node: The node that receives the request from the client. Determines which nodes in the ring should get the request based on the cluster configured snitch.
Primary Range:
Cardinality: The number of unique values in a column.
Ring:
Cluster: Two or more database instances that exchange messages using the gossip protocol.
Datacenter: A group of related nodes that are configured together within a cluster for replication and workload segregation purposes.
Denormalization: Process of optimizing the read performance of a database by adding redundant data or by grouping data.
Eventual consistency: The database ensures eventual data consistency by updating all replicas during read operations and periodically checking and updating any replicas not directly accessed.
Materialized view: A table with data that is automatically inserted and updated from another base table.
Node: An instance of Cassandra.
Zombie: A row or cell that reappears in a database table after deletion. This can happen if a node goes down for a long period of time and is then restored without being repaired.
Lightweight transaction (LWT): A way to perform a conditional update with strong consistency guarantees. Also known as CAS for Compare And Set. This operation comes with a great latency.
Hint: One of the three ways, in addition to read-repair and full/incremental anti-entropy repair, that Cassandra implements the eventual consistency guarantee that all updates are eventually received by all replicas.

Consistency

Cassandra has an eventual consistency, which means that it will be consistent at some point.

ONE: Only a single replica must respond.
TWO: Two replicas must respond.
THREE: Three replicas must respond.
ALL: All of the replicas must respond.
QUORUM: A majority (n/2 + 1) of all the replicas must respond.
LOCAL_QUORUM: A majority of the replicas in the local datacenter (in the datacenter the coordinator is in) must respond.
EACH_QUORUM: A majority of the replicas in each datacenter must respond.
LOCAL_ONE: Only the closest replica in the local datacenter must respond. In a multi-datacenter cluster, this also guarantees that read requests are not sent to replicas in a remote datacenter.
ANY: Closest replica, as determined by the snitch. A single replica may respond, or the coordinator may store a hint. If a hint is stored, the coordinator will later attempt to replay the hint and deliver the mutation to the replicas. This consistency level is only accepted for write operations.
SERIAL: This consistency level is only for use with lightweight transaction. Equivalent to QUORUM. (Only READ)
LOCAL_SERIAL: Same as SERIAL but used to maintain consistency locally (within the single datacenter). Equivalent to LOCAL_QUORUM. (Only READ)

Eventual consistency

Hinted Handoff

Eventual consistency on writes. They are best effort, but not a guaranteed way to repair inconsistencies.

If a node becomes unresponsive, the coordinator will store a Hint containing the data that could not be written. If a node is down for longer than max_hint_window_in_ms (3 hours by default), the coordinator stops writing new hints. At that point, the unresponsive node must be repaired before bringing it back up again.

Hints are also stored for write requests that time out after the timeout triggered by write_request_timeout_in_ms, (10 seconds by default). The coordinator returns a TimeOutException exception, and the write will fail but a hint will be stored.

Read Repair

Eventual consistency on reads.

Read Repair is the process of repairing data replicas during a read request. If all replicas involved in a read request at the given read consistency level are consistent, the data is returned to the client and no read repair is needed. But if the replicas involved in a read request at the given consistency level are not consistent, a read repair is performed to make replicas involved in the read request consistent.

The most up-to-date data is returned to the client. The read repair runs in the foreground and is blocking in that a response is not returned to the client until the read repair has completed and up-to-date data is constructed.

Anti-Entropy Repair

It must be done manually as it results in a lot of disk and network io.

There are 2 types of repairs: - Full repairs operate over all of the data in the token range being repaired. - Incremental repairs only repair data that’s been written since the previous incremental repair.

Incremental repair is the default and is run with the following command nodetool repair
A full repair can be run with the following command nodetool repair --full
Additionally, repair can be run on a single keyspace nodetool repair [options] <keyspace_name>
Or even on specific tables nodetool repair [options] <keyspace_name> <table1> <table2>

The repair command repairs token ranges only on the node being repaired; it does not repair the whole cluster. By default, repair operates on all token ranges replicated by the node on which repair is run, causing duplicate work when running it on every node. Avoid duplicate work by using the -pr flag to repair only the "primary" ranges on a node. Do a full cluster repair by running the nodetool repair -pr command on each node in each datacenter in the cluster, until all of the nodes and datacenters are repaired.

Best practices: - Should be run often enough that the gc grace period never expires on unrepaired data. Otherwise, deleted data could reappear (zombies). With a default gc grace period of 10 days, repairing every node in your cluster at least once every 7 days will prevent this, while providing enough slack to allow for delays. - Running an incremental repair every 1-3 days, and a full repair every 1-3 weeks is probably reasonable. If you don’t want to run incremental repairs, a full repair every 5 days is a good place to start.

When increasing replication factor you need to run a full (-full) repair to distribute the data.

Partitioner

Mechanism that Cassandra uses to distribute data across the cluster in what is called tokens. Every node is responsable for a range of tokens.

There are three types of partitioners that assign tokens in different ways:

Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values.
RandomPartitioner (OLD): uniformly distributes data across the cluster based on MD5 hash values.
ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes

Statements against ByteOrderedPartitioner: - Can cause hot spots. Ex: more people's name starts with A than X. - Uneven data distribution for multiple tables

Statements in favor of ByteOrderedPartitioner: - You can scan for all names starting with "J" and the data returned will be sorted. - Hotspots should not be a problem, as long as the partition key has enough cardinality.

Considerations: - The partitioner is changed in the cassandra.yaml configuration file. - Must be the same for all the nodes in the cluster. - Changing it in a production cluster is very hard. - When adding a node to the cluster, the tokens get redistributed across the cluster and a repair is needed to remove data that no longer belongs to the nodes.

Data replication

A replication factor of 1 means that there is only one copy of each row in the cluster. If the node containing the row goes down, the row cannot be retrieved. A replication factor of 2 means two copies of each row, where each copy is on a different node. All replicas are equally important; there is no primary or master replica. As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase the replication factor and then add the desired number of nodes later.

There is two strategies:

SimpleStrategy: Use only for a single datacenter and one rack. If you ever intend more than one datacenter, use the NetworkTopologyStrategy. Places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering topology.

NetworkTopologyStrategy: Highly recommended for most deployments because it is much easier to expand to multiple datacenters when required by future expansion. Can specify a different replication factor for each data center. Within a data center, it allocates replicas to nodes in different racks in order to maximize availability.

The NetworkTopologyStrategy is the recommended strategy for keyspaces in production deployments, regardless of whether it's a single data center or multiple data center deployment.

As disadvantages for NetworkTopologyStrategy: - Harder to garantee a load balance for the data. - Harder to garantee consistency of data.

Compaction (Clase 3, 6)

Merging multiple sstables eliminating tombstones. As SSTables accumulate, the distribution of data can require accessing more and more SSTables to retrieve a complete row.

Compaction collects all versions of each unique row and assembles one complete row, using the most up-to-date version (by timestamp) of each of the row’s columns. The merge process is performant, because rows are sorted by partition key within each SSTable, and the merge process does not use random I/O. The new versions of each row is written to a new SSTable. The old versions, along with any rows that are ready for deletion, are left in the old SSTables, and are deleted as soon as pending reads are completed.

Best practices: - Pause automatic compactions during maintenance windows. - Is ok to have multiple pending compactions on a node with a high load. - Change the compactor throughout if the disk is capable. - Change the strategy if needed.

Commands: - Manually compact a table: nodetool compact keyspace table - nodetool compactionstats - nodetool proxyhistograms - nodetool tablestats keyspace table - nodetool tpstats

Recomendations: - Monitor tombstone counts and reads - Monitor resource usage - Enable concurrent compactor if the hardware is capable - Monitor io for disk

Compaction Strategies

They define how a sstable is created.

Unified Compaction Strategy (UCS): It is designed to be able to handle both immutable time-series data and workloads with lots of updates and deletes. It is also designed to be able to handle both spinning disks and SSDs.
Size Tiered Compaction Strategy (STCS) (default): Most useful for not strictly time-series workloads with spinning disks, or when the I/O from LCS is too high. Gradually reduces the ammount of sstables. Optimized for a high volume if inserted data without frecuent updates. Slower reads as data is fragmented.
Leveled Compaction Strategy (LCS): optimized for read heavy workloads, or workloads with lots of updates and deletes. It is not a good choice for immutable time-series data.
Time Window Compaction Strategy (TWCS): designed for TTL’ed, mostly immutable time-series data.

Signs of a wrong compaction strategy: - Latency - Always compacting - Compactions that never end - Always rising statimation time to finish the compaction. - Errors like backpressure or compaction overwhelm.

Tombstones

Cassandra treats a deletion as an insertion, and inserts a time-stamped deletion marker called a tombstone. It has a built-in expiration date/time; at the end of its expiration period, the grace period, the tombstone is deleted as part of Cassandra’s compaction process. This approach is used instead of removing values. Once an object is marked as a tombstone, queries will ignore all values that are time-stamped previous to the tombstone insertion.

Zombies

If one replica node is unresponsive at deletion time, it does not receive the tombstone immediately, so it still contains the pre-delete version of the object. If the tombstoned object has already been deleted from the rest of the cluster before that node recovers, Cassandra treats the object on the recovered node as new data, and propagates it to the rest of the cluster.

Grace Period

To prevent zombies, Cassandra gives each tombstone a grace period, by default 864000 seconds (ten days), after which a tombstone expires and can be deleted during compaction. Prior to the grace period expiring, Cassandra will retain a tombstone through compaction events. Each table can have its own value for this property.

Compression

LZ4Compressor: low compression but faster
ZstdCompressor: good balance. Recommended for lower disk footprint.
SnappyCompressor: balanced
DeflateCompressor: Higher compression ratio
- lz4compressor ALTER TABLE ... WITH COMPRESSION = {'sstable_compression': 'lz4Compressor'}; - Menor compresión
- snappy - Equilibrado
- deflate - mayor compresión

Force a re-compression: nodetool scrub or nodetool upgradesstables -a

Snapshots

Sources

https://cassandra.apache.org/doc/5.0/cassandra/
https://docs.datastax.com/en/cassandra-oss/
https://www.yugabyte.com/blog/apache-cassandra-lightweight-transactions-secondary-indexes-tunable-consistency
https://www.beyondthelines.net/databases/cassandra-lightweight-transactions/