citrusleaf the noSQL database for the enterprise

technology

Scaling

There are two noteworthy aspects of Citrusleaf scaling:

  • Linear scaling - the Citrusleaf cluster capacity increases linearly as nodes are added to the cluster with per node throughput staying well over 200,000 TPS at under 1 millisecond response times in real-world configurations.
  • Auto scaling - the system requires no operational intervention to add new nodes to the cluster.

Linear Scaling

Efficient distributed consensus

All of the nodes in the Citrusleaf system participate in a Paxos distributed consensus algorithm, which is used to ensure agreement on a minimal amount of critical shared state.  The most critical part of this shared state is the list of nodes that are participating in the cluster.  Consequently, every time a node arrives or departs, the consensus algorithm runs to ensure that agreement is reached.  This process takes a fraction of a second.  After consensus is achieved each individual node agrees on both the participants and their order within the cluster.  Using this information the master node for any transaction can be computed along with the replica nodes.  Since the essential information about any transaction can be computed, transactions can be simpler and use proven database altorithms.  This results in minimal latency transactions which only involve a minimal subset of nodes. 

Balanced data partitioning

In a Citrusleaf database cluster, the contents of a namespace are partitioned by key value and the associated are spread across every node in the cluster.  Once a node has been added to or removed from a cluster, data rebalancing starts immediately after a cluster change is detected. The automatic data rebalancing is conducted across the internal cluster interconnect. Balanced data ensures that query volume is distributed evenly across all nodes, and the cluster robust in the event of node failure happening during rebalancing itself.  This provides Citrusleaf's most powerful feature: scalability in both performance and capacity can be achieved entirely horizontally. Furthermore, the system is designed to be continuously available, so data rebalancing doesn't impact cluster behavior.

If a cluster node receives a request for a piece of data that it does not have locally, it satisfies the request by creating an internal proxy for this request, fetching the data from the real owner using the internal cluster interconnect (see figure), and subsequently replying to the client directly.

cluster_interconnect2

Efficient transaction routing

Without efficient transaction routing, a request would always require an extra network hop. Either a proxy would be placed in the middle of the transaction, increasing latency and decreasing throughput, or transactions would flow over the cluster interconnect. Therefore, the Citrusleaf client dynamically discovers the cluster's current partition state and routes transactions to the correct node in the cluster. As nodes are added to the cluster and the data is automatically rebalanced, the cluster's capacity to handle client throughput continues to increase linearly. Specifically, there is no added overhead introduced because of cluster interconnect. This fact is borne out in our benchmarks which demonstrate that the maximum number of transactions supported per node stays constant as we add nodes to the cluster.

Auto Scaling

The Citrusleaf database platform is self-organizing and scales elastically to fit your business needs in a "just in time" manner. Distributed consensus algorithms for automatic node addition and removal combined with automatic data rebalancing algorithms provide robust self-management of the system during node arrival and departure.

Automatic node addition and removal

The Citrusleaf algorithms for detecting node arrival and departure are robust.

  • We use multiple independent paths for nodes to discover each other. Nodes can be discovered via an explicit heartbeat message and/or via other kinds of traffic sent to each other using the internal cluster interconnects.

  • The algorithms to discover node departure need to avoid mistaken removal of nodes during temporary congestion.  We again use failures along multiple independent paths to ensure high confidence in the event.

  • Sometimes nodes can depart and then join again in a relatively short amount of time (router glitches).  The system therefore avoids race conditions by enforcing the order of arrival and departure events.

The Citrusleaf consensus algorithm for admitting and removing nodes from the cluster is unique in that consistency votes are taken only during cluster reorganization.  Once the cluster membership list is agreed upon, the rest of the data routing tables can be independently generated by the cluster members very quickly.  Unlike many other clustered solutions, Citrusleaf clusters do not have a “master” during normal operation.  All nodes are treated equal and data is distributed equitably among all nodes of the cluster.

Automatic data rebalancing

Citrusleaf’s data rebalancing mechanism ensures that query volume is distributed evenly across all nodes, and is robust in the event of node failure happening during rebalancing itself.  The system is designed to be continuously available, so data rebalancing doesn't impact cluster behavior.  The transaction algorithms are integrated with the data distribution system, and there is only one consensus vote to coordinate a cluster change. With only one vote, there is only a short period when the cluster internal redirection mechanisms are used while clients discover the new cluster configuration.  Thus, this mechanism optimizes transactional simplicity in a scalable shared-nothing environment while maintaining ACID characteristics.

Citrusleaf allows configuration options to specify how much available operating overhead should be used for administrative tasks like rebalancing data between nodes as compared to running client transactions. In cases where slowing transactions temporarily is preferred, the cluster will heal more quickly. In cases where transactional speed and volume must be maintained, the cluster will rebalance more slowly.

In some cases, the replication factor can't be satisfied. The cluster can be configured to either decrease the replication factor and retain all data, or begin evicting the oldest data that is marked as disposable. If the cluster can't accept any more data, it will begin operating in a read-only mode until new capacity becomes available - at which point it will automatically being accepting application writes.

By not requiring operator intervention, the cluster will self-heal even at demanding times. In one customer deployment, a rack circuit breaker blew, taking out one node of an 8 node cluster. No operator intervention was required. Even though the outage was at peak time for the data center, transactions continued with full ACID fidelity. In several hours, when the fault was corrected and the troublesome rack was brought back online, operators did not need to take special steps to maintain the Citrusleaf cluster.