DISCLAIMER This is a very early version of Cassandra-on-Mesos framework. This document, code behavior, and anything else may change without notice and/or break older installations.
This document is a technical description about what happens inside Cassandra-on-Mesos framework - more precisely: inside the scheduler.
Reminder: the single instance of the Cassandra-on-Mesos scheduler submits tasks (and status requests) to its Cassandra-on-Mesos executors. Configuration and status information is stored as state objects in ZooKeeper - and the scheduler is the only process that is updates that information. Cassandra-on-Mesos executors are relatively dumb processes that just do what the scheduler wants them to do.
This document assumes that the reader is familiar with Apache Mesos and (to some extent) with Apache Cassandra.
To have a better understanding about what's going on in the scheduler, it is essential to know the structures defined in
The framework stores its state object in ZooKeeper at
zk://<server-list>/cassandra-mesos/<framework-name> (let's call this the ZooKeeper base URL).
Since the framework name is encoded in the ZK URL you can spin off multiple framework instances with different names that can run concurrently on the same cluster.
There are some objects stored directly beneath the ZooKeeper base URL. All of these objects are defined in
model.proto and are stored in ZK using their message name.
CassandraFrameworkConfigurationBase configuration object including all configuration role objects.
CassandraClusterStateCurrent status of the cluster.
CassandraClusterHealthCheckHistoryContains a history of the last health-checks that were received from all nodes.
CassandraClusterJobsContains the current cluster-wide job and the last job status (one per job type).
These objects are initially persisted when you start a Cassandra-on-Mesos framework for the first time.
Mesos periodically offers the Cassandra-on-Mesos scheduler available resources via a call to
Scheduler.resourceOffers. Within this call, the Cassandra-on-Mesos framework checks whether nodes need to be "occupied" as a Cassandra node or tasks have to be submitted against already acquired nodes.
The framework currently only allows one Cassandra instance per node (based on IP address).
CassandraClusterState.nodesToAcquire is greater than 0 and an offer is received for a node that does not
already host a Cassandra instance, a Cassandra-on-Mesos executor is launched on that node.
CassandraClusterState.seedsToAcquire is also greater than 0, the new Cassandra instance will be a seed node.
If an offer for a node already containing an Cassandra-on-Mesos executor is received, offer handling checks whether tasks need to be launched via that executor, or a health check or a status report for a cluster-wide job is required.
Tasks are launched via
SchedulerDriver.launchTasks(). Such tasks can be:
Messages are submitted via
SchedulerDriver.sendFrameworkMessage(). Such messages can be:
CassandraNode is created with
targetRunState=RUN. There are four run states defined. It is possible to set the
targetRunState from any state to any state - except that it is not possible to change the
it is set to
RUNthe normal and initial state. It means that the scheduler will do its best to keep the Cassandra process running. It will be started if necessary.
STOPmeans that the Cassandra process should not run. It will be stopped if necessary.
RESTARTis basically a automatic transition via
RUN. After the Cassandra process has been stopped and is running again, the
targetRunStateis set to
TERMINATEis a 'dead end state'. It means to stop the Cassandra process and the executor. It is not possible to set
targetRunStatefor this node to another value. All you can do is to do initiate a node-replace for this node.
Cassandra-on-Mesos framework uses so called cluster jobs to ensure that certain tasks like repair, cleanup and restart are never executed on/against more than one node. Some job types (repair and cleanup) involve activity performed against the actual Cassandra process. Other job types (restart) are just a kind of synchonization object.
Only one cluster job can be active at all times. A cluster job can be aborted - effectively after the current node has finished its part. At the moment there is no way to interrupt a node's part of a cluster job.
A node's part of a cluster job is effectively cancelled when the Cassandra process or its executor or its slave dies.
CassandraNode keeps track of all tasks that have been launched via the executor. Again, there are three kinds
of tasks that can be launched: executor metadata, cassandra server and cluster job.
Each started task is added to
CassandraNode.tasks. A task is only removed from that list, when the scheduler receives
the appropiate status update from Mesos via
Scheduler.statusUpdate(). This means, that the
list contains only running tasks.
There are three kinds of tasks:
METADATAis used to inquire some information about the node the executor runs on. This task does not spawn a process.
SERVERis the task that writes Cassandra configuration files and starts the Cassandra process. Task status changes from executor to scheduler represent whether the Cassandra process is running.
CLUSTER_JOBis the task that represents the node's part of a cluster wide job like repair or cleanup. This task does not spawn a process.
CONFIG_UPDATEis the task that just updates Cassandra configuration files. This task does not spawn a process.
The lifecycle of a task is communicated via the standard Mesos messages.
The scheduler requests status information from executors periodically via so called Mesos framework messages. Framework messages are not guaranteed to arrive in order or to even be delivered at all.
Cassandra-on-Mesos uses two kinds of status request/response framework messages:
HEALTH_CHECK_DETAILSto get information about the status of the Cassandra process.
NODE_JOB_STATUSto get information about the status and progress of a node's part of a cluster-wide job like repair or cleanup.
The executor can submit framework messages on its own as part of another task - for example when detecting that the Cassandra process died or to inform the scheduler about a status change.
Sometimes the framework has to ensure that a Cassandra node can be considered as "live". A Cassandra node
is considered "live", when all the following criteria match (in
CassandraNodehas a task of type
SERVER(which means that the Cassandra process is running).
CassandraNodehas a recent
HealthCheckHistoryEntry, which is
healthyand has a
NodeInfoin the recent
HealthCheckHistoryEntryindicates that both native transport and thrift transport are running.
The Cassandra-on-Mesos framework currently defines the following limitations before a Cassandra process can be started:
bootstrapGraceTimeSeconds. This is to allow a previously started node to successfully bootstrap.
The inital state (
CassandraClusterState) is basically empty and has the fields
nodesToAcquire set to the
configured number of nodes and
seedsToAcquire set to the number of seeds.
At first, the scheduler just acquires the number of seed nodes by allocating the required resources and
launching the Cassandra-on-Mesos executor. No Cassandra process will be started until there are enough seed nodes to fulfil the initially configured number of seeds (via