DC/OS Software Development Kit Status: Alpha

Building Multi-Service Schedulers

This is a “getting started” guide for running multiple services from a single SDK-based Scheduler process/Framework, with support for dynamically adding or removing those services from the Scheduler without needing to restart it.

Readers of this document should already have some experience with writing SDK-based services. It assumes that the reader has some knowledge/experience with high-level SDK concepts, such as ServiceSpecs and AbstractSchedulers.

Everything here is subject to change. There may be bugs or deficiencies in the current implementation as described here, and there may need be API changes before this feature will be ready for use in a production situation. But please send feedback! And patches!

If you’re looking for example usage, just take a look at the reference implementation in hello-world. In particular, the ‘Dynamic Multi-Service’ example should be applicable to most people. It also has integration tests. In addition to that, most of the SDK code involved in this change resides in the SDK’s scheduler.multi package.

Overview

A Multi-Service Scheduler is effectively a single Mesos Framework/Scheduler process, which manages multiple underlying Services. A “Service” is represented by a single ServiceSpec (and associated Plans), which are wrapped in an AbstractScheduler object, potentially with other customizations provided by the developer.

Terminology

Existing Data Services

In practice, existing data services (Kafka, Cassandra, etc.) will continue to use the Mono-Scheduler structure for the foreseeable future, for the following reasons:

Limitations

The following are known limitations of this Multi-Schedulers:

Requirements

In order to build a Multi-Service Scheduler, the developer needs to implement a few things themselves:

  1. A serialized format for the per-service config. This config must have the necessary information to rebuild your AbstractScheduler objects if/when the Scheduler is restarted.
  2. A ServiceFactory callback which will use that serialized config to build services via SchedulerBuilder.
  3. When building the service using SchedulerBuilder, call enableMultiService(String frameworkName).
  4. If services are supposed to be added/removed by end users, any HTTP endpoints or similar functionality needed to support those calls must be implemented by the developer.

Keep reading for more information on each of those points, with links to examples.

Config serialization and ServiceFactory implementation

If the Scheduler process is restarted, any previously added services must be re-added so that they can resume running. In order for the SDK to do this, the developer needs to provide a ServiceFactory callback which will recreate the AbstractScheduler object when invoked. This callback is provided with a byte[] context field, where any application-specific information needed to reconstruct a given service can be stored. To use this context field, the service developer must implement a serialization format for config. For example, this could be used to store a small JSON blob storing application-level information about the service’s configuration.

Call SchedulerBuilder.enableMultiService()

Within that ServiceFactory callback, the developer should use a SchedulerBuilder to build the AbstractScheduler object, as they would do with single-service schedulers today. However, the developer must also be careful to invoke SchedulerBuilder.enableMultiService(String frameworkName) to enable multi-service support within the service being built. Hello-world has an example of a ServiceFactory implementation.

HTTP Endpoint(s) (optional)

If the developer intends for end-users to add/remove services from the Scheduler, the developer must implement their own HTTP endpoint(s) which do this. The exact functionality of these endpoints depends on the specific service being implemented. For example, a Spark Dispatcher implementation could include an endpoint that emulates the spark-submit endpoint, which internally adds the submitted jobs as new services. For example, this endpoint in the hello-world reference implementation accepts an example YAML template filename to be run and any parameters to use with it.

Implementation

Here are the main components to know about when building a Multi-Service Scheduler.

There are four new classes in the SDK to be aware of:

Single-service structure Multi-service structure
Single-service structure Multi-service structure

Given these classes, there are three things you’d want to implement:

  1. Your main() function, which creates/initializes the above three classes, and then invokes MultiServiceRunner.run() to start the Framework thread.
  2. A callback which will rebuild any active services which were previously added via MultiServiceManager.putService(). If you have a fixed/static list of services to add, then this is very simple, since you’d just re-add the same list of services every time the scheduler starts. Otherwise, you would use a ServiceStore to handle persisting the services while they’re active, along with a ServiceFactory callback which would be used by the SDK to rebuild the services upon restart.
  3. Any application-specific logic for dynamically adding/updating/removing services in the list, For example this could be HTTP endpoints which would result in calling MultiServiceManager.putService()/uninstallService(), after having updated the list of active services in the developer’s persisted storage (see previous requirement). This logic is only needed if the list of services can change on the fly – a fixed set of services would not need this.

Example flows

The following describe the steps to perform common operations in a Multi-Scheduler. To see a full reference implementation supporting all of these operations, take a look at the additions to hello-world in this dcos-commons PR.

Adding a Service

Adding a service to a Multi-Scheduler works as follows:

Reconfiguring a Service

Updating the configuration of existing services which were previously added is also supported. The flow for doing this in a Multi-Scheduler works as follows:

Restarting the Scheduler

In the event of a Scheduler process restart, the Scheduler will automatically reconstruct the active services as it’s initialized. This reconstruction is done using the ServiceFactory provided by the developer.

Removing a Service

Service removal is handled asynchronously. The developer requests that a service be removed, and it gets removed in the background (killing that service’s tasks and unreserving its resources). The developer is notified via a callback when the removal is complete:

Uninstalling the Scheduler

Uninstalling the scheduler (i.e. via dcos package uninstall ...) works as follows. This effectively works by unwinding all previously-added services, and then removing the parent framework and main Scheduler process once all services have been torn down: