Start by reading this documentation: http://mesos.apache.org/documentation/latest/maintenance/#how-does-it-work

Tested with DC/OS 1.11.4 (Mesos 1.5.1, Marathon 1.6.535), in strict mode.

Basically, there are three modes for a node in Mesos:

You can move from UP to DRAIN by scheduling maintenance, and you can move back to UP from DRAIN by unscheduling maintenance.

The only way to get to DOWN is to first put a node in DRAIN first (you can do this at the same time with multiple queries).

Once a node is DOWN, you can only move it back to UP.

These are the four valid transitions:

Respecting maintenance:

A framework must respect maintenance mode in order for DRAIN to do anything (putting a node in DOWN / starting maintenance will always evict running tasks, regardless of support for maintenance).

Marathon 1.6.535 (in DC/OS 1.11.4) supports respecting maintenance mode, but it must be turned on via flag. This can be done by doing the following:

sudo mkdir -p /var/lib/dcos/marathon
sudo tee /var/lib/dcos/marathon/environment <<-'EOF'
MARATHON_ENABLE_FEATURES=vips,task_killing,external_volumes,secrets,gpu_resources,maintenance_mode
EOF

This adds the maintenance_mode to the list of enabled features. Note that this overrides the default enabled feature list in DC/OS 1.11.4, which has vips,task_killing,external_volumes,secrets,gpu_resources

You can create this file prior to installing DC/OS (I think permissions will just kind of work, not entirely sure though).

This should not be necessary in Marathon 1.7 (DC/OS 1.12.x).

API Calls

Here are some example calls to perform various actions:

All of these go through the DC/OS Master IP (in this case, 172.31.47.190), using HTTPS and the mesos/api/v1 endpoint (which is the Mesos “V1 Operator” API endpoint)

Get the current state of maintenance (will indicate which nodes are currently in drain and which are down):

curl \
    -H "authorization: token=$(dcos config show core.dcos_acs_token)" \
    -kL \
    -X POST \
    -H "content-type: application/json" \
    https://172.31.47.190/mesos/api/v1 \
    -d '{ "type": "GET_MAINTENANCE_STATUS" }'

Get the current list of maintenance schedules:

curl \
    -H "authorization: token=$(dcos config show core.dcos_acs_token)" \
    -kL \
    -X POST \
    -H "content-type: application/json" \
    https://172.31.47.190/mesos/api/v1 \
    -d '{ "type": "GET_MAINTENANCE_SCHEDULE" }'

Update the current list of maintenance schedules. Note that this has a couple things:

curl \
    -H "authorization: token=$(dcos config show core.dcos_acs_token)" \
    -kL \
    -X POST \
    -H "content-type: application/json" \
    https://172.31.47.190/mesos/api/v1 \
    -d '
    {
        "type": "UPDATE_MAINTENANCE_SCHEDULE",
        "update_maintenance_schedule": {
            "schedule": {
                "windows": [
                    {
                        "machine_ids": [
                            {
                                "hostname": "172.31.18.234",
                                "ip": "172.31.18.234"
                            }
                        ],
                        "unavailability": {
                            "start": { "nanoseconds": 1554905650000000000 },
                            "duration": { "nanoseconds": 3600000000000 }
                        }
                    }
                ]
            }
        }
    }'

For example, to unschedule all schedules:

curl \
    -H "authorization: token=$(dcos config show core.dcos_acs_token)" \
    -kL \
    -X POST \
    -H "content-type: application/json" \
    https://172.31.47.190/mesos/api/v1 \
    -d '
    {
        "type": "UPDATE_MAINTENANCE_SCHEDULE",
        "update_maintenance_schedule": {
            "schedule": {
                "windows": [
                ]
            }
        }
    }'

This starts maintenance (puts nodes in DOWN) for one or more nodes:

curl \
    -H "authorization: token=$(dcos config show core.dcos_acs_token)" \
    -kL \
    -X POST \
    -H "content-type: application/json" \
    https://172.31.47.190/mesos/api/v1 \
    -d '
    {
        "type": "START_MAINTENANCE",
        "start_maintenance": {
            "machines": [
                {
                    "hostname": "172.31.18.234",
                    "ip": "172.31.18.234"
                }
            ]
        }
    }'

This stops maintenance (brings nodes back to UP) for one or more nodes:

```bash curl \ -H “authorization: token=$(dcos config show core.dcos_acs_token)” \ -kL \ -X POST \ -H “content-type: application/json” \ https://172.31.47.190/mesos/api/v1 \ -d ‘ { “type”: “STOP_MAINTENANCE”, “stop_maintenance”: { “machines”: [ { “hostname”: “172.31.18.234”, “ip”: “172.31.18.234” } ] } }’