If your deployment is not finishing and/or your instances appear to start very slowly, your service may be delayed.
In order to protect the cluster from being overloaded with launch requests, Marathon will slow down the launch rate of services that crash loop / fail frequently via a mechanism known as “backoff delay”. Were it not for this protective measure, Marathon could cause cluster-wide agents due to disks filling up on Mesos Agents with sandboxes created for the frequently-launched-and-failing failing tasks.
There are a number of reasons for application instances not being started. To find out if the backoff delay might be causing your instances not to be launched you need to:
v2/queue
endpoint.service.id
equals id of your application or pod.delay.timeLeftSeconds
; if it is higher than 0, then your service is delayed.The backoff delay length and rate can be configured per service. For apps, the Application definition has the properties backoffFactor
, maxLaunchDelaySeconds
and backoffSeconds
. For pods, the Pod definition has a backoff
object with the properties backoff
, backoffFactor
and maxLaunchDelay
.
backoffSeconds
- The initial delay applied to a service that has failed for the first time.backoffFactor
- Controls the rate at which the backoff delay grows; a value of 1.05 would result in a 5% slower launch rate after each failure.maxLaunchDelaySeconds
- The largest delay allowed (default: 5 minutes).When deploying a new service or a new version of existing service, the delay value for that application is reset to delaySeconds
.
Every time an instance of this application fails, the current value of delay is multiplied by the backoffFactor
up until maxLaunchDelaySeconds
is reached.
The delay is increased also when task fails or exits with exit code 0 (TASK_FAILED
and TASK_FINISHED
in Mesos).
The delay is NOT increased when task is killed (TASK_KILLED
in Mesos).
DELETE /v2/queue/{app.id}/delay
HTTP API request is issuedBy issuing a request to DELETE /v2/queue/{service.id}/delay
.