This is meant to be a rolling FAQ for DC/OS users. This is very incomplete.
This simple ip-detect script that should almost always work:
#!/bin/bash
ip route get 8.8.8.8 | awk 'NR==1{print $NF}'
Things to try/check when standing up or troubleshooting a DC/OS cluster:
-
Make sure that selinux is in permissive or disabled mode:
getenforce
$ getenforce Permissive
-
Make sure that your ip-detect script returns the expected IP address (and that this IP address actually exists as an IP address on the system)
bash /opt/mesosphere/bin/detect_ip
ip a | grep $(/opt/mesosphere/bin/detect_ip)
$ bash /opt/mesosphere/bin/detect_ip 192.168.10.31 $ ip a | grep $(/opt/mesosphere/bin/detect_ip) inet 192.168.10.31/24 brd 192.168.10.255 scope global eth0
-
Check that time is reporting healthy:
ENABLE_CHECK_TIME=true /opt/mesosphere/bin/check-time
$ ENABLE_CHECK_TIME=true /opt/mesosphere/bin/check-time Checking whether time is synchronized using the kernel adjtimex API. Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.) Time is in sync!
-
Check that exhibitor is healthy (should return with JSON indicating all of your masters, all with
"description":"serving"
and"code":3
curl -s $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}'):8181/exhibitor/v1/cluster/status
$ curl -s $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}'):8181/exhibitor/v1/cluster/status [{"code":3,"description":"serving","hostname":"192.168.10.31","isLeader":false},{"code":3,"description":"serving","hostname":"192.168.10.32","isLeader":true},{"code":3,"description":"serving","hostname":"192.168.10.33","isLeader":false}]
-
If you’re on Enterprise DC/OS, check that cockroachdb is healthy (look for ranges_underreplicated to be 0)
curl -skL 127.0.0.1:8090/_status/vars | grep ranges_underreplicated
$ curl -skL 127.0.0.1:8090/_status/vars | grep ranges_underreplicated # HELP ranges_underreplicated Number of ranges with fewer live replicas than the replication target # TYPE ranges_underreplicated gauge ranges_underreplicated{store="2"} 0
-
Verify that spartan is healthy
ping -c2 -W1 ready.spartan
$ ping -c2 -W1 ready.spartan PING ready.spartan (127.0.0.1) 56(84) bytes of data. 64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.031 ms 64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.030 ms --- ready.spartan ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.030/0.030/0.031/0.005 ms
-
Verify that Mesos is healthy (look for
overlay/log/recovered
andregistrar/log/recovered
to both be 1):curl -s $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}'):5050/metrics/snapshot | tr ',' '\n' | grep recovered
$ curl -s $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}'):5050/metrics/snapshot | tr ',' '\n' | grep recovered "overlay\/log\/recovered":1.0 "registrar\/log\/recovered":1.0
-
Verify that mesos-dns is healthy
ping -c2 -W1 leader.mesos
$ ping -c2 -W1 leader.mesos PING leader.mesos (192.168.10.32) 56(84) bytes of data. 64 bytes from master-asus-32 (192.168.10.32): icmp_seq=1 ttl=64 time=3.64 ms 64 bytes from master-asus-32 (192.168.10.32): icmp_seq=2 ttl=64 time=2.12 ms --- leader.mesos ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 2.124/2.886/3.648/0.762 ms
-
Verify that Marathon is healthy
curl -skL $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}'):8080/v2/leader
$ curl -skL $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}'):8080/v2/leader 3{"leader":"192.168.10.33:8080"}
-
See if the UI responds to curl:
curl $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}') -kL | head -c300
$ curl $(ip route get 8.8.8.8 | awk 'NR==1{print $NF}') -skL | head -c300 <!DOCTYPE html> <html lang=en class=no-js> <head> <title> Mesosphere DC/OS </title> <meta charset=utf-8> <meta http-equiv=X-UA-Compatible content="IE=edge,chrome=1"> <meta name=title content="Mesosphere DC/OS"><!--[if lt IE 9]> <script src="http://html5shim.googlecode.com/svn/trunk/html5.js">
Q: I’m having issues installing DC/OS. Can you help me troubleshoot?
A1: Things to check with every installation:
- Did you verify that your ntp is properly synchronized and working?
- Can you verify that if you copy the ip-detect script to all of your nodes, and run it on each node, it properly responds with an IP address that is reachable from all nodes?
- Can you verify that the output of ip-detect run on your masters matches the IP address(es) given in your master_list in your config.yaml
- For troubleshooting, it is always helpful to provide the config.yaml that you’re using
- Look at logs, specifically journalctl logs. Look at the output of
sudo systemctl list-units dcos-*
(don’t forget the asterisk) on each node to see what systemd units are failing or not failing. You should see between 15 and 25 systemd units on all nodes. - If you identify that a lot of systemd units are failing, look at the journal logs for these units first:
- journalctl -fu dcos-spartan (this has been replaced by
dcos-net
in DC/OS 1.11.x and above) - journalctl -fu dcos-exhibitor
- journalctl -fu dcos-mesos-master (on masters)
- journalctl -fu dcos-spartan (this has been replaced by
A2: What installer did you use? Try using the advanced installer. It’s a bit easier to use and a lot easier to troubleshoot and understand than the CLI and GUI installers. A quick walkthrough of a very basic installation is available here: Distributed DC/OS Cluster Setup