What do folks generally do as far as monitoring, alerting and logging in the brave new containers world? Do you have to now monitor both the host machine and the container insides? How do you deal with the added complexity?
It’s generally a good idea to err on the side of over-monitoring, as you never know what random question may be useful to answer while handling an incident. Monitor your host system metrics, monitor your mesos metrics (https://github.com/rayrod2030/collectd-mesos is a dead simple metric collection example for this), and monitor your workloads. This is existing best practice, and there are many existing tools.
When I was a SRE at Tumblr I gained a lot of respect for host-local metric and log aggregators that forward to clusters and buffer to disk when downstream failures occur - these are super useful in the context of Mesos as well. This way your tasks can hit a local endpoint, and the local aggregator worries about remote failure handling or downstream reconfiguration in a uniform way.
One thing I’ll warn about in a more dynamic environment is that you should test anything that you rely on to re-resolve DNS. The JVM, for instance, caches indefinitely unless you explicitly tell it not to on initialization. While DNS is a nice universally half-implemented solution, a ton of stuff will fail to re-resolve during timeouts or connection failures at any threshold.
We use one monitoring system for Mesos itself (soon we’ll be open-sourcing it!), and have applications & containers self-report metrics & alerts to hosted instances of Riemann. Essentially, this even allows us to split the ownership responsibility of the applications on the cluster vs. the cluster infrastructure.
Reading that post made me realize how long I'v been out of the startup devops world. I’d never even heard of Mesos!