The hardware required for this meager amount of monitoring makes me want to cry. Why do we keep making things worse?
ServiceNow using Xymon has aggregated all their monitoring data into one system that monitors
Hosts : 569869
Pages : 60421
Status messages : 740185
That’s a lot of servers and almost 750,000 services being checked. I remember single server setups on old HP boxes with 8GB of RAM doing 50,000 hosts back in 2008.
In the monitoring space we’re building new tools which are considerably less efficient but nobody seems to care.
I’m not sure that 31,000 samples a second is meager, but this server is clearly larger than we need for ingesting samples (as is obvious from the usage level). It needs RAM (and CPU) primarily for complex or long time range queries over the raw metrics data, and doing such queries is both a tradeoff between server resources and response time (Prometheus loads data into memory for query speed), and also between flexibility and pre-computing a limited set of measures and data.
The server itself is a basic Dell 1U server (an R230). The actual hardware cost involved is modest and wouldn’t be reduced significantly by making it smaller. Also, system metrics suggest that we rarely go above 6 or 7 GB of RAM these days in active use, so loading it up to 32 GB is probably overkill with current versions of Prometheus. But better safe than sorry with an OOM explosion due to an unexpectedly complex and memory-intensive query.
(I’m the author of the linked-to entry.)
I’ve never heard of Xymon but I suppose most people are coming from Nagios/Icinga and I didn’t notice Prometheus being any worse than those setups from the hardware requirements. Not that this is saying a lot, but I see that as the baseline. I heard check_mk was better but I never used it myself.
Oh, and one thing - again no idea what Xymon in that setup did, but Prometheus is not primarily or solely used for monitoring but metrics. First of all those will always use a lot more processing power on the receiving end than a simple online/offline check. Not everything can be shoehorned into rrdtool.