1. 48

I was wondering what others are doing to monitor their server infrastructure for small(er) scale projects. I am currently looking for something that basically does the following for me:

On 3 VPS, monitor CPU usage, disk space usage and network traffic and send me alerts if any of these values reach a defined threshold or have very unusual spikes.

All of the solutions I could find out there - both hosted as well as self-hosted - seem overly complex for my needs and required a lot of configuration wrangling and come with a lot of bells and whistles I do not need. I’ve got pretty decent metrics on application performance already, so this would really be about monitoring infrastructure to raise an alarm before systems go into an unhealthy state and brings down the services.

  1.  

  2. 23

    Prometheus/Grafana ? Prometheus out of the box node exporter should have everything you need

    1. 5

      This is what I’m I’m using to monitor my two servers. I setup prometheus with static scrape targets for the node exporters on my two servers, as well as the application metrics on one of them. The config uses the Digitalocean private network IPs so it’s unencrypted. Then I setup Caddy as a reverse proxy with https. Finally I created a Grafana config that uses sqlite for storage, uses Gitlab Oauth to login, points at the local Prom interface, and has a slack auth token for notifications.

      This all lives in the NixOS expression for my server, so it’s actually pretty easy to maintain. The part that took the longest was setting up the Gitlab auth. I can share the config if you’d like.

      1. 5

        please if you could that would be helpful. I’m always interested in how other people are setting up nixos

        1. 7

          Here’s what it looks like: https://gitlab.com/-/snippets/2102573

          The network.nix is the config for morph to deploy the server. The configuration.nix contains all the config for the monitoring server. I just run nix-shell --run 'morph deploy network.nix switch --upload-secrets --on "monitoring*"' when I want to update the server.

          1. 7

            I followed https://christine.website/blog/prometheus-grafana-loki-nixos-2020-11-20 when setting it up on my infrastructure and it was really helpful

        2. 3

          I’d second this. It does require more moving parts than stuff like monit, et al, but it’s simple to set up and flexible enough that you can add in additional data as you want.

          1. 1

            Third vote for this! Use it at Afterburst (~20 nodes and at work with >1k nodes) cannot recommend the combination enough.

          2. 14

            netdata might be a bit overkill but it is easy to setup and has a cloud offering.

            OT but for cron I use healthchecks

            1. 8

              I’m a big fan of monit. It has some bells and whistles, but it will let you check all sorts of metrics and email, or whatever else if they get past a specified threshold. That’s very configurable, you could, for example, have a check to alarm if you had sustained > 20mbps over an hour, and also a short burst of 100 mbps over 2 minutes for example. Some of its other features i tend to like is that it can also check services either via a command, or via checking a certain port, etc, and, if desired, automatically restart them. It’s very easy to configure, and has a ton of examples on their wiki.

              1. 5

                I can recommend https://uptimerobot.com/ (free version).

                Its independent from your infrastructure and can notify via mail (or twitter DM if you run your on mailserver on the servers you want to monitor)

                1. 1

                  Doesn’t seem to monitor CPU usage, disk space usage and network traffic

                2. 5

                  Good, old Munin.

                  1. 5

                    Who is your VPS provider? Most of them already provide those kinds of metrics.

                    Otherwise you can check out things like https://www.monitorix.org (I don’t use it because I don’t need to but I have bookmarked it in case I need to).

                    1. 1

                      In this particular case it’s Hetzner Cloud, but it comes with the same problem that everyone else has: they cannot monitor your disk space usage as they do not have OS level access.

                      I thought about using the Hetzner Cloud API too, but it would only tell me about CPU and network.

                      1. 8

                        If you’re just concerned about disk space and can do everything else with your VPS provider, then I’d just write a small shell script to check the df output, email me if it’s >80%, and run it through cron/periodic.

                    2. 4

                      I guess I am old or grumpy but I find all solutions posted in here overkil.

                      I don’t believe any of those will require less time to set up and maintain than writing a simple shellscript or python script and do whatever you want. If alerts on your phone is what you want a that’s not a problem a all. Sign up for an SMS gateway, use telegram, SQS or whatever you fancy. Reaching these from shellscripts is usually trivial.

                      If you want historical data, just output to a CSV or similar. Creating a chart is also a matter of one command (literally hundreds of options for this).

                      1. 1

                        RRDTools and a shellscript is still my favorite go to for this sort of thing. I prefer the look of RRDTools graphs and the format can store years of data with various resolutions using surprisingly little disk space.

                        1. 1

                          What a blast from the past. I had forgotten that existed. I think last time I visited that rrdtools website was some 15 years ago.

                          Here’s a more modern take on the sma problem http://traildb.io/

                          I like the idea, bit it is semi abandoned.

                      2. 3

                        U.S.E.R.S: the universal system error recognition service.

                        Turns out that people text or email me when things go wrong, if it’s a thing they use.

                        I also keep a stats(8) window up for the servers I run, and the activity shows when stuff is working.

                        I should really improve the setup, but my system generally Just Works, and I don’t have an SLA I need to comply with. The worst downtime has been bad flash on colo switches.

                        1. 3

                          Depends on the amount of data:

                          • relatively small - send it to a NUC at home running grafana + influxdb which collects all sorts of stuff

                          • larger data streams - install grafana + influxdb on one of the VPSes of the service (one with lower load)

                          (this is a setup for small / very non-critical stuff only) You can run that setup with minimal resource usage and next to no infrastructure preparation. You can setup everything via a short docker-compose file https://github.com/jkehres/docker-compose-influxdb-grafana/blob/master/docker-compose.yml

                          For submission, you can either use custom cron-scheduled script which submits the information or use telegraf (https://www.influxdata.com/time-series-platform/telegraf/) which has plugins for almost anything. Grafana does have lots of bells and whistles, but you can ignore it all and just use alerts.

                          1. 3

                            I found https://nixstats.com/ to be affordable for my personal needs. It has per server metrics gathering and built in application/url monitors and supports custom collectors if thats your thing.

                            Prior to that I had written a shell script executed via a 5 minute cron task that would gather various metrics, store them and produce various graphs using RRDTools. I had a static html page with js that would reload the images every 10 minutes so I could have an almost up to date dashboard of the server if I so chose.

                            To be honest I still prefer the graphs produced by RRDTools to any of these fancy pants services.

                            1. 3

                              It’s been a long time since I monitored a small group of physical machines, but back in the day Monit gave me stats on each individual process on each machine, Munin aggregated all the physical measurements of the machines into one dashboard, and an external service like Pingdom let me know the machines were up.

                              It was a sweet little set-up I was always proud of.

                              1. 3

                                Two things:

                                Use an external service to check reachability (ping, https, ssl cert validity, ssh, dns, ntp). This will be the minimal plan for any real service – free or $5/month.

                                Use mon, icinga or shinken on one of your VPS to monitor itself and the other two for things like disk space, processes running, network and cpu load, and anything else you can monitor via SNMP or a quick ssh command.

                                1. 3

                                  I’m an old sysadmin, so yeah, Nagios Core is my go-to. I’ve tried so many times to move away from it, but with NRPE, ansible, and Nagios Core I can get useful monitoring up with email in like 3ish hours.

                                  Then it just sits there and I add things to it when needed.

                                  1. 7

                                    I can get useful monitoring up with email in like 3ish hours.

                                    I’m kind of sad that this seems normal. I really expect there to be some out-of-the-box, cover 90% uses case that would take five minutes to set up with automation - or just an hour if doing it “by hand”.

                                    1. 3

                                      Yeah, it’s a problem that tons of companies have tried to fix. I really tip my hat to my friends at Sensu; they have been fighting the good fight for years.

                                      They have some great technology, example, youtube here, but I default to what I know for my personal stuff. Been Nagios for going on over a decade, still gonna be Nagios in a decade in the future.

                                      Edit: Nagios is so mainstream that Grammarly knows it, but not Kubernetes?

                                      1. 2

                                        I think most of the companies that have succeeded with making that easy end up charging more than you’d like to pay for a small setup. Datadog* seems to be that easy for basic use cases, but it costs more than you’d want for a side project.

                                        • I’m a former DD employee, FYI.
                                    2. 2

                                      I used https://okmeter.io/ when my infrastructure was 2-10 servers.

                                      1. 2

                                        zabbix

                                        1. 2

                                          I think netdata (federated or in cloud mode) might be worth a look. Or glances: https://nicolargo.github.io/glances/

                                          https://linoxide.com/install-use-netdata-monitoring-tool-linux/

                                          https://tech.davidfield.co.uk/netdata-monitoring-your-servers/

                                          Netdata should be light on resources, both should be somewhat simpler/ready out of the box than grafana+prometheus.

                                          But I’m interested to see what you land on, I’m not aware of a “perfect” opinionated, zéro setup monitoring system.

                                          I second the recommendation for Uptimerobot for… Monitoring (external services) uptime.

                                          1. 3

                                            Glances - which I’ve never heard of until now - looks a whole lot like what I was looking for. I will definitely try setting it up. Thanks for the tip.

                                            1. 1

                                              Netdata isn’t particularly light on resources, it takes between 100MB and 150MB RSS on each of my various VPS and dedicated servers, according to htop. However, I believe it can be tuned to use less memory, for example by collecting fewer metrics.

                                              1. 1

                                                Do you have examples of agents that collect similar amount of data using less resources?

                                                1. 2

                                                  At work we use Telegraf with Prometheus and Grafana, Telegraf takes between 30MB and 45MB RSS in htop; However with this kind of solution you have to configure and maintain a central collector (Prometheus), whereas each Netdata instance on my own infrastructure is completely autonomous.

                                                  The tradeoff is that I use more resources per monitored node to protect against failures of a central collector, but this was a personal choice which I’m ready to reverse if I can find something like a SaaS collector where someone else is in charge of availability.

                                                  1. 1

                                                    Thanks for the two datapoints - I would probably have guessed the relationship would be the reverse. But maybe I’m mixing up netdata and a different, lightweight collector.. Although I can’t think of which that would likely be.

                                            2. 1

                                              Netxms

                                              1. 1

                                                I use the dashboard provided by the cloud service provider and have set some alerts to get notified when CPU/disk usage reaches maximum usage

                                                1. 1

                                                  Install the monit client and have the system itself email / slack you alerts when it is out of disk space for example.

                                                  If you can afford a RPi or any other cheap server at home, use a prometheus node exporter on each of the machines and scrape the data with that local machine. You’ll have the joy of learning the basics of prometheus and alertmanager.

                                                  You can make it a learning experience:

                                                  • install prometheus and its exporters
                                                  • setup alertmanager and send alerts somewhere
                                                  • learn about webhooks and how to trigger different actions when you receive them
                                                  • make it so that only your machine can scrape data
                                                  • make it so that you can save data older than 15 days to a local InfluxDB (or other)
                                                  • use the RIPE Atlas software proble to measure availability (if the situation allows you to run it alongside anything else on your machine).
                                                  • the journey can continue

                                                  Because it is a small scale infrastructure you have the wonderful opportunity to expose yourself to too many technologies, one step at a time, without feeling their weight pressed upon you.

                                                  1. 1

                                                    Different ways:

                                                    • Sometimes the company hosting the server provides means for basic information, like the things you describe
                                                    • Prometheus (sometimes with Grafana, depending on how small)
                                                    • I’ve been using netdata, but it’s a resource hog and there was something else I was unhappy about. Forgot what though.
                                                    • Self-made tools, some for availability checks, some sending to Prometheus
                                                    • Some log parsing (ELK stack, but now looking more and more at alternatives) and reading from commands and/or files, sending out alerts under certain circumstances.
                                                    1. 1

                                                      Victoria metrics + telegraf + grafana, Prometheus is pretty heavy for my small-scale setup, and I used to have graphite as tsdb it works fine in small-scale as well.

                                                      1. 1

                                                        I am researching this right now too. I am trying influxdb2 with telegraf atm. Not sure if this is what I will settle on.