1. 28
  1. 10

    I’m baffled why people prefer going through the pains of setting up Prometheus + Grafana over adding few lines of code in their app to send metrics to InfluxDB 2.0 and have all the charting and alerting capabilities in a smaller footprint and simpler services topology. Additionally, the push model is safer under heavy load as it’s usually a better choice to drop metrics on the floor rather than choke the app by means of repeated requests from Prometheus (the “push gateway” isn’t not recommended as the primary reporting method as per docs). InfluxDB has a snappy UI, a pretty clean query language, retention policies with downsampling, HTTP API, dashboards with configuration storable in version control systems. It doesn’t have counters/gauges/histograms — there’s just measurements. The only thing I miss is metrics reporting over UDP.

    1. 17

      I’m baffled why people prefer going through the pains of setting up Prometheus + Grafana over adding few lines of code in their app to send metrics to InfluxDB 2.0 […]

      The community around Prometheus and Grafana is huge. The kube-prometheus project alone can drive a kubernetes cluster observability from zero-to-hero quickly. I admit that there is a learning curve there, but the value proposition is amazing. The community is amazing. There many online collections of ready-made rules teams can use.

      Even SaaS famous monitoring SaaS solutions offer poorer defaults than kube-prometheu’s 80+ strong k8s alerts collection. Prometheus is cheap, you just throw RAM and GBs and does all the heavy lifting. Good luck enabling OpenMetrics integration to drive billions of metrics to SaaS monitoring systems.

      Assuming I install influxDB, how do I drive metrics for cert-manager to influxDB? Cert-manager is an operator that mostly just works but needs monitoring in case the SSL cert issuance fails for whatever reason. For most solution the infra team will have to build monitoring. But cert-manager (like many others) exposes prometheus metrics (open-metrics compatible) and as a bonus there’s a good community grafana dashboard ready to be used.

      Additionally, the push model is safer under heavy load as it’s usually a better choice to drop metrics on the floor rather than choke the app by means of repeated requests from Prometheus […]

      To my experience it’s the other way around: If the exporter is built as it should, non-blocking, then it’s as light as serving a static page with few kb of text. I’ve seen applications slowed down by third party push libraries or the agent misbehaving (e.g. not being able to push metrics) leading to small but visible application performance degradation. Again one could talk about implementation but the footprint is visibly bigger in all respects.

      A pull-based metrics comes into play when you have hundreds if not thousands of services running and you don’t need detailed metrics (e.g. I don’t wanna get metrics every second). The default scrape interval is 30s. You can use buckets to collect detailed metrics falling to predefined percentiles, but sub-30s spikes cannot be debugged. Prom works like this for a reason. Modern monitoring systems are metric data-lakes and prometheus strikes a perfect balance between cost (in terms of storage, processing, etc.) and visibility.

      InfluxDB has a snappy UI, a pretty clean query language, retention policies with downsampling, HTTP API, dashboards with configuration storable in version control systems.

      There are ready-made prometheus rules and Grafana dashboards for about anything. Granted that many are poorly constructed or report the wrong thing (had poor experience with several community dashboards), usually community dashboards work out of the box for most people.

      1. 2

        In my experience, the times when a system is lagging or unresponsive are exactly the times when I want to capture as many metrics as possible. In a push model I will still get metrics leading up to the malfunction; in a pull model I may miss my window.

        As for push slowing down an application I agree that can happen, but it can also happen with pull (consider a naive application that serves metrics over http and does blocking writes to the socket). We have chosen to push metrics using shared memory so the cost of writing is at most a few hundred nanoseconds. A separate process can then transfer the metrics out of the server via push or pull, whichever is appropriate.

        1. 2

          In a push model I will still get metrics leading up to the malfunction;

          In modern observability infrastructures, the idea is to combine tracing, metrics and logs. What you’re describing is done by an APM/tracer a lot better compared to metrics.

          I’ve seen metrics being used to measure time spent in function’s but that’s abusing the pattern I think. Of course if there’s no tracer/APM then, it’s fine.

          Pushing metrics for every call leading up to the malfunction is usually dismissed because it’s a high cost, low value proposition.

          1. 1

            I didn’t say anything about metrics for every call; as you point out, that would be done better with tracing or logging. We do that too, but it serves a different purpose. A single process might handle hundreds of thousands of messages in a single second, and that granularity is too fine for humans to handle. Aggregating data is crucial, either into uniform time series (e.g. 30 second buckets so all the timestamps line up) or non-uniform time series (e.g. emitting a metric when a threshold). We keep counters in shared memory and scrape them with a separate process, resulting in a uniform time series. This is no different than what you would get scraping /proc on Linux and sending it to a tsdb, but it is all done in userspace.

            As for push vs. pull, consider this case: a machine becomes unresponsive because it is swapping to disk. In a push model with a short interval you might see metrics showing which process was allocating memory before metrics are simply cut off. In a pull model the machine might become unresponsive before the metrics are collected, and at that point it is too late.

            If you have a lot of machines to monitor but limited resources for collecting metrics, a pull model makes sense. In our case we have relatively few machines so the push model is a better tradeoff to avoid losing metrics.

            In my ideal world metrics would be pushed on coarse-grained timer, or earlier when a threshold is reached. I think Brendan Gregg has written about doing something like this though I do not have the link handy.

      2. 14

        Years ago at GitLab we started with InfluxDB. It was nothing short of a nightmare, and we eventually migrated to Prometheus. I distinctively remember two issues:

        One: each row (I forgot the exact terminology used by Influx) is essentially uniquely identifier by a pair of (tags + values, timestamp). If the tags and their values are the same, and so is the timestamp, InfluxDB just straight up (and silently) overwrites the data. This resulted in us having far less data than we’d expect, until we: 1) recorded timestamps in microseconds/nanoseconds (don’t remember which) 2) added a slight random value to it. Even then it was still guess work as to how much data would be overwritten. You can find some old discussions on this here: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/86

        Two: IIRC at some point they removed clustering support from the FOSS offering and made it enterprise only. This left a bad taste in our mouth, and IIRC a lot of people were unhappy with it at the time.

        Setting that aside, we had a lot of trouble scaling InfluxDB. It seemed that at the time it simply couldn’t handle the workload we were throwing at it. I’m not sure if it has improved since, but frankly I would stay away from push based models for large environments.

        1. 7

          I honestly don’t know whether push behaves better than pull during incidents; I prefer to drop application traffic before dropping metrics. But either way, I think a better argument for pull-based metrics is that they are more composable. With Prometheus as a concrete example, it’s possible to forward metrics from one service to another, from one scraper to another, and from one database to another, by only changing one scraper and not the target service/scraper/database.

          1. 4

            I’m using Influx myself, but the advantage of Prometheus is that you don’t have to write your own dashboards for common services. The Prometheus metric naming scheme allows folks to write and share dashboards that are actually reusable. I haven’t had that experience with InfluxDB (or Graphite before it).

            (Plus, you know, licensing.)

            1. 2

              One reason to use grafana is if you have data that you do not want to bring into influxdb. Both influx and prometheus have their own graphing capabilities and can be used on their own or with grafana. We use influxdb with grafana as a frontend so we visualize both the metrics stored in influx and other sources (such as kudu) on the same dashboard.

              1. 2

                I tend to agree but after using influxDB for quite some time I find its semantics to be rather awful and the company to be extremely difficult to deal with.

                I’m currently sizing up a migration to Prometheus. Despite disliking strongly the polling semantics and the fact that it intentionally doesn’t scale, so you have many sources of truth for data.

                Oh well.

                1. 2

                  I’ve had circular arguments on handling some monitoring but still complying with the strongly discouraged push gateway method. I think pull makes sense when you start using alertmanager and alert when a metric fails. I like the whole setup except grafana’s log solution. Loki and fluent have been a long source of frustration and there seem to be very limited resources on the internet about them. It sounds good, but is extremely hard to tame compared to prometheus. The real win over previous solutions with prometheus and grafana were alerting rules. I find it easier to express complex metrics for monitoring quickly.

                2. 7

                  Based on this experiment, I submitted a pull request to nixpkgs. Users will be able to choose to build a smaller Prometheus.

                  1. 3

                    I’ve been using rrdtool and collectd and liking it a lot as a super lightweight alternative, I even wrote a server and a grafana plugin to let me view the graphs with grafana.

                    1. 1

                      I wonder if cgo with -flto would have a massive impact on the size too. Since the discovery should use just a tiny fragment of the AWS/azure/gcloud packages, maybe it could strip them too?

                      1. 3

                        AFAIK the problem is reflection in the go packages: As soon as reflection is used lto does not strip unused parts because it does not know it’s unused.