1. 25

I’m interested in building a simple in-memory time series database, but I’m not really sure what the state of the art is here. I’m looking at having some basic aggregates over my series - sum’s, min/max, mean, etc. I’m also interested in having different retention periods and granularity. Does anyone have suggests for papers to read/recommended data structures before I dive in?

  1.  

  2. 20

    The state of the art, apparently, is total shit.

    I’m super fucking disappointed in the time-series databases I’ve tried using so far. Part of the problem is that my application is financial and all the free-software TSDBs are oriented toward network management. All of these numbers were using a 3.2-million-bar dataset on a SSD with 16GiB of RAM:

    • KairosDB has a nice UI but, on H2, it froze regularly during ingestion (unfreezing when I ran queries), ran three times slower than iterating over gzipped JSON files in PyPy, sporadically generated random deadlock exceptions, and used double the space of gzipped JSON;
    • KairosDB on Cassandra, although not hitting the constant-time performance you’d hope for, is slightly faster and more space-efficient than iterating over gzipped JSON files in PyPy (just as fast as doing queries on one big SQLite table), but it rounds your fucking data points to 32 fucking bits, so it’s fucking useless;
    • OpenTSDB’s timestamp resolution is limited to one fucking millisecond, so I didn’t try it;
    • RRDTool’s timestamp resolution is limited to one fucking second and it unavoidably automatically deletes historical data (indeed, that’s kind of the point of RRDTool), so I didn’t try it;
    • Graphite Whisper is limited to one-fucking-second timestamp resolution and unavoidably automatically deletes historical data, so I didn’t try it;
    • InfluxDB is unstable, a resource hog, and far too slow — importing our sample data set took 4 hours (as compared to 47 seconds for KairosDB on Cassandra), and executing a sample query on it that takes 10 seconds on KairosDB/Cassandra or 30 seconds on gzipped JSON files takes 1.2 hours; this while using 10GB of RAM for data that gzips to 170MB; so I really fucking wish I hadn’t tried it. This was the latest version in March 2014.

    I also considered evaluating Saturnalia, Blueflood, PostRR, TempoDB, HDF5 (!), Square’s Cube, Seriesly, Vena, Kairos (without the DB), Mike Stirling Timestore, and NREL Energy DataBus, but instead I wasted fucking weeks evaluating the pieces of shit I gave details for in the list above, because they looked more promising.

    It’s pretty ungracious of me to be complaining about free software that people are sharing with me out of solidarity with me, especially when my purpose is to build a prop trading app that I sort of necessarily won’t release at all. But hopefully this comment can keep other people from wasting massive amounts of time like I did.

    1. 5

      InfluxDB is unstable, a resource hog, and far too slow — importing our sample data set took 4 hours (as compared to 47 seconds for KairosDB on Cassandra), and executing a sample …

      While our dataset (and use case) is completely different, and we’re not focused on sub millisecond time storage, we’ve found InfluxDB to meet our needs (and doesnt' seem to hog resources) for the first version of our Metrics product. We get acceptable query times (querying against what you might consider “materialized views” (continuous queries)) – acceptable enough that we do no client side caching on the dashboard that shows the data.

      I’m guessing a lot has changed since March of this year – we deployed this in late July.

      1. 2

        I’m glad to hear that. Would you be willing to share some idea of how much disk and RAM it needs to insert a few million data points?

        1. 1

          Ack. Meant to reply to this a long time ago, but was on vacation. Unfortunately, we didn’t do good load planning and as a result likely over provisioned by a lot, which means any numbers I give you are likely to be a poor representation of what it’s really capable of.

          Also, since we’re using a pre 0.8 release, I have no idea what any of the newer releases can do, and read a blog post just yesterday that they’re rebuilding a bunch of stuff for the 0.9 series. They’re aiming to support 1-2M points / second. Not sure how far off they are at the moment.

          1. 1

            Thanks! Was it less than 4 hours? It took 4 hours when I tried inserting a 3.2 million data points into it.

      2. 5

        My favorite part about researching time series databases is this very popular lie: “Cassandra is great for time series”.

        It seems like the Cassandra definition of “time series” is, “I have some data points in time order”. When I think time series database, I think “can rollup by metadata, re-sample by date, and calculate aggregates on the fly.” Unfortunately, no database can do this for the scale typical time series / sensor systems require. Cassandra definitely can’t.

        I work on a large time series system at Parse.ly – currently supports over 10 billion events per month and needs to support 100 billion+ by 2015 – and have prototyped time series data into a lot of data stores (Redis, MongoDB, Cassandra, Elasticsearch, to name a few). In the end, they all have relative trade-offs, but the worst pit I ever entered was being convinced by marketing that Cassandra was “good at time series”.

        It’s true that if you go for a highly denormalized data model and pick fixed time intervals (e.g. daily, monthly, weekly) similar to rrdtool, then you can store all that data in Cassandra and query it with low latency. This is how DataStax built OpsCenter, for example. It’s basically distributed rrdtool. I guess that works for some people, but not for our use case, where we need to be able rollup and aggregate across millions of events and produce summary views on the fly, often with sub-second latency.

        So, for us, the best wins started to come when we realized no database was going to completely solve this problem for us, and we applied our own ingenuity to it. For me, the best tricks available to the time series practitioner are column-stride formats, probabilistic data structures, and inverted indices. Study Apache Lucene a bit – the tricks that are used in a search context to make massive data sets instantly searchable can be cross-applied to the time series use case.

        1. 3

          Kind of wonder what you think about Teafiles.

          1. 3

            Teafile looks like it might be interesting. But let’s avoid confusion: it’s a storage format, not a time-series database. It doesn’t claim to have any functionality beyond that offered by, say, SQLite or Cassandra or a directory full of CSV files. It just promises to provide that functionality at higher performance. Which it might — I haven’t tried it.

            See @amontalenti’s definition in the comment below: “can rollup by metadata, re-sample by date, and calculate aggregates on the fly,” to which I would add “and ideally can materialize rollup views so that the time to evaluate your query depends on the number of resulting points, not the number of input points”, which is basically what rrdtool does, although in a limited way.

            Teafile, according to the web page, doesn’t try to offer any of this functionality. It just stores sequences of data points with associated metadata. A binary file format is indeed the right way to do this, but I haven’t evaluated how well they do it. The blithe way they talk about memory-mapping for faster writes gives me pause. My experience is that I felt that way about memory-mapping for writes before I realized that error recovery in case of crash pretty requires some kind of journaling, which means that you lose the advantage of memory-mapping for writes. They don’t actually mention issues of reliability at all in the readme, which leads me to believe they’re unaware of them.

            Unfortunately, they’ve invented their own free-software license (GPLv3 plus an additional original-BSD-advertising-clause-like arbitrary restriction) which is incompatible with the GPL and any other similar copyleft licenses, so we’ll probably never find out if they perform as well as they claim, or if they really corrupt the database when you lose power, because nobody will bother to find out.

        2. 7

          Check out column-oriented databases. One of the strengths of column stores are their ability to efficiently perform these types of aggregations. C-Store and the related papers are a good place to start. The Design and Implementation of Modern Column-Oriented Database Systems is good too. kdb+ is a very highly regarded in-memory column store used in the financial sector, there is a free version you can play around with.

          1. 6

            elasticsearch. I store 2k events per second, do string search on it and so on. Tuned everything to use as less ram, cpu and disk as possible, wrote my own log parser (logstash is very slow). I have two instances, one for indexing, and one for searching. I periodically move indices from the index node to the search node. Then I optimize the indices in the search node.

            That’s because for us it’s more important to store events fast, we don’t care if it takes more time to search. The search is for internal usage.

            To store such big data I create hourly indices of about 7 million events.

            Didn’t like kibana, tried to hack on it ecc. but was too limiting for my purposes. So wrote my own sql-to-elasticsearch language with a nice web interface. Once finished I hope my employer lets me publish it.

            We plan to store even more data in it.

            I’ve tried InfluxDB and it’s… don’t know, 100x slower than elasticsearch?

            1. 2

              I’ve seen your Nix posts - are you a Haskell user? If so, are you using Bloodhound?

              1. 2

                Not an haskell user, no. Bloodhound seems interesting, the demo does not work at the moment though :(

                1. 1

                  Could you help me out here, what demo doesn’t work? Possibly I needed to update the docs and missed something.

                  1. 2

                    The one in the apache bloodhound homepage: http://goo.gl/9NqjB Cannot open it, error 502 proxy error. But I’m indeed not a good haskell user :P Now I understand what you mean, the haskell package, sorry.

                    Edit: So it’s a very good idea to have a typed dsl for elasticsearch. The sql->elasticsearch is mostly for querying outside of a programming environment. E.g. I use it for plotting in the web interface.

                    1. 1

                      Phew, I was worried there for a moment! Apache Bloodhound ain’t me, sorry.

                      This is me: https://github.com/bitemyapp/bloodhound/

                      www.stackage.org/package/bloodhound

                      So it’s a very good idea to have a typed dsl for elasticsearch.

                      You’re telling me! Elasticsearch’s API drives me crazy.

                      1. 2

                        For documentation I think elasticsearch would benefit from something similar to http://api.highcharts.com/highcharts#navigation That is, display all the possible options, an abstract json that shows the whole structure.

                        1. 1

                          I agree, although I’m rather fond of how Haskell’s data types are more expressive than JSON and can express things like sum types, which is really a way of saying “or”

              2. 1

                When did you try InfluxDB? @apg above claims it’s gotten a lot better since I tried it. I still have a hard time imagining it will have gotten that much better.

              3. 5

                Before you do, make sure it actually makes sense to. Can you get away with just using rrdtool instead?

                There aren’t really a ton of useful papers out there. I’d suggest looking at OpenTSDB’s stuff, then talking to Benoit and asking him what he thinks his mistakes were. If you have friends at big companies, go ask them what they did. Make sure you understand what’s going to make explode your space, and try to make the API so that it’s hard to do that. Arbitrary tagging sounds like a good idea, until you realize that adding another tag to your key makes your storage for that key combinatorially more expensive.

                If your stuff is totally in memory, I wouldn’t worry too much about how you’re going to compute your results. If you decide to make a distributed one, you’ll probably eventually want to support materializing views on write as well as on read–you shouldn’t need to do this for in-memory time series unless you start doing really sophisticated aggregation.

                1. 3

                  Make sure it actually makes sense.

                  I learn through building, and in this area I’m interested in building up my knowledge of TSDBs. I probably am not looking at anything that I’ll even put in to production for my own purposes - just a bit of hobby work.

                  Thanks for the reply - all good leads!

                  1. 1

                    Do you think Benoit has maybe written about what he thinks his mistakes were somewhere?

                    1. 2

                      No idea, my guess would be that if you could see what he broke from OpenTSDB1.x when switching to OpenTSDB2.x, that would probably be a good first-order heuristic.

                  2. 4

                    One interesting data structure for analyzing high-volume distributions of events that are tolerant of a small precision loss is a logarithmically bucketed histogram. A major mistaken trade-off taken by many metric libraries is to use reservoir samples to generate percentiles. Instead, have a function that maps a number to a bucket within a small fraction of its true value, and aggregate a count for each time this occurs. This way you can generate arbitrary percentiles without sacrificing any accuracy - just a small amount of precision. The compression is quite extreme for tightly clustered types of events, typical of things like latency. And you never lose your outliers as you do with reservoir samples.

                    Here’s an implementation I wrote (this implementation isn’t great for numbers between 0 and 1): https://github.com/cockroachdb/cockroach/blob/master/util/metrics/metrics.go

                    OpenTSDB and KairosDB are interesting, with their own sets of flaws. At work opentsdb is throwing a few hundred thousand requests per second at a 20 node hbase cluster, but we occasionally have issues with compactions causing read degradation. This is why I’m looking into KairosDB on cassandra - no regions. Looking forward to finding its negative trade-offs.

                    1. 2

                      A few other interesting histogram implementations: HdrHistogram ApproximateHistogram

                      They’re pretty similar, but Hdr requires that you pre-size it, and Approximate histogram can be sized dynamically.

                    2. 2

                      I think a lot depends on what you want the bounds of your problem to be, both algorithmically and practically. In particular, it would be helpful to get a rough estimate of your data rate, your minimum retention period, and the interactivity/immediacy of your analytic requirement, because all of those cause hard rule-outs in the solution search tree.

                      For example, I was recently faced with a problem in which data was arriving at a rate of about 100 event/sec, peaking to 250/sec, with microsecond resolution, and had to be buffered until some downstream systems could be, ah, optimized. Client additionally wanted to be able to query non-interactively based on some event parameters (sums, min/max) with response time to query measured in sub-minute. In this case the aggregate data size was estimated to be around a terabyte a day.

                      That set of parameters was well within the sql use case, so I was able to solve it with straight sql tables with appropriate indices, slather some erlang on top so I could sleep at night, and answer the various analytic questions via group-by. If I wanted to drop old data periodically when it exceeded a size, or rebucket into historical tables, that would be cake in erlang or cron. But this problem was a time series buffering problem so I don’t think I need that yet.

                      1. 1

                        Something something indexable skiplist..

                        1. -2

                          Have a look at InfluxDB first? :)