1. 64

A few colleagues of mine are working on a new company. Without giving it away, part of the application is a network of people, products and the relationship between those.

The tech lead of the current MVP is using TigerGraph as their database! I have watched a few videos around it, build my own little proof-of-concept and I am amazed why this hasn’t crossed my mind sooner!

Anyone here has experience with graph databases and why they did or didn’t choose them?

  1.  

  2. 60

    Access Patterns (imo). Very few applications I’ve run into have graph hops and pattern matching from a known starting node as their primary access pattern. It’s usually supplemental. Even if you can conceptually think of your data model as a graph (which is something you can commonly do) when your app actually uses your data the main access patterns tend to be selecting by attributes (ie: find account by email address), doing relational joins of 1 or maybe 2 hops, or aggregating summary stats for display to users or exploratory analysis.

    Compared to traditional relational databases, graph databases are terrible at all of those. Usually most queries involve a massive scan across all the nodes to find an entry point for a specific pattern and then it matches…but that scan isn’t usually as fast as it is in relational databases, and index structures ontop tend to be more expensive space-wise. Everytime I’ve looked into graph databases (Neo4j, Tigergraph, and Neptune primarily for different projects) we ran benchmarks and things fell over for our use cases very fast.

    Don’t take this as saying Graph databases are bad. What they are good at is fast hops and pattern matching across long chains. Like if the main way you primarily used a relational database was doing 5-10 table joins per query, but don’t want the heavy memory usage, latency, and IO that comes with that. I can see a giant knowledge graph style expert system or some interesting weighted path stuff being a good fit. But at the same time, I feel like if you were doing something like that at scale then you’d probably want control over what’s in memory to avoid a bunch of disk access…but I’m probably biased due to the stuff I’ve worked on.

    Speaking of the stuff I’ve worked on, one project ended up settling for custom graph application logic ontop of a KV store and it worked out well. The other project used Cayley (opensource graph db) as an in-memory index for a while. We could rebuild it from our db if it went down, but if it was up then we could quickly use Gremlin to do some queries and then flesh out the results by querying Postgres for the rest of the record attributes. It worked fine. But in the end we dropped it and just used two postgres tables with Subject, Verb, and Object columns. Subject and Object are foreign keys to normal relational tables. There was some tuning involved, but in the end it worked just as well and we didn’t have to know/run two services. This is what I’d reach for first for any “graphy” project I come up against and if it fails I’ll at least know what the bottleneck is so I can optimize for it next.

    Massive disclaimer: My opinions are colored by the domains I’ve worked in (large scale web services, automated supply chain logistics and tracking, and general user facing web apps).

    1. 40

      Graph database author here.

      It’s a very interesting question, and you’ve hit the nerve of it. The long and the short of it is that, much like lambda calculus can represent any program, relational algebra can represent pretty much all database queries. The question comes to what you optimize for.

      And so, unlike a table-centric view, which has benefits that are much better-known, what happens when you optimize for joins? Because that’s the graph game – deep joins. A graph is one huge set of join tables. So it’s really hard to shard a graph – this is connected to why NoSQL folks are always anti-join. It’s a pain in the rear. Similarly, it’s really easy to write a very expensive graph query, where starting from the other side is much cheaper.

      So then we get to the bigger point; in a world where joins are the norm, what the heck are your join tables, ie, schema? And it’s super fluid – which has benefits! – but it’s also very particular. That tends to be the first major hurdle against graph databases: defining your schema/predicates/edge-types and what they mean to you. You’re given a paintbrush and a blank canvas and have to define the world, one edge at a time. And $DEITY help you if you want to share a common schema with others! This is what schema.org is chasing, choosing some bare minimum.

      This is followed on by the fact that most of the standards in the graph database world are academic in nature. If I have one regret, it’s trying to follow the W3C with RDF. RDF is fine for import/export but it’s not a great data model. I wanted to standardize. I wanted to do right by the community. But, jeez, it’s just so abstract as to be useless. OWL goes another meta-level and defines properties about properties, and there’s simpler versions of OWL, and there’s RDFS/RDF* which is RDF about RDF and on and on…. it’s super cool that triples alone can represent pretty much anything, but that doesn’t help you much when you’re trying to be efficient or define your schema. Example: There’s a direct connection to the difference between a vector and a linked list – they both represent an ordered set. You can’t do a vector in triples, but you can do a linked list.

      I know I’m rambling a little, but now I’ll get to the turn; I still think there’s gold in them hills. The reason it’s not popular is all of the above and more, but it can be really useful! Especially when your problem is graph-shaped! I’ve implemented this a few times, and things like mapping, and social networks, and data networks, and document-origin-tracing – generally anything that would take a lot of joins – turn out swimmingly. Things that look more like tables (my example is always the back of a baseball card) look kind of nuts in the graph world, and things that look like a graph are wild in third normal form.

      So I think there’s a time and a place for graph databases. I just think that a combination of the above counter-arguments and the underlying needs are few enough that it’s under-explored and over-politicized. They work great in isolation, ironically enough.

      I’m happy to chat more, but that’s my quick take. Right tool, right job. It’s a shame about how that part of database theory has gone.

      1. 10

        Full disclosure: I work for Neo4j, a graph database vendor.

        Very well said.

        I’d add that most of the conversation in responses to OP assume “transactional” workloads. Graphs databases for analytic workloads are a whole other topic to explore. Folks should check out Stanford Prof. Jure Leskovec’s research in the space…and a lot of his lectures about graphs for machine learning are online.

        1. 2

          The long and the short of it is that, much like lambda calculus can represent any program, relational algebra can represent pretty much all database queries.

          When faced with an unknown data problem. I always choose RDBMS. It is a known quantity. I suspect I’d choose differently if I understand graph dbs better.

          I would love to see more articles here on practical use for graph dbs. In particular, I’d love to know if they are best deployed as the primary datastore for data or maybe just for the subset of data that your interested in query (e.g., perhaps just the products table in an ecommerce app).

          this is connected to why NoSQL folks are always anti-join. It’s a pain in the rear.

          Interesting. People use NoSQL a lot. They simply do joins in the application. Maybe that’s the practical solution when it comes to graph dbs? Then again, the point of graph solutions is generally to search for connections (joins). I’d love to hear more on this aspect.

          Thank you and the OP. I wish I can upvote this more. :)

          1. 1

            Yeah, you’re entirely right that the joins happen in the application as a result. The reason they’re a pain is that they represent a coordination point — a sort of for-all versus for-each. Think of how you’d do a join in a traditional MapReduce setting; it requires a shuffle! That’s not a coincidence. A lot of the CALM stuff from Cal in ~2011 is related here and def. worth a read. That’s what I meant by a pain. It’s also why it’s really hard to shard a graph.

            I think there’s something to having a graph be a secondary, problem-space-only engine, at least for OLTP. But again, lack of well-known engines, guides, schema, etc — it’d be lovely to have more resources and folks to explore various architectures further.

          2. 2

            You’re given a paintbrush and a blank canvas and have to define the world, one edge at a time.

            That’s such a great way to put it :)

            Especially when your problem is graph-shaped!

            I think we need collective experience and training in the industry to recognize problem shapes. We’re often barely able to precisely formulate our problems/requirements in the first place.

            Which database have you authored?

            1. 5

              Cayley. Happy to see it already mentioned, though I handed off maintainership a long while ago.

              (Burnout is real, kids)

            2. 2

              Thanks for Cayley! It’s refreshing to have such a direct and clean implementation of the concept. I too think there’s alot of promise in the area.

              Since you’re here, I was wondering (no obligation!) if you had any ideas around enforcing schemas at the actual database level? As you mentioned, things can grow hairy really quick and once they are in such a state then the exploration to know what needs to be fixed and the requisite migraions are daunting.

              Lately I’ve been playing with an idea for a graph db that is by default a triplestore under the hood. But with a (required!) schema that would look something commutative diagram-y. This would allow for discipline and validation of data, but also allow you to recognize multiple edge hops that are always there so for some things you could move them out of the triplestore into a quad- or 5- store to produce more compact disk representations to yield faster scans with fewer indexes and give the query planner a bit of extra choice. I haven’t thought it through too much, so I might be missing something or it might just not be worth it.

              Anyway, restriction and grokkability of the underlying schema/ontology does seem like the fundamental limiter to me in alot of cases and was curious if as someone who has alot of experience in the area if you had thoughts on how to improve the situation?

              1. 1

                If you don’t mind me joining in, have you heard of https://age.incubator.apache.org/ ? I’m curious to hear your opinion about whether it can be an effective solution to this problem.

                1. 1

                  If I have one regret, it’s trying to follow the W3C with RDF. RDF is fine for import/export but it’s not a great data model. […] it’s super cool that triples alone can represent pretty much anything, but that doesn’t help you much when you’re trying to be efficient

                  I’ve been using SPARQL a little recently to get things out of Wikidata, and it definitely seems to have pain points around that. I’m not sure at exactly what level the fault lies (SPARQL as a query language, Wikidata’s engine, etc.), but things that seem to me like they should be particularly easy in a graph DB, like “is there a path from ?x and ?y to a common node, and if yes, give me the path?” end up both hard to write and especially hard to write efficiently.

                  1. 2

                    This goes a bit to the one reply separating graphs-as-analytics and graphs-as-real-time-query-stores.

                    SPARQL is the standard (once again, looking at you W3C) but it’s trying to be SQL’s kid brother — and SQL has it’s own baggage IMO — instead of trying to build for the problem space. Say what you will about Cypher, Dude, at least it’s an ethos. Similarly Gremlin, which I liked because it’s easy to embed in other languages. I think there’s something in the spectrum between PathQuery (recently from a Google paper — I remember the early versions of it and some various arguments) and SPARQL that would target writing more functional paths — but that’s another ramble entirely :)

                2. 23

                  Oh boy. It’s a meme amongst my friends at Google that half the software engineers want to build you a graph database because they are such a cool problem to solve (especially for people who have been drilled by algorithmic coding interview prep).

                  But in all seriousness:

                  1. I’ve seen them used on flashy data science projects, and the way the data was modeled conceptually was very hard to understand.
                  2. Related to that: people (including me) already have trouble with relational data modeling. Nobody wants to admit this, but there are so many choices and so many ways to create a bit of a mess! At least the relational model gives me guarantees, and document (JSON) stores give me maximum flexibility. Graph models feel like the worst of both worlds.
                  3. I don’t know of an established one that has the same reputation for maturity as for example Postgres. Neo4j maybe comes close.
                  4. They are just newer, so they lack those decades of common knowledge about performance characteristics, libraries, cloud offerings, integrations with other tools, etc.

                  So in short, they rarely warrant the additional complexity, and unless you really know what you’re doing, they are a risky bet. 99% of all (non-blob) data in the industry is just simple data with a clear schema and static relations, or unstructured data best modeled as JSON. And in both cases, data “in the enterprise” (which is the market TigerGraph wants) is usually a huge fucking mess, and I’m not sure a graph model would change that.

                  But I agree otherwise that graph models are really cool and appealing, and I would love to have a good use case for one.

                  1. 2

                    It’s a meme amongst my friends at Google that half the software engineers want to build you a graph database because they are such a cool problem to solve (especially for people who have been drilled by algorithmic coding interview prep).

                    It’s admittedly off-topic, but would you mind expanding on this? Do you see a lot of newer engineers who come from that leetcodish/algos background doing suboptimal things?

                  2. 5

                    Set theory is hard

                    Graph theory is very hard

                    1. 3

                      I’m going to come at this from another angle– past the initial reasons for the success of SQL (and relational DBs), they owe much of their success to the availability of learning materials and environments for people. It’s easy for people to learn SQL but looking up any SQLite tutorial and running it right then and there on their laptop.

                      Where’s the SQLite for graph databases? Where can a beginner get practice without arduous setup? There may be solutions that come close, but not close enough.

                      1. 3

                        There is a ton of tutorials how to make Postgres perform, and if I do nothing it is still fine-ish at moderate sizes. And usually only tables you use in a query affect its performance. With graph databases… I once had a small piece perform well enough, but not when stored in the same graph as the other data with completely different vertex and edge namespaces (and the purpose was for this small piece to encode filtering logic, with yet another kind of edges used to select the actual data afterwards).

                        Oh well, so back to Postgres it is for the project, and messy filtering can continue living in application code…

                        I guess if people want to build them, at some point accumulation of knowledge how to build them and how to use them will make it easier not to do the stupid mistakes I did in that experiment. Or maybe new architectures will make those ways of slapping things together acceptable!

                        1. 3

                          Disclaimer: It’s been around 10 years since I was involved in creating a social network.

                          If I remember correctly we had a few discussions, because social networks are kind of a poster example for graphs, but one point was sharding. There were known-to-work solutions if you were using relational DBs but we didn’t know of any proven thing for the graph databases around. Also we weren’t a startup so no “revolutionizing the world by inventing the best new graph db”, and in the end a lot of it was trying to not spend the innovation budget on something like this, boring is better, as we were supposed to hand off the thing to the company we were building it for, so operationally it should be easy to run. Of course easy is relative, but mysql (or postgres?) was a known fact. Oh, and not to forget: in the end you can model the graph relations quite easily with a RDBMS, so why bother?

                          1. 5

                            Adding on this, to the best of my knowledge there are still no great ways to partition graphs across nodes and it’s been shown to be a hard problem (NP hard or complete, depending on some factors). Intuitively this should make sense: social graphs (for example) have very small diameters and you’re likely to have a lot of edge crossings between different compute nodes.

                            That said, I don’t think that distributing a graph DB is a big deal. You really don’t need it: (almost) any graph you would work with will fit in memory (it might be a big machine but it’ll still fit!) and replication is an easier problem.

                            Disclosure: I worked for TigerGraph. Not sure disclosure is even necessary—I left in 2015. But I have a financial interest in graph databases.

                            1. 3

                              social networks are kind of a poster example for graphs, but one point was sharding.

                              Yup. Even with a relational database, it’s hard. To exaggerate only slightly, this is one of the major reasons LiveJournal (the first real social network) failed. Performance was killing them, and they had to throttle growth via invite-codes for a few years while they worked out how to do clustering with MySQL, and cache as much as possible in RAM. This was c.2001, before any of this was common, and some of the tools BradFitz invented, like memcached, are still in use today. End result was they couldn’t grow fast enough, and they didnt have resources to evolve the UX or feature set enough. By the time Facebook caught on, they were doomed (sob!)

                              1. 1

                                It’s a bit unfair to blame everything on performance though, because I remember LiveJournal in its heyday, but there were a few reasons I never signed up. I hated the design, it didn’t look like a social network, it looked like a collection of blogs to me (without support to bring your own domain) and also 90% of where I ended up by clicking random links to LJ, it was only fanfic. I never noticed anything slow, but I can’t exactly tell you if this was indeed more 2001 or more 2007.

                                1. 3

                                  Yeah, as I said, the constant firefighting to keep the servers from overloading meant they couldn’t evolve the UI and feature set. Another big reason was that, after MovableType bought them, they made the fatal mistake of building an all-new service, which looked very nice but flopped, instead of improving LJ.

                                  1. 1

                                    Social networks can exist and even thrive without frills: viz. Hacker News.

                                    1. 1

                                      Sure, and maybe I misunderstood the point, but I wanted to talk about “social networks” as they are kinda clearly defined to the general populace, like Facebook and not any community of online people. Might be narrow, might be wrong, but the frills were not the point, but that “era” of consolidation towards single closed mass networks and not anything open or small.

                              2. 2

                                Too expensive (neo4j licensing, Neptune instances) or not fast enough even when expensive (Neptune)

                                We have > 20TB of data and expire it after N months, so we are slightly delete-heavy. We also take advantage of the fact it’s a graph database to spot where we made connections immediately on adding nodes - for a good user experience - so bulk loads aren’t an option for making things cheaper / faster.

                                1. 2

                                  Good question, I’ve asked myself the same thing a few times. As others have noted a lot of issues seem to be around scaling, and expensive (or even crazy-expensive) hosted services. I think there’s also an extent to which they’re perceived as a very specific tool for a very specific task, but in my experience they’re usable in a much more general context than just a friends graph or a recommendation engine.

                                  I co-run a niche event listing site & app with a comparatively small (30k verified users and 130k historical events) dataset that uses Neo4J community edition as the main/general-purpose database. We came for the deep relationship traversal and pre-provided graph algorithms to help with recommendations, stayed for the natural-feeling data modelling in Cypher (the Neo4J declarative query language), and over time became more dependent on APOC, the (genuinely) “Awesome” procedure library.

                                  Getting good query performance can be unexpectedly complex or simple depending on what you try to do and which end of the relationship path you start from conceptually as well as practically, but with good support for subqueries and conditional query execution you can pretty much always find a way to do what you need.

                                  It was operationally quite painful when we started using it in 2014, but we don’t find it so now, and for people like us who don’t have to scale it to the moon, in many cases I can think of I’d honestly prefer to use Cypher to build out the world edge by edge rather than SQL and schemas, not only because of the flexibility but because it just feels more natural to model relationships that way. It also used to be pretty fiddly to model dates and times or spatial elements in the graph because there were’t any native types to support it, so you had to do stuff like creating a “time tree” of days/months/years as nodes and relationships between them - but date/time/instant/durations are all built-in now, along with some lat-long and geo stuff that doesn’t need extra plugins, and it’s pretty much all there.

                                  I wonder if there’s still a perception that some of the stuff we take for granted in an RDBMS isn’t there, and whether that affects people’s decisions only to use graph stores for traditionally graph-specific domains? Not saying everyone should use graphs for everything, at all, but in a lot of situations modelling real-world relationships in data, I think that perception is definitely out of date, and a lot of people might benefit from trying it in a graph.

                                  If I had to do it all in Gremilin or Tinkerpop or similar I’d bail pretty fast, but match (person)-[:likes]->(thing) return person.id is so much more intuitive than select id from person, thing, person_thing pt where thing.id = pt.t_id and person.id = pt.p_id or whatever that I find myself thinking model questions through in Cypher really intuitively, or at least by seeing the concept of the query build up sort-of-visually in front of me as I write it. Multiple labels like (match p:Person) (or match (p:Person:Admin)) give you easy indexes and constraints as well as a flexible kind of grouping mechanism.

                                  I’d love it if more graph DBs supported (open)Cypher, but even then unless they implemented a lot of the APOC I’d find myself missing that. AgensGraph supports SQL and Cypher, a while back it was a pgsql fork which seemed like too much trouble to think about, seems it’s become a pgsql extension since then so it might be a lot easier to deal with operationally now.

                                  There are other interesting multi-model systems like OrientDB but it’s yet another query language and how much time does anyone really have?

                                  1. 2

                                    (person)-[:likes]->(thing) return person.id is so much more intuitive than select id from person, thing, person_thing pt where thing.id = pt.t_id and person.id = pt.p_id

                                    For one-to-many relationships I’ve found jOOQ’s implicit join feature a pretty pleasant happy medium between completely manually-constructed SQL and a full-on ORM. In jOOQ code, you imply the join by referencing the foreign key relationship from the child table, like select(THING.person().ID).from(THING).

                                    It does nothing for many-to-many relationships or for queries where you need to start at the parent side of a parent-child relationship, though. Not suggesting it’s a replacement for a graph database or anything, but it definitely reduces some boilerplate.

                                    1. 1

                                      Neat :-)

                                  2. 2

                                    Flexibility, I think. Relational databases are hard to beat in the general case. Perhaps specialised databases are better for some specific cases, but relational databases can offer an acceptable solution to those problems and handle everything else too. There are also consistency benefits to keeping everything in the same database rather than splitting between multiple ones.

                                    1. 2

                                      Others have mentioned the issue with determining boundaries when spreading a graph over multiple DB nodes. This means many people end up using a single DB instance for the graph. But at that point—for many use cases—loading the entire graph into memory in your process that’s doing some analysis/processing is simpler and more efficient than sending queries to a database.

                                      1. 1

                                        Like any “why isn’t ___ more popular” question I’m sure this will attract a lot of plausible-sounding anecdotes but, like all such questions, the answer is almost certainly that tech fashion just didn’t go that way, it went some other way for no reason at all except happenstance. Why didn’t OS/2 become more popular? You can describe the history there but there’s no why, it just did. An identical world with one fewer butterfly probably went another way instead and they’re just as sure about why it went that way. If graph databases did become more popular you’d have the same folk explaining why it’s obvious that it happened that way.

                                        But for a baseless anecdote of my own: SQL can model graphs, if poorly, and there are some great SQL databases that already exist, so most people with a graph problem just write it as a SQL problem instead because they learnt how to use it in school and already know it and their friends and coworkers know it. Writing a database is difficult, hard work. Especially if you care about correctness. The number of people in the world that can do it is finite and for whatever historical set of reasonless circumstances they’re concentrated in the SQL space instead and that feeds itself.

                                        1. 1

                                          Apart from being difficult to implement efficiently, most services don’t have that kind of data to make use of graph databases.