1. 24
    1. 4

      Hi Denis, what a great little project! I was wondering if you had any benchmarks, such as nodes/edges inserted per second or querying speed based on graph size and type of query. I remember trying something similar with SQLite years ago but stopped because of the low performance.

      1. 2

        Hi Sébastien, thank you for taking a look!

        The goal of this first release was to get it functional, with a minimal set of table and sql definitions.

        I would guess it would perform poorly at scale, since each operation requires open the file, creating a connection, executing the cursor, then shutting down again.

        Still, it is more than adequate for my current use case of several thousand nodes.

        There are definitely opportunities to introduce bulk functions so that multiple node adds and connections as well as path traversals can happen within a single atomic operation.

        Pull requests are welcome! :)

    2. 3

      Why generate the id from the body instead of just having an INTEGER PRIMARY KEY? You effectively have two indexes on nodes now: the implicit index on rowid, and id_idx. From your examples, you are only using numeric ids (and none of them are part of the original JSON). In addition, SQLite can automatically generate values for rowid (or rowid alias ) columns. And what do you do if two nodes have duplicate ids? Shouldn’t id_idx be UNIQUE? I would also suggest giving edges a composite primary key and making it WITHOUT ROWID.

      1. 1

        Agree that WITHOUT ROWID sounds like a good idea for the edge table, but perhaps the scheme allows for multiple edges between the same nodes? Eg one edge is “parent - child”, another edge is “taxpayer - dependent”, versus forcing a single large row for all the relationships between two nodes.

        As for IDs, It’s really useful to allow clients to come up with ids, using eg UUID. Centralizing ID allocation leads to headaches where code needs temporary IDs or some equivalent surrogate like a unique ActiveRecord in-memory object when first creating data, or anti-patterns like persisting data in multiple phases to create relationships. For example, if I want to create an edge between two nodes I have yet to create, instead of two simple insert statements, I need a more complex write node 1, write node 2, reading back row IDs, then create edge. Forget it when it comes to distributed / peer-to-peer, etc.

        1. 2

          Agree that WITHOUT ROWID sounds like a good idea for the edge table, but perhaps the scheme allows for multiple edges between the same nodes? Eg one edge is “parent - child”, another edge is “taxpayer - dependent”, versus forcing a single large row for all the relationships between two nodes.

          Well, since SQLite is still a relational database and not a graph database, I think you would have to add a column for that. For example, the schema could look like

          CREATE TABLE IF NOT EXISTS edges (
              source INT NOT NULL REFERENCES nodes (id),
              target INT NOT NULL REFERENCES nodes (id),
              action TEXT NOT NULL GENERATED ALWAYS AS (json_extract(properties, '$.action')) STORED,
              properties TEXT NOT NULL CHECK (properties = json(properties)),
              PRIMARY KEY (source, target, action)
          ) WITHOUT ROWID;
          

          Which is a pretty good use for a generated column (unlike what was shown in the OP).

          edit:

          Generated columns may not be used as part of the PRIMARY KEY. (Future versions of SQLite might relax this constraint for STORED columns.)

          Looks like I overlooked this. In this case, I would just add an action column and not store it in json at all. If you already know that all your rows will need to have an action, there is no point in json for that data.

          For example, if I want to create an edge between two nodes I have yet to create, instead of two simple insert statements, I need a more complex write node 1, write node 2, reading back row IDs, then create edge.

          The easiest way is to have some kind of identifying information about the data already. For example, if names are unique, then you could do

          INSERT INTO nodes (body) VALUES (json('{"name": "Bill Gates"}')), (json('{"name": "Microsoft"}'));
          INSERT INTO edges (source, target, properties) VALUES (
              SELECT id FROM nodes WHERE json_extract(body, '$.name') = 'Bill Gates',
              SELECT id FROM nodes WHERE json_extract(body, '$.name') = 'Microsoft',
              json('{"action": "founded"}')
          );
          

          Now this looks bad because of the two sub-selects. However, because you just inserted those very values, it’s likely that the necessary rows are still in cache. If you have the appropriate indexes (e.g. CREATE INDEX nodes_name_idx ON nodes (json_extract(body, '$.name'));) this can be pretty fast.

          Of course, if you have no such identifying information, then you will need to fall back on multiple inserts. All is not lost, though: you can use last_insert_rowid to get the rowid (or rowid alias) of the last inserted row without an additional query. For example, if you are using python (like OP), then you could do

          cur = conn.cursor()
          cur.execute("INSERT INTO nodes (body) VALUES (json(?));", ('{"name": "Bill Gates"}',))
          source = cur.lastrowid
          cur.execute("INSERT INTO nodes (body) VALUES (json(?));", ('{"name": "Miscrosoft"}',))
          target = cur.lastrowid
          cur.execute("INSERT INTO edges (source, target, properties) VALUES (?, ?, json(?));", (source, target, '{"action": "founded"}'))
          

          Which is only one query more than the default implementation. Unfortunately, SQLite has no RETURNING clause like PostgreSQL, so you need to be careful with operations like INSERT OR IGNORE INTO. If nothing ends up being inserted because of a UNIQUE clause, then the lastrowid will still have the value of the last successful insert. However, given that you have a UNIQUE column, hopefully you can use that column to use the first strategy.

          Forget it when it comes to distributed / peer-to-peer, etc.

          Fortunately this is SQLite. If you need distributed anything, you should really be looking at a Real Database (TM).

    3. 1

      I’ve developed exactly this type of schema for use in a small project. When I hit several million nodes performance was abysmal and I just switched to Neo4j, never looked back since.