1. 12
  1. 4

    I’m a fan of separating your operational and analytics data on large databases. A lot of times, the operational system only needs a subset of the overall data. This is usually based on some time frame, ie only the last two year’s worth of data. As data comes in, it’s replicated to the analytics DB and then a scheduled job truncate the out-of-date data in the operational DB.

    By splitting it out, we can be better about keeping the operational DB smaller. That allows mitigation of a lot issues described here. We also still get to query the entire data set in the analytics DB. You’re also no longer worried about long running analytics queries taking down production.

    1. 1

      Replication is easy if the analytics db is 100% identical (and still worthwhile, some analytics queries can be super heavy), but are there any good tools for replicating while ignoring deletions caused by such scheduled truncations?

      1. 1

        That really has to do with the adapter you’re using. A lot of times what you get is a deleted field. So, it’ll still have the deleted data, but be marked ADAPTER_DELETED = TRUE|FALSE. That way you get to keep all your data and query against the history knowing if it exists across boundaries.

        1. 2

          Can you give a more concrete example? I’m not sure what you mean by “adapter”.

          1. 5

            many enterprise databases come with adjacent (but some times separately priced) replication tools (eg [1], [2] )

            Most of those replications are based on redo logs (rather than triggers). So they keep the notion of an ‘acid’ transaction when replicating. And at the same time, they allow, often, very sophisticated filtering to avoid replication stuff that is not needed.

            Replication into analytics database vs ETL into analytics database, is not a simple decision

            Normally, operational databases need to have schemas or document-oriented structures that are better performing for updates, but analytics databases need to be better organized for reads (therefore denormalization, therefore column-oriented storage tech, etc).

            Some vendors eg (oracle, sap hana) implemented complex optimizations (relying on FPGAs and vector-oriented CPU instructions) to achieve what they claim a hybrid row+column oriented storages. All that effort is really justified, so that ETL from operations into analytics can be avoided (ETL is source of huge expense and huge number of data quality issues), and instead, Operational db - oriented schemas can be used for Analytics.

            There are also approaches, of course, that sit on ‘top’ of the databases, helped by custom application layers (these are mostly described in lambda or kappa architectures).

            Examples of database replication tools that allow filtering during replication:

            [1] https://docs.oracle.com/goldengate/c1230/gg-winux/GGCON/introduction-oracle-goldengate.htm#GGCON-GUID-EF513E68-4237-4CB3-98B3-2E203A68CBD4

            [2] https://www.marklogic.com/product/marklogic-database-overview/database-features/flexible-replication/

            1. 1

              Fivetran is one example I know for sure that does what I mentioned. It costs $, but I’m sure you there are more out there.