1. 0

    by replicating only files we can easily achieve a simple synchronization solution without tempering with databases itself

    This makes it immediately not useful with most databases (MySQL, PostgreSQL, SQLite, Oracle) as a sanpshot of disk state is not a recoverable consistent point.

    They seem to be aware of that:

    Although the base idea is very simple, it can only be adapted with databases that meet certain criteria. Firstly data updates must be operated atomically on a single file, so that the peer-to-peer system can propagate individual updates. Secondly most database systems use memory mapped files and mutate data in unpredictable positions all over the files of its transaction log and mutable indices.

    They open up with

    For simplicity we leave the conflict resolution for another time.

    That is one of the main hard problems to solve with database replication. If they don’t even approach this then what does it has to offer?

    1. 1

      It’s just a proof of concept used to explain the general idea of replication. The point is to explain the idea with a simple example.

    1. 7

      The big bad boy servers like postgres have quite a set up cost…

      ie. It will take you a fair amount of time and effort before you can even say ‘select “hello world”;’

      Can I recommend sqlite…?

      Incredibly small, simple and easy fast and rock solid.

      It’s utterly ubiquitous (odds on it is in your browser AND your phone!) Not just one instance either.

      The documentation is excellent https://sqlite.org/lang.html

      And the gui tools are lovely. https://sqlitebrowser.org/dl/

      Or you can an firefox addon that allows you to go poking about all the sqlite db’s in your browser (or elsewhere) http://add0n.com/sqlite-manager.html

      Lastly I cannot help but have my favourite whinge.

      Database professionals have lost their way hopelessly.

      Their core ideas come from relational algebra…. and they have forgotten them or never quite understood them.

      Once you have understood the basics of SQL and want to level up…..

      Read “SQL and Relational Theory: How to Write Accurate SQL Code” by C.J. Date and take him deadly seriously.

      Suddenly your SQL becomes much much smaller and more reliable.

      Of course, I suggested you move away from Postgres unless you need it……

      …but when you do need a serious SQL server, I strongly recommend it.

      1. 2

        The big bad boy servers like postgres have quite a set up cost…

        ie. It will take you a fair amount of time and effort before you can even say ‘select “hello world”;’

        in what context? $package_manager install, add a user/db, then open the command line client and start writing SQL? roughly the same for MariaDB and postgres

        1. 1

          Modern package systems have made it hugely much easier than it was… but still, it’s a fair bit to read and understand (what’s a user, what’s a db, what’s a role, oh dear, the DB notion of a users and grants is utterly unrelated to the unix notion… but I need to have both) and then some stuff to do before you ready to go compared to sqlite.

          If the aim is just learning sql, sqlite is just so easy.

          1. 1

            I thought that too. Main counterpoint I can think of is the backup strategy of “copy the file” vs pg_dump, but tbh even that isn’t a huge amount.

            1. 1

              Main counterpoint I can think of is the backup strategy of “copy the file” vs pg_dump

              “copy the file” is not a secure backup strategy for sqlite ;) Regardless of the tech you choose you will have some homework to do: https://www.sqlite.org/backup.html

              1. 2

                “Copy the file” will lose your most recent writes, but I thought it always yields a transactionally-consistent view of the database (it has to, as a power failure could stop writes at any time).

                I take daily backups, and I don’t care if they are actually backing up data from 10 minutes earlier than the scheduled backup time.

                1. 1

                  “Copy the file” will lose your most recent writes, but I thought it always yields a transactionally-consistent view of the database (it has to, as a power failure could stop writes at any time).

                  You are mixing up two failure cases. The sqlite DB itself is transactionally-consistent and will recover on power loss during writes.

                  ‘Dumb copy’ of a consistent sqlite file that fails on write during copy may end up being corrupt. You’re on the mercy of the filesystem to recover that and even if so the resulting file may not be usable by sqlite itself as there is no write ahead log for that copy operation the db could recover from.

                  The docs I linked above mention exactly that scenario:

                  if a power failure or operating system failure occurs while copying the database file the backup database may be corrupted following system recovery.

                  1. 1

                    I’d be astonished if a power failure mid-backup ever resulted in a working backup.

                    That said, it looks like vacuum into is probably what I actually want.

          2. 2

            Database professionals have lost their way hopelessly.

            Their core ideas come from relational algebra…. and they have forgotten them or never quite understood them.

            How so? What do you base this assertion on? I will agree that many people that work with databases lack basic relational theory knowledge and this is a problem. But what is specific about “Database Professionals”? How does it differ from [insert your programming language] professionals?

            Once you have understood the basics of SQL and want to level up….. Read “SQL and Relational Theory: How to Write Accurate SQL Code” by C.J. Date and take him deadly seriously.

            This is certainly good advice. Note the word basics. If one doesn’t dig into relational theory fundamentals, then there is really the very basics of SQL that can be understood.

            It would also recommend the original paper by Codd, where it all started (so to speak) https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

            1. 2

              Reliance on autoincrement keys is a clue. It immediately tells you they don’t actually know what their primary key is.

            2. 2

              I should definitely mention sqllite. Even if, as some sibling comments mention, installing MySQL or Postgres is pretty easy with a package manager, new developers are far less likely to be stymied by bizarre admin or install errors with sqllite.

            1. 3

              w3schools uuughhhhh

              https://meta.stackoverflow.com/questions/280478/why-not-w3schools-com

              I have reached a point with them that I literally filter them out in google searches.

              1. 2

                It’s funny to see a discussion on Stack Overflow bashing quality of another resource while it itself is known for highly misleading and often damaging answers.

                1. 1

                  Pot calling kettle of course, but w3schools is historically pretty rough. They are simply saying avoid using it as a citation because it is often incorrect. Now I hear they’ve tried to clean up their act and all that but there’s so many better resources. Even just getting a decent book on the subject would do you far better than trying to read w3schools.

              1. 1

                So what can you do? It turns out that in some databases like PostgreSQL, there is an easy solution for that:

                create index concurrently index_name on table(column);

                Yes, that’s it. This will switch the default behavior of the locking table and will create the index in parallel to standard operations that your app is doing on the DB.

                This is dangerous advice. Before blindly following do read the documentation and understand the trade offs. It’s not the default behavior for a reason:

                There are several caveats to be aware of when using this option — see Building Indexes Concurrently.

                source: https://www.postgresql.org/docs/11/sql-createindex.html

                source: https://www.postgresql.org/docs/11/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY

                1. 1

                  Good point, it made sense in our project, but as with everything there are drawbacks, and it is a bad idea to copy-paste from blog to project without researching on your own.

                1. 2

                  Zero Zero Zero by Roberto Saviano describing the global cocaine market. Famously the book was found in one of El Chapo hideouts.

                  Had they asked, authorities might have learned that Guzmán, leader of the notorious Sinaloa Cartel, would have preferred Zero Zero Zero, the 2013 book by Roberto Saviano that explored the world of drugs and dirty money. That was the book he left on his bedside table last October when he was forced to flee his hideout in Las Piedrosas, in the heart of Sinaloa Cartel country in the Golden Triangle.

                  source: https://mexiconewsdaily.com/news/a-depressed-el-chapo-gets-to-enjoy-a-book/

                  I also started reading on the bus to work. Should have done that ages ago. I didn’t before as the trip is very short (roughly 8 minutes) but it’s still enough to make progress and spend the time better and sometimes with increased traffic it is a quality of life change :)

                  1. 2

                    Slow count(*) in Postgres is actually a bit of a FAQ

                    1. 7

                      This is unrelated to the actual article and what it explores.

                      The blog talks about the difference between count(1) and count(*). The former being by many historically considered faster as it was often assumed to be optimized by the database as a constant not null literal that doesn’t require re-evaluation whereas count(*) would require evaluating the whole row.

                      The blog debunks this and shows that what is considered faster is actually slower in PostgreSQL. What you are pointing out is actual counting performance, which is not measured in this post.

                      1. 1

                        Right, sorry for the confusion. It has not been my day!

                    1. 1

                      Nice article and a good general overview.

                      The first linked youtube video is dead, removed. There is no way to check what it was linking to.

                      There is no mention of jsonpath: https://www.youtube.com/watch?v=70dBszaO67A (PDF: https://www.pgcon.org/2019/schedule/attachments/539_jsonpath-pgcon-2019.pdf)

                      Which is very relevant, considering that a large portion of the article compares query syntax and indexing. This is even relevant now if you happen to use jsquery.

                      edit: strikeout for the jsonpath nit, I see it is indeed mentioned in the conclusion.

                      1. 9

                        I recently finished Gomorra by Roberto Saviano, a very emotional, engaging and enraging testimonial of devastation and corruption happening right under our nose. Read it if you think organized crime is distant, with no direct impact on your life.

                        Right now I’m going through The Art of Strategy.

                        1. 4

                          I’ve read Gomorroa a couple of years ago. It’s indeed quite terrifying.

                        1. 1

                          Erm… I didn’t read the entire article, but all I do is createdb (and createuser) for each project. Is there anything wrong or worse with this approach vs. what is described in the OP?

                          1. 2

                            Nothing wrong. This however is worthwhile because it makes it easier to kick-start people not used to Postgres. I recall an episode of @garbage.fm where our own @jcs when asked why lobste.rs was using MySQL said something to the like that this was the database he was already used to and he never remembered how to quickly setup Postgres DB access for the app for local development while MySQL just worked out of the box.

                            1. 2

                              Not at all, same that I do. If one of the main issues is getting new developers up and running I’ve found https://postgresapp.com/ super useful as you can have multiple postgres versions running and it handles updates and all that for you. It’s mac only though but I don’t think it would be too hard to make something similar for mac and windows as well.

                              One of the first things I do in a db-backed project as well is setup db migration and db instantiation code so that I can get projects up and running with minimal fuss moving forward. It’s too easy to get into a kludge of SQL and other scripts peppered all over and lost one-off statements modifying your database schema/data while you’re in early/heavy development. Getting db migration stuff working right away means new developers on the project don’t have to worry about that stuff too much if it isn’t in their domain.

                              1. 2

                                The difference is that your approach changes the global state of the system. Including for example having a daemon running and potentially listening on the network, as mentioned in the article.

                                With the article’s approach, you don’t have to worry about name collisions between the database names, or with leaving behind junk in the system program after getting rid of a project.

                              1. 3

                                A solution to avoid tracking would be for webmail to pre-download all the images and serve those to the clients. That way the tracking data would be completely bogus but clients can still see the images.

                                1. 1

                                  The tracking data would not be completely useless because the URL to the image can contain a unique identifier.

                                  1. 4

                                    If GMail always fetches all URLs in the mail before you even open or regardless if you log in to your account then it renders the tracking feature useless as that’s no longer the user reading your mail - in fact the user may never do that.

                                    1. 1

                                      I’m assuming you mean the image resources, because a lot of email I receive contains one-time-use links that expire the second they get requested. Any client prefetching those would render the mechanism broken.

                                      1. 1

                                        Spilled.ink purports to do this (listed under “Server asset fetch”)

                                      2. 1

                                        Since the mail server would pre-fetch all the images, the sender would see somebody loading the tracking pixel from the mail server IP at the time where the email is received.

                                        Gmail already does something similar where it proxies all image requests through their servers to hide the reader’s location. This would be going one step further and even hide open times.

                                    1. 6

                                      This article has some interesting, insightful points to make, but the title is really, REALLY unfortunate.

                                      You could have picked something much more on point about the state of Hadoop’s fortunes rather than falling for the clickbait title.

                                      Seriously, folks, can we all agree that “X is Dead” SHOULD BE dead? This meme is so old it’s had its mis-spent youth, went through its middle age crisis and is now pondering incontinence options at its local pharmacy.

                                      It’s not funny, it says nothing in and of itself, and detracts from the laudable analysis the article contains.

                                      1. 6

                                        So what you’re saying is… ‘“X is dead” is dead’?

                                        1. 3

                                          What is dead may never die :)

                                          Yes, yes I am!

                                          1. 4

                                            That is not dead which can eternal lie. And with strange aeons even death may die.

                                            H. P. Lovecraft

                                        2. 3

                                          Part of that means exerting editorial control on those titles so as to nullify the clickbait.

                                          1. 2

                                            If by that you mean that writing good, worth reading articles nullifies the clickbait, I disagree.

                                            If that’s not what you mean, I don’t know what you mean :)

                                            1. 5

                                              I mean we should edit titles to nullify clickbait, possibly penalizing them for using it as well.

                                              1. 4

                                                That kind of editorializing is not allowed on this site.

                                                Even adding a text comment saying “this title is clickbaity, but the content is good” is not permitted.

                                                I believe @jcs had bad experiences with mods on HN changing titles silently, which may be the reason.

                                                1. 2

                                                  Oh! Yes I wholeheartedly agree!

                                          1. 2

                                            That is neat, the kind of things you are really thankful to the blog’s author to have written, as it’s a rather obscure edge-case (at least from my point of view).

                                            For those (like me) wondering why there’s no such mechanism on MySQL : https://stackoverflow.com/questions/25153532/why-is-it-a-vacuum-not-needed-with-mysql-compared-to-the-postgresql

                                            PS: I’d suggest removing the [ruby] tab, and go with a [postgre] one.

                                            1. 2

                                              Just one clarification as PostgreSQL vacuum just does a lot of things and I think this behavior could have also happened with MySQL.

                                              In this specific blog post, the problem was caused by outdated table statistics.

                                              If you look at the query plan: https://explain.depesz.com/s/7qJI

                                              You can see that the planner wrongly estimated the amount of rows the query would return (the “rows x” column and up/down arrows compared to “rows” column).

                                              Vacuum in PostgreSQL is a process for reclaiming removed rows, preventing ID wraparound, updating statistics, dumping the pending list of a GIN index to the main tree and other maintenance tasks. This is a costly process and takes time.

                                              There is also a VACUUM FULL which does full table rewrites. That’s something you almost never run or even want to run - it takes A LOT of time.

                                              There are also autovacuum processes that are automatically ran to do the above in smaller incremental steps.

                                              Now, in this particular case. In my humble opinion a vacuum was not needed. The author could have just ran ANALYZE TABLE order_items or even ANALYZE TABLE order_items purchase_order_number would have the same result in a fraction of the time.

                                              I’m not a MySQL expert, but that database also has statistics (https://dev.mysql.com/doc/refman/8.0/en/analyze-table.html) and a step like this also might help the planner after large imports or adding new columns to big tables. I don’t see anything in the linked stackoverflow post indicating that MySQL would be immune to outdated statistics leading to an inefficient plan.

                                            1. 21

                                              I would have fallen for this.

                                              This gist is that the personal accounts of some university staff with @~.ac.uk email addresses and access to personal homepages on the university’s domain were compromised. Thus, the emails seemed far more legitimate than they otherwise would have.

                                              The real problem is that those university homepages were then hosting a Firefox 0-day. That is a separate issue from the email phishing problem, people could have gotten to those pages via a DDG search.

                                              The whole issue goes to show that browser security is paramount. Should we start considering running our web browsers in containers? Full VMs? Separate user accounts? I really don’t know how far is too far.

                                              1. 15

                                                Yeah, it’s pretty hard to adjust your thread-model to include a Firefox 0day :| I suppose you could look into virtualization or containers. But then, the Firefox 0day would still get access to all of your web sessions.

                                                I think it’s more worthwhile auditing Firefox’s sandbox, so that future 0days will be limited to the content process. But that’s my job anyway, so you all don’t have to (and makes my thinking biased :-))

                                                1. 3

                                                  Using NoScript with its default default-deny policy would have prevented the attack.

                                                  1. 1

                                                    In Guix, I’ve been running IceCat (a FF derivative) in a container for some time now, but it needs a bit of work:

                                                    https://lists.gnu.org/archive/html/help-guix/2019-06/msg00292.html

                                                    If there are any people at Mozilla who find this problem interesting and would like to offer advice on some of the issues I outlined there, it’d be great if one day a containerized Firefox (or IceCat, in Guix’s case) could be the norm.

                                                  2. 2

                                                    I was also thinking this.

                                                    Would there be any way to add another level of security around, say, Firefox without breaking any of its functionality? I’d love that!

                                                    1. 7

                                                      You might want to have a look at https://www.qubes-os.org/

                                                      (I never tried it, though)

                                                      1. 3

                                                        There was once Chrome with capsicum extensions but it was never accepted by upstream. The idea of Capsicum (iirc) is to add capabilities to FreeBSD to restrict applications from doing things that weren’t expected (thereby “sandboxing” it). You could restrict to which directories it could write and which syscalls it was allowed to do. In practice, I think it was a bit like pledge on OpenBSD.

                                                        1. 7

                                                          You could restrict to which directories it could write and which syscalls it was allowed to do. In practice, I think it was a bit like pledge on

                                                          Chrome is sandboxed (unveil) and pledged on OpenBSD. It can only access files within your ~/Downloads directory.

                                                          1. 1

                                                            That’s neat. You happen to know if Firefox has something similar on OpenBSD?

                                                            1. 2

                                                              Last time I checked it only had W^X support (which Chrome doesn’t) but no pledge/unveil - I haven’t checked in a while though.

                                                              1. 2

                                                                pledge sandboxing has been merged into Firefox some time ago https://bugzilla.mozilla.org/show_bug.cgi?id=1457092

                                                          2. 4

                                                            iirc, pledge is for syscalls, unveil is for directories.

                                                            1. 2

                                                              Thanks for the correction. I suppose Capsicum is a bit like a combination of both then.

                                                              1. 2

                                                                Capsicum is more general purpose and more flexible, but with this comes complexity in implementing it. OpenBSD’s sandboxing primitives are far more primitive and sweeping, but their hope is because it’s easy to implement once you know what subset you have, it can both be easily implemented and keep narrowing after privileged init operations take place, even if it isn’t 100%.

                                                                1. 2

                                                                  I wouldn’t describe it as flexible really. It’s very rigid yet elegant. Capsicum is a fundamental change to the way apps work: file descriptors are treated as capabilities — you can’t conjure them out of thin air (i.e. from global namespaces) once in capability mode, you have to derive new ones from existing ones: openat() a file under a directory you already have a descriptor to, accept() on a socket you already have, get one passed in over a unix socket you already have and so on.

                                                                  With LD_PRELOAD hacks, you can use it to sandbox existing, unaware apps too :)

                                                                  BTW, since this is a browser thread: I might look into adding Capsicum support to Firefox “soon”. The content processes are already designed around fd passing, but I think they let GTK and other libs open whatever they want..

                                                          3. 2

                                                            Would there be any way to add another level of security around, say, Firefox without breaking any of its functionality? I’d love that!

                                                            Besides what others suggested, you may want to use uMatrix and block 1st and 3rd party JavaScript by default. The whole idea of running untrusted code on your machine is kinda insane, despite how impressive browser sandboxes are nowadays.

                                                            1. 1

                                                              Already using uMatrix, uBlock, NanoDefender and Firefox Custom blocking for a while now.

                                                              It’s just really hard to get anyone else to change their workflow as much as I did.

                                                            2. 1

                                                              I haven’t tried it [yet], but there’s https://github.com/netblue30/firejail .

                                                              Firejail is a SUID sandbox program that reduces the risk of security breaches by restricting the running environment of untrusted applications

                                                              (No connection to Firefox, though the morpheme “fire” in the name might suggest that.)

                                                              1. 1

                                                                Only for Linux, right? Too bad.

                                                          1. 3

                                                            It’s better to be good at something, instead of mediocre at everything.

                                                            The saying goes.

                                                            Jack of all trades, master of none, but oftentimes better than master of one

                                                            That said. I went down the too spread out route through early years of my career. It’s still hard to tell if that was a general net gain or a net loss. I did everything from Common Lisp, concatenative languages like forth, prolog, smalltalk, erlang up to backend web development in perl, python, ruby, frontend javascript and embedded development on ARM boards in C, Ada and C++.

                                                            The only time I had a chance to really specialize was with databases while working at a corporation (7 years). I’m tired of switching tech and I realized I can’t get a deep understanding of everything so focusing down the list of stuff I touch is my goal for the past several years.

                                                            I’m now only doing C, Shell, Python, PostgreSQL on BSD in netsec setting. Nothing more, nothing less.

                                                            1. 0

                                                              How did you find the bug? Why haven’t your automated unit test not caught it?

                                                              1. 8

                                                                Not the author, but how would you test against that? I assume that If he knew the bug was possible he would just create a constraint preventing experience without an associated account or he would have a check constraint preventing anyone from creating a user with cpu/guest names. You have to know that there is a vector for the bug before testing for it - do you usually write tests which would find data inconsistency issues?

                                                                1. 3

                                                                  On the one hand, reserving certain usernames so that nobody can use them to create an account is a good practice for systems like this, and thinking through the set of names to reserve should be part of the development process. On the other hand, very few people actually do that.

                                                                2. 7

                                                                  It wasn’t a bug before Rogan found it.

                                                                  What makes a code buggy? Its inconsistency with the spec. Without a spec, code is not correct or incorrect. In many cases the spec is only in one or more heads. When nobody has considered some case then there is no spec. This implies it was not a bug. The code just worked as coded. It was only when Rogan considered the emergent behavior that he worked out that detail of the spec. Now the code was inconsistent with the spec and thus buggy. Before that moment the code was fine, so of course nobody wrote a test about that.

                                                                  Ok, maybe Rogan had a spec. I just assumed that he did not.

                                                                1. 2
                                                                  • Resting
                                                                  • Writing a PGCon report
                                                                  • Thinking about cleaning up my github repos, might just do that
                                                                  • Nuking an old article on my blog
                                                                  1. 3

                                                                    Considering that WhatsApp uses the same protocol as Signal. I wonder if the issue is in this specific implementation or the protocol. What is the likely-hood of the same or similar issue existing in Signal?

                                                                    Is there any info if this bug was limited to iOS or does it also affect Android WhatsApp app?

                                                                    1. 2

                                                                      According to the original Financial Times report, it affects both.

                                                                      1. 2

                                                                        1: You are not correct, the article says nothing of the kind.

                                                                        2: According to an article by WIRED magazine on the topic, Signal is not vulnerable, just the WhatsApp implementation of the Signal protocol. Makes you wonder what other code is in there…

                                                                    1. 1

                                                                      On the other hand, the OP learnt how to configure thousands of machines, no matter the distribution with a degree of management and orchestration that won’t be matched by someone knowing only how to write Unix commands.

                                                                      I mean, yes, you could probably get away with Ansible and shell commands, but let’s face it, knowing puppet is probably more valuable…

                                                                      1. 3

                                                                        On the other hand, the OP learnt how to configure thousands of machines, no matter the distribution with a degree of management and orchestration that won’t be matched by someone knowing only how to write Unix commands.

                                                                        I think the key takeaway here is lack of access to the tool set he built at work. Having access and knowledge on how to use a internal company inventory of ansible playbooks/roles or (whatever the chef/puppet equivalent is) doesn’t mean you are able o re-create them on your own.

                                                                        1. 1

                                                                          Definitely true, and it’s the same with development. Getting used to your environment can be a pain when you change…

                                                                        2. 2

                                                                          Nobody’s denying that these skills are a huge win for everyone concerned - the challenge is how to retain the ability to work in the small when you spend your workday in the large.