Threads for ankush

    1. 4

      select count(*) where … is an O(N) operation which essentially gets converted to a for-loop iterating over all rows in the table

      They never specify what database they’re working with, but this shouldn’t be true for any DB I’m familiar with. For instance, PostgreSQL will use an index for this if you have e.g. a primary key. (Yes it is slightly more complicated than that, but even with the visibility map I believe a table count should be substantially faster than a regular scan.)

      1. 4

        I think you’re both wrong and right.

        You’re right because, since the postgres team made the push towards index-only scans, you can get much faster count(*) performance[1]. It’s still upper-bound at O(n), but average case is O(nm), where m is the fraction of pages that have been recently changed, which is usually low enough to make it a big win.

        However, I think you’re wrong because all of that above requires a covering index (“covering” in this case meaning, roughly, “usable as the sole filter for this WHERE clause”), and I don’t think that’s a valid assumption based on their article. Specifically, they seem to be talking about filterable list views (see the available options in the left sidebar and above the grid in that first screenshot, for example). Those are really hard to have covering indexes for, because the user has a high likelihood of filtering on multiple fields. And in those cases, you are definitely doing a table scan, regardless of RDBMS.

        So I suspect they’re being accurate in the description of the problem, but not clearly communicating why it’s a problem for them when it won’t necessarily be for many people.

        [1]: Amusingly enough, historically PostgreSQL was bad at this use case, and famously so. It figured big in the “MySQL vs PostgreSQL” wars of the aughts. Specifically, at the time, MySQL using MyISAM tables could quickly retrieve count(*) (don’t recall if it was cached per index or rapidly calculated based on space use, but it was much faster than O(n)), while PostgreSQL did require a full table scan every time.

        1. 2

          Clarification: We’re using MariaDB with InnoDB storage engine. I’ll update the post.

          So I’m kinda sticking to the definition of big-oh here. Index only scans are faster by only a constant factor, even though it’s a huge factor… Correct?

          I tried to clarify it in next paragraph but yeah that for-loop explanation is clearly misleading. That only happens for full table scans.

          You’re right about filerable list views. We allow arbitrary filters and even adding new custom columns on the fly, so it’s quite hard for us to ensure every column relevant for filtering will be indexed. That’s another much harder problem that I’ve been trying to automate without much success.

          1. 1

            Imagine if the database can do O(1) on select count(*) from table, yeah, i mean myisam.

            1. 1

              So I’m kinda sticking to the definition of big-oh here. Index only scans are faster by only a constant factor, even though it’s a huge factor… Correct?

              Technically, as colonelpanic alludes to, not exactly. Since indexes are sorted, getting the result set is O(log n) (or a more complex, but still well-sub-O(n), performance if you’re using multiple indexes in the query). Checking visibility would be O(m), where m is the size of the result set.

              even adding new custom columns on the fly …. trying to automate without much success

              Yeah, that’s a tricky problem. I’m unfamiliar with your software setup, and likewise unfamiliar with MariaDB’s index offerings, but I’d be happy to brainstorm with you if you’d like.

            2. 1

              You’re correct that where would prevent an index scan most likely, but in that case wouldn’t it be counting the result set? Still not a full table scan.

              1. 2

                True, I misspoke. There are three scenarios: covering index, where you get the speedup, no relevant index, which will definitely be a full table scan, and the in-between, where you may have enough index coverage to find your limiting set, but no individual covering index. That last scenario may or may not result in a full table scan, but definitely will require the full visibility scan, which makes it much slower than the first, fast case.

          2. 30

            this is silly…CI isn’t having it run on the cloud via some rented server.

            CI is making sure that env issues / local dev issues / whatever don’t break the build. Its the idea of dual-entry accounting, ie having a second path to check your work.

            “signing off” work locally defeats that entire purpose

            1. 3

              If the signing runs the exact same thing as the CI (as a pre-requisite) I don’t see the issue here?

              Is it the element of trust that people are scared of?

              1. 3

                I don’t think it does here though, does it? The signing is just a unilateral declaration that everything has passed — it doesn’t require that the tests have been run at all, nor that they were run in an environment anything like that of CI, or indeed production.

                This seems an important distinction to me because I know that I am very fallible, and often forget things, ranging from big things like just forgetting to run certain tests, through to much more subtle things like having an environment that diverges from CI/production in such a way that tests will succeed on my machine but not in production.

                It’s not about trust so much as consistency. I trust myself and my coworkers to write good code and to do things correctly. I don’t necessarily think we’re so consistent, though. But what’s incredibly good at being consistent? A script that runs in exactly the same environment every time, runs exactly the same operations every time, and never changes… which sounds a lot like CI.

                That said, I do like the idea of utilising dev machines for CI workers more. I just think there’s probably a better implementation of that than this.

                1. 2

                  The signing is just a unilateral declaration that everything has passed — it doesn’t require that the tests have been run at all

                  I imagine the intention is to enforce this by convention.

                  But what’s incredibly good at being consistent? A script that runs in exactly the same environment every time, runs exactly the same operations every time, and never changes.

                  And at the end of said script you’d include a step execute the signing.

              2. 2

                Who said their build is breaking though?

                1. 1

                  You can run something like this: https://github.com/nektos/act

                  Assuming you still test it in isolation before the final deployment, local testing can speed up small changes.

                2. 59

                  I don’t pretend that posts are evergreen by hiding their dates.

                  Dan Luu, I’m looking at you.

                  1. 41

                    Someone, please donate him some CSS for max-width.

                    1. 16

                      This made me realise that with almost every single thing I read, or even watch on TV, I want to know when it was published and I immediately check for it.

                      1. 10

                        That line made me add dates to my blog because honestly I forgot about it and felt personally attacked.

                          1. 2

                            Which is completely useless for the 99% case where you reach an article from the web or an aggregator and have no idea that the index page has anything.

                            1. 1

                              Maybe I just exposed a personal quirk, then. When I visit a site from a link, if it’s of any interest whatsoever to me, practically the first thing I do is look for the index page.

                          2. 3

                            I’ve seen people argue that they have the date in the URL.

                            1. 2

                              It’s not ideal but it’s fair, the date is here even if it’s a lot more annoying to find than necessary.

                              Below URL is putting that in a <meta>.

                              But afaik dan luu puts it nowhere, to get it you have to know that it’s on the index, go back there, and look for the title of the article.

                              1. 6

                                yep, and the “reverse” thing is just a checkbox to check that updates in realtime - I’d never have thought it would be worth a whole article about this feature as it’s just readily available to use in Hotspot (and Heaptrack, which does the same but for memory allocations)

                                1. 1

                                  I’ve used flamegraphs but never heard of hotspot, so…

                                  1. 1

                                    I tried Hotspot. It segfaulted on the first use, so I didn’t investigate it further. But it looks enticing, I’ll give it a shot again.

                                    1. 3

                                      strange, I’ve used it for.. like 7ish years at this point and don’t really recall it segfaulting

                                      1. 1

                                        I also recently tried Hotspot and when I installed it from AUR on Arch Linux, it wouldn’t read perf.data files — it just kept on consuming memory until exhausting my machine.

                                        When I tried it on an Ubuntu 24.04 machine, hotspot worked fine with the same perf.data file.

                                        So double-check whether you might have a broken version / broken installation :)

                                        BTW: I ended up not using hotspot, as for the purpose of annotating source code files with counter values, it did not seem any better than perf report. Would be curious what others are using hotspot for primarily…?

                                        1. 2

                                          For flame graphs.

                                          I also prefer perf report for annotating source.

                                          I would like some tool to detect register / stack thrashing, though.

                                  2. 4

                                    Would Mypy have been of any usage with some typestubs? E.g., that qb was meant to be of Type[SomeQueryBuilderInterface] which shouldn’t have an engine field?

                                    1. 5

                                      I was thinking the same: Any statically typed language would disallow modifying types at runtime. By definition. But does Mypy care? Let’s try:

                                      class QueryBuilder:
                                          pass
                                      
                                      class QueryBuilderMysqlEngine:
                                          pass
                                      
                                      def main():
                                          qb = QueryBuilder()
                                          qb.engine = QueryBuilderMysqlEngine()
                                      
                                      • [ ] mypy: Success: no issues found in 1 source file
                                      • [ ] ruff:
                                      • [x] pylint | grep -Ev 'docstring|snake_case|Too few public methods': Attribute ‘engine’ defined outside __init__ (attribute-defined-outside-init)

                                      Thankfully, pylint cares, but it also cares about things I find rather normal.

                                      My takeaway from the article: As someone not used to thinking in dynamic types and concurrency in the same world: Oh boy, this is also a thing to check for.

                                      1. 5

                                        Mypy does have an error code for this: attr-defined

                                        easiest way to get it to show up is passing --strict to mypy

                                        1. 3

                                          This is because mypy doesn’t check functions that don’t have type annotations on them by default. You can change that via various flags, which I think should be included in the --strict mentioned in a sibling comment.

                                          1. 2

                                            That’s just not true, the “by definition” – even Java and C++ allow modifying types at run-time.

                                            1. 2
                                              1. 2

                                                Classes in both Java and C++ support run-time mutable static fields

                                                1. 2

                                                  That only allows you to modify a value at runtime, not modify a type. You can’t add a previously non existing static field to a type at runtime.

                                                  1. 1

                                                    Oh, I understand where you are drawing the line now. You mean something like adding new behavior

                                                2. 1

                                                  I was about to ask the same. Maybe some would count casting and polymorphism, but I wouldn’t.

                                                  That’s a question of definition, but my answer is that you can’t change types at runtime, because types don’t exist at runtime. Wikipedia is foggy as usual on what static typing precisely is, but I think the meaningful distinction is that the type of each variable is static. That’s what it must mean if C++ is statically typed. In other words, it can’t be relevant that you can violate typing as much as you want at runtime by pointing at or interposing objects of different types, and that the language even helps you do that in safe forms (OOP).

                                              2. 1

                                                qb = QueryBuilder()

                                                I don’t think this is an accurate representation of the original code. This is constructing a new object of type QueryBuilder. If a new object was being constructed each time then there would not have been an issue. But the code was doing the equivalent of qb = QueryBuilder, not qb = QueryBuilder(). So instead of creating a new object and then mutating that new object, we’re instead mutating the QueryBuilder class itself.

                                                If I use this code:

                                                def main() -> None:
                                                    qb = QueryBuilder
                                                    qb.engine = QueryBuilderMysqlEngine()
                                                

                                                Then mypy complains with:

                                                error: "type[QueryBuilder]" has no attribute "engine"  [attr-defined]
                                                

                                                which is what we want to happen.

                                                You need either the -> None annotation or the --strict command line option to get mypy to kick in here. Otherwise it will treat the entire main function as something that’s not yet meant to have type checking applied.

                                              3. 3

                                                Probably, but whoever wrote it probably wrote it this way intentionally. I don’t know why, but I am guessing it was because qb.engine was a better developer-facing API than qb_engine separately attached to the local namespace. Alas, better design was to not have this “global state” at all, which we eventually removed.

                                              4. 9

                                                Nice article, if a grandiosely titled. For its length, I think it could do more than just outline the definition of time.sleep, though.

                                                The only takeaway right now is that CPython’s time.sleep makes a UNIX syscall.

                                                1. 1

                                                  I didn’t read time.sleep source before writing this. But looks like it’s just a wrapper around the syscall on Linux with some more code for handling the lack of this syscall or other platforms like Windows.

                                                  Here is the relevant code: https://github.com/python/cpython/blob/08e65430aafa1047029e6f132a5f748c415bda14/Modules/timemodule.c#L2206