1. 29

More discussion about Doug McIlroy’s critique of Donald Knuth’s word-count program than in the recent Haskell post on the same topic.

  1.  

  2. 4

    McIlroy’s critique seems intellectually dishonest. One would not accept his solution in a take home coding test – it’s like when we ask someone to implement line splitting and they use .split().

    Of course what he does is shorter; but not all programs can be that short, merely because this toy example can. Such a program would be a poor example of literate programming, but is also a poor example of how to handle complexity in general. When you actually write complicated things in shell you quickly find out that these components don’t always reuse, and the need to author, explain and organize your own components rapidly outstrips the facilities available in shell.

    1. 5

      “Intellectually dishonest” ? I’d give points for using .split unless the assignment spelled out what could be used.

      1. 7

        That is another way of clarifying the distinction. Knuth didn’t set out to write the shortest or most production worthy program, but rather to demo Web on a simple example program. McIlroy’s answer is to a different question; and his sleight of hand is in equating the two.

      2. 3

        not all programs can be that short

        Why not?

        I’ve seen small databases, small web servers, small programming languages…

        1. 3

          Just because one iteration of an idea can be small, doesn’t mean all useful iterations will be.

          1. 1

            Of course not, but that’s stupid. Who cares if “all useful iterations” will be: One could iterate on an idea and produce a massive steaming pile of dogshit simply because they’re a shit programmer.

            What I really care about is whether all business problems can be solved with small programs, and given that the smallest database is also the fastest and most featureful I’m inclined to believe that it is.

            1. 3

              I obviously meant that not all business problems can be solved with small programs.

              What database is the smallest, fastest, and most full featured?

              1. 2

                I obviously meant that not all business problems can be solved with small programs.

                Right. I understand this is a prevailing thought, but I’m not convinced.

                What database is the smallest, fastest, and most full featured?

                kdb

                1. 1

                  Sigh. I thought you were going to say kdb, but I thought I’d ask in case you had a novel or interesting answer. It’s incredibly niche, anyone familiar with its featureset would plainly know that it’s not general purpose.

                  1. 0

                    I’m using Kdb as a (CRM) database.

                    I’m also using it for time series (yes), and as an application server for a real-time bidding system.

                    I’ve got a unstructured data ingestion system on Kdb.

                    I’ve even got a full text search running on Kdb.

                    I know people doing GIS with Kdb.

                    Not sure what your definition of “general purpose” is, but it certainly meets mine.

                    1. 3

                      For the most part discussions I’ve had about kdb have been overly religious for my taste, and I’m not about to get in another holy war. I’m glad kdb works for you, but implementing features like FTS and GIS on top of kdb yourself doesn’t mean kdb has those features.

                      1. 1

                        I don’t know about that. I think that the ability to implement them in KDB does – I’m just writing queries here, and that’s important.

                        You can call this a potato if you want, but I’d say postgresql can do GIS queries as well even though someone had to write them in C and link them in as a externally shipped tool.

                        Today I wanted to index a table by a cuckoo hash. I can’t imagine the mess of SQL needed to do such a thing, and it’s about five lines long in kdb. Doing that in postgresql would be very invasive.

                        1. 2

                          That’s neat. Would you care to share those 5 lines?

                          1. 2

                            Sure. Unoptimised version follows:

                            pos:{[t;n] k:count[t] div 2;raze (0,k)+\:((n mod 16),(n div 16)) mod k}
                            hash:{last (md5 "c"$ -18!x) except 0x00}
                            add:{ [table;text] n:hash text; if[-11h=type table; :table set ins[get table;n]; ]; :ins[table;n]; };
                            ins:{ [t;n] i:pos[t;n]; j:i rand where t[i]=0x00; if[j<>0N; :@[t;j;:;n]; ]; j:i@rand where t[i]<>n; if[j<>0N; m:t j; p:pos[t;m]; q:p@rand where t[p]<>m; if[q<>0N; :@[t;j,q;:;n,m]; ]; :@[t;j;:;n]; ]; :t; };
                            check:{ [table;text] n:hash text; t:$[-11h=type table;get table;table]; :any t[pos[t;n]]=n; };
                            

                            … but it’s still fast enough to check things out.

                            1. 2

                              I think I’ve got you beat on brevity:

                              create table "table" ( "text" varchar);
                              create index hash_index on "table" using hash("text");
                              

                              May not a be cuckoo hash, but I’m not convinced that matters. Except possibly in a highly specialized, niche application. I doubt you wrote this for your CRM, for example.

                              1. 2

                                I think I’ve got you beat on brevity:

                                Oh if I just want a regular hash/index I can use the g or s properties (sorted is fine). Indeed that’s what I benchmark this problem with. The exact syntax would be:

                                table:([] id:`s#`sym$(); date:`s#`date$() acct:`acct$())
                                

                                I doubt you wrote this for your CRM, for example.

                                A CRM is a component of this application.

                                So there’s an attribution component where I’ve got ~2m accounts that I want to connect to some set of around ~1m ids. You might imagine it’s:

                                id date -> account[]
                                

                                and indeed the trivial

                                create table (id varchar, date date, acct varchar);
                                

                                is fine, but it’s a chunky index – around 10-16GB per day – that I’d have to build and for this use case I don’t need exact answers: it’s okay to select a few acct that don’t have the id, and given the number of processes that need this data built, maybe something that uses around 20MB might be worth experimenting with.

                                1. 2

                                  I don’t think I have adequate information about the problem to discuss more in depth. Only, I could store a year of that index (6T) on the cloud for less than 2 engineer-weeks of USD. So if your optimization took longer than 2 weeks cumulative to design, implement, test, be used by other engineers effectively, per year, then it’d be a waste. Moreover, I suspect there is a trivial strategy to represent that index in a smaller way. You have string identifiers for a few million rows, you could easily use an int32 id and save space. Or maybe you can’t. I don’t know enough about your application. I also don’t know what you mean by number of processes that need the data built. If a distinct 16 gb index is built for each process running, I can see how that would explode the size. But I don’t see anything else in what you said to indicate whether that’s the case. Again, not enough info.

                                  I don’t really think it’s productive to continue this discussion. You like kdb and I’m happy for you. But not many people can just implement, e.g., a reasonable GIS system for their application. And though I could, I don’t want to. I’ll use PostGIS or something else off the shelf unless I have a compelling reason to go out of my way to make a custom solution. Because all that bloat you talk about, that’s functionality I don’t even know I need yet, but will a year down the line when the scope of my application expands. It’s subtle logic handling edge cases that might have otherwise wasted a lot of my time.

                                  That’s why I say kdb is niche, it’s for people who actually will derive financial value out of doing stuff like that themselves. That’s tremendously uncommon.

                                  1. 1

                                    That’s why I say kdb is niche, it’s for people who actually will derive financial value out of doing stuff like that themselves. That’s tremendously uncommon.

                                    Okay, but I’m not arguing it’s not niche. I said that “all business problems can be solved with small programs” and you said they can’t.

                                    That people are okay with big programs is a (possibly) unrelated issue.

                                    1. 2

                                      Solving a problem with limited time and developer resources is a business problem.

                                      1. -1

                                        I don’t agree “limited time and developer resources” is a business problem that could be solved better with a big program than a small program.

                                        That actually sounds absolutely absurd to me, so I assume you must mean something else, but I can’t imagine what it might be.

                                        1. 3

                                          I do mean that. Using software that supports doing what you need saves time and developer resources. Suppose a team needs a few different unrelated features. In the database world, there are many large general purpose databases that support a ton of features, including the features the hypothetical team needs. But they aren’t terribly likely to find a database that supports ONLY those few unrelated features. And in the interest of saving time and developer resources, they could reasonably choose that large general purpose database over implementing those features themselves on top of a small but highly customizable database.

                                          And on a meta level, building a large general purpose database program solves a business problem: build a database that a ton of different teams can use, so you can sell it a ton of times. The fact that you technically can implement your own GIS on kdb isn’t all that compelling if I’m looking to buy a GIS database. I can implement my own GIS on a lot of things, the point is I don’t want to.

                                          I said that “all business problems can be solved with small programs” and you said they can’t.

                                          Perhaps “can’t in practice with realistic constraints” is more accurate. Can all business problems be solved with small programs given unlimited resources and top developers? Who fucking cares? That’s not how real life works. No one choses Walgreens vs CVS based on how many lines of code those companies execute to conduct business. If refining, specializing, and minimizing their code size made them more money somehow, then maybe their business problems would be better solved with small programs. But they probably get more value per dollar out of mixing and matching large generic programs. More value per dollar is better in a business context.

                                          There is a point where specializing becomes more effective than mixing and matching large general purpose code, but that point isn’t “always every time for any business problem.”

                                          1. 1

                                            This is all over the place and I’m not sure how to respond. I’m not even really sure what you’re saying.

                                            Why exactly do you think that a problem like GIS requires a large program, when we can clearly see a solution with a small one?

                                            there are many large general purpose databases that support a ton of features

                                            There are also small general purpose databases that support a ton of features.

                                            Not sure what your point is.

                                            Can all business problems be solved with small programs … Who fucking cares?

                                            I do. There is significant value in small programs: They have less bugs, they are easier to read and write and they run faster. I find programs that are correct and fast to be more valuable than programs that aren’t, and I can’t imagine there’s a business that thinks otherwise that will last very long.

                                            There’s other things in your post that I don’t really understand your point. It’s not clear if you disagree with me or where you disagree. It almost seems like you’re angry about something – maybe this religious point you mentioned earlier – that has nothing to do with me.

                                            1. 2

                                              Why exactly do you think that a problem like GIS requires a large program, when we can clearly see a solution with a small one?

                                              That’s not what I’m saying. I’m saying a program that supports GIS, and a bunch of other unrelated features so as to be general purpose, will be large. One such program is PostgreSQL.

                                              It’s not clear if you disagree with me or where you disagree.

                                              I agree small programs are good. I disagree that every problem can be solved with small programs.

                                              It almost seems like you’re angry about something

                                              No, just frustrated that this discussion is going exactly the way I expected, and that I should have known better and not gotten involved.

                                              that has nothing to do with me.

                                              It has everything to do with you. I feel I have dangled my point in front of your face, and you are perfectly capable of understanding but have refused to do so.

                                              So here it is laid out:

                                              Thesis: not all problems can be solved with small programs.

                                              Example: I do not believe the problem solved by PostgreSQL could be solved by a small program.

                                              Problem solved by PostgreSQL: saving time and developer resources by providing a general purpose, many featured solution usable immediately by a wide variety of teams. Contrast with kdb, which requires implementing desired features on top of it.

                                              Make sense?

                                              1. 1

                                                I’m saying a program that supports GIS, and a bunch of other unrelated features so as to be general purpose, will be large. One such program is PostgreSQL.

                                                PostgreSQL ships GIS as an add-on.

                                                Same as with kdb.

                                                Thesis: not all problems can be solved with small programs.

                                                “All problems” isn’t important.

                                                You can always invent a problem that cannot be solved by a small program, such as “needs to be a big program.”

                                                All business problems is a little better, and while still open to a certain amount of shenanigans, if you’re not intellectually dishonest you’ll get something out of the argument.

                                                Shit like this:

                                                Example … Problem solved by PostgreSQL: saving time and developer resources by providing a general purpose, many featured solution usable immediately by a wide variety of teams. Contrast with kdb, which requires implementing desired features on top of it.

                                                are counterproductive. “general purpose” is met:

                                                • having a range of potential uses or functions; not specialized in design.

                                                however:

                                                • many featured solution usable immediately by a wide variety of teams

                                                is weasel words. Define exactly what you mean by this. How many varieties of team is “wide enough”. How many features is “many featured enough”? I’m certain whatever number you choose we can simply implement that many with kdb and close this point off.

                                                finally:

                                                • Contrast with kdb, which requires implementing desired features on top of it.

                                                … like GIS using PostGIS.

                                                1. 1

                                                  PostgreSQL ships GIS as an add-on.

                                                  Same as with kdb.

                                                  I was not aware, you made it sound like your friends implemented GIS on kdb. If kdb has a fully featured GIS plugin that ships with the distribution, then I stand corrected—for this feature. To define fully featured for you, lets go with this, in particular sections 8.8. Operators, 8.9. Spatial Relationships and Measurements, and 8.11. Geometry Processing.

                                                  many featured solution usable immediately by a wide variety of teams

                                                  Define exactly what you mean by this.

                                                  It defines itself.

                                                  • has many features
                                                  • usable immediately by a wide variety of teams

                                                  And I was implying that it’s usable immediately by a wide variety of teams because it has many features, GIS being one example. For another example, generalized inverted indexes on hierarchical document values like JSON. Although perhaps there is also a kdb plugin that ships with the distribution and provides generalized inverted indexes?

                                                  I’m certain whatever number you choose we can simply implement that many with kdb and close this point off.

                                                  But they aren’t there already, which is the entire point.

                                                  Contrast with kdb, which requires implementing desired features on top of it.

                                                  … like GIS using PostGIS.

                                                  PostGIS ships with PostgreSQL in nearly every distribution channel. And the end user of the database certainly does not have to implement PostGIS, since it’s already written. Which is my entire point.

                                                  1. 2

                                                    I just got to this thread, and it’s immensely interesting to me in spite of occasionally falling off the tightrope into flames.

                                                    Excluding specifics of different tools, there’s two aesthetics at war here between you and @geocar:

                                                    • Using conventional tools provided by others, because they have an incentive to serve as many users as possible, and so could potentially anticipate features you may need in the future. This is a really nice steel-man of the conventional approach that most people cargo-cult as “reuse”.

                                                    • Using small programs, minimalist tools and as few dependencies as possible, because every new dependency introduces new degrees of freedom where things can go wrong, where somebody else’s agenda may conflict with your own, where you pay for complexity that you don’t need.

                                                    If only we could magically unbundle the benefits of other people’s code from their limitations, have features magically appear when we need them, and be magically robust to security holes in features we don’t use.

                                                    The synthesis I’ve come up with to these two poles is to use libraries by copying, and then gradually rip out what I don’t need. This obviously makes things more minimal, so moves me closer to @geocar (whose biases I share). But it also moves me closer to your side, because when I need a new feature from upstream next year I know enough about the internals of a library to actually bring it back into the fold.

                                                    It’s hard to imagine a better synthesis than this. The only way to get the benefits without the limitations is to get on a path to understanding your dependencies more deeply.

                                                    Edit: http://arclanguage.org/item?id=20221 provides deeplinks inside the evolution of a project of mine, where you can see a library being ingested and assimilated, periodically exchanging DNA with “upstream”.

                                                    1. 1

                                                      I agree that minimal software is generally better too. I just don’t think it’s practical or valuable to make all software minimal. Using a HTTP wrapper library for literally one request in an app? Kill that dependency. But wait, the app isn’t consumer facing, just needs to get done in as little time as possible, and probably won’t be substantially extended? Screw it, who cares? Adding a dependency in a situation that matters so little is totally worth it if the wrapper library saves the developer 20 minutes of learning a more low level API.

                                                      1. 1

                                                        But you haven’t addressed my comment at all. Copying the HTTP wrapper library is a reasonable option, right? At worst, it adds minimal overhead for upgrading and so on. At best, it reduces your exposure to a fracas like befell left-pad.

                                                        1. 2

                                                          If the app matters then copying the HTTP wrapper, or any other library, could be valuable. If the app doesn’t matter, it’s still a waste of time. It’s all about tradeoffs.

                                                          Something like an HTTP wrapper, I might just drop it entirely. A lot of those libraries are just reinterpretations of how the author feels APIs should look. Something like ncurses though? I’m not touching it, no way. Or postgres? Forking a database is a huge commitment. But a json parser with a few hokey features I’ll never need, that slow down the parser? I’ve forked that. A password hashing library that bizarrely had waaaay more functions than hash, and check_hash? Forked.

                                                          For C++ it’s especially valuable to fork and strip, because monster headers increase compile times. In big projects, adding a header that increases compile time by 200ms can add minutes to build time. Yikes.

                                                          So yeah, I agree with you that forking and stripping is a good strategy. It doesn’t apply to everything, but in situations where it’s the best choice, I find it’s usually the best choice by a long shot.

                                                          1. 2

                                                            It sounds like you’re already practicing what I struggled to figure out. That’s great! I’ll suggest that your narrative of “big programs” is too blunt, and doesn’t adequately emphasize the challenges of dealing with their fallout.

                                                            Forking a database is a huge commitment.

                                                            All you’re doing is copying it. How is that a commitment?

                                                            There’s a certain amount of learned helplessness that rears its head whenever the word “fork” comes up. Let’s just say “copy” to get past that. That’ll help us realize that there’s no dependency we can’t copy into our project, just to allow for future opportunities to rip out code. Start with the damn OS! OpenBSD has a userland you can keep on your system and recompile with a single command. Why can’t everyone do this?

                                                            1. 3

                                                              It’s not learned helplessness, it’s that maintaining database software is actually hard. If you copy it but don’t change it, you’re pretty much just taking the peripheral burden upon yourself. Now if you want to deploy it you’re on the hook for builds, packaging, package testing, patches for your distro and so on. All this stuff normally done by actual domain experts. Not only is it a huge waste of time, it’s something you’re really likely to screw up at least once.

                                                              I work on database engines and I don’t even host my own databases when I can afford it. Setting up replication, failover, backups, etc., that’s a ton of work, especially since you have to test all of it thoroughly and regularly. If it were for a business application, I’d happily pay for Heroku Postgres all the way up to premium-8 tier ($8500 / month). At $102,000 / year, that’s still lower than the salary I’d pay for an engineer I’d actually trust to manage a HA postgres setup.

        2. 2

          On the other hand, with just a moment’s thought to the command line, McIlroy’s version will quickly show problems with the definition of “word” where you end up with “isn”, “wouldn” and “t” as “words,” among other problems. McIlroy can then spend time on replacing the first line with a more specialized program to break words out of a text stream. Knuth can do the same, but how much time has been spent writing the rest of the code to deal with counting and sorting words?

          1. 4

            I’ve way more often been in the position of replacing huge shell/Python/&c agglomerations with a single well defined and modular program than the opposite. Perhaps there is ultimately good reason that people build large systems in languages with module systems and interfaces, instead of in shell.

            Most languages have libraries for the kind of stuff you’re talking about — modules don’t have to be literally separate programs.

            1. 1

              Also busted: words with accents, like café, Montréal, née, Québec, and résumé. He even used the word “Fabergé” in his review, which would become “faberg” in the output!

            2. 2

              Why would you not accept his solution? He doesn’t use a ready-made frequency algorithm, but shows his knowledge of the problem and the tools at hand to implement exactly the algorithm required. Exactly what I want a candidate to do.

              1. 2
                1. 1

                  If we imagine a second student, who is Knuth, who gives the expected answer, then McIlroy is like the clever student, calling the other student’s answer unimaginative or dull — but to be especially imaginative was never the purpose to begin with.

                  1. 2

                    Indeed, you are right about that.