1. 1

    It makes me think of my Lotus Notes days… There used to be a website reporting all the cryptic errors returned by Notes.

    1. 2

      To write good docs, you need to walk on a fine line between conciseness, completeness and clarity. Often, only one of these factors prevails, at the expense of the others.

      • Docs that are too long (or include too many details) = bad for conciseness and clarity.
      • Docs that are too short (or include not enough details) = bad for completeness and clarity.
      • Docs that are written poorly = bad for clarify.
      1. 3

        Overall, I find that the premise of the post is true.

        Programs can be unreadable for several reasons (lousy formatting, bad or no naming conventions, etc.). This makes introducing bugs very easy.

        Over-engineered code is also a problem. The code might seem well designed at first sight, but you quickly realize that it’s a huge mess and that it’s super hard to understand. Again, introducing a bug in this case is easy.

        1. 1

          Great article. This is something I think way too few people know. I used to be in that business myself, even though it was very local and so only nation wide (I am not living in the US) applied. It was really insightful.

          Funny side note: robots.txt was not binding, calls or emails of people saying you need to stop were. That’s how weird laws can be. Again, not the US. ;)

          Of course one should still respect it to not get a call. And another hint to not get into troubles: Sell the fact that you are crawling them! Backlinks, free visitors, etc.

          And one more thing. Crawling, scraping, filling out forms, sending post vs get requests, etc. can all be viewed differently by the law and a lot of common terms can mean different things. So ignore their technical meanings when dealing with the law. Make sure to learn what those things mean to lawyers. They can be funny and things that you never even considered. That’s by the way something one should do in general: Question each and every technical term!

          Here is the only thing I am not so sure about:

          “It’s the same as what my browser already does! Scraping a site is not technically different from using a web browser. I could gather data manually, anyway!”

          False. Terms of Service (ToS) often contain clauses that prohibit crawling/scraping/harvesting and automated uses of their associated services. You’re legally bound by those terms; it doesn’t matter that you could get that data manually.

          AFAIK ToS only apply if you register/explicitly agree. Is that true?

          Else it would be really weird. One would essentially create a law, rather than terms. Or in other words you could make a link and there having a ToS saying you are not allowed to visit or that you will have to pay now.

          1. 1

            Hey, thanks for the great feedback! :)

            Sell the fact that you are crawling them! Backlinks, free visitors, etc.

            You’re absolutely right, and I should have mentioned it in my post!

            AFAIK ToS only apply if you register/explicitly agree. Is that true?

            In some cases, courts ruled that since the defendants were logically aware of the ToS (even though they hadn’t explicitly agreed to them), they were enforceable. Take a look at bullets #7 and #8 of the section “The typical counterarguments brought by people” in my post. Whether or not ToS are enforceable seems to depend on the context.

            Or in other words you could make a link and there having a ToS saying you are not allowed to visit or that you will have to pay now.

            True, and some people successful did it. Take a look at Internet Archive v. Suzanne Shell.

          1. 1

            Great write up! I, for one, can’t wait (sense my sarcasm here) until we have cookie / evercookie/ IP based TOS CAPTCHAs to prove the TOS was agreed to before proceeding. And not one of these JavaScript based things…No! that’d be too easy to ignore or work around. I’m talking a kill bots style intrusion that happens way down low in the stack that can’t be circumvented and must be done to unlock the content. At the first sign of bot like behavior at a given IP, it’s another TOS CAPTCHA.

            1. 1

              Thanks for your kind remarks.

              And yep, what you suggest would be super effective! hehe :)

            1. 4

              I’m curious if anyone is knowledgeable about how ad blockers and software like Brave browser fit in with the terms of use situation. Seems like if you are crawling and scraping for your own personal use, and not re-publishing, you might be able to craft your crawler/scraper to adhere as closely to TOU as ad blocking does.

              1. 3

                Brave is in a very precarious spot I think because they’re taking the content, remixing it, and showing it. That’s close to what aereo was doing. Actually probably more infringing than aereo. Maybe you can do it for yourself, personally, but it’s treacherous ground for a business model.

                1. 2

                  I’m curious if anyone is knowledgeable about how ad blockers and software like Brave browser fit in with the terms of use situation.

                  Personally, I don’t know. It’s a different topic.

                  But if you consider that there are still a lot of grey areas in law about scraping/crawling, there are probably also a lot of grey areas about Ad Blockers. I’ve just googled it and I found that some German publishers sued Adblock Plus in the past. Not sure what happened to the other ad blockers.

                  Seems like if you are crawling and scraping for your own personal use, and not re-publishing, you might be able to craft your crawler/scraper to adhere as closely to TOU as ad blocking does.

                  I don’t think so. Because ToS/ToU often prohibit automatic data collection.

                  1. 2

                    But if you consider that there are still a lot of grey areas in law about scraping/crawling, there are probably also a lot of grey areas about Ad Blockers. I’ve just googled it and I found that some German publishers sued Adblock Plus in the past. Not sure what happened to the other ad blockers.

                    AdBlock Plus has an “acceptable ads” product, which charges larger publishers a fee to be included on that list.


                    Springer sued AdBlock Plus and ad blocking itself was deemed legal, “acceptable ads” not.

                  2. 1

                    There have been plenty of attempts by publishers to sue adblockers with arguments along those lines. From what I’m aware they always lost.

                    1. 1

                      Apparently, Google and other big names attempted to sue Adblock Plus. But I don’t know how it turned out either.

                      It would be interesting to do a bit more research on this topic. What we’d find out would probably be super interesting :)

                  1. 1

                    Python might not be the fastest, I recognize. But at the time, I learned this language for very specific reasons:

                    • It has a very simple syntax. Everything is stripped down, especially when compared to Java or C# (e.g. no need for braces, interfaces, etc.).
                    • It’s elegant (e.g. comprehensions).
                    • It has a lot of libraries/packages, so you can do pretty much everything you want.
                    • It’s truly cross-platform.

                    So for me, the “speed” factor was not important at all in my decision.

                    And I tend to agree with the point of the author; most projects are not performance critical, so arguing that Python is slow is completely irrelevant in those situations.

                    1. 0

                      The author says that one way to convince people of trying Erlang is by hiding its weaknesses (e.g. its syntax) and focusing on its strenghts (e.g. its performance). He gives MongoDB as an example, essentially saying that the folks of MongoDB originally went to meetups and showed benchmarks to people, instead of focusing on its high unstability.

                      I’d say that I’m a bit mitigated on this. The problem is that despite its weaknesses, MongoDB presented far more concrete advantages to people, directly from the start. Those advantages were obvious. But the same may not apply to Erlang.

                      1. 1

                        I’m not an astrophysicist, but I would imagine any civilization capable of generating the amount of energy needed for the observed power of FRBs would also have come up with something better than solar sails….unless there really isn’t anything better than solar sails, which, while they’re awesome, would make me a little sad.

                        1. 2

                          Alternatively, they prefer to keep control of their probes centralised rather than allowing them to be independent; that is, it could be a political rather than technical design choice.

                          1. 2

                            I agree. I’d also expect advanced alien civilizations to use… advanced technology.

                            But what if other alien civilizations aren’t much more advanced than us, or are just a little bit more advanced than us? Maybe, just like us, they’re trying things out, and they don’t really know what they’re doing.

                            1. 3

                              Maybe they’ve got much more advanced technology, but this is their equivalent of yacht racing.

                          1. 4

                            I’m not a fan of “open floor spaces” either. I agree that it’s a productivity killer, most of the time.

                            I remember reading an article many years ago about the private offices at Fogcreek, and in my mind it, it made complete sense. But not everybody (read: not every company) sees it that way, unfortunately.

                            1. 2

                              I’m mainly a Python programmer these days. I know that there are a few irritating things about it, and this might be why some people decided to switch.

                              I’ve never worked seriously with Go. I’ve just created a few simple programs here and there. So I can’t really speak about it.

                              I’m curious; for those of you who code in Go daily, and who have already coded in Python for several years, what are the real advantages of Go that justify a switch?

                              1. 2

                                off the top of my head; type safety, real concurrency, some machine density / performance gains.

                                more than anything i just really like the choices the language designers have made, it leads to clearer programs that are easier to maintain.

                                it did not replace python as my “batteries included” scripting language. go is great for medium to large projects but struggles a bit for very quick tasks still.

                              1. 2

                                The article talks about Python crashing with an out-of-memory error while crawling a web page. The author presents various fixes to his/her Python code.

                                I think really, though, that those aren’t fixes. Those are workarounds to the fundamentally broken nature of memory allocation on Linux. The OOM killer idea is just…I mean, I know that I love having random processes killed at random intervals because some other process was told there was more memory available than there was.

                                (Okay, so yeah, the author of TFA shouldn’t have relied on an out-of-memory condition to signal when to stop crawling, but saying that wouldn’t have given me an opportunity to bitch about Linux’s allocator…)

                                1. 2

                                  Thanks for your feedback :)

                                  I think really, though, that those are fixes.

                                  An out-of-memory error generally means that there’s either:

                                  1. Something broken in your code (or in something that your code depends on).
                                  2. A lack of resources on the system.

                                  In either case, as the programmer, you’re at fault. And fixes or workarounds will be needed.

                                  I’m not sure that I agree with you on the premise that memory allocation is broken on Linux. No matter which OS you use, your system resources are limited in some way. Aren’t they?

                                  But you know what? I’m fairly new to Linux, so I’m really open to different ideas and solutions. What do you think would be a better solution than OOM killer?

                                  Finally, I agree with you that crashes are not 100% fun to deal with.

                                  1. 1

                                    An out-of-memory error generally means that there’s either:

                                    Something broken in your code (or in something that your code depends on).
                                    A lack of resources on the system.

                                    In either case, as the programmer, you’re at fault. And fixes or workarounds will be needed.

                                    Also, as an additional reply: the way Linux does it, sometimes it’s not your fault. Linux’s default allocation strategy lies to you: it tells you that resources you reserved were in fact successfully reserved for you when they’re not.

                                    1. 1

                                      I’ve read your reply below and given that OOM killer can kill the wrong process sometimes, you’re right in a way.

                                      However, could we say that it’s the programmer’s fault for not planning enough resources on the system? I tend to think so.

                                    2. 1


                                      That sheds some light on how Linux caters to memory-greedy processes.

                                      @lorddimwit, is this what you called a hassle? It can certainly be, but it’s not impossibly difficult to enforce a hard limit.

                                      In practice there seems to be enough crappy apps out there that the overcommit system was developed for good reason, but YMMV as always. I have had to tune this for servers, but not so often and never for a desktop.

                                      1. 1

                                        I know, that’s why I parenthetically clarified that really I was bitching about the OOM killer, and not something else. :)

                                        Linux should, IMHO, simply fail at allocation time if memory is exhausted, returning NULL from malloc. Right now, what happens is that memory allocation essentially never fails and then, at some random point in the future if resources are actually exhausted, a random program is killed.

                                        (Okay, so it’s not random, it’s the one with the highest OOM score, but still.)

                                        The problem is that the process that’s killed is decoupled from the most recent allocation. This means that long-running processes with no bugs can just be killed at random times because some other program allocated too much memory. You can fiddle with OOM score weights and stuff, but at the end of the day, the consequences are the same: a random process is going to get killed on memory exhaustion, rather than just have the allocation fail.

                                        The most logical solution, to me, is simply return NULL on allocation failure and let the program deal with it in a way that makes sense (try again with a smaller allocation, report to the user that memory’s exhausted, whatever). Instead, it’s impossible to detect when a memory allocation from malloc isn’t really going to be available.

                                        It’s possible to disable the OOM killer (or at least it used to be), but it’s a hassle.

                                        1. 2

                                          Okay, I see what you mean.

                                          I don’t claim to know all the details of OOM killer. But by reading OOM killer’s source code, I understand that it will not randomly kill processes.

                                          Instead, it seems to calculate an “OOM badness score” mainly based on the total amount of memory used by processes. So any process that has the highest score (i.e. takes the most memory) might be killed first, but not necessarily. It depends on other factors as well.

                                          In my specific scenario, it killed the right process. But you may be right; there are probably other situations where the wrong process will be killed.

                                          Have you ever experienced it?

                                          1. 4

                                            Oh yeah, all the time. There was a period of time where the OOM killer was sarcastically called “the Postgres killer”. Because of the way PostgreSQL managed buffers, it would almost always have the highest OOM score. They fixed it by allocating buffers differently, but it sucks when your production DB is randomly killed out from underneath you when it’s doing nothing wrong.

                                            Again, you can adjust weights and such so that different processes would be more likely to be selected based on their weighted OOM score, but it’s an imperfect solution.

                                            1. 4

                                              Or the X killer. If you had 100 windows running, the X server was probably using the most memory. Kernel kills that, and suddenly lots of memory is free…

                                              1. 3

                                                There was a certain era of the ‘90s where people ran dual-purpose Unix server/workstations, where that might not even have been the wrong choice. If you’ve got an X session running on the same machine that runs the ecommerce database and website, better to take down the X session…

                                                1. 3

                                                  Says the guy who was never running his company’s vital app in an xterm. :)

                                          2. 1

                                            The reason Linux does that is because of an idiom of Unix programming—the program will allocate a huge block of memory but only use a portion of it. Because Unix in general has used paging systems for … oh … 30 years or so, a large block will take up address space but no actual RAM until it’s actually used. Try running this:

                                            for (size_t i = 0 ; i < 1000000; i++)
                                              void *p = malloc(1024 * 1024);


                                            for (size_t i = 0 ; i < 1000000; i++)
                                              void *p = malloc(1024*1024);

                                            One will run rather quickly, and one will probably bring your system to a crawl (especially if its a 32bit system).

                                            You can change the behavior of Linux (the term you want is “overcommit”) but you might be surprised at what fails when disabled.

                                            1. 3

                                              I understand the reasoning behind it, and I still think it’s problematic. It’s a distinct issue from leaving pages unmapped until they’re used. I prefer determinism to convenience. :)

                                              Many Unicies, both old (e.g. Solaris) and new (e.g. FreeBSD), keep a count of the number of pages of swap that would be needed if everyone suddenly called in their loans, and would return a failed allocation if the number of pages requested would cause the total outstanding page debt to exceed that number. That’s the way I’d prefer it, and it still works well with virtual memory and is performant and all that good stuff. Memory is still left unmapped until it’s touched, just as before. All that’s different is a counter is incremented.

                                              The problem of course is that if every last page of memory is used, it wouldn’t be possible to start up a shell and fix things, in theory. Linux “solved” this by killing a random process. Some of the Unices solved it the right way, by keeping a small amount of memory free and exclusively for use by the superuser, so that root could log in and fix things.

                                              (Of course that fails if the runaway process is running as root, but that’s a failure of system administration, not memory allocation. ;) )

                                              I know that Solaris would continue running with 100% memory utilization and things would fail the right way (that is, by returning an error code/NULL) when called, rather than killing off some random, possibly important, process.

                                              EDIT: FreeBSD does support memory overcommit now, too, optionally, enabled via sysctl.

                                              1. 3

                                                I’m always amazed by the level of expertise and knowledge that people have online.

                                                Thanks for sharing your input, lorddimwit! :)

                                      1. 2

                                        I appreciate mbenbernard posting here. This is a small example of a really broad problem. If you do networking with untrusted servers (or even trusted servers) you have to ensure your memory usage is bounded and assume the thing on the other side can misbehave.

                                        This reminds me of someone seeing an OOM error because random HTTP was interpreted as a 32bit number indicating an allocation size https://rachelbythebay.com/w/2016/02/21/malloc/

                                        There are multiple approaches to getting a good level of reliability. The dead simple one is using processes and relying on linux. It would have been trivial to write an app that does this:

                                        timeout 30s curl foo.com | readlimit 200kb | ....

                                        While that sort of setup is pretty fun, it’s high overhead. In practice I take the careful coding approach. In golang I generally use a combination of the streaming approach you mentioned (the default in go) combined with a hard limit on I/O https://golang.org/pkg/io/#LimitReader combined with with deadlines on read and write operations https://golang.org/pkg/net/#TCPConn.SetDeadline


                                        I made up the readlimit program. Writing a program which copies up to N bytes from stdin and writes to stdout would be pretty straightforward.

                                        1. 2

                                          Hey thanks for your comments, Shane!

                                          One must always assume that APIs might break; and I discovered it the hard way with requests.

                                          I’m curious; when you say that the streaming approach is “the default in go”, do you mean that the default HTTP library of Go automatically handles streaming (transparently)?

                                        1. 1

                                          For my web apps, I’ve always stored times as UTC timestamps on the server, and then let the client (JavaScript) convert it to local time. For a simple web app, I guess that this is perfectly acceptable.

                                          However, I didn’t know about the Olson database thing. The author is probably right; for any time sensitive application (i.e. where time is critical), it probably makes sense to perform all of those additional steps on the server. However, I’ll certainly not take that advice for granted; this would need heavy testing.

                                          1. 3

                                            I find that posting stuff on Reddit is pretty unpredictable. You never really know when people will like it, or hate it.

                                            But this was a super interesting post. I knew that time had some influence on a post’s ranking, but I didn’t know by how much.

                                            1. 3

                                              This is totally unexpected, but really great news. I’m curious to see what will happen in the next few months. I personally trust Mozilla more than any other organization.

                                              1. 6

                                                This is really a shame. Mozilla invests tons of money in a company producing closed source software while there’s this free software alternative that has been well established against that quasi monopoly. They should support Wallabag instead of Pocket… sometimes I just don’t understand the Foundation’s decisions… (and i know they need money, but this seems like a decision contrary to their principles)…

                                                1. 19

                                                  Pocket has a business model and larger audience. If they adopted Wallabag then they would be competing with Pocket, through acquisition they also get rid of their largest competitor. C'est la vie.

                                                  I too wish they supported OSS like Wallabag, but I likewise wish more OSS products had robust business models.

                                                  1. 7

                                                    Presumably Pocket will become licensed as Open Source?

                                                    I do agree with you here, though.

                                                    1. 25


                                                      (I have nothing to do with any of this, even though I work at Mozilla)

                                                      1. 2

                                                        Super cool! :)

                                                      2. 2

                                                        Yes, that’s why I created an account yesterday and transitioned many of my open tabs into it.

                                                      3. 6

                                                        The Mozilla foundation is about OSS. Pocket will become OSS. Mark my words.

                                                        1. 6

                                                          “Mozilla invests tons of money” – I haven’t seen any reports of how much Mozilla spent on the acquisition. The numbers may have worked out to be something reasonable for Mozilla, on a per-user basis. It really depends on the company’s trajectory and the investors' view of the company.

                                                          Acquiring a service that has lots of users and some penetration on mobile devices might not be a bad investment. I guess we’ll know in a few years.

                                                          1. 5

                                                            A good way of describing Mozilla is “principled but pragmatic”. Basically, be principled, but not to such a degree that you shoot yourself in the foot in the process. A textbook example of this was the decision to adopt EME.

                                                            I’m not very familiar with either Pocket or Wallabag, but I’d guess Pocket has better market penetration and recommendation algorithms which would support existing Mozilla initiatives like context graph.

                                                          1. 5

                                                            It’s quite interesting that they can finally confirm it.

                                                            However, aren’t all skills learned anyway?

                                                            Let’s say that you would perform this quite troubling scientific experiment; you would leave a toddler alone for 10 years, completely isolated, and nobody would talk to him during all those years. It’s pretty obvious that we wouldn’t be able to speak at all.

                                                            1. 9

                                                              Good question.

                                                              Chomsky’s famous “deep structures” idea (I hope I’m not mangling it too badly) stated that language, in particular, involved innate skills at some level of abstraction. So, no, there was not agreement among linguists that everything about language-use had to be learned.

                                                              Though obviously any specific language has to be learned, the notion was that the structure of the human brain is such that certain basics common to all languages were common. As a non-linguist, I think this would be things like the existence of nouns and verbs (but not of other parts of speech), and other general patterns in the structures of languages.

                                                              Of course, the existence of those commonalities does not itself mean that there’s any innate knowledge of them. Linguistics has done better than many other sciences at looking at diverse cultural backgrounds rather than guessing about universal truth from an unintentionally narrow sample, but it’s still possible that the commonalities emerge from social factors. It’s also possible that they emerge directly from the problem domain - that they represent the best way to describe the world, if you happen to be a physical object.

                                                              1. 9

                                                                The poverty of the stimulus debate is an interesting aspect of the argument. Chomskyians argue that young kids learn grammar too rapidly and from too few examples to support a view that language learning is blank-slate learning using some kind of general information-processing/induction capability. Instead people in this camp think it must be a kind of parameter learning, where kids are not really learning the grammars per se, but have the grammar building blocks “built in” and are learning parameters on them. Which, the argument goes, explains why they can rapidly generate complex structures after “training” on only small numbers of examples.

                                                                Counter-arguments are of various kinds, which I haven’t kept up with enough to summarize accurately. But one broad class is to agree that there is some kind of inductive bias (essentially all successful learning algorithms have some kind of inductive bias), but disagree with the Chomskyian parameter-fitting model and/or the Chomskyian hypothesis that it’s language-specific.

                                                              2. 6

                                                                Let’s say that you would perform this quite troubling scientific experiment; you would leave a toddler alone for 10 years, completely isolated, and nobody would talk to him during all those years.

                                                                Unfortunately, this has happened before. Not as a science experiment, but through parental abuse and neglect.

                                                                1. 5

                                                                  Fredrick II didn’t exactly do it as a science experiment in the sense that he wasn’t controlling variables, etc., but he tried to do the same thing on purpose,

                                                                2. 3

                                                                  However, aren’t all skills learned anyway?

                                                                  Not true with animals, anyway.

                                                                  This happens all the time on a farm. An animal is born and never sees it’s parent or siblings or other animals of same species (parent dies, or whatever). The animal will have many behaviors that are exactly like if it had been raised with animals of its species.

                                                                1. 4

                                                                  Simpler is better. Here’s what I use:

                                                                  • ! Some description of a bug fix
                                                                  • * Some description of a change or new feature
                                                                  • Y Some description of a merge.
                                                                  • + Some description of a newly added file
                                                                  • - Some description of a newly deleted file

                                                                  I reserve the last two ones for very important files (added or deleted). It’s certainly not perfect, but it works for me.

                                                                  1. 3

                                                                    That’s only less simple if you measure complexity in characters. It would make no sense if not accompanied by this glossary.

                                                                    1. 1

                                                                      I just read it once and I think I’d be unlikely to need the glossary again.

                                                                  1. 7

                                                                    Most commit messages focus on the “what” of the commit, but I wish they focused more on the ‘why’. It’s nigh-impossible fit a ‘why’ into the first 80-character line, though, and it’s just so easy to stop writing a message after that line.

                                                                    1. 4

                                                                      I always try to put the “why” in the long-form comment after the commit message. It takes some disciple and it’s not fun, but it’s always worth it.

                                                                      1. 3

                                                                        Totally. I mean when you read a commit message “Removed therubyracer”, you can see the commit and be like “obviously, I can see that - but why? because I want to add it back in”.

                                                                        If the “Why” is because you think it’s unused, that means it’s safe for me to re-add it (probably).

                                                                        If the “Why” is because there was some critical flaw with using therubyracer and there is now some better way, then knowing that would help me tons so I don’t hit the same flaw.

                                                                        I just picked therubyracer out of nowhere… well, recent history on a real project, but it’s totally arbitrary.

                                                                      2. 1

                                                                        The same generally applies to comments in code. Most code comments that I come across are generally useless or too cryptic.