1. 3

    Unfortunately, this heap must be able to grow dynamically because there exists no meaningful upper bound for allocations on behalf of the C++ runtime. E.g., in a multi-threaded application, multiple exceptions may be in flight at the same time.

    I’m not entirely convinced by this. Obviously true in theory, unclear in practice. A scheme which would work for - say - a max of 100 simultaneous exceptions would be to allocate a (fairly small) buffer and take free slots in that for the exception header. What %age of use cases exceed 100? How about 1000? Does a working assumption of “no more than X OS threads per physical processor” help set a meaningful bound?

    This has the disadvantage in that it fails (how? - likely a hard stop with an informative error) if you violate this bound. But the advantage in that you get to carry on doing what you’re doing code-wise and not worry about this.

    It used to be common practice to have fixed bounds in the kernel, and recompile with different bounds for different workloads. This isn’t ideal, but it can be pragmatic, if it is easy to put in place a bound which is unlikely to be hit.

    1. 3

      Part of the problem is that this is per thread, so it can add up to quite a large amount of space. The other part of the problem is that C++ allows you to throw value types, which can be arbitrarily large. With the Windows unwinder, this has very little overhead because they’re allocated on the stack and then the unwinder runs as ‘funclets’ (functions that have access to the stack frame of another function) on top of the stack. This means that all cleanup code and all nested exceptions (and all unwinder state) are all on the stack. In the Itanium / DWARF ABI, most of the state is heap allocated and so the exception object is a variable-sized allocation that holds the unwinder state and the thrown object. Exception objects are usually small, but in theory you could throw a 1 MiB object, by value, and have several of them in flight at a time. C++11 made this slightly worse by providing an accessor to get the current in-flight exception so that you can forward it to another thread, which means that you have to be able to allocate it on the heap, even if your fast path stores it only on the stack.

    1. 7

      From experience, you cannot rely on url discovery to constrain operations by a client.

      If you return urls like /deposit/account_id/amount in a particular response, you can expect developers to construct those, even if your documentation explicitly says you have to chase through previous steps, have them returned to you and then follow them.

      i.e. developers don’t like walking multiple requests to discover which operations are currently permissible and their associated urls, they destructure them to find semantic meaning and then use them on the fly.

      I think the only way around this would be to have opaque urls (/opaque/some-uuid) which must be discovered by a client, and are mapped into the semantic spaces by some other layer. This seems like it doesn’t add value.

      1. 2

        I once had to write an API for consumption where constructing URLs wasn’t allowed, but obviously people did it anyway.

        So I wrote a middleware which for every URL in the application would I rewrite responses to use guides/hashes/random data on parts of the URL. It kept a cache in memory so you could hit the same nonsense URL repeatedly, but the cache was cleared nightly and on app restart.

        As long as consumers followed from the root document, they had no problems. And the ones which complained were politely pointed to the contract they signed about how the API was to be used.

      1. 1

        I did something similar once with perl (actually mod_perl in an apache, with the process backend). perl supports signal handlers, so I installed a SIGUSR handler in the perl code which used caller to walk the stack and append the trace to a file.

        This (and a script to hit a pid with the signal once a second for a short while) was enough to get a good idea of what was burning time on the prod system.

        Being able to “Profile in prod” is massively useful, and you can do it fairly non-invasively with surprisingly little work. You can completely avoid the difficult problem of having to replicate live load on your test system.

        (golang makes this very easy+safe with net/http/pprof)

          1. 6

            Comparison timings of dd if=/dev/zero bs=X | pv -v > /dev/null on my system (ryzen 5 3600)

            bs rate
            1 2.4MiB/s
            16 37MiB/s
            256 565MiB/s
            4096 5.6GiB/s
            65536 7.8GiB/s
            524288 3.8GiB/s
            
            1. 2

              The FizzBuzz code avoided the heap, kept everything in L2 cache and both minimised syscalls and picked some seldom-used ones which brought better performance at the cost that the output had to be piped to another application.

              It’s worth running perf and strace with your above dd command. I suspect all three of the above optimisations are missed out on.

            1. 3

              Using an in-process L1 cache is likely only useful at scale if:

              • your application makes multiple lookups of a cached object from L2 during handling one request
              • you have some server affinity, routing multiple successive requests from the same client to the application server

              Without one of these two, the chances of finding your entry in L1 decrease as you scale.

              (The first case seems odd, but was actually pretty common in one codebase I’m familiar with. In this case, though, the number of items you need to L1 cache is really small - perhaps a user object or similar).

              Another honourable mention in this space is groupcached: https://github.com/golang/groupcache (also written by the inestimable bradfitz (the person behind livejournal and the original perl memcached)).

              Groupcached is purely client-side, with application nodes co-operating to cache data and avoid thundering herd on cache load. I’ve not used it, but the approach is really interesting.

              1. 2

                I now wonder how clever you can reasonably be here.

                Rather than have a “safe but optimised for integers” add/sub/mul/div operations, if you have a function which performs a number of numeric operations on its arguments, if you can demonstrate that the type of those arguments didn’t change in the function, then you could guard once at the beginning of the func then use “unsafe integer add with no type check” through that func.

                You could then even hoist the integer values into int64s (rather than python integers) and even into registers for the duration of the function, so you are a lot closer to jitting.

                Lastly, one could have this type information - discovered at runtime - spill out to the caller, and inferred further and further.

                1. 5

                  Aren’t you just describing type specialization? The Pysco Python compiler did this years ago, in 2004 or earlier (and it was sort of popular at the time – people used it). The author Armin Rigo then worked on the more ambitious PyPy project.

                  I think Figure 1 is pretty much what you’re talking about. In the related work section they cite “Self”, etc.

                  http://psyco.sourceforge.net/theory_psyco.pdf

                  https://en.wikipedia.org/wiki/Psyco

                  1. 5

                    I know that WebKit’s JavaScriptCore does exactly that optimisation and I believe other implementations do. In the CFG JIT (tier 3 in JSC) it does type inference, based on type guards. If you know the input to a trace (not necessarily a function, the CFG JIT works on a continuation-passing style IR) are a particular concrete type (in JavaScript, typically double, though in some cases int32), you can then infer the output types and can then dispatch to the specialised version of the trace as long as the guards are hit.

                    These optimisations are based on ideas from StrongTalk (Anamorphic Smalltalk) and Self, back in the ’80s and ’90s.

                  1. 5

                    I think the thing which end to end tests cover is whether the assembly of correctly-functioning components gives a correctly functioning system. i.e. it is integration testing, not unit testing. It is testing that the assembly works, not just the components.

                    1. 11

                      Happy to see Serenity getting this kind of exposure! Very well-deserved.

                      1. 3

                        But what’s the benefit, really? Using only one CPU core is a huge handicap, it won’t run most software, and the browser sounds limited too. And AFAIK it doesn’t offer anything technically innovative under the hood like a microkernel. (Plus that UI … I know it’s a matter of taste, but to me Windows 95 was a gray-and-pus colored eyesore.)

                        (I don’t mean to flame, I’m just wondering what the appeal is beyond the wow factor of “someone built an OS from scratch”.)

                        1. 17

                          It’s in the article, right? The guy’s in recovery and needed, or needs, a substitute activity.

                          I know a bit of how people can do in such a situation, and I’m sure wrangling Linus or anyone else in a bigger project would lead to a relapse. No joke.

                          Stuff like single-core isn’t a handicap, it’s a good start. Hoping of course the SMP support / scheduler will look like Haiku’s so the desktop remains responsive ;)

                          Any wow factor beyond the backstory is personal to whoever is into these projects.

                          1. 17

                            Plus, it’s fun. I poked around SerenityOS and tried my hand at fixing a bug or two (no PRs yet, I’m not in the best spot, either, but soon…). It’s just a light-hearted hobby project that can brighten a nerd’s evening like no other.

                            Its community is really nice, too, nobody well ackshuallies you because your program has so many options it’s intimidating for new users, reviews don’t bring up things like what particular method you use to build your Docker container or whether the way you’ve done something is fashionable in the latest C++ standard.

                            It may not look like much, but lots of things that start out as fun eventually turn out to be useful through sheer inertia, precisely because people end up doing things that are unthinkable in the real world, like, I dunno, interfaces that you can use on a small screen :-P.

                          2. 15

                            I will almost certainly never run SernetyOS, ReactOS, or Haiku but I’m still happy that the projects exist and I like to see news from them.

                            1. 10

                              There was a period where reasonable people could (and did) make similar criticisms of Linux as an OS. It couldn’t run a lot of the software you needed. To “get things done”, you needed windows. Only enthusiasts could make it work.

                              I think most people are aware of the dizzying tower of abstraction on which modern development sits. There are pros (scale from a toaster to the cloud! use this library!) and cons (performance! complexity!) to this, but - for some problem domains - cutting that away for a fresh start is a feature not a bug.

                              If a new OS is going to ever grow up under the heavy canopy of the existing ecosystem, it will look like this at first.

                              1. 5

                                Speaking as someone who wrote a little bit of the libc, I share the frustration it’s not at all original. However, what I do like about it is the heart and spirit of it. Andreas’ enthusiasm is very infectious.

                                1. 4

                                  Personally I think the independent browser engine alone makes SerenityOS worthwhile. Creating one is a Herculean task that even enormous corporations will not attempt, but having multiple browser engines is vital for the health of the open web. There are so few browser engines left besides Blink+WebKit that I consider it to be an emergency. SerenityOS’s browser may not be very impressive at the moment, but it could potentially be a starting point for a new competitive browser engine the way that KHTML was the starting point for the current dominant browser engines.

                                  If Servo can somehow survive, I’d consider that to be a better basis for a competitive browser, but if SerenityOS runs well on older / cheaper machines, perhaps a browser based on it could find a niche on low-powered hardware.

                                  And if having an alternate browser engine implementation is worth getting excited about, what other independent implementations of things exist in SerenityOS that merit attention? Maybe I’ll install SerenityOS and investigate.

                                  1. 3

                                    Eh. It’s not the choices I would have made for a personal project, but given that it started as one dude scratching a personal itch … I’m happy to see it. My own nostalgic itch-scratching would end up looking a lot more like classic Finder on top of Symbolics, but with two kids and such it’ll always be an unrealized daydream. Good on this dude for pushing through and Doing It.

                                1. 11

                                  While this does take a stab at refuting some criticisms of Perl I don’t think it achieves its goals.

                                  One of the core issues with Perl is that it’s such a flexible language. This is what its adherents (myself included!) love. But it’s also a net negative when trying to use Perl for large projects where interpersonal communication with regards to coding standards and what’s “idiomatic” is important. What if your resident rock star just can’t stand Moose, and insists you use their own home-grown object system? With Perl, it’s entirely plausible whatever they’ve cooked up will in fact do the job - at least for this project.

                                  I return to my oft-repeated observation: Perl hackers revel in TMTOWTDI, Python hackers fret over what’s idiomatic. It’s a culture thing, there’s nothing wrong with it, but time has shown that bondage and discipline languages have an edge for large projects, and maybe for knowledge diffusion in general.

                                  1. 5

                                    Similar arguments have been made about why Lisps never became as widespread as many people feel they should be.

                                    Edit: typo

                                    1. 3

                                      Yep, I agree. I guess I’d like perl criticism to be channelled into “the language is too big / too man ways to do things” (a la C++) and “the language is too flexible/powerful” (a la lisp/scheme). Rather than “line noise, write only, yadda yadda”.

                                      A corollary of that, is that places which manage to make good on C++ or lisps could also make good on perl with similar external disciplines (style guides etc).

                                      I agree the trend is more towards “simple” + “one way to do it” (I like golang very much).

                                    1. 2

                                      In a world where many/most people used this, there would be a lot of redundant computation occurring.

                                      I wonder if such a system could/should be built in a way that this could be exploited - effectively memoising it across all users. (In the same way that cloud storage exploits redundancy between users by using content addressable storage on the backend so they only need to store one copy of $POPULAR_FILE)

                                      e.g. for compilation you’d probably want:

                                      1. content-based-addressing of the data input (I am compiling something equivalent to foo.c)
                                      2. content-based-addressing of the executable input (I am running gcc 2.6.1 on x86)
                                      3. some kind of hash of the execution environment (cpuid flags, env vars, container OS?)

                                      and probably some other bits and pieces (does executable access system calls like local “current time”). Could probably be made to work if the executables agreed to play nice and/or run inside a decent sandbox.

                                      It would be challenging, but also very cool, to get this right.

                                      1. 1

                                        Gradle has a distributed cache that can probably do that.

                                        1. 1

                                          That’s interesting, thank you.

                                          I was using the example of compilation, but I think the general question is interesting. “We are performing this computation, with this executable code, on this input data, in this runtime environment (which is a special case of input data)”

                                          If we can determine that (all essential) characteristics of these are the same as a previous run, then we can lookup the result.

                                          I think there are interesting questions as to what constitutes inputs here (e.g. no-pure things like ‘time’ and ‘network’) and - moreso - what makes the executable code “the same” for this purpose. (What level do you work at - source code, binary etc).

                                          1. 1

                                            Gradle has a very flexible task system - you can completely define the relevant inputs and dependencies of tasks by yourself. Often that will just be files, and some dependencies on the outputs of other tasks. The tricky part is usually defining all of them correctly. But once you do that, magic happens - tasks can be cached, and if you set up a distributed cache, it may even be shared amongst multiple machines (devs, CI ,etc) A task doesn’t have to be compilation, can be anything that takes inputs and produces outputs, maybe you want to do some code generation, or whatever. I’m sure there are other build systems that are backed by a similarly flexible high-level task system, Gradle is just the one that I happen to know.

                                        2. 1

                                          Llama (https://github.com/nelhage/llama) does a few of these (like content addressing). I have played around with it a bit and its been a joy.

                                        1. 6

                                          There aren’t a lot of phone numbers, so you’ll want to at least salt your hash so that a rainbow table of pre-computed sha256 values isn’t useful.

                                          But, someone with your db and salt can likely still usefully brute force things unless you’re using a deliberately slow hash.

                                          So I’d guess that the same advice applies to phone numbers as passwords (i.e. use something like scrypt or argon2), I think the search space is similar to weak passwords?

                                          log2(10^10) ~= 33bits

                                          https://en.wikipedia.org/wiki/Password_strength

                                          According to one study involving half a million users, the average password entropy was estimated at 40.54 bits

                                          Oh - possibly quite a bit smaller. Maybe too small to do anything useful?

                                          1. 2

                                            It’s even more restricted than that - only the last 4 digits, in North America, are free to use all 10 possible values. The others are subject to various forms of restrictions, assignments and rules. See the “Modern Plan” section here: https://en.wikipedia.org/wiki/North_American_Numbering_Plan#Numbering_plan

                                          1. 6

                                            None of these are particularly convincing to me, as someone who has seen some Perl code and heard more complaints but hasn’t written any.

                                            1. Perl regexes that I’ve seen are not easily readable. If they are readable when done well but nobody does them well, then the problem is that it’s too easy to use them badly, which is a language design problem. It’s true that regexes in other languages aren’t especially readable either, but as he says, regexes are much more deeply embedded in Perl than other languages, so it’s a bigger problem.

                                            2. Sigils seem pointless and at the wrong level of abstraction to me. If it’s a simple data structure, I’m not going to forget what it is. If it’s a complex data structure, I don’t want to constantly think of it as a hash or an array, I want to think of it as the data structure it actually represents. Ruby’s use of sigils to differentiate between instance variables, class variables, and so on makes more sense.

                                            3. I don’t have a problem with the dereferencing syntax mentioned, but it brings up a different question. Let’s say I have an array @array. I want the third element, and I want to store it in a variable. What sigil do I use, if I don’t know the structure of the array? I assume there’s a solution but it seems needlessly complicated that you have to worry about it.

                                            4. The argument here is “if it bothers you you can use English;” except most of the complaints around perl have to do with readability, and if you’re reading the program of someone who didn’t use English; it doesn’t help that the option is available.

                                            1. 5

                                              Sigils seem pointless and at the wrong level of abstraction to me. If it’s a simple data structure, I’m not going to forget what it is. If it’s a complex data structure, I don’t want to constantly think of it as a hash or an array, I want to think of it as the data structure it actually represents. Ruby’s use of sigils to differentiate between instance variables, class variables, and so on makes more sense.

                                              Perl has roots in linguistic thought [*], so $ and @ are more reasonably thought of as “singular” and “plural” rather than as data types. These provide a ‘context’ to the surrounding code, which perl (uniquely) leverages. Having the context be explicit in the sigils helps understand this process.

                                              https://www.perlmonks.org/?node_id=738558

                                              Complex data structures are captured in rich dynamic types with fields and methods, just as in other languages. The same plural and singular context applies to them too.

                                              [*] e.g. many methods will operate without an argument, and use the default argument $_ which I pronounce “it”.

                                              my @a = (1,2,3);
                                              say for @a;
                                              

                                              Here “for” is looping over @a. The for loop aliases “it” to each value in turn. “say” is invoked on each iteration without an argument, so it uses “it” and prints it.

                                              1. 2

                                                scalars are singular and arrays are plural, and the sigil is meant to provide context

                                                That certainly is a much better explanation for it than the article presented. Why are hashes given a different sigil, though? And why does the other responder say they never use arrays, if they’re one of the two main contexts?

                                                1. 6

                                                  Why are hashes given a different sigil, though?

                                                  That’s a good question. I find that in perl code, you rarely see a hash sigil except as a declaration. It provides a list context to the RHS of an assignment. Assigning a list to a hash takes alternating elements as key/value.

                                                  And why does the other responder say they never use arrays, if they’re one of the two main contexts?

                                                  Plural context is more properly called “list context” (https://perldoc.perl.org/perldata#Context, https://docstore.mik.ua/orelly/perl4/prog/ch02_07.htm, https://docstore.mik.ua/orelly/perl3/lperl/ch03_08.htm) and is more general than arrays.

                                                  Although, to be honest, I used to use perl arrays all the time. With push/pop and shift/unshift they filled the role of lists, queues, dequeues and vectors in terms of ease of use (I don’t think you can properly fulfill algorithmic complexity of all of those with one data structure, but specialist needs can pull in other modules with those characteristics).

                                                  e.g.

                                                  # Seed the work list
                                                  my @todo = ($path);
                                                  # Take one item from the front of stuff to do, finish if no more work
                                                  while (my $next = shift @todo) {
                                                    #  Get zero or more work items from doing this one
                                                    my @additional_work = process_file($next);
                                                    # append new work items
                                                    push @todo, @additional_work;
                                                  }
                                                  

                                                  Perl’s philosophy is at odds with programming language development since the early 2000s. “There is more than one way to do it” (TIMTOWDI) also “Easy things should be easy, hard things should be possible” a.k.a. Huffman coding for the programmer. Python had broadly the opposite approach (“one way to do things”, “explicit is better than implicit”) and was more in tune with the way the world went.

                                                  This leads to a language with a very different approach and culture to others. One can rightly criticise perl for having many ways of doing things (just as one can rightly criticise C++ for the same reason). That is a strength and a weakness, mostly a weakness at scale, where a new dev might have to understand a large amount of the language to read code. But to a newbie, learning to write code, it can be an advantage.

                                                  List context and abbreviation also enabled powerful map/filter tools some time before python got list comprehensions, and when such things were more a province of the lispers. (perl also had proper closures before many other languages - https://www.amazon.co.uk/Higher-Order-Perl-Mark-Dominus/dp/1558607013). Perl hashes (a.k.a. dicts, maps) “auto-vivify” so you don’t need to handle the uninitialised case. Some code to calculate distribution of word lengths:

                                                  #!/usr/bin/perl
                                                  use Modern::Perl;
                                                  
                                                  my @words = qw/the quick brown fox should be nicer to the lazy dog/;
                                                  my %lengths;
                                                  $lengths{$_}++ for map length, @words; # length takes $_ as arg, provided by map
                                                  foreach my $k (sort keys %lengths) { # list context operators (keys, sort - compose naturally)
                                                      say "$k => $lengths{$k}";
                                                  }
                                                  

                                                  One might think that one can control the breadth of the language in large projects with coding standards, but both perl and C++ arguably fail here.

                                                  But perl was (and is):

                                                  • concise
                                                  • expressive
                                                  • fast (compared to other languages in its niche - i.e. python + ruby. JS got a big speed bump later)
                                                  • well designed (although one may disagree with the design goals)
                                                  • fun
                                              2. 1

                                                Re point 3

                                                my @array = (1,2,3,4);
                                                my $third = $array[2]; # 0-based indexing, referencing a single entry returns a $calar
                                                my @first_last = @array[0,3]; # slice returns an @rray 
                                                

                                                To be honest, outside of introductory programming and (in my case) coding contests, arrays are rarely used other than indirectly. The true workhorse of Perl is the hash(reference).

                                                1. 1

                                                  It’d work for a hash as well, but my question is a bit different: If you have a data structure you fill at runtime, so sometimes it could be (1,{'a' => 'b'},2,3) but other times it could be (1,2,{'a' => 'b'},3). Or any other access pattern where you have a heterogeneous array, or a hash where the values are of different types. And you access $array[2]. Since you don’t know if it’s a scalar or a hash, what sigil do you use for the variable?

                                                  1. 1

                                                    Well to be pedantic, {'a' => 'b'} is a reference to a hash, which is a scalar, so it would be $third in either case. Of course it would sometimes be an integer, and sometimes it would be a hash reference, but that’s a separate (more serious) issue ;)

                                              1. 7

                                                I hate Perl, but none of these are in the top three reasons.

                                                Fragile datastructure creation and dereferencing syntax is #4. I don’t know why anyone would defend it because it serves no useful purpose for the user. It might be okay if the compiler told you that you are using wrong dereference incantation for the value. However, it brings us to #1.

                                                Perl is essentially untyped. Even with strict and warnings there’re lots of errors that will pass undetected and will only manifest in incorrect behavior. That’s my reason #1.

                                                The second reason if lack of good ways to express abstraction through either types or objects. Hashes that pretend they are objects don’t count, and third-party bolt-on solutions aren’t that much better.

                                                The third one is lack of a good module system.

                                                For the record, I let a big rewrite from Perl to Python and it saved the project from crumbling under its own weight. I’m not a casual Perl hater. ;)

                                                1. 3

                                                  Hashes that pretend they are objects don’t count

                                                  I’d like to understand more about this. Do you have the same reaction to javascript, which I think has a similar model? Could you elaborate on why they don’t count?

                                                  and third-party bolt-on solutions aren’t that much better.

                                                  I’ve not looked for a while, but things like Moose (https://metacpan.org/dist/Moose/view/lib/Moose/Cookbook.pod#Basic-Moose) seem to meet a lot of needs - could you give an idea what is lacking?

                                                  The third one is lack of a good module system.

                                                  I’d appreciate understanding more about this. CPAN was pretty much setting the standard for module creation and distribution before any modern alternative.

                                                  There are lots of opinionated aspects to perl which people can reasonably dislike/disagree on - I’m just a bit surprised by some of the above.

                                                  1. 4

                                                    Do you have the same reaction to javascript, which I think has a similar model?

                                                    Yes. JS and Perl have one thing in common: they were designed (or thrown together) as languages for short scripts, and then people started using them for large applications, but, as Perlis said, a language never leaves its embryonal sac. All you can do is work around those issues, not fix them.

                                                    Why it doesn’t count? Because it doesn’t really provide you with a way to either express the data model or enforce it.

                                                    things like Moose seem to meet a lot of needs

                                                    At the very least, you have to force them on people. Since it’s not a standard, and not even very widespread, that will be writing in “Perl with Moose”, not in Perl.

                                                    CPAN was pretty much setting the standard for module creation and distribution before any modern alternative.

                                                    I mean a module system at the language level, not a way to download chunks of code. By that logic, JS has a good module system because there’s NPM. It doesn’t. If you haven’t seen a good one, look at Ada or ML, or at least at Java or Python.

                                                    1. 5

                                                      I mean a module system at the language level

                                                      perl has package scoping, language constructs to import sets of symbols from one package into another (or refer to them with a package prefix).

                                                      Because it doesn’t really provide you with a way to either express the data model or enforce it.

                                                      Are you objecting to the lack of enforcement of public/private (the perl convention is leading underscore for private, but it isn’t enforced. Although C++ doesn’t enforce it either if you #define private public before including the header file…)? Because you can have types with fields, methods on objects, class methods, constructors, destructors.

                                                      Is it the lack of static typing you dislike? Does that also apply to python without type hinting? (Sorry for the many questions. Many people dislike perl for different reasons - and that is fine - but your reasons are different to most :-) )

                                                      At the very least, you have to force them on people. Since it’s not a standard, and not even very widespread, that will be writing in “Perl with Moose”, not in Perl.

                                                      No? You don’t need to force them on anyone. Like common lisp, the underpinnings of perl (at the language level) are quite powerful and allow people to build powerful extension modules. This can lead to some fragmentation, but the community made reasonable efforts to promulgate good practice, e.g. https://en.wikipedia.org/wiki/Perl_Best_Practices

                                                      I think there are reasons to dislike perl, but I don’t think lack of power in the type system is one of them (unless you are just against dynamic typing in general, which is perhaps a different position than “perl is bad because it doesn’t have real types”).

                                                1. 1

                                                  While I definitely applaud the initiative (and speed for it’s own sake), is it necessary to generate this more than once (i.e. does speed matter)? I guess the data could be customer-specific in some way, but that seems not needed for a test db.

                                                  1. 13

                                                    Easily one of the best blog entries I’ve ever read and the title is perfect like a mnemonic, it makes it easy to remember the principle. Although the code examples are in Haskell the content is so well written that it becomes approachable no matter what your background is.

                                                    1. 5

                                                      Is this fundamentally the same idea as “make invalid states unrepresentable”?

                                                      i.e. don’t have a corner case of your type where you could have a value which isn’t valid (like the empty list), instead tighten your type so that every possible value of the type is valid.

                                                      Looked at this way, any time you have a gap (where your type is too loose and allows invalid states) the consumer of the value of that types needs to check them again. This burden is felt everywhere the type is used.

                                                      If you do the check to ensure you haven’t fallen into the gap, take the opportunity to tighten the type to get rid of the gap. i.e. make the invalid state unrepresentable.

                                                      All sounds good - but I wonder how practical this is in practice? In a mainstream language, this kind of work is often done in constructors. If I have a ‘User’ class with a ‘name’ string, and my User constructor requires a non-empty name, the rest of the code in the user class can assume it is non-empty (if the only way of making a User is via the ctor)

                                                      Is that the same thing as what we’re doing here, or is there a material difference?

                                                      1. 4

                                                        I think it is related to “make invalid states unrepresentable”, but goes further to explain how to use types to achieve this. The crucial part is to understand the difference between validation and parsing as explained in the article. Applying it to your example validating a username to be non-empty is fine as long as the Constructor is the only way to construct that object, but parsing it would mean narrowing it to a different type, say a NonEmptyString. Or perhaps a Username type now makes sense. Passing that type around means you don’t need to re-validate your assumptions because now indeed you have made invalid states unrepresentable.

                                                        1. 4

                                                          parsing it would mean narrowing it to a different type, say a NonEmptyString. Or perhaps a Username type now makes sense. Passing that type around means you don’t need to re-validate your assumptions because now indeed you have made invalid states unrepresentable.

                                                          As far as I understand, it is not practical with Python to apply that advice. Any ideas how to do it?

                                                          1. 3

                                                            Throw an exception in the constructor of Username if the passed-in string doesn’t meet the restrictions for a username.

                                                            1. 2

                                                              I was thinking today that it should be perfectly possible to apply the “parse, don’t validate” approach in Go, and I think it is a viable thing to do in Python too.

                                                              If I understand correctly, the blog post is advocating to parse a logical representation of your domain model out of the primitive data that you get from other systems (strings, lists, json, etc). There is a benefit in pulling all that in a separate step that is independent of the real logic. This gets you several benefits:

                                                              • Fewer bugs, as you do not work with primitive objects that can have different meanings and representations.
                                                              • A central place to handle all your external data transformations that is easy to reason about.
                                                              • No half-complete operations. Say a 10-write process completes 5 writes, and the 6-th before-write validation fails - how do you handle that?

                                                              Doing all that in a language like Haskell that has an advanced type system is great, but I think we can do it in other languages even if we don’t have that 100% guarantee that the compiler is watching our back. For example, a parsed OO structure in a dynamic language composed of classes with a custom, domain-specific API is a lot better than the usual lists-in-dicts-in-more-lists-and-tons-of-strings spaghetti.

                                                              1. 1

                                                                I don’t have any experience with doing that in Python. You can technically subclass str, but failing in a constructor is not so nice and creating a new method to parse a string opens the door to construct the object with an invalid value.

                                                        1. 10

                                                          Back when the internet was quite a bit smaller, I considered (but never did) asking the honduras domain registrars to add an MX record and forward ‘j’ to me, so I could be j@hn.

                                                          1. 9

                                                            Can someone sum up for me why one might like QUIC?

                                                            1. 13

                                                              Imagine you are visiting website and you try to fetch files:

                                                              • example.com/foo
                                                              • example.com/bar
                                                              • example.com/baz

                                                              In HTTP 1.1 you needed to have separate TCP connection for each of them to be able to fetch them in parallel. IIRC it was about 4 in most cases, which meant that if you tried to fetch example.com/qux and example.com/quux in addition to above, then one of the resources would wait. It doesn’t matter that the rest 4 could take a lot of time and could block the pipe, so it would do nothing until resource was fully fetched. So if by chance your slow resource was requested before fast resources, then it could slow whole page.

                                                              HTTP 2 fixed that by allowing multiplexing, fetching several files using the same pipe. That meant that you do no longer need to have multiple pipelines. However there is still problem. As TCP is stream of data, that mean that it need all packets before current to be received before processing given frame. That mean that single missing packet can slow down processing resources that are already received due to fact that we need to wait for marauder that can be retired over and over again.

                                                              HTTP 3 (aka HTTP over QUIC with few bells and whistles) is based on UDP and the streaming is build on top of that. That mean that each “logical” stream within single “connection” can be processed independently. It also adds few different things like:

                                                              • always encrypted communication
                                                              • multi homing (which is useful for example for mobile devices which can “reuse” connection when switching between carriers, for example switching from WiFi to cellular)
                                                              • reduced handshake for encryption
                                                              1. 9

                                                                Afaik, multihoming is proposed but not yet standardized. I know of no implementation which that supports it.

                                                                QUIC does have some other nice features though

                                                                • QUIC connections are independent of IP addresses. I.e. they survive IP address changes
                                                                • Fully encrypted headers: Added privacy and also flexibility. Makes it easier to experiment in the Internet without middleboxes interfereing
                                                                • Loss recovery is better than TCP’s
                                                                1. 4

                                                                  Afaik, multihoming is proposed but not yet standardized

                                                                  That is true, however it should be clarified that only applies to using multiple network paths simultaneously. As you mentioned, QUIC does fix the layering violation of TCP connections being identified partially by their IP address. So what OP described (reusing connections when switching from WiFi to cellular) already works. What doesn’t work yet is having both WiFi and cellular on at the same time.

                                                                  1. 3

                                                                    Fully encrypted headers

                                                                    Aren’t headers already encrypted in HTTPS?

                                                                    1. 8

                                                                      HTTP headers, yes. TCP packet headers, no. HTTPS is HTTP over TLS over TCP. Anything at the TCP layer is unencrypted. In some scenarios, you start with HTTP before you redirect to HTTPS, so the initial HTTP request is unencrypted.

                                                                      1. 1

                                                                        They are. If they weren’t, it’d be substantially less useful considering that’s where cookies are sent.

                                                                        e: Though I think QUIC encrypts some stuff that HTTPS doesn’t.

                                                                    2. 3

                                                                      Why is this better than just making multiple tcp connections?

                                                                      1. 5

                                                                        TCP connection are not free, they required handshakes both at TCP level and SSL. They also consume resource at the OS level which can be significant for servers.

                                                                        1. 4

                                                                          Significantly more resources than managing quic connections?

                                                                          1. 4

                                                                            Yes, QUIC use UDP “under the table” so creation of new stream within existing connection is 100% free, as all you need is just to generate new stream ID (no need for communication between participants when creating new stream). So from the network stack viewpoint it is “free”.

                                                                            1. 3

                                                                              Note that this is true for current userspace implementations, but may not be true in the long term. For example, on FreeBSD you can do sendfile over a TLS connection and avoid a copy to userspace. With a userspace QUIC connection, that’s not possible. It’s going to end up needing at least some of the state to be moved into the kernel.

                                                                        2. 5

                                                                          There are also some headaches it causes around network congestion negotiation.

                                                                          Say I have 4 HTTP/1.1 connections instead of 1 HTTP/2 or HTTP/3 connection.

                                                                          Stateful firewalls use 4 entries instead of 1.

                                                                          All 4 connections independently ramp their speed up and down as their independent estimates of available throughput change. I suppose in theory a TCP stack could use congestion information from one to inform behaviour on the other 3, but in practice I believe they don’t.

                                                                          HTTP/1.1 requires single-duplex transfer on each connection (don’t send second request until entirety of first reponse arrives, can’t start sending second response before entirety of second request arrives). This makes it hard for individual requests to get up to max throughput, except when the bodies are very large, because the data flows in each direction keep slamming shut then opening all the way back up.

                                                                          AIUI having 4 times as many connections is a bit like executing a tiny Sybil attack, in the context of multiple applications competing for bandwidth over a contended link. You show up acting like 4 people who are bad at using TCP instead of 1 person who is good at using TCP. ;)

                                                                          On Windows the number of TCP connections you can open at once by default is surprisingly low for some reason. ;p

                                                                          HTTP/2 and so on are really not meant to make an individual server be able to serve more clients. They deliberately spend more server CPU on each client in order to give each client a better experience.

                                                                        3. 2

                                                                          In theory, HTTP 1.1 allowed pipelining requests: https://en.wikipedia.org/wiki/HTTP_pipelining which allowed multiple, simulteneous fetches over a single TCP connection.

                                                                          I’m not sure how broadly it was used.

                                                                          1. 4

                                                                            Pipeline still require each document to be sent in order. A single slow request clog the pipeline. Also, from Wikipedia, it appears to not be broadly used due to buggy implementation and limited proxy support.

                                                                            1. 3

                                                                              QUIC avoids head-of-line blocking. You do one handshake to get an encrypted connection but after that the packet delivery for each stream is independent. If one packet is dropped then it delays the remaining packets in that stream but not others. This significantly improves latency compared to HTTP pipelining.

                                                                          2. 5

                                                                            A non-HTTP-oriented answer: It gives you multiple independent data streams over a single connection using a single port, without needing to write your own framing/multiplexing protocol. Streams are lightweight, so you can basically create as many of them as you desire and they will all be multiplexed over the same port. Whether or not streams are ordered or send in unordered chunks is up to you. You can also choose to transmit data unreliably; this appears to be a slightly secondary functionality, but at least the implementation I looked at (quinn) provides operations to you like “find my maximum MTU size” and “estimate RTT” that you will need anyway if you want to use UDP for low-latency unreliable stuff such sending as media or game data.

                                                                          1. 2

                                                                            Any code that “feels synchronous” necessarily makes you pay for that feeling by stealing true concurrency away from you. It may seem more convenient at first, but the second you need more control you have to use much clunkier abstractions.

                                                                            Yeah, you’ve got to choose a default for what you mean when you write straight-line code without explicit concurrency markup.

                                                                            By default, do you want:

                                                                               a = func_a()
                                                                               b = func_b()
                                                                            

                                                                            to mean: i) run func_a, then run func_b or ii) run func_a and func_b concurrently (and plough on until an explicit join/await of some kind).

                                                                            I want (i), because I think it is easier to reason about. And - unless there is I/O going on - I think (ii) would be no faster in general (because you’re either in a single OS thread or trying to do some implicit locking of every data structure if you are defaulting to everything being concurrent).

                                                                            1. 4

                                                                              I dunno, compilers are already allowed to reorder, intermix and generally mush around the ordering of func_a() and func_b() whenever they can prove it doesn’t violate any invariants. For example, they can interleave instructions for better instruction-level parallelism. If func_a() and func_b() are truly independent then there’s a cost tradeoff to automatically parallelizing them: running them on different cores has overhead, so you have to be able to know that the gain is worth the overhead. Generally we leave that decision to humans, but I don’t think there’s any fundamental reason we need to.

                                                                              1. 1

                                                                                Generally we leave that decision to humans, but I don’t think there’s any fundamental reason we need to.

                                                                                I think this is mostly because the state space here to arrive at a good decision is still very broad. It often depends on what the initial core delegation overhead is, on how many independent cores are available on a machine, on degenerate cache coherency cases, and other complicated things. Humans are often equally unaware, but they can run load tests to gave a working understanding of the system and can optimize thusly.

                                                                            1. 11

                                                                              BackBlaze acknowledged this and pushed out a fix. Facebook’s SDKs are notorious for recording far more data than necessary as noted here, so I don’t feel BackBlaze was shipping off data intentionally, and were blindsided by Facebook changing things under them.

                                                                              1. 35

                                                                                BackBlaze is responsible for the code on their website. If they ship code in their web app which ships all the names of the user’s files to Facebook, that’s on them. This is a huge violation of trust from BackBlaze. “A library did it” isn’t an excuse.

                                                                                1. 20

                                                                                  I completely agree, it is certainly a grave mistake on their part. What I meant was that this incident appears to be a result of carelessness rather than malice.

                                                                                  1. 6

                                                                                    Ah, makes sense. That is indeed an important thing to point out.

                                                                                    1. 2

                                                                                      Case or “Never attribute to malice that which can be adequately explained by stupidity.”?

                                                                                      1. 2

                                                                                        Never attribute to malice which can be adequately explained by passing the buck to a library♥

                                                                                    2. 9

                                                                                      Absolutely, I mean what did they expect would happen when they include some tracking garbage from facebook? I evaluated them and eventually planned to use them as a block storage provider but canceled my account with them today after I read about the tracking pixel. There’s absolutely zero reason for including this tracking stuff in the admin part of the website.

                                                                                      1. 2

                                                                                        The only mitigation I can think of is to code-review (at some level) all diffs of all dependencies (transitively), when any first-level dependency changes.

                                                                                        It’s even worse if some libraries are loaded from a third party, which could change them at any time.

                                                                                        I think that is a lot of difficult, challenging work.

                                                                                        Is there a better idea than the one above? Or is that just the cost of doing business and the best approach would be for us to somehow distribute the load (e.g. a 3rd party, curated, checked, trusted JS stack which covers a common set of modules.

                                                                                        1. 16

                                                                                          The mitigation here is substantially simpler, don’t include code loading from or sending data to 3rd parties on pages that contain sensitive business and personal information that you are obligated to protect. Especially when that’s your core business.

                                                                                          People would be much more understanding of this issue if it was a supply chain attack, it wasn’t, they intentionally included scripts from third parties where there shouldn’t have been any. That the scripts were extracting slightly more data than they thought… really isn’t the issue.

                                                                                          1. 4

                                                                                            But why would you like to integrate your customers admin panel with Facebook? It compromises their privacy and your company secrets. The only reason I can imagine is measuring conversions, but again is it worth the risks?

                                                                                            1. 3

                                                                                              Well, it’s a trade-off isn’t it. In theory, code reviewing (and self hosting!) every dependency could provide the best security. That’s feasible if you’re comfortable with using few dependencies, but it might not always be possible.

                                                                                              If you’re not going to be reviewing your dependencies though, the very least you should do is to reflect over whether the dependency is managed by someone who you have reasons to believe aren’t going to do anything creepy. I would, for example, probably trust jQuery, because they don’t (AFAIK) have a history of being creepy. Do we have a reason to trust Facebook to not be creepy? Absolutely not. So maybe don’t use their tracking library.

                                                                                              Above all that though, host your code on your own damn servers. There’s no good reason to give a library vendor (or an attacker with access to your library vendor’s web server) the technical ability to inject arbitrary code into your app just by changing a file on their end. This should be an obvious thing just from a reliability perspective too. Thanks to Hyrum’s law, every change is a potential breaking change, so it seems ridiculous to effectively push new versions of dependencies to customers with no testing.