1. 33
  1.  

  2. 17

    TLDR:

    1. Start with big IDs so you don’t enter in PHP weird behaviors.
    2. Use UTF-8, always.
    3. Allow-list system with deny by default for anything related to security.
    4. Monitor your SQL queries.
    1. 5

      The one time I read the article before the comments…

      1. 2

        Monitor your SQL queries.

        The idea that SQL queries are a special kind of error is really weird. You should be reporting all your errors that aren’t caused by bad user input or network hiccups.

        The allow-list bit is a good call but the rest seems oddly language-specific. Starting with big IDs implies you’re using integers, which is not a great idea to begin with; stick with UUIDs and the whole problem isn’t even possible.

        1. 3

          The idea that SQL queries are a special kind of error is really weird.

          Because SQL syntax errors indicate injection bugs, and SQL injection bugs can exploited to accomplish literally anything. The OP isn’t talking about queries that are supposed to return exactly one result returning more than one, which are also bugs, but probably not exploitable to completely pwn the database.

          1. 1

            The OP isn’t talking about queries that are supposed to return exactly one result returning more than one, which are also bugs […]

            If it’s a bug … why are you ignoring it? You should … fix it instead?

            1. 2

              Because I need to fix the highest severity bugs first, and SQL injections are much higher severity.

              1. 2

                If a bug exists but nobody discovers it, is it truly a bug?

                1. 2

                  If a bug exists and all your users find out about it before you do, are you really a professional?

                  1. 1

                    Precisely. I’m sure I don’t need to tell a fellow professional how this kind of disparate-parity bug can quietly develop over time.

            2. 2

              What’s wrong with using integers as keys? If you want to make sure that two integers with different semantic meanings are never confused for each other, you should be relying on a type system, not the improbability of UUID collisions. (Unless I’ve mistaken your point?)

              1. 2

                Integers are enumerable, while (some kinds of) UUIDs are not.

                1. 1

                  What’s the problem with enumerability—that someone might be able to guess a valid ID?

                  Which kinds of UUIDs aren’t enumerable?

                  1. 3

                    You can technically enumerate UUIDs, but considering how many there (quite a few) are and how they are generated (randomly for version 4 UUIDs), it may take you some time.

          2. 7

            If you’re using integer IDs to identify data or objects, don’t start your IDs at 1. Start them at a huge number (e.g., 2^33) so that no object ID will ever appear in any other role in your application (like a count, a natural index, a byte size, a timestamp, etc).

            Passing around random integers and logic like “well it’s somewhere in the order of eight and a half billion, so it must be a user” sounds like a really fucking shitty way to write most programs - both in terms of making assumptions, and in terms of developer productivity.

            Ok, very memory sensitive, massively concurrent systems will see a noticeable operational benefit to passing around an integer, rather than an Object, but I’d wager that 99.9% of people will never work on such a project. Even if you don’t want to go full on Model Objects, at least use wrapped integers for your IDs (e.g. class UserID { public int $id } - and then your methods (or your global functions if that’s your kink) can at least typehint to require a UserID, so a function hypothetical get_friends(UserID $id): array will throw immediately if you pass in say a PhotoID, or a GroupID, or any other random integer.

            1. 8

              well it’s somewhere in the order of eight and a half billion, so it must be a user

              That’s not what they’re saying: What the article states is “if you put your user IDs outside the accidentally reachable number space, accidentally trying to parse a small number won’t hand out some user data”. At no point is that arrangement supposed to mean if x >= large_num { x_is_user = true; }

              1. 4

                Like I said: relying on an arbitrary integer being “outside accidentally reachable space” sounds like a fucking terrible idea, rather than just, you know, using the type system available to you, to say “hey we need a fucking User ID, not just any random integer”.

                You (the proverbial you not you specifically) may as well also propose using ranges of integers starting at each billion, for different object types, so you can do away with foreign keys in your RDBMS.

                Let me put this another way: if your codebase is written in such a way that you’re relying on user ID’s being some magically unique number, not appearing in any other form, to provide any semblance of security or privacy, you’ve already failed.

                1. 3

                  It’s very poorly explained, but the idea is that you have a bunch of things indexed by some kind of numerical ID. If your language doesn’t give you nice unit types, it’s very easy to confuse integer-representing-a-Foo-ID and integer-representing-a-Bar-ID (and loop induction variable that was supposed to be an index into an array of Foo IDs, and many other things). If you start everything at 0, then an accidental type confusion in your program will probably still find a valid thing. If you start both at different, moderately large, random indexes, then type confusion will probably trigger some kind of thing-not-found error. This is much easier to find in testing: the observable failure is close to the bug.

                  It’s not about segregating the types and guaranteeing that different numerical ranges refer to different types, it’s about finding the errors where you make the confusion.

                  If you were doing this in a language like C++, you’d have a separate type for each of these IDs and mark the casts to and from ints as explicit, so you’d have to explicitly write something like: ProductID id(sessionid.as_integer()) and that would be likely to be picked up in code review. PHP doesn’t really help you here.

                  1. 3

                    PHP doesn’t really help you here.

                    Like I said, if your scale is such that passing around actual model instances isn’t feasible (so already a pretty slim minority of the software world), a wrapper class with a single integer property is going to use minimal memory, and still lets you use types to only accept/return an object that is “known” to be a User ID, or Product ID or whatever.

                    I’m not sure why you would think this won’t work in PHP, or pretty much any language that has even the most basic concept of classes.

                    1. 3

                      So you use this nice user type in your code but at some point your code is supposed to present a list of users (e.g. their facebook friends) to the user on the other side of the HTTP connection. For that, you read out the user data (name, picture, …) and present them, but you also need some identifier to put into the URL that is opened when they click on the friend’s link. Now what?

                      Of course that data gets sanitized on input (and could be wrapped into a User ID object at that point) again, but still: there’s this number floating around in some shape or form. Doesn’t hurt to keep it outside the “normal” number space to avoid running into funny issues down the road (because, you know, coders make mistakes).

                      I’ve seen similar advice to start such numbers at 2^53 if there’s a chance that double-by-default languages such as Javascript (or JSON parsers that try to be compliant) mess with them, just so developers see resulting issues immediately rather than at some distant point in time when nobody remembers what’s going on.

                      It’s simply a very cheap defensive programming technique when you deal with something that carries an ID somewhere.

                      1. 5

                        Now what? Now you have some actual security. You lookup a user in a table, you check if the active session user has permissions to do whatever the link is supposed to do - neither of those is different logic if the underlying integer is 8 or 8 billion.

                        Doesn’t hurt to keep it outside the normal search space to avoid running into funny issues down the road (because, you know, coders make mistakes).

                        What the fuck is “normal search space”?

                        No user input should be trusted. Your argument is proof of why this ridiculous theory is bad security theatre - there’s nothing to stop a client sending a request with 8 rather than 2^33 in the parameter that identifies the user. If your application is written well, that shouldn’t matter: regular security/privacy should prevent them from seeing/doing things they shouldn’t. At worst they should get an error message.

                        Your suggestion implies that they might be able to do/see something they shouldn’t be able to, because e.g. the number they send happens to be the ID of a non-user object.

                        If that is the case, you’re basically arguing in favour of security by obscurity. If that is not the case, then you’re arguing in favour of security theatre.

                        So which is it?

                        1. 4

                          Your suggestion implies that they might be able to do/see something they shouldn’t be able to, because e.g. the number they send happens to be the ID of a non-user object.

                          I’m arguing that some coder might factor out some user ID handling code into a piece of work that translates it into a plain integer type and write their functions around that. And then have some coder (maybe the same, but just as clueless) cast values when using those function instead of fixing the mess, and again by mistake, they happen to cast an enum type (which typically cast into the 0..n range for small n). And this is all in a strongly-typed environment: The Phabricator folks use PHP and therefore I’d assume that they write their blog posts for a PHP-using audience, and PHP’s type system provides fewer guarantees (although they’re cleaning up their act. slowly.)

                          I’d rather have it explode on them then, than give reasonably-looking-at-a-glance data because the CEO thought it’s cool to have UID 1.

                          As I wrote, coders make mistake.

                          Counter question: what irritates you so much about simply starting a counter at a large value that you exploded like that (see the expletives in the first post)? It’s a no-cost guard rail that is ideally never needed, but as it costs nothing, and might protect against stupid mistakes (even though it’s pretty weak), why bother?

                          1. 2

                            I would rather have it explode at them as soon as possible rather than when you reach 2^n users, by which time a fix might be much more difficult both to do and to trace.

                            1. 2

                              “It’s a no-cost guard rail that is ideally never needed, but as it costs nothing, and might protect against stupid mistakes (even though it’s pretty weak), why bother?”

                              I think the problem is that, as stephenr says in his reply, it’s akin security by obscurity. It’s convincing yourself - or your future self, or whoever looks at this system later, that all’s fine because these numbers are big and therefore we’ve solved the problem. By doing something that ‘might protect’ rather than something that will protect, the problem gets worse, as now we’re lulled into a false sense of security.

                              Yes, a system shouldn’t explode because someone wanted UID 1, but the system shouldn’t act like it’s fine for 4 years and then explode because we’ve been comparing UIDs and creation timestamps like that joke signpost that seems to be all over the world (population + height above sea level … total = …) and we’ve only just hit the timestamp and the UID where that mattered.

                              Typing is important and what the article advocates is an easy ‘solution’ that’s dangerous and is something software development should have moved past by now. UUIDs/GUIDs are now commonplace and absolutely appropriate for, well, unique identifiers. Namespacing prefixes work reasonably well (UID-123) but are prone to humans making up rules (‘I’ve only ever seen UIDs with 3 digits so therefore if someone tells me they have UID 1 that means I should write UID-001 and an ID of UID-1000 is invalid’).

                              1. 2

                                as it costs nothing, and might protect against stupid mistakes, why bother

                                Lots of things cost nothing and someone claims “might” do something. I’d rather just do something that does protect against the issue, and ignore the security theatre.

                                Edited, @vakradrz makes a good point.

                                1. 3

                                  I agree with your argument (and have made my own reply to the parent comment) but please try to keep it civil, as while it’s an important topic and I’m sure we’ve seen disasters due to such designs, there’s still good intention here and education is more difficult with this kind of tone.

                                  1. 2

                                    You make a good point.

                  2. 4

                    For me, the most compelling reason to assign IDs as described in the article is so that you can grep your log files for those IDs and probably not get false positives. Even just starting your IDs at 256 means that you won’t get collisions with the components of IPv4 addresses. Starting at 10,000 means you won’t get collisions with the components of IPv6 addresses, starting at 32,769 means you won’t collide with PIDs (depending on how your system is configured), and so on. You can go whole-hog with this and use UUIDs for everything, and then you’re even less likely to have collisions, but that has its own drawbacks.

                    That being said, I think it’s important to view this as a minor developer affordance and not as a substitute for a type system. If you find yourself doing this because you’re getting loop counters confused with entity IDs… I don’t think making the entity IDs larger is the right solution.

                    1. 2

                      … is it really that hard to have your logs reflect the type as well as the ID? How do you grep for anything that isn’t using an artificially large PK?. I’d have thought user:123 was easier to get a valid result against than just 9993939191939393

                      1. 2

                        Sure, of course your logs should be clear about the meaning of each piece of information. But if you’re grepping through logs multiple times per day, every day, then not having to type user: each time starts to give you a nontrivial time savings—and, more significantly, it feels like there’s less friction in the process. I’m assuming that the numbers are going to be copied and pasted anyway, which means there isn’t a time difference between grepping for a shorter number or a longer one. And your approach depends on a higher level of consistency in writing log messages than I think is common at most places—if someone leaves out the user: in one particular message, you’re back at grep -w.

                        1. 3

                          This logic doesn’t make any sense to me, because in any non-trivial application, users are just one type of thing that you’d want to be able to identify.

                          I don’t buy the idea that users are some unique thing you’d want to search for, but products, sales, payments, groups, etc etc - whatever the actual business items of the application are - are not equally as important. This is why none of the arguments presented make any sense to me: offsetting one type of object by 2^33 doesn’t solve the same supposed issue for all the other object types you have, so unless your application is so trivial you only have two object types: users and… something else, the offset mechanism is not a useful solution to any of the problems presented, IMO.

                          1. 2

                            I think the idea is that you would offset users by 2^33 (for example), products by 2^34, sales by 2^35, and so on. Of course there are obvious problems with this scheme: if you have more than 2^34 products, the IDs for those are going to start overlapping with the IDs for sales. If you have more than around 30 entities in your system, you won’t be able to offset all of them like this and stay within 64 bits.

                            That’s why I think it’s vital to treat this scheme as just a developer affordance and not any kind of data integrity or type safety feature. Like I said, I think this is only really helpful for grepping logs… I think I may disagree with some of the other commenters on that point.

                            1. 1

                              Your comment actually made me go back and re-read the article. I think you’re right that they are talking about all objects, not just users specifically, but their reasoning seems to be essentially, what you alluded to before:

                              If you find yourself doing this because you’re getting loop counters confused with entity IDs… I don’t think making the entity IDs larger is the right solution.

                              In the example given, getting a list of users returns an associative array using integer IDs as the key, and boolean true as the value. They then proceed to call array_slice without setting the preserve_keys flag to true, and get back a 0-indexed array of boolean true.

                              I wouldn’t be surprised if this is a real world example from Facebook, given some of the absolutely garbage examples shown in the leaked dumps of the codebase from several years ago - but to use this ridiculous pattern as a reason for starting your object IDs at 34 billion, is beyond stupid.

                              1. 2

                                Agreed. I ended up re-reading the article too, and apparently the “larger IDs make searching logs easier” point was something I made up; the article didn’t say that. None of the reasons they give for making the IDs larger seem like sound engineering to me.

                    2. 3

                      This isn’t meant to be a primary way to distinguish valid IDs from random numbers, but as a defense in depth in case you screw up your code.

                      Most likely you’re going to need to work with SQL, JSON, URLs, and other places where you’ll have to put an untyped number. Newtype in the language doesn’t help in cases like this:

                      get_friends(new UserID($_GET['photo_id']))
                      
                      1. 2

                        So what happens when someone changes your URL from ?uid=9809890809809890 to ?uid=9.

                        SQL of all places is a ridiculous example. Are you searching every table looking for a PK match?

                        1. 3

                          I think you’re still reframing this as if it was meant to be a security measure or some kind of bullet-proof protection. It’s not. It’s a “lint” that may help catch a programmer’s error. It is not intended to catch nor detect any outside interference.

                          I’ve chosen SQL, because SQL doesn’t accept PHP types as arguments (unless you implement a very fancy type-safe ORM, I guess?). There could be mistakes like misaligning ? placeholders and their values, or selecting columns in a wrong order. Even when you only use named placeholders and fetch rows as assoc arrays, if you join multiple tables with an id column, you might accidentally pick the wrong one. Bugs can happen. The trick is about making such bugs fail louder, sooner.

                          1. 2

                            The trick is about making such bugs fail louder, sooner.

                            If your developers don’t notice that their piece of code is returning the wrong user, I honestly don’t think they’ll notice that it’s not returning any user, because they’re clearly not testing what they write, even in the most basic of “I tried this once on my local machine” sense.

                      2. 1

                        Yeah this is weird. If am doing ‘get friends’ I want a list of friends, not integers.

                      3. 3

                        About the users, why not prefix all user IDs with user_ or something if you really want to identify it. Making the numbers large doesn’t seem like a good solution at all.

                        1. 3

                          Another tip for catching invalid uses of IDs is to use a different autoincrement value for each type (ideally different primes), so increase user IDs by 3, page IDs by 7, widget IDs by 11, and so on. This way they won’t overlap with each other. If you ever mix them up, you’re more likely to notice when program fails due to “missing” data than if it continued with incorrect data.

                          1. 4

                            ideally different primes

                            Note that the increments must be relatively prime to each other or else you will get collisions. Making them all prime is the easiest way to ensure this.

                            1. 1

                              Actually, disregard what I said, it’s nonsense. If you offset user IDs by 3 and page IDs by 5 then 15, 30, 45, … are still going to appear in both sequences.

                              If you want non-overlapping sequences of IDs one way to get them is to use powers of distinct primes, so for example the user IDs would be 3¹, 3², 3³, etc., the page IDs would be 5¹, 5², 5³, and so on. The downside is that you’ll exhaust your storage space quickly: there are only 63 powers of 2 that fit into 64 bits (I’m excluding 2⁰, since 2⁰ = 3⁰ = 5⁰ = …), only 40 powers of 3, and only 27 powers of 5.

                              All in all, I heartily recommend solving this problem in a different way.

                            2. 1

                              This will get more complex if you need to do replication with active:active(:active…), - which I’d argue is any site that has an uptime goal of anything above DILIGAF - as offsets are used to avoid simple conflicts in auto increments.

                            3. 4

                              Is passlist and blocklist usage really so hard?

                              1. 24

                                Is it really so hard to realize many people are simply unaware of the controversy?

                                Is it really so hard to realize people are creatures of habit and do not even realize they are using non-preferred nomenclature?

                                Is it really so hard to realize that scolding them like this is counterproductive, because they will automatically become defensive?

                                I could have phrased this nicer and more constructive. Could you?

                                1. 16

                                  100% – I’m leaving my comment up as it was a silly and unconstructive knee jerk reaction. Thank you.

                                2. 4

                                  This page is at least 9 years old; it was submitted to HN in 2011.

                                3. 2

                                  In one of the systems I was working on I’ve used a whitelist-based approach rather than blacklist one. I agree this maybe reduced one category of bugs, but at the same time it introduced another, where functionality often didn’t work until I’ve updated the whitelist. So I guess this whitelist-first approach instead of blacklist-first approach is really a trade; an exchange of one category of bugs to another. Which in case of trading security bugs to functional bugs could be a net positive, but it’s a good idea to think about what category of bugs are actually being traded.

                                  1. 5

                                    I agree this maybe reduced one category of bugs, but at the same time it introduced another, where functionality often didn’t work until I’ve updated the whitelist.

                                    That’s the goal. Deny-list approaches fail open, allow-list approaches fail closed. If you forget to put something that you need on your allow list, some feature that your users care about will stop working and you will get useful bug reports. If you fail to put something that you don’t need into a deny-list then an attacker can exploit it and you don’t get any bug reports.

                                    1. 3

                                      100% this. If the default-deny causes you to think through a new code path during development, then you’re paying the lowest possivle cognitive price you’ll ever get. If you notice something is off during development and testing then you’re right there.

                                      Imagine you had a bug report that sometimes works most of the time, except once or twice. Good luck investigating, when you got a whole stack of technology to wade through. It’s more time-consuming, more frustrating for the users and maybe even pissing off your ops team.

                                      Bonus: once in this mood, you are also less likely to come up with a thorough solution that catches all of the possible facets of “badness” and you’ll end up in the sad place again and again.

                                  2. 1

                                    I think SQL injection a solve problem with prepared statement or store procedure?

                                    I have been using Golang and pretty much hand write my SQL query and prepared statement. In some place I even adopted Store Procedure. It feels nice because you write SQL once and any service can call it without worrying about how that particular service deal with SQL injection.

                                    1. 1

                                      For prepared statement, yes usually. Sometimes the thing you want to parameterise is not supported. For example table name is usually not supported.

                                      For stored procs: not unless you use a prepared statement or escape correctly. Always possible to convince your code to pass “^C ROLLBACK; – do something” or similar as an arg to a stored proc.

                                      Personally I don’t see people using prepared statements often in real life because a) it requires a new roundtrip b) their framework or driver can do the job.

                                    2. 1

                                      One thing I find useful is to make sure that the hash map that I use is not deterministic - each time program is executed the order of iteration should be different, like for example the default rust hashmap.