1. 33
  1.  

  2. 13

    I was speaking with a client this week who was struggling to test the membership of an IP in a large list of IP ranges within a sub-millisecond deadline. The root cause, I felt, is that most developers only turn to use two data structures, lists and hashes, and when those fail them they change the tooling to something like SQL, Spark, ….or maybe decide “PHP is slow, I need C”.

    It made me recall a library I wrote and use with a few clients that solves just this problem, I thought I would share it, not so much for “shiny shiny”, to remind everyone that even in a slow language, if you pick and use your data structure correctly (eg. gb_trees[1]) then it rarely matters what programming language you use.

    I would be really interested to hear stories where swapping out for a more appropriate data structure helped save the day for others.

    [1] though I really wanted a (radix) trie but you work with what you have baked in the core language offering

    1. 4

      I was speaking with a client this week who was struggling to test the membership of an IP in a large list of IP ranges within a sub-millisecond deadline.

      I would be really interested to hear stories where swapping out for a more appropriate data structure helped save the day for others

      I had very good results using a simple bisection method. In my case, I used the bisect module in Python.

      I used a fairly big list of IPv4 ranges: all the public ranges assigned by the five RIRs. I wrote a blog post about it.

      Hope it’ll be useful!

      1. 2

        q/k has a bin operator which implements binary search.

        q)a:asc update b:floor a+2 xexp 32-b from flip `a`b!"II"$flip "/"vs'1_ read0 hsym `$"fullbogons-ipv4.txt"
        q){x within value a a.a bin x} "I"$"1.2.3.4"
        0b
        q){x within value a a.a bin x} "I"$"0.0.0.2"
        1b
        

        To see why this works, note my table a is pre-sorted, so a.a bin x will find the index of the row which has a value greater-or-equal to the target. Because application and indexing are the same, I can then get the row as a pair with value and do a quick check to make sure the input is within that range.

        How fast is it?

        q)\t:1000 {x within value a a.a bin x} each a.a
        4929
        

        That’s 5 milliseconds for all of the input IP addresses.

        q)\t:1000 {x within value a a.a bin x} each 1000?0i
        1495
        

        That’s 1.5 µseconds per lookup! Wow!

        I tried to do the same thing with Erlang, but building a vector in erlang is tricky – there are no standard library calls for it, so I went for tuples and rolling bin myself:

        bin(A, Key) -> bin(A, Key, 1, tuple_size(A)).
        bin(_, _, X, Y) when X > Y -> Y;
        bin(A, K, X, Y) -> I = (Y + X) div 2, V = element(I, A), if K > V -> bin(A, K, I + 1, Y); K < V -> bin(A, K, X, I - 1); true -> I end.
        

        This can probably be improved, but timing it:

        23> timer:tc(fun () -> lists:map(fun (X) -> a:check(X,D) end, tuple_to_list(hd(D))) end).
        {11543,
         [true,true,true,true,true,true,true,true,true,true,true,
          true,true,true,true,true,true,true,true,true,true,true,true,
          true,true,true,true|...]}
        

        Wow! Am I reading that right?

        26> timer:tc(fun () -> a:check(rand:uniform(4294967296), D) end).                        
        {15,false}
        

        Checking with some random values:

        31> M = lists:map(fun (_) -> rand:uniform(4294967296) end, lists:seq(1,1000)).
        [1625206192,2869803430,2103437497,2086713235,3940882389,
         2296258966,3322323448,1396553267,1794578915,2860418586,
         1281050487,444562501,3797348793,3992939745,2859767047,
         145867709,1976629736,956575317,1351635532,1056482660,
         1171921685,2805441948,3489185570,4248238815,2260232200,
         1053952541,746961818,2043346634,230923220|...]
        32> timer:tc(fun () -> lists:map(fun(I) -> a:check(I, D) end, M) end).               
        {3723,
         [false,false,false,false,false,false,false,false,false,
          false,false,false,false,false,false,false,false,false,false,
          false,false,false,false,false,false,false,false|...]}
        

        Looks like we’re only 2x-ish slower than kdb+! Not bad for a “slow” language :)

        1. 2

          Can you share the file you used to test the performance of ranges?

          I’d like to try it against my own IP address library (for Racket).

          1. 4

            I have updated the performance section to use publicly available larger lists.

            1. 3

              Thanks! Out of curiosity, I tried a binary search. That gets me these results

              min: 0.057861328125 max: 2.095947265625 avg: 0.063962158203125
              

              on my i7-6820HQ. It’s pretty fast once the JIT has a chance to warm up! I might give the trie-based approach a try (pun not intended) if I have time next week. It would be nice to see how much of a difference it makes.

              1. 3

                Compare it to a list iteration implementation and then pour yourself a drink to celebrate how substantially better you pushed the envelope in two hours :)