1. 34
  1.  

  2. 25

    This seems a good time to promote a paper our team published last year (sorry to blow my own trumpet :P ): http://soft-dev.org/pubs/html/barrett_bolz-tereick_killick_mount_tratt__virtual_machine_warmup_blows_hot_and_cold_v6/

    We measured not only the warmup, but also the startup of lots of contemporary JIT compilers.

    On the a quad-core i7-4790 @ 3.6GHz with 32GB of RAM, running Debian 8:

    • C was the fastest to start up at 0.00075 secs (+/- 0.000029) – surprise!
    • LuaJIT was the next fastest to start up at 0.00389 secs (+/- 0.000442).
    • V8 was in 3rd at 0.08727 secs (+/- 0.000239).
    • The second slowest to start up was HHVM at 0.75270 secs (+/- 0.002056).
    • The slowest overall to start up was JRubyTruffle (now called TruffleRuby) at 2.66179 sec (+/- 0.011864). This is a Ruby implementation built on GraalVM (plain Java on GraalVM did much better in terms of startup).

    Table 3 in the linked paper has a full breakdown.

    The main outcome of the paper was that few of the VMs we benchmarked reliably achieved a steady state of peak performance after 2000 benchmark iterations, and some slowed down over time.

    1. 1

      I saw a talk about this. Very cool stuff! It is a good antidote to the thrall of benchmarks.

      1. 1

        Cool work! You should make that a submission on its own in the morning in case someone misses it due to a filter. For instance, people who don’t care about Python specifically like main post is tagged with. Just programming, performance, and compiler tags should do. Good news is a lot of people still saw and enjoyed it per the votes. You definitely deserve an “authored by” submission, though. :)

        1. 3

          It was on the lobsters front page about six months ago. https://lobste.rs/s/njsxtv/virtual_machine_warmup_blows_hot_cold

          It was a very good paper and I personally wouldn’t mind seeing it reposted, but I don’t actually know what the etiquette for that is here.

          1. 1

            I forgot. My bad. I should probably do a search next time.

      2. 5

        I’ve seen a number of tricks to speed start time over the years, and am not surprised to see suggestions for the same in that thread. My most personally memorable one was a time I was loading some dozens of GB of data in to a process before analyzing it. My startup time was tens of minutes to load and link this data, and I had a lot of show stopping bugs that were crashing my data analysis after that data was loaded. Waiting for that data to reload got old before noon.

        I arranged to load my data once and fork + import my data analysis code so I only had to pay startup costs once. I easily had a dozen iterations before my analysis was stable enough. I saved myself a lot of time skipping over reloading that data. The quality of the code I wrote for this fork+import was really only good enough to be scaffolding: I threw it all away once that pipeline had stabilized. But I got my result that day.

        1. 2

          “I arranged to load my data once and fork + import my data analysis code so I only had to pay startup costs once. I easily had a dozen iterations before my analysis was stable enough. I saved myself a lot of time skipping over reloading that data.”

          Cool trick. As I read it, I remember that the ability to easily do this is a prime selling point of using LISP environments. Your code and data goes into the running image. You can branch off with new stuff that just reads and works on it. Kill that off. Try something else. Since you can change running code, you don’t need to reload anything for new experiments if your code already separates the data container and the new stuff. Memory safety should reduce crashes to exceptions.

          All languages, even C, need the ability to do this either as default of how they work or in a runtime environment. It’s just too useful. It might have saved you some effort.

          1. 2

            My own experience there was moving from C to Python and being blown away at the quality of the debugger and the exception handling environment. I’ve never quite got the hang of living in a “notebook” (interactive Python [or otherwise] prompt), but I have trouble doing without the ability to drop in to a debugger / repl when I want to poke around. I’ve gone so far as to write up the C equivalent of Python’s pdb.set_trace.

            1. 2

              The notebook/repl concept was weird for me too when I tried LISP. I didn’t get used to it. The cool thing was that the editor had an alternative where I typed code in a source file which could be compiled individually. The one thing that was unique of that experience vs interactive or full compiles was the ability to compile individual functions in a text file. They called it incremental compilation. I type a function stub, compile (fraction of a second), fill it in a bit, compile, test it in REPL, fix or move on. So, I was still using an editor with the normal feel of that (well, emac’s take on it…) with the advantages of a REPL plus a bit faster execution than interpreted.

              So, let me modify my previous statement to say the languages like C need livecoding in images that are hard to totally trash plus incremental compilation. :)

          2. 2

            Is any of that code available online? I have been dealing with similar problems recently. My solution has been to use Jupyter notebook to keep the raw data live and swap out the code as necessary. Definitely helps speed things up. (Especially if you have some modifications that you need to make on the data beforehand.)

            1. 7

              Alas that trick never even made it in to the proprietary repository I was working on. It was straightforward enough:

              import os
              data = be_back_in_an_hour()
              f = True
              while f:
                  pid = os.fork()
                  if 0 == pid:
                      import one_more_time
                      one_more_time.for_the_world(data)
                      os._exit(0)
                  else:
                      os.waitpid(pid, 0)
                      f = iowait_ask_try_again()
              

              Here be_back_in_an_hour runs, and data contains effectively all of the memory our application is going to be using. After os.fork a copy [on write] of data is available both in our child and parent process. In the child process, we import and invoke one_more_time.for_the_world on it. That’s the code I was actively debugging and wasn’t willing to wait for be_back_in_an_hour each time. If I found a bug I’d modify one_more_time.py and run another loop.

              There is a lot to be desired in the above code. I had to be careful not to send SIGINT to my parent process while debugging and if I happened to find bugs in data I had to both patch it at runtime in one_more_time.for_the_world and correct the bug in be_back_in_an_hour. It got me through an otherwise unacceptably long debug cycle, though.

              Most of the problems of this sort the better option is to serialize data in to a pickle or some other format that could be quickly read from disk. Either all at once or by dividing it in to pieces. It was only this one time that, for whatever reason, the return value from be_back_in_an_hour was only available hot.

              1. 2

                Man, that’s a great trick! It’s a great example of how knowing your environment allows you to do complex work simply. Thanks for sharing.

                1. 2

                  I usually do this by checkpointing intermediate results to disk as a cache. But your solution definitely works if caching data to disk won’t work for some reason.

                  1. 1

                    I like this because sometimes you want to deal with data that takes 30 seconds to load and only 5 seconds to process. (Or some other lopsided ratio.) It’s nice to have to worry only about the 5 seconds of execution time.

                    1. 1

                      If you can, saving your work to an intermediate file is vastly superior to the throw-away thing I did above, agreed.

              2. 3

                The offhand ‘even perl’ in there struck me as unfair. It reminds me that perl is actually pretty fast (specifically at startup, but my recollection was also that it runs quickly):

                $ time for i in `seq 1 1000`; do perl < /dev/null; done
                
                real    0m2.786s
                user    0m1.337s
                sys     0m0.686s
                
                $ time for i in `seq 1 1000`; do python < /dev/null; done
                
                real    0m19.245s
                user    0m9.329s
                sys     0m4.860s
                
                $ time for i in `seq 1 1000`; do python3 < /dev/null; done
                
                real    0m48.840s
                user    0m30.672s
                sys     0m7.130s
                
                
                1. 1

                  I can’t comment on how fast Perl is, but you are measuring the time taken to tear down here too.

                  The correct way would be to take the raw monotonic time immediately before invoking the VM, then inside the guest language immediately print it again and take the difference.

                  P.S. Wow Python3 is slower.

                  1. 2

                    but you are measuring the time taken to tear down here too.

                    I guess so? I’m not sure that’s a useful distinction.

                    The people wanting “faster startup” are also wanting “fast teardown”, because otherwise you’re running in some kind of daemon-mode and both times are moot.

                    1. 1

                      The people wanting “faster startup” are also wanting “fast teardown”

                      Yeah, I guess I agree that they should both be fast, but if we were measuring for real, I’d measure them separately.

                      1. 1

                        I’m not sure that’s a useful distinction.

                        If latency matters then it could be. If you’re spawning a process to handle network requests for example then the startup time affects latency but the teardown time doesn’t, unless the load gets too high.

                    2. 1

                      Hah before I read the comments I did the same thing! My results on a 2015 MBP - with only startup and teardown on an empty script, and I included node and ruby also:

                      ~/temp:$ time python2 empty.txt 
                      real    0m0.028s
                      user    0m0.016s
                      sys     0m0.008s
                      
                      ~/temp:$ time python3 empty.txt 
                      real    0m0.042s
                      user    0m0.030s
                      sys     0m0.009s
                      
                      ~/temp:$ time node empty.txt 
                      real    0m0.079s
                      user    0m0.059s
                      sys     0m0.018s
                      
                      ~/temp:$ time perl empty.txt 
                      real    0m0.011s
                      user    0m0.004s
                      sys     0m0.002s
                      
                      ~/temp:$ time ruby empty.txt 
                      real    0m0.096s
                      user    0m0.027s
                      sys     0m0.044s
                      
                      1. 2

                        Ruby can do a bit better if you don’t need gems (and it’s Python 3 here):

                        $ time for i in $(seq 1 1000); do ruby </dev/null; done
                        
                        real	0m31.612s
                        user	0m27.910s
                        sys	0m3.622s
                        
                        $ time for i in $(seq 1 1000); do ruby --disable-gems </dev/null; done
                        
                        real	0m4.117s
                        user	0m2.848s
                        sys	0m1.271s
                        
                        $ time for i in $(seq 1 1000); do perl </dev/null; done
                        
                        real	0m1.225s
                        user	0m0.920s
                        sys	0m0.294s
                        
                        $ time for i in $(seq 1 1000); do python </dev/null; done
                        
                        real	0m13.216s
                        user	0m10.916s
                        sys	0m2.275s
                        
                        1. 1

                          So as long python3 is faster than ruby/node, we are ok…?