1. 2

    Found at FrugalMap.cs:

    // If we’re unsorted and we have entries to sort, do a simple // bubble sort. Sort the pairs, 0..5, and then again until we no // longer do any swapping.

    Dying to know what the rationale was here. Does Bubblesort have really good performance on really small lists?

    1. 3

      Bubblesort is in place and requires no setup as most other potentially faster sorting algorithms does.

      You’re also bounded to at most 15 comparisons and an early bailout if nothing was swapped.

      So it’s probably good enough that the main time is spent elsewhere

      1. 2

        Yup! Bubblesort is much faster at small lists because the constant factor overhead is almost negligible (for exactly the reasons that @jeb mentioned). Usually you see insertion sort instead of bubble sort, which has the same O(N^2)-but-almost-zero-overhead time complexity, because insertion sort ends up doing fewer swaps, but the general idea (solve small sorting problems with O(N^2) algorithms) is the same. This is a weird property of asymptotic complexity analysis: the idea behind expressing complexity as O(f(n)) is to give you an idea of how some kind of resource usage (time, memory) grows as the problem grows. It really tells you very little about how the algorithm will perform on smaller instances of the problem.

        In fact, one of the more popular sorting algorithms today, Timsort, often boils down to “split into small lists of length N, run insertion sort on those, and then do a merge-sort-like merging until the whole thing is sorted”. Most production-grade quicksort implementations also do this: break the list down into small chunks, insertion-sort those, and then do the quicksort magic back up the stack. Many sorting algorithms do exactly what FrugalMap is doing as well: you can unroll the loop in the insertion sort step, since insertion sort on 5 elements ends up being a relatively small chunk of code, and you avoid conditional jumps in the common case that your short array is exactly the maximum length allowed by your algorithm.

      1. 2

        Hey Andy.

        For your SQL query, have you looked into the with statement?

        Then you could rewrite the query to something along these lines

        WITH total (hits)
        AS
        (
            SELECT SUM(num_hits)
            FROM traffic
        )
        
        SELECT url,
               total.hits * 100.0 / total.hits
               AS percentage
        FROM traffic
        GROUP BY url
        ORDER BY percentage DESC;
        

        Keep up the interesting blogs posts :)

        1. 2

          Thanks, I didn’t realize that was SQL 99, and sqlite supports it. I will edit the post!

          1. 2

            Hm I was able to do this in sqlite, but it doesn’t look much better. It seems like you need still need a SELECT and you can’t use a syntax like total.hits.

            WITH total (hits) AS (SELECT SUM(num_hits) FROM traffic)       
            SELECT url, SUM(num_hits) * 100.0 / (SELECT hits FROM total) AS percentage 
            ...
            

            Thanks for the suggestion though. I will update the blog post but I think it still illustrates the awkwardness of SQL.

            1. 4

              Try this first because the “awkwardness” of SQL (which is really just Dataframes + simplified constraint solver) is the same awkwardness as any other dataframe. The only difference is that we don’t have to reinvent aggregation or deal with the fragility of implementation across M different scientists (and all the various edge cases)

              -- Create a single row, single column table with the sum of the hits:
              WITH total (hits)
              AS
              (
                  SELECT SUM(hits)
                  FROM traffic
              )
              -- And then use it as a cross join 
              -- (1 row from total * N rows from traffic) = N total rows
              SELECT url,
                     traffic.hits * 100.0 / total.hits
                     AS percentage
              FROM
                  traffic,
                  total
              GROUP BY url
              ORDER BY percentage DESC
              ;
              

              No inner query necessary. And looks just as nice as the Dataframe examples, if not better.

              Can we have a redemption arc for SQL because this is what JOINs are good at?

              Edit: Proof of work:

              $ sqlite3
              SQLite version 3.19.3 2017-06-27 16:48:08
              Enter ".help" for usage hints.
              Connected to a transient in-memory database.
              Use ".open FILENAME" to reopen on a persistent database.
              sqlite> CREATE TABLE traffic (
                 ...>      date DATETIME NOT NULL,
                 ...>      URL TEXT NOT NULL,
                 ...>      hits INTEGER NOT NULL DEFAULT 0,
                 ...>      PRIMARY KEY(date, URL)
                 ...>  );
              sqlite>
              sqlite> INSERT INTO traffic
                 ...>  VALUES
                 ...>      ("2018-11-30",  "/site.html",  300),
                 ...>      ("2018-11-30",  '/blog/'  , 1000),
                 ...>      ("2018-12-01",  '/site.html',  320),
                 ...>      ("2018-12-01",  '/blog/',  1300),
                 ...>      ("2018-12-02",  '/site.html',  310),
                 ...>      ("2018-12-02",  '/blog/',  1800),
                 ...>      ("2018-12-02",  '/data-frames.html',   7500)
                 ...> ;
              sqlite> -- Create a single row, single column table with the sum of the hits:
              sqlite> WITH total (hits)
                 ...> AS
                 ...> (
                 ...>     SELECT SUM(hits)
                 ...>     FROM traffic
                 ...> )
                 ...> -- And then use it as a cross join (1 row * N rows) = N rows
                 ...> SELECT url,
                 ...>        traffic.hits * 100.0 / total.hits
                 ...>        AS percentage
                 ...> FROM
                 ...>     traffic, total
                 ...> GROUP BY url
                 ...> ORDER BY percentage DESC
                 ...> ;
              /data-frames.html|59.8563447725459
              /blog/|14.365522745411
              /site.html|2.47406225059856
              sqlite>
              
              1. 2

                Thanks, that is better. I updated the code here and gave you credit:

                https://github.com/oilshell/blog-code/blob/master/data-frames/run.sh#L70

                For this case, SQL works OK. I think the data frame syntax is nicer, but you could argue it the other way.

                I still stand by my claim that SQL gets unwieldy for complex analysis. I don’t think this is controversial – most people including myself have dealt with 200 line SQL queries with no abstraction, that take forever to run.

                Here are some lengthier dplyr pipelines – I shudder to think how they would be expressed in SQL :)

                https://github.com/oilshell/oil/blob/master/benchmarks/report.R


                Also, data manipulation is only part of what R does. As mentioned, it is really meant for statistics, visualization, and mathematical modelling, and SQL isn’t appropriate in those domains. There’s a fuzzy boundary in the middle of a data science workflow where you could use either.

                My general guideline is to cut down the data with SQL – from terabytes to gigabytes – and then do analysis in R on a single machine. SQL is for extraction and R is for analysis. This is a very nice workflow IMO.


                Also note that sqlite supports direct CSV import: https://github.com/oilshell/blog-code/blob/master/data-frames/run.sh#L65)

                Thanks for the suggsestion (and testing it! )

                1. 2

                  Usually when your SQL statement gets too unwieldy its due to lack of CTEs or trying to do too much in one query.

                  Perhaps the issue with SQL is that people compare a (usually) singly scoped statement against a script file of successive statements. Most people in that situation will opt for few giant, ugly SQL statements instead of using simpler successive imperative SQL statements that pass state via session variables.

          1. 9

            …Where can I learn more about this experimental rilkef?

            1. 3

              Rilkef? That was so November. You need to keep up to date with the industry.

              1. 2

                This is the real question

              1. 4

                The idea is good but… It’s almost like the whole point of spreadsheets are to have a powerful way of manipulating data that doesn’t require much programming skills but still can do a whole lot of value adding.

                Of course you can do all the summations using a range of command line tools… But does it really but pressing ALT and =?

                I remain a bit doubtful for the approach :)

                1. 3

                  For doing calculations my approach is to import the CSV/TSV into SQLite and query it there. It’s remarkably faster than importing into a spreadsheet program and doesn’t have arbitrary row count limitations.

                  1. 3

                    I didn’t know that was possible… Seems like a nice feature to know about. For future reference 1 is a link to the documentation about importing and exporting CSV files. For TSV files, see 2

                    1. 2

                      Years and years ago, I was cooking up a Django project that would read in downloadable CSV bank statements from my online bank.

                      The purpose was this, but as the statement downloads can’t be automated and banks have been improving their reporting features, the project dwindled off.

                      Being able to query by SQL or Django admin is a superpower, though.

                      This Vim TSV business might be better suited for eg. how translators sometimes use CSV/TSV as an interchange format.

                  1. 2

                    Going out with friends tonight (It’s the release of the Christmas brew today… I intend on being one of the few sober people in the bar) Tomorrow it’s the yearly Christmas dinner with my old friends from school Sunday it’s boardgames with my friends from Uni.

                    And some relaxing in between =)

                    1. 3

                      Hopefully we’re not bored by rust is not other languages stories yet. Not sure if I learned more about rust or ruby here.

                      Rust is telling me that iter() yielded references to integers, but my code expected an actual integer, not a reference to an integer.

                      Now the array was mutated! It turns out Ruby passed integers to the closure by value, but strings by reference. Updating each string inside the loop also updated that string inside the array.

                      I managed to be surprised by both those statements.

                      1. 1

                        ruby passes everything by reference; the issue in the code is that i = i + 1 rebinds i as a reference to a new integer object. the other half of the puzzle is that integers are immutable, and strings are not. the writer managed to confuse the issue by saying

                        str = str << "-mutated"
                        

                        instead of

                        str = str + "-mutated"
                        

                        in both cases, str is rebound to the result of the RHS, but the + operator adds two strings and returns a new string, whereas the << operator modifies a string object in place and returns the same object.

                        1. 1

                          Ok, using << seems pretty sneaky here. Thanks for pointing that out.

                        2. 1

                          Have you ever changed the value of 4?

                          In a language other than FORTRAN?

                          (In a more modern take on that old story, I have done it in Python…)

                          1. 3

                            Yes. To great amusement I have played around with reflection in Java and changed the values of signed bytes.

                            For those who haven’t played around with it previously, the JVM has an internal array that caches all signed byte values1.

                            While both private, internal and static it’s still possible to access the field using reflection.

                            If you swap references in that array, sudden 5 becomes 4.

                            1. 1

                              I’ve done the “redefining the value of true” gag in squeak smalltalk. It ends about as poorly as you’d expect.

                          1. 3

                            I’m going to look a bit at my little Java functional utils project. Basically I’m doing a Try, Either, Tuple (0 to 8 items), and possibly yet another Optional - is there something obvious I’ve missed that you guys could see as a nice addition?

                            And then I’m going to the big city to attend a concert with Shania Twain on Sunday.

                            1. 10

                              I do photography, board games, floorball and hiking.

                              Few things are as refreshing as just walking a few kilometers after work :-)

                              1. 4

                                Nothing on the plan. And I will enjoy it.

                                1. 8

                                  As much as I’d love to see people actually pay for tools, I don’t quite get why they’re trying this with the JDK.

                                  1. 25

                                    Because people will fall into the trap, and they will get a call from Oracle’s compliance department, and Oracle will make money, and that will buy enough fuel to power the Larry Ellison for another day or so.

                                    1. 10

                                      Wanted to mention the following comment on the same note:

                                      https://palisadecompliance.com/oracle-org-chart/

                                      Granted the article is opionated and a bit dated (2013) but this shows what Oracle is capable of.

                                      1. 5

                                        I don’t know if it’s deliberate or by accident, but I like the idea of fueling a Larry Ellison.

                                        But it will be a drag to ensure proper compliance with these new rules…

                                        1. 30

                                          “…You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn, you stick your hand in there and it’ll chop it off, the end. You don’t think ‘oh, the lawnmower hates me’ – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.” – Bryan Cantrill

                                          https://www.youtube.com/watch?v=-zRN7XLCRhc

                                    1. 27

                                      I use https://protonmail.com. I wanted a Gmail alternative that was private and fully encrypted. I pay for the plus model so I can use my domain, I did not want the hassle or expense of a self-hosted model. I have been completely happy with Protonmail. I have used them since they were in beta.

                                      1. 7

                                        Yes, +1 for ProtonMail. From the small research I’ve done, they’re the most secure email provider. I also use my own domain.

                                        1. 5

                                          ProtonMail is great. The search function is a little bit slow, but since its encrypted at rest it kind of has to be.

                                          There are a couple of features that are great. The one I get the most use out of is having multiple address connect to the same email account. I have several email addresses, one for personal use, one used for signing up accounts, one for newsletters (or other noisy notifications), one scoped to projects, etc.

                                          There is also ProtonMail’s Bridge that gets around some of the security issues with IMAP/POP creating a connection over TLS, which then locally runs a IMAP/POP server on your machine.

                                          They have also had their OpenPGPjs (A opensource PGP impl in JS) library audited.(1)

                                          2 major caveats for anyone who is considering an encrypted email service is that

                                          1. Email is inherently insecure. It is hitting protonmails server in plain text possibly without StartTLS.
                                          2. You are probably going to forfeit some functionality for the this feature.

                                          1: It wasn’t directly them, more the community around OpenPGPjs, which they are part of. I’m also unsure of the original ownership of this project, but that can get muddied with opensource sometimes.

                                          1. 3

                                            I also use protonmail, no particular complaints about it.

                                            1. 3

                                              I have the Visionary plan and seamlessly migrated my email to them - including my whole archive which goes back about 13 years or so, once the bridge was out.

                                              It’s a very nice and simple web client, and the apps are good enough that they just work for my parents.

                                              Overall, I like it very much.

                                            1. 3

                                              Finishing up a small parser for map files to try and figure out if there’s a correlation between function addresses and apparent performance issues due to caching in an embedded device.

                                              1. 1

                                                If you’re doing java or c#, I’ve used Quartz(.NET) with success. It’s XML configured but it’s bearable once you learn it, and it has a bunch of different failure handling modes which comes in handy when doing real things :-)

                                                1. 1

                                                  A bit of a mess, although I’ve still to try to work with something as good as WPF, it has some magic but MVVM seems to be the less worst of the UI patterns.

                                                  1. 3

                                                    Enjoying the final weekend of my holidays for this time. Going to go to Guggenheimer and MoMA tomorrow and doing a bit of build your own adventure on Sunday before I jump on the plane home to Denmark.

                                                    Any cool tech museums in NYC that isn’t Intrepid? :-)

                                                    1. 2

                                                      I’m just about to fly out to Calgary to visit a friend.

                                                      So continuing my Canada adventures, which so far has been a few days in Toronto and then on the Canadian to Vancouver… I’ll have to get back at some point :-)

                                                      1. 3

                                                        I’m traveling Canada this weekend. On vacation, and going to jump on the train to Vancouver late Saturday evening. I’m checking out Toronto until then.

                                                        1. 2

                                                          This is one of my go-to sites: https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed and they also have courses.

                                                          Obviously, UI/UX/HCI is a huge area, so if there’s a particular focus (what kind of UI? for what purpose? etc.) that you have, I can make more specific suggestions.

                                                          1. 1

                                                            For now I’m just trying to get a general idea and to get to know what to ask about. Thank you for the link.

                                                          1. 5

                                                            For all command line needs, go with Gow, Gnu On Windows.

                                                            For the very limited exposure I’ve had with Ubuntu on Windows, it was pretty good.

                                                            Is it a requirement that the IRC client is open source? If not, have a look at mIRC.

                                                            I think that Windows 10 comes with OpenSSH by default now, see this https://www.bleepingcomputer.com/news/microsoft/heres-how-to-enable-the-built-in-windows-10-openssh-client/

                                                            1. 3

                                                              Thanks very much I hadn’t heard of GOW. I’ll check out mIRC.

                                                            1. 1

                                                              I’m being a good son and took my mum to London with me, as I was going over to attend the friday nights Taylor Swift concert.

                                                              So no programming this weekend.