1. 31
  1.  

  2. 6

    Posting because this is the not-awk I’ve been looking to write, and it’s already written; hopefully others will also find it a useful tool

    1. 6

      I’d be interested in a blog post that compares programs in Tab and Awk… I’ve seen Tab, but so far I’ve stuck with Awk because it’s there, and it works, and other people know it.

      IMO if it doesn’t have a compelling explanation, it might not be worth the time to implement… I have debugged my language by explaining it to other people.

      edit: I guess this is the closest page, and it seems interesting, although not that different than the normal Unix ways of doing things? I think it’s possibly more consistent, but it doesn’t seem different enough to be compelling.

      https://tkatchev.bitbucket.io/tab/examples.html

    2. 5

      What does O(n) memory usage mean? Every language has O(n) memory usage, where n is the number of allocated objects.

      1. 3

        Give that this is intended to work on a text file, I assume n is the size of this input.

      2. 2

        Some other awk-like languages from the Oil wiki [1]

        (1) Miller: https://miller.readthedocs.io/en/latest/reference.html

        Impressive in many ways, but it looks like it started as “cut” for CSV/TSV, and then grew a programming language …

        (2) TXR, a Lisp-based language, although it also has relatively confusing syntax: http://nongnu.org/txr/

        I get the feeling that these languages are “too different to be worth it”, and tend to grow beyond their domain of applicability.

        More specifically, I would draw a distinction between streaming lines of text and operating on tables. The former feeds into the latter. In other words, clean your data with awk, then analyze it with SQL or R.

        SQL and R are both extremely powerful and can do basically anything these languages can. But sometimes you do want a small sharp tool to cut down a big data set, since it’s easy to use both SQL and R in extremely slow ways.

        But it also means that you don’t want these languages to grow too many features, like mathematical libraries, etc. because that is better done with the “real thing” as a separate step.

        What is a Data Frame? (In Python, R, and SQL)

        [1] https://github.com/oilshell/oil/wiki/ExternalResources

        1. 6

          Perl is also a pretty good replacement for awk. It’s installed everywhere, runs on everything.

          1. 7

            And it behaves the same on Linux and BSD ;)

            1. 5

              You reminded me of a2p!

          2. 1

            Re: What is a Data Frame? (In Python, R, and SQL), your SQL is awkward because it’s approximating an imperative style.

            It would be nice if SQL had Python-like scalar variables (i.e. variables that aren’t tables).

            You can avoid this using window functions:

            with total_hits_per_url as (
              select url, sum(hits) as hits
              from traffic
              group by url
            )
            select url, hits * 100.0 / sum(hits) over () as percentage
            from total_hits_per_url
            order by percentage desc;
            

            Of course this style is more verbose in this simple case, but it preserves the sequential pipeline of transformations that makes dplyr code so readable.

            1. 1

              Yes that’s basically what the linked run.sh code does. The comment there shows that update came from the original lobste.rs discussion!

              https://github.com/oilshell/blog-code/blob/master/data-frames/run.sh#L79

              I didn’t change the blog post code since it is less familiar – I just mentioned the possibility. Although I should probably have both styles in run.sh for comparison.

              1. 1

                Not quite: run.sh approximates a total_hits variable definition by joining with a 1 row 1 column CTE. Logically speaking, a join has nothing to do with dividing the total hits per url by the sum of all hits. Using a window function more closely matches the logical intent.

                In this simple example there isn’t much difference. But for larger queries, particularly those already involving many joins, using joins to implement window functions leads to confusing SQL. At every join you must ask, are these tables intermediate result sets, or are they implementation details?

                1. 1

                  Hm I tried to dig deeper into this and am now very confused… I haven’t really used window functions or CTE before.

                  https://github.com/oilshell/blog-code/tree/master/data-frames

                  1. I get a syntax error in your example. Try this:

                    ./run.sh csv2sqlite ./run.sh with-window

                    Error: near line 1: near “(”: syntax error

                  Is that because sqlite uses a different syntax? They seem to have a WINDOW function.

                  https://sqlite.org/windowfunctions.html

                  I would appreciate a patch to make it work; I couldn’t do it.

                  1. I noticed the with-cte version is actually broken!

                  The correct answer is done by with-sql:

                  Popular Pages:
                  /data-frames.html|59.8563447725459
                  /blog/|32.7214684756584
                  /site.html|7.42218675179569
                  

                  The with-cte version prints:

                  /data-frames.html|59.8563447725459
                  /blog/|14.365522745411
                  /site.html|2.47406225059856
                  

                  I don’t know why that is! This is why I don’t use SQL for analytics :-/

                  1. 2

                    OK I fixed the CTE query – it was missing a SUM()!

                    https://github.com/oilshell/blog-code/commit/75d4f86fc7b6ee8d32236089e9b147375cb5f46a

                    Looks like the bug was present in the original suggestion!

                    https://lobste.rs/s/hnfc6a/what_is_data_frame_python_r_sql#c_0os5e8

                    This is kinda why I like having intermediate results to “debug” … I think modern SQL engines should be good at optimizing those?

                    I get that you want to do things “in one step” with SQL, but I think that sequence of transformations is easier to debug, and the whole point of the optimizer should be to “erase” those boundaries.

                  2. 1

                    OK it looks like the problem is that sqlite3 from 2016 doesn’t support window functions!

                    I get the same syntax error on the example here with row_number(), etc.

                    $ sqlite3 --version
                    3.11.0 2016-02-15 17:29:24 3d862f207e3adc00f78066799ac5a8c282430a5f
                    

                    Looks like they were only added in 9/2018 as of version 3.25 !

                    https://sqlite.org/windowfunctions.html


                    edit: I got the window function working. I agree this is a nicer style (especially since the other one had a bug!)

                    https://github.com/oilshell/blog-code/blob/master/data-frames/run.sh#L136

                    I’m still a bit confused about OVER (), but it doesn’t work without that. To me there is no “Windowing” going on here? It just looks like a Common Table Expression?

            2. 0

              Given the name, I was disappointed it wasn’t a Whitespace variant. Can people please give names to things that are at least vaguely unique?

              1. 4

                Sorry it’s not a meme language but a practical tool, I assume it’s named tab because it’s a string processing language named in the Unix tradition of having a short name