1. 10
  1.  

  2. 5

    Or you can use q:

    .j.k[.Q.hg`:https://www.reddit.com/r/pics/.json][`data;`children;;`data;`url]
    

    Let’s breakdown what’s going on:

    .Q.hg`:https://www.reddit.com/r/pics/.json
    

    calls .Q.hg with the URL, which returns the content.

    .j.k[.Q.hg`:https://www.reddit.com/r/pics/.json]
    

    calls .j.k on that content which converts JSON to a q object. It looks kindof like this:

    kind| "Listing"
    data| `modhash`children`after`before!("";+`kind`data!(("t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3";"t3"..
    

    Now we can explore this interactively with:

    x:.j.k[.Q.hg`:https://www.reddit.com/r/pics/.json]
    x[`data]
    

    Note function application and indexing use the same syntax in q.

    modhash | ""
    children| +`kind`data!(("t3";"t3";"..
    after   | "t3_6j94fw"
    before  | 0n
    
    x[`data;`children]
    kind data                                                                                                                                                                ..
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------..
    "t3" `contest_mode`subreddit_name_prefixed`banned_by`media_embed`thumbnail_width`subreddit`selftext_html`selftext`likes`suggested_sort`user_reports`secure_media`link_fla..
    "t3" `contest_mode`subreddit_name_prefixed`banned_by`media_embed`thumbnail_width`subreddit`selftext_html`selftext`likes`suggested_sort`user_reports`secure_media`link_fla..
    "t3" `contest_mode`subreddit_name_prefixed`banned_by`media_embed`thumbnail_width`subreddit`selftext_html`selftext`likes`suggested_sort`user_reports`secure_media`link_fla..
    "t3" `contest_mode`subreddit_name_prefixed`banned_by`media_embed`thumbnail_width`subreddit`selftext_html`selftext`likes`suggested_sort`user_reports`secure_media`link_fla..
    "t3" `contest_mode`subreddit_name_prefixed`banned_by`media_embed`thumbnail_width`subreddit`selftext_html`selftext`likes`suggested_sort`user_reports`secure_media`link_fla..
    ..
    

    and so on, until we see the data we’re interested in.

    Let’s make this a function:

    g:{m:.j.k[.Q.hg`$":https://www.reddit.com/r/pics/.json",first x];t:m[`data;`after];($[0n~t;first x;"?after=",t];m[`data;`children;;`data;`url])};
    

    This “gets” the two pieces of information we’re interested in: The data.after path and the data.children..data.url path. You can see lambdas are made with curly braces, and assignment is done with colon.

    `$":https://..."
    

    converts strings to handles. Handles are special symbols that we can perform IO on. Having a separate type for this prevents accidental injection attacks because you can’t join a symbol with another symbol without (deliberately) converting them to character-arrays first. The other thing is:

    $[0n~t;TRUE;FALSE]
    

    is how we write “if null is t” in q. Oh, and g can be made shorter, but I would have to discuss more q to do so.

    However with that one function, we can use scan to collect all of our lists.

    raze last flip g scan ("";())
    

    Done!

    • raze (flatten)
    • last
    • flip (rotate array; flip top two dimensions)
    1. 3

      …though it costs ~$100k/core to run q which seems like a lot of money to part with for a screenscraper.

      1. 1

        I don’t think it’s that expensive unless you get consulting and support as well.

    2. 2

      This is neat. I really enjoy seeing how short the experts can make one-liners.

      When I put F5Bot public, the Reddit scraper was only a handful of lines. It worked for a while, but today it’s thousands of lines. It’s amazing how many edge-cases come up to the surface. Also, the Reddit API is really quirky.