Threads for apas

  1. 2

    I used to build computational biology pipelines and do a substantial amount of data processing with AWK and is one of the best languages I’ve used.

    1. 1

      Hi all! My site’s at https://apas.gr and the blog at https://apas.gr/posts - would love to get some feedback. Thanks!

      1. 2

        It looks quite nice, I guess it’s heavily inspired by the tufte layout? For the posts I would suggest only showing a summary of it in order to give a better overview of the single articles. Currently it is a page that displays all your articles in its full-length.

        1. 2

          Thanks. Yes, indeed it’s based on Tufte layout. In fact, I built a minimalist static blog generator based on Tufte and this is a pertinent Wordpress theme. Will look into the posts layout - thanks!

      1. 1

        https://apas.gr; I write about technology, policy, and finance. (Lately, I mostly write about the same topics in my newsletter.)

        1. 2
          • ɛts
          • lɪb
          • car
          • effsock
          • skimə
          1. 1

            I like how thorough your build script is.

            1. 1

              Just the path (\w) and when I ssh it’s host with path: [host] \w. Running bash with colors enabled.

              1. 9

                I’m glad somebody wrote this up, because I feel like everybody who works with big data learns these lessons independently (sometimes multiple times within the same organization). If you teach CS, please make your students read this before they graduate.

                Understanding these lessons is basically why the office I work at is so much more productive than the rest of the company we’re a part of: there’s been an effort to get incoming devs to understand that, in most cases, it’s faster, cheaper, & easier to use unix shell tools to process large data sets than to use fancy hypebeast toolchains like hadoop.

                There are a couple things this essay doesn’t mention that would probably speed up processing substantially. One is using LC_ALL=C – if you force locale to C, no locale processing occurs during piping, which speeds everything up a lot. Another is that if you are using GNU awk, there’s support for running commands and piping to them internally, which means that downloads can actually be done inside AWK and posts can be done there too – which allows you to open multiple input and output streams and switch between them in a single batch, avoiding some merge steps. Also, one might want to use xargs instead of gnu parallel, because xargs is a more mature tool & one that’s available on basically all unix machines out of the box.

                1. 4

                  One thing I found particularly useful about this post (not evident from the title, but constitutes the first half) is specifics about how the Big Data Science Toolchains can fail, in this case Apache Spark, even when the author tried a bunch of the obvious and less-obvious fixes.

                  The biggest win here seems to be not necessarily the raw processing time due to low-level optimizations in awk, but more big-picture algorithmic wins from “manually” controlling data locality, where Spark didn’t do the right thing automatically, and couldn’t be persuaded to do the right thing less automatically.

                  1. 3

                    Have them read “The Treacherous Optimization” which is all about how GNU grep is so fast: grep is important for its own sake, of course, but the point is that these tools have had decades of work poured into them, even the relatively new GNU tools which postdate the Classic Unix codebases.

                    It’s also an interesting introduction to code optimization and engineering tradeoffs, or tradeoffs where multiple decisions are defensible because none of them are absolutely perfect.

                    1. 3

                      Yeah I’ve personally run into exactly this kind of slowness with R (and Python to a lesser extent), and fixed it with shell. I love R but it can be very slow.

                      That’s part of the reason I’m working on Oil. Shell is still useful and relevant but a lot of people are reluctant to learn it.

                      I posted this in another thread, but it is good to eyeball your computations with “numbers every programmer should know”:

                      https://gist.github.com/jboner/2841832

                      https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

                      In particular most “parsing” is linear time, so my rule is that you want to be within 2x-10x of the hardware’s theoretical speed. With certain tools you will be more in the 100-1000x range, and then it’s time to use a different tool, probably to cut down the data first. Hardware is cheap but waiting for data pipelines costs human time.

                      1. 4

                        When I was working in my first lab I did exactly the same – moved an existing computational biology pipeline off of R to AWK, lots of shell plumbing, GNU Parallel, and a Flask front-end server which submitted jobs in GridEngine. Brought runtime down from 40 minutes to about 30 seconds for one genome. R is nice but can be slow (also, it was just a prototype.)

                        The pivotal lesson I learned was to embrace the battle-tested technologies in the shell stack and everything Unix instead of fantsy-pantsy modern stacks and tools on top of Hadoop, Spark and others. “Someone probably solved your problem in the 80s” from the author rings absolutely true.

                        Others call it Taco Bell programming.

                        1. 2

                          Taco Bell programming is amazing, I’m saddened by the fact that Taco Bell programming has become quite esoteric. I wish this knowledge was more widespread in the industry.

                      2. 1

                        You must be very glad. 3 identical comments 😅

                        1. 1

                          Just a glitch. My mouse’s debounce doesn’t work properly, and lobste.rs doesn’t properly deduplicate requests, so when I click post it sometimes emits several duplicate requests which the server treats as duplicate comments (even though they come from the same form).

                          There was a patch applied for this a year ago, but either it didn’t work or it never migrated from the git repo to the live version of the site.

                      1. 1

                        I remember reading about zsh’s startup and prompt latencies being considerably worse than bash but can’t recall where (either Github Issues comments or a Dan Luu-like post.) In any case:

                        echo "Changing shell to brew's bash. . ."
                        brew install bash
                        echo "$(brew --prefix)/bin/bash" | sudo tee -a /etc/shells > /dev/null
                        chsh -s $(brew --prefix)/bin/bash
                        
                        1. 3

                          Funny, I thought the performance issues were the other way around.

                          I certainly notice that bash’s completion is considerably slower than zsh.

                          1. 2

                            I’ve been a zsh user since the early 1990s, maybe 1992 or so, and never really used bash as I was a ksh user previously, so perhaps my observances can be taken with a grain of salt - but I think the problem here is that a lot of people turn on oh-my-zsh (or similar) and enable every plugin and option and then wonder why the shell takes forever to start up.

                            I do use oh-my-zsh, and for zsh 5.3.1 on a Raspberry Pi 3B, average (over 10 startups) startup time is 0.714s, this is compared to 0.204s for (uncustomized beyond Debian defaults) bash, 0.176s for tcsh, and 0.008s for mksh. Uncustomized zsh actually launches slightly faster than uncustomized bash

                            Also, once the shell is running, I find zsh performance (in areas like globbing and completion) to be quite speedy, but I’ve never benchmarked against bash.

                            Edit: After upgrading to zsh 5.7.1 (from 5.3.1), startup time is, on average, 215ms faster - startup in ~500ms on average.

                            Profiling shows the speed improvement is faster completion initialization.

                            1. 1

                              My 85-line .zshrc, with some fancy-ish prompt stuff, some functions and alises, but no framework like OMZ has an average start time of 0.058s. On the same system, bash in its default configuration (as provided by the FreeBSD package) is about 0.008s.

                              I’ve never understood the need for frameworks for my shell. I can implement the same functionality with much less overhead.

                              1. 1

                                The framework isn’t very heavy unless you enable all sorts of thing you likely won’t use.

                                Using the zsh profiler (add zmodload zsh/zprof as the first line of your .zshrc, and zprof as the last), I can see that ~50% of my startup overhead is related to completion, and another 30% is related to zsh-syntax-highlighting (which can be slow in usage, but not annoyingly so).

                                I don’t see a need to try to optimize the remaining 20% by profiling parts of oh-my-zsh or trying to avoid it. (I do use some of it’s plugins and then unset or reset a few of the aliases, however, so there is some low hanging fruit in that 20%.)

                                As this is a comparatively slow CPU, where your numbers are probably for a modern desktop, you’d be even less likely to notice the overhead from the framework. It’s not all that heavy.

                            2. 1

                              To be honest, I don’t know much about bash completion because I don’t use it. Standard tab for directory autocomplete with the following Readline settings is all I need.

                              set show-all-if-ambiguous on
                              set completion-ignore-case on
                              set mark-symlinked-directories on
                              set colored-stats on
                              set completion-prefix-display-length 3
                              
                            3. 1

                              Ha! I just had to do this last week when something broke my completions and I realized it was easier to install and use bash from Homebrew than it was to find a workaround for my actual problem. My overengineered solution is here.

                              1. 1

                                I do something similar in my dotfiles setup script.

                            1. 3

                              Brew bundle is indeed very nice, albeit occasionally slow. I’ve generated two Brewfiles (one for Darwin and one for Linux kernels) and integrated them in my dotfiles in order to maintain consistency across different machines. I can think of a few ways to improve it (for instance, pin brew bundle to a file and update it automatically whenever software is [un|in]stalled,) but all told, quite happy with it.

                              1. 1

                                I’ve done the same thing. A Brewfile in my dotfiles makes it really easy to keep my machines in sync.

                              1. 2

                                It’s interesting you’re using Context. Here’s is my typesetting script—also based on pandoc—with which I typeset from plain documents, to academic papers with bibliography, to letters. I can also call it from within Vim without leaving it and render a preview. For blogging, I’ve built athena around the same pandoc setup.

                                1. 1

                                  Looks like your script gets the job done! There are some minor improvements that could be made (such as changing #!/bin/bash to the more portable #!/usr/bin/env bash and eliminating some of the duplication), but probably not worth the effort. Good to see crossref being used. Athena’s really cool–have you considered adding an RSS feed to it?

                                  1. 1

                                    Thanks! WRT bin/env bash and the blatant duplication you’re correct; I’ve thought about both but I also don’t think they’re worth the effort. athena does indeed generate an RSS ATOM feed!

                                1. 2

                                  I’m looking for new cool ideas to experiment with my server after building a Flask a server exposing my custom Pandoc environment to typeset from the iPad and an endpoint to transcribe podcast segments and append the text on Evernote, both triggered with iOS’ Shortcuts. Please, feel free to suggest some.

                                  1. 6

                                    Personal projects I worked over the weekend: I built a Flask a server exposing my custom Pandoc environment to typeset from the iPad with Shortcuts. Pull up the share sheet, run the shortcut, get PDF. I also wrote a simple endpoint to transcribe podcast segments. In the same vein, pull up the share sheet in Overcast, run the shortcut, append the podcast URL and transcription to an Evernote note.

                                    1. 3

                                      I am going to steal this. Thank you so much. I’m also an iPad main.

                                      1. 3

                                        You’re welcome. Please, feel free to clone.

                                      2. 3

                                        This is so cool! Are you using Pythonista or Editorial on the iPad? Or something else?

                                        OK I get zero points for reading comprehension :) You’re not using iPad side Python at alland are doing Shortcuts with small server apps running on your LAN.

                                        Now for a less stupid question: Have you ever considered using Editorial or Pythonista for any of these workflows?

                                        1. 3

                                          Thanks! So here’s what I use: on the iPad I write Markdown in Drafts and code via ssh and iSH in my server, a Digital Ocean Ubuntu droplet. iSH is very capable; it even runs tmux and supports Smart Keyboard combinations such as cmd + . (the classic Mac escape equivalent) as Esc in Vim out of the box – it’s exactly like my desktop, very happy with it. On the desktop I do both in Vim. The best part about the Digital Ocean server is that’s it’s only $5 a month and comes with free domain and free SSL, very easy to just spin one up and host a bunch of personal projects.

                                          Both the Flask server and the podcast transcription endpoint are in the server, not in my LAN. As I wrote in my initial comment, I access both via two shortcuts which essentially POST and GET via CURL the input files. The Shortcut workflows are accessible from any app that can export a plain text (or sound) file from its Share Sheet. They’re universal, so to speak.

                                          I thought of trying Editorial but so far happy with what Draft’s offering for free on the iPad. With regards to Pythonista, I don’t need to run Python locally and so far my automation need are being met by Shortcuts.

                                          1. 2

                                            Glad to hear things are working well for you! Both Pythonista and Editorial are pretty amazing and worth a look IMO.

                                            Also yeah Ish is SUPER impressive! Writing a full on X86 emulator for ARM so he could run a Linux kernel under IOS is no mean feat! And as you say it works like a champ! I hope Apple blesses it officially and doesn’t kill it off.

                                      1. 2

                                        I recently switched to Dracula Theme - it’s available for all the tools I use (VSCode, VIM, iterm2) but tons more. Every now and then I want a light theme, Solarized is still the light color scheme to go for me. Just need some workflow to automate the light/dark switching.

                                          1. 2

                                            awesome, thank you!

                                        1. 4

                                          I use Solarized Light during the day and Dark during the night; I’ve written how I automated switching between the two on my blog. If I had to choose just one, I’d choose Light.

                                          1. 2

                                            If you’re using macOS and iTerm I wrote of a similar approach. I wrote a vanilla bash function that automatically (or manually [see just function i in the diff]) changes between Solarized Light and Dark both in iTerm and Vim.

                                            (If you want to go one step further, I wrote how to automate the whole Mojave Dark - Light dingus in my blog.)

                                            1. 2

                                              I really liked the u option and went to check my scripts only to find out that I’ve been using it for quite a while now in set -euo pipefail. E, OTOH, I had no idea about and is very interesting.

                                              1. 3

                                                Thanks for this, however I still can’t understand how and why GNU Stow (or other similar dotfile managers or Ansible playbooks, etc) are better than a simple shell script. Precisely because I’m sharing my dotfiles across multiple devices, platforms, and operating systems I want them to be as platform agnostic and minimal as possible, and without any external dependencies. My script simply installs/symlinks everything and later I use git pull to sync changes across machines.

                                                1. 2

                                                  Yep I also use an install script (although mine is in Python), the reasons being:

                                                  • I need to support 3 platforms (Ubuntu/Mint, macOS, Windows), with different things to install in different ways (or not at all) depending on the platform.
                                                  • In some cases, I find it easier to procedurally generate a dotfile that will point to a resource located in your dotfiles repo, instead of symlinking everything into a fixed/hard-coded location.
                                                  • A script can also manage your sub-repos (pulling/cloning), so that everything is done in one command.
                                                  • A script can optionally do a subset of something, so that if you want to just update, say, your tmux config, you can just run that and it will just pull your tmux plugins’ repos, redo the symlinks, etc., and nothing else.
                                                  1. 1

                                                    It’s certainly not for everyone. I use it because its default behaviour matches my workflow perfectly, and there’s no need for a shell script. The only thing simpler than a simple shell script is no shell script at all!

                                                    1. 1

                                                      Unless I’m missing something you are creating an extra dependency before you can install (and manage) your dotfiles: Stow. I.e., you have to download, compile, and install from source or via a package manager. What if you have to use systems you’re not the admin for and have no sudo rights? Or Perl is too old (Stow’s requirement.) Or, or, or…

                                                  1. 15

                                                    Between OMZ making the shell take 3s to start on an SSD, and asking whether it can check for updates, I’m not sure it lasted more than 3 days on my machine. :)

                                                    More to the point: rather than starting with OMZ and asking, “what can I get rid of?” I’d encourage the opposite view of asking “what do I need?” and adding that to a simple zsh setup. Take ownership of your workspace and add only what you fully understand. I really like a simple prompt with CWD, current git branch (if any), and dirty status. My shell needs are also very simple, and I’m still learning a lot about what plain zsh can do. Add aliases as you need them instead of using giant community-driven plugins.

                                                    I really feel like our workspaces should be slightly idiosyncratic, if only to surface better workflows over time. Frameworks cut this impulse off at the knees.

                                                    The benefit of this approach:

                                                       ❯ time zsh -i -c exit
                                                           0.05 real         0.04 user         0.01 sys
                                                    
                                                    1. 4

                                                      I really like a simple prompt with CWD, current git branch (if any), and dirty status.

                                                      In that case perhaps you’ll like my prompt. It relies on and sources these two simple functions.

                                                      Time varies from 0m0.006s to 0m0.009s.

                                                      1. 1

                                                        Yeah, I’ve always felt OMZ (and similar things for emacs, etc.) kind of miss the point.

                                                        Dot files and tool configuration allow everybody to optimize their workflow so that it works best for their needs. Maybe using a collection of other people’s customizations really works best for some people, but it seems sub-optimal in general.

                                                      1. 3

                                                        Maybe @steveno or someone else can ELI5 this to me why is this advantageous over traditional, platform-agnostic, and dependency-less symlinking in a bash script? Cf. my dotfiles and the install script.

                                                        1. 3

                                                          Salt’s declarative nature means that you’re mostly describing the end state of a system, not how to get there.

                                                          So instead of saying “copy this stuff to this directory and then chmod” you say “I want this other directory to look like this”. Instead of saying “install these packages” you say “I want this to be installed”. You also get dependency management so if you (say) just want to install your SSH setup on a machine you can say to do that (and ignore your window manager conf).

                                                          If your files are grouped well enough and organized enough you can apply targeted subsets of your setup on many machines based off of what you want. “I want to use FF on this machine so pull in that + all the dependencies on that that I need”. “Install everything but leave out the driver conf I need for this one specific machine”

                                                          This means that if you update these scripts, you can re-run salt and it will just run what needs to run to hit the target state! So you get recovery from partial setup, checking for divergences in setups, etc for free! There’s dry run capabilities too so you can easily see what would need to change.

                                                          This is a wonderful way of keeping machines in sync

                                                          1. 2

                                                            Looking at my repository right now, there isn’t any advantage. You could do everything I’ve done with a bash script. The beauty of this setup for me, and I really should have stated this initially, is that I can have multiple machines all share this configuration really easily. For example, my plan is buy a RaspberryPi and setup an encrypted DNS server. All I need to do is install salt on the Pi and it gets all of this setup just like my NUC currently has. I can then use salt to target specific machines and have it setup a lot of this for me.

                                                            1. 2

                                                              The beauty of this setup for me, and I really should have stated this initially, is that I can have multiple machines all share this configuration really easily

                                                              You can also do this with a shell script.

                                                              All I need to do is install salt

                                                              With shell scripts you don’t need to install anything.

                                                              1. 3

                                                                As I previously stated, given what’s currently in this repository, there isn’t anything here that you couldn’t do with a shell script. That’s missing the point though. Salt, or ansible, or chef, provide you with a way to manage complex setups on multiple systems. Salt specifically (because I’m not very familiar with ansible or chef) provides a lot of other convenient tools like salt-ssh or reactor as well.

                                                                1. 2

                                                                  I feel like your point is just that shell script is turing complete. Ok. The interesting questions are about which approach is better/easier/faster/safer/more powerful.

                                                                  1. 2

                                                                    If you’re targeting different distributions of linux or different operating systems entirely, the complexity of a bash script will start to ramp up pretty quickly.

                                                                    1. 2

                                                                      I disagree, I use a shell script simply because I use a vast array of Unix operating systems. Many of which don’t even support tools like salt, or simply do not have package management at all.

                                                                      1. 1

                                                                        I have a POSIX sh script that I use to manage my dotfiles. Instead of it trying to actually install system packages for me, I have a ./configctl check command that just checks if certain binaries are available in the environment. I’ve found that this approach hits the sweet spot since I still get a consistent environment across machines but I don’t need to do any hairy cross-distro stuff. And I get looped in to decide what’s right for the particular machine since I’m the one actually going and installing stuff.

                                                                    2. 1

                                                                      The beauty of this setup for me, and I really should have stated this initially, is that I can have multiple machines all share this configuration really easily.

                                                                      Have to agree with @4ad on this one. I have to use remote machines I don’t have sudo rights and/or often are completely bare bones (eg., not even git preinstalled.) My goal, in essence, is a standardized, reproducible, platform-agnostic, dependency-less dotfile environment which I can install with as few commands as possible and use as fast as possible. I don’t see how adding such a dependency benefits me in this scenario. I’m not against Ansible-like dotfile systems, but, in my opinion, using such systems for this task seems like an overkill. Happy to hear otherwise, though.

                                                                    1. 2

                                                                      Awesome link. I didn’t know xargs could do that.