My goodness - i just had an idea for a blog post about modeling real-world processes with TLA+. I had a couple of situations at work recently that weren’t code related that I immediately mapped to a TLA+-like model in my head, and that made me understand them better.
It turns out state machines very naturally model a lot of business processes too. I guess that’s not surprising, seeing as with programming we’re also just modeling business processes, but it’s been surprising for me to be thinking about TLA+ outside of a verification context.
Hah, that’s Leslie Lamport’s open-secret mission: he wants to change how software engineers think using TLA+ as a sort of Trojan Horse!
Yeah, TLA+ was primarily created as a notation for thinking, the model checker can’t even check all the things you can write.
I don’t get what the hardware is doing different from FPGAs or PLAs or “normal” CPU clusters. As such, I don’t know what I could use this for.
Here’s the spec for the F18A chip: https://www.greenarraychips.com/home/documents/greg/DB001-110412-F18A.pdf
I’d be curious if you tried polars, my sense is that it’s very good / fast at these sorts of things. (But I don’t have a lot of experience with it.)
https://duckdb.org/docs/extensions/json.html is worth checking out
Cool, thanks! I wasn’t familiar with DuckDB! I almost wrote a coda mentioning Postgres, which would probably have worked as well.
When you’re
trial-and-erroriteratively building jq commands as I do, you’ll quickly grow tired of having to wait about a minute for your command to succeed, only to find out that it didn’t in fact return what you were looking for.
I feel like the article misses step 0: build and test your tools on a minimal representative sample. There’s no point running the first few jq attempts on the full dataset rather than “head 100” of it. Then potentially some better selected sample where you know the right answers. Then step 1: think how you’re going to apply your ready and tested process to gigabytes of json.
Also, choosing a better language is an automatic massive gain sometimes. I wrote https://github.com/viraptor/heapleak in Crystal because it’s ~10x faster than heapy in Ruby.
If I can do it in 50 seconds on the full dataset in the way I’m used to, I’ll do it. If I couldn’t have done that, I probably would have used JQ to sample, but thanks to parallel
, that wasn’t necessary. Hooray!
There’s also ijson and json-stream (and many other) libraries that can turn JSON files to python generators which are really fast and memory efficient.
It does feel like as memory and connection speeds have grown, streaming parsing strategies have been forgotten somewhat.
I’ve tried working with the chunksize
parameter for pandas.read_csv
, but I ran out of memory very quickly again when trying to reduce my JSON data. Maybe pandas doesn’t use one of these libraries, but that discouraged me from trying any other non-parallelized approach.
Chunksize should work in this case unless you store the result somewhere instead of just printing on screen and letting the garbage collector do it’s thing. Though afaik pandas has nothing for chunking/streaming JSON.
In general if you can take advantage of garbage collector and generators you can process any size files with very little memory (just need enough to load 1 chunk/item). Though working with generators is much more hands-on than just loading everything to memory so unsurprisingly it’s less approachable.
The code was something like
result = pd.DataFrame()
for chunk in pd.read_csv(...):
df = pd.DataFrame(chunk)[["column1", "column2", "column3"]]
result = pd.concat([result, df]}
Used all the memory. Probably creating full dataframes and throwing them away causes a lot of memory pressure. But Dask took care of all that for me, so it gets a mention.
If you were to use the ijson basic_parse interface you’d have to write a state machine to keep track of where you are in the JSON file and of the opening and closing braces/brackets with the start/end_map/array
events, which might be a bit tedious but is easily abstracted away with some functions. When you use it like that you’d only basically have to pay the memory cost of one JSON token at a time. I’m guessing the reason why the 10GB file takes so long to process in jq is that it tries to parse the file into memory in its entirety before doing any work, so any solution that doesn’t involve keeping the entire parsed file in memory is probably gonna be a lot faster.
By the way @jeeger, I looked at jq’s documentation and saw that it has a --stream
flag that does exactly that and claims to be faster on very large files but requires you to use a slightly different querying syntax, have you tried it?
One of my favorite edge cases of this type is how holidays and observances keyed to sunrise or sunset work in places where the sun does not rise/set once per 24-hour period and may not rise/set for months at a time (authorities differ on how to handle that!).
Another is the question of how one would face Mecca to pray while in orbit around the Earth.
The latter was publicly discussed a while ago: https://en.wikipedia.org/wiki/Qibla#Outer_space.
I love how ultimately, these rules devolve into “whatever, do the best you can under the circumstances”, which I wish was a maxime more people followed.
That doesn’t work inside a deleted directory, since the .
and ..
directory entries have been removed.
How would “cd $(pwd)” work inside a deleted directory?
[jeeger@dumper /tmp] $ mkdir -p /tmp/test/test
[jeeger@dumper /tmp] $ cd /tmp/test/test
[jeeger@dumper /tmp/test/test] $ rm -rf /tmp/test
[jeeger@dumper /tmp/test/test] $ mkdir -p /tmp/test/test
[jeeger@dumper /tmp/test/test] $ cd .
[jeeger@dumper /tmp/test/test] $ ls
[jeeger@dumper /tmp/test/test] $
That’s a good point. I ran some tests of my own:
/tmp/test/test
/bin/pwd
to verify the problem is detectedcd .
or cd $(pwd)
/bin/pwd
again to verify the problem is fixedIt turns out cd .
and cd $(pwd)
work equally well in posh
, mksh
, dash
, zsh
and bash
, in the default configuration. However, for historical reasons I have set -P
in my .bashrc
file, which (as an unintended side-effect) turns off the magic that allows cd .
to work.
Very interesting, it’s only after bash resolves /tmp/test/test/.
to /tmp/test/test
that they become equivalent.
I’m not sure what you’re demonstrating but please be advised that ls does not error in deleted cwds anyway, it simply exits
Done. With just a little bit of work, we can now bypass all pattern list based adblockers. As long as all proxied requests happen through HTTPS, third-party cookie goodies will be passed along to the tracker as well.
Cool, you’ve also created an open proxy for people to use.
Yeah, well, if your site doesn’t respect user defaults, I’m not going to respect your site. Load you once, shame on you.. load you twice, shame on me.
Ever notice how you can pi-hole www.reddit.com
to 0.0.0.0
, but Google Chrome will silently ignore that and somehow do its own DNS request.. and it loads? After I’ve gone through the trouble of setting the OS default DNS (to try to stay off reddit)? Not very respectful, Chrome. That seems like something malware would do, and frankly it’s getting harder and harder to tell the difference.
Firefox respects user defaults. Safari, too. What’s the story, they’re “protecting my privacy”? Please no.
Ah, I’m wrong. It doesn’t index it, it scans it. Sorry about the Vice link, I hate that about as much as I hate medium: https://www.vice.com/en_us/article/wj7x9w/google-chrome-scans-files-on-your-windows-computer-chrome-cleanup-tool
Preliminary tests with fewer than 100 satellites up showed approximately 600MBps available as tested on an aircraft in flight to Starlink. For reference, my last flat in Berlin (the capital of the largest economy in the EU), on the ground on a main street in the city center, was serviced with approximately 14MBps ADSL, and this was the fastest offering available from any vendor.
I mean, duh. It’s probably 1 user on Starlink vs. 500 on the local DSLAM?
Alternately: this potentially makes full-time, international waters seasteading practically viable.
Because Internet access is the problem with seasteading.
The makefile probably be updated to use non-outdated pattern rules instead of suffix rules.
The benchmarks look impressive, for the full changelog, see the release announcement mail.
I’m salivating after those ligatures. If only ligature support was available in a sane manner in Emacs.
I use vim on the terminal and ligatures in Iosevka work just fine. But it does require a terminal emulator with ligature support.
I’ve heard some people raving about tiddly wiki for this which seems like it has a lot of flexible interfaces. I don’t share my notes publicly as often, but I take tons of notes, and I almost always have this tui mind map that I wrote open in a tmux pane.
Why don’t you share your notes publicly? I think there is so much value to be had if everyone shared their notes in public.
I have a lot of personal info in there that I don’t need to share. But I am often thinking “I should be writing more for public consumption” and I also started making a gitbook, with sort of a long plan of turning it into a sort of “designing data intensive applications” but with an eye toward actually building the underlying infrastructure that supports data intensive applications. Sort of a missing guide for modern database and distributed systems implementation. But it’s been hard for me to overcome the friction of really building a durable habit of writing for others.
That’s the cool thing about a wiki. You write the notes for yourself. Just share it with others.
Would love to read your book. I’d release it gradually too as the topic of building the infrastructure for data intensive apps is vast.
Not to discourage you from writing, because I do believe that there’s always room for new takes on a topic, but Martin Kleppman’s Designing Data-Intensive Applications may be of interest to you, if you haven’t seen it. I’m not sure that it goes into “underlying infrastructure”, as you’re thinking, though.
I love that book! I want to do something similar but targeting people building infrastructure instead of the applications on top.
Gotcha. That sounds useful, though I’d think less evergreen because the technology on which the infrastructure is built will keep changing.
I’ve seen a lot about the virtues of event sourcing, but a lot less about how one implements event sourcing at different scales. Am I correct that’s the kind of thing you’d dig in to?
No software I tried supported tagging things as public or private. So I am forced to make everything private.
@icefall This void reminds me a lot of maxthink, an old DOS personal organizer (http://maxthink.com) I’ll take a deep look, thanks for sharing.
@nikivi, your gitbook is a prime. Very well crafted. I also try to keep notes on a similar structure but yours are way more strucutred.
The simplicity that was TiddlyWiki has unfortunately been mostly lost in the name of security. It used to be a single file that you could open and edit, and everything would save automatically and transparently.
Now you have to install a browser extension for it to save correctly, which makes TiddlyWiki much harder to transfer between browsers.
Edit: I’m not taking sides on the security-functionality tradeoff, to forestall any off-topic discussion.
However, TiddlyWiki has been along a looooong time (at least 10 years) now, and I assume that this would mean the Wiki
portion of the software is superb.
Documentation for the syntax is at http://mandoc.bsd.lv/man/mdoc.7.html.
https://manpages.bsd.lv/ is also a good resource if one has never written [gt]roff and/or manpages before.
Oof, that looks like a much better resource. The syntax still needs some getting used to, but being shown the benefits of semantic markup (i.e. how does the PDF conversion look? Does it work well?) might persuade me to write some mandoc.
The syntax still needs some getting used to
It becomes fairly pleasant. I’ve converted a few pages from man to mdoc for the suckless project (although my patches haven’t been accepted) and all you really need is the manpage you linked as a reference. I’ve also written a few myself, and once you understand the boilerplate + semantic, the rest is easy.
how does the PDF conversion look? Does it work well?
It’s really easy: running
$ mandoc -Tpdf ed.1 > ed.pdf
I am curious about one thing. The image model of small talk, which is considered one of its strengths is similar to the notebook state in Jupyter notebooks right? How does smalltalk get away from all the associated disadvantages?
What am I missing when people talk about image based development?
The main thing I think you’re missing is that the actual Smalltalk development experience looks nothing like the experience of using Jupyter.
First, Smalltalk, unlike every other language I’m aware of, and unlike Jupyter notebooks, is not written as files. Instead, you work entirely in a tool called the browser, where you can see a list of classes, those classes’ definitions, and a list of all methods within a class. The order in which these classes and methods were defined is irrelevant and invisible. This alone eliminates a huge swath of the notebook complaints.
Second, with few exceptions, the version of objects you’re seeing is always live; while you might have a Workspace (REPL) open with a test and output, that’s visibly very distinct from an Inspector, which shows you the current state of any given variable. The Browser likewise only shows you the current definitions of a class or method. (Previous versions are usually available through what you can think of as a very powerful undo tool.) You thus don’t end up with the current v. previous value skew issues with notebooks.
Notebooks are suffering because they’re trying to bring a kind of Smalltalky experience to a world where code is written linearly in an editor. I agree, that’s tough, and that’s really the underlying issue that the post you linked is hitting. Smalltalk (and Common Lisp environments) instead focus on saying that, if you’re gonna have an image-style situation, you need tooling that focuses on the dynamic nature of the preserved data.
[Edit: all that said, by the way, I think you can make a case that some specifics of how Smalltalk’s image works in relation to its dev environment are a bit off. One way to fix that is the direction Self went with Morphic, where truly everything is live, and direct object manipulation rules the day. The other direction you can go is what Common Lisp and Factor do, which is to say that the image contains code, but not data—in other words, making images more like Java .jar
s/.NET assemblies/Python .pyc
files. The purist in me prefers Self’s approach, the practical part of me prefers Factor/Common Lisp’s.]
Thank you for your response.
I am not sure I understand. Say I have an object A that contains another object B constructed in a particular way as a member. Now, if I edit the constructor of the object B, so that the variables inside B are initialized slightly differently, would it reflect on the already constructed object inside A?
Not in that specific case, no. For other things (e.g., changing variable names, class shape, changing a method definition, etc.), the answer is yes. The best way to think of it: the Smalltalk image is a database, and like most databases, if you change the schema or a stored procedure, that takes effect immediately, but existing data would need to be backfilled or changed. This is exactly the thing I was flagging on why Common Lisp and Factor do not do this: their images are code only, no data. (Or Self, for which the answer to your question is, at least in some cases, actually yes.)
This is different than the main thing highlighted in the article, though. There, the concern wasn’t as much about state existing at all, as much as the fact that the presentation made it look as if the output accurately reflected the input, which wasn’t guaranteed. Thus, in a notebook,
In[123]: x = 5
Out[123]: 5
is potentially not accurately reflecting the real value of x
, whereas in Smalltalk, if you went behind the scenes and changed x
to, say, 10
, then any Inspector on x
would also change immediately to 10
.
Thanks again for the detailed response!
I understood the rest of the post, but I do not get it when you say that in Common Lisp (and Factor), image is code only, and not data. How is that different from storing everything in a file?
Eh, I probably confused things. The images for Factor and Lisp are just compiled code, effectively, albeit stored in a high-level format that’s designed to make debugging a running app a lot easier. That’s why I said they’re equivalent to Java .class
files or the like. I was more mentioning that because the word “images” means something very different in Smalltalk/Self v. Common Lisp/Factor-like systems.
So, what does it mean when Factor says it has an image based model? Is it simply saying that it is compiled?
Going back to the original question; My experience is with R, where the image is frequently not a boon, unless one is very careful not to reuse variable names during data cleaning. And the linearity of notebooks isn’t a factor when you work in the command line. R does let you see the current definitions of everything but that is not as helpful as it first seemed to be. Hence my question.
Interesting! So if I define a class, instantiate it, change a method definition, and call that method on the class, the new definition is called?
That would indeed be much more useful than a notebook
-like approach.
That’s completely correct, and also applies to certain other refactorings on the class (e.g., renaming instance variables or the like).
One problem I could see with image files is servicing them; how do you patch several image files, or worse, deal with divergences in each due to patching of behaviour?
Regarding the security of unikernels, I believe that there is a reasonable rebuttal from Bryan Cantrill: https://www.joyent.com/blog/unikernels-are-unfit-for-production
He might be biased (joyent/containers vs unikernels) but @bcantrill’s opinions are usually more driven by wanting stronger engineering practices than marketing.
Why, oh why did they have to define a new syntax?
Goblins is a Scheme library so the code you see is regular Scheme code.
The whitepaper uses a whitespace-based syntax, which I find absurd.
I’m partial to s-expressions, too. The Wisp syntax was chosen in an attempt to not scare off people who find s-expressions offputting. Whether it was the right call or not is up for debate. Spritely didn’t invent the syntax, though. Wisp is an alternative Scheme syntax and it is an official SRFI. This paper is an evolving document, so it may be switched back to good ol’ s-expressions if the Wisp experiment doesn’t work out.
I would definitely love a version of the whitepaper without wisp. If people find s-expression offputting, they’ll use another language with another framework, so I don’t buy the argument. The concepts used by Spritely are complex enough without forcing people to learn another syntax.