function f() { return 42; }
function g() { return 42 + (Math.random() < 0.001); }
any testing-based approach will very likely report that f and g are equivalent
With a code coverage tool that measures branch coverage you could know that your test didn’t exercise all branches of this code.
So to answer the OP: if you can find a code coverage tool with which you can measure a sufficient coverage metric (e.g. multiple condition decision coverage), you can create a test set that covers the code you intend to replace, in the sure knowledge all cases are covered. Then assert your replacement passes the same test set. It’s not a proof, but perhaps it’s enough for your needs.
And if that doesn’t fool it, this almost certainly will:
function f() { return 42; }
function g() { return Math.random() < 0.001 ? 1000 : 42; }
Google writes papers about AI, presumably to incentivize their academics with recognition for their work. This is great until you notice that the crucial piece, the trained models, is always absent.
while it is true that the trained model is often/always absent, the model is not the crucial piece. the alphago papers only contain some high level outline, but this was enough to cause a dramatic improvement in any other go bot. leelazero is now top professional level, which was completely unthinkable less than 3 years ago
you had a problem with the output format of du, so you rewrote the whole thing in a different language and your version is 4x slower
free software exists for the exact reason that people can improve upon it and change the bits they don’t like
don’t get me wrong, i like the final product, it looks very nice and human friendly
it just feels like you wrote it in rust for the sake of it, instead of trying to achieve a better design or fix some flaw deeply rooted in the fundamental design of the existing implementation
by doing so you threw away all the work that others made available for free, and ended up with something with sub-par efficiency
I did write it rust for the sake of it. I like working in rust I dislike working in c.
I also contributed several fixes to the rust port of du in coreutils. The rust port of du is also a lot slower than the original c du. I do not know why. This is something I want to look in to without starting a c vs rust speed flamewar.
I guess to me du and dust are different tools and adding the dust features to du wouldn’t be accepted (the point of dust is to remove the need for - h flag and to autosort for you).
Not to mention that large, mature projects, like coreutils, are not super easy to augment with functionality like what dust provides, not to mention the social challenges of actually contributing a change to such a project.
The whole assumption of @izabera’s comment, that people need to not write whatever software they want and instead only contribute to existing projects, is incredibly stuck up its own ass.
I don’t think that’s what izabera is saying. What is argued is that it is better to contribute to du than to write dust. It is like saying helping the poor is better than enjoying the beach; that does not mean people shouldn’t enjoy the beach.
When you hear such things, replying that people can do whatever they want, while true, is unproductive. My reply is more akin to pointing out why donating to certain charity will not actually end up helping the poor.
When you hear such things, replying that people can do whatever they want, while true, is unproductive.
izabera’s comment in the first place is unproductive. I had some asshole say the same thing to me in the past. They can bugger off. Free software very clearly does not exist just for the sake of small iterative improvement. There are many reasons and benefits of free software, and picking a single one and lambasting folks for it is… Well… @nebkor got it right: stuck up its own ass.
Yeah… I disagree with lambasting, but I still think it is a fair game to discuss strategic aspect. Say, if your goal is to help users, is it better to contribute to existing popular C implementation, or to start Rust one? As bootandy already said that the goal was to write Rust, and since that goal can’t be fulfilled by contributing to C implementation, it is a moot. But if the goal was to help users, it would be relevant.
As another example, the goal of Mozilla is much closer to “help users” than to “write Rust”. So the strategic aspect of what is the best way to achieve that goal, whether it is to work on existing C++ codebase or to rewrite it in Rust, is a very real question. For another, if someone’s goal is to, say, improve the security of world’s computational infrastructure, it is a very real question whether it is best to improve C++ or to work on Rust. And if and only if two people agree on the goal, it is legitimate to argue by one to the other that “you are wasting your time with Rust”. (I believe this is roughly why Daniel Micay is now working on Copperhead OS instead of Rust. I tend to agree that if one’s goal is to improve security, Rust is currently not the best bet. I work on Rust because my goal is not to improve security.)
i am being downvoted, flagged as incorrect, troll and off topic, for pointing out that parsing a flat stream of text is easier than monitoring a directory tree with new files spawning asynchronously
i get that you guys like suckless, but be realistic. the whole point of the article by sircmpwn was that irc is super easy to interact with. a tree-like view of irc may look cool, but ii takes the simplicity of the protocol away and it’s a terrible tool to build bots
Pretty sure it is actually because you called the idea retarded. Even if you are right, it certainly isn’t retarded.
Your comments are getting downvoted for being rude, flippant, and having little information. You made more technical points in this comment complaining about downvotes than in both of your previous comments combined.
If you had said this project made a really strange choice about its approach to IRC that seems bad for irc bots, you might have started a great conversation about the use cases they wanted to support, unix philosophy, suckless’s take on unix philosophy, parsing, separation of concerns, any number of things. If you start out rudely dismissive, people won’t see it as worth their time to share their different perspectives that maybe make this a good idea, and if you’re right that it’s a bad idea you won’t convince them.
I suggest you to explain why, in your opinion, this isn’t a good idea. Use rational arguments and provide examples.
People have the right to disagree, but if they do not bring a constructive criticism then their opinion can be kept for themselves.
It’s even easier when using ii from suckless. We have a bot on #openbsd-gaming now that reports how many people are currently playing. It just runs qstat every 5 minutes, massages the output and spits it out to the input file tied to our channel. It’s hard to beat echo "hello world" > irc/chat.freenode.net/#openbsd-gaming for scripting bots.
This is like some weird Portlandia thing…“The dream of Plan9 is alive in suckless”, etc.
Still, neat though! :)
ii sounds great in theory, but try to answer new incoming queries. now instead of parsing a single stream of text you have to monitor an entire directory tree with files appearing out of nowhere at any time
parsing irc is super simple and ii is a retarded idea for a bot
parsing irc is super simple and ii is a retarded idea for a bot
This isn’t constructive. If anything the toxicity detracts from your argument above.
ii sounds great in theory, but try to answer new incoming queries. now instead of parsing a single stream of text you have to monitor an entire directory tree with files appearing out of nowhere at any time
Sure, if your bot needs to respond to private queries. The one on our channel doesn’t parse any input at all. Including the channel itself - it’s a notification bot.
parsing irc is super simple and ii is a retarded idea for a bot
You’re telling me it was retarded to output the result of qstat every 5 minutes to a file? It took 5 minutes to write the notification using ii - it serves it’s purpose.
Does ii fit every use case of writing a bot for IRC? Nope. However it does make it dead easy to have various tools output content to a file and get it delivered on an IRC channel that way.
To be fair, even for notification, it seems overkill to pull ii and play with files when you can simply send to socket:
NICK bot
JOIN #foo
PRIVMSG #foo :My text message
All you need is echo and nc and IRC is yours. ii look to me like an overkill solution for simple problems and limited solution for complexes ones. But then I might simply be missing some complexity about writing bots, manager server configuration, connection throttling, etc.
Sure, but you will either connect to the network each time you invoke that notification or will need to maintain the connection up, respond to keepalive pings from the server etc. It is really easier to just dump a notification to a file every 5 minutes and have ii handle the connection.
It’s not perfect for all use cases, but it does simplify this specific one we had :)
it’s funny how on this website we routinely see people arguing about the inefficient design of fork(), and people who would rather have an empty /bin/true (which would spawn an entire new shell when executed) than implement it in c
stick to pre-ansi c instead of c++17, this is the only way to make sure that your code will run on ancient unix platforms that died well before you were born
GCC has a cleanup attribute that may be useful.
https://gist.github.com/e160fdaffe84aede805f054a56466aa3
related, this is an implementation of Andrei Alexandrescu’s SCOPE_EXIT (pretty much the same as defer in go) with gcc’s cleanup attribute and with clang blocks
gcc’s version is cleaner and has no runtime overhead, gcc is able to optimize all of the complexity away in that simple case (usually also in more complex cases too)
The problem with using vim to support such claims is that it doesn’t actually work: HJKL keys were not researched in extreme depth and proven to be the best keys to use as arrow keys in qwerty keyboards, they just happened to be the arrow keys on the one terminal that the person that wrote vi was using 1 2
Yeah, I don’t get why we keep preaching HJKL in vim. I use JKL; instead (down up left right respectively) and I don’t have to lift my fingers from home row and left / right are on the “weaker” fingers, because I don’t typically use them as often as up / down.
I agree with the parent comment, HJKL was not researched, and might not be the absolutely best ever. At this point it would less useful for me to use something other than HJKL. Muscle memory is too ingrained.
And even if I took the time really learn another key combination there would be tons of programs that still assume HJKL.
The foundation might be wonky, but we are not going to tear down the house and rebuild it.
To satisfy requirement (4), I need the hash to be fairly wide. 128 bits gives an extremely low probability of collision for hash-tables of enormous size.
this only works if you use an array with 2^128 elements
[Title] /proc/<pid>/stat is broken
This sounds serious! Is the content of the pseudo-file associating incorrect PIDs or parent PIDs to processes?
Let’s continue…
Documentation (as in, man proc) tells us to parse this file using the scanf family, even providing the proper escape codes - which are subtly wrong.
So it’s a documentation issue…
When including a space character in the executable name, the %s escape will not read all of the executable name, breaking all subsequent reads
I have literally never encountered an executable with a space in the name, although it’s perfectly legal from a file name perspective. (I’ve been a Linux user since 1998).
The only reasonable way to do this with the current layout of the stats file would be to read all of the file and scan it from the end […]
So… let’s do this instead?
The proper fix (aside from introducing the above function) however should probably be to either sanitize the executable name before exposing it to /proc//stat […]
Sounds reasonable to me.
[…], or move it to be the last parameter in the file.
Thus breaking all existing implementations that rely on the documentation in man proc. But I guess it can be done in some backwardly compatible way?
This problem could potentially be used to feed process-controlled data to all tools relying on reading /proc//stat
I can’t really parse this. Do you mean “affect” instead of “used”?
In conclusion: I can’t see any evidence of the functionality of this proc pseudo-file being “broken”. You have encountered an edge case (an executable name with a whitespace character in it). You’ve even suggested a workaround (scan from the end). If you had formulated this post as “here’s a workaround for this edge case” I believe you would have made a stronger case.
I have literally never encountered an executable with a space in the name
Well, tmux does this, for example. But my primary concern is not has it ever happened to me but, if it happens, what will my code do?. As this is a silent failure (as in, the recommended method fails in a non-obvious way without indicating failure), no action is taken by most implementations to guard against this. That, in my mind, counts as broken, and the least thing to do is to fix the documentation. Or expose single parameters in files instead of a huge conglomeration with parsing issues. Or… see above.
So… let’s do this instead?
I do, but only after I got sceptical while reading the documentation, ran some tests and had my hunch confirmed. Then I checked to see others making that very mistake.
Thus breaking all existing implementations that rely on the documentation in man proc. But I guess it can be done in some backwardly compatible way?
No, I don’t think so - except for introducing single-value files (and leaving /proc/<pid>/stats be as it is).
This problem could potentially be used to feed process-controlled data to all tools relying on reading /proc//stat
I can’t really parse this. Do you mean “affect” instead of “used”?
Admittedly, English is not my first language, I do however think that sentence parses just fine. The discussed problem (which is present in several implementations based on the documentation), can potentially be used to inject data (controlled by the process, instead of the kernel) into third-party software.
In conclusion: I can’t see any evidence of the functionality of this proc pseudo-file being “broken”.
That depends on your view of broken - if erroneous documentation affecting close to all software relying on it with a silent failure does not sound broken to you, I guess it is not.
You have encountered an edge case (an executable name with a whitespace character in it).
I actually did not encounter it per se, I just noticed the possibility for it. But it is an undocumented edge case.
You’ve even suggested a workaround (scan from the end).
I believe that is good form.
If you had formulated this post as “here’s a workaround for this edge case” I believe you would have made a stronger case.
Maybe, but as we can see by the examples of recent vulnerabilities, you’ll need a catchy name and a logo to really get attention, so in my book I’m OK.
Thanks for taking the time to answer the questions I have raised.
The discussed problem (which is present in several implementations based on the documentation), can potentially be used to inject data (controlled by the process, instead of the kernel) into third-party software.
Much clearer, thanks.
On the use of “broken”I’m maybe extra sensitive to this as I work in supporting a commercial software application. For both legal and SLA[1] we require our customers to be precise in their communication about the issues they face.
[1] Service level agreement
Followup: can you give a specific example of how tmux does this? I checked the running instances of that application on my machine and only found the single word tmux in the output of stat files of the PIDs returned by pgrep.
On my Debian 9 machine, when starting a tmux host session, the corresponding /proc/<pid>/stat file contains:
2972 (tmux: client) S 2964 2972 2964 […]
“Thus breaking all existing implementations that rely on the documentation in man proc. But I guess it can be done in some backwardly compatible way?”
I will never get the 100ms it took to read this sentence back….
I dunno, maybe just duplicate the information at the end of the current format, in the author’s preferred format, and delimited by some character not otherwise part of the spec.
It’s not trivial, though.
That was my point.
this was clearly overlooked when the api was designed, nobody is parsing that file from the end and nobody is supposed to
What was overlooked? That executables can have whitespace in their names?
I can agree that this section of the manpage can be wrong (http://man7.org/linux/man-pages/man5/proc.5.html, search for stat):
(2) comm %s
The filename of the executable, in parentheses.
This is visible whether or not the executable is
swapped out.
From the manpage of scanf:
s: Matches a sequence of non-white-space characters; the next
pointer must be a pointer to the initial element of a
character array that is long enough to hold the input sequence
and the terminating null byte ('\0'), which is added
automatically. The input string stops at white space or at
the maximum field width, whichever occurs first.
So it’s clear no provision was made for executables having whitespace in them.
This issue can be simply avoided by not allowing whitespace in executable names, and by reporting such occurrences as a bug.
This issue can be simply avoided by not allowing whitespace in executable names, and by reporting such occurrences as a bug
Ahhh, the Systemd approach to input validation!
Seriously, if the system allows running executables with whitespace in their names, and your program is meant to work with such a system, then it needs to work with executables with whitespace in their names.
I agree somewhat with the OP - the interface is badly thought out. But it’s a general problem: trying to pass structured data between kernel and userspace in plain-text format is, IMO, a bad idea. (I’d rather a binary format. You have the length of the string encoded in 4 bytes, then the string itself. Simple, easy to deal with. No weird corner cases).
I agree it’s a bug.
However, there’s a strong convention that executables do not have whitespace in them, at least in Linux/Unix.[1]
If you don’t adhere to this convention, and you stumble across a consequence to this, does this mean that a format that’s been around as long as the proc system is literally broken? That’s where I reacted.
As far as I know, nothing crashes when you start an executable with whitespace in it. The proc filesystem isn’t corrupted.
One part of it is slightly harder to parse using C.
That’s my take, I’m happy to be enlightened further.
I also agree that exposing these kind of structures as plain text is arguably … optimistic, and prone to edge cases. (By the way, isn’t one of the criticisms of systemd that it has an internal binary format?).
[1] note I’m just going from personal observation here, it’s possible there’s a subset of Linux applications that are perfectly fine with whitespace in the executable name.
I agree with most of what you just said, but I myself didn’t take “broken” to mean anything beyond “has a problem due to lack of forethought”. Maybe I’m just getting used to people exaggerating complaints (heck I’m surely guilty of it myself from time to time).
It’s true that we basically never see executables with a space (or various other characters) in their names, but it can be pretty frustrating when tools stop working or don’t work properly when something slightly unusual happens. I could easily see a new-to-linux person creating just such an executable because they “didn’t know better” and suffering as a result because other programs on their system don’t correctly handle it. In the worst case, this sort of problem (though not necessarily this exact problem) can lead to security issues.
Yes, it’s possible to correctly handle /proc/xxx/stat in the presence of executables with spaces in the name, but it’s almost certain that some programs are going to come into existence which don’t do so correctly. The format actually lends itself to this mistake - and that’s what’s “broken” about it. That’s my take, anyway.
Thanks for this thoughtful response. I believe you and I are in agreement.
Looking at this from a slightly more usual perspective, how does the Linux system handle executables with (non-whitespace) Unicode characters?
Well, I’m no expert on unicode, but I believe for the most part Linux (the kernel) treats filenames as strings of bytes, not strings of characters. The difference is subtle - unless you happen to be writing text in a language that uses characters not found in the ASCII range. However, UTF-8 encoding will (I think) never cause any bytes in the ASCII range (0-127) to appear as part of a multi-byte encoded character, so you can’t get spurious spaces or newlines or other control characters even if you treat UTF-8 encoded text as ASCII. For that reason, it poses less of a problem for things like /proc/xxx/stat and the like.
Of course filenames being byte sequences comes with its own set of problems, including that it’s hard to know encoding should be used to display filenames (I believe many command line tools use the locale’s default encoding, and that’s nearly always UTF-8 these days) and that a filename potentially contains an invalid encoding. Then of course there’s the fact that unicode has multiple ways of encoding the exact same text and so in theory you could get two “identical” filenames in one directory (different byte sequences, same character sequence, or at least same visible representation). Unicode seems like a big mess to me, but I guess the problem it’s trying to solve is not an easy one.
(minor edit: UTF-8 doesn’t allow 0-127 as part of a multi-byte encoded character. Of course they can appear as regular characters, equivalent to the ASCII).
~ ❯ cd .local/bin
~/.l/bin ❯ cat > ą << EOF
> #/usr/bin/env sh
> echo ą
> EOF
~/.l/bin ❯ chmod +x ą
~/.l/bin ❯ ./ą
ą
If you don’t adhere to this convention, and you stumble across a consequence to this, does this mean that a format that’s been around as long as the proc system is literally broken?
Yes; the proc system’s format has been broken (well, misleadingly-documented) the whole time.
As you note, using pure text to represent this is a problem. I don’t recommend an internal, poorly-documented binary format either: canonical S-expressions have a textual representation but can still contain binary data:
(this is a canonical s-expression)
(so "is this")
(and so |aXMgdGhpcw==|)
An example stat might be:
(stat
(pid 123456)
(command "evil\nls")
(state running)
(ppid 123455)
(pgrp 6)
(session 1)
(tty 2 3)
(flags 4567)
(min-fault 16)
…)
Or, if you really cared about concision:
(12345 "evil\nls" R 123455 6 1 16361 4567 16 …)
nobody is parsing that file from the end
As an example the Python Prometheus client library uses this file, and allows for this.
First, to call itself a process could [simply] execute /proc/self/exe, which is a in-memory representation of the process.
There’s no such representation available as a file. /proc/self/exe is just a symlink to the executable that was used to create the process.
Because of that, it’s OK to overwrite the command’s arguments, including os.Args[0]. No harm will be made, as the executable is not read from the disk.
You can always call a process with whatever args[0] you like. No harm would be done.
Although /proc/self/exe looks like a symbolic link, it behaves differently if you open it. It’s actually more like a hard link to the original file. You can rename or delete the original file, and still open it via /proc/self/exe.
No harm will be made, as the executable is not read from the disk.
the executable is definitely read from the disk
Again, this was only possible because we are executing /proc/self/exe instead of loading the executable from disk again.
no
The kernel already has open file descriptors for all running processes, so the child process will be based on the in-memory representation of the parent.
no that’s not how it works, and file descriptors aren’t magic objects that cache all the data in memory
The executable could even be removed from the disk and the child would still be executed.
that’s because it won’t actually be removed if it’s still used, not because there’s a copy in memory
<3 systems engineering blog posts written by people who didn’t take unix101
Instead of telling people they are idiots, please use this opportunity to correct the mistakes that the others made. It’ll make you feel good, and not make the others feel bad. Let’s prop up everyone, And not just sit there flexing muscles.
Sorry for disappointing you :)
I got that (wrongly) from a code comment in Moby (please check my comment above) and didn’t check the facts.
I’m not saying that the OP was correct, I’m just saying that:
/proc/self/exe is just a symlink to the executable
,,, is also not completely correct.
Thanks for pointing out my mistakes! I just fixed the text.
I made some bad assumptions when I read this comment [1] in from from Docker and failed to validate it. Sorry.
By the way, is it just by bad English or that comment is actually wrong as well?
that comment is actually wrong as well?
I don’t think it’s strictly correct, but for the purpose of the code in question it is accurate. That is, /proc/self/exe points to the executable file that was used to launch “this” process - even if it has moved or been deleted - and this most likely matches the “in memory” image of the program executable; but I don’t believe that’s guaranteed.
If you want to test and make sure, try a program which opens its own executable for writing and trashes the contents, and then execute /proc/self/exe. I’m pretty sure you’ll find it crashes.
but I don’t believe that’s guaranteed.
I think it’s guaranteed on local file systems as a consequence of other behavior. I don’t think you can open a file for writing when it’s executing – you should get ETXTBSY when you try to do that. That means that as long as you’re pointing at the original binary, nobody has modified it.
I don’t think that holds on NFS, though.
If you want to test and make sure, try a program which opens its own executable for writing and trashes the contents, and then execute /proc/self/exe. I’m pretty sure you’ll find it crashes
Actually, scratch that. You won’t be able to write to the executable since you’ll get ETXTBUSY when you try to open it. So, for pretty much all intents and purposes, the comment is correct.
Interesting. Thank you for your insights.
In order to satisfy my curiosity, I created this small program [1] that calls /proc/self/exe infinitely and prints the result of readlink.
When I run the program and then delete its binary (i.e., the binary that /proc/self/exe points to), the program keeps successfully calling itself. The only difference is that now /proc/self/exe points to /my/path/proc (deleted).
[1] https://gist.github.com/bertinatto/5769867b5e838a773b38e57d2fd5ce13
Lots of things can already be done with portable sh code. typeset for example is quite powerful but mentioned nowhere in the repo:
In fact, this builtin is often wrapped by the shell itself, at least in OpenBSD’s ksh:
typesetis not posixtypeset -fdoestypeset -fudoes in your kshtypeset -idoesn’t really have the same effect in bash and kshtrorawklocal, but it’s not posix either“portable sh code” != “it works on my openbsd”
Note how I did not speak of POSIX; you mistake the ksh example as general assumption. My reply to pl’s comment tells you what I meant with “portable” (admittedly, wording was a bit misleading).
what’s your definition of portable?
What do you mean with portable sh code? I just tried it with
dashand it has no typeset built in.At least available in Bourne and Korn shell derivatives; (Debian’s) Almquist Shell does not implement that particular builtin.