While setting up my first NFS cluster, it occurred to me that it would be useful if my NFS server could mount it’s own exported filesystem–it didn’t take long to discover this would hang the server in short order.
If memory serves there was (and for all I know still is) a deadlock condition between memory allocation and disk buffering triggered with loopback NFS mounts.
Setting aside the “Well stop doing that” resolution to this issue, Does NFS count or not count as ‘modern computing’ here?
Loop back NFS should probably work, but I’m not surprised if it probably doesn’t.
But NFS could count. Things like ls used to just work. Stir some NFS into the picture, and you never know if it will work or not.
Now that you mention it, this story is a parallel with ps instead of ls. ps used to just work, but we found a way to add features to the system until we could introduce a failure case where none existed before.
Stir some NFS into the picture, and you never know if it will work or not.
Select comments from the qmail source code:
/* if it was error_exist, almost certainly successful; i hate NFS */
install.c: if (close(fdout) == -1) /* NFS silliness */
qmail-local.c: if (close(fd) == -1) goto fail; /* NFS dorks */
qmail-recipients.c: if (close(fdtemp) == -1) die_write(); /* NFS stupidity */
According to that codebase, NFS is not exclusively silly, dorky, or stupid–but certainly enough to count.
A bit off-topic but then again maybe not but the solution to locally mount exported filesystems without hangs, deadlocks or speed penalties is to use bind mounts. I use the following piece of bash-proze on all hosts in my (home) network:
#!/bin/bash
#
# universal access to all nfs-exported files based on hostname
# without speed penalty when using local files by using bind
# mounts for locally exported directories
#
# hackish but ey, it works...
key=$1
server=$(echo $key|cut -d '/' -f 1)
hostname=$(hostname)
host=$(host $hostname|head -1)
fqdn=$(echo $host|cut -d ' ' -f 1)
hip=$(echo $host|cut -d ' ' -f 4)
nfs_root=$(egrep '^[^#]*fsid=0' /etc/exports|awk '{print $1}')
case $server in
$hostname|$fqdn|$hip|localhost|127.0.0.1)
mntstr="--rbind "
for dir in $nfs_root/*;do
mntpnt=$(echo $dir|sed -e "s#^$nfs_root##")
mntstr="$mntstr $mntpnt :$dir "
done
;;
*)
mntstr="-fstype=nfs4,noatime,async,proto=tcp,retry=60,hard,intr $server:/"
;;
esac
echo $mntstr
This script is ‘mounted’ under /net and allows all hosts to access exported filesystems on all other hosts. When accessing locally exported filesystems these get bind-mounted instead of nfs-mounted - problem solved.
It is a mapper script for autofs which produces mount with the parameters to be used to mount a certain path. This mapper interprets paths which start with a hostname or IP address, i.e.:
/net/my.host.name/home -> host=my.host.name, path to be mounted on that host=/home
The script is called with the path to be mounted as argument, it extracts the host name (or IP address), determines whether the filesystem to be mounted is local or remote and produces an according string of parameters as output. This output is used by autofs to mount the filesystem. To use the script, add it to auto.master as mapper script for a certain path (as stated I use /net, the script itself is saved as /etc/auto.ufs):
/net /etc/auto.ufs --timeout=3600
As said it used bind mounts (remounts of existing directories on secondary locations in the file system hierarchy) to mount NFS-exported directories locally, i.e. it bypasses NFS when those directories are mounted locally.
If they grow too big, the kernel’s OOM killer will fire inside the container, something will die, and life goes on
This is fucking terrible. I wrote about this before in a container rant, but basically a container with memory limits still sees the entire machines memory. If it has 12GB of ram, the process in the container will still see 12GB of ram even if it has a 2GB memory limitation. If it caches a bunch of stuff and has a garbage collection system that empties out once it uses a % of RAM, it will be killed when it hit that 2GB limit. No outofmemeory error. Just straight up killed. It breaks everything about memory management, and now the underlying VM (JVM, Python interpreter, Ruby, etc.) now needs to check if it’s running on Linux, and in a cgroup, to get the “real” memory limit.
Some process managers have decided they would rather be the ones who keep tabs on memory size and do the killing themselves
Is that to ensure the offending / biggest process gets killed, not a random one?
All it will do now is stop accesses to that memory space until you do something about it
Stop accesses?! o_0
On FreeBSD, you’d just tell the kernel to specifically deny the malloc that pushes the jail over the limit: rctl -a jail:whatever:vmemoryuse:deny=1g — or to KILL the process that does that (sigkill instead of deny).
Is that to ensure the offending / biggest process gets killed, not a random one?
Pretty much. With memcg, there’s no reason we have to defer to the kernel for a hasty decision; the system has spare memory to communicate with a userspace process which could implement some kind of grace policy based on existing resource usage and importance to the overall goal. I.e. if this is a Very Important HPC job that’s been running for hours, maybe don’t kill it. Or if it’s a process serving customer traffic, maybe put it into drain mode and kill it off gracefully. Or if it’s a backend task that’s subject to speculative executiion, kill off the task and let one of the other backends pick up the slack.
I feel like half of modern computing is taking a thing that works and then making it more fragile.
The secret to making open source work as a business model.
While setting up my first NFS cluster, it occurred to me that it would be useful if my NFS server could mount it’s own exported filesystem–it didn’t take long to discover this would hang the server in short order.
If memory serves there was (and for all I know still is) a deadlock condition between memory allocation and disk buffering triggered with loopback NFS mounts.
Setting aside the “Well stop doing that” resolution to this issue, Does NFS count or not count as ‘modern computing’ here?
Loop back NFS should probably work, but I’m not surprised if it probably doesn’t.
But NFS could count. Things like ls used to just work. Stir some NFS into the picture, and you never know if it will work or not.
Now that you mention it, this story is a parallel with ps instead of ls. ps used to just work, but we found a way to add features to the system until we could introduce a failure case where none existed before.
containers : NFS :: processes : files
Select comments from the qmail source code:
According to that codebase, NFS is not exclusively silly, dorky, or stupid–but certainly enough to count.
A bit off-topic but then again maybe not but the solution to locally mount exported filesystems without hangs, deadlocks or speed penalties is to use bind mounts. I use the following piece of bash-proze on all hosts in my (home) network:
This script is ‘mounted’ under
/net
and allows all hosts to access exported filesystems on all other hosts. When accessing locally exported filesystems these get bind-mounted instead of nfs-mounted - problem solved.Why does this work?
It is a mapper script for
autofs
which producesmount
with the parameters to be used to mount a certain path. This mapper interprets paths which start with a hostname or IP address, i.e.:/net/my.host.name/home
-> host=my.host.name
, path to be mounted on that host=/home
The script is called with the path to be mounted as argument, it extracts the host name (or IP address), determines whether the filesystem to be mounted is local or remote and produces an according string of parameters as output. This output is used by
autofs
to mount the filesystem. To use the script, add it toauto.master
as mapper script for a certain path (as stated I use/net
, the script itself is saved as/etc/auto.ufs
):/net /etc/auto.ufs --timeout=3600
As said it used bind mounts (remounts of existing directories on secondary locations in the file system hierarchy) to mount NFS-exported directories locally, i.e. it bypasses NFS when those directories are mounted locally.
This is fucking terrible. I wrote about this before in a container rant, but basically a container with memory limits still sees the entire machines memory. If it has 12GB of ram, the process in the container will still see 12GB of ram even if it has a 2GB memory limitation. If it caches a bunch of stuff and has a garbage collection system that empties out once it uses a % of RAM, it will be killed when it hit that 2GB limit. No outofmemeory error. Just straight up killed. It breaks everything about memory management, and now the underlying VM (JVM, Python interpreter, Ruby, etc.) now needs to check if it’s running on Linux, and in a cgroup, to get the “real” memory limit.
well, presumably not forever
Is that to ensure the offending / biggest process gets killed, not a random one?
Stop accesses?! o_0
On FreeBSD, you’d just tell the kernel to specifically deny the malloc that pushes the jail over the limit:
rctl -a jail:whatever:vmemoryuse:deny=1g
— or to KILL the process that does that (sigkill
instead ofdeny
).Pretty much. With memcg, there’s no reason we have to defer to the kernel for a hasty decision; the system has spare memory to communicate with a userspace process which could implement some kind of grace policy based on existing resource usage and importance to the overall goal. I.e. if this is a Very Important HPC job that’s been running for hours, maybe don’t kill it. Or if it’s a process serving customer traffic, maybe put it into drain mode and kill it off gracefully. Or if it’s a backend task that’s subject to speculative executiion, kill off the task and let one of the other backends pick up the slack.