1. 20
  1.  

  2. 9

    On the topic of swap, I’d like to encourage people to use zram (or zswap if using with swap space) - it’s like a swap space but instead of writing to disk it compresses the pages. IME it’s fast (unlike swap, which slows the computer down to a crawl under heavy use), reliable, and neatly handles the “make things slow down instead of crash” problem. Here’s my script for turning it on

    set -ex
    modprobe zram
    echo lz4 >> /sys/block/zram0/comp_algorithm
    echo 16G >> /sys/block/zram0/disksize
    mkswap --label zram0 /dev/zram0
    swapon --priority 100 /dev/zram0
    
    1. 1

      zswap

      Interesting.

      Does it survive the “memory leak turns a swappy system into treacle” test?

      https://lobste.rs/s/arrx3j/linux_performance_why_you_should_almost#c_rjj34s

      Actually the example there is possibly the best case scenario for zswap.

      A better test would be….

      ruby -e 'IO.read( "/dev/urandom")'
      
    2. 8

      Another advantage is that swap gives admins time to react to low memory issues. We will often notice the server acting slowly and upon login will notice heavy swapping. Without swap (as described in the next section) running out of memory can create much more sudden and severe chain reactions.

      As usual with such advise, it really depends on the situation. We use large-memory machines as compute nodes (for machine learning). Having swap (especially larger amounts) is typically bad. Inevitably, there are OOM situations, either when someone introduces a memory leak (or writes an inefficient program) or because several people run processes simultaneously without keeping an eye on memory.

      With swap, the systems typically become unresponsive to the point where you can’t even SSH into them anymore. No one can access their processes anymore, everybody loses. With no or only little swap, the OOM killer kills some process, typically one of the compute jobs. Everyone can still log in, you can see in the kernel messages which process was killed, and the processes that were not OOM killed will continue happily. Since some processes can run several days, this is a lot better than swap hell.

      1. 1

        Why not just limit each process/compute group to X amount of memory and mostly avoid OOM killer completely? Something like Hashicorp Nomad as a job runner will even do this for you.

        1. 3

          Because it would complicate things for us. We don’t want to use job runners, one of the nice things is such machines is that (in contrast to typical HPC machines), you can just use the machines interactively, without jobs, batching, etc. Also, many such tools (don’t know about Nomad) have other problems, such as counting mmap’ed files towards the memory of a process. Then that process using 4GB memory, mmap’ing a 500GB file suddenly looks very large.

          Having a job OOM-killed once or twice a month or so, is far less a nuisance than extra abstraction layers.

      2. 4

        Another advantage is that swap gives admins time to react to low memory issues. We will often notice the server acting slowly and upon login will notice heavy swapping. Without swap (as described in the next section) running out of memory can create much more sudden and severe chain reactions. So usually I would advice to set swap space to about the size of your largest process. For example, MySQL’s configured memory in my.cnf. It can even be smaller. Especially if you have monitoring and/or alerting in place.

        This has occasionally been the difference between “the database seems slow” and “the database is down” for me in the past. Definitely a fan of having swap there to buy a little time for debugging or triage before a process just goes OOM and gets killed.

        1. 8

          Maybe my experiences are outliers, but for me when a system I maintain is swapping, it may as well be down because the (remote) symptoms make it much more painful to ‘debug’. SSH is super unreliable/slow, and ultimately unusable for any amount of meaningful debug other than “yep, system is swapping, something is using up all available memory, time to go into the lab”. If the system is down, “yep, system is down, time to go into the lab”. The latter saves me time I guess, because I don’t have to deal with more steps.

          1. 2

            …until you want to understand what is causing the massive memory usage. In that case, being able to get monitoring metrics during the incident or slowly SSH in is invaluable.

        2. 4

          One consideration to have is that secrets held in memory might get swapped onto disk. YMMV

          1. 4

            IIRC Linux has an option that let you mark memory as “do not swap this”. Password managers and such use it for precisely this reason.

            1. 6

              Yes it’s mlock(2). Docker containers need the extra --cap-add=IPC_LOCK flag to use that feature.

          2. 3

            I have to disagree.

            The ability of the operator to reason about and predict system performance and reliability is crucial for most systems. I prefer a system operating at not-quite-peak performance that defines an out of memory behaviour which I can rely on, vs a system that delivers marginally better performance combined with occasional terrible performance.

            1. 2

              A good rule of thumb, unless you’re suspending the whole of your laptop ram into swap, is to keep the amount proportional to disk speed. And make sure you don’t have too much of it.

              1. 2

                Let’s consider you’re a reckless dude who thought a swap partition is unnecessary or just forgot about it during installation. Worry not, at least if you’re on linux

                # cd /
                # dd if=/dev/zero of=/pagefile.sys bs=1G count=8
                # chmod 600 /pagefile.sys
                # mkswap /pagefile.sys
                # swapon /pagefile.sys
                # $EDITOR /etc/fstab #optionally
                
                1. 4

                  Hmm, I wonder where I have seen that filename before :)

                  This approach is quite nice because it makes it easy to resize the swap. One downside is that suspend-to-disk won’t work if the disk is encrypted.

                  1. 1

                    This is extra fun especially on CoW filesystems like ZFS: you’re out of memory and need to swap… and you reach into the filesystem to write… whoops the filesystem needs more memory for itself…

                    1. 1

                      Apparently it’s (probably) reliable if (and only if) you turn all the ZFS features off & make sure to pre-allocate the file on disc. So no compression, no checksums, no data cache. Turn on any of those and you’re going to get deadlocks under memory pressure though.

                      (The FreeBSD people have run into exactly the same problem for the same reasons.)

                      1. 1

                        Gp has an interesting point re:cow filesystems - and you probably shouldn’t use a zfs file for swap. Otoh zfs also is a volume manager - using a zvol should be fine. However, it appears Solaris special-cased swap (avoiding dead-lock, trading for the option of swap bit rot…). So it appears one would be better off with a separate partition for now:

                        https://github.com/zfsonlinux/zfs/issues/7734

                        1. 2

                          zvol swap on FreeBSD is not fine, I have deadlocked that :D

                    2. 1

                      This is basically the article that I’ve been meaning to write for a while. I have seen a lot of people making a lot of uninformed statements about Linux swap, based either on a deep misunderstanding of what it’s for or (more generally) lack of Linux admin experience.

                      The short version is:

                      1. Swap is useful for moving idle pages of virtual memory to disk instead of keeping it in RAM, so that your RAM can be put to use for better things, like I/O cache.
                      2. Swap is not, and has never been, some kind of canary for an OOM situation. If you’re using it as such, please don’t. If you’re using this as a strawman in your crusade against swap, cut it out. I don’t know who started this myth but I would seriously like to meet them so I can punch them in the face.
                      3. When deciding whether or not to you swap, you have to consider your workload. On a typical hacker’s laptop, you probably don’t need or want swap. If your infrastructure already has some provision for a low-latency I/O cache, like a hypervisor with plenty of “reserved” RAM, you don’t need swap in your VMs. But if you have a host with multiple services that stays up for a long time with ordinary storage attached to it, you want swap.
                      1. 1

                        — Even if there is still available RAM, the Linux Kernel will move memory pages which are hardly ever used into swap space.

                        — It’s better to swap out memory pages that have been inactive for a while, keeping often-used data in cache and this should happen when the server is most idle, which is the aim of the Kernel.

                        I would love to see a proper explanation of why this helps. To a layperson, it would seem that with abundant RAM, it makes little difference whether a seldom used page languishes in RAM or on disk.

                        1. 2

                          It’s not the seldom-used pages you care about, it’s the memory you free up to use for fast caching of disk instead.

                          An example: I admin a bunch of hosts with a few dozen docker containers on each. All of these have 8 GB of RAM and 8 GB of swap. Their swap is almost always about half-full (4 GB). Without swap, half of the machine’s physical memory would be needlessly consumed by idle pages and I/O performance would be lower since there wouldn’t be much left over for disk cache.

                          Sure, I could adopt a hard-line stance against swap–as many do these days–and add more RAM to compensate but (for this workload, at least) net result is just paying more to get the same performance.

                        2. 0

                          Try this….

                          ruby -e 'IO.read( "/dev/zero")'
                          

                          If your system turns into treacle for several minutes….. turn swap off and try again.

                          1. 4

                            What point are you trying to make with this?

                            1. 3

                              I’m a Linux desktop user. With RAM being plentiful these days, a process using up all of it is overwhelmingly likely to be one that’s running out of control. Rather than spending 10 minutes bringing up top, I’d prefer to let it crash into the OOM-killer. ulimit might be a better way of going about that tho.

                              1. 2

                                This is what I do on Linux too. As you say, it’s just a case of how long you have to wait until you kill off a runaway process. Do I want it killed automatically or do I want to struggle for several minutes trying to find and kill it manually?

                                1. 1

                                  Sadly ulimit is a borked design and is only partially implemented on linux anyway.

                                  You will note that no distro I know of sets ulimits for all users, because they cannot know the amount of ram and the load characteristics.

                                  cgroups are possibly the technology to use these days.

                                2. 1

                                  Users should learn about ulimits I suppose.

                                  1. 3

                                    I think its been replaced with k8s… :(

                                    1. 1

                                      Sadly ulimit’s are a broken design that is partially implemented and requires tweaking for every different system and load.

                                    2. 1

                                      ie. If you do the experiment, you will find if you have swap, your system will start swapping.

                                      Since the cache hierarchy is so steep these days, your system will be utterly unusable, sometimes for as much as ten minutes or more.

                                      If you don’t have swap, you may notice a small slow down in your system, and then the OOM killer wakes up and kills the guilty process and you can continue without impairment.

                                    3. 1

                                      … why?

                                      1. 2

                                        The Article said “Use Swap”, my one liner demonstrates that using swap enables any user to bring any system to it’s knees with a one liner.

                                        Try my one liner without swap and the OOM killer just kills the culprit and nothing bad happens.

                                        1. 1

                                          Thanks for the explanation! It was really unclear what point you were making without actually going ahead and turning a system into treacle, which I wasn’t up for.