You don’t need to put random files somewhere, the ext filesystems have had a “reserved percentage” and set it to 5% for exactly this reason. If you ever run out of space on a partition, just
tune2fs -m 0
clean up fs
tune2fs -m 5
and you’re done.
I can already see myself forgetting to do the second part.
Yeah, I prefer the “8GB empty file” approach more. If I’m in a panic trying to fix a server which fell over, I’m much more likely to remember how rm works than how tune2fs works.
alias uhoh='tune2fs -m 0'
alias phew='tune2fs -m 5'
You’re almost never going to use those aliases. If you’re lucky, you’ll still have them in your shell rc file in 5 years when you run out of space. You’ll certainly have forgot about them by then.
I was being somewhat facetious :) my point is that now you know about the trick, there are ways to make it easier to remember and use if you ever need it.
I think the trick is knowing that the fs saves 5% of the drive for exactly this situation. The exact command can be googled. I’ll admit I looked at the man page before typing in the original comment. I know disk space is cheap these days, but losing 8GB of space more than the 5% that you’ve already lost seems really wasteful to me.
Wouldn’t you also be just as likely to forget to recreate the file if you were using the strategy proposed by the article?
I’d probably have them set up in an ansible config. Don’t need to remember, the computer remembers for you.
you can set up tune2fs -m 5 in an ansible config as well.
You can tune a file system, but you can’t tune a fish.
This is great information, but I can see it not finding its way to everybody who needs to know it, like someone spending their time mainly on building the application they then serve. When / how did you learn this?
This is something that’s been in UNIX since at least the ‘80s, so any basic intro to UNIX course ought to cover it. I came across it in an undergrad UNIX course. Per-user filesystem quotas have also been around for a long time, this is just a degenerate special case (it’s traditionally implemented by setting root’s quota to 105% and underreporting the ‘capacity’ of the disk).
Note that this is far more reliable than creating an empty file. Most *NIX filesystems support ‘holes’, so unless you actually write data everywhere, you’re not actually going to consume disk space. Worse, on CoW filesystems, it’s possible that deleting a file can require you to allocate space on the disk, which may not be possible if you let the filesystem get 100% full. I believe this is the case for ZFS if the zpool gets completely full.
Thanks! In the early years after university (think mid-2000s), I often wished my school had given some more practical knowledge. They were very much on the theory side, so I learned a lot of OS concepts and was ready when functional programming really landed, but only my internships stooped to discuss pragmatic things like version control. If you didn’t get this knowledge from academia or if you didn’t go through academia in the first place, where would you look for it?
It’s been over 20 years since I read a Linux book, but I’m pretty sure that the last one I read covered it. It’s the sort of thing I’d expect to see in anything aimed at sysadmins.
the output of df and fsck will tell you about this. I probably learned about it when I realized the filesystem was one size but df had a slightly smaller size.
This is bad. Bad advice and bad practice overall.
Why not just properly partition the disk and keep things separated? An 8GB empty file won’t help if you can’t even log into the machine to delete it. Besides deleting the file does not imply that the space gets immediately available.
Other posts in this series:
You joke, but:
You can rely on Go picking up stupid ideas and doing them for real ..
Thanks for sharing – that was a fun read! “Ballast” is such a perfect name for this.
I have mixed feelings about this one. It’s a good trick if you don’t have good enough skills to be effective at recovery without it. But once you go past one or two servers that you’re running in production long term I think you should be able to solve this in a better way. Or at least be really aware you’re using weird tricks rather than proper solutions.
“A better way” being: separate users for separate purposes, separate partition for system / runtime data / logs, block reservations for root where needed, log rotation, monitoring / altering, knowing how to quickly find the real cause using du and deal with that.
“A better way” should be alerting. You have something that sends you a text or e-mail or something you check daily as you approach 90% or whatever threshold you want to reach.
That’s only one piece of the puzzle though. Alerting is great, but if something is spamming content, by the time you log in, you may be already 100% full and failing. Containing that problem to one service and keeping the system running needs to happen as well.
I thought it was a solved problem – make different users for services and put disk quotas on their directories
Seems like it would be more useful to just poll regularly and notify on space running out.
Otherwise you´ll just have a server completely filled up ALSO with an 8Gb blank file on it you still can´t SSH to.
If your distro doesn’t let you SSH in when the disk is full, you need to get another distro. I’ve filled up my disk on Ubuntu and Debian many times, and have never had trouble logging in with SSH.
I do this with ZFS reservations for personal and small stuff. For prod I do FS monitoring.
I noticed an internal tool at my work would drop a 5G file called “paperweight” in its data directory when deployed. I was confused until I found out it was exactly this pattern: big file that operators could delete to aid recovery when the disk filled up. A smart trick!
This reminds me of a character from the early days of my career, working at a large energy company in the late 1980s. We ran MVS/XS and IMS DB/DC as the database and transactional monitor system and I worked in the database support group which looked after the IMS system itself and the databases of the applications that ran on it.
He told us that when they first installed one of the large systems using IMS, they’d put in some statements (this was 370 assembler) that introduced some quite small arbitrary delays, in various places. Because he knew that at some stage management would come and ask if they could improve performance. And when they did, they could remove some of the delays. I never saw the actual delay code, so I don’t know if it was true or not, but I believe him, and the idea has stuck with me ever since.
No way! I have heard of people putting other shoddy things in their code (including hidden switches that put devices into a “demo mode” so that features that aren’t actually working yet can be simulated) but that’s on another level. As a young software engineer, I always love hearing stories from veterans
What if, instead of creating a file that does nothing besides take up space, you used it as an additional swap file? That way the space can be used by the kernel to cache things. In an emergency situation you can delete this extra swap file (and thus free up disk space). I would call that a win-win situation.
Can you guarantee that you will have enough space in RAM and other swap devices at the time of your disk space emergency? If you can’t, you won’t be able to swap off your swapfile, so it doesn’t work as reserved space that can be freed in emergencies. If you can, you probably aren’t getting any actual benefit from the extra swap space (The only exception I can think of is if your dual-use-reserved-space-and-swapfile is on a fast disk, meaning it’s preferable to swap to it over a large, slower swap device, but you always have enough space to turn it off).
Systemd has this built in, apparently:
sudo journalctl --vacuum-size=1G
Vacuuming done, freed 2.9G of archived journals from /var/log/journal/
As you see, I can still remove another gigabyte after cutting away 2.9 gigabyte.
I used to do something similar with LVM on Linux servers. Leave a few gigs unallocated to a volume so you can dip into it in an emergency.
i wonder if this technique was referred to as ‘canarying’ or some other polished devops term-du-jour, would people view it as less of a kludge and more of an acceptable solution?