Touhou-Project.com

Greasing the wheels

Added 2024-08-13 12:53:20 +0000 UTC

Hey all, hope you’re doing well. You may have noticed that there was an outage in THP’s own Matrix server recently that lasted about two days. So I thought it would be a good opportunity to talk about some of the practical aspects of administrating systems.

For the most part, it’s a job that requires a very uneven amount of effort and time when interacting with servers; when installing, launching, configuring, tweaking, troubleshooting, and other similar activities the work can take many long hours and involve much gnashing of teeth; the day-to-day, the maintenance, is typically low-intensity with the occasional look at system statuses, applying software updates, and the occasional look at logs, or looking into optimizing or automating common tasks. In my practical experience, the latter only takes maybe an hour or two a month, while the former can mean long nights and even longer days when it is necessary—it is often critical to have systems up and running well as soon as possible as you want to minimize the disruption to your userbase.

Maintenance is important and any system admin worth their salt keeps abreast with the news and release cycle for important software. Any system that’s exposed to the internet is susceptible to attacks and attempts at malicious actions by bad actors and no technology is 100% secure. This is mitigated in part by hardening software configuration to present a smaller attack surface but sometimes there’s a lot that’s outside of a sysadmin’s control. So yes, it’s nice to get new features or performance improvements (like, say, with PHP) but far more important is the patching of bugs and exploits in server software like nginx or the kernel which, if exploited, can mean intrusion, data theft or loss, or other bad things to happen. It’s a never-ending arms race as all servers are routinely probed for exploits with automated tools or bots constantly so slacking off even for a little bit with security updates is a recipe for disaster. (As an aside, about half of internet traffic is by bots, not all malicious as some will be indexers or perform other relatively benign services, but figures for malicious bots range between 30-70% of total bot traffic.)

In practical terms for THP this means that I usually update the servers and do other maintenance tasks on average about every 10 days, usually at times of low traffic like late at night in the western hemisphere (being a chronic insomniac helps a little in that regard 😭). If I catch wind of a relevant exploit or security issue, of course, I’ll be updating as soon as fixes are pushed out. Most updates for software will be minor and won’t require much in the way of intervention or downtime (kernel updates require a reboot so I maybe only update it once a month, maybe longer, if there’s no critical issues) but sometimes major versions come out for software and require updating of configuration files or redoing parts of an existing setup. A somewhat recent of example of that is the mail server deprecating some of the way its internal components interact with other bits of software, which required my reading of not-so-great documentation in order to figure out how to make sure everything would work together without errors. In those instances the average hour or two maintenance time per month doesn’t hold for that month but it’s more than offset by those months when things are operating smoothly and it only takes a few minutes to grease the figurative wheels.

The server that hosts the Matrix instance is a little more high-maintenance as I’m more keen on keeping things on the latest software available as there tend to be a lot of improvements in features, performance, and implementation of the Matrix specifications in each release of the relevant software. The more important bits of software are roughly on a two-rarely-three week release schedule (releasing versions on Tuesdays, typically) so, after waiting a while to see if there are any showstopping bugs, I typically update those things the day after. The software that I use to manage those particular components (which are mostly containerized) does a fair deal of the work for me in terms of error checking and setting up things—its only real downside is its slow execution time which can easily be 10-20 minutes depending on how many tasks it is performing. If anything goes wrong during the process, it can take much more time to debug but, for the most part, it isn’t too much of a pain. Issues with specific minor components that interact with Matrix do pop up frequently enough to be annoying (as of the time of writing, there’s a head-scratching issue that makes one of the bots I use no longer work, which I’m trying to figure but may be an upstream bug, and another with the internal mailing software) and does bump up that average maintenance time; the last few months have seen me spend a lot of time managing these things compared to the same period last year.

But that’s not all. That 1-2 hour figure is only the active time spent interacting with the servers. I alluded as much earlier when I brought up keeping abreast of news and important updates, but a lot to sysadmin work is keeping yourself in the loop. One aspect of this means reading tech news or subscribing to mailing lists and checking at least once a day. It means understanding issues and how it may practically affect your services and servers. Very many dry and esoterically-worded posts and release news, often filled with jargon and that assumes the reader is as familiar as the developer with the project and its inner workings, need to be comprehended to have a handle on that sort of thing. (Read enough of these and you’ll never doubt the importance of the humanities in education as many STEM people or nerds don’t know how to write and communicate well.) And, well, another aspect is checking in generally with projects and their community; combing through issue trackers, wikis, documents, forums, roadmaps, blog posts, articles on tech sites, etc. Knowing what changes are incoming, the direction of the project, how active things are, what others are saying can save a lot of headaches down the line and avoid desperate scrambles to get things working. Not only that but being familiar with all that makes troubleshooting easier when (not if) something goes wrong as finding useful information and support is easier.

I don’t wish to overstate the work here, all of that said. It is a fair chunk of time but it’s not like I spend the equivalent of several workdays in a month enmeshed in all that. It’s not like I go out of my way to schedule time for reading up on things. It’s much more organic, usually prompted by a random thought about a piece of software or a glance at my to-do list; it’s not something I set quotas for and I do it in bursts whenever the mood strikes me. While it’s important to never neglect doing this sort of thing regularly, it’s also something that works best if it’s not forced in my experience. Quick research or scanning of information will do just fine most of the time. If a particular topic is interesting or it’s something essential to the working of the site, then I might spend more time reading up on things. So I can’t really say for sure how much time I do spend doing this sort of thing in a month but I will say it’s best done a relaxed frame of mind—with a nice cup of tea and enjoyable music playing in the background.

As I planned this post, I thought I’d also write about the those intense hours of activity that are so unlike routine maintenance, tying it into the recent issues with the server that runs Matrix. When I began writing it, however, I realized that there was a lot to say about both aspects of sysadmin work and that it might be best to split up the posts so it doesn’t become a huge wall of text that most people won’t bother to read. Instead, I’m going the route of shorter posts that most people won’t bother to read ;)

That’s a long-winded way of telling you to expect another post in the near-future. I’ll be getting around to the specific issues we experienced and other aspects of sysadmin work. Until then, be of good cheer and take it easy!