Touhou-Project.com

File this under...

Added 2021-08-16 13:45:24 +0000 UTC

Hey all, hope you’ve been well. Sadly, I can’t really say I’ve been doing that well myself, to the extent that I was thinking of pausing the Patreon campaign while my personal life remained a mess. I managed to use my willpower to power through and find the time, however, so I do have progress to report.

Last time I posted, I talked about the overhaul to some of the user-facing stuff as well as issues I had in truly transforming things to an easy-to-maintain form. After struggling to find an acceptable compromise with regards to the archives, I decided to go ahead and completely change how things worked. I aimed to institute some sort of flexible system where I could easily change or update the archived threads while keeping it independent of “current” posts and that database.

Even just conceptualizing what I needed to do was complex. An obvious starting point was to create as complete a post database as I could. This meant taking various data dumps I have as backups and stitching them together. A fair amount has changed about how data are stored, with some entries being no longer relevant and dropped (like additional info needed for Kusaba X’s half-baked “embed” boards) and with others (like tracking if a post is an update or a spoiler) that have been added by me over the years.

Reconciling that sort of thing can be tricky as it is altogether too easy to overwrite or lose data if you’re not careful. The process is further complicated by changes in encoding and storage itself; the data has always used UTF-8 but specific implementations of it in the database were flawed and were largely superseded by newer ones. I modernized the “current” database some time ago, which was not without some headache as some loss happened with certain glyphs. Smooth sailing for the site since. But, as you might imagine, doing this process for older database dumps and stitching them together needs a lot of care and time to do properly.

After that, posts had to be manipulated to be brought up to newer standards. Like, for example, the way words in italics/underline/strike-through are stored now are different (and more HTML5 friendly) and so really old posts needed to be updated. Basically this was find-and-replace with regular expressions, callbacks and SQL queries, always mindful that it’s important to strike a balance between accuracy and time-saving. It’s just not feasible to manually check thousands upon thousands of entries so testing specific patterns on selected posts first was the only pragmatic way of doing things. You may not think about it it much but a clickable link or formatted text is stored in its own particular way within the whole of the message area of a single post. Having consistency there is good, as it allows finer control of how things are displayed and avoids errors down the line.

In the meanwhile, I was writing tools both for testing and to prototype other things. Obviously, being able to generate an arbitrary thread or post on demand was a final goal. Parsing data and checking it against the content of extant HTML files was an important safeguard; I took several precautions against data loss and tried to make things redundant/easy to roll back if I messed up (I messed up twice when merging databases and once when adding in some data from a thread, for example). In the end I had about seven new “utilities” that I used in various stages to parse things or to check. You might not expect it, but even something as trivial as the size of a thumbnail might get garbled for whatever reason upon copy, so it might be necessary to regenerate the data from the original image.

Merging the databases was not the end of it, as there were also several threads (100+) that were never in the database to begin with. These were largely pre-2009, from before the time that I was involved with running the site. Archived threads had been made by someone else and had been more or less immutable (save for my user scripts and appearance tweaks) since then. So it was important to first find every one of these threads, then pass each on to a parser that would correctly populate the fields that I then needed to insert into the database for storage. Which, not that difficult if you’ve got a regular format, but these threads did not.

I don’t want to bore you too much with the details but picture something like the identifying information for a reply being a hidden element. Which, fine, you can look for that, except that when it’s the first post in the thread, the format is different so you have to look for something else. All those gotchas, including how file information was presented, names are treated etc took a lot of patience and thorough testing to be reasonably sure that I got it all. Not to mention, I also had to “clean” the data which not only means updating (often mixed and even older) conventions about text formatting but also eliminating a lot of junk information. In-line javascript (eg: on click, highlight post) is not only currently pointless but takes up space and can mess up the formatting of the post in a modern context.

As if that wasn’t enough, I also took the time to transform threads from before THP existed into a format that I could host. It never sat that well with me that entries for the oldest threads were hosted on various 4chan archivers. For one reason, you never know when they might be down. For another, I think it’s poor user experience to have to leave the site you are on to continue to read content. Every thread (save one, which is a special case) was thus converted and now “look” like contemporary site threads with a link to the 4chan archives at the top. In for a penny, out for a pound in terms of effort. But I think the results are worth it even if I had to spend an extra day or two figuring out the details.

Once all that was done, I had to turn all that new data into useful permanent output. With another one of those utilities I basically ran everything through a large check that tested database and threads against one another to make sure I had the right amount of posts per thread, all the images and whatever else. The proper, slower, method I devised for regenerating archived threads takes a long time—at least forty minutes—as it checks against old files even as it takes from the database to reconstruct threads according to the latest template.

All I have to do from now on is pretty much just alter the archived thread template file if I want to adjust things and I can just run the regeneration script and let it do the rest. If I wanted to do it fast, without checking against files, it would take far less time. This is likely a reasonable option if I decide down the line to create an interface for specifying a single thread to be rebuilt. Or a range of them.

I’m satisfied with how things turned out more or less. Things haven’t been perfect and there’s still some threads that will need manual fixing and there are bound to be subtle errors here and there. I already had to fix a couple and, since threads have to be checked pretty much manually and that’s labor-intensive, I’ll probably continue to fix them over the coming weeks and months whenever I can be bothered with that thankless and menial work.

There’s a few new complexities like in the main database for keeping track of when a thread is archived; individual posts are marked as archived, which I then can copy to the other database with considerably less effort. And I know that I have to figure out how to optimize a lot of this process so that the site can continue to run with minimal intervention on my behalf.

Still, exciting to have everything in a far easier format to manipulate. It should be way easier to provide a uniform experience in terms of features and UI to users across all threads and already many of the various settings that were previously for the “live” threads only now work without any extra work on the archived threads. I was also able to get rid of a script that was for the oldest archived threads only and any relief of technical debt is a good thing.

I hope to get to all those new features and reworks of existing things at some point in the future. I don’t think it will be this month, however. All that’s been mentioned (and some stuff that I’ve omitted for the sake of brevity) took several actual real days of dedicated work, especially the testing. It’s already more time than I usually spend working on THP and I don’t really have the time, resources and mental capacity to do much else. Without getting into details, life has been bitch.

That said, I’ll still likely do a couple of things by the end of the month. I have to renew the SSL certificate, sort out some stuff with the domain, maintenance on the server itself and the like. Maybe I’ll be able to squeeze in some more bugfixes and small enhancements but no promises. I’ll see about writing that stuff up when appropriate.

Until then, be well and take it easy. And enjoy rereading old archived stories if you feel like it. Should be a better experience!