Touhou-Project.com

And just exactly what are you watching?

Added 2018-10-08 07:23:05 +0000 UTC

Hey guys, you may have noticed the new features on the site by now and I hope you’re enjoying them. I wanted to talk about one of the things that took the most work to implement: the watched threads feature.

But first, it’s worth bringing up why this feature went away in the first place.

The original Kusaba X implementation of the watched threads feature was not that great. For starters, it limited the list of threads to a single board. Which is okay enough when you’ve got a single board or two you frequent, but kind of pointless when you’re following a dozen stories across six boards. If you have to open a board page to see what’s new, why not just look at the story thread while you’re at it? Even in its heyday there weren’t very many new threads being created on THP every week and the number of active stories on most boards were just a handful at a time.

Implementation-wise, quite a bit of it depended on the inline scripts that used to be on every page of the site. It would have been a pain to update and modernize like the other script-dependent things I ended up redoing. That said, that wouldn’t have been an insurmountable roadblock. The real problem was related to how it got post data.

Without boring you to tears, basically there was a special table in the SQL database that mainly held IP address and thread numbers. Multiple entries per IP, with no real expiry or context to know if the information was still important to retain. So the site scripts in the watcher would tell the PHP intermediary to run a few functions to retrieve data some from this table and others from the posts table to get an updated count of total posts and so forth.

A lot of this checking used variables that were board-specific and, honestly, pretty inefficient. At the time I axed the system I was still less knowledgeable about the inner workings of THP so I wasn’t sure I could do a good job tweaking it or replacing it properly. But, more importantly, I recognized that the amount of work that I would have to do would be perhaps no less time-consuming than doing something up from scratch. So I decided to ditch the system when I updated the user scripts as it’d no longer work. I didn’t know when I would get around to it but I was confident that I’d find a suitable solution.

There were two systems I had as reference and gave me food for thought when deciding how to approach the issue:

1) A similar system to the initial implementation whereby the database is queried and those results parsed and then spat out at the user.

There are some advantages to this, mostly in that it’s less error-prone as the results are “live” from the site and guaranteed to be the latest data. But I thought that whatever advantages it might have had were offset by the fact that you were still executing database queries every time the thread watcher checked for new posts. It’s a lot of data to keep track of and, even with safeguards to limit the amount of time someone could check, can still be open to abuse or simply problems of scale. THP doesn’t use all of the resources it has available to run and I’d like to keep it that way. Not only does it mean we have room to grow and deal with spikes but it also keeps costs down.

2) A system like 4chan X or other similar scripts where threads are queried directly, posts counted and quantities set client-side.

This would have been the most direct way to do it. Doesn’t really need any board software modifications except for the actual script that users run. The disadvantages though are that in order to build a database of posts, you have to query threads directly. This means loading up pages, counting how many posts they have and then adding that to the localstorage. Not that great for your bandwidth when you’re watching more of a handful of threads.

Ultimately, I went with something closer to 2) than 1) but that tried to minimize the downsides to them both. There is no SQL database with the watched threads and IPs etc. What happens instead is that every time a post is submitted, after it goes through all the checks to determine whether it’s valid and is eventually is inserted into the database, a series of functions run. These functions identify which thread the post belongs to (or if it’s a new thread), checks if there are updates in the thread or if the post was an update itself, and then filters that data adds a timestamp, and then writes to a small JSON file.

The result is a small database that gets updated every once in a while (as in, it’s not triggered by some IP address added as an entry) and doesn’t require any special processing by the board software beyond this. The client-side script asks for this database every once in a while and then compares its own localstorage information against the file. The processing is done client-side, so no additional server resources are used. It’s quick and painless and the file is unlikely to ever be too large in size, unlike say a whole thread that was being counted in example 2). Just a quick comparison: the current JSON file is about a single KB in size, whereas a single big thread on THP can be 400KB easily. So if you had multiple threads you needed to check against… well you get the picture.

There are some disadvantages from both worlds as well as it’s not as accurate and things like cleanup are only really done sporadically by the board software. Specifically, when a file is deleted on a board, it checks for dead threads and prunes the JSON file accordingly. It also can’t really deal that well with posts being deleted and it may under or over report new posts in very specific circumstances. But they’re generally unlikely and, at the very least, will let you know that something happened in that thread. All in all, I think it’s the best solution given our reality and a good compromise to boot. Plus, JSON files are more or less human-readable and thus easy to be used by third parties if someone ever wanted to do something cool with that data.

The JSON-creation part has been running for a few weeks now automatically while I finished up all the user-facing bits of code. That said, there’s a related quirk that may be annoying to some. As it queries the file whenever you add a thread for its initial localstorage info, threads that haven’t gotten any posts since the code started running report zero posts total. This means that whenever they get their next post, if someone is watching the thread, the thread watcher will report the total number of posts in the thread as new (because obviously anything that is not zero is new). I figured it wasn’t a big deal as it probably won’t be very common that someone will be watching an older thread that’s been dead for months and someone will post in it.

So yeah, I’m fairly satisfied with how it all turned out, even if it took a boatload of hours of thinking about it, looking at code, prototyping and learning from mistakes. To put a little perspective to this: it easily took me at least 20 hours in the last two weeks to get just the thread watcher working as I wanted to. And that was after all the other work, cleaning up bits of old code on the site and laying the foundations for places where the scripts could “hook” into the data. There’s a lot of testing that had to be done and even then a last minute bug seemed to make its way in: the thread watcher is counting your own posts in a watched thread as “new” where it shouldn’t. I’ll fix that in the next few days, given enough free time.

All that said, I hope you enjoyed a peak into the process. Time-permitting, I’ll follow up next week with a similar post talking about the other big time sink: the post preview function. I’m assuming I’m not being boring and tried to ease up on the tech jargon but do let me know in comments if you want me to be more detailed or less!

Until next time, take it easy.