Bringing structure to an organized mess

I don't have good files hygiene. I suffer from a tedency to hoard data. And when I say hoard, I mean hoard. All of my computers and laptops are a complete utter mess of files, with stuff often appearing in doubles or left forgotten in some file structure. And really... it's a shame! Like, really. I should be able to do better. So let's try and bring organization to this mess of data and get STRUCTURED.

I mainly write this post as a sort of “collaborative journey”. You basically get to see my direct thoughts on organizing and what systems I'll be using.

Step 0: Defining categories

A very important step indeed. When starting off with organizing data, we first need to look at the kind of data that I have obtained over the years.

In general, I am capable of pointing out these “big” categories:

With these big categories sorted out, let's find ways to tackle each.

Step 1: Existing organization software

One that I rolled just in “media” previously are images. I save a lot of images. I think on an average estimate I download anywhere between 50 and a 1000 images per day. The overwhelming majority of this is fanart. I suspect that in total, the amount of images I save are around 75GB, and this number is increasing.

Luckily for me, there is a very easy solution for this. Conventional photo management applications mostly suck, and the manual labor involved to tag this many images is probably not feasible, but luckily for me there exists software specific to solve this issue.

I'm of course talking about booru software. Booru, which is Japanese for “cardboard box” originates from the Danbooru project. A number of implementations exist, and I've actually been experimenting on and off with these for a few years now.

That said, I think I have finally found the booru software that I want to work with. I've been using it for a little over a year now and it's called szurubooru. Unlike Danbooru (and it's direct derivative Moebooru), which is a Ruby project that is... difficult to deploy, Myimouto, which is a PHP project that has lied abandoned for several years (I attempted a short lived fork to implement some minor things, but then gave up since it's fucking PHP and I have better things to do with my time) and Gelbooru 0.1.x (which is not only PHP, but is also fundamentally broken and extremely limited in features), Szurubooru has pretty much hit all the essential hallmarks for an existing organization system that fits my needs.

To make it clear, when I use existing systems, I typically look at the following “main” concepts:

Szurubooru hits all of these for me. There's no approval system, usage is as easy as uploading the images (and with a few scripts I use, I can automate that to make it comfortable from my phone and my computer) and the only wrinkle is tagging, which I solved using a python library and a small webapp that can reverse search images.

There's no pointless approval steps in the program either, I can limit signups and pulling out is as easy as simply transferring stuff through the API or failing that, just moving the images folder on my hard disk to my new system.

Oh and it's deployed in less than 5 minutes and updating is just as easy, since it's all done with docker.

Images and short movies: SOLVED.

What about comics though. Comics are another category I kind of have issues with. I collect a lot of them, mainly doujinshi and most are simply stored in a zip format until I want to view them. Luckily for me, again a tool exists that hits the previous needs: Lanraragi. It's perl, but thanks to how well it uses docker, it never needs to take issues with that. Data importing is so easy it's practically not a thing: I just have to put all my doujins in one folder. Pulling out is equally as easy, the zips are never modified while it runs.

Doujinshi: Solved

Step 2: Custom organization software

Okay this tackled a few things. It's not nearly everything though. So how do we fix the rest?

Well, for this I turn to the promising Johnny Decimal system. Johnny Decimal is a system based on the Dewey Decimal Classification. Dewey Decimal is a system designed for libraries to organize books. Luckily for me, my dataset greatly resembles that of a library.

Dewey Decimal uses a simple system, but there's some flaws that require modification to make it work with an individual blip of data.

The idea behind Dewey is that everything exists within a category. For example, books about religion have the super category 200. That means that if I pick a book with Dewey classification 232, I would know that it is going to be about religion. This method continues downwards. So in our previous number, the category 230 is about Christianity specifically (Dewey is an American system, so 200 is mostly about Christian subjects with other religions being a footnote in category 290, which is unfortunate but fuck it, this is an example). Then, for a more specific subject, the classification 232 is about Jesus Christ & his family.

Again, sorry for the religious stuff, but it works well enough to illustrate the core principle: Every subcategory relates to the previous categories in a well established system.

Dewey Decimal then also permits you to go even further and declare specific subcategories on these systems for living or dead authors, for specific fields of a science and so on and so forth.

So should we just copy Dewey and call it a day? I mean, we could, and Deweys system would certainly bring a structure in that data, but it doesn't bring in a structure of data that I would be comfortable with navigating per se.

For example, I download a lot of anime. According to Dewey, all of this would go under 741. But should I really categorize Itadaki Seieki (NSFW!) in the same category as JoJo's Bizarre Adventure? Probably not, unless I want to embarass myself when navigating my file system.

That's where Johnny Decimal comes in. You see, Johnny also examined the Dewey Decimal system and thought it was a good idea. But he changed one crucial element: He doesn't rely on Dewey's specific implementation of the system.

Instead, he encourages you to construct a smaller system and have multiple you can navigate through, with all categories and definitions set up by you.

Johnny's system is particularly appealing if you're a designer or an artist who often works on projects related to their job. His system is great at bringing order to that chaos. But it does have it's limitations that make it somewhat useless to my interests.

So we must break these limitations. Specifically, for me, the biggest issue is that Johnny encourages you to limit your system to at maximum 100 categories, with no more than 10 root categories. Johnny does have an answer if you have more than 10 root categories but it's just not really adequate: It amounts to “have more than one system” or “you haven't properly split out your categories”.

So I'll be doing something a bit more closer to the Dewey Decimal system and move away a little bit from Johnny's system: I have 100 root categories and 999 counter categories. Unlike Dewey, there is no obligation for root category 020 (on my laptop where I do this already; 020 is Multimedia) to relate in any form to category 030 (Projects), but category 021 (Anime) is related to category 023 (Movies).

The actual files themselves are then stored in a subfolder of that. For example, the Neon Genesis Evangelion anime is stored in category 021.001. 001 is just a counter here, it doesn't have any actual bearing on the rest of the anime in that folder (exemplified by 002 being Hellsing Ultimate, a show in a markedly different genre).

This is also where I break Johnny's rule intentionally: Johnny says you should stop at this. Once you reach the counter, everything below that must be a flat structure. That is where I disagree. Consider for a second category 034.001. 034 is my Uni work, and 001 refers to the course Data Structures & Algorithms.

Except here I hit an issue. For category 034.001, my goal is to store both the practical assignments and the college assignments. Johnny Decimal would say that I have to split out 034.001 into two subcategories. This however is weird. After all, 034.001 should be about DS&A, and splitting that out means that I have stuff that is about DS&A but is stored in a different folder, even if what is stored there is tangentially related. To solve this, I simply extend the system with another dot.

To understand this better, here's how I would visit a lesson on Recursion:

cd ~/johnny-decimal/030 Projects/034 University/034.001 Data Structures & Algorithms/034.001.10 Colleges/034.001.11 Recursion

Oh boi. So to unpack this:

As you can see, here I opted to go for a smaller subset. This is because subcategories just shouldn't reach more than 10 entries (at that point you can just increase the digits anyway, but I would also consider just wondering if you couldn't be splitting up your results better.

Subcategories are optional for me, not every dataset benefits or gains anything from them and some are entirely incompatible with it and require their own structure (for example, a programming project wouldn't be deeper categorizable than this, because of the fact that those projects have their own structures).

And that's pretty much it. I'm currently looking for ways to improve it further, but right now, I use the following two zsh methods to get the most out of this system (borrowed from Johnny):

access_jd_root_function () {
        cd ~/johnny-decimal/*/${1}*

access_jd_specific_function () {
        cd ~/johnny-decimal/*/*/${1}*

export access_jd_specific_function
export access_jd_root_function

alias cjd='access_jd_specific_function'
alias jd='access_jd_root_function'

cjd allows me to enter a specific directory from wherever I am. ie. 034.001 would permit me to enter the Data Structures category. jd allows me to access 034 (University) just as easily. Syntax is cjd 034.001 and jd 034. Easy as that.

One final thought on mapping out the structure: Easy, yet so hard. I could just use tree, but that would get messy. Perphaps a database system? But those are clunky. I tried airtable as suggested by Johnny, but I didn't like it.

A readme is an option, certainly but markdown table syntax is a pain to use.

I don't plan to use this on emails and the like, those are and always be a mess because E-Mail sucks.

And... I think that's it. That's the system I'll be using to organize my data in a general sense.

Got any suggestions, questions or even improvements to this system? Please let me know! You can comment on this blog with assuming you're viewing it directly and not from a federated site (if you are, just uh visit the original page and you'll find it).