Bookwards: Rationale
English

Bookwards: Rationale

by

language learning
creativity
habits
programming
hobbies

Bookwards: Rationale

In this post, I'm going to describe the rationale behind my recently published computer program for learning and maintaining vocabulary.

In a certain sense, this post is a continuation of my previous one, but it can be read independently. To be honest with you, I didn't really like the outcome of the last text and I hope this one turns out much better. Concretely, I hope it's readable and understandable for most (if not all!) Journaly users reading it. I really hope you find it informative and I'd, as always, welcome your constructive feedback. If you don't have any, you could always say hi or paste a funny racoon GIF or Youtube video about cats. There are never enough of them.

Please fasten your seat belts and get ready for an amazing technical roundtrip.

A computer program for improving vocabulary

Similar to existing products like LingQ, LWT, FLTR or Readlang, the computer program I've written should help its users to improve their vocabulary. In contrast to the aforementioned existing products, mine really is a very minimal and simple product, which admittedly still has a couple of rough edges. Nevertheless, I hope it's usable in its current form. At least I certainly use it regularly with all my foreign and native languages. It looks like this:

In case you're by now curious about it, please take a look at https://hub.docker.com/r/edufuga/edwords. There you'll find instructions on how to run the program using Docker. I've not written the instructions on how to install Docker itself, but that's a very easy installation procedure under Linux and macOS. If you're using Windows, take a look at these instructions: https://docs.docker.com/desktop/windows/install/.

For the rest of this post, I'm going to describe the main technical details about the program and, like the title promises, the rationale in behind of it.

Note for techies: The source code of the program can be found here, in case you're curious. Please bear in mind that the current state of several of its parts is mostly a working prototype, which means that it works but the code is definitely not how I ideally would have liked to do it. I plan to improve on those parts in the near future, so that I'm finally content with it.

A bottom-up approach

The program has been written using a buttom-up approach. The most central part of the program is the "word-status-date" structure (what I call a "record"). Each record contains the user's self-assessed state of knowledge of a certain word in the form "word", "state of knowledge", "date of the self-assessment". An example could be: "cat, KNOWN, 2021-12-17 19:19:19".

The main data structure of the program is thus the tuple (word, state, date). Everything else follows logically from this fact or was constructed with this in mind.

Once the program "knows" how to represent the state of knowledge of words in general, the next logical step is to split a given text document into sentences and words. For this, I've written two regular expressions that work well for the languages that I speak and learn. Currently languages like Chinese would not work well with the program, I suppose. Until now, the program has been tailored to my specific language needs and ideas, but the charm of writing your own program is that one can always improve on that :). One of my long-term goals is to develop a generally useful program.

Everything is a file

The information (word, state of knowledge, date of the self-assessment) is saved as files on the hard drive. This is a quite strange idea for some computer users, I assume, but I personally find it absolutely charming. This is inspired by UNIX and Linux, where the idea "everything is a file" is quite central in these operating systems.

I wanted something as close to the operating system as possible. This excluded things like a SQL database for persisting (saving) the data. Furthermore, I wanted an easily navigable and "hackable" data structure, which means that the data itself should be usable without my program or other software. Everything one needs is a file browser and a text viewer or text editor: Your vocabulary is directly visible and inspectable without third-party tools. This provides a level of simplicity and transparency no database is able to achieve. But it has other issues. In the end, everything is a trade-off.

Not only the vocabulary knowledge is saved as files. Absolutely everything the program uses is a file: The book (or more generally, a text document) you read with it is a plain text file, the sentences and words in the text document are saved as files, the user-defined connections between concepts are saved as files, etc.

Atomicity of operations

Everything is saved atomically. This means that every new or reviewed word is a completely new file, which is directly written to the hard disk. File IO (input/output) is thus an essential part of the program. I didn't like the idea of writing a single file containing several words. Instead, I found the idea of "one word = one file" to be much better. That's what the wording "atomicity of operations" refers to. Every word file is an atomic part of the current state of knowledge of the vocabulary in a certain language.

Unicode support

Thanks to the Unicode support under Linux and macOS (and by now theoretically under Windows 10+ as well), it's possible to save files with (almost) arbitrary file names. This makes it viable to use the words themselves as file names. The result is a very simple representation of the vocabulary on the hard drive of the user.

Fast and direct information retrieval

The fact that the vocabulary is saved as files with every word as a file name (I call these "word files"), is important for yet another reason: fast word state retrieval.

What the heck are you talking about?

Well, I mean how the program obtains the information "how well does the user know this concrete word?". This is an almost trivial operation, from the point of view of the program: open the file with the word in its filename and read its contents. Every word file contains a single line, which lists the word, its state of knowledge according to the user's self-assessment, and the date of the assessment. Or stated differently: it contains the information on how well did the user know a certain word at a given moment in time. Because there are no indirections, the lookup is really fast. Furthermore, the program uses an internal caching mechanism to keep loaded words in memory in order to avoid redundancy. And even without that cache, modern operating systems do that nevertheless for files that are accessed often.

This is another reason for not wanting to have a database. A database is not only an opaque and possibly vendor-specific data structure, but it saves the vocabulary knowledge in a single place, a table. Imagine that I speak several languages and want to practise all of them (go figure); by using a database I'd be forced to save all of that in a single table, so the languages would not be physically separated at all. I personally find this a no-go. How I imagine things, my vocabulary knowledge should be neatly separated in different parts of my (computer's) memory.

So no databases.

Addendum for techies: Yes, I know there are other alternatives such as document-oriented databases, but I still find the documents (the text files) themselves to be the essential data abstraction.

Git

No, I'm not insulting you.

The last thing I want to mention in this by now too long post, is Git. One of the outcomes (or actually, one of the reasons in behind) of using files to save the vocabulary knowledge of someone using the program, is that it can be saved and versioned using Git, which provides a completely different paradigm to handling data than a database does. I could, for example, easily make my vocabulary knowledge in several languages openly accessible to anyone with an internet browser. In my eyes, this approach to open data is something appealing and worth striving for.

That's it for now

The post turned out to be a bit longer than I expected, but I hope it has been an interesting read. Please leave any comments you may have on it :)

4