The War Machine 💣
English

The War Machine 💣

by

literature
cognitive science
programming
hobbies
multilingual book club 03

The War Machine

In this post, I'm going to talk about my experience reading Seta in Italian. I won't be talking much about the story of the book and whether I liked it or not in the end. Instead of that, this text will be quite data-centric. It's about hard facts and reliable numbers, so at least I know Linda will read it. And hopefully more people as well. To all of you: welcome!

https://www.youtube.com/watch?v=sm8TtOnTLdg

Okay, let's get started. Incidentally, this is the first time that I'm writing this post using the Journaly text editor. Until now, I wrote all of them using a plain text editor (I love Geany) and pasted the result immediately before publishing it. But never mind.

So, the book Seta. If you watched the last livestream (and paid attention) of the amazing Multilanguage Book Club, you'll already know I've read the book during my holidays, in two days. It's been 22 months since I was last in Catalonia (if you're going to call that Spain, please leave this text 🤭), so it was definitely a good thing to reset that counter back to zero. Again, never mind.

So, the book Seta. I've read the whole book (okay, it's not that long) using a very simple book reader programmed by yours truly. Instead of just describing the poor thing, let me show you how that looks like. Let's see if I find it... Ah, here it is: https://asciinema.org/a/9bITXnQ4vF9GM5lk5E1kge8lr. Have fun.

Now that you're back from watching that tiny "video" recording, you have a better idea of how my reading experience looked like (and also about my horrible taste of a color scheme for highlighting words on the screen). The essence of the computer program is: It highlights words according to their knowledge. If you know about LingQ or similar software, you'll immediately understand what that means. As I didn't c̶a̶r̶e̶ know about the different already existing software products, I just wrote my own little baby, tailored to my specific needs, wishes and bad ideas. So I'm using that for reading e-books, which I convert first (using Calibre) from the original format (.epub, say) to a plain text document, which is what the program uses.

So, the book Seta. The repetition of sentences reminds me of something I've read lately...

Before actually reading Seta, I went through parts of the vocabulary (as you can see here: https://asciinema.org/a/23zWlUx1FkRSABXdoGG7JBVAY). There, I basically rushed through the most frequent (and thus important) words in the story and marked them according to my estimated state of knowledge. Since I officially don't really speak Italian, there are a lot of unknown words (the red ones). Part of the fun is exactly this: Reading in a foreign language that feels quite familiar. As I wrote somewhere else (https://journaly.com/post/11675), I read the previous book for the second round of the Book Club (it was just wonderful to be able to choose something of your own liking!) also in Italian. I liked the experience so much that I'm repeating it this time.

After having entered a substantial amount of new words and having read the book, I plan to train (and hopefully improve) my vocabulary of the book during the remaining time of the Book Club. The reviewing process looks like this: https://asciinema.org/a/DZ09Gi96NyyvUb35811lN8VxU (the screen size is smaller because I used my smartphone for that).

Okay, I promised there'd be some numbers in this text, so let's get started. I hope you're ready.

https://www.youtube.com/watch?v=TtEvE1-cD1E (If you aren't an adult, please don't watch that)

As I said, I don't really speak Italian. By that, I mean that... well, that I don't speak it. More concretely, I didn't fully start to consciously and actively learn the language (i.e. grammar) yet. All I did so far is reading books and training vocabulary. Or rather than "training" it, it's mostly been just entering it into the mentioned computer program for the first time.

Until now, I've entered exactly 12,300 words. Read that again, or to get a feeling for it, watch this: https://asciinema.org/a/A7GX4GIxTJV6NiER8q2yBIc7M. To be precise, my complete current vocabulary knowledge of Italian can be found here: https://zerobin.net/?9e264f2132dc6f4f#go6Pa9X3zeqCspKaCqXSoXZqJD5csVuvz6U9+xTyg1E= (the password is "italian", very original, I know). The nitty-gritty details of this huge text file are, let's be honest, boring, so I've prepared a more digestible form of presentation: a picture. There you go: https://i.imgur.com/YlHj9bE.png.

The graph https://i.imgur.com/YlHj9bE.png shows the evolution of my Italian vocabulary since I started my romantic journey with that romance language some months ago. As you can see, I started taking my experiment of "reading without knowing" more seriously in June. The majority of words have the state "comprehended", which means exactly this. They are part of my passive knowledge. My (estimated) active knowledge is still quite poor (as the blue line shows) and I plan to improve that in the upcoming weeks. This process of interiorizing words is showed in the purple graph by a decrease, as you can see in the picture towards the end of the timeline (in September). This is just a cryptic way of saying that I reviewed comprehended words and rated some of them as known, thereby activating part of my passive knowledge.

The previous data is of a global nature, it's not context-specific. It tells you all sorts of things about my knowledge of Italian vocabulary. In other words: Until now, there's no trace of the concrete book I've read, that is, Seta.

So, the book Seta. In order to visualize my knowledge of the vocabulary in the book Seta, I came up with the following form of graphical representation: https://i.imgur.com/xia81F4.png.

The horizontal axis shows the frequency of words in the book (pay attention that it's not linear, because not all word frequencies have occurrences: for example, there's no word that appears exactly 50 times, for example). The vertical axis shows the number of words in each category (these are: IGNORED, UNKNOWN, GUESSED, COMPREHENDED, KNOWN) divided by the total number of words at that frequency, or in other words, the percentage of unknown words, etc., at every frequency.

The previous picture is the main result of my data analysis and its graphical representation. In case you're curious about the raw data (i.e. the vocabulary of the book), here it is: https://pastebin.com/Pc93JkpF. These are all the (unique) words in Seta. A similar list, containing the words in decreasing frequency, is here: https://pastebin.com/hNwtBepk. This last word list is what my program uses for training vocabulary.

I hope this text has been informative and not too dull to read. I'm curious about your opinion and value every (constructive) feedback about the text, data, graphics or program. Furthermore, raccoon GIFs are always welcome. So here is one: https://media.giphy.com/media/RxVNyswc0Igj6/giphy-downsized-large.gif.

9