The Language of Numbers 🔢
English

The Language of Numbers 🔢

by

literature
language learning
science
linguistics
programming

A few months ago I read this post on reddit and another one I can no longer recall. This Journaly post is of a similar spirit and deals with the exact same topic: The vocabulary introduced in the Harry Potter series.

More concretely, the topic here is the distribution of new words in all Harry Potter books. Some days ago I was curious about something and I literally wanted to see the numbers. The graphics included in this text (in the links that'll follow) and the text itself are the outcome of my musings. I hope you'll find this text entertaining as well as informative. Welcome!

https://www.youtube.com/watch?v=8Ijk3nepXmM

Like I was saying, some time ago I got lost on reddit and, at some point, found myself staring at some curious graph made by an avid Harry Potter reader. The mentioned reader collected the data and crafted the plot in question by hand. In my case, I am not specially fond of losing so much time, so I've written a small computer program to automate the process of extracting the needed data and doing the data analysis I want. But I haven't told you yet what the main question and the goal of it is, so first things first.

Let's start with a bit of background information. A bit of context never hurt anyone.

I train the vocabulary of a given language using a self-made computer program. This program is mostly a poor man's book reader and vocabulary trainer. I'm using it already for some time, so I've accumulated a considerable amount of input, especially for languages such as Dutch and German. Recently, my focus has switched from "using the program" (i.e. accumulating data) to "extracting information" (i.e. analyzing the data). This post is not directly related to the data of my program, but it certainly is one of the outcomes of the process of having used it. So much about the context.

Let's finally start with the vocabulary in Harry Potter.

Why Harry Potter? Well, because it's what absolutely everyone out there reads in their target language. Just kidding. I simply happen to like Harry Potter, obviously, so why not using it for the data analysis?

What data analysis?

I was (for some weird reasons) interested in the following question: How many new words appear while reading Harry Potter, assuming someone doesn't know anything at the beginning but understands everything immediately (after having looked it up once, for example)?

Of course, this is a simplification of reality in several different aspects. First of all, the concept of "word" is grossly oversimplified: In my very crude data analysis, a "word" is simply a "sequence of characters" like "Dobby" or "Winky" or "asdf". A more professional analysis would include a lemmatization step to only consider the lemmas (the headwords, the dictionary forms of the words) and possibly exclude words of invented characters or places. Secondly, the assumption that we start by "knowing nothing" and nevertheless immediately "remember everything" is not just rude, but simply wrong: People don't usually start reading Harry Potter (or any other book) having no vocabulary in the chosen language and they surely don't remember absolutely everything at once.

Now that I've described the "problem" itself and its "model", let's present the results. Let's give some numbers.

All Harry Potter books contain over a million words. Concretely, splitting the book into words gave 1.062.740 words. These are all the words in all seven HP books, without repetitions. The number of unique words is obviously much lower (concretely: 24.275), but it's not of further interest in this post.

The computer program I wrote counted the number of new words every 100 words. This number is at the same time the percentage of encountered new words. The size of the sampling is, for sure, arbitrary. Why 100 and not 10, 50 or 12.345, for example? A hundred words simply seemed to be a reasonable number for the bins or buckets (yes, that's a technical term) of the range of values.

After letting the program count the number of new words in every group of hundred words, I made a scatter plot of the data to visualize the results. The result is https://i.imgur.com/tiJYyfe.png.

Nice, right?

Well, not really. The scatter plot does indeed show how the density of new words evolves, but (especially along the lines of 1% and 2%) it's quite difficult to visually distinguish the individual points of the graphic.

The next step was to add a tiny bit of randomness to the data. Or to be more blunt: I faked the results. Instead of values like "one new word" or "two new words" on a given page, every value was modified slightly. The result of this change is https://i.imgur.com/1g9lzP5.png.

If you're interested in the data, here it is: https://pastebin.com/6XJMj81F. The faked numbers are here: https://pastebin.com/U0BGmNPP.

The scatter plot https://i.imgur.com/1g9lzP5.png is the main result of the data analysis, and to be honest with you, I find it a nice wallpaper. It's a graphic with an amazing amount of information. For example: As one'd expect, the number of "zero new words" (i.e. known words) every page increases considerably over the course of the books. What I found surprising is, nevertheless, that even towards the end of the last book there is still a high amount of pages containing one or even two percent of new words.

Another interesting curiosity is the beginning of the line of "zero new words": The first point along that line is found quite late in the first book. In other words: The first time that no new words appear is exactly in this fragment of the first Harry Potter book:

... but it was a narrow corridor and if they came much nearer they’d knock right into him — the Cloak didn’t stop him from being solid.
He backed away as quietly as he could. A door stood ajar to his left. It was his only hope. He squeezed through it, holding his breath, trying not to move it, and to his relief he managed to get inside the room without their noticing anything. They walked straight past, and Harry leaned against the wall, breathing deeply, listening to their footsteps dying away. That had been close, very close. It was a few seconds before he noticed ...

This quotation gives a taste of the kind of information that the analysis provides. The quoted fragment corresponds to the information "hidden" in a single point of the scatter plot. Every single dot represents one such fragment of the book, containing a certain number of new words.

It's also interesting to compare the same analysis with the complete Harry Potter series (i.e. a comparable and representative amount of data) in different languages. In German it looks like

https://imgur.com/7ygwALL.png and for Italian it's

https://imgur.com/CHlygsX.png.

The comparison of the "regions of high density" is of considerable interest: Italian and German show a higher amount of new words towards the "end" (i.e. there are more new words that the reader constantly encounters). I found it also striking that the first group of 100 words containing no new words in the Italian and German translations is within the second book of the series.

Finally, to get a sense of how the "long-term behavior" of the graphic is, I made a linear and exponential fit of the data. The result is

https://i.imgur.com/fpq4mPR.png. The linear fit (linear regression) doesn't adequately represent the experimental data (the dots), but the exponential decay arguabely does. If this curve is of any significance, it shows that there's an amount of approximately 1.77% of new words (constantly, from the second to the last book).

That's it for now. Ideally I'd like to make another data analysis considering my own vocabulary, and another one considering only the lemmas.

8