About my projects

In this section I publish reports of my hobby projects. These are mostly programming-related, but I occasionally try to do something ex silico as well. I like to write programs that model and simulate physical phenomena or serve as solvers for various puzzles and problems. I do not want to use pre-existing tools too much, instead I prefer writing my own tools and learn more from the mistakes that I make. As do my interests, also my project topics cover a broad range of scientific disciplines. I wish to demonstrate that I am capable of learning various new subjects independently and down to detail, and better yet, that I am adept at applying such knowledge to practice.


Vocabulary (7.12.2016)

In this report I will present my new program for expanding one's English vocabulary. This project is a bit different from the majority of the previous ones in that little mathematics and no physics is involved. Instead, I demonstrate text string handling and a simple text-based user interface to a vocabulary database. Furthermore, unlike in many of my previous projects, I have no results to present—just the working program—so this report will be a relatively short one.

The program

I find that the biggest problem in learning new words is that they are scattered in a sea of familiar words in both spoken and written language. Would it not be easier if one could just filter out the familiar words, then look up the remaining new words in a dictionary or online, and finally save and label these words? Well, this is exactly what my new program does: it reads text files (just copy text from a web page or from a pdf file) and displays the words not yet saved in a vocabulary database to the user who can then look them up somewhere and can finally save them along with optional flags or a comment. The user can re-expose him- or herself to these recent new words by querying the vocabulary database for commented and/or low-frequency words. Once the user is certain of having learned a word, he or she can remove its comment thereby avoiding unnecessary repetition. Furthermore, the user can, for example, study the frequencies of different words in the texts read or find and print the database entries of words matching different criteria. The program should be applicable to other languages limited to the latin alphabets a–z as well, but since it ignores diacritics and the other languages that I know also have umlauts and Scandinavian characters, I am focusing on English. Further limitations are that no compounds (neither with a hyphen nor with a space) nor contractions are supported for simplicity. For the same reason I have also chosen not to handle line breaks within words.

The program is written in Java and is available as a runnable jar file. It has a simple text-based user interface that one can use from the command line. If one's operating system is Unix-like and one has Java installed, the program can be run by typing the command java -jar vocabulary.jar. The code consists of two classes: for the user interface and for implementing the vocabulary database. For efficient access, the database has a tree structure which is elucidated in Figure 1. The tree has a root node (denoted here with an underscore) that can have child nodes (first letter of a word) for each alphabet a–z (in this example the vocabulary consists of just a few words so most letters are missing), and these again can have child nodes (second letter of a word) for each alphabet, and so on. The frequency of a word and its optional flags and/or comment (blue bubbles) are stored in the last node corresponding to the last letter of the word. The database depicted here could result from an input text file such as: "a an a an a Andes ATM atom". Accessing and manipulating the database is implemented efficiently using recursive algorithms. However, I must stress that this is by no means high-performance code—I have skipped quite many corners here.

Figure 1. An example of a possible vocabulary database. Words are stored letter-by-letter in branched chains of nodes (black boxes) and their frequency and optional flags and/or comment (blue bubbles) are stored in the last node of the word. The flags A, F and P correspond to "abbreviation or acronym", "foreign" and "proper noun", respectively.

By typing the help command (h or H) the program prints out instructions. Nevertheless, I will briefly go through the main features of the program here as well. The program starts in the normal mode where the user can manipulate or query the vocabulary database. For example, the user can add, edit or remove words, can print words' entries (including their optional flags and/or comment) and can find words matching different criteria. The optional criteria include a filter expression, a frequency range, the word's flags and whether it has a comment. Alternatively, the user can also switch on the reading mode where text files are read and the words missing from the vocabulary are displayed to the user. The user can accept these words and add them to the vocabulary simply by pressing the enter key. The vocabulary can be manipulated also in this mode. Lastly, the user can save his or her vocabularies and can load them back later.


The program seems to work as intended. Furthermore, loading and saving vocabularies, accessing their entries and reading text execute virtually instantaneously, which suggests that the tree structure and the recursive algorithms for the vocabulary database, and the string handling are efficient. The text-based user interface is a decent one, considering it is my first one ever.

For someone with a broad vocabulary to begin with, going through the thousands of familiar words in English texts seems to take several days, so this program might be better suited for someone with a more limited vocabulary. On the plus side, it is easy to comment new or difficult words, and to later query the vocabulary for them. Furthermore, one can do some analyzing of the English language—by studying the frequencies of different words, for example.


Juggling (1.1.2017)

Vocabulary (7.12.2016)

High-performance phase field crystal (6.8.2016)

Phase field crystal (17.9.2015)

Image stacking (23.7.2015)

Wind tunnel (13.12.2014)

Nebulae (8.11.2014)

Optical lithography (22.7.2014)

Evolution (22.5.2014)

Covalent molecules (23.11.2013)

Sudoku (18.7.2013)

Hama beads (4.6.2013)