People generally fail as sources of randomness and I seem to be no exception (see for example this). So, I decided that what I really needed was a program that would generate random words for me, then I can pick a name from that. But, these words can’t be any random combination of letters — I want them vaguely resemble actual words, so that they will be pronounceable. To accomplish this I am going to try to random generate n-grams rather than random letters. An n-gram is just a list of consecutive words or letters in a text of length n. For example, the n-grams of length 2 (or 2-grams) in the word ‘hello’ are: ‘he’, ‘el’, ‘ll’, and ‘lo’.

For my first attempt, I am going to try a model where the probability of picking an n-gram depends on its position in a word. That way, I will be more likely to produce prefixes in the beginning of my words and suffixes at the ends. To do this, I use a data structure called a trie. A trie is a particular kind of data structure that is organized as a tree and is used to store and lookup words by their prefix. Each node in the trie corresponds to a prefix of a word, and every step you take down the trie adds another letter to this prefix. My thinking was that I would construct a trie corresponding to every word in the dictionary and walk down it at random. The probability for each decision I talk in walking down the corresponds trie corresponds to the number of words in the dictionary that continue that way on the trie. Unfortunately, the end result is that every word I produce in this manner must already be a word that as been added to the trie. This is no good! I need a word that is not in the dictionary.

To fix this, I abandoned information about the current position in the word. This time I will use a Markov chain over n-grams of characters.

A Markov chain is a random process that lacks any memory of where it has been. For example, suppose that I were walking through New York in the style of a Markov chain. Every time I came to an intersection, I would randomly decide to either turn left, turn right, go straight, or even go backwards without any regard for where I had already been. In my Markov chain, each state corresponds to a particular n-gram and the probability of picking a new n-gram to follow it is based on the fraction of times that I saw the second n-gram in the dictionary following the first (sliding one letter). For example, suppose that I was using 3-grams and my list of known words was as follows:

- hello
- helicopter
- helios

Then 1/3 of the time, I would follow ‘hel’, with ‘elo’ and 2/3 of the time I would follow it with ‘eli’. What results is something much more like what I was looking for. Here are some words that my code produced:

- ertrutheodontitize
- yrphorous
- myst
- bedalie
- ilziesis
- agapessnessneel
- xes
- hubbrosaccee
- dna
- ivzokoulture
- apsidae
- sover
- raasi
- arseturbound
- yammeter
- tophite
- hromiously
- vutziacercoopic
- uoauteless
- ihl

Thanks to this, I’ve got my new screen name, ‘pejoculant’! I’ve posted my code for both attempts on github, feel free to download it generate your own names.

]]>