Anagrams – in O(N)

If you understand “big O” notation, skip on down to the “Solving Week 5 Problem 2” heading.

I’m going to discuss this using Perl 6, but I think an intermediate Perl 5 user will be able to figure out what the code does and how it would be able to be implemented in Perl 5.

“Big O” Notation

Most students of computer science have been exposed to “big O” notation.  Essentially, algorithms can be classified by performance.  N represents the size of the data being processed, while O() is shorthand for “execution time on the order of …”

What this means in practice:

Some processing has a constant execution time, no matter how much data is involved.  For instance, getting the first element of a linked list doesn’t depend on the size of the linked list. Likewise, accessing a random sector on an SSD drive doesn’t (well, mostly) depend on how much data is on the SSD – it takes a certain amount of time to fetch any sector, no matter how many sectors are in use (yes, I’m fibbing just a hair here).  These types of things have execution time of O(1).  Now, some constant speed algorithms are faster than other constant speed algorithms, but what we know is that on a truly huge data set, they won’t work worse than small data sets.  You can think of an execution time formula of these might be t = C where C is a constant.

Then there are O(N) algorithms – these are algorithms that execution time could be written in a formula such as t = A × N + C.  In this case, C is again a constant (think of it like startup time, but it’s whatever constant time the alogrithm always requires) and A is some scaling factor applied to this algorithm.  Again, N represents the size of the data.  An example of this type of algorithm is determining the largest value in a list of numbers.  If the numbers aren’t sorted, you have to read every number in the list to find the max.  So if there is a list of 100,000 numbers, you can expect finding the biggest number in that list to complete, roughly, in 10% of the time a list of 1,000,000 numbers. Generally, in computing, O(N) algorithms are considered very good.

Other algorithms might be O(Log N).  That is, they get slower as data size grows, but execution time isn’t twice as slow for twice as many elements. Rather, if you double the number of elements, rather than doubling the time (like O(N) does), it will only increase the execution time by 50%.  You can think of this like a “guess the number game.”  If I tell you that I’m thinking of a number between 1 in 64, and I’ll tell you whether your guess was bigger or smaller than my number, you will be able to figure out my number in no more than 6 guesses (Log2 64 = 6).  Example:

If I’m thinking of 63, your guaranteed way to find it in 6 guesses:

  • Start from the middle – 32.  I tell you higher.
  • Okay, what’s in the middle of 33 and 64?  Let’s say 48. I tell you higher.
  • Okay, what’s in the middle of 48 and 64?  Let’s say 56. Again, I tell you higher.
  • Okay, what’s in the middle of 56 and 64?  Let’s say 60. Again, I tell you higher.
  • Okay, what’s in the middle of 60 and 64?  Let’s say 62.  Again, I tell you higher.
  • Now you know it’s either 63 or 64. To figure out which it is, you guess 63. That’s right.  If you guessed 64, you’d also know it was 63.

In this case, exactly 6 questions are needed. Now if the number I picked was 32, and you started guessing at 32, it would only take one guess – but the idea with O() is to tend towards worse case (it’s actually more complex than that, but that’s good for this purpose).

If I doubled the size of the space – a number between 1 and 128, you would need not 12 guesses but just one more.  In 7 guesses, you could find any number I picked between 1 and 128.  As you can see, that’s a desirable feature – the time to search only takes 1/6th longer, not 2x longer.

Of course, there’s another way of guessing – you could simply start at 1 and go to 64.  The typical time this would take would be 32 guesses on average (worse case is 64, best case is 1).  If the problem was to guess a number between 1 and 128, you would need on average 64 guesses – and it would take 2x as long as finding the number between 1 and 64.  The sequential guessing algorithm is an O(N) algorithm.

Some algorithms, like efficient sorting algorithms are O(N × Log N).  Twice as much data takes more than twice as long to sort. But, in computing, this too is generally considered a good algorithm for most tasks. For sorting, it’s as good as it gets. But there are plenty of bad sorting algorithms that do much worse, O(N²), for instance  Let’s say we sort by finding the smallest number in a list, and putting that as the first number in a new list. Then we find the next smallest number in the source list and make that the second number in the new list.  Basically, we’re selecting the next number N times, and each time we do that we have to go through a list that is on average ½N in length.  So we get a formula for execution time that is something like t = ½N×N, or just t = ½N². We drop the constants (the ½) and say this is O(N²).  If 64 elements took 10ms to sort with this bad algorithm, then 128 elements would take not twice as long but FOUR times as long to execute!  As the data grew, it would eventually take more time than you have.  O(N²) algorithms are generally undesirable.

Solving Week 5 Problem 2

For the Perl Weekly Challenge week 5 problem 2, participants were asked to solve:

Write a program to find the sequence of characters that has the most anagrams.

An anagram is another word that can be made from the rearrangement of numbers in a word.  For instance, “ear” as an anagram of “are”.

I made a few assumptions – first that characters were essentially things like letters.  If there was a letter with an accent or other modifier, it would be treated like a unique characters.  digits would also be treated as if they were individual characters. I also assumed there was a list of valid words to use.  Thus, the sequence of characters that had the most anagrams would be the characters that formed the most valid words in this file.

So…how do you solve this?

One approach would be to read in the list of all the words in the dictionary. Then go through each word and see if every other word in the dictionary was an anagram.  Then take the words that made the most anagrams and output that.

This could be written as, in Perl-ish pseudocode:

my $most-anagram-sequence = '';
my $most-anagram-length   = 0;
for @words -> $word {
  my $tmp = 0;
  for @words -> $possible-anagram {
    $tmp += 1 if is-anagram($word, $possible-anagram);
  }
  if $tmp > $most-anagram-length {
    $most-anagram-length = $tmp;
    $most-anagram-sequence = $word;
  }
}
say $word;

See the nested loop? Nested loops generally mean O(N²) – or worse.  This is O(N²).

I didn’t define the is-anagram() function, but let’s just assume we have one that works well.

In this code, we use $most-anagram-sequence to represent a word that has the most anagrams (I’m ignoring that two different sets of characters might have the same number of anagrams).  I’ll also note that while I’m using the word sequence, really what we’re looking for is a set – but if that doesn’t make sense to you, don’t worry about it.

We use $most-anagram-length to say how many anagrams the word in the variable above has.  I.E. if the word with the most anagrams has 3 anagrams (including itself), this variable would contain 3.

We then assume we have an array of words, and iterate over that array for every word.  The inner loop, using the temp variable, loops over that array again and counts how many anagrams are found.

If that word has more anagrams than the previously (up to that point) found word, we update the most-anagram-sequence and -length variables.

Finally, outside of the loop, we print the word.

I didn’t solve it this way, because I immediately recognized that this approach would take close to forever to run – it’s O(N²).  And we know O(N²) is bad!

But to illustrate this example, this is the code I came up with to do this:

sub MAIN(Str:D $filename = '/usr/share/dict/words') {
    my $most-anagram-sequence = '';
    my $most-anagram-length   = 0;

    my @words = ($filename.IO.lines».fc).unique;

    for @words -> $word {
        my $tmp = 0;
        for @words -> $possible-anagram {
            my @word-chars     = $word.comb.sort.join('');
            my @possible-chars = $possible-anagram.comb.sort.join('');

            if @word-chars ~~ @possible-chars {
                $tmp++;
            }
        }

        if $tmp > $most-anagram-length {
            $most-anagram-length   = $tmp;
            $most-anagram-sequence = $word;
        }
    }

    say "Best sequence ({$most-anagram-length} anagrams): {$most-anagram-sequence}";
}

This should look rather similar to the pseudocode provided before. The big difference is the use of a main subroutine, to allow a user to call the program with a different list of words (by default, if no argument is provided on the command line, it uses the system dictionary, which has one English word per line).

The line that may be a bit unfamiliar is the line starting with @words. Basically, I take the file, read all the lines into a list ($filename.IO.lines), and then use the “hyper” operator (which can also be written “>>.“) to take each element and fold it’s case (for the characters used in English, that just means to make it lower case, but there are some advantages to doing it this way in other languages). I then pull out the unique words.  I.E. if a word exists in the dictionary but differs only in case, I do not want to include it twice.

We also have a line that does this:

my @word-chars = $word.comb.sort.join('');

In this line, we take a word and use comb to split it into a list of characters.  We then sort that list and join that list together (with nothing between the elements).  This basically sorts the letters in the word.  So a word like “are” becomes “aer”.  The benefit of this is that it is then easy to compare (we use the smart match, but an “eq” would work too) two words with their characters sorted.  If the sorted characters are the same in both words, they are an anagram.

Beyond that, everything is self-explanatory.

There are roughly 100,000 words in the system dictionary. I tried running this program, but performance was way too poor – after 10 minutes, I got bored and killed it. Running it on a much smaller dictionary, a 1,000 word dictionary, it took 50 seconds.  On a 500 word dictionary it took 11 seconds.  That’s a bit less time than 25% of the time as we might have predicted, but it’s in the ballpark (if the algorithm was a purely O(N²) algorithm, it would have taken exactly 25% less time).  But if we just assume it is perfectly exponential, based on 1,000 words taking 50 seconds, we could expect this would have taken roughly 75,000 seconds on the 100,000 word dictionary (about a day). I’ll leave proof of that to the reader.

So what do we do?  We need a better algorithm.  What if we had a way to do this in O(N)?  That would be way better!

We’re going to use a Perl hash to do this, because we know hash lookups are basically constant in time, no matter how much data is in the hash (not quite true, but true enough for our purposes).  So we’ll do it this way:

  1. Declare a new hash
  2. For each dictionary word:
    1. Take the word and sort the characters.
    2. If the sorted characters have an entry in the hash, just add one to that entry.  If the sorted characters don’t have an entry in the hash, add an entry with a value of 1.
    3. If the hash element we just updated has a value bigger than the running count of “most” anagrams found thus far for a sequence, then keep track of this word.
  3. Print the word that has the most anagrams found.

The important thing here is that we only look at a given word exactly once – and then move on to the next word.

If the Perl code might be more clear to you, here’s what I came up with:

sub MAIN(Str:D $filename = '/usr/share/dict/words') {
    my %wordseen;
    my %wordcache;
    my $maxcnt = 0;
    my $maxword = '';

    for $filename.IO.lines -> $word {
        next if %wordseen{$word.fc}:exists;  # We do not consider a change in case an anagram
        %wordseen{$word.fc} = 1;

        my $matchkey = $word.fc.comb.sort.join('');
        my $cnt = (%wordcache{$matchkey} // 0) + 1;
        %wordcache{$matchkey} = $cnt;

        if $cnt > $maxcnt {
            $maxcnt = $cnt;
            $maxword = $word;
        } elsif $cnt == $maxcnt {
            $maxword = "$maxword | $word";  # For where TWO OR MORE words have same # of anagrams
        }
    }

    say "Best sequence ($maxcnt anagrams): $maxword";
}

The %wordseen hash is just used to make sure we don’t count a word twice if it differs only in case.

We use the %wordseen hash to keep track of how many anagrams we found for a given sequence of characters, thus far in the processing.

We then use $maxcnt and $maxword to track the best anagram we’ve found thus far.  Note that we also allow TWO different sequences to have the same number of anagrams, and print both if applicable.

The run time for this? 5 seconds.  That’s almost a day quicker than the not-so-good algorithm.  Consider for a minute this: let’s say I gave you a language that is 1000X faster than Perl 6, but you implement your algorithm such that it is O(N²) and I use Perl 6 to implement an algorithm that is O(N).  Which one will be quicker?  Mine takes 5 seconds. Yours will be 75 seconds.  It’s hard to imagine finding a language 1000X faster in the general case. Better programming will always be more important than a faster language.

5 thoughts on “Anagrams – in O(N)

  1. Hi Joelle,

    congratulation for this is a very nice explanation of Big O notation. Let me just point out to one very small limitation:

    > “But if we just assume it is perfectly exponential, …”

    Wrong word here. O(N²) algorithms are not exponential, but quadratic.

    Cheers, Laurent.

    Like

  2. Hi Joelle, just thought you might like to know that the solution you came up with at the end is described in Jon Bentley’s Programming Pearls, and he calls what you call the “matchkey” the SIGNATURE.

    One other thing, I don’t know Perl 6 – only Perl 5 – but can’t you still use autovivification to increment wordcache{$matchkey} into existence as $wordcache{$matchkey}++ does in Perl 5?

    cheers
    duncan

    Like

    1. I’ve never been a fan of autovivification, for illogical reasons – just my style and background I suppose (there is nothing wrong with it). But yes autovivification works in Perl 6 too.

      Like

Leave a comment