4 Vigenère cipher

4.1 Introduction

Both the Caesar cipher and the monoalphabetic substitution cipher have a single, fixed mapping from input letters to ciphertext letters, making them vulnerable to letter frequency analysis. It would be more secure if we could create a polyalphabetic cipher - i.e. the mapping from plaintext letters to ciphertext letters changes after every input character. One example of this is an extension of the Caesar cipher, called a “Vigenère cipher” (named after the 16th-century French diplomat Blaise de Vigenère).

In this cipher, we define the key as a word or phrase that is repeated as many times as necessary to cover the length of the message. Each letter in this key is then used as the key for a Caesar cipher to encrypt the corresponding letter in the message.

For example, if the message is “Cryptography is fun” and the key is “SECRETKEY”, we would write the message and the repeated key together:

CRYPTOGRAPHYISFUN
SECRETKEYSECRETKE

Since the first letter of the key is “S”, which is the 19th letter of the alphabet, we shift the first letter of our message (“C”) by 18 (since we start counting with “A” as zero, meaning “no shift”), to give us “U”.

There is a handy way to visualise this process, called a Vigenère square:

The cipher text is the column on the left, while the encryption letter is the top row. (In fact, doing it the other way around works too: why?)

Exercise: Using pen and paper, go ahead and encrypt the rest of that message.
CRYPTOGRAPHYISFUN  <- message
SECRETKEYSECRETKE  <- key

4.2 Cracking the Vigenère cipher

Luckily, the Vigenère cipher still contains a point of weakness that we can use to attack it: the periodicity of the key, or in other words, the fact that it repeats itself every \(n\) characters.

This means that, if the key length is 3, then the 1st, 4th, 7th, … characters are encoded in the same way, and we can tackle it just like we did for the simple ciphers seen already.

But, to do this, we first need to know the key length!

4.2.1 Index of Coincidence

To find the key length, we need to understand another fundamental property of English text: the index of coincidence (IoC). If we have a piece of text and randomly pull two letters out of it, the IoC tells us how likely it is that these two letters are the same.

Mathematically, we can define the IoC as:

\[\text{IoC} = 26 \cdot \frac{n_\text{A}(n_\text{A} - 1) + n_\text{B}(n_\text{B} - 1) + \cdots + n_\text{Z}(n_\text{Z} - 1)}{N(N - 1)},\]

where \(n_i\) is the number of times the letter \(i\) appears in the text, and \(N\) is the length of the text.

Question: Show that if all letters have the same probability of occurring, then the IoC is equal to 1. (Assume that the text is long enough, such that the \(n_i\)’s are much larger than 1.)

Because letters are not uniformly distributed in English (recall that ’E’s were particularly common), this quantity is, on average, greater than 1 for typical English texts.

Have a play around with this plot, and find out what the IoC is for some typical text. The same samples as before are included, plus a random sequence of letters:

Samples:

4.2.2 Using IoC to determine Key Length

We can exploit the fact that the IoC has different values for English text and for random sequences of letters, to identify the key length. If the key had a length of e.g. 4, we would know that the 1st, 5th, 9th, … letters were encrypted with the same letter of the key, and we would expect the IoC of this collection of letters (and also the 2nd, 6th, 10th, …) to be higher than what we would see for random. If on the other hand, the key had length 5, we would expect the same to be true if we looked at the 1st, 6th, 11th, 16th, … letters. (This periodic property is going to be looked at in more detail in “Modular Arithmetic”, which we’ll cover this afternoon.)

Exercise: Decrypt the following ciphertext.

Part 1: First, look at the Index of Coincidence and Letter Frequency graphs for the subsets of the ciphertext that we’d expect to see if the key length were 2, 3, or 4.

Choose a key length to try:

Proposed key length:

Part 2: Now we have identified the most likely key length, we should be able to look at the individual letter frequency graphs, and work out how much we’d need to shift them (like a Caesar cipher) to identify each letter of the key.

Hint 1: Remember, the most common letter in English text is “E”.

Hint 2: When we encrypt a letter using the Vigenère cipher, the letter “A” corresponds to a shift of 0. So, if the most common letter in the first of the graphs above was “F”, this would imply that we had a Caeser cipher of shift=1 (to move “E”s to “F”s), corresponding to the first letter of the key being “B”.

Proposed Key:

4.3 Aside: One-time pads - the unbreakable cipher

Since the way that we break the Vigenère cipher depends on the repetition of the key, it follows that if the key is (at least) as long as the message, we wouldn’t be able to crack the cipher this way. In fact, these so-called “one-time pads”, where the key is a long sequence of randomly generated letters, are thought to be unbreakable (as long as the sequence is truly random, and is not reused - hence the name). These were used for some of the most secret communications in World War II. However, of course, we still have the problem of how to securely share the key between the sender and the recipient!