A happy robot with AI written in his chest
Published on

Solving cryptograms with genetic algorithms

Authors

Introduction

Encryption plays a vital role in the digital world. One of the safest ways to protect important data from prying eyes is to encrypt it. This involves transforming each character into a completely random one, making the data unreadable during transmission. While encryption is a complex and important aspect of cybersecurity, I would like to delve into a specific but not so complex type; monoalphabetic substitution ciphers or better known as cryptograms. This exploration will allow us to understand the foundations of encryption and open the door to other fascinating concepts, such as genetic algorithms, a small subset of the field of artificial intelligence. So let me break down these topics for you!

Cryptogram

Let's start by defining what a cryptogram is. It is basically a puzzle in which an initially readable text (plaintext) is encrypted by replacing each letter with a different one from the same alphabet, which is why it is also called substitution cipher. While cryptograms may have been used in the past to hide personal secrets or for military purposes, today they are often so easy to decipher that they are mainly printed in newspapers and magazines just for fun.

Cipher Key

The cipher key is the most crucial concept in encryption, it is the key that gives us access to encrypt or decrypt the cryptogram. For simple substitution ciphers, the cipher key is a text string that indicates what letter our plaintext will be replaced with to encrypt it, or what letter our ciphertext will be replaced with to decrypt it. In this case, the cipher key contains all the letters of the alphabet, but in random order.

To encrypt plaintext, each letter is replaced with the corresponding letter of the cipher key in the original alphabet.

Encryption
ABCDEFGHIJKLMNOPQRSTUVWXYZ -> original alphabet  
DJTGIBOKWHMFYACSVRQNUZLXEP -> cipherkey (random order alphabet)

Hello World -> plaintext  
Wiffc Lcryg -> ciphertext

As I mentioned before, these types of cryptograms are quite simple to solve; You can just sit for a few minutes and start exchanging letters until you find meaningful answers. But the fun here, especially for those of us who enjoy programming, is automating this search for solutions. This is where a very interesting type of algorithm comes into play; genetic algorithms.

Genetic Algorithm

They are a small branch of artificial intelligence. Today we are quite familiar with AI, with its generative, classification and prediction models. But AI encompasses all algorithms that simulate biological behavior. In this case, a genetic algorithm is a model or abstraction of biological evolution based on Charles Darwin's theory of natural selection.

It was originally developed by John Holland and his collaborators, who were probably the first to use crossover, recombination, mutation and selection in the study of adaptive and artificial systems.

Genetic algorithms mainly deal with optimization problems, where the objective is to minimize or maximize the objective function (fitness). This is achieved through iteration, starting with an initial population of potential solutions. Parents are then arbitrarily chosen for crossover, and their offspring (or child) are then affected by possible mutations. During this process, the most suitable offspring are selected by evaluating the quality of the solution they give.

But who would be our parents and offspring in the case of cryptograms? Well, as we've seen before, it's the cipher key; the key that unlocks the solution and turns gibberish into readable text.

Fitness

Let's talk about the importance of fitness in these algorithms. Without fitness, we'd just be iterating aimlessly, not knowing when our solution is optimal enough. For cryptograms, we first need a way to evaluate this fitness. We have to figure out how the algorithm can know on its own whether replacing letters in a text makes sense (i.e., is a possible solution).

How do we achieve this? A simple way is to consider all sequences of n-letters (n-grams) in the resulting text and multiply them by the probability that each n-gram appears in the chosen language (e.g., English).

n-grams

A n-gram consists of groups of n consecutive letters found in the text. For example, "Hello World" has the quadgrams (4-grams):

HELL, ELLO, LLOW, LOWO, OWOR, WORL, ORLD

The probability of this text would be calculated as follows:

P(Hello World)=P(HELL)P(ELLO)P(LLOW)P(LOWO)P(OWOR)P(WORL)P(ORLD)P(Hello\ World) = P(HELL) \cdot P(ELLO) \cdot P(LLOW) \cdot P(LOWO) \cdot P(OWOR) \cdot P(WORL) \cdot P(ORLD)

Experimentally, using quadgrams to calculate fitness leads to better solutions with the genetic algorithm in most cases, except when the text is too short. In such cases, it is better to use smaller n-grams.

Taking the calculated frequency of these n-grams extracted from Tolstoy's "War and Peace", which I found on this site, you simply have to Divide the frequency of each n-gram by the sum total of all n-gram frequencies to find its probability. But the n-grams found are so many that they make the total frequency of all n-grams too high, resulting in very small probabilities, and this makes the multiplication of these probabilities much, much smaller. Thus, the solution is to use the logarithmic function and convert these multiplications into additions:

log(P(Hello World))=log(P(HELL))+log(P(ELLO))+log(P(LLOW))+log(P(LOWO))+log(P(OWOR))+log(P(WORL))+log(P(ORLD))\log{(P(Hello\ World))} = \log{(P(HELL))} + \log{(P(ELLO))} + \log{(P(LLOW))} + \log{(P(LOWO))} + \log{(P(OWOR))} + \log{(P(WORL))} + \log{(P(ORLD))}

Remembering the behavior of the logarithmic function curve, since the probabilities are numbers less than one, the logarithms will be negative numbers, and the rarer the n-gram, the more negative the logarithm will be.

Crossover

Now, let's go to the first operation of the genetic algorithm. Crossover or recombination is the operation where we combine the genetic information of two parents to create an offspring.

In this way, the genetic algorithm is all about iterating operations based on an initial population. This population consists of different cipherkeys that can be possible solutions to the encrypted text, so each one can have its respective fitness calculated. With this fitness we can select our parents. Just as in natural selection the fittest individuals are the most likely to mate, fitness allows us to randomly, but weightedly, choose our parents, giving priority to those with better fitness. So, we select two parents and obtain an offspring, then repeat this process until our number of offspring equals the number of individuals in the previous population.

There are many types of crossover algorithms, let's discuss some of them.

Order One

This is a very simple operation that is based on preserving the relative order in which elements occur. Let's go with a very good explanation from Noureddin Sadawi.

  1. Select a random set from parent 1 and copy this part to the child.

  2. Copy the elements that are not in the child, from parent 2. Starting from the position of the first element that follows the copied set, and wrapping around the end.

Cycle

This operation consists of cycles between the two elements of the parent, this way each element comes together with its position.

  1. Make parent 1 p1 and parent 2 p2.
  2. Start a new cycle:
    1. Select an element from p1 whose position is empty in the child element.
    2. Look at the element at the same position in p2.
    3. Go to the position of the same element but in p1.
    4. Place this item on the child in the same position.
    5. Starting from this last element in p1, repeat steps 2.2 to 2.4 until you reach the first selected element.
  3. Swap parents (who was p1 is now p2, and vice versa) and repeat the cycle from step 2.
  4. Repeat step 3 until the children has all its elements filled.

Partially Mapped

With this operation, the child is created by taking a subsequence of parent 1 and trying to preserve the order and position of the elements of parent 2.

  1. Randomly select a segment of elements from parent 1 and copy them directly to the child. Remember the positions of the segment elements.
  2. Search the same segment positions in parent 2 and select each value that has not already been copied to the child.
  3. For each of these elements:
    1. Find the element of parent 1 at the same position.
    2. Find this same element in parent 2.
    3. If the position of this element in parent 2 is within the original segment, go to step 3.1 using this value.
    4. If the position is not inside the original segment, insert the element from step 3 into the child at this position.
  4. Copy any remaining elements from parent 2 to child.

Hill Climbing

Hill Climb is an optimization algorithm that belongs more to the local search family. The main idea is to make small changes to a possible solution and only keep the change if it improves the solution. In this way, changes continue to be made to the same solution until it can no longer be improved.

To start, select two parents, compare their fitness and choose the best one, this will be our parent 1 and the base for our child.

  1. Go through both parents, comparing element by element from left to right.
  2. If the element at position n of parent 2 is different from the element at position n of parent 1, we replace the element at position n of parent 1 with the element at position n of parent 2.
  3. This will cause the n element of parent 2 to be repeated twice in parent 1. Then we look for that repeated element in parent 1 that is not in position n and replace it with the original element that was in position n of the parent 1.
  4. Calculate the fitness of this new solution. If it is not greater than or equal to the previous fitness of parent 1, undo the change.
  5. Repeat this process until you have gone through all the N parent positions.

Mutation

The mutation is our final operation of the genetic algorithm, which occurs once the parents have crossed. This mutation involves changes in the same offspring, altering only one element. Biologically, mutation is what drives evolution; some offspring may undergo mutations that make them slightly different from the population, and perhaps this change gives them a slight survival advantage over others. As a result, these new mutated offspring are more likely to mate and pass their genetics on to new generations. The same principle applies in the world of algorithms; Mutations are necessary to prevent the algorithm from getting stuck in local minima. If the algorithm can't seem to improve on its current path, a mutation could put it on a different course that allows for further improvement.

Insert

This is a fairly simple type of mutation, where we insert one element next to another and thus preserve the order and adjacency information.

  1. Choose two elements randomly.
  2. Move the second element and insert it to the right of the first element, causing the other elements to shift to the right.

Swap

With this type of mutation, we can preserve most of the adjacency information.

  1. Pick two random elements
  2. Swap their positions.

Inversion

Next, we have inversion mutation. This mutation preserves most of the adjacency information while altering the order information.

  1. Pick two random elements.
  2. Invert the segment between them.

Scramble

This mutation, as the previous one, preserves most of the adjacency information while altering the order information.

  1. Pick two random elements.
  2. Randomly rearrange the segment between them.

Genetic Decipher: a python implementation

Based on everything explained in this post, I've put together a Python package, which you can check out on this GitHub Repository.

References

For more information about these topics you can visit: