- Published on
Solving cryptograms with genetic algorithms
- Authors

- Name
- Alejandro Perdomo
- @alejandropr05
Introduction
Encryption plays a vital role in the digital world. One of the safest ways to protect important data from prying eyes is to encrypt it. This involves transforming each character into a completely random one, making the data unreadable during transmission. While encryption is a complex and important aspect of cybersecurity, I would like to delve into a specific but not so complex type; monoalphabetic substitution ciphers or better known as cryptograms. This exploration will allow us to understand the foundations of encryption and open the door to other fascinating concepts, such as genetic algorithms, a small subset of the field of artificial intelligence. So let me break down these topics for you!
Cryptogram
Let's start by defining what a cryptogram is. It is basically a puzzle in which an initially readable text (plaintext) is encrypted by replacing each letter with a different one from the same alphabet, which is why it is also called substitution cipher. While cryptograms may have been used in the past to hide personal secrets or for military purposes, today they are often so easy to decipher that they are mainly printed in newspapers and magazines just for fun.
Cipher Key
The cipher key is the most crucial concept in encryption, it is the key that gives us access to encrypt or decrypt the cryptogram. For simple substitution ciphers, the cipher key is a text string that indicates what letter our plaintext will be replaced with to encrypt it, or what letter our ciphertext will be replaced with to decrypt it. In this case, the cipher key contains all the letters of the alphabet, but in random order.
To encrypt plaintext, each letter is replaced with the corresponding letter of the cipher key in the original alphabet.
ABCDEFGHIJKLMNOPQRSTUVWXYZ -> original alphabet
DJTGIBOKWHMFYACSVRQNUZLXEP -> cipherkey (random order alphabet)
Hello World -> plaintext
Wiffc Lcryg -> ciphertext
As I mentioned before, these types of cryptograms are quite simple to solve; You can just sit for a few minutes and start exchanging letters until you find meaningful answers. But the fun here, especially for those of us who enjoy programming, is automating this search for solutions. This is where a very interesting type of algorithm comes into play; genetic algorithms.
Genetic Algorithm
They are a small branch of artificial intelligence. Today we are quite familiar with AI, with its generative, classification and prediction models. But AI encompasses all algorithms that simulate biological behavior. In this case, a genetic algorithm is a model or abstraction of biological evolution based on Charles Darwin's theory of natural selection.
It was originally developed by John Holland and his collaborators, who were probably the first to use crossover, recombination, mutation and selection in the study of adaptive and artificial systems.
Genetic algorithms mainly deal with optimization problems, where the objective is to minimize or maximize the objective function (fitness). This is achieved through iteration, starting with an initial population of potential solutions. Parents are then arbitrarily chosen for crossover, and their offspring (or child) are then affected by possible mutations. During this process, the most suitable offspring are selected by evaluating the quality of the solution they give.
But who would be our parents and offspring in the case of cryptograms? Well, as we've seen before, it's the cipher key; the key that unlocks the solution and turns gibberish into readable text.
Fitness
Let's talk about the importance of fitness in these algorithms. Without fitness, we'd just be iterating aimlessly, not knowing when our solution is optimal enough. For cryptograms, we first need a way to evaluate this fitness. We have to figure out how the algorithm can know on its own whether replacing letters in a text makes sense (i.e., is a possible solution).
How do we achieve this? A simple way is to consider all sequences of n-letters (n-grams) in the resulting text and multiply them by the probability that each n-gram appears in the chosen language (e.g., English).
n-grams
A n-gram consists of groups of n consecutive letters found in the text. For example, "Hello World" has the quadgrams (4-grams):
HELL, ELLO, LLOW, LOWO, OWOR, WORL, ORLD
The probability of this text would be calculated as follows:
Experimentally, using quadgrams to calculate fitness leads to better solutions with the genetic algorithm in most cases, except when the text is too short. In such cases, it is better to use smaller n-grams.
Taking the calculated frequency of these n-grams extracted from Tolstoy's "War and Peace", which I found on this site, you simply have to Divide the frequency of each n-gram by the sum total of all n-gram frequencies to find its probability. But the n-grams found are so many that they make the total frequency of all n-grams too high, resulting in very small probabilities, and this makes the multiplication of these probabilities much, much smaller. Thus, the solution is to use the logarithmic function and convert these multiplications into additions:
Remembering the behavior of the logarithmic function curve, since the probabilities are numbers less than one, the logarithms will be negative numbers, and the rarer the n-gram, the more negative the logarithm will be.
Crossover
Now, let's go to the first operation of the genetic algorithm. Crossover or recombination is the operation where we combine the genetic information of two parents to create an offspring.
In this way, the genetic algorithm is all about iterating operations based on an initial population. This population consists of different cipherkeys that can be possible solutions to the encrypted text, so each one can have its respective fitness calculated. With this fitness we can select our parents. Just as in natural selection the fittest individuals are the most likely to mate, fitness allows us to randomly, but weightedly, choose our parents, giving priority to those with better fitness. So, we select two parents and obtain an offspring, then repeat this process until our number of offspring equals the number of individuals in the previous population.
There are many types of crossover algorithms, let's discuss some of them.
Order One
This is a very simple operation that is based on preserving the relative order in which elements occur. Let's go with a very good explanation from Noureddin Sadawi.
Select a random set from
parent 1and copy this part to thechild.Copy the elements that are not in the
child, fromparent 2. Starting from the position of the first element that follows the copied set, and wrapping around the end.
Cycle
This operation consists of cycles between the two elements of the parent, this way each element comes together with its position.
- Make parent 1
p1and parent 2p2. - Start a new cycle:
- Select an element from
p1whose position is empty in thechildelement. - Look at the element at the same position in
p2. - Go to the position of the same element but in
p1. - Place this item on the
childin the same position. - Starting from this last element in
p1, repeat steps 2.2 to 2.4 until you reach the first selected element.
- Select an element from
- Swap parents (who was
p1is nowp2, and vice versa) and repeat the cycle from step 2. - Repeat step 3 until the children has all its elements filled.
Partially Mapped
With this operation, the child is created by taking a subsequence of parent 1 and trying to preserve the order and position of the elements of parent 2.
- Randomly select a segment of elements from
parent 1and copy them directly to thechild. Remember the positions of the segment elements. - Search the same segment positions in
parent 2and select each value that has not already been copied to thechild. - For each of these elements:
- Find the element of
parent 1at the same position. - Find this same element in
parent 2. - If the position of this element in
parent 2is within the original segment, go to step 3.1 using this value. - If the position is not inside the original segment, insert the element from step 3 into the
childat this position.
- Find the element of
- Copy any remaining elements from
parent 2tochild.
Hill Climbing
Hill Climb is an optimization algorithm that belongs more to the local search family. The main idea is to make small changes to a possible solution and only keep the change if it improves the solution. In this way, changes continue to be made to the same solution until it can no longer be improved.
To start, select two parents, compare their fitness and choose the best one, this will be our parent 1 and the base for our child.
- Go through both parents, comparing element by element from left to right.
- If the element at position
nofparent 2is different from the element at positionnofparent 1, we replace the element at positionnofparent 1with the element at position n ofparent 2. - This will cause the
nelement ofparent 2to be repeated twice inparent 1. Then we look for that repeated element inparent 1that is not in positionnand replace it with the original element that was in positionnof theparent 1. - Calculate the fitness of this new solution. If it is not greater than or equal to the previous fitness of
parent 1, undo the change. - Repeat this process until you have gone through all the
Nparent positions.
Mutation
The mutation is our final operation of the genetic algorithm, which occurs once the parents have crossed. This mutation involves changes in the same offspring, altering only one element. Biologically, mutation is what drives evolution; some offspring may undergo mutations that make them slightly different from the population, and perhaps this change gives them a slight survival advantage over others. As a result, these new mutated offspring are more likely to mate and pass their genetics on to new generations. The same principle applies in the world of algorithms; Mutations are necessary to prevent the algorithm from getting stuck in local minima. If the algorithm can't seem to improve on its current path, a mutation could put it on a different course that allows for further improvement.
Insert
This is a fairly simple type of mutation, where we insert one element next to another and thus preserve the order and adjacency information.
- Choose two elements randomly.
- Move the second element and insert it to the right of the first element, causing the other elements to shift to the right.
Swap
With this type of mutation, we can preserve most of the adjacency information.
- Pick two random elements
- Swap their positions.
Inversion
Next, we have inversion mutation. This mutation preserves most of the adjacency information while altering the order information.
- Pick two random elements.
- Invert the segment between them.
Scramble
This mutation, as the previous one, preserves most of the adjacency information while altering the order information.
- Pick two random elements.
- Randomly rearrange the segment between them.
Genetic Decipher: a python implementation
Based on everything explained in this post, I've put together a Python package, which you can check out on this GitHub Repository.
References
For more information about these topics you can visit:
- Chapter 6 - Genetic Algorithms, by Xin-She Yang, in Nature-Inspired Optimization Algorithms (Second Edition), 2021.
- A Practical Introduction to Genetic Algorithms - a YouTube playlist by Noureddin Sadawi.
- Quadgram Statistics as a Fitness Measure, from Practical Cryptography.
- Online Calculator: Text Fitness , from Planetcalc.
- Genetic Algorithm Tutorial, from Rubicite Interactive.