Genome Toolkit. Part 2.1: Identifying, fixing, and testing a bug

Identifying the bug

In the previous article (Part 2 here), we wrote our first Genome Toolkit algorithm. Even though it was a very simple algorithm to help us search for repeating patterns, (k-mers in DNA/Genome sequences), and it seemed to work correctly, we actually had a bug in it. Let’s take a look at it, and how we can fix it. We will also discuss how bugs like this might end up accumulating a lot of errors during the usage of our algorithm, and result in incorrect calculations.

The bug: our for loop is always using the length of the sequence, and ignores the length of the k-mer we are searching for.

Logic with a bug: does not take into account the length of the k-mer:

for position in range( len(sequence) - 1 ):
    if sequence[position:position + len(kmer)] == kmer:

The above loop logic will always loop the same amount of times, which is a problem as our k-mers may be a different length, while the sequence stays the same length.

Let’s look at three examples, two of which will reproduce the bug.
Below we have a sequence and a k-mer we used in Part 2. We will reuse the same data, but extend the k-mer length.

Example 1: Data that will work correctly, and we will not see the bug:

seq = AATTAAAC | Length 8
kmer = AA      | Length 2

And if we now visualize every step of our loop logic, this is how it scans the sequence, 2 k-mers at a time:

0. [AA]TTAAAC
1. A[AT]TAAAC
2. AA[TT]AAAC
3. AAT[TA]AAC
4. AATT[AA]AC
5. AATTA[AA]C
6. AATTAA[AC]

So far, so good. Loop stops at the last k-mer, which is AC.

Example 2: Now we will change the k-mer length to three, by adding another nucleotide to it. Sequence length is unchanged:

seq = AATTAAAC | Length 8
kmer = AAA     | Length 3

Let’s again visualize every step of our current loop logic:

0. [AAT]TAAAC
1. A[ATT]AAAC
2. AA[TTA]AAC
3. AAT[TAA]AC
4. AATT[AAA]C
5. AATTA[AAC]
6. AATTAA[AC?]

Look at the step 6. Because we are now scanning the sequence for k-mers of length 3, and not 2, and our loop still iterates 7 times, on the last iteration we compare AAA with AC (? – is an empty memory location, that does not contain a nucleotide). We can call it “over scanning of the sequence”. This is incorrect. In some other, more strict languages, we would see an error or a warning even before we compile/build/run our code.

Example 3: And now we add another nucleotide to our k-mer, to make it of length 4:

seq = AATTAAAC | Length 8
kmer = AAAA    | Length 4

In this last example, we can see that we now have two incorrect/bad if statements, when we compare our 4-length k-mer with a part of our sequence:

0. [AATT]AAAC
1. A[ATTA]AAC
2. AA[TTAA]AC
3. AAT[TAAA]C
4. AATT[AAAC]
5. AATTA[AAC?]
6. AATTAA[AC??]

Steps 5 and 6 both have “gaps”, when we try comparing AAAA with AAC and AC. Our for loop has “over scanned” again. Twice.

Fixing the bug

Now let’s see how we can fix this bug. Let’s correct the logic of our for loop, so it uses not only the length of the sequence, but also the length of the k-mer, and dynamically adjusts, based on both values.

Correct logic, that uses k-mer length:

for position in range( len(sequence) - (len(kmer) - 1) ):
    if sequence[position:position + len(kmer)] == kmer:

Here is the change we have done in this for loop:
From len(sequence) - 1 To len(sequence) - (len(kmer) - 1)

Now, let’s use the above fixed for loop logic as a formula, to predict the outcome of how many iterations the loop will run. Remember, the count starts from 0, so if it loops 7 times, it is from 0 to 6.

Example 1: We are using the original sequence and k-mer length, but also manually calculating how many times the loop should iterate:

seq = AATTAAAC | Length 8
kmer = AA      | Length 2
Loops: [8 - (2 - 1)] = 7 times

Now, if we visualize it again, we can see, that our loop iteration prediction was correct. Loop had 7 (0-6) iterations:

0. [AA]TTAAAC
1. A[AT]TAAAC
2. AA[TT]AAAC
3. AAT[TA]AAC
4. AATT[AA]AC
5. AATTA[AA]C
6. AATTAA[AC]

This is good. Now let’s see what happens in the other two examples.

Example 2: We perform the same test with a k-mer of length 3, and we also calculate loop iterations. Remember, that this is now done on a fixed version of our `foor loop`:

seq = AATTAAAC | Length 8
kmer = AAA     | Length 3
Loops: [8 - (3 - 1)] = 6 times

Now we visualize every step yet again:

0. [AAT]TAAAC
1. A[ATT]AAAC
2. AA[TTA]AAC
3. AAT[TAA]AC
4. AATT[AAA]C
5. AATTA[AAC]

Amazing! Our new version of the for loop seems to be producing correct results now. We can see that our loop iteration prediction was 6, and it indeed iterated 6 times (0-5). On step 5 we can clearly see that we are not “over scanning” our sequence anymore.

Now let’s run one more test.

Example 3: We repeat our test one last time. k-mer is now of length 4, and our loop iteration calculation predicts 5 iterations:

seq = AATTAAAC | Length 8
kmer = AAAA    | Length 4
Loops: [8 - (4 - 1)] = 5 times

0. [AATT]AAAC
1. A[ATTA]AAC
2. AA[TTAA]AC
3. AAT[TAAA]C
4. AATT[AAAC]

So now, that we are sure that our updated for loop does not “over scan” the sequence, we can add this change to our Genome Toolkit class.

Conclusion

Bugs like this might be hard to find, if the algorithm is much longer and more complex. In the case of our algorithm, this bug did not produce a bad outcome. It did not break anything, and would still produce correct results. In some other case, where we would need to manipulate the data, like a list of k-mers, that has bad data like [AT ], [AATT], [GCA ] in it (gaps in k-mers), it would break the program execution or accumulate a lot of incorrect data, and produce incorrect results.

When implementing your algorithms, try using the Python debugger on every step of your code. Try visualizing what is happening in your variables, by printing out what is stored in them. Use some simple sample data, like what we used in here. Don’t write the whole algorithm, and test it after. Write a line of code, and test it, add another line of code, and test that, etc.

How can you test your code, to make sure it is bug-free? You can use a paper and a pencil, and write/draw a few simple, short examples of your code, specifying the input and the expected output. After that, try following every step of your algorithm on the paper first. You can come up with a few tests like that, and include them in your project. The test coverage you can come up with, will most likely depend on your level of experience, not only with Python, but also with logical thinking. Python also has testing functionality built-in, the unittest module, but this is beyond the scope of this article.

This is it for this article. See you in the next one.

Video version:

Genome Toolkit. Part 2.1: Identifying, fixing, and testing a bug

Identifying the bug

Example 1: Data that will work correctly, and we will not see the bug:

Example 2: Now we will change the k-mer length to three, by adding another nucleotide to it. Sequence length is unchanged:

Example 3: And now we add another nucleotide to our k-mer, to make it of length 4:

Fixing the bug

Example 1: We are using the original sequence and k-mer length, but also manually calculating how many times the loop should iterate:

Example 2: We perform the same test with a k-mer of length 3, and we also calculate loop iterations. Remember, that this is now done on a fixed version of our foor loop:

Example 3: We repeat our test one last time. k-mer is now of length 4, and our loop iteration calculation predicts 5 iterations:

Conclusion

Related Posts

Genome Toolkit. Part 3: building statistical data (k-mer frequency)

Genome Toolkit. Part 2: in search of patterns

Leave a Reply Cancel reply

Example 2: We perform the same test with a k-mer of length 3, and we also calculate loop iterations. Remember, that this is now done on a fixed version of our `foor loop`: