DNA Toolkit Part 1: Validating and counting nucleotides

In this article, we will start working on a DNA Toolkit. We will define a DNA nucleotide list and write our first two functions. The first couple of sections will deal with some very basic string/dictionary/lists processing as we build our biological sequence class and a set of functions to work with that class. We will build a powerful set of tools to analyze Genes and Genomes, to visualize and analyze the data.

By programming all of this ‘by hand’ in Python (or any other language of your choice), instead of using existing tools/services, will make sure that we have a good understanding of the bioinformatics fundamentals.


Tools I use for my work and research:

  • Linux Manjaro/Debian 10.
  • VSCodium (Free/Libre Open Source Software Binaries of VSCode).
  • Python 3.7+, Bash.

I suggest using a good code editor like VSCode as it has a built-in, one-click Python debugger which will be super important for figuring out more complex algorithms we will tackle in later articles. A good code editor also has code completion, code tips, and auto-formatting functionality.

If you need more help setting up your programming environment in Windows or Mac, take a look at Cory Shafer’s amazing videos:


We are going to start with four files:

[jurisl@JurisLinuxPC dna-toolset]$ tree -l
.
├── dna_toolkit.py
├── main.py
├── structures.py
└── utilities.py
0 directories, 4 files
  • dna_toolkit.py: Set of functions to process DNA/RNA/Codon data. It includes structures.py.
  • structures.py: Set of structures like DNA, RNA, Codons, Proteins.
  • main.py: File for testing our code. It includes dna_toolkit.py.
  • utilities.py: Will contain helper functions like reading/writing files, accessing data bases etc.

We start by defining a Python List to store DNA nucleotides in strucutres.py file. That way, we can reuse that list in all of our functions:

DNA_Nucleotides = ['A', 'C', 'G', 'T']

Now we include everything from structures.py into dna_toolkit.py and our first function does two things:

  • Makes sure that the sequence (seq) passed to it is all upper case. In ASCII table, ‘a’ = 97 and ‘A’ = 65. So 97 != 65 and ‘a’ != ‘A’. ASCII Table: Link.
  • Checks if every character in that sequence is one of the characters in DNA_Nucleotides list and return all upper case sequence if it is.
from structures import *

validate_seq(seq):
    """
    Check the sequence to make sure
    it is a valid DNA string
    """
    tmpseq = seq.upper()
    for nuc in tmpseq:
        if nuc not in DNA_Nucleotides:
            return False
    return tmpseq

The second function just counts individual nucleotides and stores the result in a Python Dictionary and returns that dictionary. This is probably the easiest, regular loop approach.

def nucleotide_frequency(seq):
    """
    Count nucleotides in a given sequence.
    Return a dictionary
    """
    tmpFreqDict = {"A": 0, "C": 0, "G": 0, "T": 0}
    for nuc in seq:
        tmpFreqDict[nuc] += 1
    return tmpFreqDict

Below we use a more Pythonic way by importing Count method from the collections module.

from collections import Counter

def nucleotide_frequency(seq):
    """
    Count nucleotides in a given sequence.
    Return a dictionary
    """
    # More Pythonic, using Counter
    return dict(Counter(seq))

Now our for loop turns into one line of code, which is probably faster than a for loop. At this stage, it is not all that important. Use whatever you are more comfortable with at your stage of learning Python.

Let’s include our dna_toolkit.py into our main.py, create a random DNA sequence (randDNAStr), validate it, print out the validated version, it’s length and pass it on to nucleotide frequencies function and use f-strings to display results in a nicer, more structured way.

# DNA Toolset/Code testing file
from dna_toolkit import *

# Creating a random DNA sequence for testing:
randDNAStr = "ATAGCTGACTAT"

DNAStr = validate_seq(randDNAStr)

print(f'\nSequence: {DNAStr}\n')
print(f'[1] + Sequence Length: {len(DNAStr)}\n')
print(f'[2] + Nucleotide Frequency: {nucleotide_frequency(DNAStr)}\n')

So now we can run our code to see the output (lines 9, 10 and 11):

Sequence: ATAGCTGACTAT

[1] + Sequence Length: 12

[2] + Nucleotide Frequency: {'A': 4, 'T': 4, 'G': 2, 'C': 2}

We can also add a random DNA sequence generation to our main.py file. We do that by importing random module and using random.choice method. Basically, we give it our nucleotide list and ask it to randomly pick one of the nucleotides, 50 times in our example below. We do so by using Python’s list comprehension. It returns a list, so we need to convert it to a string. We use the join method to ‘glue’ all list elements into one string.

Here is how it looks:

import random

# Creating a random DNA sequence for testing:
randDNAStr = ''.join([random.choice(DNA_Nucleotides)
                      for nuc in range(50)])

This is it for now. Now we have an entry point into a set of tools we will be writing in this series of articles.

You can access the code for this article via GitLab Link. Code in the repository might differ from this article as I am refactoring and updating it as this series progresses, but core functionality stays the same.

A video version of this article is also available here:

Until next time, rebelCoder, signing out.

Related Posts

2 thoughts on “DNA Toolkit Part 1: Validating and counting nucleotides

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.