Greedy motif search

I like learning new things. This is why I’ve joined Finding Hidden Messages in DNA (Bioinformatics I) at Coursera. The pace of this course is really good, however I needed some time figuring out the solution for the “Greedy Motif Search” algorithm — or at least how to implement it in Python.

Source code

Sorry, because it is a graded problem I do not provide the source code. But I try to describe the solution as clear as possible to help you implementing the pseudo code. If you have problems with the understanding or questions to the pseudo code feel free to contact me.

Pseudo code

This is the code which is visible at the course’s interactive text book:

GREEDYMOTIFSEARCH(Dna, k, t)
    BestMotifs ← motif matrix formed by first k-mers in each string from Dna
        for each k-mer Motif in the first string from Dna
            Motif_1 ← Motif
            for i = 2 to t
                form Profile from motifs Motif_1, …, Motif_i - 1
                Motif_i ← Profile-most probable k-mer in the i-th string in Dna
            Motifs ← (Motif_1, …, Motif_t)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifs ← Motifs
    output BestMotifs

If you find the pseudo code clear feel free to skip this whole article. If not keep on reading.

Explanation

Well, the pseudo code reads itself a bit hard — or at least for me. So here is a little explanation.

The input fields are the following:

  • Dna is the list of the DNA strings to find the motifs in it
  • k is the k-mer size to find the best scoring motifs
  • t is the length of the Dna list

The BestMotifs in the beginning contains the first k-mers of each string of the provided Dna list.

Then the first for-loop iterates over the k-mers in the first string of the Dna list and adds this k-mer (called Motif) to the list of Motifs.

The second for loop forms a profile from the motifs contained in the list of Motifs, then it selects the most probable k-mer from the actual string taken from the Dna list and adds it to the list of Motifs.

After the second for loop the score of the new Motifs list is compared to the score of the BestMotifs. If the score is lower than from the actual best (lower score means less distance) then the current Motifs will become the BestMotifs.

Pseudo code for Score(Motifs)

Score(Motifs)
    consensus = Consensus(Motifs)
    score = 0
    for motif in motifs:
        score = score + HammingDistance(consensus, motif)
    return score

Motifs is the list of Motifs to calculate the score for.

Pseudo code for Consensus(Motifs)

Consensus(Motifs)
    consensus = ""
    for i in 1..|Motifs[1]|
        count[A] = 0
        count[C] = 0
        count[G] = 0
        count[T] = 0
        for motif in Motifs
            count[motif[i]] = count[motif[i]] + 1
        consensus = consensus + key(count) where max(value)

Motifs is the list of Motifs to find the consensus for.

As you can see, a consensus is created from the most popular letter in each column of the motif matrix. If you encounter a tie you should take the lexicographically first one. For example if there are 4 motifs and each motif contains at position i a different nucleotide then the consensus would contain A at the position i.

The pseudo code seems a bit big but in Python for example it can be solved under 10 lines of code.

Pseudo code for HammingDistance(genome_a, genome_b)


HammingDistance(genome_a, genome_b)
    if genome_a == genome_b
        return 0
    distance = 0
    for i = 1..min(|genome_a|,|genome_b|)
        if genome_a[i] != genome_b[i]
            distance = distance + 1
    distance = distance + abs(|genome_a| - |genome_b|)
    return distance

genome_a and genome_b are the two genomes we compare. |genome_a| denotes the length of the genome. abs is the absolute value of the difference of the two lengths.

As you can see, this pseudo code works for genomes of different lengths too.

An example

The example is based on the sample input provided with the code challenge:

k = 3
t = 5
Dna:
GGCGTTCAGGCA
AAGAATCAGTCA
CAAGGAGTTCGC
CACGTCAATCAC
CAATAATATTCG

The first step is to calculate the list of BestMotifs initially. This means we take the first 3-mer of each element of the Dna list.

GGC
AAG
CAA
CAC
CAA

Now we have our BestMotifs (with the score of 6) and we can start the loop.

The first for loop iterates over the 3-mers in the genome GGCGTTCAGGCA. These 3-mers are: GGC GCG CGT GTT TTC TCA CAG AGG GGC GCA.

For every loop the first element of the Motifs list is the current 3-mer from the list above. Then for each remaining genome in the Dna list we select the most probable 3-mer for the profile generated from Motifs (for the first time it is only GGC). For the genome AAGAATCAGTCA the most probable 3-mer is AAG so we add it to the Motifs list. Now we take the third genome from the Dna list and select the most probable 3-mer against the profile formed from the Motifs list. And so on… After the iterations we end up with the following Motifs list: GCG AAG AAG CAC CAA. The score of this list is 7 which is higher than the score of the current BestMotifs which is 6 so the BestMotifs stays as it is.

I will skip the walkthrough for each iteration, however I will show the Motifs and their score after each iteration and indicate whether they switch to be the best motifs or not:

GCG AAG AAG ACG CAA, Score: 5 --> gets BestMotifs
CGT AAG AAG AAT AAT, Score: 4 --> gets BestMotifs
GTT AAG AAG AAT AAT, Score: 4 (not better)
TTC AAG AAG ATC TTC, Score: 6 (not better)
TCA TCA CAA TCA TAA, Score: 3 --> gets BestMotifs
CAG CAG CAA CAA CAA, Score: 2 --> gets BestMotifs
AGG AAG AAG CAC CAA, Score: 5 (not better)
GGC AAG AAG CAC CAA, Score: 7 (not better)
GCA AAG AAG ACG CAA, Score: 6 (not better)

As you can see, the best motifs at the end are CAG CAG CAA CAA CAA with the score of 2. And this is the result of the GreedyMotifSearch algorithm — and this is the same than the example shows at the code challenge.

Advertisements

13 thoughts on “Greedy motif search

  1. I guess I don’t understand how the profile is generated from the Motifs list. For example, why is the 3-mer ‘AAG’ the most probable from the genome ‘AAGAATCAGTCA’ when the profile is generated from only ‘GGC’? It seems to me that the most probable 3-mer from this genome based on that profile would be ‘GTC’ – what am I missing?

    • In this algorithm the best probable k-mer is calculated based on the profile which you generate from the already available motifs. In this case the algorithm chooses AAG as the most probable based on the current motif list.

      • In your example in the first loop there is only one k-mer in the motif list: ‘ GGC’. Based on the profile built from this one-element list, how can the most probable k-mer in ‘AAGAATCAGTCA’ be ‘AAG’? That’s what I find hard to understand…

      • OK, in this case I will either update the article (and will notify you in a comment) or write another article about how to calculate the profile and profile-based k-mer (and I will use the example above as the example there).

      • Thanks, I figured out the problem – I was calculating the probability of a k-mer by adding the profile matrix probabilities when I should have been multiplying – forgot my basic probability facts for some reason…

  2. First, to construct the ‘BestMotifs’ list we just take the first k-mer of each string. Because k = 3, the first 3 letters of our first string are GGC, the first 3 letters of our second string are AAG, etc. Then we construct our first Motif list….

    We start with the second k-mer in our first string – ‘GCG’. The profile for ‘GCG’ is just:
    A: 0 0 0
    C: 0 1 0
    G: 1 0 1
    T: 0 0 0
    The consensus for this profile is just ‘GCG’, so we can also assume that the probability for any motif except GCG is zero. We now check for the k-mer with the highest probability within our second string. Our only option for this profile – ‘GCG’ does not exist in our second string. Because each k-mer in our second string has a probability of 0, we return the first k-mer with a probability of 0 in our string – ‘AAG’.

    For our next profile we use ‘GCG’ and ‘AAG’:
    A: .5 .5 0
    C: 0 .5 .5
    G: .5 0 .5
    T: 0 0 0
    Focusing on the third string now, we search for the k-mer with the highest probability relating to this profile within our third string. ‘AAG’ has the highest probability in our third string. Then we repeat this process. Let me know if this does not answer your question or if I made an error.

    * As a side note, when determining the hamming distance score, we want to make sure to take the consensus of our motif list and then compare that to each motif in our list, adding up a total score. So for our first motif list, looking at a profile of all 5 k-mers, we can tell the consensus is AAG. So comparing AAG to GCG we get a hamming distance of 2. Comparing AAG to AAG and AAG we have a H. distance of 0. So our total is still 2. Then comparing AAG to ACG we get a H. distance of 1, so our total is now 3. Comparing AAC to CAA we get a H. distance of 2, so our total hamming distance score is 5.

  3. Pingback: Creating a Profile Matrix | JaPy Software

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s