Pybites Logo

Codon usage

Level: Intermediate (score: 3)

The genetic code of all organisms uses a 3 base (codon), 4 letter encoding (A, G, C or T/U) to represent the 20* amino acids used in proteins. This yields 43 = 4*4*4 = 64 different possible three base codons. Of these, one is used as an initiator called "start codon", three are used to signal the end of a protein and are called "stop codons" (*). The residual 60 codons + the start/methionine codon encode the 20 proteinogenic amino acids. Some amino acids are encoded by up to 6 different codons, whereas other amino acids are only encoded by a single codon. This is known as the degenerate code and is often visualized by a codon wheel. Every organism has a different set of preferred codons which helps to optimize and balance protein production.

In this bite you are provided with a list of all coding sequences of the bacterium Staphyloccocus aureus.

Calculate the average codon usage table for all sequences using the supplied translation table. Please note that the coding sequences are supplied as an RNA sequence, whereas the codon usage table is provided as a DNA sequence. To convert a DNA sequence to an RNA sequence, replace all Ts to Us. Disregard sequences that are not valid coding sequences.


There you go, our first Bioinformatics Bite. Keep calm and code in Python!