Scoring matrices
Level: Advanced (score: 4)
Proteins fulfill important functions in all organisms and consist of amino acids linked together in a specific order.
Although single changes of these amino acids can have devastating effects on the protein function, not all changes carry the same severity. The impact is largely influenced by the chemical and physical properties of the substituted amino acid.
But how can differences in proteins be quantified? An efficient way to calculate similarity is the use of scoring matrices.
In this Bite you will use BLOSUM and PAM matrices to calculate protein similarity scores.
The score is calculated by summing up all the values from the scoring matrix for each paired amino acid. Each amino acid is represented by a different letter (e.g. A stands for alanine, R for aRginine ...)
Consider the following scoring matrix:
BLOSUM62 (excerpt):
| A R N D C Q [...]
--+-----------------------
A | 4 -1 -2 -2 0 -1 [...]
R | -1 5 0 -2 -3 1 [...]
[...]
To calculate the BLOSUM62 similarity score between two sequences Seq1 and Seq2, the sequences are aligned and the individual scores for each amino acid are added up as follows:
Seq1 A R R N C Q A
Seq2 A A R R A A A
------ ----------------------
Score 4 -1 5 0 0 -1 4 --> SUM == 11
matrix_score()
therefore returns 11
in this example.
Todo:
- Implement a general matrix score calculator for amino acid sequences using the provided matrices.
- Implement a custom error class AminoAcidNotFoundError
in the matrix_score
function to address non existent pairs (see tests for more information).
- Write a function that returns the sequence(s) of the most closely related (highest score) amino acid sequence.
Note: For this bite you can assume that all sequences are already properly aligned.