Pybites Logo

Fasta to 2-line fasta

Level: Intermediate (score: 3)

A very simple format to store biological sequence data is the (multi-)FASTA format.

The first line of each record starts with a > character and is followed by a name. The following lines contain the sequence information. A record ends when > character or the end of the file is encountered.

FASTA files downloaded from public databases such as the National Center for Biotechnology Information (NCBI) often contain line breaks after 60-80 characters which ensures sequences are not truncated in text editors.

However in many cases (think *nix command line tools, grep, wc, etc.), it is better if each sequence is exactly one line long.

Your job is to convert a multiline FASTA file to a 2-Line FASTA file.

Multiline FASTA format:

>Sequence 1:
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGA
AATTGCTCAAGAAAAATTATCAGCTGTAAGTTACT
[...]
>Sequence 2:
ATGATGGAATTCACTATTAAAAGAGATTATTTTAT
TACACAATTAAATGACACATTAAAAGCTATTTCAC
[...]

2-Line FASTA format:

>Sequence 1:
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTGTAAGTTACT[...]
>Sequence 2:
ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCAC[...]

This Bite has biopython enabled (check out module Bio.SeqIO's convert function), but it can also be solved without this module.