Pybites Logo

Create file pairs

Level: Advanced (score: 4)

In this bite you will write a function that pairs filenames with each other. This is useful in the bioinformatics field, as some of the current sequencing technologies produce two paired files for each sample which need to be processed together.

The filenames have the following naming scheme:

File 1: SampleName_S1_L001_R1_001.fastq.gz

File 2: SampleName_S1_L001_R2_001.fastq.gz

A pair always consists of an R1 and R2 file.

The SampleName and all numbers are variable but the overall structure is always the same:

- SampleName can contain letters, number or special characters (including _)
- The number following S runs from 1 to 99
- The number following L runs from 001 to 999
- R1 stands for file 1 and R2 for file 2 of a pair (no other numbers are allowed)
- The last number block runs from 001 to 999
- The file name extension should end in fastq.gz (no extra extensions such as fastq.gz.md5)

Your task

- Write a function pair_files(filenames) that receives a list of filenames and returns a list of tuples, where each tuple contains pairs of filenames in the following order (filename1-R1, filename2-R2)
- Ignore filenames that do not match the naming scheme (even if they contain R1 and R2)
- Matching the filenames shoulb be case insensitive but the function should return the correct case in the filename pairs
- For the tests presented here, you can assume that there is at most one file that can be paired with another (no higher tuplets)

Example

# Two complete pairs, one file without partner
>>> filenames = [
"Sample1_S1_L001_R1_001.FASTQ.GZ", "Sample1_S1_L001_R2_001.fastq.gz",
"Sample2_S2_L001_R1_001.fastq.gz", "sample2_s2_l001_r2_001.fastq.gz",
"Sample3_S3_L001_R1_001.fastq.gz",
]
>>> pair_files(filenames)
[('Sample1_S1_L001_R1_001.FASTQ.GZ', 'Sample1_S1_L001_R2_001.fastq.gz'),
 ('Sample2_S2_L001_R1_001.fastq.gz', 'sample2_s2_l001_r2_001.fastq.gz')]