In this post, I explain how I created FASTQ files from a BAM file using a utility called Picard (no relation, although I pronounce my name the same way).
In 2014, my wife and I “got genomed” through Illumina’s Understand Your Genome (UYG) program, now managed by Genome Medical. Subsequently, I crowdsourced the sequencing of our kids’ genomes and presented family trio findings about our adult daughter’s autism in 2015.
One of the limitations of the family trio work was that the bioinformatics pipelines were different between our samples and our kids’ samples. To fix this limitation, I had to “reconstitute” the original FASTQ files from the BAM file provided by Illumina and then re-run all our data through the same pipeline. (Note: To my knowledge, UYG no longer provides BAM files as part of this program.)
Fortunately, bioinformatics wizard Mike Lin was also in my UYG class and wrote a blog series explaining how to extract FASTQ files from a BAM file. (Thank you, Mike!)
Using AWS to run samtools and Picard
You can create FASTQ files from your BAM file by using Picard, a set of Java-based command line tools for manipulating high-throughput sequencing (HTS) data in formats such as SAM/BAM/CRAM and VCF.
For reasons that escape me now, I first ran Picard using an AWS t1.micro instance.
After 3 attempts–watching Picard fail after running for 3 days each time–and creating thousands of temp files in the process, I learned the hard way that Picard requires more than 613 MBytes of memory. This time, I used a c4.2xlarge instance (4 cores, 16 GBytes of memory), which worked. It appears that 16 GBytes is about the minimum amount of memory to get the job done.
Step 1. Is your BAM file sorted?
Before creating FASTQ files, make sure your BAM file is sorted so that your genome coordinates are in order. One of the ways to do this is with samtools, a suite of programs for interacting with HTS data. Here are the commands I used to install it. You can check whether or not your BAM file is sorted by using this command:
samtools stats YourFile.bam | grep "is sorted:" # "is sorted: 1" = Yes, your BAM file is sorted. # "is sorted: 0" = No, your BAM file is not sorted.
If your BAM file requires sorting, use this command (or something close to it):
# Type "samtools sort --help" for a description of this command samtools sort -n -@ 2 -m 2560M InputFile.bam -o ./OutputFile.sorted.bam # Check for existence of Read Groups (@RG) samtools view -H InputFile.bam | grep '^@RG'
Step 2. Run Picard
Get Java and the picard.jar file. Run this command, but keep in mind that the options below are for a BAM file created on an Illumina HiSeq sequencer:
java -jar ~/picard.jar SamToFastq INPUT=InputFile.bam RE_REVERSE=true INCLUDE_NON_PF_READS=true OUTPUT_PER_RG=true OUTPUT_DIR=OutputDirectoryName
Alternatively, you can use GATK4 (version 4.0 and greater) to accomplish the same task:
gatk SamToFastq --INPUT=InputFile.bam --RE_REVERSE=true --INCLUDE_NON_PF_READS=true --OUTPUT_PER_RG=true --OUTPUT_DIR=OutputDirectoryName
Using the c4.2xlarge instance, I ran Picard in 3 hours to create the FASTQ files shown below. In addition, creating compressed (gzip) versions of the files required another 8.5 hours of compute time. With an on-demand price of about $0.40 per hour, creating compressed FASTQ files cost approximately $4.60 USD on AWS.