[Update: 2021-01-10: Thank you for your interest in our book club. We are currently closed to new members, but you can watch and subscribe to our meetings on the Genomics in the Cloud Book Club channel on YouTube.]
Introducing the Genomics in the CloudBook Club, an online discussion group. Our 30+ members across 10 time zones are covering one chapter each week, and we expect to complete the book in March 2021.
Taking a page from the R for Data Science Online Learning Community, we created a Slack account for discussions and a Zoom account for meetings. Last week, we had lively online conversations about reference genome diversity, workflow language selection, personal whole genome sequencing, reproducibility tips, and more.
After each meeting, we post the video to this GITC Book Club channel on YouTube so you can follow us anytime. Our member’s tweets are also available here. Thank you for tuning in!
One of the limitations of the family trio work was that the bioinformatics pipelines were different between our samples and our kids’ samples. To fix this limitation, I had to “reconstitute” the original FASTQ files from the BAM file provided by Illumina and then re-run all our data through the same pipeline. (Note: To my knowledge, UYG no longer provides BAM files as part of this program.)
You can create FASTQ files from your BAM file by using Picard, a set of Java-based command line tools for manipulating high-throughput sequencing (HTS) data in formats such as SAM/BAM/CRAM and VCF.
For reasons that escape me now, I first ran Picard using an AWS t1.micro instance.
After 3 attempts–watching Picard fail after running for 3 days each time–and creating thousands of temp files in the process, I learned the hard way that Picard requires more than 613 MBytes of memory. This time, I used a c4.2xlarge instance (4 cores, 16 GBytes of memory), which worked. It appears that 16 GBytes is about the minimum amount of memory to get the job done.
Step 1. Is your BAM file sorted?
Before creating FASTQ files, make sure your BAM file is sorted so that your genome coordinates are in order. One of the ways to do this is with samtools, a suite of programs for interacting with HTS data. Here are the commands I used to install it. You can check whether or not your BAM file is sorted by using this command:
samtools stats YourFile.bam | grep "is sorted:"
# "is sorted: 1" = Yes, your BAM file is sorted.
# "is sorted: 0" = No, your BAM file is not sorted.
If your BAM file requires sorting, use this command (or something close to it):
# Type "samtools sort --help" for a description of this command
samtools sort -n -@ 2 -m 2560M InputFile.bam -o ./OutputFile.sorted.bam
# Check for existence of Read Groups (@RG)
samtools view -H InputFile.bam | grep '^@RG'
Step 2. Run Picard
Get Java and the picard.jar file. Run this command, but keep in mind that the options below are for a BAM file created on an Illumina HiSeq sequencer:
Using the c4.2xlarge instance, I ran Picard in 3 hours to create the FASTQ files shown below. In addition, creating compressed (gzip) versions of the files required another 8.5 hours of compute time. With an on-demand price of about $0.40 per hour, creating compressed FASTQ files cost approximately $4.60 USD on AWS.
I launched my first cloud server literally while in the clouds in May 2014. Cloud computing has changed so much, it’s unbelievable. Back then, I had to patch the Linux kernel by hand so that the ftp server would work on AWS. Today, uploading your genome using Amazon’s command line interface (CLI) to an AWS S3 storage bucket is relatively easy. Understandably, Amazon makes it challenging (but doable) to make your storage publicly available. I used the Apache Web Server and s3fs to share this information.
s3fs allows allows you to mount an S3 bucket via FUSE. s3fs preserves the native object format for files, allowing use of other tools like AWS CLI. Again, your commands may vary depending on your flavor of Linux. Here are the commands I used to install s3fs.
$1,500. That’s the amount of money I have spent over the past 5 years to store our family’s whole genome sequence (WGS) data. For $299 per person in 2020, I could sequence all of us again at 30x coverage, get the same data files, and spend less money. In 2015, I wrote about posting my WGS data to DNAnexus. Last month (July 2020), I moved all of our data to Amazon (AWS) S3 storage. In this post, I explain why.
Nevertheless, I recently moved my WGS data to Amazon S3 due to storage costs and a lack of price transparency.
I’ve learned that most of the work that I want to do can be done with VCF files. Yes, there are times when I want to look at BAM files, but moving those files to lower-cost storage makes sense. DNAnexus introduced a Glacier-based archiving service in 2019 to support those operations, although I did not use it. My BAM file is 73 GBytes, which represents about 93% of the 79 GBytes for my WGS data (no FASTQ data). If I deeply archive BAM and FASTQ data (329 GBytes total), I can reduce the amount of higher-cost storage by 98%. The cost comparison for a single genome with FASTQ files looks roughly like this:
Storage cost on DNAnexus: (329 GBytes * $0.03 per GB-month [everything]) = $9.87 per month
Storage cost on AWS: (7 GBytes * $0.0125 per GB-month [VCF]) + (322 GBytes * $0.00099 per GB-month [everything else]) = $0.41 per month
Overall, I can reduce my monthly storage costs by over 95% by using lower-cost storage tiers on AWS (see Table 1 below). Again, the comparison is apples-to-oranges because I did not use DNAnexus’ archiving service, mostly because it required a separate license to activate. Using Amazon S3, our monthly WGS storage costs will decrease from $24 per month to less than $1 per month.
Lack of price transparency
If we compare AWS’ S3 storage price from 5 years ago to DNAnexus’, we find that the storage markup was 35% over the S3 list price. It turns out that Amazon decreased its S3 storage price over the past 5 years, which led DNAnexus to drop their storage price to the current $0.03 per GB-month, still at a 35% markup. (For comparison, on demand GPU- or FPGA-based compute cycles (Amazon EC2) are marked-up over 100%.)
I do not fault DNAnexus for marking-up AWS pricing–they are a business and provide value beyond storage and compute cycles. However, you will not find any pricing information on the DNAnexus website. In addition to storage costs, add-ons like archiving and GxP regulatory compliance require separate licenses that are not disclosed when signing-up. Presumably, the company’s professional services team assists with these onboarding activities.
How to move your data from DNAnexus to AWS
So, having made the decision to move my WGS data to AWS, how did I do it?
On the DNAnexus platform, I used AWS S3 Exporter, a company-provided tool to upload data to an AWS S3 bucket (DNAnexus account required). You can invoke the exporter using either their SDK (dx-toolkit) or an online wizard–both methods work great. The DNAnexus documentation for the exporter tool is a little out-of-date, so here is the updated AWS IAM policy file to make your transfers work with verification:
Another improvement: You can transfer your data from one S3 instance to another (DNAnexus to AWS) at the rate of 250 GBytes per hour, including verification. Five years ago, the transfer speed was 10 GBytes per hour!
One final gotcha
One thing that has not changed in 5 years is the “data transfer out” fee. Amazon’s fee is $0.09 per GByte and DNAnexus’ fee is $0.13 per GByte. This fee is an understandable disincentive to keep you from moving your data around too much. In my case, moving our family’s WGS data to AWS will add over $100 to the final bill. It’s a little like losing all your money at baccarat and then finding out that you still owe the banque a commission before you leave the table. Not a big deal when you are a family, but when you are the UK Biobank expecting to grow to 15 petabytes over the next 5 years…well, you get the idea.
For the money, take a look at upstart competitors like Basepair or ixLayer.
[Update 2021-01-10: Do not forget to remove the DNAnexus API, called dx-toolkit!]
You may know this painting, one of over two dozen haystack paintings (actually, stacks of wheat) that Claude Monet produced at various times and seasons. What if we hid a needle one of them? Could we find it? In genetics, we often say that finding a genetic variant is like finding a needle in a haystack. But how big is the needle? How big is the haystack? Who believes this stuff?
Francis Collins, NIH director and leader of the Human Genome Project, is a clear believer. In a recent PBS series by Ken Burns, The Gene: An Intimate History, Dr Collins said that finding a misspelling in a gene that causes a particular disease is “a needle in a haystack.“
So, in honor of April 25th, National DNA Day, which commemorates the discovery of DNA’s double helix in 1953 and the publication of the first draft of the human genome in 2003, I embarked on a curious journey to answer these questions. My approach would be to determine the volume of a needle and then figure out how large to make the haystack to see if the analogy holds.
What is the volume of a needle?
How do you calculate the volume of a needle? It is useful to start with some real needles, and it turned out that I had a small box of 30 assorted needles–here’s a photo of them:
Needles come in all kinds of lengths, but I was looking for an “average” needle, so I made a rough list of their lengths and came up with this:
Average needle length: sum(5 x 30 mm, 10 x 35 mm, 11 x 40 mm, 1 x 45 mm, 1 x 50 mm, 1 x 55 mm, 1 x 60 mm) / 30 = 37 mm or 1.4 inches
Volume of a needle: displacement method
Next, I had to figure out the volume (V) of a needle. My first approach was to drop the needles into a graduated cylinder filled with water and then measure the displacement.
It turns out that 30 needles displaced 0.5 ml of water, so we divide by 30 to get the average volume of the needles in my sample.
V = 0.5 mm / 30 = 0.016 ml or about 0.02 ml per needle
Can we find a different method to measure the volume of the needle and compare results?
Volume of a needle: small cylinder
Another way to approximate the volume of the needle is to view it as a small cylinder using the formula: V = pi * r^2 * height. Since we are looking for volume (V), we want to express our measurements in centimeters to end up with cubic centimeters (cc).
V = pi * (0.075 cm / 2)^2 * 3.7 cm = 0.016 cc, or about 0.02 cc
Since 1 ml = 1 cc, we can say that the volume of our needle is 0.02 ml, which is spot on with our previous measurement. Note: It is rare in science that your numbers match exactly, but we’ll take the win and start building our haystack.
Building a genetic haystack
Our next task is to build a haystack that approximates the size of the human genome, where each A-T or C-G base pair in the human genome is the size of a needle. The Human Genome Project pegs that number around 3 billion base pairs within one copy of a single genome.
We build our genetic haystack by defining it as a volume having 3 billion opportunities to find a 0.02 ml object, or volume (V) equals 3,000,000,000 * 0.02 ml.
V = (3*10^9 * 0.02 ml) = 60*10^6 ml or 60,000,000 ml
Since 1 ml = 1 cubic centimeter (cc), the volume can also be stated as 60,000,000 cubic centimeters or 60*10^6 cc. Finally, 1 liter = 1000 ml, so the volume in liters equals 60,000,000 ml * (1 liter / 1000 ml).
V (haystack) = (60*10^6 ml) * (1 liter / 1000 ml) = 60*10^3 liters or 60,000 liters
We now know how much space it occupies, but what does it look like? For example, how tall is our haystack?
Shape of a genetic haystack: approximation with a hemisphere
Like the needle, we can approximate the shape of a haystack to something easy to calculate, like the volume of hemisphere in cubic centimeters: V = 2/3 * pi * r^3.
V (hemisphere) = 60*10^6 cc = 2/3 * pi * r^3
Solving for the radius (r), we get r = 306 cm. Since 1 m = 100 cm, the height (radius) of the hemisphere equals 306 cm * (1 m / 100 cm).
r (hemisphere) = 306 cm * (1 m / 100 cm) = about 3 meters or 10 feet tall
So, we have a rough idea of what an idealized haystack looks like. It is 3 meters tall and it is shaped like a hemisphere. But how accurate is our measurement? Can we do better? Has anyone studied haystack modeling? Indeed, someone has and his name is W.H. Hosterman.
Shape of a genetic haystack: using haystack modeling
In 1931, W.H. Hosterman from the U.S. Department of Agriculture published an extensive technical bulletin “for the purpose of determining the volume and tonnage of hay.” Hosterman and his USDA colleagues measured over 2,600 haystacks across 10 states and presented results for both square and round haystacks. For round haystacks, Hosterman derived this formula: V = ((0.04 * Over) – (0.012 * C)) * C^2, where C is the circumference of the haystack, and Over is the measurement of the circumference of the stack over the top. We can compare results with the hemisphere by selecting equivalent values from Table 7 in Hosterman’s paper. (We will momentarily switch to imperial units to make use of the constants in the formula.)
The observed height of Hosterman’s haystacks are slightly taller than wide, but their sloped tops make the volume of the “true” haystack (58,333 liters) a little less than our original estimate of a 3-meter-tall, 60,000-liter haystack. When we compare the volume using Hosterman’s formula to the volume of the hemisphere, we see that they differ by 1,667 liters, or about 3%. Close enough!
Genetics is like finding a needle in a haystack
We have two comparisons that match closely. So, what does that haystack look like?
It turns out that Romanian haystacks are about 3 meters tall, so finding a misspelling in a gene (a genetic variant) is indeed like finding a needle in a haystack.
Who looks for needles in haystacks?
If genetics truly is like finding a needle in a haystack, what kinds of people do this? Well, let’s go back to Francis Collins.
In 1993, Collins published a paper describing the genetics of cystic fibrosis, a disease that required finding “a needle in a haystack.” Today, we have found needles from thousands of genetic haystacks. Some of these needles lead to clues about rare diseases, which affect more than 400 million people worldwide.
In real life, you can find a needle in a haystack, too. In 2004, performance artist Sven Sachsalber found a needle in a haystack after hunting for it for 24 hours. If we could get computers to solve diseases at that speed, we would happily be out of work.
Today I joined All of Us, a research community of one million people to lead the way for individualized prevention, treatment, and care for, well, all of us. This project was previously known as the Precision Medicine Initiative.
Many of you know that our family has used whole genome sequencing to look for clues in our daughter’s autism. This blog shares that journey. I have also published peer-reviewed papers to explore the reasons why people share personal health information. Through this research, I am convinced that information sharing will contribute to a learning healthcare system to improve care and lower costs.
It just takes people like you and me to #JoinAllofUs and lead by example.
In 2006, a Scientific American article written by George Church, “Genomics for All,” rekindled my interest in genomics. I went back to school in 2009 to contemplate the business of genomic medicine, and celebrated my MBA by writing a Wikipedia entry for the word, “Exome.” I was hooked.
Along the way, I realized that medical imaging and genomics are highly complementary: genomics informs or identifies conditions, and radiology localizes them. Sarah-Jane Dawson pointed this out at a Future of Genomic Medicine conference in 2014.
I have been a long-time listener to the intelligent and informative podcasts on Mendelspod, a site that connects people and ideas in life sciences. (Most nights you can find me listening to Mendelspod while I do the dishes.) I tuned-in sometime in 2012 and created a mental map of the industry by listening to every podcast I could find. A steady diet of listening to the latest developments in the industry has allowed me to talk about genomics with ease at meetups, tweetups and conferences. (OK, going back to school helped, too.) Somewhere along the way I decided that I would do something worthy of being interviewed on the show.
Hosted by the Mind First Foundation, this conference enabled participants in the Personal Genome Project to hear first-hand how their health data could be used in research, especially mental health research. The second day of the conference, the “PGPalooza,” let PGP participants directly interact with researchers to select projects of interest and have their questions answered immediately.
James Tao graciously edited this 25-minute video of my talk about family trio sequencing and autism:
Also, special thanks to Alex Hoekstra, co-founder of Mind First, for the invitation to this event.
In this blog post, I look at whole genome sequence platforms for storage and discuss what might happen to “genomical” amounts of data.
When I uploaded my whole genome sequence in September 2014 (about 10 months ago), few options existed for sharing personal genomic data. The usual suspects (Dropbox, Evernote and Figshare) were prohibitively expensive for large amounts of data. I knew about DNAnexus, but I saw it as a platform for researchers, not consumers. Well, times have changed. Fast.
A Battle of Platforms?
In addition to my original “roll your own” approach, DNAnexus and Google Genomics have emerged as major players for end-to-end genomics workflow. In the table below, you can see that storage costs for AWS S3, DNAnexus and Google Genomics are roughly the same. Everyone provides free uploads (we want your data!), but the cost for transferring data out of the systemvaries. Google Genomics does not charge for this, but instead charges for API access. For my current AWS storage, I pay about $4 per month to store my genome.
Astronomical becomes Genomical: A Perspective on Storage
In this recent article about big data and genomics, the authors compare the field of genomics with three other Big Data applications: astronomy, YouTube and Twitter. In common with genomics, these domains: 1) generate large amounts of data, and 2) share similar data life cycles. The authors examine four areas–acquisition, storage, distribution, analysis–and conclude that genomics is “on par with or the most demanding” of these disciplines/applications. My previous experience in medical imaging (a field that arguably tackled the prior generation of “big data” issues) leads me to believe that genomics will come to epitomize Big Data to many more people before long.
If you look carefully at the projections in the figure above, we may run out of genomes to sequence (really?), which brings us back to storage. Where will we store all of this sequence data, especially as genomic medicine continues its inexorable move to the clinic?
Delete Nothing and Carry on
If the field of medical imaging is an indicator, deleting anything after it has been archived is the exception rather than the rule. The main reason for this is medicolegal — hospitals avoid the liability of not being able to recall an exam later by keeping everything. Although the incidence of requiring access to images after diagnosis is low, the consequence of not having access to the original diagnostic image is high. A former colleague suggested that about 5% of their medical archive customers use lifecycle management features to delete imaging exams. In medical imaging, customers more commonly use lifecycle management features to migrate images to less expensive storage devices over time. So, in genomics, you might migrate your sequence data stored on Amazon from solid state storage (most expensive) to S3 to Glacier (least expensive). But my best guess: we’ll delete nothing and carry on.