Is genetics like finding a needle in a haystack?

wheatstacks
Claude Monet, Wheatstacks (End of Summer), 1890-91. The Art Institute of Chicago.

You may know this painting, one of over two dozen haystack paintings (actually, stacks of wheat) that Claude Monet produced at various times and seasons. What if we hid a needle one of them? Could we find it? In genetics, we often say that finding a genetic variant is like finding a needle in a haystack. But how big is the needle? How big is the haystack? Who believes this stuff?

Francis Collins, NIH director and leader of the Human Genome Project, is a clear believer. In a recent PBS series by Ken Burns, The Gene: An Intimate History, Dr Collins said that finding a misspelling in a gene that causes a particular disease is “a needle in a haystack.

collins-quote
Dr Francis Collins, NIH director

So, in honor of April 25th, National DNA Day, which commemorates the discovery of DNA’s double helix in 1953 and the publication of the first draft of the human genome in 2003, I embarked on a curious journey to answer these questions. My approach would be to determine the volume of a needle and then figure out how large to make the haystack to see if the analogy holds.

What is the volume of a needle?

How do you calculate the volume of a needle? It is useful to start with some real needles, and it turned out that I had a small box of 30 assorted needles–here’s a photo of them:

needles
30 needles ranging is size from 30 mm to 60 mm

Needles come in all kinds of lengths, but I was looking for an “average” needle, so I made a rough list of their lengths and came up with this:

Average needle length: sum(5 x 30 mm, 10 x 35 mm, 11 x 40 mm, 1 x 45 mm, 1 x 50 mm, 1 x 55 mm, 1 x 60 mm) / 30 = 37 mm or 1.4 inches

needle1
A very average needle from my sample (length = 37mm, diameter = 0.75 mm)

Volume of a needle: displacement method

Next, I had to figure out the volume (V) of a needle. My first approach was to drop the needles into a graduated cylinder filled with water and then measure the displacement.

Graduated cylinder filled with 40 ml water (left). Needles in graduated cylinder (middle). Close-up photo showing 0.5 ml water displacement (right).

It turns out that 30 needles displaced 0.5 ml of water, so we divide by 30 to get the average volume of the needles in my sample.

V = 0.5 mm / 30 = 0.016 ml or about 0.02 ml per needle

Can we find a different method to measure the volume of the needle and compare results?

Volume of a needle: small cylinder

Another way to approximate the volume of the needle is to view it as a small cylinder using the formula: V = pi * r^2 * height. Since we are looking for volume (V), we want to express our measurements in centimeters to end up with cubic centimeters (cc).

V = pi * (0.075 cm / 2)^2 * 3.7 cm = 0.016 cc, or about 0.02 cc

Since 1 ml = 1 cc, we can say that the volume of our needle is 0.02 ml, which is spot on with our previous measurement. Note: It is rare in science that your numbers match exactly, but we’ll take the win and start building our haystack.

Building a genetic haystack

Our next task is to build a haystack that approximates the size of the human genome, where each A-T or C-G base pair in the human genome is the size of a needle. The Human Genome Project pegs that number around 3 billion base pairs within one copy of a single genome.

We build our genetic haystack by defining it as a volume having 3 billion opportunities to find a 0.02 ml object, or volume (V) equals 3,000,000,000 * 0.02 ml.

V = (3*10^9 * 0.02 ml) = 60*10^6 ml or 60,000,000 ml

Since 1 ml = 1 cubic centimeter (cc), the volume can also be stated as 60,000,000 cubic centimeters or 60*10^6 cc. Finally, 1 liter = 1000 ml, so the volume in liters equals 60,000,000 ml * (1 liter / 1000 ml).

V (haystack) = (60*10^6 ml) * (1 liter / 1000 ml) = 60*10^3 liters or 60,000 liters

We now know how much space it occupies, but what does it look like? For example, how tall is our haystack?

Shape of a genetic haystack: approximation with a hemisphere

Like the needle, we can approximate the shape of a haystack to something easy to calculate, like the volume of hemisphere in cubic centimeters: V = 2/3 * pi * r^3.

V (hemisphere) = 60*10^6 cc = 2/3 * pi * r^3

Solving for the radius (r), we get r = 306 cm. Since 1 m = 100 cm, the height (radius) of the hemisphere equals 306 cm * (1 m / 100 cm).

r (hemisphere) = 306 cm * (1 m / 100 cm) = about 3 meters or 10 feet tall

So, we have a rough idea of what an idealized haystack looks like. It is 3 meters tall and it is shaped like a hemisphere. But how accurate is our measurement? Can we do better? Has anyone studied haystack modeling? Indeed, someone has and his name is W.H. Hosterman.

Shape of a genetic haystack: using haystack modeling

hosterman-stacks
Outline drawings of square hay stacks of different shapes

In 1931, W.H. Hosterman from the U.S. Department of Agriculture published an extensive technical bulletin “for the purpose of determining the volume and tonnage of hay.” Hosterman and his USDA colleagues measured over 2,600 haystacks across 10 states and presented results for both square and round haystacks. For round haystacks, Hosterman derived this formula: V = ((0.04 * Over) – (0.012 * C)) * C^2, where C is the circumference of the haystack, and Over is the measurement of the circumference of the stack over the top. We can compare results with the hemisphere by selecting equivalent values from Table 7 in Hosterman’s paper.  (We will momentarily switch to imperial units to make use of the constants in the formula.)

V (Hosterman) = ((0.04 * 32 ft) – (0.012 * 62 ft)) * 62^2 = 2,060 cubic feet or 58,333 liters 

The observed height of Hosterman’s haystacks are slightly taller than wide, but their sloped tops make the volume of the “true” haystack (58,333 liters) a little less than our original estimate of a 3-meter-tall, 60,000-liter haystack. When we compare the volume using Hosterman’s formula to the volume of the hemisphere, we see that they differ by 1,667 liters, or about 3%. Close enough!

Genetics is like finding a needle in a haystack

We have two comparisons that match closely. So, what does that haystack look like?

romanian-haystack
Romanian haystacks are typically 3 to 4 meters tall.

It turns out that Romanian haystacks are about 3 meters tall, so finding a misspelling in a gene (a genetic variant) is indeed like finding a needle in a haystack.

Who looks for needles in haystacks?

If genetics truly is like finding a needle in a haystack, what kinds of people do this? Well, let’s go back to Francis Collins.

In 1993, Collins published a paper describing the genetics of cystic fibrosis, a disease that required finding “a needle in a haystack.” Today, we have found needles from thousands of genetic haystacks. Some of these needles lead to clues about rare diseases, which affect more than 400 million people worldwide.

In real life, you can find a needle in a haystack, too. In 2004, performance artist Sven Sachsalber found a needle in a haystack after hunting for it for 24 hours. If we could get computers to solve diseases at that speed, we would happily be out of work.

sachsalber-aiguille
In 2004, performance artist Sven Sachsalber found a needle in a haystack at the Palais de Tokyo in Paris.

OK, back to finding more needles…

#JoinAllofUs

Today I joined All of Us, a research community of one million people to lead the way for individualized prevention, treatment, and care for, well, all of us. This project was previously known as the Precision Medicine Initiative.

Many of you know that our family has used whole genome sequencing to look for clues in our daughter’s autism. This blog shares that journey. I have also published peer-reviewed papers to explore the reasons why people share personal health information. Through this research, I am convinced that information sharing will contribute to a learning healthcare system to improve care and lower costs.

It just takes people like you and me to #JoinAllofUs and lead by example.

AllofUsBanner

 

 

Big data: From medical imaging to genomics

Pickard-KT-and-Kimberly
KT & Kimberly Pickard

In 2006, a Scientific American article written by George Church, “Genomics for All,” rekindled my interest in genomics. I went back to school in 2009 to contemplate the business of genomic medicine, and celebrated my MBA by writing a Wikipedia entry for the word, “Exome.” I was hooked.

We started our odyssey by genotyping our family using 23andMe, and later my wife and I had our whole genomes sequenced. Realizing that genomics were starting to yield clinically useful information, we crowdsourced the sequencing of our kid’s genomes to look for genetic clues in their autism. We found interesting results, gave talks and wrote papers.

imaging-to-genomics-2014-03-06

Along the way, I realized that medical imaging and genomics are highly complementary: genomics informs or identifies conditions, and radiology localizes them. Sarah-Jane Dawson pointed this out at a Future of Genomic Medicine conference in 2014.

DIY genomics, autism, and coffee on Mendelspod

I have been a long-time listener to the intelligent and informative podcasts on Mendelspod, a site that connects people and ideas in life sciences. (Most nights you can find me listening to Mendelspod while I do the dishes.) I tuned-in sometime in 2012 and created a mental map of the industry by listening to every podcast I could find. A steady diet of listening to the latest developments in the industry has allowed me to talk about genomics with ease at meetups, tweetups and conferences. (OK, going back to school helped, too.) Somewhere along the way I decided that I would do something worthy of being interviewed on the show.

Well, last week I got my wish when my interview was posted on Mendelspod. I talked about our crowdfunded family trio sequencing project, autism, and even “coming out” of the research closet after being invited to speak at a conference in China last year. We explored parallels between my career in medical imaging and the future of genomic medicine (more in this blog post).

We concluded the interview by talking about Genomics Coffee, a discussion group that meets on the second and fourth Thursdays in San Francisco. Check it out!

Many thanks to Theral Timpson and Ayanna Monteverdi, co-producers of Mendelspod, for their great show.

DIY Genomics at MindEx 2015

image
I recently presented results from our DIY genomics project at MindEx 2015 held at Harvard’s very Hogwarts-looking Sanders Theatre.

Hosted by the Mind First Foundation, this conference enabled participants in the Personal Genome Project to hear first-hand how their health data could be used in research, especially mental health research. The second day of the conference, the “PGPalooza,” let PGP participants directly interact with researchers to select projects of interest and have their questions answered immediately.

James Tao graciously edited this 25-minute video of my talk about family trio sequencing and autism:

Also, special thanks to Alex Hoekstra, co-founder of Mind First, for the invitation to this event.

Additional resources: Video Slides

Why I uploaded my WGS data to DNAnexus

In this blog post, I look at whole genome sequence platforms for storage and discuss what might happen to “genomical” amounts of data.

Background

When I uploaded my whole genome sequence in September 2014 (about 10 months ago), few options existed for sharing personal genomic data. The usual suspects (DropboxEvernote and Figshare) were prohibitively expensive for large amounts of data. I knew about DNAnexus, but I saw it as a platform for researchers, not consumers. Well, times have changed. Fast.

A Battle of Platforms?

In addition to my original “roll your own” approach, DNAnexus and Google Genomics have emerged as major players for end-to-end genomics workflow. In the table below, you can see that storage costs for AWS S3, DNAnexus and Google Genomics are roughly the same. Everyone provides free uploads (we want your data!), but the cost for transferring data out of the system varies. Google Genomics does not charge for this, but instead charges for API access. For my current AWS storage, I pay about $4 per month to store my genome.

WGS-Storage-Pricing
Table 1. Comparison of AWS, DNAnexus and Google Genomics storage costs. Your mileage may vary. Accessed July 7, 2015.

Ultimately, I selected DNAnexus over Google Genomics because their workflow API is well-developed and appealed to my roll-up-your-sleeves sensibility. (If you’re comfortable with command-line work, this platform is for you. BaseSpaceGenoSpace and Galaxy are other platforms to consider.) Google Ventures backed DNAnexus in 2011, so it’s difficult to predict what will happen in the long run. What we do know is that the value of their respective platforms will increase as more people join (and add data) to them. Google Genomics has partnerships with DNAstack, Autism Speaks and even DNAnexus. DNAnexus has partnerships with Baylor College of Medicine, WuXi NextCODE, and the Encode Project. The battle begins. If these two platforms can maintain standards-based interoperability, the competition is good for everyone.

Astronomical becomes Genomical: A Perspective on Storage

In this recent article about big data and genomics, the authors compare the field of genomics with three other Big Data applications: astronomy, YouTube and Twitter. In common with genomics, these domains: 1) generate large amounts of data, and 2) share similar data life cycles. The authors examine four areas–acquisition, storage, distribution, analysis–and conclude that genomics is “on par with or the most demanding” of these disciplines/applications. My previous experience in medical imaging (a field that arguably tackled the prior generation of “big data” issues) leads me to believe that genomics will come to epitomize Big Data to many more people before long.

growth-of-DNA-sequencing
Growth of DNA sequencing. Source: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

If you look carefully at the projections in the figure above, we may run out of genomes to sequence (really?), which brings us back to storage. Where will we store all of this sequence data, especially as genomic medicine continues its inexorable move to the clinic?

running-out-of-genomes-twitter

Delete Nothing and Carry on

If the field of medical imaging is an indicator, deleting anything after it has been archived is the exception rather than the rule. The main reason for this is medicolegal — hospitals avoid the liability of not being able to recall an exam later by keeping everything. Although the incidence of requiring access to images after diagnosis is low, the consequence of not having access to the original diagnostic image is high. A former colleague suggested that about 5% of their medical archive customers use lifecycle management features to delete imaging exams. In medical imaging, customers more commonly use lifecycle management features to migrate images to less expensive storage devices over time. So, in genomics, you might migrate your sequence data stored on Amazon from solid state storage (most expensive) to S3 to Glacier (least expensive). But my best guess: we’ll delete nothing and carry on.

Storage is one aspect of genome informatics that is undergoing rapid change. You can learn more at upcoming events like the HL7 2015 Genomics Policy Conference and CSHL’s 2015 Genome Infomatics Conference in October.

Stay tuned!

Update: I moved my WGS data from DNAnexus to AWS.

Finding Genetic Clues in Autism with Family Trio Sequencing

Yesterday, I presented preliminary findings at the 2015 Clinical Genome Conference in San Francisco from our family trio sequencing project. In this crowdsourced project on experiment.com, I looked for genetic clues to autism in our adult-aged daughter. While the talk focused on the “DIY” aspects of how to accomplish WGS sequencing (see slides), this post focuses on genetic findings.

Overview

The project began with a crowdsourced effort to raise $1,750 to sequence our daughter’s genome, and took slightly more than two months to complete. After working with AllSeq and HudsonAlpha to obtain WGS data, we used VarSeq from Golden Helix to search for unique variants, as well as browse whole genome sequence data. After filtering our variant call data to focus on high quality exome variants, we examined 52 potentially damaging de novo and compound heterozygous changes suggested by VarSeq’s family trio analysis. Although this first approach did not yield clues specific to autism, it did suggest a number of secondary findings that are not addressed here. The second approach was to start with genes having known associations with autism and then look for them in our daughter’s DNA. Several curated databases have between 200 and 700 genes, but again, none produced meaningful results. The third method was to look at known “hot spots” in autism genetics, such as variants in the NRXN1 gene, as well as known structural variation on chromosome 16. Changes to NRXN1 and so-called “16p” changes are discussed below.

Findings 

  • NRXN1-Deletion-AnnotatedNRXN1 – Deletions in NRXN1 are associated with a wide spectrum of developmental disorders, including autism. Our daughter has a 10bp exonic deletion (-GT repeat) followed by what appears to be a 9bp compound heterozygous deletion in NRXN1. Both deletions are partially present in both parents and overlap; the deletions appear to have been accumulatively inherited. Due to the high number of sequence repeats, copy number variation (CNV) should clarify the significance of this finding.
  • 16p11.2-Deletion-Annotated16p deletions – Deletions and duplications in this 593-kilobase section of chromosome 16 are widely associated with developmental issues, including autism. Our daughter appears to have dozens of deletions in this region, some inherited and some not. However, since the variants in our daughter’s DNA were called using a different software pipeline, it is difficult to draw meaningful conclusions (see “Limitations,” below). For example, some variants in our daughter’s DNA were shown to map to multiple places on the genome, suggesting either large copy number variation or genomic regions that were difficult to map. Copy number variation (CNV) analysis will also elucidate this region. Once resequenced, this region has the potential to provide genetic clues to our daughter’s condition.

Limitations

My wife and I received our WGS data in March 2014. Our samples were sequenced at 30x coverage using Illumina’s HiSeq platform and then aligned and called with Illumina’s pipeline, Isaac. Our daughter’s DNA was sequenced in May 2015 at 30x coverage, but on Illumina’s newest platform, the Illumina HiSeq X Ten. The difference is that our daughter’s DNA was aligned using BWA, followed by variant calling with GATK “best practice” workflow. To accurately compare genomes in family trio analysis, all samples must be processed using the same software pipeline. Otherwise, variants may be aligned and called differently. My wife and I must go back to the (almost) original FASTQ data and start over. Although Illumina did not provide these files with our results, Mike Lin from DNAnexus explains how to extract FASTQ files from Illumina data in this great blog series. Hint: it involves a utility called Picard (no relation). Until we resequence our WGS data using the same bioinformatics pipeline, all results should be considered preliminary.

Conclusion

This project demonstrated that personal genomics is very real, and has the potential to answer complex medical questions. Today, answering those questions using whole genome data and family trio analysis requires a combination of genetic, bioinformatic and domain knowledge to reach meaningful conclusions. Validating those conclusions remains challenging, from rare diseases to complex conditions such as autism. Currently, personal genomics has a similar feel to “homebrew” computer clubs from the late ’70s–the community is still very small, collegial, and willing to share “tips and tricks” to advance the field.

Although we encountered some “dark alleys” during the analysis, our preliminary results suggest that family trio sequencing can indeed provide genetic clues to autism. We will continue to refine the analysis by resequencing the data with the same pipeline, which should resolve questions in the 16p region, as well as the potential deletion in NRXN1. Further, CNV analysis should answer structural variation questions that are also known to be associated with autism spectrum conditions

Acknowledgements

I would like to thank our backers and the team at experiment.com, as well as Gabe Rudy from Golden Helix. Gabe was very generous with his time, knowledge and insight. Finally, I would like to thank my wife, Kimberly, for her patience and fortitude. 

Additional resources: Slides

Searching for Genetic Clues in Autism with Family Trio Sequencing

This entry was cross-posted from DNAdigest on April 22, 2015.

Amazingly, the cost of whole genome sequencing is now 100,000 times less expensive than it was a dozen years ago. If the Tesla Model S followed this trajectory, you could buy one today for less than $1 USD. This super logarithmic decline puts genomics on par with desktop publishing or 3D printing—it has become something that you can affordably do yourself.

My wife, Kimberly, and I were excited about the prospect of having our genomes sequenced.Pickard-KT-and-Kimberly Our daughter has autism, and like many parents of special needs children, we were eager to explore the underlying causes of her condition. We “got genomed” last year by enrolling in Illumina’s Understand Your Genome program. We received our whole genome sequencing (WGS) data, as well as limited predisposition and carrier screening for a number of Mendelian traits. As many DNAdigest readers know, the cost of WGS continues to drop in price, almost to the $1,000 genome that Illumina announced last year. Kimberly and I were intrigued to learn that we were both carriers of some rare genetic variants. Could our genetic idiosyncrasies be contributing to our daughter’s autism?

After being sequenced, I followed the lead of DNAdigest contributor Manuel Corpas and posted my whole genome sequence online. I decided to publish my genome without restrictions in an attempt to lead by example. In the future, platforms like Repositive will make it easier for consumers to share genomic information and maintain privacy.

Kimberly and I recently launched a project on experiment.com to crowd fund the whole genome sequencing of our adult-aged daughter. In this project, we will look for genetic clues to her autism using family trio sequencing. Family trio sequencing is a powerful technique that can explain genetic conditions by looking at differences in DNA between Mom, Dad and an affected child.

We were thrilled when the sequencing project was funded the first day. In the process, we received feedback from other parents who wanted to learn more about the technique, so we added a stretch goal to cover publishing costs in an open access journal. The research paper will document our findings, as well as explain how family trio sequencing can be used to search for answers to health conditions and rare diseases.

Information sharing can indeed be very personal, but we find the possibility of catalyzing new areas of health research compelling. With this project, we hope to find clues that will contribute, if only in a small way, to a growing body of genomics research that supports a broader explanation of autism.

Exploring Markets of Data for Personal Health Information

Consumers are willing to share health information with financial reward

Are some consumers willing to sell their personal health information? It looks like the answer is “yes.” This week, I presented a paper at the IEEE International Conference on Data Mining in Shenzhen, China. This paper summarized the results of an online survey about consumers’ willingness to share de-identified health information, and whether their attitudes would change if a financial reward was offered. Here’s the abstract:

To realize preventive and personalized medicine, large numbers of consumers must pool health information to create datasets that can be analyzed for wellness and disease trends. To date, consumers have been reluctant to share personal health information for a variety of reasons. To explore how financial rewards may influence data sharing, the concept of Markets of Data (MoDAT) is applied to health information. Results from a global online survey show that a previously uncovered group of consumers exists who are willing to sell their de-identified personal health information. Incorporating this information into existing health research databases has the potential to improve healthcare worldwide.

During the presentation, I argued that patient populations for both rare and common diseases can look similar, especially when looking at disease subtypes. When considering relatively common diseases such as diabetes, schizophrenia, and autism spectrum disorders, a single hospital in the U.S. will not see enough patients for a given disease subtype to make meaningful conclusions. On average, U.S.-based hospitals do not have enough patients to solve disease questions without sharing health information.

For this survey, a global panel of 400 participants was selected at random by AYTM, an online market research tool. Questions were based on a previous health information sharing survey, with additional questions about sharing with financial reward. I received 400 responses from 59 countries in less than two hours. U.S.-based respondents overwhelming believed that their health information was worth more than $1000, but the global average was around $250 when the U.S. was excluded. For these participants, both their motivation and the amount of data shared increased with financial reward. Keep in mind that these participants were paid to respond to the survey, so they represent a kind of self-selected group. Nevertheless, monetizing health information sharing produced a surprising result, demonstrating that an alternative source of health information may exist for research purposes.

Additional resources: Paper, Supplemental files, Slides

I uploaded my whole genome sequence data to the cloud

i-got-genomedI got genomed by Illumina

In March 2014, my wife and I “got genomed” by enrolling in Illumina’s (now Genome Medical’s) Understand Your Genome (UYG) program. UYG requires participants to order this whole genome sequence (WGS) test from their physicians due to uncertainties surrounding the delivery of genomic results in the U.S. Illumina is careful to point out that the service “…has not been cleared or approved by the U.S. Food and Drug Administration” and “you will not receive medical results, or a diagnosis, or a recommendation for treatment.” Our family physician signed the request in November 2013, and we received our results in February. Fortunately, no surprises, but the UYG program only covers these Mendelian disorders for now. We flew to San Diego a few weeks later to listen to talks by genomic researchers and discuss our results with genetic counselors. As part of this one-day seminar, we each received an iPad Mini that was pre-loaded with our results, as well as a portable hard drive that contained our raw sequence data.

illumina-wgs-hard-drive I received my WGS data on this encrypted hard drive (about 100GB).

After we arrived home, the next step was to find a public “home” for my sequence data (to share without restrictions). What I learned is that uploading your genome anywhere is a challenge, mostly because the dataset is so big.

I looked at DropboxEvernote and Figshare, but their storage models do not scale well for genomic data. I tried Sage Bionetworks, but the BAM file was too large to upload. I settled on Amazon Web Services (AWS) and created an anonymous FTP server using the Amazon Elastic Compute Cloud (EC2).

About my whole genome sequence data

My genome data and results are now in the public domain, freely available to download under a Creative Commons (CC0) license. Uploading the data took two days over a 3Mbps connection, so you may want to read the clinical report and sample report instead.

  • BAM file checksum: 2529521235 (78.1GB uncompressed)
  • VCF file checksum: 4165261022 (2.4GB gzip compressed)

Questions about FTP? See this FAQ.

Now that I have my genome in the cloud, I’ll start playing with analysis tools like STORMSeq. Stay tuned!