Citizen science: One family’s search for answers in their genes

This entry was cross-posted from Terra on April 28, 2021.

In April, we celebrate Citizen Science Month, World Autism Day, and National DNA Day. In this guest blog post, all three events come together as KT Pickard, father of a young woman with autism, shares his family’s story of personal genomics and citizen science. 


This past Sunday was National DNA Day, which commemorates the discovery of DNA’s double helix in 1953 and the publication of the first draft of the human genome in 2003. Events on National DNA Day celebrate the latest genomic research and explore how those advances might impact our lives. Last year, I wrote a playful article for DNA Day that investigated whether genetics is truly like finding a needle in a haystack. This year, our family is honored to share our story and ideas with you.

Our family’s DNA odyssey

My wife and I have a young adult-aged daughter who is on the autism spectrum. We first discovered that our daughter had autism when she was eight years old. As we struggled to understand autism and what it meant for our family, we learned that autism is uniquely expressed: Meeting one person with autism means that you have met one person with autism. 

Long fascinated with genomics, my wife and I wondered how our DNA may have contributed to her condition, and we set out to learn all that we could. It was the beginnings of this diagnostic odyssey that gave expression to my second career as a citizen scientist. My professional background in supercomputing, software engineering, and medical imaging were a good start to apply scientific principles and gain insights.

We began our journey by talking with our family doctor, then my wife and I had our whole genomes sequenced through the Understand Your Genome project. Later, we crowdsourced the sequencing of her genome and began looking for genetic clues. By applying trio analysis to our family data, we discovered some preliminary findings: Our daughter has deletions in the NRXN1 gene and in a large region of chromosome 16, which have been found to be widely associated with developmental issues including autism. It looks like my wife and I have each contributed some variant alleles, but we are being careful about interpreting these findings because our WGS data and our daughter’s were processed through different pipelines, which could lead to inconsistent results.

Trio analysis of the NRXN1 locus shows a compound heterozygous deletion, with each parent possibly contributing one allele (visualization by VarSeq from Golden Helix). 

To continue our journey, I want to reprocess our family’s WGS data with the latest GATK Best Practices, in the hope that this will give us a consistent baseline. I came across Terra through the book Genomics in the Cloud, which I picked up to help me learn more about GATK. I led an online book club in early 2021 based on the book, and subsequently moved our WGS data to the Terra platform. Now I am using the GATK Whole Genome Analysis Pipeline in Terra to reprocess our data. Working with Terra has been challenging, but highly satisfying because it provides access to industry standard genomics tools.

From personal genomics to citizen science

My family’s main goal with this project is to make meaningful discoveries about the genetic basis of our daughter’s autism. In 2015, genetics could explain the heritability of autism spectrum disorder in approximately 1 in 5 cases. Amazingly, that number has increased to 4 in 5 cases today. 

Our daughter (who drew this image) is on the left. At the time, she represented the 1 in 5 people whose autism could be explained by genetics.

Yet there is more to be gained. Although whole genome sequencing may not provide directly actionable results for autism itself, WGS can make a huge difference for parents who discover a comorbid, but treatable condition. By sharing our data and our findings with others, we can accelerate medical knowledge. 

A growing number of projects offer opportunities for non-scientists to contribute in various forms to the advancement of biomedical research. In U.S. healthcare, one of the largest citizen science projects—All of Us—seeks one million people to share their unique health data to speed up medical research. By creating a national resource that reflects and supports the broad diversity of the U.S., the goal of All of Us is to advance precision medicine for all. 

We have enrolled in the All of Us project and are looking forward to doing our part. I find it inspiring that this is something we can all contribute to, as citizens, even those of us who are not researchers. 

Looking to the future

At its core, citizen science is a collaboration between scientists and those who are curious and motivated to contribute to scientific knowledge. As our family’s odyssey unfolds, I like to reflect about what I see out here on the bleeding edge of research, and how it could be applied to improve outcomes for patients in the real-world. 

In community practice, many medical providers have limited knowledge of autism. Due to a lack of effective data sharing and awareness, an undiagnosed person with autism who walks through the door of a hospital may appear like a rare disease patient. A clinician evaluating them would miss out on a huge amount of valuable context. How could we improve the system so that clinicians could more effectively recognize the underlying context of that person’s condition? We can address some of these issues with machine learning, but that requires pooling together huge amounts of data, and much of that data is difficult to access.

As a citizen scientist, I see an enormous opportunity to combine research data with real-world data and evidence across healthcare delivery organizations. Common ontologies and interoperability standards are making it increasingly easy to pool de-identified datasets to test hypotheses on synthetic data—realistic-but-not-real data—to gain insights. A recent “call to action” encourages citizen scientists to evaluate the utility of this method precisely because data can be shared without disclosing the identities of anyone involved. Done ethically and responsibly, this synthetic DNA approach has the potential to accelerate autism research and deliver new benefits to patients.

This is the perspective I have gained from my journey so far. By asking questions and continuing to discover more about what our genomes contain, I have been fortunate to learn much about scientific principles, bioinformatics, and a bit about the genetic basis of autism. Although it is at times a challenging road, I have found that the path of personal genomics and citizen science is a satisfying way to find answers to the questions that my family faces. I hope this story will inspire others to explore, and perhaps let researchers and clinicians see patients and their families as potential collaborators in the quest to understand complex conditions like autism.

Genomics in the Cloud Book Club!

Genomics in the Cloud by Geraldine A. Van der Auwera and Brian D. O’Connor

[Update: 2021-01-10: Thank you for your interest in our book club. We are currently closed to new members, but you can watch and subscribe to our meetings on the Genomics in the Cloud Book Club channel on YouTube.]

Introducing the Genomics in the Cloud Book Club, an online discussion group. Our 30+ members across 10 time zones are covering one chapter each week, and we expect to complete the book in March 2021.

For a chapter synopsis, please read this Twitter thread from one of the authors, Geraldine A. Van der Auwera.

Taking a page from the R for Data Science Online Learning Community, we created a Slack account for discussions and a Zoom account for meetings. Last week, we had lively online conversations about reference genome diversity, workflow language selection, personal whole genome sequencing, reproducibility tips, and more.

After each meeting, we post the video to this GITC Book Club channel on YouTube so you can follow us anytime. Our member’s tweets are also available here. Thank you for tuning in!

Additional resources: Slides

Picard reruns: Creating FASTQ files from a BAM file

In this post, I explain how I created FASTQ files from a BAM file using a utility called Picard (no relation, although I pronounce my name the same way).

Background

In 2014, my wife and I “got genomed” through Illumina’s Understand Your Genome (UYG) program, now managed by Genome Medical. Subsequently, I crowdsourced the sequencing of our kids’ genomes and presented family trio findings about our adult daughter’s autism in 2015.

One of the limitations of the family trio work was that the bioinformatics pipelines were different between our samples and our kids’ samples. To fix this limitation, I had to “reconstitute” the original FASTQ files from the BAM file provided by Illumina and then re-run all our data through the same pipeline. (Note: To my knowledge, UYG no longer provides BAM files as part of this program.)

Fortunately, bioinformatics wizard Mike Lin was also in my UYG class and wrote a blog series explaining how to extract FASTQ files from a BAM file. (Thank you, Mike!)

Using AWS to run samtools and Picard

You can create FASTQ files from your BAM file by using Picard, a set of Java-based command line tools for manipulating high-throughput sequencing (HTS) data in formats such as SAM/BAM/CRAM and VCF.

Running Picard

For reasons that escape me now, I first ran Picard using an AWS t1.micro instance.

Facepalm: I attempted to run Picard using an AWS t1.micro instance. Source: Paramount

After 3 attempts–watching Picard fail after running for 3 days each time–and creating thousands of temp files in the process, I learned the hard way that Picard requires more than 613 MBytes of memory. This time, I used a c4.2xlarge instance (4 cores, 16 GBytes of memory), which worked. It appears that 16 GBytes is about the minimum amount of memory to get the job done.

Step 1. Is your BAM file sorted?

Before creating FASTQ files, make sure your BAM file is sorted so that your genome coordinates are in order. One of the ways to do this is with samtools, a suite of programs for interacting with HTS data. Here are the commands I used to install it. You can check whether or not your BAM file is sorted by using this command:

samtools stats YourFile.bam | grep "is sorted:"
# "is sorted: 1" = Yes, your BAM file is sorted.
# "is sorted: 0" = No, your BAM file is not sorted.

If your BAM file requires sorting, use this command (or something close to it):

# Type "samtools sort --help" for a description of this command
samtools sort -n -@ 2 -m 2560M InputFile.bam -o ./OutputFile.sorted.bam

# Check for existence of Read Groups (@RG)
samtools view -H InputFile.bam | grep '^@RG'

Step 2. Run Picard

Get Java and the picard.jar file. Run this command, but keep in mind that the options below are for a BAM file created on an Illumina HiSeq sequencer:

java -jar ~/picard.jar SamToFastq INPUT=InputFile.bam RE_REVERSE=true INCLUDE_NON_PF_READS=true OUTPUT_PER_RG=true OUTPUT_DIR=OutputDirectoryName

Alternatively, you can use GATK4 (version 4.0 and greater) to accomplish the same task:

gatk SamToFastq --INPUT=InputFile.bam --RE_REVERSE=true --INCLUDE_NON_PF_READS=true --OUTPUT_PER_RG=true --OUTPUT_DIR=OutputDirectoryName

Using the c4.2xlarge instance, I ran Picard in 3 hours to create the FASTQ files shown below. In addition, creating compressed (gzip) versions of the files required another 8.5 hours of compute time. With an on-demand price of about $0.40 per hour, creating compressed FASTQ files cost approximately $4.60 USD on AWS.

Next…the pipeline!

Source: strangeuniverse1

My WGS data is now available via Amazon S3

In 2014, I uploaded my WGS data to the cloud and made it publicly available. In a previous post, I explained why I moved my WGS data from DNAnexus to Amazon. In this post, I explain the final step: attaching the S3 bucket to a web server. The goal was to replace the ftp server with a web server and make it easier to download my whole genome sequence data.

TL;DR: My genome is now available at http://genome.startcodon.org

Background

I launched my first cloud server literally while in the clouds in May 2014. Cloud computing has changed so much, it’s unbelievable. Back then, I had to patch the Linux kernel by hand so that the ftp server would work on AWS. Today, uploading your genome using Amazon’s command line interface (CLI) to an AWS S3 storage bucket is relatively easy. Understandably, Amazon makes it challenging (but doable) to make your storage publicly available. I used the Apache Web Server and s3fs to share this information.

My first cloud server

Step 1. Install Apache

Depending on your flavor of Linux, your commands may vary. I am using Ubuntu 18.04 LTS running on a t2.micro EC2 server. Here are the commands I used to install the Apache HTTP Server.

Step 2. Install s3fs

s3fs allows allows you to mount an S3 bucket via FUSE. s3fs preserves the native object format for files, allowing use of other tools like AWS CLI. Again, your commands may vary depending on your flavor of Linux. Here are the commands I used to install s3fs.

About my whole genome sequence data

My genome data and results are now in the public domain, freely available to download under a Creative Commons (CC0) license with a HIPAA waiver. I have not converted my BAM files to CRAM yet, so you may want to read the clinical report and sample report to save bandwidth.

Download information

Note: I decommissioned the ftp server after 6 years of faithful service.

A tale of cancer and genetics: part 4 of 4

Summary: My wife had breast cancer. These posts describe: 1) finding out, 2) genetic testing, 3) radiation therapy, and 4) an incidental finding in the APC gene.

Incidental finding in the APC gene

Great news! Six months have passed since Kimberly finished radiation therapy for breast cancer. Today, she had a follow-up diagnostic mammogram that confirmed she is cancer-free! She will continue to be monitored over the next 5 years, but our big worries are behind us. Incidentally, we learned about a useful website during our journey, cancersurvivalrates.com that gave us a much better picture of survival rates.

Hereditary cancer screening

Let’s finish by returning to the variant in the APC gene that we found during expanded genetic testing and wrap-up this series.

During genetic testing, our genetic counselor ordered an additional gene panel to screen for other cancers due to Kimberly’s family history. As I mentioned earlier, our insurance company denied all of our genetic testing claims, saying that the expanded panel was not related to her breast cancer. Nevertheless, the information that we received was worth the $250 out-of-pocket expense. Given the lack of reimbursement, reasonable costs for clinical genetic testing will ultimately drive most of it to be physician-ordered but privately paid. Just be sure to get your data!

So, what did we learn?

As we know from autosomal dominant inheritance, a person affected by an autosomal dominant disorder has a 50 percent chance of passing the mutated gene to each child. And sure enough, we saw the APC gene variant in 1 of our 2 adult-aged children; the other child does not carry it. We know this because we have whole genome sequences for everyone in our family. Here’s what Kimberly’s genetic code looks like at this location:

APC variant T>A (rs1801155). Above: 30x WGS data visualized with IGV. More here: https://go.usa.gov/xGZmh

It turns out that this variant increases the risk of colorectal cancer from 5% (found in the general population) to 10% (in the population with this variant). So, the child with the variant should have a colonoscopy at age 40 (earlier than usual) and follow-up colonoscopies every 5 years after that. If you have a APC gene variant, talk to a genetic counselor–and show them some love! Note: This blog is not intended to replace advice from a medical professional.

Before publishing this story, we had a family meeting to discuss Mom’s cancer-free diagnosis, as well as the APC variant that one of them carries. All of us agreed to share this information with hopes that it will assist others.

Along the way, we learned that knowledge gave us the strength to move forward. I also have newfound appreciation for my wife, whose bravery knows no bounds.

/end

A tale of cancer and genetics: part 3 of 4

Summary: My wife had breast cancer. These posts describe: 1) finding out, 2) genetic testing, 3) radiation therapy, and 4) an incidental finding in the APC gene.

Radiation therapy

Kimberly’s radiation therapy tech Hannah standing in front of a Varian linear accelerator.

One month after surgery, Kimberly began radiation therapy, which is designed to reduce the recurrence of breast cancer after surgery by more than half. We met with a radiation oncologist and developed a 15-visit treatment plan. The cost of Kimberly’s radiation therapy was about $25,000, and fortunately our health insurance covered about 90%.

Radiation therapy and genetics have a curious relationship. The basic idea behind radiotherapy is to induce double-strand breaks in DNA with ionizing radiation. Although radiation damages both normal cells and cancer cells, most normal cells repair themselves, while cancer cells do not. Therapy is given in daily doses to allow the DNA in healthy cells to recover between visits.

External beam radiotherapy based on linear accelerators has been available since the early 1950s, and machines like the Varian Clinac above deliver a shaped beam of high-energy x-rays to a precisely targeted area. In Kimberly’s case, a surgeon had removed her tumor 1 month prior, so the target area was the breast where the surgery occurred–just in case a single errant cancer cell had wandered from the surgical site.

We made daily visits for several weeks and Kimberly tolerated the procedure well. On her right side she had what looked like a sunburn, a common side effect, that faded over the next month. We continued to have follow-up visits with both her medical and radiation oncologists.

A few days after finishing radiation therapy, we visited the Varian production plant in Palo Alto, California. It was fascinating to see the construction of these behemoth machines and learn more about their operation. (My favorite part was learning that the electron linear accelerator tube is tuned with a ball peen hammer.) As luck would have it, all of this activity occurred just 1 week before the COVID-19 shelter-in-place order hit the San Francisco Bay area in March 2020.

We spent the next 6 months not only sheltering-in-place, but also waiting for her follow-up mammogram to determine if radiation therapy was successful.

Interior view of the Varian Clinac linear accelerator. The cylindrical object on the left is a klystron tube, which was invented by the Varian brothers in 1937. The tube is the first part of a multi-stage process to create high-energy x-rays used in radiotherapy.

/3

A tale of cancer and genetics: part 2 of 4

Summary: My wife had breast cancer. These posts describe: 1) finding out, 2) genetic testing, 3) radiation therapy, and 4) an incidental finding in the APC gene.

An image of left-handed DNA, because our lives were twisted backwards at this point.

Genetic testing

The day after Kimberly received her breast cancer diagnosis, we met with a board-certified genetic nurse, Frank. Before our visit, we completed a form that Frank used to create a family pedigree. The list below is not a pedigree, but it shows the history of cancer in Kimberly’s family. We removed kinship for privacy:

  1. Breast cancer (maternal side: 1, paternal side: 1)
  2. Lung/bone cancer (paternal: 3)
  3. Cervical cancer (maternal: 1)
  4. Colon cancer (paternal: 1)
  5. Liver cancer (paternal: 2)
  6. Kidney/bone cancer (paternal: 1)
  7. Esophageal cancer (paternal: 1)
pedigree
Example hereditary breast and ovarian cancer in a pedigree chart

It turns out that 5-10% of cancer is inherited. People who carry hereditary mutations do not necessarily get cancer, but their lifetime risk is higher than average. Genetic counselors use pedigree charts to visualize family history and evaluate when genetic testing adds diagnostic value. Kimberly’s family history met lab guidelines for further evaluation, so Frank ordered a gene panel from a nearby lab, Invitae. The blood test was ordered stat, and we received our results six business days later. Treatment plans can change based on genetic results, so we were grateful to receive results before her surgery, which was now scheduled.

We returned to Frank’s office and first learned that she does not carry mutations in 9 genes known to influence the risk of breast cancer: ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2, PTEN, STK11, and ΤΡ53. Phew! Invitae also provides free hereditary cancer testing to breast cancer patients at no additional charge (as long as you order the expanded panel within 90 days of the original test), so Frank ordered the expanded panel. Given that Kimberly has a family history of other hereditary cancers, we welcomed a broader genetic search. The results could be meaningful not only to us, but also to other living relatives. Oddly, our insurance company rejected all of our genetic testing claims because the resubmission was not related to her breast cancer diagnosis. I discussed the situation with Invitae and they were very accommodating–our total out-of-pocket cost was $250. I am still mad at our insurance company, but that’s a rant for another day.

Although she did not have any mutations related to breast cancer, Kimberly’s expanded genetic testing revealed a point mutation in the APC gene, which is known to increase the risk of colorectal cancer. People with this variant are generally counseled to have their first colonoscopy at age 40 (she did that) and follow-up colonoscopies every 5 years (coming up). Since the APC I1307K variant is autosomal dominant, close relatives such as siblings and children have a 1 in 2 chance of inheriting an APC mutation. We called Kimberly’s sister and shared our findings, part of a cascade testing strategy. We also have our kid’s whole genome sequences, which will let us check for APC mutations directly. We will return to that search in part 4 of this series.

We left Frank’s office and developed a treatment plan with Kimberly’s surgeon and medical oncologist a few days later. The plan included surgery (lumpectomy) followed by radiation therapy. Surgery was successful, as you can see in the before and after images below. (Special thanks to the Horos Project for the open source DICOM viewer.)

Before surgery

pre-operative images
Pre-operative images. Left: Diagnostic mammogram. Lesion is visible in the upper right quadrant. Right: Ultrasound with tumor measurements (0.8 cm x 0.6 cm).

After surgery

Post-operative image. CT scan after lumpectomy (3 cm x 3cm).

Kimberly received her diagnosis the day after Thanksgiving. In the 18 days that followed, we had 16 medical appointments that took us from diagnostic mammogram to surgery. With surgery behind us, Christmas was now six days away. We spent a quiet holiday with the kids.

We began 2020 hopeful, knowing that her type 1A tumor had been successfully removed by her surgeon. We were also much more knowledgeable about hereditary cancer risks due to Frank’s counseling.

One month later, Kimberly would begin radiation therapy to dramatically decrease her chance of recurrence.

/2

A tale of cancer and genetics: part 1 of 4

Summary: My wife had breast cancer. These posts describe: 1) finding out, 2) genetic testing, 3) radiation therapy, and 4) an incidental finding in the APC gene.

Leavenworth National Cemetary, November 29, 2019 (photo credit: Hannah Pickard Photography)

Finding out

It was the day after Thanksgiving. My wife Kimberly was talking with a nurse about the results from a biopsy performed 2 days earlier. She hung up her mobile phone and burst into tears. Kimberly received the call while we were exiting the gates of Leavenworth National Cemetary in Kansas, where we had just laid my mother-in-law Barbara to rest with her husband, Gilbert. Our kids were in the back seat and did not really know what was going on, but they guessed that mom had cancer.

The week prior, Kimberly had a diagnostic mammogram, and the radiologist told us in person that Kimberly had a suspicious lesion in her right breast (larger than a peppercorn, smaller than a pea) and recommended a biopsy. Luckily, a biopsy appointment was available the day before Thanksgiving, and we took it even though we were flying to Kansas City the next day. We asked the care coordinator to call us as soon as she had preliminary pathology results, and she did. Our family flew home to the San Francisco Bay area on Sunday.

On Monday, Kimberly and I visited the medical oncology department of a nearby clinic. The nurse said that Kimberly had invasive ductal carcinoma. Surprisingly, the rest of the visit did not turn into that dull surreal buzz that often accompanies bad news and drowns out everything else. In our case, years of being in rooms like this one discussing the needs of our exceptional children proved immensely useful. I took notes and Kimberly asked incisive questions about treatment options, radiation therapy, and genetic counseling. The nurse patched-in our long time family physician over the phone, and his presence was very assuring. It was a brief respite from what would become an overwhelming 3 month journey–the first 2-3 weeks especially so. We learned about a bewildering array of cancer treatment options, visited competing medical facilities, and evaluated new doctors.

We drove home and I read the Wikipedia entry for invasive ductal carcinoma. It was the prognosis section that caught me completely off guard:

Overall, the five-year survival rate of invasive ductal carcinoma was approximately 85% in 2003.

Reference: https://doi.org/10.1186/bcr767

Those odds were not good, and I had multiple panic attacks over the next few weeks at the thought of losing my wife. “Hang in there. Moment by moment,” a friend texted to me. I read that message over and over, hanging on.

/1

Why I moved our WGS data from DNAnexus to Amazon S3

$1,500. That’s the amount of money I have spent over the past 5 years to store our family’s whole genome sequence (WGS) data. For $299 per person in 2020, I could sequence all of us again at 30x coverage, get the same data files, and spend less money. In 2015, I wrote about posting my WGS data to DNAnexus. Last month (July 2020), I moved all of our data to Amazon (AWS) S3 storage. In this post, I explain why.

Five years ago, my impression was that DNAnexus was a platform for researchers, not consumers. It turns out that my first impression was correct–DNAnexus is not a platform for consumers. To their credit, their platform-as-a-service model includes an extensive set of genomic analysis tools, an easy-to-use SDK, top-notch documentation. a way to run your own docker images using Workflow Description Language (WDL), and a professional services team. DNAnexus’ IT infrastructure and regulatory compliance make the platform valuable for over 100 enterprise customers, and their recent $100M funding round coupled with their UK Biobank/AWS announcement will enable the company to expand into new markets and let researchers find more actionable insights.

DNAnexus Platform-as-a-Service

Nevertheless, I recently moved my WGS data to Amazon S3 due to storage costs and a lack of price transparency.

Storage costs

I’ve learned that most of the work that I want to do can be done with VCF files. Yes, there are times when I want to look at BAM files, but moving those files to lower-cost storage makes sense. DNAnexus introduced a Glacier-based archiving service in 2019 to support those operations, although I did not use it. My BAM file is 73 GBytes, which represents about 93% of the 79 GBytes for my WGS data (no FASTQ data). If I deeply archive BAM and FASTQ data (329 GBytes total), I can reduce the amount of higher-cost storage by 98%. The cost comparison for a single genome with FASTQ files looks roughly like this:

  • Storage cost on DNAnexus: (329 GBytes * $0.03 per GB-month [everything]) = $9.87 per month
  • Storage cost on AWS: (7 GBytes * $0.0125 per GB-month [VCF]) + (322 GBytes * $0.00099 per GB-month [everything else]) = $0.41 per month

Overall, I can reduce my monthly storage costs by over 95% by using lower-cost storage tiers on AWS (see Table 1 below). Again, the comparison is apples-to-oranges because I did not use DNAnexus’ archiving service, mostly because it required a separate license to activate. Using Amazon S3, our monthly WGS storage costs will decrease from $24 per month to less than $1 per month.

Table 1. Comparison of AWS and DNAnexus storage pricing (accessed August 23, 2020).

Lack of price transparency

If we compare AWS’ S3 storage price from 5 years ago to DNAnexus’, we find that the storage markup was 35% over the S3 list price. It turns out that Amazon decreased its S3 storage price over the past 5 years, which led DNAnexus to drop their storage price to the current $0.03 per GB-month, still at a 35% markup. (For comparison, on demand GPU- or FPGA-based compute cycles (Amazon EC2) are marked-up over 100%.)

I do not fault DNAnexus for marking-up AWS pricing–they are a business and provide value beyond storage and compute cycles. However, you will not find any pricing information on the DNAnexus website. In addition to storage costs, add-ons like archiving and GxP regulatory compliance require separate licenses that are not disclosed when signing-up. Presumably, the company’s professional services team assists with these onboarding activities.

How to move your data from DNAnexus to AWS

So, having made the decision to move my WGS data to AWS, how did I do it?

On the DNAnexus platform, I used AWS S3 Exporter, a company-provided tool to upload data to an AWS S3 bucket (DNAnexus account required). You can invoke the exporter using either their SDK (dx-toolkit) or an online wizard–both methods work great. The DNAnexus documentation for the exporter tool is a little out-of-date, so here is the updated AWS IAM policy file to make your transfers work with verification:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::yourAccountNumber:root"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::yourBucketName",
            "Condition": {
                "StringLike": {
                    "aws:Referer": "https://platform.dnanexus.com/*"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::yourAccountNumber:root"
            },
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::yourBucketName",
                "arn:aws:s3:::yourBucketName/*"
            ],
            "Condition": {
                "StringLike": {
                    "aws:Referer": "https://platform.dnanexus.com/*"
                }
            }
        }
    ]
}

Another improvement: You can transfer your data from one S3 instance to another (DNAnexus to AWS) at the rate of 250 GBytes per hour, including verification. Five years ago, the transfer speed was 10 GBytes per hour!

One final gotcha

One thing that has not changed in 5 years is the “data transfer out” fee. Amazon’s fee is $0.09 per GByte and DNAnexus’ fee is $0.13 per GByte. This fee is an understandable disincentive to keep you from moving your data around too much. In my case, moving our family’s WGS data to AWS will add over $100 to the final bill. It’s a little like losing all your money at baccarat and then finding out that you still owe the banque a commission before you leave the table. Not a big deal when you are a family, but when you are the UK Biobank expecting to grow to 15 petabytes over the next 5 years…well, you get the idea.

For the money, take a look at upstart competitors like Basepair or ixLayer.

[Update 2021-01-10: Do not forget to remove the DNAnexus API, called dx-toolkit!]

sudo apt-get remove --purge dx
sudo apt autoremove
sudo rm /etc/apt/sources.list.d/dnanexus.list

My WGS data is now available on Amazon S3

Read the blog post

Is genetics like finding a needle in a haystack?

wheatstacks
Claude Monet, Wheatstacks (End of Summer), 1890-91. The Art Institute of Chicago.

You may know this painting, one of over two dozen haystack paintings (actually, stacks of wheat) that Claude Monet produced at various times and seasons. What if we hid a needle one of them? Could we find it? In genetics, we often say that finding a genetic variant is like finding a needle in a haystack. But how big is the needle? How big is the haystack? Who believes this stuff?

Francis Collins, NIH director and leader of the Human Genome Project, is a clear believer. In a recent PBS series by Ken Burns, The Gene: An Intimate History, Dr Collins said that finding a misspelling in a gene that causes a particular disease is “a needle in a haystack.

collins-quote
Dr Francis Collins, NIH director

So, in honor of April 25th, National DNA Day, which commemorates the discovery of DNA’s double helix in 1953 and the publication of the first draft of the human genome in 2003, I embarked on a curious journey to answer these questions. My approach would be to determine the volume of a needle and then figure out how large to make the haystack to see if the analogy holds.

What is the volume of a needle?

How do you calculate the volume of a needle? It is useful to start with some real needles, and it turned out that I had a small box of 30 assorted needles–here’s a photo of them:

needles
30 needles ranging is size from 30 mm to 60 mm

Needles come in all kinds of lengths, but I was looking for an “average” needle, so I made a rough list of their lengths and came up with this:

Average needle length: sum(5 x 30 mm, 10 x 35 mm, 11 x 40 mm, 1 x 45 mm, 1 x 50 mm, 1 x 55 mm, 1 x 60 mm) / 30 = 37 mm or 1.4 inches

needle1
A very average needle from my sample (length = 37mm, diameter = 0.75 mm)

Volume of a needle: displacement method

Next, I had to figure out the volume (V) of a needle. My first approach was to drop the needles into a graduated cylinder filled with water and then measure the displacement.

Graduated cylinder filled with 40 ml water (left). Needles in graduated cylinder (middle). Close-up photo showing 0.5 ml water displacement (right).

It turns out that 30 needles displaced 0.5 ml of water, so we divide by 30 to get the average volume of the needles in my sample.

V = 0.5 mm / 30 = 0.016 ml or about 0.02 ml per needle

Can we find a different method to measure the volume of the needle and compare results?

Volume of a needle: small cylinder

Another way to approximate the volume of the needle is to view it as a small cylinder using the formula: V = pi * r^2 * height. Since we are looking for volume (V), we want to express our measurements in centimeters to end up with cubic centimeters (cc).

V = pi * (0.075 cm / 2)^2 * 3.7 cm = 0.016 cc, or about 0.02 cc

Since 1 ml = 1 cc, we can say that the volume of our needle is 0.02 ml, which is spot on with our previous measurement. Note: It is rare in science that your numbers match exactly, but we’ll take the win and start building our haystack.

Building a genetic haystack

Our next task is to build a haystack that approximates the size of the human genome, where each A-T or C-G base pair in the human genome is the size of a needle. The Human Genome Project pegs that number around 3 billion base pairs within one copy of a single genome.

We build our genetic haystack by defining it as a volume having 3 billion opportunities to find a 0.02 ml object, or volume (V) equals 3,000,000,000 * 0.02 ml.

V = (3*10^9 * 0.02 ml) = 60*10^6 ml or 60,000,000 ml

Since 1 ml = 1 cubic centimeter (cc), the volume can also be stated as 60,000,000 cubic centimeters or 60*10^6 cc. Finally, 1 liter = 1000 ml, so the volume in liters equals 60,000,000 ml * (1 liter / 1000 ml).

V (haystack) = (60*10^6 ml) * (1 liter / 1000 ml) = 60*10^3 liters or 60,000 liters

We now know how much space it occupies, but what does it look like? For example, how tall is our haystack?

Shape of a genetic haystack: approximation with a hemisphere

Like the needle, we can approximate the shape of a haystack to something easy to calculate, like the volume of hemisphere in cubic centimeters: V = 2/3 * pi * r^3.

V (hemisphere) = 60*10^6 cc = 2/3 * pi * r^3

Solving for the radius (r), we get r = 306 cm. Since 1 m = 100 cm, the height (radius) of the hemisphere equals 306 cm * (1 m / 100 cm).

r (hemisphere) = 306 cm * (1 m / 100 cm) = about 3 meters or 10 feet tall

So, we have a rough idea of what an idealized haystack looks like. It is 3 meters tall and it is shaped like a hemisphere. But how accurate is our measurement? Can we do better? Has anyone studied haystack modeling? Indeed, someone has and his name is W.H. Hosterman.

Shape of a genetic haystack: using haystack modeling

hosterman-stacks
Outline drawings of square hay stacks of different shapes

In 1931, W.H. Hosterman from the U.S. Department of Agriculture published an extensive technical bulletin “for the purpose of determining the volume and tonnage of hay.” Hosterman and his USDA colleagues measured over 2,600 haystacks across 10 states and presented results for both square and round haystacks. For round haystacks, Hosterman derived this formula: V = ((0.04 * Over) – (0.012 * C)) * C^2, where C is the circumference of the haystack, and Over is the measurement of the circumference of the stack over the top. We can compare results with the hemisphere by selecting equivalent values from Table 7 in Hosterman’s paper.  (We will momentarily switch to imperial units to make use of the constants in the formula.)

V (Hosterman) = ((0.04 * 32 ft) – (0.012 * 62 ft)) * 62^2 = 2,060 cubic feet or 58,333 liters 

The observed height of Hosterman’s haystacks are slightly taller than wide, but their sloped tops make the volume of the “true” haystack (58,333 liters) a little less than our original estimate of a 3-meter-tall, 60,000-liter haystack. When we compare the volume using Hosterman’s formula to the volume of the hemisphere, we see that they differ by 1,667 liters, or about 3%. Close enough!

Genetics is like finding a needle in a haystack

We have two comparisons that match closely. So, what does that haystack look like?

romanian-haystack
Romanian haystacks are typically 3 to 4 meters tall.

It turns out that Romanian haystacks are about 3 meters tall, so finding a misspelling in a gene (a genetic variant) is indeed like finding a needle in a haystack.

Who looks for needles in haystacks?

If genetics truly is like finding a needle in a haystack, what kinds of people do this? Well, let’s go back to Francis Collins.

In 1993, Collins published a paper describing the genetics of cystic fibrosis, a disease that required finding “a needle in a haystack.” Today, we have found needles from thousands of genetic haystacks. Some of these needles lead to clues about rare diseases, which affect more than 400 million people worldwide.

In real life, you can find a needle in a haystack, too. In 2004, performance artist Sven Sachsalber found a needle in a haystack after hunting for it for 24 hours. If we could get computers to solve diseases at that speed, we would happily be out of work.

sachsalber-aiguille
In 2004, performance artist Sven Sachsalber found a needle in a haystack at the Palais de Tokyo in Paris.

OK, back to finding more needles…