Tag Archives: DNA

7 things I learned while reprocessing my WGS data on Terra: part 1

After creating FASTQ files from my BAM data and learning how to use Terra, I was finally ready to run the Whole Genome Analysis Pipeline. This collection of workflows, called a “workspace,” contains the latest GATK Best Practices workflows for whole genome sequence (WGS) data, including pre-processing, germline short variant discovery, and joint variant calling. Although I am working with a single human genome (my own), this same production pipeline is routinely used on thousands of WGS samples every day.

Being a relative newcomer to GATK and a complete notice with Terra, the path to success was a little bumpy. Before jumping into what I learned, I want to acknowledge the staff at the Broad, who were extraordinarily kind. Starting with with GATK’s Benevolent Dictator for Life, Geraldine Van der Auwera, who is coincidentally the co-author of a highly informative book, Genomics in the Cloud. This blog post would not be possible without the knowledge that I gleaned from those pages. The Terra support team has also been wonderfully responsive–I even received a call from a designer at the Broad asking how they could improve Terra’s user experience!

Below, I describe the reprocessing of my WGS data. The goal is to have a consistent baseline as we continue to search for answers in our genes.

Note: Terra is evolving rapidly, and you may find that some links have changed. These tips were current as of this writing (June 2021). Drop me a line on Twitter if you see an improvement that I can add.

1. Creating unmapped BAM (uBAM) files from paired end files

Our family’s WGS data was processed on Illumina sequencers, albeit on different machines at different times. To get started, the first processing step is to create unmapped BAM (uBAM) files from raw FASTQ data. GATK’s use of uBAM files is an acknowledged “off label” use of the BAM file format, but it provides an opportunity to insert details (metadata) that would otherwise be absent. Given Illumina’s 75% market share, chances are high that you will be creating uBAM files using the “Paired FASTQ to unmapped BAM” workflow located in the Sequence-Format-Conversion workspace (or something similar).

2. Read Groups (@RG) in the uBAM file

After creating uBAM files, my first run of the 1-WholeGenomeGermlineSingleSample workflow ended with an error (after three days of processing):

Task UnmappedBamToAlignedBam.CheckContamination:NA:1 failed. Job exit code 255. Check gs://my-terra-bucket/.../call-CheckContamination/stderr for more information. PAPI error code 9. Please check the log file for more details: gs://my-terra-bucket/.../call-CheckContamination/CheckContamination.log.

To start debugging the CheckContamination subtask, I fired up the cloud-based Jupyter notebook within Terra (very cool), attempted to copy the sorted BAM file to the notebook environment, and promptly ran out of disk space. To create enough disk space for your BAM file, go to settings (look for the big gear in upper right corner) and change the persistent disk size to 100 GB.

The cause of this error turned out to be a misunderstanding about read groups. In the BAM file, you can see two different values in the read group (@RG) field: Pickard-K-Thomas_C and Pickard-K-Thomas_A. Those values have to be the same; otherwise, CheckContamination thinks your BAM file has been “contaminated” with multiple samples.

!samtools view sample.sorted.bam | head -n 2

C2L88ACXX_0:5:1303:576005:0	113	chr1	10000	28	30S70M	chr18	3702590	0	CTATGCAGCACACCCAACCAAACCCCATCCATAACCCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA	0''''0'00''0'0'0'0''7'<'<'0'''''B7<<B'''B<7'7'B<00'0F<BBFBFFFBB'FBFFFFIFFFFBFB<BFFFFB<B<FFFFB<FBFBBB	MC:Z:100M	RG:Z:Pickard-K-Thomas_C	MQ:i:60	AS:i:65
C2L88ACXX_0:3:1101:1473452:0	99	chr1	10001	0	100M	=	10242	288	TAACCCTAACCCTAACCCTAACCCTTACCCTTACCCTTACCCTTACCCTTACCCTTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC	BBBFFFFFFFFFFIIIIIIIIIFIIFBFIIBFFIIIIIIIIIIFFIIIIFBFFIIFFIIFI<BFIFF0BFFFBFFFFBB<<BFFFBBF7BBBBBBB77BB	MC:Z:53S47M	RG:Z:Pickard-K-Thomas_A	MQ:i:0	AS:i:70

Searching for read group (@RG) in the BAM file makes the problem even more visible:

!samtools view -H sample.sorted.bam | grep '^@RG'
@RG	ID:Pickard-K-Thomas_A	SM:Pickard-K-Thomas_A	LB:Illumina-PG0001189-BLD	PL:ILLUMINA	PU:C2L88ACXX.0.3	CN:Illumina	DT:2013-12-08T07:00:00+0000
@RG	ID:Pickard-K-Thomas_B	SM:Pickard-K-Thomas_B	LB:Illumina-PG0001189-BLD	PL:ILLUMINA	PU:C2L88ACXX.0.4	CN:Illumina	DT:2013-12-08T07:00:00+0000
@RG	ID:Pickard-K-Thomas_C	SM:Pickard-K-Thomas_C	LB:Illumina-PG0001189-BLD	PL:ILLUMINA	PU:C2L88ACXX.0.5	CN:Illumina	DT:2013-12-08T07:00:00+0000

The fix for read groups in the uBAM file

The fix was to go back to Sequence-Format-Conversion and change three values in WORKFLOWS>INPUTS, which in turn inserts the correct metadata in your uBAM files–many thanks to Geraldine for pointing this out:

  1. Change readgroup_name from this.read_group to this.read_group_id
  2. Change sample_name from this.sample_id to this.sample
  3. Change additional_disk_space_gb to 100

Other notes:

  1. This article was invaluable to understand how read groups (@RG) work.
  2. ID (Read Group IDentifier) field: Each ID value must be unique.
  3. SM (SaMple) field: Unlike ID, the sample name must be the same in all SM fields.
  4. LB (LiBrary) field: I referenced my unique Illumina ID for DNA prep library traceability.
  5. PL (PLatform) = ILLUMINA (all caps)…I read an official list of sequencers in the documentation and “ILLUMINA” is on that list.
  6. PU (Platform Unit) field: The convention is to use periods as the delimiter in the lane identifier, not underbars as used in the FASTQ filename.
  7. CN (Sequencing CeNter) field: I used “Illumina” because they processed this sample.
  8. DT (DaTe) field: Using the ISO 8610 combined date/time standard worked for me. Interestingly, Terra converted my local time to UTC time inside the BAM file (which makes sense given that genomes can be processed across multiple timezones).

From the Sequence-Format-Conversion workflow, here’s my successful DATA>TABLE>read_group page in tsv format:

entity:read_group_id	output_unmapped_bam	fastq1	fastq2	library_name	platform_name	platform_unit	run_date	sample	sequencing_center

Pickard-K-Thomas_A	gs://my-terra-bucket-id/.../Pickard-K-Thomas_A.unmapped.bam		gs://my-bucket/Pickard-K-Thomas/FASTQ/C2L88ACXX_0_3_none_1.fastq.gz	gs://my-bucket/Pickard-K-Thomas/FASTQ/C2L88ACXX_0_3_none_2.fastq.gz	Illumina-PG0001189-BLD	ILLUMINA	C2L88ACXX.0.3	2013-12-07T23:00:00-08:00	Pickard-K-Thomas	Illumina

Pickard-K-Thomas_B	gs://my-terra-bucket-id/.../Pickard-K-Thomas_B.unmapped.bam		gs://my-bucket/Pickard-K-Thomas/FASTQ/C2L88ACXX_0_4_none_1.fastq.gz	gs://my-bucket/Pickard-K-Thomas/FASTQ/C2L88ACXX_0_4_none_2.fastq.gz	Illumina-PG0001189-BLD	ILLUMINA	C2L88ACXX.0.4	2013-12-07T23:00:00-08:00	Pickard-K-Thomas	Illumina

Pickard-K-Thomas_C	gs://my-terra-bucket-id/.../Pickard-K-Thomas_C.unmapped.bam		gs://my-bucket/Pickard-K-Thomas/FASTQ/C2L88ACXX_0_5_none_1.fastq.gz	gs://my-bucket/Pickard-K-Thomas/FASTQ/C2L88ACXX_0_5_none_2.fastq.gz	Illumina-PG0001189-BLD	ILLUMINA	C2L88ACXX.0.5	2013-12-07T23:00:00-08:00	Pickard-K-Thomas	Illumina

Creating uBAM files took about four hours at a cost of $1.15. It was time for a second run of the 1-WholeGenomeGermlineSingleSample workflow.

3. CheckFingerprint issue #1

This time, the sticking point was at the end of the pipeline in a routine called CheckFingerprint, which is called as a subtask within AggregatedBamQC. Here’s the error (also found after three days of processing):

Job AggregatedBamQC.CheckFingerprint:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.

I checked the CheckFingerprint.log and suspected the issue was related to the NA12878 dataset (what was that doing there???):

WARNING 2021-05-11 21:25:40 FingerprintChecker Couldn't find index for file /cromwell_root/dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf going to read through it all.
WARNING 2021-05-11 21:25:40 FingerprintChecker There was a genotyping error in File: file:///cromwell_root/dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf
Cannot find sample 1-WholeGenomeGermlineSingleSample_2021-05-09T05-11-16 in provided file.

The fix for CheckFingerprint issue #1

After some head scratching, I found the solution by scrolling to the bottom of WORKFLOWS>INPUTS. There, I found a field called fingerprint_genotypes_file, which had a value of: gs://dsde-data-na12878-public/NA12878.hg38.reference.fingerprint.vcf

Clearing this field fixed the issue, and I launched 1-WholeGenomeGermlineSingleSample for the third time.

Note: When debugging your problem, keep in mind that searching the Terra knowledge base does not include results from GATK documentation, which can be very useful for GATK- or Picard-related issues.

4. CheckFingerprint issue #2

The third run was also unsuccessful–this time the issue was a little trickier. Here’s the error message (also found after three days of processing!):

INFO 2021-05-11 21:27:00 CheckFingerprint Read Group: null / Pickard-K-Thomas vs. 1-WholeGenomeGermlineSingleSample_2021-05-09T05-11-16: LOD = 0.0
ERROR 2021-05-11 21:27:00 CheckFingerprint No non-zero results found. This is likely an error. Probable cause: EXPECTED_SAMPLE (if provided) or the sample name from INPUT (if EXPECTED_SAMPLE isn't provided) isn't a sample in GENOTYPES file.

The fix for CheckFingerprint issue #2

It turns out that the saved name of your WORKFLOWS>root entity>read_group_set must match the name of your VCF output (in my case, Pickard-K-Thomas). In the error message above, the default read_group_set name (1-WholeGenomeGermlineSingleSample_2021-05-09T05-11-16) does not match, but is stored as the value in read_group_set_id in DATA>read_group_set . Saving the read_group_set name as “Pickard-K-Thomas” fixed the issue. The alternative is to change the value of WORKFLOWS>INPUTS>sample_and_unmapped_bams, which uses read_group_set_id by default. Yikes!

Note: Understanding the standard data model is critical to your success. This article, chapters 11 and 13 in Genomics in the Cloud, and these videos will assist in wrapping your head around it. I found the data model to be the most challenging part of this process.

I launched 1-WholeGenomeGermlineSingleSample for the fourth time, but aborted it after four days of processing (thinking that the software was broken).

5. Improving process delays

If your job is taking longer than usual (say, an extra 12+ hours), take a look at the timing diagram in the Job Manager. If you see a bunch of pink boxes, it’s time to submit a request to Terra Support for more resources. To submit support requests, you must create a Zendesk account that is separate from your Terra account. The good news is that the support account that you create for Terra will also be valid for questions that you submit to the GATK Community Forum.

This article provides an excellent overview explaining how to request additional resources for your project. In my case, I wanted my jobs to run 30% faster, so I requested an increase for resources that were limited (IP addresses and CPUs). After forwarding my request, the support team took care of my request immediately and the issue completely disappeared. Here is the information that I provided for the request:

  1. Your Terra billing project: YOUR-BILLING-PROJECT-GOES-HERE
  2. Which quota(s) you want to increase: IP addresses and CPUs
  3. What you want your new quota(s) to be: 30% higher than what they are now
  4. Which regions you want the increase applied to, if applicable: us-central1
  5. Rationale for increase: Research purposes

6. Information to include when submitting a support request

If you followed the instructions in the previous step, you are ready to submit support requests. Providing these items in your request will speed-up the process:

  1. Your Project ID
  2. Your workspace name
  3. Your Bucket ID, Submission ID, and Workflow ID
  4. Any useful log information

You may also be asked to share your workspace with the support team. To do this, add the email address GROUP_FireCloud-Support@firecloud.org to your workspace by clicking the Share button–the option is located in the three-dots menu at the top-right.

7. Cleaning up

My fifth run was successful! Now it was time to clean-up.

After learning how to use this workflow and running it unsuccessfully a few times, I had amassed a significant amount of storage. To wit:

$ gsutil du -s gs://my-terra-bucket-id
3,994,577,017,810  gs://my-terra-bucket-id

Holy smokes–about 4 terabytes, which costs more than $50 USD per month using standard Google cloud storage. At runtime, you can automatically delete intermediate files with an option that removes files for workflows that complete successfully. Since I was learning, I kept them around and then used the Remove_Workflow_Intermediates notebook to remove them manually.

To begin cleaning-up, I removed all subdirectories with failed runs (but not the notebooks directory):

The spinning circles show the directories that I manually deleted. Be sure to keep the “notebooks” directory.

Next, I looked at the size of the directory from my successful run, about 864 gigabytes:

$ gsutil du -s gs://my-terra-bucket-id/my-submission-id
863,761,741,217  gs://my-terra-bucket-id/my-submission-id

To manually delete the remaining intermediate files, I copied this notebook to my workspace. Note: Before running it, I upgraded to the latest version of pip and google-cloud-bigquery with this command:

!/usr/local/bin/python3 -m pip install --upgrade pip
!pip install --upgrade google-cloud-bigquery

Within the notebook code, I also modified the pip command to upgrade to the latest library versions with this command:

!pip install --upgrade $install_cmd

The program found 463 intermediate files to delete (Note: 782.61 GiB = 840 gigabytes).

WARNING: Delete 463 files totaling 782.61 GiB in gs://my-terra-bucket-id (Whole-Genome-Analysis-Pipeline)
Are you sure? [y/yes (default: no)]: yes

After executing the cleanup code, I reduced total storage for the successful run by 97%, from about 864 to 23 gigabytes, which now costs less than $0.50 USD per month using standard Google cloud storage. The largest savings came from storing the uncompressed BAM file (previously 80 gigabytes) as a compressed CRAM file (16 gigabytes). My take-home: It pays to pay attention to unnecessary files!

Conclusion

After building uBAM files correctly, reprocessing my genome would typically cost about $7 USD and three days of compute time. It took five runs to get it right, but Terra’s call caching magic–and perhaps the additional CPU power that I requested–brought the last runtime down to 14 hours. It has been a steep climb, but the views are great. Next up: reprocessing WGS data for the rest of our family, and then joint variant calling.

Citizen science: One family’s search for answers in their genes

This entry was cross-posted from Terra on April 28, 2021.

In April, we celebrate Citizen Science Month, World Autism Day, and National DNA Day. In this guest blog post, all three events come together as KT Pickard, father of a young woman with autism, shares his family’s story of personal genomics and citizen science. 


This past Sunday was National DNA Day, which commemorates the discovery of DNA’s double helix in 1953 and the publication of the first draft of the human genome in 2003. Events on National DNA Day celebrate the latest genomic research and explore how those advances might impact our lives. Last year, I wrote a playful article for DNA Day that investigated whether genetics is truly like finding a needle in a haystack. This year, our family is honored to share our story and ideas with you.

Our family’s DNA odyssey

My wife and I have a young adult-aged daughter who is on the autism spectrum. We first discovered that our daughter had autism when she was eight years old. As we struggled to understand autism and what it meant for our family, we learned that autism is uniquely expressed: Meeting one person with autism means that you have met one person with autism. 

Long fascinated with genomics, my wife and I wondered how our DNA may have contributed to her condition, and we set out to learn all that we could. It was the beginnings of this diagnostic odyssey that gave expression to my second career as a citizen scientist. My professional background in supercomputing, software engineering, and medical imaging were a good start to apply scientific principles and gain insights.

We began our journey by talking with our family doctor, then my wife and I had our whole genomes sequenced through the Understand Your Genome project. Later, we crowdsourced the sequencing of her genome and began looking for genetic clues. By applying trio analysis to our family data, we discovered some preliminary findings: Our daughter has deletions in the NRXN1 gene and in a large region of chromosome 16, which have been found to be widely associated with developmental issues including autism. It looks like my wife and I have each contributed some variant alleles, but we are being careful about interpreting these findings because our WGS data and our daughter’s were processed through different pipelines, which could lead to inconsistent results.

Trio analysis of the NRXN1 locus shows a compound heterozygous deletion, with each parent possibly contributing one allele (visualization by VarSeq from Golden Helix). 

To continue our journey, I want to reprocess our family’s WGS data with the latest GATK Best Practices, in the hope that this will give us a consistent baseline. I came across Terra through the book Genomics in the Cloud, which I picked up to help me learn more about GATK. I led an online book club in early 2021 based on the book, and subsequently moved our WGS data to the Terra platform. Now I am using the GATK Whole Genome Analysis Pipeline in Terra to reprocess our data. Working with Terra has been challenging, but highly satisfying because it provides access to industry standard genomics tools.

From personal genomics to citizen science

My family’s main goal with this project is to make meaningful discoveries about the genetic basis of our daughter’s autism. In 2015, genetics could explain the heritability of autism spectrum disorder in approximately 1 in 5 cases. Amazingly, that number has increased to 4 in 5 cases today. 

Our daughter (who drew this image) is on the left. At the time, she represented the 1 in 5 people whose autism could be explained by genetics.

Yet there is more to be gained. Although whole genome sequencing may not provide directly actionable results for autism itself, WGS can make a huge difference for parents who discover a comorbid, but treatable condition. By sharing our data and our findings with others, we can accelerate medical knowledge. 

A growing number of projects offer opportunities for non-scientists to contribute in various forms to the advancement of biomedical research. In U.S. healthcare, one of the largest citizen science projects—All of Us—seeks one million people to share their unique health data to speed up medical research. By creating a national resource that reflects and supports the broad diversity of the U.S., the goal of All of Us is to advance precision medicine for all. 

We have enrolled in the All of Us project and are looking forward to doing our part. I find it inspiring that this is something we can all contribute to, as citizens, even those of us who are not researchers. 

Looking to the future

At its core, citizen science is a collaboration between scientists and those who are curious and motivated to contribute to scientific knowledge. As our family’s odyssey unfolds, I like to reflect about what I see out here on the bleeding edge of research, and how it could be applied to improve outcomes for patients in the real-world. 

In community practice, many medical providers have limited knowledge of autism. Due to a lack of effective data sharing and awareness, an undiagnosed person with autism who walks through the door of a hospital may appear like a rare disease patient. A clinician evaluating them would miss out on a huge amount of valuable context. How could we improve the system so that clinicians could more effectively recognize the underlying context of that person’s condition? We can address some of these issues with machine learning, but that requires pooling together huge amounts of data, and much of that data is difficult to access.

As a citizen scientist, I see an enormous opportunity to combine research data with real-world data and evidence across healthcare delivery organizations. Common ontologies and interoperability standards are making it increasingly easy to pool de-identified datasets to test hypotheses on synthetic data—realistic-but-not-real data—to gain insights. A recent “call to action” encourages citizen scientists to evaluate the utility of this method precisely because data can be shared without disclosing the identities of anyone involved. Done ethically and responsibly, this synthetic DNA approach has the potential to accelerate autism research and deliver new benefits to patients.

This is the perspective I have gained from my journey so far. By asking questions and continuing to discover more about what our genomes contain, I have been fortunate to learn much about scientific principles, bioinformatics, and a bit about the genetic basis of autism. Although it is at times a challenging road, I have found that the path of personal genomics and citizen science is a satisfying way to find answers to the questions that my family faces. I hope this story will inspire others to explore, and perhaps let researchers and clinicians see patients and their families as potential collaborators in the quest to understand complex conditions like autism.

#JoinAllofUs

Today I joined All of Us, a research community of one million people to lead the way for individualized prevention, treatment, and care for, well, all of us. This project was previously known as the Precision Medicine Initiative.

Many of you know that our family has used whole genome sequencing to look for clues in our daughter’s autism. This blog shares that journey. I have also published peer-reviewed papers to explore the reasons why people share personal health information. Through this research, I am convinced that information sharing will contribute to a learning healthcare system to improve care and lower costs.

It just takes people like you and me to #JoinAllofUs and lead by example.

AllofUsBanner

 

 

Big data: From medical imaging to genomics

Pickard-KT-and-Kimberly
KT & Kimberly Pickard

In 2006, a Scientific American article written by George Church, “Genomics for All,” rekindled my interest in genomics. I went back to school in 2009 to contemplate the business of genomic medicine, and celebrated my MBA by writing a Wikipedia entry for the word, “Exome.” I was hooked.

We started our odyssey by genotyping our family using 23andMe, and later my wife and I had our whole genomes sequenced. Realizing that genomics were starting to yield clinically useful information, we crowdsourced the sequencing of our kid’s genomes to look for genetic clues in their autism. We found interesting results, gave talks and wrote papers.

imaging-to-genomics-2014-03-06

Along the way, I realized that medical imaging and genomics are highly complementary: genomics informs or identifies conditions, and radiology localizes them. Sarah-Jane Dawson pointed this out at a Future of Genomic Medicine conference in 2014.

DIY genomics, autism, and coffee on Mendelspod

I have been a long-time listener to the intelligent and informative podcasts on Mendelspod, a site that connects people and ideas in life sciences. (Most nights you can find me listening to Mendelspod while I do the dishes.) I tuned-in sometime in 2012 and created a mental map of the industry by listening to every podcast I could find. A steady diet of listening to the latest developments in the industry has allowed me to talk about genomics with ease at meetups, tweetups and conferences. (OK, going back to school helped, too.) Somewhere along the way I decided that I would do something worthy of being interviewed on the show.

Well, last week I got my wish when my interview was posted on Mendelspod. I talked about our crowdfunded family trio sequencing project, autism, and even “coming out” of the research closet after being invited to speak at a conference in China last year. We explored parallels between my career in medical imaging and the future of genomic medicine (more in this blog post).

We concluded the interview by talking about Genomics Coffee, a (now defunct) discussion group that met in San Francisco.

Many thanks to Theral Timpson and Ayanna Monteverdi, co-producers of Mendelspod, for their great show.

DIY Genomics at MindEx 2015

image
I recently presented results from our DIY genomics project at MindEx 2015 held at Harvard’s very Hogwarts-looking Sanders Theatre.

Hosted by the Mind First Foundation, this conference enabled participants in the Personal Genome Project to hear first-hand how their health data could be used in research, especially mental health research. The second day of the conference, the “PGPalooza,” let PGP participants directly interact with researchers to select projects of interest and have their questions answered immediately.

James Tao graciously edited this 25-minute video of my talk about family trio sequencing and autism:

Also, special thanks to Alex Hoekstra, co-founder of Mind First, for the invitation to this event.

Additional resources: Video Slides

Finding Genetic Clues in Autism with Family Trio Sequencing

Yesterday, I presented preliminary findings at the 2015 Clinical Genome Conference in San Francisco from our family trio sequencing project. In this crowdsourced project on experiment.com, I looked for genetic clues to autism in our adult-aged daughter. While the talk focused on the “DIY” aspects of how to accomplish WGS sequencing, this post focuses on genetic findings.

Overview

The project began with a crowdsourced effort to raise $1,750 to sequence our daughter’s genome, and took slightly more than two months to complete. After working with AllSeq and HudsonAlpha to obtain WGS data, we used VarSeq from Golden Helix to search for unique variants, as well as browse whole genome sequence data. After filtering our variant call data to focus on high quality exome variants, we examined 52 potentially damaging de novo and compound heterozygous changes suggested by VarSeq’s family trio analysis. Although this first approach did not yield clues specific to autism, it did suggest a number of secondary findings that are not addressed here. The second approach was to start with genes having known associations with autism and then look for them in our daughter’s DNA. Several curated databases have between 200 and 1200 genes, but again, none produced meaningful results. The third method was to look at known “hot spots” in autism genetics, such as variants in the NRXN1 gene, as well as known structural variation on chromosome 16. Changes to NRXN1 and so-called “16p” changes are discussed below.

Findings 

  • NRXN1-Deletion-AnnotatedNRXN1 – Deletions in NRXN1 are associated with a wide spectrum of developmental disorders, including autism. Our daughter has a 10bp exonic deletion (-GT repeat) followed by what appears to be a 9bp compound heterozygous deletion in NRXN1. Both deletions are partially present in both parents and overlap; the deletions appear to have been accumulatively inherited. Due to the high number of sequence repeats, copy number variation (CNV) should clarify the significance of this finding.
  • 16p11.2-Deletion-Annotated16p deletions – Deletions and duplications in this 593-kilobase section of chromosome 16 are widely associated with developmental issues, including autism. Our daughter appears to have dozens of deletions in this region, some inherited and some not. However, since the variants in our daughter’s DNA were called using a different software pipeline, it is difficult to draw meaningful conclusions (see “Limitations,” below). For example, some variants in our daughter’s DNA were shown to map to multiple locations on the genome, suggesting either large copy number variation or genomic regions that were difficult to map. Copy number variation (CNV) analysis will also elucidate this region. Once reprocessed, these findings may provide potential genetic clues to our daughter’s condition.

Limitations

My wife and I received our WGS data in March 2014. Our samples were sequenced at 30x coverage using Illumina’s HiSeq platform and then aligned and called with Illumina’s pipeline, Isaac. Our daughter’s DNA was sequenced in May 2015 at 30x coverage, but on Illumina’s newest platform, the Illumina HiSeq X Ten. The difference is that our daughter’s DNA was aligned using BWA, followed by variant calling with GATK “best practice” workflow. To accurately compare genomes in family trio analysis, all samples must be processed using the same software pipeline. Otherwise, variants may be aligned and called differently. My wife and I must go back to the (almost) original FASTQ data and start over. Although Illumina did not provide these files with our results, Mike Lin explains how to extract FASTQ files from Illumina data in this great blog series. Hint: it involves a utility called Picard (no relation). Until we reprocess our WGS data using the same bioinformatics pipeline, all results should be considered preliminary.

Conclusion

This project demonstrated that personal genomics is very real, and has the potential to answer complex medical questions. Today, answering those questions using whole genome data and family trio analysis requires a combination of genetic, bioinformatic and domain knowledge to reach meaningful conclusions. Validating those conclusions remains challenging, from rare diseases to complex conditions such as autism. Currently, personal genomics has a similar feel to “homebrew” computer clubs from the late ’70s–the community is still very small, collegial, and willing to share “tips and tricks” to advance the field.

Although we encountered some “dark alleys” during the analysis, our preliminary results suggest that family trio sequencing can indeed provide genetic clues to autism. We will continue to refine the analysis by reprocessing the data with the same pipeline, which should resolve questions in the 16p region, as well as the potential deletion in NRXN1. Further, CNV analysis should answer structural variation questions that are also known to be associated with autism spectrum conditions

Acknowledgements

I would like to thank our backers and the team at experiment.com, as well as Gabe Rudy from Golden Helix. Gabe was very generous with his time, knowledge and insight. Finally, I would like to thank my wife, Kimberly, for her patience and fortitude. 

Additional resources: Slides

Searching for Genetic Clues in Autism with Family Trio Sequencing

This entry was cross-posted from DNAdigest on April 22, 2015.

Amazingly, the cost of whole genome sequencing is now 100,000 times less expensive than it was a dozen years ago. If the Tesla Model S followed this trajectory, you could buy one today for less than $1 USD. This super logarithmic decline puts genomics on par with desktop publishing or 3D printing—it has become something that you can affordably do yourself.

My wife, Kimberly, and I were excited about the prospect of having our genomes sequenced.Pickard-KT-and-Kimberly Our daughter has autism, and like many parents of special needs children, we were eager to explore the underlying causes of her condition. We “got genomed” last year by enrolling in Illumina’s Understand Your Genome program. We received our whole genome sequencing (WGS) data, as well as limited predisposition and carrier screening for a number of Mendelian traits. As many DNAdigest readers know, the cost of WGS continues to drop in price, almost to the $1,000 genome that Illumina announced last year. Kimberly and I were intrigued to learn that we were both carriers of some rare genetic variants. Could our genetic idiosyncrasies be contributing to our daughter’s autism?

After being sequenced, I followed the lead of DNAdigest contributor Manuel Corpas and posted my whole genome sequence online. I decided to publish my genome without restrictions in an attempt to lead by example. In the future, platforms like Repositive will make it easier for consumers to share genomic information and maintain privacy.

Kimberly and I recently launched a project on experiment.com to crowd fund the whole genome sequencing of our adult-aged daughter. In this project, we will look for genetic clues to her autism using family trio sequencing. Family trio sequencing is a powerful technique that can explain genetic conditions by looking at differences in DNA between Mom, Dad and an affected child.

We were thrilled when the sequencing project was funded the first day. In the process, we received feedback from other parents who wanted to learn more about the technique, so we added a stretch goal to cover publishing costs in an open access journal. The research paper will document our findings, as well as explain how family trio sequencing can be used to search for answers to health conditions and rare diseases.

Information sharing can indeed be very personal, but we find the possibility of catalyzing new areas of health research compelling. With this project, we hope to find clues that will contribute, if only in a small way, to a growing body of genomics research that supports a broader explanation of autism.

I uploaded my whole genome sequence data to the cloud

i-got-genomedI got genomed by Illumina

In March 2014, my wife and I “got genomed” by enrolling in Illumina’s (now Genome Medical’s) Understand Your Genome (UYG) program. UYG requires participants to order this whole genome sequence (WGS) test from their physicians due to uncertainties surrounding the delivery of genomic results in the U.S. Illumina is careful to point out that the service “…has not been cleared or approved by the U.S. Food and Drug Administration” and “you will not receive medical results, or a diagnosis, or a recommendation for treatment.” Our family physician signed the request in November 2013, and we received our results in February. Fortunately, no surprises, but the UYG program only covers these Mendelian disorders for now. We flew to San Diego a few weeks later to listen to talks by genomic researchers and discuss our results with genetic counselors. As part of this one-day seminar, we each received an iPad Mini that was pre-loaded with our results, as well as a portable hard drive that contained our raw sequence data.

illumina-wgs-hard-drive I received my WGS data on this encrypted hard drive (about 100GB).

After we arrived home, the next step was to find a public “home” for my sequence data (to share without restrictions). What I learned is that uploading your genome anywhere is a challenge, mostly because the dataset is so big.

I looked at DropboxEvernote and Figshare, but their storage models do not scale well for genomic data. I tried Sage Bionetworks, but the BAM file was too large to upload. I settled on Amazon Web Services (AWS) and created an anonymous FTP server using the Amazon Elastic Compute Cloud (EC2).

About my whole genome sequence data

My genome data and results are now in the public domain, freely available to download under a Creative Commons (CC0) license. Uploading the data took two days over a 3Mbps connection, so you may want to read the clinical report and sample report instead.

  • ftp://ftp.startcodon.org <– I decommissioned the ftp server
  • username: anonymous
  • password: guest
  • BAM file checksum: 2529521235 (78.1GB uncompressed)
  • VCF file checksum: 4165261022 (2.4GB gzip compressed)

Questions about FTP? See this FAQ.

Now that I have my genome in the cloud, I’ll start playing with analysis tools like STORMSeq. Stay tuned!

My WGS data is now available on Amazon S3

Read the blog post

What is a Gene?

In an ongoing effort to unravel the mysteries of DNA, I recently completed a class at UC Berkeley, “Introduction to Genetic Analysis.” This essay, “What is a Gene?” was part of my final. Although the question could easily pass as a Zen koan, I gave it a shot.

What is a Gene?
 
To paraphrase Nature reporter Helen Pearson, ‘gene’ is not your typical four-letter word. Unlike most four-letter words whose definitions are well understood, the definition of a gene remains elusive. The more scientists learn about genes, the more the definition seems to fray around the edges. A question such as ‘How many genes are in this organism?’ is difficult to answer conclusively without a consistent description. In 2006, one research group examined the results of 77 experiments counting the number of genes in the human genome; none produced the same result (Liolios et al, 2006 doi:10.1093/nar/gkj145). From the smallest virus with three functional genes to humans with approximately 22,000, counting genes is challenging.

With roots in Mendel’s research on garden peas, the term “gene” has evolved from its original definition of a “unit of inheritance” to one that reflects advances in molecular biology. A commonly accepted definition is that a gene is a region of nucleic acid that specifies an RNA or protein. This definition encompasses both single- and double-stranded DNA and RNA. Exons, coding regions of DNA and RNA that are translated into protein sequences, are found in most, but not all genes. To incorporate a finding that proteins can be produced from non-coding exon regions, some geneticists have added “flanking regulatory elements” to this definition (Pesole, 2008 doi:10.1016/j.gene.2008.03.010). This addition incorporates genetic curiosities such as the lac operon, which allows bacteria to digest lactose. Newer definitions may emphasize functional products—counting proteins or RNA—rather than specific DNA loci. More precise definitions that apply to specific types of organisms, e.g., eukaryotes, seem inevitable.

After the discovery of the structure of DNA by Watson and Crick, mechanisms describing DNA replication, transcription and translation quickly followed. DNA, which functions as a “parts list” of molecular information, stores an organism’s functional repertoire. The addition of molecular information to biology provided a physical basis for understanding heredity, which in turn led to the surprising finding that organisms share many genes in common. This commonality has provided insight into the evolution of various species. Through the lens of evolution, genes exist to convert the molecular information stored in DNA into self-sustaining multicellular organisms. Organisms with adaptive genes transmit their genetic information to the next generation to ensure the successful propagation of the species.

In 1955, the year Watson and Crick’s paper appeared, Einstein was asked to define “light quanta,” or what are now commonly called photons. His response was:

All these fifty years of conscious brooding have brought me no nearer to the answer to the question, ‘What are light quanta?’ Nowadays, every Tom, Dick and Harry thinks he knows it, but he is mistaken. (Born, 1971 The Born-Einstein Letters)

In the ensuing fifty years, particle physicists arrived at a consensus describing photons (the so-called Standard Model). In genetics, the results from the Human Genome Project in 2000 provide a foundation on which to build future results. A clearer answer to the question ‘What is a gene?’ is emerging. The answer to this question will provide more accurate interpretations of the similarities and differences between individuals and species.