You may know this painting, one of over two dozen haystack paintings (actually, stacks of wheat) that Claude Monet produced at various times and seasons. What if we hid a needle one of them? Could we find it? In genetics, we often say that finding a genetic variant is like finding a needle in a haystack. But how big is the needle? How big is the haystack? Who believes this stuff?
Francis Collins, NIH director and leader of the Human Genome Project, is a clear believer. In a recent PBS series by Ken Burns, The Gene: An Intimate History, Dr Collins said that finding a misspelling in a gene that causes a particular disease is “a needle in a haystack.“
So, in honor of April 25th, National DNA Day, which commemorates the discovery of DNA’s double helix in 1953 and the publication of the first draft of the human genome in 2003, I embarked on a curious journey to answer these questions. My approach would be to determine the volume of a needle and then figure out how large to make the haystack to see if the analogy holds.
What is the volume of a needle?
How do you calculate the volume of a needle? It is useful to start with some real needles, and it turned out that I had a small box of 30 assorted needles–here’s a photo of them:
Needles come in all kinds of lengths, but I was looking for an “average” needle, so I made a rough list of their lengths and came up with this:
Average needle length: sum(5 x 30 mm, 10 x 35 mm, 11 x 40 mm, 1 x 45 mm, 1 x 50 mm, 1 x 55 mm, 1 x 60 mm) / 30 = 37 mm or 1.4 inches
Volume of a needle: displacement method
Next, I had to figure out the volume (V) of a needle. My first approach was to drop the needles into a graduated cylinder filled with water and then measure the displacement.
It turns out that 30 needles displaced 0.5 ml of water, so we divide by 30 to get the average volume of the needles in my sample.
V = 0.5 mm / 30 = 0.016 ml or about 0.02 ml per needle
Can we find a different method to measure the volume of the needle and compare results?
Volume of a needle: small cylinder
Another way to approximate the volume of the needle is to view it as a small cylinder using the formula: V = pi * r^2 * height. Since we are looking for volume (V), we want to express our measurements in centimeters to end up with cubic centimeters (cc).
V = pi * (0.075 cm / 2)^2 * 3.7 cm = 0.016 cc, or about 0.02 cc
Since 1 ml = 1 cc, we can say that the volume of our needle is 0.02 ml, which is spot on with our previous measurement. Note: It is rare in science that your numbers match exactly, but we’ll take the win and start building our haystack.
Building a genetic haystack
Our next task is to build a haystack that approximates the size of the human genome, where each A-T or C-G base pair in the human genome is the size of a needle. The Human Genome Project pegs that number around 3 billion base pairs within one copy of a single genome.
We build our genetic haystack by defining it as a volume having 3 billion opportunities to find a 0.02 ml object, or volume (V) equals 3,000,000,000 * 0.02 ml.
V = (3*10^9 * 0.02 ml) = 60*10^6 ml or 60,000,000 ml
Since 1 ml = 1 cubic centimeter (cc), the volume can also be stated as 60,000,000 cubic centimeters or 60*10^6 cc. Finally, 1 liter = 1000 ml, so the volume in liters equals 60,000,000 ml * (1 liter / 1000 ml).
V (haystack) = (60*10^6 ml) * (1 liter / 1000 ml) = 60*10^3 liters or 60,000 liters
We now know how much space it occupies, but what does it look like? For example, how tall is our haystack?
Shape of a genetic haystack: approximation with a hemisphere
Like the needle, we can approximate the shape of a haystack to something easy to calculate, like the volume of hemisphere in cubic centimeters: V = 2/3 * pi * r^3.
V (hemisphere) = 60*10^6 cc = 2/3 * pi * r^3
Solving for the radius (r), we get r = 306 cm. Since 1 m = 100 cm, the height (radius) of the hemisphere equals 306 cm * (1 m / 100 cm).
r (hemisphere) = 306 cm * (1 m / 100 cm) = about 3 meters or 10 feet tall
So, we have a rough idea of what an idealized haystack looks like. It is 3 meters tall and it is shaped like a hemisphere. But how accurate is our measurement? Can we do better? Has anyone studied haystack modeling? Indeed, someone has and his name is W.H. Hosterman.
Shape of a genetic haystack: using haystack modeling
In 1931, W.H. Hosterman from the U.S. Department of Agriculture published an extensive technical bulletin “for the purpose of determining the volume and tonnage of hay.” Hosterman and his USDA colleagues measured over 2,600 haystacks across 10 states and presented results for both square and round haystacks. For round haystacks, Hosterman derived this formula: V = ((0.04 * Over) – (0.012 * C)) * C^2, where C is the circumference of the haystack, and Over is the measurement of the circumference of the stack over the top. We can compare results with the hemisphere by selecting equivalent values from Table 7 in Hosterman’s paper. (We will momentarily switch to imperial units to make use of the constants in the formula.)
V (Hosterman) = ((0.04 * 32 ft) – (0.012 * 62 ft)) * 62^2 = 2,060 cubic feet or 58,333 liters
The observed height of Hosterman’s haystacks are slightly taller than wide, but their sloped tops make the volume of the “true” haystack (58,333 liters) a little less than our original estimate of a 3-meter-tall, 60,000-liter haystack. When we compare the volume using Hosterman’s formula to the volume of the hemisphere, we see that they differ by 1,667 liters, or about 3%. Close enough!
Genetics is like finding a needle in a haystack
We have two comparisons that match closely. So, what does that haystack look like?
It turns out that Romanian haystacks are about 3 meters tall, so finding a misspelling in a gene (a genetic variant) is indeed like finding a needle in a haystack.
Who looks for needles in haystacks?
If genetics truly is like finding a needle in a haystack, what kinds of people do this? Well, let’s go back to Francis Collins.
In 1993, Collins published a paper describing the genetics of cystic fibrosis, a disease that required finding “a needle in a haystack.” Today, we have found needles from thousands of genetic haystacks. Some of these needles lead to clues about rare diseases, which affect more than 400 million people worldwide.
In real life, you can find a needle in a haystack, too. In 2004, performance artist Sven Sachsalber found a needle in a haystack after hunting for it for 24 hours. If we could get computers to solve diseases at that speed, we would happily be out of work.
OK, back to finding more needles…