When is a match a false positive?

“The best laid schemes o’ mice an’ men / Gang aft agley”, as the poet [Robert Burns] says.

In other words, herein lies what worked in my project to identify a good cM threshold for “false positives”… and what didn’t.

Just as many of our ancestors braved new frontiers, those of us who are trying to use DNA in our genealogy find ourselves in a new frontier too. There are some guidelines and best practices and methodologies that are widely recommended and followed, and some that are controversial or evolving. As citizen scientists, we are all welcome to conduct studies with our own family data and see how well it aligns with “what the experts say”.

One of the most basic questions in genetic genealogy is whether a matching DNA segment is “large enough” to investigate further. Small segments of matching DNA may be “false positives”. There are two primary reasons why an alleged match may be of no use:

  1. The segment could be passed down from an ancestor too far back to identify. Some refer to this as population genetics, i.e. a piece of DNA that everyone from a certain area may have gotten from a common ancestor several hundreds of years ago. AncestryDNA uses a proprietary algorithm to try to identify these “pileup regions” and remove them from our matches. The other companies at this time do not. This isn’t exactly an invalid match, but it’s considered a false positive because it’s not genealogically relevant.
  2. Genotyping errors may occur. DNA testing companies look at alleles at specific locations (markers) in our chromosomes, and compare those values to another tester’s DNA, to see if enough consecutive identical alleles constitute a match. There are several ways this can go awry;1 but the bigger the chromosome segment, the less likely these errors can produce a false match.

Each testing company employs its own algorithms and thresholds to minimize false positives and yet not omit genuine matches. The International Society of Genetic Genealogy suggests that “In general, the larger the shared segments the more likely that the match is genuine. Half-identical matching segments of 15 cMs are mostly IBD [Identical by Descent, i.e. valid for genealogy], and the majority of matches between 10 cM and 15 cM are IBD. As the predicted matching segments get smaller the false positive rate increases.”2

One way genetic genealogists can check to see if a match between A and B is genuine is to see if B also matches A’s mother or A’s father. (A doesn’t have any DNA that didn’t come from her mother or father. But a genotyping error could cause it to look like something different.) If B does not match either of A’s parents, then this can be a false positive match.3

At the Salt Lake Institute of Genealogy in January 2016, Angie Bush reported the results of her study on the DNA matches of a two-parent-one-child trio. With FamilyTreeDNA data, she found that 94-95% of A’s matches over 10 cM were valid, i.e. they also matched one of A’s parents. However, of the matches where the largest segment was 7-10 cM, nearly a third of them did not match either parent: 32.4% were false positives. Unlike FamilyTreeDNA, AncestryDNA “phases” the data to improve the quality of the match list. At AncestryDNA, Angie’s test found that over 97% of the matches over 10 cM were shared with a parent, and over 91% of those between 7 cM and 10 cM matched a parent. At 6 to 7 cM, there was a serious drop-off: more than 80% were false positives. (Thanks, Angie, for permission to share!)

As a fledgling citizen scientist, I wanted to test this within my own family. Previously, I hadn’t seen any value in testing my kids—they weren’t going to have any ancestors that their parents didn’t have. But now I was tempted to see what having a two-parent-one-child trio could tell me. I asked Dan if he would donate a DNA sample to the cause and he agreed. His father and I had already tested at AncestryDNA.

And here’s where my plans “gang agley”. AncestryDNA introduced their new V2 chip after I mailed in Dan’s DNA sample but before it was processed. In basic terms, this means that previously, with the V1 chip, AncestryDNA was looking at about 700,000 SNPs (Single Nucleotide Polymorphisms, i.e. markers or locations) on the chromosomes and comparing them to the same areas in the chromosomes of other testers. Now, though, AncestryDNA is looking at about 460,000 of the old SNPs and another 200,000+ new SNPs.

So it may be hypothetically possible for Dan to have an IBD match (Identical by Descent, i.e. valid for genealogical purposes) with another recent tester who also tested on the V2 chip, and that person might not match Dan’s parents, because the SNPs where they match were not processed for Dan’s mom or dad.

Using the DNAGedcom client,4 I downloaded all of Dan’s matches. He had 4630 matches over 6 cM. Then I downloaded the parents’ matches and used Excel to strip out the duplicates, where Dan’s match also matched one of his parents. The result?

                                                    2-Parent 1-Child DNA Analysis
size of match # of Dan’s matches # that don’t match parents % false
all 4630 1635 35.3 %
15+ cM 232 4 1.7 %
10-15 cM 690 33 4.7 %
7-10 cM 1846 453 24.5 %
6-7 cM 1862 1145 61.5 %

It’s important to remember that some of those false positives could hypothetically match Dan if they tested on the V2 chip and matched Dan on SNPs that were not processed for Dan’s parents. I randomly messaged a dozen of these “false positive” matches over 10 cM. So far, only one has replied—and she confirmed that her DNA test was new, apparently with the V2 chip. It is premature to argue that this is an IBD match; more research is needed. But it may be possible. (ETA July 24, 2016: I heard from another match between 10-12 cM: she tested on the old V1 chip in January 2016 and matched Dan but neither parent. This is most likely a false positive.)

It’s not likely that most of the false positive matches in the chart are truly genuine due to the chip issue. I downloaded Dan’s data less than a month after AncestryDNA rolled out the V2 chip; only those testers who also had their DNA sampled that month on the new chip might be valid matches. In addition, those individuals would need to match Dan in the new SNP regions. Finally, no genotyping error must have occurred—and at this time, I can’t determine that without parent matches. Nevertheless, the chip change could mean that Dan’s false positive figures appear more pessimistic than actually warranted. This could be a reason why Angie Bush’s results with AncestryDNA data were more promising.

It’s also conceivable that some of Dan’s “false positives” are really a “false negative”in a parent, i.e. the parent really does match Dan but some processing glitch on the parent’s kit makes it look like a mismatch. Since my goal is to identify a threshold where I am confident the match is reliable, I’m not going to worry about those.

Conclusions?

As noted at the beginning of this article, a false positive may be due to a DNA segment passed down from outside a genealogical timeframe, e.g. due to population genetics. AncestryDNA’s process attempts to mitigate this issue. The second problem is when genotyping errors make it appear that A and B match, when in fact, B doesn’t match either of A’s parents there, and therefore the match isn’t valid. Is there a certain match size threshold where we can rely on the match being IBD?

  • Like the ISOGG wiki, I’m very confident about matches over 15 cM. All of Dan’s matches over 15.5 cM matched a parent, regardless of the chip. The four matches between 15 and 15.5 cM that didn’t match a parent did log in the first week of June and may have matched Dan on the V2 new SNP regions. In any case, at least 98% of Dan’s matches over 15 cM matched a parent and appear genuine. This would be all projected relatives at AncestryDNA rated 4th-6th cousins and closer, and some distant cousins.
  • Between 10 and 15 cM, Dan’s AncestryDNA matches did match a parent over 95% of the time, regardless of chip. This is likewise worth my time to pursue, although the common ancestor may be further back than my tree goes.
  • Between 7 and 10 cM, nearly 75% of Dan’s matches appear genuine, and that figure could be low due to the chip change. In Angie Bush’s study, she found over 90% of the AncestryDNA matches in this range were genuine. At this time, I choose to pursue these only if they appear in a network (In Common With, Shared Matches, Triangulation Group, etc.) with other matches on a brick wall I’m currently investigating. With matches this size, I don’t assume it’s a reliable match, but it’s worth exploring further.
  • Between 6 and 7 cM, Dan’s results indicated these matches are more likely to be false than genuine. Angie’s AncestryDNA results were that they were about 20% likely to be genuine. Each genealogist has to decide for himself/herself whether these are worth chasing, when a fair portion of these predicted matches are actually false; there is no one right answer for everyone. I don’t invest time on these.
  • These conclusions apply only to AncestryDNA results. While I have immediate family who have tested at FamilyTreeDNA and 23andMe, I don’t have two-parent-one-child trios at those companies.
  • It may be possible for someone to match a child but neither parent and still be a legitimate match—at least, if the child and the match tested on a different chip than the parents, and matched on the different SNP regions that weren’t used in the parents’ chip. There still may or may not be genotyping errors on these matches, esp. on the segment matches under 10 cM. More research is needed.

My family case study didn’t really change my thresholds for useful segments, but it gave me confidence that what some experts are recommending for reliable match size is in line with what I have seen in my own family. But it would have been more clear-cut if I could have had Dan’s DNA processed with the V1 chip! [mental hashtag #darn.that.V2.chip.timing!]

Citations

1. For example, see Ann Turner, “Satiable Curiosity: Identity Crisis: Identical by State or Identical by Descent?”, Journal of Genetic Genealogy, Fall 2011, vol. 7, (http://www.jogg.info/72/files/Turner.htm : accessed 7 July 2016).

2. See International Society of Genetic Genealogy Wiki, “Identical by Descent” (http://isogg.org/wiki/Identical_by_descent : accessed 25 Apr 2016) for more information.

3. See International Society of Genetic Genealogy Wiki, “False positive matches” (http://isogg.org/wiki/Identical_by_descent#False_positive_matches : accessed 25 Apr 2016) for more information.

4. Rob Warthen, DNAGedcom,“Welcome to the DNAGedcom Client,” (https://www.dnagedcom.com/doc/welcome-to-the-dnagedcom-client/ : accessed 4 Jul 2016).

Ann Raymont © 2016

Advertisements

One thought on “When is a match a false positive?

  1. Pingback: Citizen Science in genetic genealogy | DNAsleuth

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s