Year-end is a great time to take a look at our DNA matches and our organization/analysis tools and maybe do a little cleanup. I’ve begun playing with a free Windows-based utility by Jonathan Brecher—the Shared Clustering tool.
Now, there are a lot of clustering tools out there for those who are ready to try that—and not everyone is ready to rumble, I mean cluster. What I really appreciate about Shared Clustering right now is the automated [Note Update] function, which you can do with or without clustering. I can make my Ancestry match notes more meaningful and consistent and then bulk upload those same notes to other kits I manage too. I’ll walk you through the steps I took, and maybe you’ll get some ideas of how to tweak the process to be helpful to you.
You don’t have to do the clustering part of the tool, but it’s so simple to do, you might as well consider that too.
What is clustering?
Different clustering tools work may with different companies’ data. The Shared Clustering tool is specific to AncestryDNA matches—it offers an automated kind of grouping. You know how Ancestry lets you manually put your matches into color-coded groups? For example, you might assign a different color to each of your four great-grandparent couples. You mark a known cousin in your match list with a particular color. If you’ve put your second cousin Juanita in the blue group, you can then use Ancestry’s Shared Matches to find more distant cousins related to you and Juanita both and add those cousins to the blue group too. (NOTE: Groups, or clusters, help us create or support theories about how cousins may be related. They can be very useful to pursue further, but we shouldn’t take them as standalone proof of how strangers are related.)
Why bother to group or cluster your matches? Here are a couple ideas:
- When you get a new match that looks interesting, you can quickly glance at the shared matches and see if those cousins belong to a common group. If so, you may want to add the new match to that group and decide if you want to contact him or her. When reaching out to a stranger, it’s often more effective if you can say something specific like “Do you have ancestors from Steuben County, Indiana, in the late 1800s? I do, and I think we may be related on that line. I’m happy to share photos and newspaper clippings about any ancestors we might share….”
- You may decide you want to work on a particular ‘brick wall’. You can filter your matches to find everyone you believe you’re related to along that line, and work on their trees or engage them in conversation to see if you can develop new leads or evidence to move closer to a solution.
This kind of Shared Match grouping described above is a manual process. Clustering is an automated way to group your matches according to perceived ancestral lines.
See image 1 to squint (LOL!) at a cropped section of the spreadsheet output of Jonathan Brecher’s Shared Clustering tool.
Now, you don’t have to do the Clustering piece to do the Update Notes piece. But why not give it a try too? Here are the steps I took to apply meaningful-to-me and consistent notes to hundreds of my AncestryDNA matches.
Step 1. Read all about Shared Clustering and then download the tool.
Brecher has excellent documentation here: https://github.com/jonathanbrecher/sharedclustering/wiki. See image 2 for an excerpt of that page. Then, to download the tool, you can click the download link shown by the first red arrow or the link found in the panel shown in the red circle below. Likewise, you can read the introduction from either of the two Introduction links shown.
Before executing the program, I recommend reading all the content linked in this wiki entry. (You can skip around if you want.) Brecher explains everything very clearly. For example, “Interpreting Clusters” will tell you what’s in each of the columns in the resulting spreadsheet. “Breaking Brick Walls” walks through an example of how one could use these clusters to develop new leads to identify the unknown parents of a 3rd great-grandparent.
Step 2. Run the tool to create the cluster spreadsheet.
In my initial pass, I set the variables to limit processing to my matches of 20+ cM, and the program completed running in just a few minutes. (You can set the tool to use ALL your Ancestry matches, which includes matches down to 6 cM. That process may take hours, and you’ll want to read the documentation to understand the pros and cons. I decided to try that later.) At the 20 cM threshold, my report produced 796 matches—exactly the number of 4th cousins Ancestry said I had. (But we all know Ancestry means projected 4th-6th cousins when it says that). Shared Clustering assembled those matches into 44 clusters, with a handful of matches plotted between clusters but not assigned to one cluster number. The “correlated” column may provide a hint for those. A few matches did not end up in any cluster.
Among other valuable data, the application captures your current notes for each match in one of the spreadsheet columns.
Step 3. Decide on your desired standard note format.
Maybe you’ve already come up with a note format that works for you. I made some tweaks to what I had been using. I want to keep track of the last date I did something with the match. If I’ve identified our Most Recent Common Ancestor (MRCA), I want the note to carry our relationship and who the MRCA is. (And maybe even the ahnentafel number of those ancestors too.) If I don’t know the exact relationship but have an idea about the ancestral line, I want to see those possible surnames in the note. And at the end of the note, I may put miscellaneous content, such as whether the match is on GEDmatch or if I have a file on my computer with more information and correspondence, etc.
Because I am trialing this process, I decided to save the cluster number in my notes. (I also add a leading zero to single digit cluster numbers, e.g. cluster01.) I know that future runs (such as when including matches less that 20 cM) may generate different cluster numbers, so I may revisit that decision later. Meanwhile, I maintain a document with observations and theories about each of those cluster numbers.
Step 4. Try to identify an ancestral line for each cluster.
The Shared Clustering spreadsheet has a column for common ancestor – if the match had a leaf, the common ancestors’ names will be there. If cluster37 has some matches showing John Darcy-Margaret Gleason common ancestors, I may enter Darcy and Gleason in the notes column of the spreadsheet for everyone in cluster37. That doesn’t mean that they all descend from the Darcy-Gleason couple. It would mean that I have a hypothesis that they all descend from an ancestor of John Darcy or Margaret Gleason. What if one match in cluster 37 shows John Darcy-Margaret Gleason and another shows Michael Darcy-Mary Tynan (John’s parents)? Does this mean they all match on Darcy DNA and not on Gleason DNA? I can’t say that for sure. This is something to explore later. For now, I’ll put Darcy, Gleason, and Tynan surnames in all the notes in cluster37 that might have DNA from any of those lines.
Some clusters didn’t have data in anyone’s common ancestor column. I spent a little extra time trying to figure those out (and didn’t always succeed). The spreadsheet tells you if the match has a private, unlinked, or public linked tree, how many people are in the tree, and even includes a tree link. I spot-checked some trees to try to figure out a common ancestral line. Even if I can just narrow it down to which grandparent line – that’s a start. I also looked for some match names at other companies where I have DNA results (23andMe, GEDmatch, etc.) to see if any of those cousins—like me—had results at more than one place. Shared matches in those other companies might point me to a closer relative that would reveal the line we share.
I haven’t figured out ancestral lines for all my clusters yet, but maybe 75-80% of them.
Step 5. Update the notes in the spreadsheet.
So far, we’re just updating data in a spreadsheet. It’s really not hard to enter or change a note or phrase in that column and then copy/paste it to the other notes in that cluster. I updated the notes column of all 796 entries in my spreadsheet while watching TV.
You’ll notice that some of my notes (Image 3) contain a hashtag #. I have the MedBetter chrome extension that allows me to search my matches for any data preceded by a hashtag in the note.
Image 3. updated notes in cluster spreadsheet
For example, I can filter my matches to show me only those that have #Darcy in the note, or #cluster37. (I may even decide to filter by certain cluster numbers and then assign a colored AncestryDNA dot to those ‘groups’.) You can read more about MedBetter here: https://dnasleuth.wordpress.com/2018/07/01/organizing-my-ancestrydna-matches/. (Caveat: sometimes I get no results when I try to search with the filter. So far, it’s always been pilot error, e.g. accidentally having a trailing space at the end of my search text (like “#Flynn ” instead of “#Flynn”.)
Here’s a summary of my new note structure.
- I start with the date. (If a match had an earlier date, I left it – it was the date I last exchanged a message with the match. If it had no date, I put today’s date. If I work with the match later–maybe because we exchanged messages–I’ll update the date.)
- If I have established our Most Recent Common Ancestor (MRCA), I add our relationship (e.g. 4c1r for 4th cousin once removed), followed by MRCA= and then that ancestral couple. I include the ahnentafel number in the note somewhere too.
- I add the cluster number for each. I have a working document on my computer with more observations about these clusters. (I used #clusterNF for matches where a cluster number was Not Found.)
- Surnames associated with this match come next.
- Lastly, if I had notes previously or want to add some now, they go here.
But do whatever works for you!
Step 6. Bulk upload the notes to Ancestry
I’m really excited about the Shared Clustering application’s ability to upload all the notes in our spreadsheet back into Ancestry. In fact, I can even update that same note content on the matches of other kits I manage. For example, after I update all my match notes, I can run the tool to apply the same notes to the matches on my sister Alice’s kit. It was super fast and easy!
You can find simple instructions by clicking the link labeled Upload Notes tab at https://github.com/jonathanbrecher/sharedclustering/wiki. The process will create a log of before-and-after notes, so if for any reason you want to restore the original notes on any or all, it is doable.
See Image 4 for a look at the main panel of the Shared Clustering tool.
Running the Shared Clustering tool:
- Review the Introduction.
- Download your matches. You’ll be prompted to log into Ancestry from this screen and select the kit you want to download. When it finishes, you’ll be prompted to continue in Cluster tab. See the red arrow.
- Click [continue in Cluster tab]. (Or the Cluster tab at the top.)
- If you want to upload notes from the spreadsheet, you can select this tab—and then select the kit whose notes you want to update. (Read the wiki instructions described above if you just want to update one or more multiple kits’ notes without doing the clustering steps.)
On occasion, I have gotten an error message when running the Update Notes function. I Haven’t quite figured out why yet. (Maybe Ancestry was having intermittent server issues?) But I’ve been able to run the Update Notes process successfully later.
Thank you, Jonathan Brecher, for making this amazingly helpful utility available to genealogists using Windows! Now that I’ve got my data organized and cleaned up, I’m ready to face the New Year and see how the spreadsheet and these notes may help me uncover new leads or evidence!
© 2019 Ann Raymont, CG®