On Deriving the DNA Relationship Probability Table

There is a need in genetic genealogy research to determine the distribution of shared segment lengths for each possible relationship type so that accurate relationship prediction, based on actual data, can be achieved. A good example of this is The Shared cM Project created and maintained by Blaine Bettinger (updated on occasion), and the DNA Survey (live), which was derived from users who submitted their DNA survey report file (dnasurvey-[timestamp].txt) using the DNA Survey Submission Form.

DNA Survey
DNA Survey
The DNA Survey has been divided up into three tables for easier viewing.
Table 1 Table 2 Table 3

Table 1 shows the ranges of DNA segment lengths for various calculated relationships as they compare to total chromosome lengths, for the DNA survey reports submitted by users and aggregated together. It includes a number of statistical fields, including the minimum, median, and maximum values, the 25% and 75% quartiles and the upper and lower cutoff values. Both will be used to calculate standard deviations.

Table 2 shows the calculated standard deviations for the same list of DNA relationships. The table includes three different DNA segment length standard deviation ranges for various calculated relationships as they compare to total chromosome lengths. The delta (Δ), as I've defined it here, represents the change between the mean and the median. Large deltas indicate a skewed distribution due to a large number of outliers falling to either side of a normal distribution curve. The variance is also included and indicates how outliers flatten a normal distribution curve. Sigma (σ) is the symbol for standard deviation and is calculated from the average and total values for each relationship. Because the sample sizes are still so small, significant delta and variation can be seen for some relationships. To address the issues of outliers, there are several techniques that can be used to eliminate them from calculations. The first is to toss entries that fall several standard deviations from the average. This assumes that the normal distribution is a symmetrical bell curve. This is a valid assumption, but only when the sample sizes are large. Outliers in small samples skew the results significantly. Another method is to extract the quartile values from the data, determine cutoff values based on their difference and discard those falling outside the lower and upper limits. A third method is to measure the variance in the data and when above a threshold, use the median absolute deviation (MAD) instead of the standard deviation. The advantage of the median absolute deviation is that it reduces the effects of outliers when calculating deviation results. This is because when small samples are being used, outliers have significantly more influence on the average than they do the median.

Table 3 shows the ranges of actual shared DNA segment lengths that you can expect based on calculated standard deviations. A confidence level of 99.0% indicates that only 1 in 150 thousand DNA matches may fall outside this range, a confidence level of 99.9% indicates that only 1 in 1.75 million DNA matches may fall outside this range and a confidence level of 100% indicates that only 1 in 10 million DNA matches may fall outside this range.

Having summarized these methods, I should note here that I am an engineer - not a statitician - there may be other statistical tools that would better serve both small and large DNA samples. When enough data has been submitted by users, Gaussian distribution curves for each calculated DNA relationship can be determined and made available here along with the raw data (in real time), so that myself and other software developers will be able to use them to create more accurate DNA relationship probability matricies. They should be more accurate because they will be based on actual user data that has been prevalidated before submission.