Enhanced comment feature has been enabled for all readers including those not logged in. Click on the Discussion tab (top left) to add or reply to discussions.

Prediction Bias: Difference between revisions

From BIF Guidelines Wiki
 
(25 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=Bias=
[[Category: Genetic Evaluation]]
Generally, bias is the systematic under or overestimation of what is being estimated or predicted.  Bias can exist in [[Expected Progeny Difference | EPDs]] and [[Accuracy | accuracy values]] from many sources, including selective reporting, inaccurate measurements, approximation methods, incorrect models, incorrect variance components, and others.
==Estimating Bias==


Let u be the true progeny difference (TPD) and u^ be our estimate (EPD). From this we could estimate the degree of bias in our estimate by  determining the difference in the mean u and mean u^. However, we never observe the TPD. Instead we estimate it using pedigree, performance, and genomic data.
If we had both the true progeny difference (TPD) and an estimate (EPD) of the TPD, then we could calculate the degree of bias in our estimate as the difference in the mean TPD and mean EPD. However, we never observe the TPD. Instead we estimate it using pedigree, performance, and genomic data.


We can approximate the degree of bias and under/over dispersion of EPD by using regression techniques. One such way to do this is to regress the EPD with more information (e.g., genomic EPD) on the EPD with less information (e.g, pedigree-based EPD). Our expectation is that the intercept from this regression is 0 (no bias) and the slope of the regression is 1 (no over or under dispersion). Our expectations come from the theory of BLUP where u^ is an unbiased estimator of u and that
We can approximate the degree of bias and under/overdispersion of EPD by using regression techniques<ref>Reverter, A., B. L. Golden, R. M. Bourdon, and J. S. Brinks. 1994. Technical  Note: Detection of Bias in Genetic Predictions. J. Anim. Sci. 72:34-37. </ref>
<ref>Legarra, A., and A. Reverter. 2018. Semi-parametric estimates of population accuracy and bias of predictions of breeding values and future phenotypes using the LR method. Genetics Selection Evolution. 40:53. </ref>. One such way to do this is to regress the EPD with more information (e.g., genomic EPD) on the EPD with less information (e.g, pedigree-based EPD). Our expectation is that the intercept from this regression is 0 (no bias) given the properties of [[Best Linear Unbiased Prediction]] and the slope of the regression is 1 (no over or under dispersion).  


Covar (EPD, EPD)/Var (EPD) = Covar (1/2a,1/2a)/Var(1/2a) = Var(1/2a)/Var(1/2a) = 1
A fundamental assumption is that the ratio of variance components used to generate both sets of EPD are the same. if they are not, then the expectation of the regression coefficient being 1 no longer holds.


A fundamental assumption is that the ratio of variance components used to generate both sets of EPD are the same. if they are not, then the expectation of the regression coefficient being 1 no longer holds.
Another approach is to regress phenotypes after being corrected for systematic effects on EPD. Here the expectation of the regression coefficient is 2. If EBV were used instead of EPD the expectation of the regression coefficient would be 1.


Another approach is to regress phenotypes after being corrected for systematic effects on EPD. Here the expectation of the regression coefficient is 2.  
A key assumption is that the phenotype of the individual is not included in the EPD of that individual. Consequently, this approach lends itself to cross-validation or forward-in-time validation strategies whereby some set(s) of animals have their phenotypes masked in the genetic evaluation.


Covar (corrected phenotype, EPD)/var (EPD) = Covar (a +e, 1/2a)/var (1/2a) = 1/2 var (a) /1/4 var (a) =2
In a similar fashion, average progeny performance (corrected for systematic effects) can be regressed on parent (sire) EPD. This is done annually at the US Meat Animal Research Center as part of the process to update across-breed EPD adjustment factors. The expectation of the regression coefficient is 1 in this case and assumes that the progeny information used is not part of the sire's EPD. A regression coefficient of less than 1 suggests that the EPD are over-dispersed meaning that a one-unit change in EPD will generate less than a one-unit change in average progeny phenotypes.


If EBV were used instead of EPD the expectation of the regression coefficient would be 1.
==Sources of Bias==


A key assumption is that the phenotype of the individual is not included in the EPD of that individual. Consequently, this approach lends itself to cross-validation or forward in time validation strategies whereby some set(s) of animals have their phenotypes masked in the genetic evaluation.
Bias generally arises from incomplete information. For example, if selection takes place early in life (e.g., based on weaning weight) such that a non-random group of animals is culled, then subsequent weight trait EPD (e.g., yearling weight) could be biased. This issue can be accommodated through the use of [[Multiple Trait Evaluation | Multiple-Trait Evaluation]]. Another example is incomplete recording of animals within a contemporary group. If only the heaviest animals are reported, then their performance relative to their contemporaries (e.g., contemporary group deviations) is biased downward because the observed average for the group is artificially inflated.


In similar fashion, average progeny performance (corrected for systematic effects) can be regressed on parent (sire) EPD. This is done annually at the US Meat Animal Research Center as part of the process to update across-breed EPD adjustment factors. The expectation of the regression coefficient is 1 in this case and assumes that the progeny information used is not part of the sire's EPD. A regression coefficient of less than 1 suggests that the EPD are over-dispersed meaning that a one unit change in EPD will generate less than a one unit change in progeny phenotypes.
==References==

Latest revision as of 21:20, 29 May 2021

Generally, bias is the systematic under or overestimation of what is being estimated or predicted. Bias can exist in EPDs and accuracy values from many sources, including selective reporting, inaccurate measurements, approximation methods, incorrect models, incorrect variance components, and others.

Estimating Bias

If we had both the true progeny difference (TPD) and an estimate (EPD) of the TPD, then we could calculate the degree of bias in our estimate as the difference in the mean TPD and mean EPD. However, we never observe the TPD. Instead we estimate it using pedigree, performance, and genomic data.

We can approximate the degree of bias and under/overdispersion of EPD by using regression techniques[1] [2]. One such way to do this is to regress the EPD with more information (e.g., genomic EPD) on the EPD with less information (e.g, pedigree-based EPD). Our expectation is that the intercept from this regression is 0 (no bias) given the properties of Best Linear Unbiased Prediction and the slope of the regression is 1 (no over or under dispersion).

A fundamental assumption is that the ratio of variance components used to generate both sets of EPD are the same. if they are not, then the expectation of the regression coefficient being 1 no longer holds.

Another approach is to regress phenotypes after being corrected for systematic effects on EPD. Here the expectation of the regression coefficient is 2. If EBV were used instead of EPD the expectation of the regression coefficient would be 1.

A key assumption is that the phenotype of the individual is not included in the EPD of that individual. Consequently, this approach lends itself to cross-validation or forward-in-time validation strategies whereby some set(s) of animals have their phenotypes masked in the genetic evaluation.

In a similar fashion, average progeny performance (corrected for systematic effects) can be regressed on parent (sire) EPD. This is done annually at the US Meat Animal Research Center as part of the process to update across-breed EPD adjustment factors. The expectation of the regression coefficient is 1 in this case and assumes that the progeny information used is not part of the sire's EPD. A regression coefficient of less than 1 suggests that the EPD are over-dispersed meaning that a one-unit change in EPD will generate less than a one-unit change in average progeny phenotypes.

Sources of Bias

Bias generally arises from incomplete information. For example, if selection takes place early in life (e.g., based on weaning weight) such that a non-random group of animals is culled, then subsequent weight trait EPD (e.g., yearling weight) could be biased. This issue can be accommodated through the use of Multiple-Trait Evaluation. Another example is incomplete recording of animals within a contemporary group. If only the heaviest animals are reported, then their performance relative to their contemporaries (e.g., contemporary group deviations) is biased downward because the observed average for the group is artificially inflated.

References

  1. Reverter, A., B. L. Golden, R. M. Bourdon, and J. S. Brinks. 1994. Technical Note: Detection of Bias in Genetic Predictions. J. Anim. Sci. 72:34-37.
  2. Legarra, A., and A. Reverter. 2018. Semi-parametric estimates of population accuracy and bias of predictions of breeding values and future phenotypes using the LR method. Genetics Selection Evolution. 40:53.