Content - 6b21af35cdae1923b7f085e13a3ae6d14b069130 - d3f9418/vignettes/HardyWeinberg.Snw

visit type:
Tip revision: 074be7ceb0152257617689ff9e44df8b5d70c754 authored by Jan Graffelman on 20 March 2009, 00:00:00 UTC
version 1.4.1
Tip revision: 074be7c
HardyWeinberg.Snw
\documentclass[a4paper]{article}

% \VignetteIndexEntry{A R package for performing Graphical tests for Hardy-Weinberg Equilibrium}
% \VignetteDepends{graphics,stats}
% \VignetteKeyword{aplot}

% Documentation for the HardyWeinberg package

\usepackage[english]{babel}
\usepackage{Sweave}
\usepackage{Rd}
\usepackage{url}
\usepackage{hyperref}
\usepackage[utf8]{inputenc}

\setlength{\parindent}{0cm}

\begin{document}

\begin{center}
\sf
{\sf \bf \Large The {\tt HardyWeinberg} Package}\\
\vspace{4mm}
{\sf \normalsize {\tt version 1.4.1}}\\
%\normalsize
\vspace{4mm}
{\bf \large Jan Graffelman}\\
\vspace{4mm} \rm \large
Department of Statistics and Operations Research\\
Universitat Polit\`ecnica de Catalunya\\
Avinguda Diagonal 647, 08028 Barcelona, Spain.\\
{\it email:} jan.graffelman@upc.edu\\
\vspace{4mm}
{\sc November 2009}
\end{center}

\section{Introduction}

This guide gives some instructions on how to perform graphical significance tests for Hardy-Weinberg
equilibrium (HWE) by depicting the acceptance region for HWE in a ternary plot with routines from
the package {\tt HardyWeinberg}. The outline of this guide is as follows. Section~2 
describes how the R package {\tt HardyWeinberg} can be installed. Section~3 shows 
how to perform some of the classical tests for Hardy-Weinberg equilibrium with routines from the package. 
Finally, Section~4 shows how to contruct ternary plots with the HW acceptance region 
and how to perform graphical tests for HWE. We refer to Graffelman \& Morales~(2008) 
for the theoretical foundation of the graphical tests. If you appreciate this software then please 
cite the following paper in your work:\\

Graffelman, J. \& Morales-Camarena, J. (2008) Graphical tests for Hardy-Weinberg equilibrium
based on the ternary plot. {\it Human Heredity} {\bf 65}(2): 77-84. 
\href{http://dx.doi.org/10.1159/000108939}{(clic here to access the paper)}

\section{Installation}
\label{sec:install}

Packages in R can be installed inside the program with the option "Packages"
in the main menu and then choosing "Install package" and picking the package
"HardyWeinberg". Typing:

<<fig=FALSE,echo=TRUE>>=
library(HardyWeinberg)
@

will make, among others, the functions {\tt HWChisq, HWData, HWExact, HWLratio} and {\tt HWTernaryPlot} available. 
Changes made over the different versions of {\tt HardyWeinberg} are detailed in an appendix below.

\section{Classical tests for Hardy-Weinberg equilibrium}
\label{sec:classical}

We show how to perform several classical tests for Hardy-Weinberg equilibrium. As an example we
use a sample of 1000 individuals genotyped for the MN blood group locus described by
Hedrick~(2005, Table 2.4). We store the genotypic counts (298, 489 and 213 for MM, MN and NN
respectively) in a vector {\tt x}:

<<fig=FALSE,echo=TRUE>>=
x <- c(298,489,213)
HW.test <- HWChisq(x,verbose=TRUE)
@

This shows that the $\chi^2$-statistic has value \Sexpr{round(HW.test$chisq,digits=4)}, and that the corresponding p-value for 
the test is \Sexpr{round(HW.test$pval,digits=4)}. Taking a significance level of $\alpha = 0.05$, we do not reject HWE for
the MN locus. When {\tt verbose} is set to {\tt FALSE} (default) the test is silent, and {\tt HW.test}
is a list containg the results of the test ($\chi^2$-statistic, the p-value of the test, half the deviation from 
HWE ({\tt D}) for the heterozygote ($D = \frac{1}{2} (f_{AB} - e_{AB}$)) 
and the allele frequency ({\tt p}) of M).

<<fig=FALSE,echo=TRUE>>=
HW.test <- HWChisq(x)
print(HW.test)
@

By default, {\tt HWChisq} applies a continuity correction, expect for very small minor allele frequencies. In
order to perform a $\chi^2$-test without Yates' continuity correction, it is necessary to set the {\tt cc} parameter to zero:

<<fig=FALSE,echo=TRUE>>=
HW.test <- HWChisq(x,cc=0,verbose=TRUE)
@

The test with correction gives a smaller $\chi^2$-statistic and a larger p-value in comparison with the 
ordinary $\chi^2$ test. The likelihood ratio test~(Weir, 1996, Chapter 3) for HWE can be performed by typing

<<fig=FALSE,echo=TRUE>>=
HW.lrtest <- HWLratio(x,verbose=TRUE)
@

Note that the $G^2$-statistic and the p-value obtained are very close to the $\chi^2$-statistic
and its p-value. An exact test for HWE can be performed by using routine {\tt HWExact}. 

<<fig=FALSE,echo=TRUE>>=
HW.exacttest <- HWExact(x,verbose=TRUE)
@

The exact test leads to the same conclusion, we do not reject HWE (\Sexpr{round(HW.exacttest$pval,digits=4)}) 
Both one-sided and two-sided exact tests are possible by using the argument {\tt alternative}, which can
be set to {\tt two.sided}, {\tt greater}, or {\tt less}. Two different ways of computing the
p-value of an exact test are implemented, and can be specified by the {\tt pvaluetype} argument, which
can be set to {\tt dost} (double one-sided tail probability) or {\tt selome} (sum equally likely or more
extreme).\\

All routines {\tt HWChisq, HWExact} and {\tt HWLratio} assume that the data are supplied as a vector of genotypic 
counts listed in order (AA,AB,BB).\\

Often many markers are tested for HWE. If the genotype counts AA, AB, BB are collected in a $m \times 3$
matrix, with each row representing a marker, then HWE tests can be run over each row in the matrix by
the routines {\tt HWChisqMat} and {\tt HWExactMat}. These routines return a list with the p-values and
test statistics for each marker.

\section{Graphical tests for Hardy-Weinberg equilibrium}
\label{sec:graphical}

This section shows how to create ternary plots for a database of marker data (e.g.\ SNPs) and shows
how the depict the acceptance region for HWE in the ternary plot, using different tests. 

\subsection{Simulated data}

An example with simulated data follows below. We obtain $m=100$ markers for $n=100$ individuals
by taking random samples from a multinomial distribution with $\theta_{AA} = p^2, \hspace{1mm} \theta_{AB} = 2pq, 
\hspace{1mm}$ and $\theta_{BB} = q^2$. This is done by routine {\tt HWData}, which can generate data sets
that are in Hardy-Weinberg equilibrium. Routine {\tt HWData} can generate data that are in exact
equilibrium ({\tt exactequilibrium = TRUE}) or that are generated from a multinomial distribution (default).
{\tt HWData} returns a list with both the matrix of genotypic counts {\tt Xt} and the matrix with genotypic 
compositions {\tt Xc} with the relative frequencies of AA, AB and BB.

<<fig=FALSE,echo=TRUE,results=hide>>=
set.seed(123)

m <- 100 # number of markers
n <- 100 # sample size
out <- HWData(n,m)
Xc <- out$Xc
Xt <- out$Xt
@

A ternary plot of the marker data can be obtained by:

\begin{figure}[htb]
\centering
<<fig=TRUE,echo=TRUE>>=
out <- HWTernaryPlot(Xt,region=0,hwcurve=FALSE,vbounds=FALSE,vertex.cex=1.25)
@
\caption{Ternary plot of 100 simulated SNPs for 100 individuals.}
\end{figure}
\clearpage

Next, we create four different ternary plots for the simulated marker data shown in Figure~\ref{fig:ternplots}. Panel
(a) depicts the 100 genotypic compositions in a ternary plot with the HWE curve. Note the marked curvature
in the cloud of points. Panel (b) shows a nicer ternary plot with the HWE curve and the acceptance region 
for HWE according to an ordinary $\chi^2$-test. Green markers are not significant, red markers significant 
($\alpha = 0.05$). Some markers show up significant. Panel (c) shows the same data, but the acceptance region 
represented corresponds
to a $\chi^2$-test with continuity correction ({\tt cc = 0.5}), with separate curves for $D>0$ and $D<0$. 
Some markers previously significant markers now turn
up insignificant. Panel (d) shows the acceptance region for the exact test. This option takes
more computer time. The acceptance region is bounded by a zig-zag line that connects those samples
that are just significant for the given allele frequency.

\begin{figure}[htb]
\centering
\includegraphics[width=12cm,height=12cm]{HWTernaryPlots.pdf}
\caption{Ternary plot of 100 simulated SNPs for 100 individuals. (a): ordinary ternary plot, (b): 
with $\chi^2$-acceptance region, (c) with acceptance region for $\chi^2$-test with continuity correction, 
(d): with acceptance region for two-tailed exact test.}\label{fig:ternplots}
\end{figure}

Routine {\tt HWData} can simulate genotypic counts under several conditions. A fixed allele frequency
can be specified by setting {\tt pfixed=TRUE}, and setting {\tt p} to the desired allele frequency. Sampling
is then according to Haldane's exact distribution. If {\tt pfixed} is {\tt FALSE}, the given {\tt p} will be used in 
sampling from the multinomial distribution. If {\tt p} is not specified, {\tt p} will be drawn from a uniform
distribution, and genotypes are drawn from a multinomial distribution with varying allele frequency. It is
also possible to generate data under inbreeding, by specifying the inbreeding coefficient {\tt f}. Inbreeding
with a fixed allele frequecy has not been implemented yet. The effects of the different options for {\tt HWData}
are shown in the ternary plots below.

<<fig=FALSE,echo=TRUE,results=hide>>=
n <- 100
m <- 100
X1 <- HWData(n,m,p=0.5,pfixed=TRUE)$Xt
X2 <- HWData(n,m,p=0.5)$Xt
X3 <- HWData(n,m)$Xt
X4 <- HWData(n,m,p=0.5,f=0.5,pfixed=TRUE)$Xt
X5 <- HWData(n,m,p=0.5,f=0.5,)$Xt
X6 <- HWData(n,m,f=0.5)$Xt
@  

\begin{figure}[htb]
\centering
<<fig=TRUE,echo=FALSE,results=hide>>=
opar <- par(mfrow=c(3,2),mar=c(3,5,3,1)+0.1,mex=0.75,oma=c(2,0,2,0),new=TRUE)
par(mfg=c(1,1))
HWTernaryPlot(X1,main="(a)")
par(mfg=c(2,1))
HWTernaryPlot(X2,main="(b)")
par(mfg=c(3,1))
HWTernaryPlot(X3,main="(c)")
par(mfg=c(2,2))
HWTernaryPlot(X5,main="(e)")
par(mfg=c(3,2))
HWTernaryPlot(X6,main="(f)")
par(opar)
@ 
\caption{Ternary plots for markers simulated under different conditions. (a) fixed allele frequency, using
Haldane's distribution. (b) multinomial sampling for p=0.5. (c) uniform random allele frequency. (d) not
implemented yet. (e) multinomial sampling with inbreeding. (f) random allele frequency with inbreeding.}
\end{figure}
\clearpage

\subsection{Empirical data}

Empirical data sets of genetic markers (e.g.\ SNPs) typically contain considerable amounts of
missing data. Consequently, the number of observations (sample size) varies from marker to marker.
This makes it more difficult to draw a reasonable acceptance region for HWE in the ternary plot, because
the region depends on sample size. Arguments {\tt n} and {\tt ssf} can be used to control the sample
size that is used for drawing the acceptance region. If a sample size {\tt n} is explicitly supplied
(e.g.\ {\tt n=100}) then that sample size will be used for drawing the region, disregarding the real
sample sizes from the data. If {\tt n} is not given, then the sample size is computed from the matrix of
counts by the function given by {\tt ssf} ({\tt ssf = "max"} by default). One can set {\tt ssf} to other functions 
such as {\tt min}, {\tt mean} or {\tt median} and then the minimum, mean or median of the sample sizes 
over the markers will be used to draw the acceptance region. An example with some SNP data follows below.
We have a vector of genotypics counts, giving the triples (AA,AB,BB), and turn these into a $m \times 3$
matrix, where each row represents a sample. The sample size vary from 20 (minimum) to 29 (maximum). We
may use the median sample size for drawing acceptance regions by specifying {\tt ssf="median"}.
More examples of ternary plots with human SNP data are given in Graffelman \& Morales~(2008).

\begin{figure}[htb]
\centering
<<fig=TRUE,echo=TRUE>>=
X <- c(21,12,3,7,7,0,7,0,0,0,10,4,16,12,4,0,2,4,11,0,6,18,15,
16,1,1,13,18,0,1,15,13,9,19,5,3,6,2,0,3,10,9,10,6,0,0,
0,24,21,17,21,1,0,0,0,25,3,14,0,0,14,16,22,0,0,17,16,17,2,
2,16,16,2,16,0,2,26,0,0,2,0,14,22,0,24,25,25,12,14,6,1,8,
11,13,14,14,5,8,4,6,5,13,11,4,11,12,3,12,15,11,6,11,8,10,12,
11,6,12,9,7,7,6,7,15,7,3,12,8,15,16,9,13,16,15,17,6,3,3,
4,8,10,8,9,7,5,7,4,7,11,4,4,12,10,7,4,5,10,10,12,9,10,
11,11,12,10,12,13,3,3,4,12,3,13,7,7,3,4,4,14,13,12,17,0,2,
13,8,8,23,5,23,23,24,6,13,0,2,12,26,15,10,3,22,12,2,4,0,16,
14,3,1,19,21,3,2,4,1,12,12,15,11,13,12,3,3,4,5,21,26,26,0,
0,1,0,17,22,23,20,0,15,3,24,23,3,3,0,21,22,0,2,0,16,16,2,
2,14,2,17,14,0,26,25,13,24,2,0,22,1,0,0,3,2,11,11)
X <- matrix(X,byrow=FALSE,ncol=3)
colnames(X) <- c("AA","AB","BB")
samplesizes <- apply(X,1,sum)
print(summary(samplesizes))
Res <- HWTernaryPlot(X,region=1,vbounds=FALSE,ssf="median")
@
\caption{Ternary plot with acceptance region based on median sample size.}
\end{figure}
\clearpage

For large databases of SNPs, drawing the ternary plot can be time consuming. Usually the matrix with genotypic
counts contains several rows with the same counts. The ternary plot can be constructed faster by plotting only
the unique rows of the count matrix. Function {\tt UniqueGenotypeCounts} extracts the unique rows of the count
matrix and also counts their frequency.

\section*{Acknowledgements}

This work was partially supported by grants SEJ2006-13537 and CODA-RSS MTM2009-13272 by the Spanish Ministry 
of Education and Science. This document was generated by Sweave~(Leisch, 2002).

\section{References}

Elston, R. C. \& Forthofer, R. (1977) Testing for Hardy-Weinberg equilibrium in small samples,
Biometrics 33(3) pp. 536-542.\\

Graffelman, J. \& Morales-Camarena, J. (2008) Graphical tests for Hardy-Weinberg equilibrium
based on the ternary plot. {\it Human Heredity} {\bf 65}(2): 77-84.\\

Hedrick, P. W. (2005) {\it Genetics of Populations}. Third edition. Jones and Bartlett Publishers,
Sudbury, Massachusetts.\\

Leisch, F. (2002) Sweave: Dynamic generation of statistical reports using literate data analysis.
{\it Compstat 2002, Proceedings in Computational Statistics}. pp. 575-580, Physica Verlag, Heidelberg.
ISBN 3-7908-1517-9 URL http:/www.ci.tuwien.ac.at/~leisch/Sweave.\\

Weir, B. S. (1996) {\it Genetic Data Analysis II}. Sinauer Associates, Massachusetts.\\

Wigginton, J. E., Cutler, D. J. and Abecasis, G. R. (2005) A note on exact tests of Hardy-Weinberg 
equilibrium. {\it American Journal of Human Genetics}, 76, pp. 887-893.\\

\section{Appendix: version history}

Version 1.2: Routine {\tt HWData} has been added.\\
Version 1.3:

\begin{itemize}
\item {\tt curtyp} argument was added to {\tt HWTernaryPlot}.
\item {\tt HWTernaryPlot} now also accepts a matrix of genotypic counts as input.
\item {\tt ssf} argument was added to {\tt HWTernaryPlot}.
\item Routine {\tt HWExact} has been added, substituting the
  previous {\tt HWFisher}. {\tt HWExact} is a better implementation of
  the exact test for HWE.
\end{itemize}

Version 1.4:

\begin{itemize}

\item {\tt HWChisq} and {\tt HWExact} prints informative test if verbose is set to true.

\item {\tt HWExact} now uses a faster algorithm, based on the recurrence equations described by Elston
  and Forthofer (1977) and Wigginton et al. (2005). Options for one or two-sided tests, and for the type of 
  p-value have been added. 

\item Routines {\tt af} and {\tt maf} have been added for the computation of (minor) allele frequencies.

\item {\tt HWTernaryPlot} allows for automatic colouring of significant and non-significant compositions via
   the {\tt signifcolour} argument. The {\tt Xa} argument has been removed. The drawing of acceptance regions
   for exact tests has been improved.
      
\end{itemize}

\section{Future versions}

A bayesian test for HWE will be included in the package in the near future.

\end{document}
Browse the archive

https://github.com/cran/HardyWeinberg