https://github.com/cran/RecordLinkage
Raw File
Tip revision: b32452149857b15849e9c82fb81e812df6e921fc authored by Murat Sariyar on 08 November 2022, 13:10:15 UTC
version 0.4-12.4
Tip revision: b324521
epiWeights.Rd
\name{epiWeights}
\Rdversion{1.1}
\alias{epiWeights}
\alias{epiWeights-methods}
\alias{epiWeights,RecLinkData-method}
\alias{epiWeights,RLBigData-method}


\title{
Calculate EpiLink weights
}
\description{
  Calculates weights for record pairs based on the EpiLink approach
  (see references).
}
\usage{
  epiWeights(rpairs, e = 0.01, f, ...)

  \S4method{epiWeights}{RecLinkData}(rpairs, e = 0.01, f = rpairs$frequencies)

  \S4method{epiWeights}{RLBigData}(rpairs, e = 0.01, f = getFrequencies(rpairs),
    withProgressBar = (sink.number()==0))
}

\arguments{
  \item{rpairs}{The record pairs for which to
    compute weights. See details.}
  \item{e}{
    Numeric vector. Estimated error rate(s).
  }
  \item{f}{
    Numeric vector. Average frequency of attribute values.
  }
  \item{withProgressBar}{Whether to display a progress bar}
  \item{...}{Placeholder for method-specific arguments.}
}

\details{
This function calculates weights for record pairs based on the approach used by Contiero et alia in the EpiLink record linkage software (see references).

Since package version 0.3, this is a generic function with methods for S3 objects of class \code{\link{RecLinkData}}
  as well as S4 objects of classes  \code{"\linkS4class{RLBigDataDedup}"} and
  \code{"\linkS4class{RLBigDataLinkage}"}.

The weight for a record pair \eqn{(x^{1},x^{2})}{(x1,x2)} is computed by the formula 
  \deqn{\frac{\sum_{i}w_{i}s(x^{1}_{i},x^{2}_{i})}{\sum_{i}w_{i}}}{sum_i (w_1 * s(x1_i, x2_i)) / sum_i w_i}
  where \eqn{s(x^{1}_{i},x^{2}_{i})}{s(x1_i, x2_i)} is the value of a string comparison of
  records \eqn{x^{1}}{x1} and \eqn{x^{2}}{x2} in the i-th field and \eqn{w_{i}}{w_i} is a weighting factor computed by 
  \deqn{w_{i}=\log_{2}(1-e_{i})/f_{i}}{w_i = log_2 (1-e_i) / f_i},  where \eqn{f_{i}}{f_i} denotes the average frequency of values and \eqn{e_{i}}{e_i} the estimated error rate for field \eqn{i}. 
  
String comparison values are taken from the record pairs as they were generated with \code{\link{compare.dedup}} or \code{\link{compare.linkage}}. The use of binary patterns is possible, but in general yields poor results.
  
The average frequency of values is by default taken from the object   \code{rpairs}. Both frequency and error rate \code{e} can be set to a single value, which will be recycled, or to a vector with distinct error rates for every field. 
  
The error rate(s) and frequencie(s) must satisfy 
\eqn{e_{i}\leq{}1-f_{i}}{e[i] <= 1-f[i]} for all \eqn{i}, otherwise the functions fails. Also, some other rare combinations can result in weights with illegal values (NaN, less than 0 or greater than 1). In this case a  warning is issued.
  
By default, the \code{"\linkS4class{RLBigDataDedup}"} method displays a progress bar unless output is diverted by \code{sink}, e.g. when processing a Sweave file.
}

\value{
  A copy of \code{rpairs} with the weights attached. See the class documentation
  (\code{\link{RecLinkData}}, \code{"\linkS4class{RLBigDataDedup}"} and
  \code{"\linkS4class{RLBigDataLinkage}"}) on how weights are stored.
  
  For the \code{"RLBigData"} method, the returned object is only a shallow
  copy in the sense that it links to the same ff data files as database file as
  \code{rpairs}.
}

\section{Side effects}{
  The \code{"RLBigData"} method creates a \code{"ffvector"} object,
  for which a disk file is created.
}

\references{
P. Contiero et al., The EpiLink record linkage software, in: Methods of 
Information in Medicine 2005, 44 (1), 66--71.
}
\author{
Andreas Borg, Murat Sariyar
}

\seealso{
  \code{\link{epiClassify}} for classification based on EpiLink weights.
  \code{\link{emWeights}} for a different approach for weight calculation.
}

\examples{
# generate record pairs
data(RLdata500)
p=compare.dedup(RLdata500,strcmp=TRUE ,strcmpfun=levenshteinSim,
  identity=identity.RLdata500, blockfld=list("by", "bm", "bd"))

# calculate weights
p=epiWeights(p)

# classify and show results
summary(epiClassify(p,0.6))
}
\keyword{classif}
back to top