Skip to main content
  • Home
  • Development
  • Documentation
  • Donate
  • Operational login
  • Browse the archive

swh logo
SoftwareHeritage
Software
Heritage
Archive
Features
  • Search

  • Downloads

  • Save code now

  • Add forge now

  • Help

https://github.com/cran/GenAlgo
01 January 2021, 04:13:10 UTC
  • Code
  • Branches (5)
  • Releases (0)
  • Visits
    • Branches
    • Releases
    • HEAD
    • refs/heads/master
    • refs/tags/2.1.3
    • refs/tags/2.1.4
    • refs/tags/2.1.5
    • refs/tags/2.2.0
    No releases to show
  • 91d7050
  • /
  • inst
  • /
  • doc
  • /
  • featureSelection.Rmd
Raw File Download Save again
Take a new snapshot of a software origin

If the archived software origin currently browsed is not synchronized with its upstream version (for instance when new commits have been issued), you can explicitly request Software Heritage to take a new snapshot of it.

Use the form below to proceed. Once a request has been submitted and accepted, it will be processed as soon as possible. You can then check its processing state by visiting this dedicated page.
swh spinner

Processing "take a new snapshot" request ...

To reference or cite the objects present in the Software Heritage archive, permalinks based on SoftWare Hash IDentifiers (SWHIDs) must be used.
Select below a type of object currently browsed in order to display its associated SWHID and permalink.

  • content
  • directory
  • revision
  • snapshot
origin badgecontent badge
swh:1:cnt:89ba16fc4cc2a0baff64841cf884ebbac662d7f1
origin badgedirectory badge
swh:1:dir:2a3460cb578bf477f37d944d0c767b9dae6350ec
origin badgerevision badge
swh:1:rev:35a421aca28768b31853a0b29cdbab60e40e7569
origin badgesnapshot badge
swh:1:snp:908bea40927100b6d5e6aad50b0ee1e6e85dd112

This interface enables to generate software citations, provided that the root directory of browsed objects contains a citation.cff or codemeta.json file.
Select below a type of object currently browsed in order to generate citations for them.

  • content
  • directory
  • revision
  • snapshot
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Generate software citation in BibTex format (requires biblatex-software package)
Generating citation ...
Tip revision: 35a421aca28768b31853a0b29cdbab60e40e7569 authored by Kevin R. Coombes on 18 May 2018, 14:29:40 UTC
version 2.1.4
Tip revision: 35a421a
featureSelection.Rmd
---
title: "Using Genetic Algorithms for Feature Selection"
author: "Kevin R. Coombes"
data: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Genetic Algorithm and Feature Selection}
  %\VignetteKeywords{OOMPA,GenAlgo,Feature Selection}
  %\VignetteDepends{GenAlgo,Umpire,oompaBase}
  %\VignettePackage{GenAlgo}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r makeHappy,echo=FALSE}
if (!require(Umpire)) {
  knitr::opts_chunk$set(eval = FALSE)
}
```

In this vignette, we ilustrate how to apply the `GenAlgo` package
to the problem of feature selection in an "omics-scale" data set. We
start by loading the packages that we will need.
```{r lib}
library(GenAlgo)
library(Umpire)
library(oompaBase)
```

# Simulating a Data Set
We will use the `Umpire` package to simulate a comnplex enough data set
to stress our feature selection algorithm.  We begin by setting up a
time-to-event model, built on an exponential baseline.
```{r survmodel}
set.seed(391629)
sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1)
```
Next, we build a "cancer model" with six subtypes.
```{r cancModel}
nBlocks <- 20    # number of possible hits
cm <- CancerModel(name="cansim",
                  nPossible=nBlocks,
                  nPattern=6,
                  OUT = function(n) rnorm(n, 0, 1), 
                  SURV= function(n) rnorm(n, 0, 1),
                  survivalModel=sm)
### Include 100 blocks/pathways that are not hit by cancer
nTotalBlocks <- nBlocks + 100
```
Now define the hyperparameters for the models.
```{r hyper}
### block size
blockSize <- round(rnorm(nTotalBlocks, 100, 30))
### log normal mean hypers
mu0    <- 6
sigma0 <- 1.5
### log normal sigma hypers
rate   <- 28.11
shape  <- 44.25
### block corr
p <- 0.6
w <- 5
```

Now set up the baseline "Engine".
```{r engine}
rho <- rbeta(nTotalBlocks, p*w, (1-p)*w)
base <- lapply(1:nTotalBlocks,
               function(i) {
                 bs <- blockSize[i]
                 co <- matrix(rho[i], nrow=bs, ncol=bs)
                 diag(co) <- 1
                 mu <- rnorm(bs, mu0, sigma0)
                 sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1)
                 covo <- co *(t(sigma) %*% sigma)
                 MVN(mu, covo)
               })
eng <- Engine(base)
```

We alter the means if there is a hit, or else build it using the original
engine components.
```{r alter}
altered <- alterMean(eng, normalOffset, delta=0, sigma=1)
object <- CancerEngine(cm, eng, altered)
summary(object)
```

```{r clean1}
rm(altered, base, blockSize, cm, eng, mu0, nBlocks, nTotalBlocks,
   p, rate, rho, shape, sigma0, sm, w)
```

Now we can use this elaborate setup to generate the simulated data.
```{r traind}
train <- rand(object, 198)
tdata <- train$data
pid <- paste("PID", sample(1001:9999, 198+93), sep='')
rownames(train$clinical) <- colnames(tdata) <- pid[1:198]
```
Of course, to make things harder, we will add noise to the simulated measurements.
```{r noise}
noise <- NoiseModel(3, 1, 1e-16)
train$data <- log2(blur(noise, 2^(tdata)))
sum(is.na(train$data))
rm(tdata)
summary(train$clinical)
summary(train$data[, 1:3])
```

Now we can also simualte a validation data set.
```{r validd}
valid <- rand(object, 93)
vdata <- valid$data
vdata <- log2(blur(noise, 2^(vdata))) # add noise
sum(is.na(vdata))
vdata[is.na(vdata)] <- 0.26347
valid$data <- vdata
colnames(valid$data) <- rownames(valid$clinical) <- pid[199:291]
rm(vdata, noise, object, pid)
summary(valid$clinical)
summary(valid$data[, 1:3])
```

# Setting up the Genetic Algorithm
Now we can start using the `GenAlgo` package. The key step is to define
sensible functions that can measure the "fitness" of a solution and to
introduce "mutations". When these functions are called, they are passed a
`context` argument that can be used to access extra information about
how to proceed. In this case, that context will be the `train` object,
which includes the clinical information about the samples.

## Fitness
Now we can define the fitness function. The idea is to compute the Mahalanobis
distance between the two groups (of "Good" or "Bad" outcome samples) in the
space defined by the selected features.
```{r measureFitness}
measureFitness <- function(arow, context) {
  predictors <- t(context$data[arow, ]) # space defined by features
  groups <- context$clinical$Outcome    # good or bad outcome
  maha(predictors, groups, method='var')
}
```

## Mutations
The mutation function randomly chooses any other feature/row to swap out
possible predictors of the outcome.
```{r mutator}
mutator <- function(allele, context) {
   sample(1:nrow(context$data),1)
}
```

## Initialization
We need to decide how many features to include in a potential predictor
(here we use ten). We also need to decide how big a population of feature-sets
(here we use 200) should be used in each generation of the genetic algorithm.
```{r initialize}
set.seed(821831)
n.individuals <- 200
n.features <- 10
y <- matrix(0, n.individuals, n.features)
for (i in 1:n.individuals) {
  y[i,] <- sample(1:nrow(train$data), n.features)
}
```
Having chosen the staring population, we can run the first step of the
genetic algorithm.
```{r round1}
my.ga <- GenAlg(y, measureFitness, mutator, context=train) # initialize
summary(my.ga)
```

To be able to evaluate things later, we save the starting generation
```{r save0}
recurse <- my.ga
pop0 <- sort(table(as.vector(my.ga@data)))
```

## Multiple Generations
Realistically, we probably want to run a couple of hundred or even a
couple thousand iterations of the algorithm. But, in interests of making
the vignette complete ins a reasonable amount of time, we are only going
to terate through 20 generations.
```{r recurse}
NGEN <- 20
diversity <- meanfit <- fitter <- rep(NA, NGEN)
for (i in 1:NGEN) {
  recurse <- newGeneration(recurse)
  fitter[i] <- recurse@best.fit
  meanfit[i] <- mean(recurse@fitness)
  diversity[i] <- popDiversity(recurse)
}
```

Plot max and mean fitness by generation. This figure shows that both the mean
and the maximum fitness are increasing.
```{r fig.cap="Fitness by generation."}
plot(fitter, type='l', ylim=c(0, 1.5), xlab="Generation", ylab="Fitness")
abline(h=max(fitter), col='gray', lty=2)
lines(fitter)
lines(meanfit, col='gray')
points(meanfit, pch=16, col=jetColors(NGEN))
legend("bottomleft", c("Maximum", "Mean"), col=c("black", "blue"), lwd=2)
```

Plot the diversity of the population, to see that it is deceasing.
```{r fig.cap="Diversity."}
plot(diversity, col='gray', type='l', ylim=c(0,10), xlab='', ylab='', yaxt='n')
points(diversity, pch=16, col=jetColors(NGEN))
```

See which predictors get selected most frequently in the latest
generation.
```{r div}
sort(table(as.vector(recurse@data)))
```

back to top

Software Heritage — Copyright (C) 2015–2026, The Software Heritage developers. License: GNU AGPLv3+.
The source code of Software Heritage itself is available on our development forge.
The source code files archived by Software Heritage are available under their own copyright and licenses.
Terms of use: Archive access, API— Content policy— Contact— JavaScript license information— Web API