1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: README
author: Jan-Michael Rye
---

# Synopsis

Generate molecular feature vectors for machine- and deep-learning models using cheminformatics software packages. A configuration file is used to select and configure different feature sets for inclusion in the full feature vector. The full list of available feature sets is available [here](https://jrye.gitlabpages.inria.fr/chemfeat/gen_features.html). The results of feature set calculations are cached in an [SQLite](https://www.sqlite.org/index.html) database to avoid the overhead of redundant calculations when feature sets are re-used.

The following packages are currently used to calculate chemical features:

* [RDKit](https://pypi.org/project/rdkit/)
* [PaDEL-pywrapper](https://pypi.org/project/PaDEL-pywrapper/)


# Links

* [GitLab](https://gitlab.inria.fr/jrye/chemfeat)
* [Documentation](https://jrye.gitlabpages.inria.fr/chemfeat/)

# Usage

## Command-Line

The package provides the `chemfeat` command-line tool to generated CSV files of feature vectors from lists of InChi strings. It can also be used to generate a template feature-set configuration file and a markdown document describing all of the feature sets. The command's various help messages can be found [here](https://jrye.gitlabpages.inria.fr/chemfeat/gen_command_help.html).

### Usage Example

Given a feature set configuration file ("feature_sets.yaml") and a CSV file with a column of InChi strings ("inchis.csv"), a CSV file out features ("features.csv") can be generated with the following command:

~~~sh
chemfeat calc feature_sets.yaml inchis.csv features.csv
~~~

The following sections contain example contents for the input files and the output file that they produce.

#### feature_sets.yaml

Example feature set configuration file. Note that the feature sets are specified as a list, which allows the same feature set to be use multiple times with different parameters. For the full list of features, see the [feature descriptions](https://jrye.gitlabpages.inria.fr/gen_features.html) and the [configuration file template](https://jrye.gitlabpages.inria.fr/chemfeat/gen_feature_set_configuration.html).

~~~yaml
# QED feature calculator.
- name: qed

# RDK descriptor feature calculator.
- name: rdkdesc
~~~

#### inchis.csv

Example CSV input file with a column containing InChi values. The name of the InChi column is configurable and defaults to "InChi".

~~~
InChi,name
"InChI=1S/C8H9NO2/c1-6(10)9-7-2-4-8(11)5-3-7/h2-5,11H,1H3,(H,9,10)","paracetamol"
"InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)","ibuprofen"
~~~

#### featurs.csv

The CSV feature file that results from the example input files above.

~~~
InChi,qed__ALERTS,qed__ALOGP,qed__AROM,qed__HBA,qed__HBD,qed__MW,qed__PSA,qed__ROTB,rdkdesc__FpDensityMorgan1,rdkdesc__FpDensityMorgan2,rdkdesc__FpDensityMorgan3,rdkdesc__MaxAbsPartialCharge,rdkdesc__MaxPartialCharge,rdkdesc__MinAbsPartialCharge,rdkdesc__MinPartialCharge,rdkdesc__NumRadicalElectrons,rdkdesc__NumValenceElectrons
"InChI=1S/C8H9NO2/c1-6(10)9-7-2-4-8(11)5-3-7/h2-5,11H,1H3,(H,9,10)",2,2.0000999999999998,1,2,2,151.16500000000002,52.82000000000001,1,1.2727272727272727,1.8181818181818181,2.272727272727273,0.5079642937129114,0.18214293782620056,0.18214293782620056,-0.5079642937129114,0,58
"InChI=1S/C13H18O2/c1-9(2)8-11-4-6-12(7-5-11)10(3)13(14)15/h4-7,9-10H,8H2,1-3H3,(H,14,15)",0,3.073200000000001,1,2,1,206.28499999999997,37.3,4,1.2,1.7333333333333334,2.1333333333333333,0.4807885019257389,0.3101853515323108,0.3101853515323108,-0.4807885019257389,0,82
~~~


## Python API

~~~python
from chemfeat.database import FeatureDatabase
from chemfeat.features.manager import FeatureManager

# Here we assume that the following variables have already been defined:
# 
# feat_specs:
#   A list of feature specifications as returned by loading a YAML feature-set
#   configuration file.
#
# inchis:
#   An iterable of InChi strings representing the molecules for which the
#   features should be calculated.


# Create the database object.
feat_db = FeatureDatabase('features.sqlite')

# Create the feature manager object.
feat_man = FeatureManager(feat_db, feat_specs)

# Calculate the features and retrieve them as a Pandas dataframe.
feat_dataframe = feat_man.calculate_features(inchis, return_dataframe=True)
~~~