https://github.com/ENCODE-DCC/checkfiles
Raw File
Tip revision: af24fe99645445f57a036cd3ae000557192c3e38 authored by Idan Gabdank on 01 March 2022, 22:21:00 UTC
Merge pull request #61 from ENCODE-DCC/CHECK-13-python-path
Tip revision: af24fe9
README.md
Check Files
===========

Files are checked to see if the MD5 sum (both for gzipped and ungzipped) is identical to the submitted metadata, as well as run through
the validateFiles program from jksrc  (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/validateFiles).
It operates on files in the 'uploading' state (according to the encodeD database) in the encode-files S3 bucket.
Checkfiles is used by the ENCODE DCC to validate genomic datafiles submitted by labs.
The bucket itself is mounted using Goofys (https://github.com/kahing/goofys).
Errors are reported back to encodeD.

Setup
-----

Install pyenv environment(if not already installed from ENCODE-DCC/encoded) repo::

    brew install pyenv
    pyenv install 3.4.3
    pyenv install 2.7.13
    echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
    echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
    echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
    echo 'eval "pyenv shell 2.7.13 3.4.3"' >> ~/.bash_profile
    source ~/.bash_profile

Install required packages for running deploy::

    pyvenv .
    bin/pip install -r requirements-deploy.txt

Deploy to AWS
-------------

Supply arguments for branch, name, image-id, instance-type and profile-name (if you want to change the defaults) before  ``--`` separator and the rest of the args after the separator:

    bin/python deploy.py -- --username ACCESS_KEY_ID --password SECRET_ACCESS_KEY --bot-token SLACK-BOT-TOKEN https://www.encodeproject.org

Run on DNAnexus
---------------

Prerequisites
* DNAnexus login
* dx toolkit
* Access to the DNAnexus checkfiles project (everyone in DNAnexus org-cherrylab should have this)
* A DNAnexus path of a file to check in the form project-name:/dir/subdir/filename

To run on a DNAnexus file with --dry-run::

    dx run -i dry_run=t -i dx_file="project-name:/dir/subdir/filename" --watch --yes checkfiles:checkfiles

 To capture output and error streams to a file that will be saved to the current DNAnexus project and send those to slack::
 
    dx run -i dx_file="project-name:/dir/subdir/filename" -i bot_token="mybot-token" -i out="myoutfile" -i err="myerrfile" --watch --yes checkfiles:checkfiles

To see full syntax and options::

    dx run checkfiles:checkfiles --help

NOTE: stdout and stderr are currently sent to the DNAnexus log.  Saving those streams to files it not yet supported.

Deploy to DNAnexus
------------------

You only need to do this if you have changed the code.

If you don't aleady have a DNAnexus login token::

    dx login

Select the checkfiles project on DNAnexus::

    dx select checkfiles

Only if you have changed the checkfiles asset Makefile, in the checkfiles repo root::

    dx build_asset checkfiles_asset

Build the new applet::

    dx build -a checkfiles

back to top