# REANA example - BSM search [![image](https://github.com/reanahub/reana-demo-bsm-search/workflows/CI/badge.svg)](https://github.com/reanahub/reana-demo-bsm-search/actions) [![image](https://img.shields.io/badge/discourse-forum-blue.svg)](https://forum.reana.io) [![image](https://img.shields.io/github/license/reanahub/reana-demo-bsm-search.svg)](https://raw.githubusercontent.com/reanahub/reana-demo-bsm-search/master/LICENSE) [![image](https://www.reana.io/static/img/badges/launch-on-reana-at-cern.svg)](https://reana.cern.ch/launch?url=https%3A%2F%2Fgithub.com%2Freanahub%2Freana-demo-bsm-search&name=reana-demo-bsm-search) ## About This [REANA](http://reanahub.io/) reproducible analysis example emulates a typical Beyond Standard Model (BSM) search as performed in collider particle physics. It involves processing three main groups of data: 1. The observed collision data as it was recorded from the detector 1. The Standard Model Backgrounds relevant for the search 1. The Beyond Standard Model signal sample. After processing, a statistical model involving both signal and control regions is built and the model is fitted against the observed data. In this emulation, the data is compatible with the Standard Model expectation, and thus an upper limit on the signal strength of the BSM process is computed, which is the main output of the workflow. This example uses the [ROOT](https://root.cern.ch/) data analysis framework and [Yadage](https://github.com/yadage) computational workflow engine. ## Analysis structure Making a research data analysis reproducible basically means to provide "runnable recipes" addressing (1) where is the input data, (2) what software was used to analyse the data, (3) which computing environments were used to run the software and (4) which computational workflow steps were taken to run the analysis. This will permit to instantiate the analysis on the computational cloud and run the analysis to obtain (5) output results. ### 1. Input data In this example, the input datasets representing the collision and simulated data will be generated on the fly in the first couple of workflow steps. Therefore there is no explicit input data to be taken care of. ### 2. Analysis code This example uses the [ROOT](https://root.cern.ch/) analysis framework with the custom user code located in the `code` directory. In order to execute the different stages of the analysis a number of scripts are needed. In a real analysis these scripts might be part of larger analysis frameworks developed using the experiment-internal software stack (e.g. CMSSW or the ATLAS Analysis Releases) and be based on C++ with many dependencies and require multiple container images. In this emulation we have two container images. 1. A pure ROOT6 container image `docker.io/reanahub/reana-demo-bsm-search` used for most steps (such as selection, merging etc) 1. An image based on ROOT6 which also has the `hftools` package installed. This image is used for the last steps dealing with fitting, plotting and exporting to HepData #### [generantuple.py](code/generantuple.py) - Generating Toy Data This script generates toy datasets needed for the analysis. The script has the command line interface `python /code/generantuple.py {type} {nevents} {outputfile}`, where `{type}` can be one of `[data, mc1, mc2, qcd, sig]` generating "observed data", two background processes "mc1" or "mc2", a "multijet-background" and finally the BSM signal process, respectively. The "data" is just a specific mix of the other three processes according to their respective cross sections. The dataset which is a collection of "events" (the number of events is controlled by the `{nevents}` parameters) which is stored in a ROOT TNtuple at the path indicated by `{outputfilename}`. Since dataset generation is easily parallelizable, ultimately we will run many of these jobs at the same time and merge the TNtuples via ROOT's `hadd` utility. #### [select.py](code/select.py) - Selecting Interesting Events This is the main "event selection" code that processes the datasets and selects "interesting events" and applies correction and systematic variations to the events. In a real analysis this would be the bulk of the analysis code implemented in a C++ experiment framework. In this example, the cli structure of the script is `python /code/select.py {inputfile} {outputfile} {region} var1,var2,...` where an input and output files are specified as well as the region (i.e. either signal or control region) and a number of comma-delimited systematic variations are specified. The code then applies cuts and variations to the events and writes the selected events into a new TNtuple which is saved to disk. In this case only variations that affect the event selection need to be specified. Variations that only affect the event weights are dealt with in the histogramming step (see below). #### [histogram.py](code/histogram.py) - Summarize Events in histograms This script reads in the TNtuple of the selected events and creates the required histograms for building the statistical model and weights them to a specific luminosity. The command structure is `python /code/histogram.py {inputfile} {outputfile} {name} {weight} var,var2,...` the variations in this case are weight-only variations. #### [makews.py](code/makews.py) - Building a Statistical Model This script creates a `RooWorkspace` using the HistFactory p.d.f template. The HistFactory configuration has a single channel and four samples (qcd, mc1, mc2 and signal). The parameter of interest in this model is the normalization of the signal sample (the signal strength). For fitting and plotting the resulting workspace we use an external package called `hftools` (HistFactory tools), which provides command line tools for these purposes and no additional code is needed from our side. The command line structure is `python /code/makews.py {data_bkg_hists} {workspace_prefix} {xml_dir}` The script expects all data and background histograms to be collected in a single ROOT file and writes the XML configuration and workspace to the paths specified on the command line. #### [hepdata_export.py](code/hepdata_export.py) - Preparing a HepData submission The final step of an analysis is often to prepare a HepData submission in order to archive measured distributions and results on the HepData archive. Here we use `hftools` as a python library in this script, which has some convenience functions to generated HepData tables from a `RooWorkspace`. The command line structure is: `python /code/hepdata_export.py {combined_model}` ### 3. Compute environment In order to be able to rerun the analysis even several years in the future, we need to "encapsulate the current compute environment", for example to freeze the ROOT version our analysis is using. We shall achieve this by preparing a [Docker](https://www.docker.com/) container image for our analysis steps. Some of the analysis steps will run in a pure [ROOT](https://root.cern.ch/) analysis environment. We can use an already existing container image, for example [reana-env-root6](https://github.com/reanahub/reana-env-root6), for these steps. Some of the other analysis tasks wil need `hftools` Python library installed that our Python code needs. We can extend the `reana-env-root6` image to install `hftools` and to include our own Python code. This can be achieved as follows: ```console $ less environments/reana-demo-bsm-search/Dockerfile ``` ```Dockerfile # Start from the ROOT6 base image: FROM docker.io/reanahub/reana-env-root6:6.18.04 # Install HFtools and its dependencies: RUN apt-get -y update && \ apt-get -y install \ libyaml-dev \ python-numpy \ zip && \ apt-get autoremove -y && \ apt-get clean -y RUN pip install hftools==0.0.6 # Mount our code: ADD code /code WORKDIR /code ``` We can build our analysis environment image and give it a name `docker.io/reanahub/reana-demo-bsm-search`: ```console $ docker build -f environment/Dockerfile -t docker.io/reanahub/reana-demo-bsm-search . ``` We can push the image to the DockerHub image registry: ```console $ docker push docker.io/reanahub/reana-demo-bsm-search ``` (Note that typically you would use your own username such as `johndoe` in place of `reanahub`.) ### 4. Analysis workflow This analysis example intends to emulate fully what is happening in a typical BSM search analysis. This means a lot of computational steps with parallel execution and merging of results. We shall use the [Yadage](https://github.com/yadage) workflow engine to express the computational steps in a declarative manner. The [databkgmc.yml](workflow/databkgmc.yml) workflow defines the full pipeline defining various data, signal, simulation, merging, fitting and plotting steps: ![](https://raw.githubusercontent.com/reanahub/reana-demo-bsm-search/master/docs/workflow.png) At a very high level the workflow is as follows 1. Generate and process "observed data" to produce observed data and a data-driven multijet estimate in the signal region. 1. For each non-multijet Standard Model process (MC1 and MC2), generate and process datasets including systematic variations 1. Generate and Process a signal dataset The three sub-workflows above can happen in parallel as they are independent of each other. Once they are done the remaining steps needed are 1. Merge Outputs from subworkflows and prepare a Statisical Model. 1. Perform Fits and produce Plots. 1. Prepare a HepData Submission ```console +---------------+ +--------------+ +------+ | Data & Mulijet| |SM Backgrounds| |Signal| +---------------+ +--------------+ +------+ | | | | | | +--------> v <----------+ +--+--+ |Merge| +--+--+ | v +----------+ | Workspace| +----------+ +-----------+ | | +------------------+ |Fit & Plots| <---+ +----> |HepData Submission| +-----------+ +------------------+ ``` #### The Data Workflow The subworkflow generating and processing the "observed data" goes through these high-level stages. 1. **Generating the Data** This stage generates data in a highly parallel fashion and then merges the files into a smaller number of files. We do not merge into a single file as this may end up being too large (currently merges happen in batches of six) 1. **Processing Data in Signal Region** This branch in the data workflow processes the data and selects and histograms events in the signal region. This will be the data the model is fitted against. 1. **Processing Data in Control Region for data-driven multijet estimate** This branch selects and histograms events in the control region to estimate the shape of the distribution and then uses a transfer factor which controls the normalization of the distribution in the signal region. This results in a so-called "data-driven" estimate the so-called "multijets" (or "qcd") background, since it would be unfeasible to estimate it using Monte-Carlo samples. 1. **Merge final results** Finally, the results are merged into a single file that holds all the resulting histograms from the data sub-workflow. #### The SM Background Workflow For each of the SM backgrounds that are not estimated directly from the data, we use generated Monte-Carlo samples. For the Standard Model backgrounds we generate and process these datasets including systematics variations. These systematic variations change the values of the variables that are used to select "interesting events" as well as the "weight" of the event that is used when filling the histograms. The SM Background sub-workflow splits into further sub-sub-workflows performed for each of the background processes. In this emulation we have two such processes. For each sample, we go through the following stages 1. Generate datasets for the background processes 1. Run Event selection for Signal region 1. Histogram Events (with correct luminosity weighting) As some systematics affect the variables that are cut on in the event selection ( so-called shape variatiosn), the event selection step needs to be performed multiple times (once for each shape variations). Therefore, there is an additional sub-workflow for processing shape variations. Systematics only affecting the weights can be implemented in one go at the histogramming stage. As we progress through these stages, we add merging steps to reduce the number of files that need to be handled. Finally, all histograms for a single Monte Carlo samples are collected before merging all Monte-Carlo samples into a single ROOT file. #### The Signal Workflow The Signal workflow is very similar to the SM Background workflow, but we do not consider any systematics. Therefore it is a simple workflow that selects and histograms events (with a couple of merge stages in between). #### Putting everything together Using these sub-workflows, we assemble a composed workflow. In this example, there are no externally settable parameters, as the parameters for the three sub-workflows (data, backgrounds, signal) are fixed in the workflow spec. The parameters for the subworkflows include information on how many events to generate and, in the case of signal and background, what the relative weight should be. ```console $ head -8 workflow/databkgmc.yml stages: - name: all_bkg_mc scheduler: scheduler_type: singlestep-stage parameters: mcname: [mc1,mc2] mcweight: [0.01875,0.0125] # [Ndata / Ngen * 0.2 * 0.15, Ndata / Ngen * 0.2 * 0.1] = [10/16*0.03, 1/16 * 0.02] nevents: [40000,40000,40000,40000] #160k events / mc sample ``` Please see the [databkgmc.yml](workflow/databkgmc.yml) workflow definition and related [Yadage documentation](http://yadage.readthedocs.io/). ### 5. Output results The interesting fragements generated by this result are the pre- and the post-fit distributions of the individual samples as well as the HepData submission in the form of a ZIP archive. Below we see the model at its pre-fit configuration at nominal signal strength mu=1. The signal distribution is shown in green. As we can see the nominal setting does not describe the data, which is shown in black dots, well. ![](https://raw.githubusercontent.com/reanahub/reana-demo-bsm-search/master/docs/prefit.png) Here we see the post-fit distribution. As we can see, the signal sample needed to be scale down significantly to fit the data, which is expected since we generated the data in accordance with a SM-only scenario. ![](https://raw.githubusercontent.com/reanahub/reana-demo-bsm-search/master/docs/postfit.png) ## Running the example on REANA cloud There are two ways to execute this analysis example on REANA. If you would like to simply launch this analysis example on the REANA instance at CERN and inspect its results using the web interface, please click on the following badge: [![image](https://www.reana.io/static/img/badges/launch-on-reana-at-cern.svg)](https://reana.cern.ch/launch?url=https%3A%2F%2Fgithub.com%2Freanahub%2Freana-demo-bsm-search&name=reana-demo-bsm-search) If you would like a step-by-step guide on how to use the REANA command-line client to launch this analysis example, please read on. We start by creating a [reana.yaml](reana.yaml) file describing the above analysis structure with its inputs, code, runtime environment, computational workflow steps and expected outputs: ```yaml version: 0.6.0 inputs: directories: - workflow workflow: type: yadage file: workflow/databkgmc.yml outputs: files: - plot/prefit.pdf - plot/postfit.pdf ``` We can now install the REANA command-line client, run the analysis and download the resulting plots: ```console $ # create new virtual environment $ virtualenv ~/.virtualenvs/myreana $ source ~/.virtualenvs/myreana/bin/activate $ # install REANA client $ pip install reana-client $ # connect to some REANA cloud instance $ export REANA_SERVER_URL=https://reana.cern.ch/ $ export REANA_ACCESS_TOKEN=XXXXXXX $ # create new workflow $ reana-client create -n my-analysis $ export REANA_WORKON=my-analysis $ # upload input code and data to the workspace $ reana-client upload $ # start computational workflow $ reana-client start $ # ... should be finished in about 15 minutes $ # check its status $ reana-client status $ # list workspace files $ reana-client ls $ # download generated plots $ reana-client download ``` Please see the [REANA-Client](https://reana-client.readthedocs.io/) documentation for more detailed explanation of typical `reana-client` usage scenarios.