https://github.com/shorvath/MammalianMethylationConsortium
Tip revision: 44f3329fd593b472c2b6d021d764c9d7e5681f88 authored by Amin Haghani on 24 July 2023, 22:43:25 UTC
Add files via upload
Add files via upload
Tip revision: 44f3329
annotateHg19Probes.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download the following utilities\n",
"- http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/bigWigToBedGraph\n",
"- http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/hgWiggle\n",
"- bedtools\n",
"- bedops\n",
"- pandas (Python package)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download the following annotations\n",
"- https://public.hoffman2.idre.ucla.edu/ernst/R0RG6/LECIF/hg19.LECIFv1.1.bw\n",
"- https://public.hoffman2.idre.ucla.edu/ernst/ZHYRB/CNEP/cnep.bw\n",
"- https://public.hoffman2.idre.ucla.edu/ernst/ZHYRB/CNEP/css_cnep.bw \n",
"- http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons.bw \n",
"- http://hgdownload.soe.ucsc.edu/gbdb/hg19/multiz46way/phastCons46wayPlacental.wib\n",
"- http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phyloP100way/hg19.100way.phyloP100way.bw\n",
"- http://hgdownload.soe.ucsc.edu/gbdb//hg19/multiz46way/phyloP46wayPlacental.wib\n",
"- https://public.hoffman2.idre.ucla.edu/ernst/1G6UT/hg19_genome_100_segments.bed.gz\n",
"- https://github.com/shorvath/MammalianMethylationConsortium/blob/main/Annotations%2C%20Amin%20Haghani/Manifest%2C%20HorvathMammalMethylChip40.csv.zip"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Download Roadmap epigenome ID annotation files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"curl -s https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/imputed12marks/jointModel/final/EIDlegend.txt > EIDlegend_RoadmapChromHMM.txt\n",
"curl -s https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/analyses/src/pairs/k27ac.list > EID_RoadmapExpandedChromHMM.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Write a bed file with all hg19 coordinates"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"input_file_path = \"Manifest, HorvathMammalMethylChip40.csv\"\n",
"df = pd.read_csv(input_file_path,usecols=['IlmnID',\"Human.Hg19_CGstart\",\"Human.Hg19_CGend\",\"Human.Hg19_seqnames\"]).dropna().sort_values(by=['Human.Hg19_seqnames','Human.Hg19_CGstart'])\n",
"df['start'] = df['Human.Hg19_CGstart'].astype(int)\n",
"df['end'] = df['start']+1\n",
"df[['Human.Hg19_seqnames','start','end','IlmnID']].to_csv('hg19_coord.bed',sep='\\t',header=False,index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert bigWig files to gzipped bed files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"./bigWigToBedGraph hg19.mm10.LECIF.bw hg19.mm10.LECIF.bed\n",
"gzip hg19.mm10.LECIF.bed\n",
"\n",
"./bigWigToBedGraph cnep.bw hg19.CNEP.bed\n",
"gzip hg19.CNEP.bed\n",
"\n",
"./bigWigToBedGraph css_cnep.bw hg19.CSS-CNEP.bed\n",
"gzip hg19.CSS-CNEP.bed\n",
"\n",
"./bigWigToBedGraph hg19.100way.phastCons.bw hg19.100way.phastCons.bed\n",
"gzip hg19.100way.phastCons.bed\n",
"\n",
"./bigWigToBedGraph hg19.100way.phyloP100way.bw hg19.phyloP100way.bed\n",
"gzip hg19.phyloP100way.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map LECIF score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! bedtools map -a hg19_coord.bed -b hg19.mm10.LECIF.bed.gz -c 4 -o mean > hg19_coord.LECIF.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map CNEP score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! bedtools map -a hg19_coord.bed -b hg19.CNEP.bed.gz -c 4 -o mean > hg19_coord.CNEP.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map CSS-CNEP score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! bedtools map -a hg19_coord.bed -b hg19.CSS-CNEP.bed.gz -c 4 -o mean > hg19_coord.CSS-CNEP.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map PhastCons score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! bedtools map -a hg19_coord.bed -b hg19.100way.phastCons.bed.gz -c 4 -o mean > hg19_coord.PhastCons.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map mammalian PhastCons score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"./hgWiggle -db=hg19 -bedFile=hg19_coord.bed -lift=1 phastCons46wayPlacental > hg19.PhastConsPlacental.wig\n",
"wig2bed < hg19.PhastConsPlacental.wig > hg19.PhastConsPlacental.bed\n",
"bedtools map -a hg19_coord.bed -b hg19.PhastConsPlacental.bed -c 5 -o mean > hg19_coord.PhastConsPlacental.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map PhyloP score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! bedtools map -a hg19_coord.bed -b hg19.100way.phyloPScore.bed.gz -c 4 -o mean > hg19_coord.PhyloP.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map mammalian PhyloP score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"./hgWiggle -db=hg19 -bedFile=hg19_coord.bed -lift=1 phyloP46wayPlacental > hg19.PhyloPPlacental.wig\n",
"wig2bed < hg19.PhyloPPlacental.wig > hg19.PhyloPPlacental.bed\n",
"bedtools map -a hg19_coord.bed -b hg19.PhyloPPlacental.bed -c 5 -o mean > hg19_coord.PhyloPPlacental.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map ConsHMM states"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"path_prefix=https://ernst.cass.idre.ucla.edu/public/ConsHMM/Segmentations/hg19_multiz100way/\n",
"\n",
"cp hg19_coord.bed hg19_coord.ConsHMM.bed\n",
"for i in {1..22} X Y; do\n",
" curl -s $path_prefix/chr\"$i\"/chr\"$i\"_segmentation.bed.gz | gzip -cd |\\\n",
" bedtools map -a hg19_coord.ConsHMM.bed -b - -c 4 -o first > tmp.bed\n",
" mv tmp.bed hg19_coord.ConsHMM.bed\n",
"done"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"awk -v OFS=\"\\t\" '{m=$5;for(i=5;i<=NF;i++)if($i>m)m=$i;print $1,$2,$3,$4,m}' hg19_coord.ConsHMM.bed > tmp.bed\n",
"mv tmp.bed hg19_coord.ConsHMM.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map 100-state universal ChromHMM model states"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"gzip -cd hg19_genome_100_segments.bed.gz | sort -k1,1 -k2,2n |\\\n",
"bedtools map -a hg19_coord.bed -b - -c 4 -o first > hg19_coord.universalChromHMM.bed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map core 15-state ChromHMM model states"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"path_prefix=https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/\n",
"output_filename=hg19_coord.coreChromHMM.bed\n",
"\n",
"cp hg19_coord.bed $output_filename\n",
"for i in `cat EIDlegend_RoadmapChromHMM.txt | cut -f 1`; do\n",
" curl -s $path_prefix/\"$i\"_15_coreMarks_mnemonics.bed.gz | gzip -cd | sort -k1,1 -k2,2n |\\\n",
" bedtools map -a $output_filename -b - -c 4 -o first > tmp.bed\n",
" mv tmp.bed $output_filename\n",
"done"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map expanded 18-state ChromHMM model states"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"path_prefix=https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/\n",
"output_filename=hg19_coord.expandedChromHMM.bed\n",
"\n",
"cp hg19_coord.bed $output_filename\n",
"for i in `cat EID_RoadmapExpandedChromHMM.txt | cut -f 1`; do\n",
" curl -s $path_prefix/\"$i\"_18_core_K27ac_mnemonics.bed.gz | gzip -cd | sort -k1,1 -k2,2n |\\\n",
" bedtools map -a $output_filename -b - -c 4 -o first > tmp.bed\n",
" mv tmp.bed $output_filename\n",
"done"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Map imputed 25-state ChromHMM model states"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"path_prefix=https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/imputed12marks/jointModel/final/\n",
"output_filename=hg19_coord.imputedChromHMM.bed\n",
"\n",
"cp hg19_coord.bed $output_filename\n",
"for i in `cat EIDlegend_RoadmapChromHMM.txt | cut -f 1`; do\n",
" curl -s $path_prefix/\"$i\"_25_imputed12marks_mnemonics.bed.gz | gzip -cd | sort -k1,1 -k2,2n |\\\n",
" bedtools map -a $output_filename -b - -c 4 -o first > tmp.bed\n",
" mv tmp.bed $output_filename\n",
"done"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Combine them into a formatted csv file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"coord_filename = 'hg19_coord.bed'\n",
"bed_filenames = ['hg19_coord.LECIF.bed',\n",
" 'hg19_coord.CNEP.bed',\n",
" 'hg19_coord.CSS-CNEP.bed',\n",
" 'hg19_coord.PhastCons.bed',\n",
" 'hg19_coord.PhastConsPlacental.bed',\n",
" 'hg19_coord.PhyloP.bed',\n",
" 'hg19_coord.PhyloPPlacental.bed',\n",
" 'hg19_coord.ConsHMM.bed',\n",
" 'hg19_coord.universalChromHMM.bed',\n",
" 'hg19_coord.coreChromHMM.bed',\n",
" 'hg19_coord.expandedChromHMM.bed',\n",
" 'hg19_coord.imputedChromHMM.bed']\n",
"\n",
"position_col_names = ['chrom','start','end','probeID']\n",
"chromhmm_eid_all = pd.read_table('EIDlegend_RoadmapChromHMM.txt',header=None).values\n",
"chromhmm_eid_all_dict = dict(chromhmm_eid_all)\n",
"chromhmm_eid_subset = pd.read_table('EID_RoadmapExpandedChromHMM.txt',header=None,squeeze=True).tolist()\n",
"\n",
"core_chromhmm_col_name_prefix = 'ChromHMMState_Roadmap5Mark127Epigenome15State_ErnstKellis2012NatureMethods_RoadmapEpigenomicsConsortium2015Nature_'\n",
"expanded_chromhmm_col_name_prefix = 'ChromHMMState_Roadmap6Mark98Epigenome18State_ErnstKellis2012NatureMethods_RoadmapEpigenomicsConsortium2015Nature_'\n",
"imputed_chromhmm_col_name_prefix = 'ChromHMMState_Roadmap12Mark127Epigenome25State_ErnstKellis2012NatureMethods_RoadmapEpigenomicsConsortium2015Nature_'\n",
"\n",
"core_chromhmm_col_names = [core_chromhmm_col_name_prefix+i[0]+'_'+i[1].replace(' ','') for i in chromhmm_eid_all]\n",
"expanded_chromhmm_col_names = [expanded_chromhmm_col_name_prefix+chromhmm_eid_subset[i]+'_'+chromhmm_eid_all_dict[chromhmm_eid_subset[i]].replace(' ','') for i in range(len(chromhmm_eid_subset))]\n",
"imputed_chromhmm_col_names = [imputed_chromhmm_col_name_prefix+i[0]+'_'+i[1].replace(' ','') for i in chromhmm_eid_all]\n",
"names = [position_col_names + n for n in [['LECIFScore_HumanMouse_KwonErnst2021NatureCommunications'],\n",
" ['CNEPScore_GrujicEtAl2020NatureCommunications'],\n",
" ['CSS-CNEPScore_GrujicEtAl2020NatureCommunications'], \n",
" ['PhastConsScore_SiepelEtAl2005GenomeResearch'],\n",
" ['PhastConsPlacentalScore_SiepelEtAl2005GenomeResearch'],\n",
" ['PhyloPScore_PollardEtAl2009GenomeResearch'],\n",
" ['PhyloPPlacentalScore_PollardEtAl2009GenomeResearch'],\n",
" ['ConsHMMState_Multiz100way_ArnesonErnst2019CommunicationsBiology'],\n",
" ['ChromHMMState_VuErnst2022GenomeBiology'],\n",
" core_chromhmm_col_names,\n",
" expanded_chromhmm_col_names,\n",
" imputed_chromhmm_col_names]]\n",
"\n",
"df = pd.read_table(coord_filename,header=None,names=position_col_names)\n",
"for i in range(len(bed_filenames)):\n",
" input_df = pd.read_table(bed_filenames[i],header=None,names=names[i])\n",
" df = df.merge(input_df,on=position_col_names,how='left')\n",
" \n",
"df.to_csv('hg19_annotErnstLab.csv',index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
