``````{
"cells": [
{
"cell_type": "markdown",
"id": "73c8c1a0",
"source": [
"## Ripley K statistic"
]
},
{
"cell_type": "markdown",
"id": "68520b2c",
"source": [
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "67e3ed6d",
"outputs": [],
"source": [
"# Required packages\n",
"import numpy as np\n",
"import csv\n",
"import pandas as pd\n",
"import ripleyk\n",
"import time\n",
"import matplotlib.pyplot as plt\n",
"import os\n",
"import glob"
]
},
{
"cell_type": "markdown",
"id": "6d3b2196",
"source": [
"Next, select a list of scales (radii of circles) at which to investigate clustering:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "139cf08f",
"outputs": [],
"source": [
"# Choose the radii for the Ripley function\n",
"radii = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1] #, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2]\n"
]
},
{
"cell_type": "markdown",
"id": "f56940e7",
"source": [
"### Complete Ripley function"
]
},
{
"cell_type": "markdown",
"id": "8398053e",
"source": [
"The function **ripley_pvalue** takes a pandas dataframe of (x,y,z) coordinates a input, together with the list of radii and some further additional data. It computes Ripley K's statistic, both on the original data and on many iterations of a uniformly distributed data set of the same size, and outputs a dataframe with summary measures for each radius.\n",
"\n",
"In particular, the output contains the Ripley K statistic value above which clustering is significant at the 2-sided p-value of 0.05. This value is computed using the Neyman-Pearson Lemma, and dediced by an appropriately biased coin toss in case of a tie. The function also outputs the value of the Ripley K statistic on the original data, as well as some other measures."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5d05caba",
"outputs": [],
"source": [
"# Defines a function that computes the Ripley K statistic at a list of values of radii,\n",
"# and computes a p-value to assess significant clustering relative to a uniformly\n",
"# distributed data set of the same size.\n",
"\n",
"# The input to ripley_pvalue is the following:\n",
"\n",
"# 'df', pandas dataframe of (x,y,z) coordinates of spots in a nucleus\n",
"# 'n_data', number of draws from the null ---> Ideally, this is a multiple of 40\n",
"# 'bounding_radius', the (estimated) radius of the sphere bounding the sample (we have to choose this wisely!)\n",
"# 'boundary', either 'True' or 'False' indicating if a boundary correction is applied (still does not work ...)\n",
"# 'normalise', either 'True' or 'False' indication if normalisation takes place (should say 'False')\n",
"\n",
"# The output of this function is a pandas dataframe with columns\n",
"# 'Radii': value of the radius, i.e., the scale at which clustering is considered\n",
"# 'Mean':\n",
"# 'sigcutoff': value of the null Ripley K statistic at 2.5% from the largest (i.e.,\n",
"# cut-off for significance based on a 2-sided p-value of 0.05)\n",
"# 'Observed': observed value of the Ripley K statistic on the data\n",
"# 'pvals': corresponding p-value quantifying significant clustering relative to a\n",
"# uniformly distributed data set of the same size\n",
"\n",
"# For now, only works with a list of 10 radii\n",
"\n",
"    \n",
"    # Extract the number of spots from the nucleus df\n",
"    n_spot = df.shape[0]\n",
"    \n",
"    \n",
"    # Extract the max and min coordinate values of spots for the nucleus df\n",
"    xMin=(df[0].values.min(0));xMax=(df[0].values.max(0));\n",
"    yMin=(df[1].values.min(0));yMax=(df[1].values.max(0));\n",
"    zMin=(df[2].values.min(0));zMax=(df[2].values.max(0));\n",
"\n",
"    xDelta = xMax-xMin\n",
"    yDelta = yMax-yMin\n",
"    zDelta = zMax-zMin\n",
"    \n",
"    # Generate a null distribution of Ripley K's of uniformly distributed samples with n_spot spots\n",
"    null = []\n",
"    for i in range(n_data):\n",
"        # Obtain three np.arrays with x, y, z coordinates respectively\n",
"        xx = xDelta*np.random.uniform(0,1,n_spot)+xMin;#x coordinates of Poisson points\n",
"        yy = yDelta*np.random.uniform(0,1,n_spot)+yMin;#y coordinates of Poisson points\n",
"        zz = zDelta*np.random.uniform(0,1,n_spot)+zMin;#z coordinates of Poisson points\n",
"    \n",
"        null.append(ripley_null)\n",
"    \n",
"    # Collect results into a Pandas dataframe\n",
"    df_null = pd.DataFrame(null, columns=[str(x) for x in radii])\n",
"    \n",
"    # Sort each column in descending order (hence the [::-1] after np.sort())\n",
"    df_null_ordered = pd.DataFrame(np.sort(df_null.values, axis=0)[::-1], index=df_null.index, columns=df_null.columns)\n",
"    df_median = df_null_ordered.iloc[np.floor(n_data / 2).astype(int)]\n",
"\n",
"    # For each radius, find the value 2.5% away from the largest\n",
"    null_cutoffs = []\n",
"        significance_cut = np.floor(n_data / 40).astype(int)\n",
"        null_cutoffs.append(df_null_ordered.iloc[significance_cut][i])\n",
"    \n",
"    null_cuts = np.asarray(null_cutoffs)\n",
"    # Apply Ripley K to the data of the nucleus\n",
"    ripley_result = np.asarray(ripley_result)\n",
"    \n",
"    # Compute the p-values for each (nucleus,radius) pair\n",
"    # The choice of p-value is OPTIMISTIC, so it is biased towards significance\n",
"    # If still not significant, this strengthens the argument for true non-significance.\n",
"    pvals = []\n",
"        if ripley_result[i] == null_cuts[i]:\n",
"            idx = df_null_ordered.index[(df_null_ordered.iloc[:,i] == ripley_result[i])]\n",
"            n_cut = n_data / 40\n",
"            idx_cut = [x for x in idx if x < n_cut]\n",
"            \n",
"            above_cutoff = len(idx_cut)\n",
"            equal_to_cutoff = (df_null_ordered.iloc[:,i] == ripley_result[i]).sum()\n",
"            coin_prob = above_cutoff / equal_to_cutoff\n",
"            \n",
"            coin = np.random.binomial(1,coin_prob,size = None)\n",
"            if coin == 1:\n",
"                if np.min(idx) < n_data/2:\n",
"                    pvalue = (np.min(idx) / n_data) * 2\n",
"                else:\n",
"                    pvalue = ( (n_data - np.min(idx) ) / n_data) * 2\n",
"            else:\n",
"                if np.max(idx) < n_data/2:\n",
"                    pvalue = (np.max(idx) / n_data) * 2\n",
"                else:\n",
"                    # If ripley_result[i] == null_cuts[i] *and* we have lost the coin toss, then it's over\n",
"                    pvalue = 1\n",
"                \n",
"            pvals.append(pvalue)\n",
"        \n",
"        # Are these p-value computations correct?\n",
"        elif ripley_result[i] > null_cuts[i]:\n",
"            idx = df_null_ordered.index[df_null_ordered.iloc[:,i] >= ripley_result[i]]\n",
"            # Test if list is empty\n",
"            if idx.empty:\n",
"                pvalue = (1 / n_data) * 2\n",
"            else:\n",
"                pvalue = (np.min(idx) / n_data) * 2\n",
"            \n",
"            pvals.append(pvalue)\n",
"        else:\n",
"            idx = df_null_ordered.index[df_null_ordered.iloc[:,i] <= ripley_result[i]]\n",
"            if np.min(idx) < n_data/2:\n",
"                pvalue = (np.min(idx) / n_data) * 2\n",
"            else:\n",
"                pvalue = ( (n_data - np.min(idx) ) / n_data ) * 2\n",
"            \n",
"            pvals.append(pvalue)\n",
"    \n",
"    pvals = np.asarray(pvals)\n",
"    \n",
"    return pd.DataFrame(np.transpose(np.array([radii, df_median, null_cuts, ripley_result, pvals])), columns = ['Radii', 'Mean', 'sigcutoff', 'Observed', 'pvals'])\n"
]
},
{
"cell_type": "markdown",
"id": "77af930f",
"source": [
"One can also apply the function ripley_pvalue to a list of csv files of (x,y,z) coordinates, as suggested here:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f5f06558",
"outputs": [],
"source": [
"# Local path\n",
"os.getcwd()"
]
},
{
"cell_type": "code",
"execution_count": 80,
"id": "a5f4b53b",
"outputs": [],
"source": [
"# Read in a list of csv files consisting of (x,y,z) coordinates\n",
"os.chdir('C:file_location')\n",
"extension = 'csv'\n",
"files = [i for i in glob.glob('*.{}'.format(extension))]"
]
},
{
"cell_type": "code",
"execution_count": 83,
"id": "e83264c9",
"scrolled": true
},
"outputs": [],
"source": [
"# Loop over a list of csv files, each containing 3D coordinates of spots on a single image\n",
"results = pd.DataFrame()\n",
"for i in files:\n",
"    output['filename'] = i\n",
"    results = pd.concat([results, output])"
]
},
{
"cell_type": "code",
"execution_count": 84,
"id": "ea01192a",
"outputs": [],
"source": [
"#results.to_csv('file_name.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b7e3915",
"outputs": [],
"source": [
"# Inspect results\n",
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3156e780",
"outputs": [],
"source": []
}
],
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
``````