https://github.com/Data2Semantics/powerlaws
Tip revision: 4aa1106b7bf128f38ebf0b4cfbfda8373dedb2ec authored by Peter Bloem on 13 October 2020, 12:27:30 UTC
Merge pull request #4 from Data2Semantics/dependabot/maven/powerlaws/junit-junit-4.13.1
Merge pull request #4 from Data2Semantics/dependabot/maven/powerlaws/junit-junit-4.13.1
Tip revision: 4aa1106
README.md
Power laws
=========
This is a small java library for the analysis of power law distributed data. The
methods implemented are all taken form the paper _Power-Law Distributions in
Empirical Data_ by Clauset, Shalizi and Newman (2007) and its reference
implementations available at http://tuvalu.santafe.edu/~aaronc/powerlaws/
It contains facilities for estimating parameters, uncertainty and significance
and for generating power law distributed data. All methods are implemented for
continuous data, discrete data and the approximation of discrete data with a
continuous distribution.
# Installation
You can use [jitpack.io](https://jitpack.io/#Data2Semantics/powerlaws) to add the powerlaws library to your project. Simply add
the following repository to your pom.xml:
```xml
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
```
And the following dependency
```
<dependency>
<groupId>com.github.Data2Semantics</groupId>
<artifactId>powerlaws</artifactId>
<version>v0.1.0</version>
</dependency>
```
Check [the releases](https://github.com/Data2Semantics/powerlaws/releases) for the version number of the latest release.
# The Basics
The main thing you're likely to want to do is estimate the exponent for
some data. Say you have your data as a Collection of double values called 'data'.
You can estimate the exponent as follows:
```java
double exponent = Continuous.fit(data).fit().exponent();
```
The second fit() method above actually returns a representation of a whole
continuous power law distribution, which you can use to do various things:
```java
Continuous distribution = Continuous.fit(data).fit();
// * Retrieve the distribution's parameters
double exponent = distribution.exponent();
double xMin = distribution.xMin();
// * Generate some synthetic data
List<Double> generated = distribution.generate(1000);
```
Of course, if you have a specific distribution in mind, you can create the
Continuous object directly:
```java
// * Create a continuous power law distribution with xMin 340 and exponent 2.5
Continuous distribution = new Continuous(340, 2.5);
```
If your data is discrete (ie. integers) you can use the class Discrete in an
analogous way. Since the methods in Discrete are a bit slower than those in
Continuous, you can also use DiscreteApproximate, which approximates a discrete
power law with a continuous one. According to Clauset, this use is justified if
xMin is larger than 6. For xMin > 1000 you can safely use the approximate
version.
#The Details
## Uncertainty
The uncertainty measures, in a sense, the robustness of our estimates. If a
small change in the data leads to a massively different value, then we cannot
rely on it, or we must gather more data.
We estimate the uncertainty by _bootstrapping_. We repeat the experiment a large
number of times (1000 to 10000) with data sets that are sampled with replacement
from the original data set. For each we record the measured exponent, the xMin
parameter and the number of points in the tail. In each case the sample standard
deviation of the recorded values is the uncertainty.
## Varying xMin
The method of fitting described above uses a maximum likelihood estimator to
estimate the exponent for a given xMin parameter. It then chooses as xMin the
data point for which the resulting distribution has the smallest KS statistic to
the data. If you want to test different values of xMin, you can use the Fit
object which represents the intermediate stage of fitting a power law to data:
```java
// * Create a Fit object
Fit<Double, PowerLaw<Double>> fit = Continuous.fit(data);
// * Print out the distances for all of the data points
for(double datum : data)
System.out.println(datum + ": " + fit.fit(datum));
```
## Significance
Any data can have a power law fit to it. To ascertain whether the power law is
a proper model, we must calculate the significance. To do this, we generate
a large number of synthetic data sets like our original data set and fit a model
to it. We then calculate the KS statistic between the generated data and its
model. The ratio of generated data sets with a distance higher than that of our
fit is the probability of seeing a fit as 'bad' as ours given that the data comes
from a power law distribution.
```java
Continuous model = Continuous.fit(data).fit();
// * Calculate the significance based on 100 trials
double significance = model.significance(data, 1000);
// * Calculate the significance to an accuracy of 0.01
double significance = model.significance(data, 0.01);
```
The calculation of the significance is by far the most expensive process. For a
large data set, the process an easily take hours. Note that this is a
significance test for the the power law hypothesis (rather than the null
hypothesis) so that _high_ values mean that the power law is a good fit. Clauset
et al. suggest that for values below 0.01 the power law hypothesis should be
rejected.
## Randomness
All random numbers come from PowerLaws.random. By default this Random has a
fixed seed so that runs can be deterministically repeated. If this behaviour
is not desired, you can give the Random a random seed:
```java
PowerLaws.random = new Random();
```
## The KS Test
The Kolmogorov-Smirnov test is used to estimate the xMin parameter. There is a
small difference between the reference implementation of Clauset 2007 and the
official definition of the KS test. The change does not cause a great difference
in measured values (within uncertainties) but it is important when replicating
the results in Clauset 2007.
The constant PowerLaws.KS_CORRECT determines which version of the KS test is
used. If true, the theoretically correct version is used, if false, the version
based on the reference implementation is used.
# Contact
For bug fixes and suggestions, GitHub is preferred (send pbloem a message or
create a ticket). If you don't have a GitHub account you can email to
p & peterbloem & nl. (replacing the ampersands with an at symbol and a dot
respectively).
# License
This library is released under the MIT license. You are free to create derivative
works, including for commercial purposes, and to re-release under another license.
Attribution is not required, but it is always appreciated. See the file
LICENSE.txt for the full text of the license.
![githalytics.com alpha](https://cruel-carlota.pagodabox.com/fbbc7703db32ced91f2edd767774a2dc)