# GLFixedEffectModels.jl ![example branch parameter](https://github.com/jmboehm/GLFixedEffectModels.jl/actions/workflows/ci.yml/badge.svg?branch=master) [![codecov.io](http://codecov.io/github/jmboehm/GLFixedEffectModels.jl/coverage.svg?branch=master)](http://codecov.io/github/jmboehm/GLFixedEffectModels.jl?branch=master) [![DOI](https://zenodo.org/badge/164128032.svg)](https://zenodo.org/badge/latestdoi/164128032) This package estimates generalized linear models with high dimensional categorical variables. It builds on Matthieu Gomez's [FixedEffects.jl](https://github.com/FixedEffects/FixedEffects.jl), Amrei Stammann's [Alpaca](https://github.com/amrei-stammann/alpaca), and Sergio Correia's [ppmlhdfe](https://github.com/sergiocorreia/ppmlhdfe). ## Installation ``` ] add GLFixedEffectModels ``` ## Example use ```julia using GLFixedEffectModels, GLM, Distributions using RDatasets df = dataset("datasets", "iris") df.binary = zeros(Float64, size(df,1)) df[df.SepalLength .> 5.0,:binary] .= 1.0 df.SpeciesStr = string.(df.Species) idx = rand(1:3,size(df,1),1) a = ["A","B","C"] df.Random = vec([a[i] for i in idx]) m = @formula binary ~ SepalWidth + fe(Species) x = nlreg(df, m, Binomial(), LogitLink(), start = [0.2] ) m = @formula binary ~ SepalWidth + PetalLength + fe(Species) nlreg(df, m, Binomial(), LogitLink(), Vcov.cluster(:SpeciesStr,:Random) , start = [0.2, 0.2] ) ``` ## Documentation The main function is `nlreg()`, which returns a `GLFixedEffectModel <: RegressionModel`. ```julia nlreg(df, formula::FormulaTerm, distribution::Distribution, link::GLM.Link, vcov::CovarianceEstimator; ...) ``` The required arguments are: * `df`: a Table * `formula`: A formula created using `@formula`. * `distribution`: A `Distribution`. See the documentation of [GLM.jl](https://juliastats.org/GLM.jl/stable/manual/#Fitting-GLM-models-1) for valid distributions. * `link`: A `GLM.Link` function. See the documentation of [GLM.jl](https://juliastats.org/GLM.jl/stable/manual/#Fitting-GLM-models-1) for valid link functions. * `vcov`: A `CovarianceEstimator` to compute the variance-covariance matrix. The optional arguments are: * `save::Union{Bool, Symbol} = false`: Should residuals and eventual estimated fixed effects saved in a dataframe? Use `save = :residuals` to only save residuals. Use `save = :fe` to only save fixed effects. * `method::Symbol`: A symbol for the method. Default is `:cpu`. Alternatively, `:gpu` requires `CuArrays`. In this case, use the option `double_precision = false` to use `Float32`. This option is the same as for the [FixedEffectModels.jl](https://github.com/FixedEffects/FixedEffectModels.jl) package. * `double_precision::Bool = true`: Uses 64-bit floats if `true`, otherwise 32-bit. * `drop_singletons = true` : drop observations that are perfectly classified. * `contrasts::Dict = Dict()` An optional Dict of contrast codings for each categorical variable in the `formula`. Any unspecified variables will have `DummyCoding`. * `maxiter::Integer = 1000`: Maximum number of iterations in the Newton-Raphson routine. * `maxiter_center::Integer = 10000`: Maximum number of iterations for centering procedure. * `double_precision::Bool`: Should the demeaning operation use Float64 rather than Float32? Default to true. * `dev_tol::Real` : Tolerance level for the first stopping condition of the maximization routine. * `rho_tol::Real` : Tolerance level for the stephalving in the maximization routine. * `step_tol::Real` : Tolerance level that accounts for rounding errors inside the stephalving routine * `center_tol::Real` : Tolerance level for the stopping condition of the centering algorithm. Default to 1e-8 if `double_precision = true`, 1e-6 otherwise. * `separation::Vector{Symbol} = Symbol[]` : Method to detect/deal with [separation](https://github.com/sergiocorreia/ppmlhdfe/blob/master/guides/separation_primer.md). Supported elements are `:mu`, `:fe`, `:ReLU`, and in the future, `:simplex`. `:mu` truncates mu at `separation_mu_lbound` or `separation_mu_ubound`. `:fe` finds categories of the fixed effects that only exist when y is at the separation point. `ReLU` detects separation using ReLU, with the maxiter being `separation_ReLU_maxiter` and tolerance being `separation_ReLU_tol`. * `separation_mu_lbound::Real = -Inf` : Lower bound for the separation detection/correction heuristic (on mu). What a reasonable value would be depends on the model that you're trying to fit. * `separation_mu_ubound::Real = Inf` : Upper bound for the separation detection/correction heuristic. * `separation_ReLU_tol::Real = 1e-4` : Tolerance level for the ReLU algorithm. * `separation_ReLU_maxiter::Integer = 1000` : Maximal number of iterations for the ReLU algorithm. * `verbose::Bool = false` : If `true`, prints output on each iteration. The function returns a `GLFixedEffectModel` object which supports the `StatsBase.RegressionModel` abstraction. It can be displayed in table form by using [RegressionTables.jl](https://github.com/jmboehm/RegressionTables.jl). ## Bias correction methods The package experimentally supports bias correction methods for the following models: - Binomial regression, Logit link, Two-way, Classic (Fernández-Val and Weidner (2016, 2018)) - Binomial regression, Probit link, Two-way, Classic (Fernández-Val and Weidner (2016, 2018)) - Binomial regression, Logit link, Two-way, Network (Hinz, Stammann and Wanner (2020) & Fernández-Val and Weidner (2016)) - Binomial regression, Probit link, Two-way, Network (Hinz, Stammann and Wanner (2020) & Fernández-Val and Weidner (2016)) - Binomial regression, Logit link, Three-way, Network (Hinz, Stammann and Wanner (2020)) - Binomial regression, Probit link, Three-way, Network (Hinz, Stammann and Wanner (2020)) - Poisson regression, Log link, Three-way, Network (Weidner and Zylkin (2021)) - Poisson regression, Log link, Two-way, Network (Weidner and Zylkin (2021)) ## Things that still need to be implemented - Better default starting values - Weights - Better StatsBase interface & prediction - Better benchmarking ## Related Julia packages - [FixedEffectModels.jl](https://github.com/FixedEffects/FixedEffectModels.jl) estimates linear models with high dimensional categorical variables (and with or without endogeneous regressors). - [FixedEffects.jl](https://github.com/FixedEffects/FixedEffects.jl) is a package for fast pseudo-demeaning operations using LSMR. Both this package and [FixedEffectModels.jl](https://github.com/FixedEffects/FixedEffectModels.jl) build on this. - [Alpaca.jl](https://github.com/jmboehm/Alpaca.jl) is a wrapper to the [Alpaca R package](https://github.com/amrei-stammann/alpaca), which solves the same tasks as this package. - [GLM.jl](https://github.com/JuliaStats/GLM.jl) estimates generalized linear models, but without explicit support for categorical regressors. - [Econometrics.jl](https://github.com/Nosferican/Econometrics.jl) provides routines to estimate multinomial logit and other models. - [RegressionTables.jl](https://github.com/jmboehm/RegressionTables.jl) supports pretty printing of results from this package. ## References Correia, S. and Guimarães, P, and Zylkin, T., 2019. Verifying the existence of maximum likelihood estimates for generalized linear models. Working paper, https://arxiv.org/abs/1903.01633 Fernández-Val, I. and Weidner, M., 2016. Individual and time effects in nonlinear panel models with large N, T. Journal of Econometrics, 192(1), pp.291-312. Fernández-Val, I. and Weidner, M., 2018. Fixed effects estimation of large-T panel data models. Annual Review of Economics, 10, pp.109-138. Fong, DC. and Saunders, M. (2011) *LSMR: An Iterative Algorithm for Sparse Least-Squares Problems*. SIAM Journal on Scientific Computing Hinz, J., Stammann, A. and Wanner, J., 2021. State dependence and unobserved heterogeneity in the extensive margin of trade. Stammann, A. (2018) *Fast and Feasible Estimation of Generalized Linear Models with High-Dimensional k-way Fixed Effects*. Mimeo, Heinrich-Heine University Düsseldorf Weidner, M. and Zylkin, T., 2021. Bias and consistency in three-way gravity models. Journal of International Economics, 132, p.103513.