Revision 1db48f3a735eb0fba06a7d503f080a7ead512604 authored by Artem Artemev on 11 July 2018, 12:50:44 UTC, committed by GitHub on 11 July 2018, 12:50:44 UTC
1 parent 707b195
natural_gradients.ipynb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Natural gradients\n",
"\n",
"This shows some basic usage of the natural gradient optimizer, both on its own and in combination with other optimizers using the Actions class."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-13T13:31:56.149218Z",
"start_time": "2018-06-13T13:31:55.213980Z"
}
},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\n",
"import warnings\n",
"import numpy as np\n",
"import tensorflow as tf\n",
"import gpflow\n",
"from gpflow.test_util import notebook_niter, notebook_range\n",
"from gpflow.actions import Loop, Action\n",
"from gpflow.models import VGP, GPR, SGPR, SVGP\n",
"from gpflow.training import NatGradOptimizer, AdamOptimizer, XiSqrtMeanVar\n",
"\n",
"%matplotlib inline\n",
"%precision 4\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"np.random.seed(0)\n",
"\n",
"N, D = 100, 2\n",
"M = 10\n",
"X = np.random.uniform(size=(N, D))\n",
"Y = np.sin(10 * X)\n",
"Z = np.random.uniform(size=(M, D))\n",
"learning_rate = 0.01\n",
"iterations = 5\n",
" \n",
"\n",
"def make_matern_kernel():\n",
" return gpflow.kernels.Matern52(D)\n",
"\n",
" \n",
"class PrintAction(Action):\n",
" def __init__(self, model, text):\n",
" self.model = model\n",
" self.text = text\n",
" \n",
" def run(self, ctx):\n",
" likelihood = ctx.session.run(self.model.likelihood_tensor)\n",
" print('{}: iteration {} likelihood {:.4f}'.format(self.text, ctx.iteration, likelihood))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### \"VGP is a GPR\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Natural gradients turn VGP into GPR in a *single step, if the likelihood is Gaussian*."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:57.799754Z",
"start_time": "2018-04-30T09:59:57.547423Z"
}
},
"outputs": [],
"source": [
"vgp = VGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian())\n",
"gpr = GPR(X, Y, make_matern_kernel())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Exact GP likelihood:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:57.826302Z",
"start_time": "2018-04-30T09:59:57.800812Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-231.08993573018935"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpr.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* VGP likelihood before natural gradient step:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:57.863144Z",
"start_time": "2018-04-30T09:59:57.827383Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-328.84293853435463"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vgp.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* VGP likelihood after a single natural gradient step:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:58.254414Z",
"start_time": "2018-04-30T09:59:57.864617Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-231.0899982510537"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"NatGradOptimizer(gamma=1.).minimize(vgp, maxiter=1, var_list=[[vgp.q_mu, vgp.q_sqrt]])\n",
"vgp.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interleaving an ordinary gradient step with a NatGrad's optimizer step\n",
"\n",
"In this case (Gaussian likelihood) it achieves optimization of hyperparameters as if the model were GPR."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Method for running Adam optimization on GPR:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:58.259075Z",
"start_time": "2018-04-30T09:59:58.255558Z"
}
},
"outputs": [],
"source": [
"def run_adam(model, lr, iterations, callback=None):\n",
" adam = AdamOptimizer(lr).make_optimize_action(model)\n",
" actions = [adam] if callback is None else [adam, callback]\n",
" loop = Loop(actions, stop=iterations)()\n",
" model.anchor(model.enquire_session())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Method for running Adam and Natural gradients optimization on VGP. The hyperparameters at the end should match the GPR model."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:58.270623Z",
"start_time": "2018-04-30T09:59:58.260311Z"
}
},
"outputs": [],
"source": [
"def run_nat_grads_with_adam(model, lr, gamma, iterations, var_list=None, callback=None):\n",
" # we'll make use of this later when we use a XiTransform\n",
" if var_list is None:\n",
" var_list = [(model.q_mu, model.q_sqrt)]\n",
"\n",
" # we don't want adam optimizing these\n",
" model.q_mu.set_trainable(False)\n",
" model.q_sqrt.set_trainable(False)\n",
"\n",
" adam = AdamOptimizer(lr).make_optimize_action(model)\n",
" natgrad = NatGradOptimizer(gamma).make_optimize_action(model, var_list=var_list)\n",
" \n",
" actions = [adam, natgrad]\n",
" actions = actions if callback is None else actions + [callback]\n",
"\n",
" Loop(actions, stop=iterations)()\n",
" model.anchor(model.enquire_session())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Optimize GPR with Adam:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:58.664235Z",
"start_time": "2018-04-30T09:59:58.271714Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GPR with Adam: iteration 0 likelihood -230.6706\n",
"GPR with Adam: iteration 1 likelihood -230.2508\n",
"GPR with Adam: iteration 2 likelihood -229.8303\n",
"GPR with Adam: iteration 3 likelihood -229.4093\n",
"GPR with Adam: iteration 4 likelihood -228.9876\n"
]
}
],
"source": [
"run_adam(gpr, learning_rate, iterations, callback=PrintAction(gpr, 'GPR with Adam'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Optimizer VGP with Adam and NatGrads:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:59.501684Z",
"start_time": "2018-04-30T09:59:58.665452Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"VGP with nat grads with Adam: iteration 0 likelihood -230.6707\n",
"VGP with nat grads with Adam: iteration 1 likelihood -230.2508\n",
"VGP with nat grads with Adam: iteration 2 likelihood -229.8304\n",
"VGP with nat grads with Adam: iteration 3 likelihood -229.4093\n",
"VGP with nat grads with Adam: iteration 4 likelihood -228.9877\n"
]
}
],
"source": [
"run_nat_grads_with_adam(vgp, learning_rate, 1., iterations, callback=PrintAction(vgp, 'VGP with nat grads with Adam'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compare GPR and VGP lengthscales:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:59.505965Z",
"start_time": "2018-04-30T09:59:59.503129Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'GPR lengthscales = 0.9686, VGP lengthscales = 0.9686'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"GPR lengthscales = {:.4f}, VGP lengthscales = {:.4f}\".format(\n",
" gpr.kern.lengthscales.read_value()[()],\n",
" vgp.kern.lengthscales.read_value()[()])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### This also works for the sparse model\n",
"Nat grads turn SVGP into SGPR in the Gaussian likelihood case. We can apply the above with hyperparameters, too, though here we'll just do a single step."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:59.946016Z",
"start_time": "2018-04-30T09:59:59.507143Z"
}
},
"outputs": [],
"source": [
"svgp = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z)\n",
"sgpr = SGPR(X, Y, make_matern_kernel(), Z=Z)\n",
"\n",
"for model in svgp, sgpr:\n",
" model.likelihood.variance = 0.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analytically optimal sparse model likelihood:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T09:59:59.994121Z",
"start_time": "2018-04-30T09:59:59.947360Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-281.56164436988774"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sgpr.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SVGP likelihood before natural gradient step:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T10:00:00.052847Z",
"start_time": "2018-04-30T09:59:59.995545Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-1404.0805162757324"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"svgp.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SVGP likelihood after a single natural gradient optimization step:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T10:00:00.691979Z",
"start_time": "2018-04-30T10:00:00.053823Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-281.56164470640283"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"NatGradOptimizer(1.0).minimize(svgp, maxiter=1, var_list=[(svgp.q_mu, svgp.q_sqrt)])\n",
"svgp.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Minibatches\n",
"A crucial property of the natural gradient method is that it still works with minibatches. We need to use a smaller gamma."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T10:00:01.667254Z",
"start_time": "2018-04-30T10:00:00.692973Z"
}
},
"outputs": [],
"source": [
"svgp = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z, minibatch_size=50)\n",
"svgp.likelihood.variance = 0.1\n",
"\n",
"NatGradOptimizer(gamma=0.1).minimize(svgp, maxiter=notebook_niter(100), var_list=[(svgp.q_mu, svgp.q_sqrt)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Minibatch SVGP likelihood after natural gradient optimization:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T10:00:03.742541Z",
"start_time": "2018-04-30T10:00:01.668715Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-282.1567900386383"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.average([svgp.compute_log_likelihood() for _ in notebook_range(1000)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparison with ordinary gradients in the conjugate case\n",
"\n",
"#### (Take home message: natural gradients are always better)\n",
"\n",
"Compared with doing SVGP with ordinary gradients with minibatches, the natural gradient optimizer is much faster in the Gaussian case. \n",
"\n",
"Here we'll do hyperparameter learning together with optimization of the variational parameters, comparing the interleaved nat grad approach and using ordinary gradients for the hyperparameters and variational parameters jointly."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"ExecuteTime": {
"end_time": "2018-04-30T10:00:07.102187Z",
"start_time": "2018-04-30T10:00:03.743831Z"
}
},
"outputs": [],
"source": [
"gpflow.reset_default_graph_and_session()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-13T13:32:06.105670Z",
"start_time": "2018-06-13T13:32:02.550268Z"
}
},
"outputs": [],
"source": [
"svgp_ordinary = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z, minibatch_size=50)\n",
"svgp_nat = SVGP(X, Y, make_matern_kernel(), gpflow.likelihoods.Gaussian(), Z=Z, minibatch_size=50)\n",
"\n",
"# ordinary gradients and Adam\n",
"AdamOptimizer(learning_rate).minimize(svgp_ordinary, maxiter=iterations)\n",
"\n",
"# NatGrads with Adam\n",
"run_nat_grads_with_adam(svgp_nat, learning_rate, 0.1, iterations)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SVGP likelihood after ordinary _Adam optimization_:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-306.6432692277338"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.average([svgp_ordinary.compute_log_likelihood() for _ in notebook_range(1000)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SVGP likelihood after _NatGrad + Adam optimization_:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-233.91488549350396"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.average([svgp_nat.compute_log_likelihood() for _ in notebook_range(1000)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Comparison with ordinary gradients in the non-conjugate case\n",
"\n",
"#### (Take home message: natural gradients are usually better)\n",
"\n",
"We can use nat grads even when the likelihood isn't Gaussian. It isn't guaranteed to be better, but it usually is better in practical situations."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"gpflow.reset_default_graph_and_session()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-13T13:32:13.397570Z",
"start_time": "2018-06-13T13:32:10.157801Z"
}
},
"outputs": [],
"source": [
"Y_binary = np.random.choice([1., -1], size=X.shape)\n",
"\n",
"vgp_bernoulli = VGP(X, Y_binary, make_matern_kernel(), gpflow.likelihoods.Bernoulli())\n",
"vgp_bernoulli_natgrads = VGP(X, Y_binary, make_matern_kernel(), gpflow.likelihoods.Bernoulli())\n",
"\n",
"# ordinary gradients and Adam\n",
"AdamOptimizer(learning_rate).minimize(vgp_bernoulli, maxiter=iterations)\n",
"\n",
"# nat grads with Adam \n",
"run_nat_grads_with_adam(vgp_bernoulli_natgrads, learning_rate, 0.1, iterations)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"VGP likelihood after ordinary *Adam optimization*:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-186.44639583098405"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vgp_bernoulli.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"VGP likelihood after combination of optimizers, *NatGrad + Adam*:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-146.805910251198"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vgp_bernoulli_natgrads.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also choose to run natural gradients in another parameterization. The \n",
"sensible choice is the model parameters (q_mu, q_sqrt), which is already in gpflow."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"vgp_bernoulli_natgrads_xi = VGP(X, Y_binary, make_matern_kernel(), gpflow.likelihoods.Bernoulli())\n",
"\n",
"var_list = [(vgp_bernoulli_natgrads_xi.q_mu, vgp_bernoulli_natgrads_xi.q_sqrt, XiSqrtMeanVar())]\n",
"run_nat_grads_with_adam(vgp_bernoulli_natgrads_xi, learning_rate, 0.01, iterations, var_list=var_list)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"VGP likelihood after NatGrads with XiSqrtMeanVar + Adam optimization:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-13T13:32:15.641457Z",
"start_time": "2018-06-13T13:32:15.580550Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"-160.67030522678684"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vgp_bernoulli_natgrads_xi.compute_log_likelihood()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With sufficiently small steps, it shouldn't make a difference which transform is used, but for large \n",
"steps this can make a difference in practice."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Computing file changes ...