Interior operators in AdS/CFT

In a previous post, I mentioned that the firewall paradox could be phrased as a question about the existence of interior operators that satisfy the correct thermal correlation functions, namely 

\displaystyle \langle\Psi|\mathcal{O}(t,\mathbf{x})\tilde{\mathcal{O}}(t',\mathbf{x}')|\Psi\rangle =Z^{-1}\mathrm{tr}\left[e^{-\beta H}\mathcal{O}(t,\mathbf{x})\mathcal{O}(t'+i\beta/2,\mathbf{x}')\right]~, \ \ \ \ \ (1)

where {\tilde{\mathcal{O}}} and {\mathcal{O}} operators inside and outside the black hole, respectively; cf. eqn. (2) here. In this short post, I’d like to review the basic argument leading up to this statement, following the original works [1,2].

Consider the eternal black hole in AdS as depicted in the following diagram, which I stole from [1]:


The blue line connecting the two asymptotic boundaries is the Cauchy slice on which we’ll construct our states, denoted {\Sigma_I} in exterior region {I} and {\Sigma_{III}} in exterior region {III}. Note that, modulo possible UV divergences at the origin, either half serves as a complete Cauchy slice if we restrict our inquiries to the associated exterior region. But if we wish to reconstruct states in the interior (henceforth just {II}, since we don’t care about {IV}), then we need the entire slice. Pictorially, one can see this from the fact that only the left-moving modes from region {I}, and the right-moving modes from region {III}, cross the horizon into region {II}, but we need both left- and right-movers to have a complete mode decomposition.

To expand on this, imagine we proceed with the quantization of a free scalar field in region {I}. We need to solve the Klein-Gordon equation,

\displaystyle \left(\square-m^2\right)\phi=\frac{1}{\sqrt{-g}}\,\partial_\mu\left( g^{\mu\nu}\sqrt{-g}\,\partial_\nu\phi\right)-m^2\phi=0 \ \ \ \ \ (2)

on the AdS black brane background,

\displaystyle \mathrm{d} s^2=\frac{1}{z^2}\left[-h(z)\mathrm{d} t^2+\mathrm{d} z^2+\mathrm{d}\mathbf{x}^2\right]~, \quad\quad h(z)\equiv1-\left(\frac{z}{z_0}\right)^d~. \ \ \ \ \ (3)

where, in Poincaré coordinates, the asymptotic boundary is at {z\!=\!0}, and the horizon is at {z\!=\!z_0}. We work in {(d+1)-}spacetime dimensions, so {\mathbf{x}} is a {(d\!-\!1)}-dimensional vector representing the transverse coordinates. Note that we’ve set the AdS radius to 1. Substituting the usual plane-wave ansatz

\displaystyle f_{\omega,\mathbf{k}}(t,\mathbf{x},z)=e^{-i\omega t+i\mathbf{k}\mathbf{x}}\psi_{\omega,\mathbf{k}(z)} \ \ \ \ \ (4)

into the Klein-Gordon equation results in a second order ordinary differential equation for the radial function {\psi_{\omega,\mathbf{k}}(z)}, and hence two linearly independent solutions. As usual, we then impose normalizable boundary conditions at infinity, which leaves us with a single linear combination for each {(\omega,\mathbf{k})}. Note that we do not impose boundary conditions at the horizon. Naïvely, one might have thought to impose ingoing boundary conditions there; however, as remarked in [1], this precludes the existence of real {\omega}. More intuitively, I think of this as simply the statement that the black hole is evaporating, so we should allow the possibility for outgoing modes as well. (That is, assuming a large black hole in AdS, the black hole is in thermal equilibrium with the surrounding environment, so the outgoing and ingoing fluxes are precisely matched, and it maintains constant size). The expression for {\psi_{\omega,\mathbf{k}}(z)} is not relevant here; see [1] for more details.

We thus arrive at the standard expression of the (bulk) field {\phi} in terms of creation and annihilation operators,

\displaystyle \phi_I(t,\mathbf{x},z)=\int_{\omega>0}\frac{\mathrm{d}\omega\mathrm{d}^{d-1}\mathbf{k}}{\sqrt{2\omega}(2\pi)^d}\,\bigg[ a_{\omega,\mathbf{k}}\,f_{\omega,\mathbf{k}}(t,\mathbf{x},z)+\mathrm{h.c.}\bigg]~, \ \ \ \ \ (5)

where the creation/annihilation operators for the modes may be normalized with respect to the Klein-Gordon norm, so that

\displaystyle [a_{\omega,\mathbf{k}},a^\dagger_{\omega',\mathbf{k}'}]=\delta(\omega-\omega')\delta^{d-1}(\mathbf{k}-\mathbf{k}')~. \ \ \ \ \ (6)

Of course, a similar expansion holds for region {III}:

\displaystyle \phi_{III}(t,\mathbf{x},z)=\int_{\omega>0}\frac{\mathrm{d}\omega\mathrm{d}^{d-1}\mathbf{k}}{\sqrt{2\omega}(2\pi)^d}\,\bigg[\tilde a_{\omega,\mathbf{k}}\,g_{\omega,\mathbf{k}}(t,\mathbf{x},z)+\mathrm{h.c.}\bigg]~, \ \ \ \ \ (7)

where the mode operators {\tilde a_{\omega,\mathbf{k}}} commute with all {a_{\omega,\mathbf{k}}} by construction.

Now, what of the future interior, region {II}? Unlike the exterior regions, we no longer have any boundary condition to impose, since every Cauchy slice which crosses this region is bounded on both sides by a future horizon. Consequently, we retain both the linear combinations obtained from the Klein-Gordon equation, and hence have twice as many modes as in either {I} or {III}—which makes sense, since the interior receives contributions from both exterior regions. Nonetheless, it may be a bit confusing from the bulk perspective, since any local observer would simply arrive at the usual mode expansion involving only a single set of creation/annihilation operators, and I don’t have an intuition as to how {a_{\omega,\mathbf{k}}} and {\tilde a_{\omega,\mathbf{k}}} relate vis-à-vis their commutation relations in this shared domain. However, the entire framework in which the interior is fed by two exterior regions is only properly formulated in AdS/CFT, in which — it is generally thought — the interior region emerges from the entanglement structure between the two boundaries, so I prefer to uplift this discussion to the CFT before discussing the interior region in detail. This avoids the commutation confusion above — since the operators live in different CFTs — and it was the next step in our analysis anyway. (Incidentally, appendix B of [1] performs the mode decomposition in all three regions explicitly for the case of Rindler space, which provides a nice concrete example in which one can get one’s hands dirty).

So, we want to discuss local bulk fields from the perspective of the boundary CFT. From the extrapolate dictionary, we know that local bulk operators become increasingly smeared over the boundary (in both space and time) the farther we move into the bulk. Thus in region {I}, we can construct the operator

\displaystyle \phi^I_{\mathrm{CFT}}(t,\mathbf{x},z)=\int_{\omega>0}\frac{\mathrm{d}\omega\mathrm{d}^{d-1}\mathbf{k}}{(2\pi)^d}\,\bigg[\mathcal{O}_{\omega,\mathbf{k}}\,f_{\omega,\mathbf{k}}(t,\mathbf{x},z)+\mathcal{O}^\dagger_{\omega,\mathbf{k}}f^*_{\omega,\mathbf{k}}(t,\mathbf{x},z)\bigg]~, \ \ \ \ \ (8)

which, while a non-local operator in the CFT (constructed from local CFT operators {\mathcal{O}_{\omega,\mathbf{k}}} which act as creation operators of light primary fields), behaves like a local operator in the bulk. Note that from the perspective of the CFT, {z} is an auxiliary coordinate that simply parametrizes how smeared-out this operator is on the boundary.

As an aside, the critical difference between (8) and the more familiar HKLL prescription [3-5] is that the former is formulated directly in momentum space, while the latter is defined in position space as

\displaystyle \phi_{\mathrm{CFT}}(t,\mathbf{x},z)=\int\!\mathrm{d} t'\mathrm{d}^{d-1}\mathbf{x}'\,K(t,\mathbf{x},z;t',\mathbf{x}')\mathcal{O}(t',\mathbf{x}')~, \ \ \ \ \ (9)

where the integration kernel {K} is known as the “smearing function”, and depends on the details of the spacetime. To solve for {K}, one performs a mode expansion of the local bulk field {\phi} and identifies the normalizable mode with the local bulk operator {\mathcal{O}} in the boundary limit. One then has to invert this relation to find the bulk mode operator, and then insert this into the original expansion of {\phi}. The problem now is that to identify {K}, one needs to swap the order of integration between position and momentum space, and the presence of the horizon results in a fatal divergence that obstructs this maneuver. As discussed in more detail in [1] however, working directly in momentum space avoids this technical issue. But the basic relation “smeared boundary operators {\longleftrightarrow} local bulk fields” is the same.

Continuing, we have a similar bulk-boundary relation in region {III}, in terms of operators {\tilde{\mathcal{O}}_{\omega,\mathbf{k}}} living in the left CFT:

\displaystyle \phi^{III}_{\mathrm{CFT}}(t,\mathbf{x},z)=\int_{\omega>0}\frac{\mathrm{d}\omega\mathrm{d}^{d-1}\mathbf{k}}{(2\pi)^d}\,\bigg[\tilde{\mathcal{O}}_{\omega,\mathbf{k}}\,f_{\omega,\mathbf{k}}(t,\mathbf{x},z)+\tilde{\mathcal{O}}^\dagger_{\omega,\mathbf{k}}f^*_{\omega,\mathbf{k}}(t,\mathbf{x},z)\bigg]~. \ \ \ \ \ (10)

Note that even though I’ve used the same coordinate labels, {t} runs backwards in the left wedge, so that {\tilde{\mathcal{O}}_{\omega,\mathbf{k}}} plays the role of the creation operator here. From the discussion above, the form of the field in the black hole interior is then

\displaystyle \phi^{II}_{\mathrm{CFT}}(t,\mathbf{x},z)=\int_{\omega>0}\frac{\mathrm{d}\omega\mathrm{d}^{d-1}\mathbf{k}}{(2\pi)^d}\,\bigg[\mathcal{O}_{\omega,\mathbf{k}}\,g^{(1)}_{\omega,\mathbf{k}}(t,\mathbf{x},z)+\tilde{\mathcal{O}}_{\omega,\mathbf{k}}g^{(2)}_{\omega,\mathbf{k}}(t,\mathbf{x},z)+\mathrm{h.c}\bigg]~, \ \ \ \ \ (11)

where {\mathcal{O}_{\omega,\mathbf{k}}} and {\tilde{\mathcal{O}}_{\omega,\mathbf{k}}} are the (creation/annihilation operators for the) boundary modes in the right and left CFTs, respectively. The point is that in order to construct a local field operator behind the horizon, both sets of modes — the left-movers {\mathcal{O}_{\omega,\mathbf{k}}} from {I} and the right-movers {\tilde{\mathcal{O}}_{\omega,\mathbf{k}}} from {III} — are required. In the eternal black hole considered above, the latter originate in the second copy of the CFT. But in the one-sided case, we would seem to have only the left-movers {\mathcal{O}_{\omega,\mathbf{k}}}, hence we arrive at the crucial question: for a one-sided black hole — such as that formed from collapse in our universe — what are the interior modes {\tilde{\mathcal{O}}_{\omega,\mathbf{k}}}? Equivalently: how can we represent the black hole interior given access to only one copy of the CFT?

To answer this question, recall that the thermofield double state,

\displaystyle |\mathrm{TFD}\rangle=\frac{1}{\sqrt{Z_\beta}}\sum_ie^{-\beta E_i/2}|E_i\rangle\otimes|E_i\rangle~, \ \ \ \ \ (12)

is constructed so that either CFT appears exactly thermal when tracing out the other side, and that this well-approximates the late-time thermodynamics of a large black hole formed from collapse. That is, the exterior region will be in the Hartle-Hawking vacuum (which is to Schwarzschild as Rindler is to Minkowski), with the temperature {\beta^{-1}} of the CFT set by the mass of the black hole. This implies that correlation functions of operators {\mathcal{O}} in the pure state {|\mathrm{TFD}\rangle} may be computed as thermal expectation values in their (mixed) half of the total Hilbert space, i.e.,

\displaystyle \langle\mathrm{TFD}|\mathcal{O}(t_1,\mathbf{x}_1)\ldots\mathcal{O}(t_n,\mathbf{x}_n)|\mathrm{TFD}\rangle =Z^{-1}_\beta\mathrm{tr}\left[e^{-\beta H}\mathcal{O}(t_1,\mathbf{x}_1)\ldots\mathcal{O}(t_n,\mathbf{x}_n)\right]~. \ \ \ \ \ (13)

The same fundamental relation remains true in the case of the one-sided black hole as well: given the Hartle-Hawking state representing the exterior region, we can always obtain a purification such that expectation values in the original, thermal state are equivalent to standard correlators in the “fictitious” pure state, by the same doubling formalism that yielded the TFD. (Of course, the purification of a given mixed state is not unique, but as pointed out in [2] “the correct way to pick it, assuming that expectation values [of the operators] are all the information we have, is to pick the density matrix which maximizes the entropy.” That is, we pick the purification such that the original mixed state is thermal, i.e., {\rho\simeq Z^{-1}_\beta e^{-\beta H}} up to {1/N^2} corrections. The reason this is the “correct” prescription is that it’s the only one which does not impose additional constraints.) Thus (13) can be generally thought of as the statement that operators in an arbitrary pure state have the correct thermal expectation values when restricted to some suitably mixed subsystem (e.g., the black hole exterior dual to a single CFT).

Now, what if we wish to compute a correlation function involving operators across the horizon, e.g., {\langle\mathcal{O}\tilde{\mathcal{O}}\rangle}? In the two-sided case, we can simply compute this correlator in the pure state {|\mathrm{TFD}\rangle}. But in the one-sided case, we only have access to the thermal state representing the exterior. Thus we’d like to know how to compute the correlator using only the available data in the CFT corresponding to region {I}. In order to do this, we re-express all operators {\tilde{\mathcal{O}}} appearing in the correlator with analytically continued operators {\mathcal{O}} via the KMS condition, i.e., we make the replacement

\displaystyle \tilde{\mathcal{O}}(t,\mathbf{x}) \longrightarrow \mathcal{O}(t+i\beta/2,\mathbf{x})~. \ \ \ \ \ (14)

This is essentially the usual statement that thermal Green functions are periodic in imaginary time; see [1] for details. This relationship allows us to express the desired correlator as

\displaystyle \langle\mathrm{TFD}|\mathcal{O}(t_1,\mathbf{x}_1)\ldots\tilde{\mathcal{O}}(t_n,\mathbf{x}_n)|\mathrm{TFD}\rangle =Z^{-1}_\beta\mathrm{tr}\left[e^{-\beta H}\mathcal{O}(t_1,\mathbf{x}_1)\ldots\mathcal{O}_{\omega,\mathbf{k}}(t_n+i\beta/2,\mathbf{x}_n)\right]~, \ \ \ \ \ (15)

which is precisely eqn. (2) in our earlier post, cf. the two-point function (1) above. Note the lack of tilde’s on the right-hand side: this thermal expectation value can be computed entirely in the right CFT.

If the CFT did not admit operators which satisfy the correlation relation (15), it would imply a breakdown of effective field theory across the horizon. Alternatively, observing deviations from the correct thermal correlators would allow us to locally detect the horizon, in contradiction to the equivalence principle. In this sense, this expression may be summarized as the statement that the horizon is smooth. Thus, for the CFT to represent a black hole with no firewall, it must contain a representation of interior operators {\tilde{\mathcal{O}}} with the correct behaviour inside low-point correlators. This last qualifier hints at the state-dependent nature of these so-called “mirror operators”, which I’ve discussed at length elsewhere [6].


[1]  K. Papadodimas and S. Raju, “An Infalling Observer in AdS/CFT,” JHEP 10 (2013) 212,arXiv:1211.6767 [hep-th].

[2]  K. Papadodimas and S. Raju, “State-Dependent Bulk-Boundary Maps and Black Hole Complementarity,” Phys. Rev. D89 no. 8, (2014) 086010, arXiv:1310.6335 [hep-th].

[3]  A. Hamilton, D. N. Kabat, G. Lifschytz, and D. A. Lowe, “Holographic representation of local bulk operators,” Phys. Rev. D74 (2006) 066009, arXiv:hep-th/0606141 [hep-th].

[4]  A. Hamilton, D. N. Kabat, G. Lifschytz, and D. A. Lowe, “Local bulk operators in AdS/CFT: A Boundary view of horizons and locality,” Phys. Rev. D73 (2006) 086003,arXiv:hep-th/0506118 [hep-th].

[5]  A. Hamilton, D. N. Kabat, G. Lifschytz, and D. A. Lowe, “Local bulk operators in AdS/CFT: A Holographic description of the black hole interior,” Phys. Rev. D75 (2007) 106001,arXiv:hep-th/0612053 [hep-th]. [Erratum: Phys. Rev.D75,129902(2007)].

[6]  R. Jefferson, “Comments on black hole interiors and modular inclusions,” SciPost Phys. 6 no. 4, (2019) 042, arXiv:1811.08900 [hep-th].

Posted in Physics | Leave a comment

Free energy, variational inference, and the brain

In several recent posts, I explored various ideas that lie at the interface of physics, information theory, and machine learning:

  • We’ve seen, à la Jaynes, how the concepts of entropy in statistical thermodynamics and information theory are unified, perhaps the quintessential manifestation of the intimate relationship between the two.
  • We applied information geometry to Boltzmann machines, which led us to the formalization of “learning” as a geodesic in the abstract space of machines.
  • In the course of introducing VAEs, we saw that the Bayesian inference procedure can be understood as a process which seeks to minimizes the variational free energy, which encodes the divergence between the approximate and true probability distributions.
  • We examined how the (dimensionless) free energy serves as a generating function for the cumulants from probability theory, which manifest as the connected Green functions from quantum field theory.
  • We also showed how the cumulants from hidden priors control the higher-order interactions between visible units in an RBM, which underlies their representational power.
  • Lastly, we turned a critical eye towards the analogy between deep learning and the renormalization group, through a unifying Bayesian language in which UV degrees of freedom correspond to hidden variables over which a low-energy observer must marginalize.

Collectively, this led me to suspect that ideas along these lines — in particular, the link between variational Bayesian inference and free energy minimization in hierarchical models — might provide useful mathematical headway in our attempts to understand learning and intelligence in both minds and machines. Imagine my delight when I discovered that, at least in the context of biological brains, a neuroscientist named Karl Friston had already scooped me more than a decade ago!

The aptly-named free energy principle (for the brain) is elaborated upon in a series of about ten papers spanning as many years. I found [1-5] most helpful, but insofar as a great deal of text is copied verbatim (yes, really; never trust the h-index) it doesn’t really matter which one you read. I’m going to mostly draw from [3], because it seems the earliest in which the basic idea is fleshed-out completely. Be warned however that the notation varies slightly from paper to paper, and I find his distinction between states and parameters rather confusingly fuzzy; but we’ll make this precise below.

The basic idea is actually quite simple, and proceeds from the view of the brain as a Bayesian inference machine. In a nutshell, the job of the brain is to infer, as accurately as possible, the probability distribution representing the world (i.e., to build a model that best accords with sensory inputs). In a sense, the brain itself is a probabilistic model in this framework, so the goal is to bring this internal model of the world in line with the true, external one. But this is exactly the same inference procedure we’ve seen before in the context of VAEs! Thus the free energy principle is just the statement that the brain minimizes the variational free energy between itself (that is, its internal, approximate model) and its sensory inputs—or rather, the true distribution that generates them.

To elucidate the notation involved in formulating the principle, we can make an analogy with VAEs. In this sense, the goal of the brain is to construct a map between our observations (i.e., sensory inputs {x}) and the underlying causes (i.e., the environment state {z}). By Bayes’ theorem, the joint distribution describing the model can be decomposed as

\displaystyle p_\theta(x,z)=p_\theta(x|z)p(z)~. \ \ \ \ \ (1)

The first factor on the right-hand side is the likelihood of a particular sensory input {x} given the current state of the environment {z}, and plays the role of the decoder in this analogy, while the second factor is the prior distribution representing whatever foreknowledge the system has about the environment. The subscript {\theta} denotes the variational or “action parameters” of the model, so named because they parametrize the action of the brain on its substrate and surroundings. That is, the only way in which the system can change the distribution is by acting to change its sensory inputs. Friston denotes this dependency by {x(\theta)} (with different variables), but as alluded above, I will keep to the present notation to avoid conflating state/parameter spaces.

Continuing this analogy, the encoder {p_\theta(z|x)} is then a map from the space of sensory inputs {X} to the space of environment states {Z} (as modelled by the brain). As in the case of VAEs however, this is incomputable in practice, since we (i.e., the brain) can’t evaluate the partition function {p(x)=\sum_zp_\theta(x|z)p(z)}. Instead, we construct a new distribution {q_\phi(z|x)} for the conditional probability of environment states {z} given a particular set of sensory inputs {x}. The variational parameters {\phi} for this ensemble control the precise hamiltonian that defines the distribution, i.e., the physical parameters of the brain itself. Depending on the level of resolution, these could represent, e.g., the firing status of all neurons, or the concentrations of neurotransmitters (or the set of all weights and biases in the case of artificial neural nets).

Obviously, the more closely {q_\phi(z|x)} approximates {p_\theta(z|x)}, the better our representation — and hence, the brain’s predictions — will be. As we saw before, we quantify this discrepancy by the Kullback-Leibler divergence

\displaystyle D_z(q_\phi(z|x)||p_\theta(z|x))=\sum_zq_\phi(z|x)\ln\frac{q_\phi(z|x)}{p_\theta(z|x)}~, \ \ \ \ \ (2)

which we can re-express in terms of the variational free energy

\displaystyle \begin{aligned} F_{q|}&=-\langle\ln p_\theta(x|z)\rangle_{q|}+D_z(q_\phi(z|x)||p(z))\\ &=-\sum_zq_\phi(z|x)\ln\frac{p_\theta(x,z)}{q_\phi(z|x)} =\langle E_{p|}\rangle_{q|}-S_{q|}~, \end{aligned} \ \ \ \ \ (3)

where the subscripts {p|,q|} denote the conditional distributions {p_\theta(z|x)}, {q_\phi(z|x)}. On the far right-hand side, {E_{p|}=-\ln p_\theta(x,z)} is the energy or hamiltonian for the ensemble {p_\theta(z|x)} (with partition function {Z=p(x)}), and {S_{q|}=-\int\!\mathrm{d} z\,q_\phi(z|x)\ln q_\phi(z|x)} is the entropy of {q_\phi(z|x)} (see the aforementioned post for details).

However, at this point we must diverge from our analogy with VAEs, since what we’re truly after is a model of the state of the world which is independent of our current sensory inputs. Consider that from a selectionist standpoint, a brain that changes its environmental model when a predator temporarily moves out of sight is less likely to pass on the genes for its construction! Said differently, a predictive model of reality will be more successful when it continues to include the moon, even when nobody looks at it. Thus instead of {q_\phi(z|x)}, we will compare {p_\theta(x|z)} with the ensemble density {q_\lambda(z)}, where — unlike in the case of {p(x)} or {p(z)} — we have denoted the variational parameters {\lambda} explicitly, since they will feature crucially below. Note that {\lambda} is not the same as {\theta} (and similarly, whatever parameters characterize the marginals {p(x)}, {p(z)} cannot be identified with {\theta}). One way to see this is by comparison with our example of renormalization in deep networks, where the couplings in the joint distribution (here, {\phi} in {q_\phi(x,z)}) get renormalized after marginalizing over some degrees of freedom (here, {\lambda} in {q_\lambda(z)}, after marginalizing over all possible sensory inputs {x}). Friston therefore defines the variational free energy as

\displaystyle \begin{aligned} \mathcal{F}_q&=-\langle\ln p_\theta(x|z)\rangle_q+D_z(q_\lambda(z)||p(z))\\ &=-\sum_zq_\lambda(z)\ln\frac{p_\theta(x,z)}{q_\lambda(z)} =\langle E_{p|}\rangle_{q}-S_{q}~, \end{aligned} \ \ \ \ \ (4)

where we have used a curly {\mathcal{F}} to distinguish this from {F} above, and note that the subscript {q} (no vertical bar) denotes that expectation values are computed with respect to the distribution {q_\lambda(z)}. The first equality expresses {\mathcal{F}_q} as the log-likelihood of sensory inputs given the state of the environment, minus an error term that quantifies how far the brain’s internal model of the world {q_\lambda(z)} is from the model consistent with our observations, {p(z)}, cf. (1). Equivalent, comparing with (2)(with {q_\lambda(z)} in place of {q_\phi(z|x)}), we’re interested in the Kullback-Leibler divergence between the brain’s model of the external world, {q_\lambda(z)}, and the conditional likelihood of a state therein given our sensory inputs, {p_\theta(z|x)}. Thus we arrive at the nutshell description we gave above, namely that the principle is to minimize the difference between what is and what we think there is. As alluded above, there is a selectionist argument for this principle, namely that organisms whose beliefs accord poorly with reality tend not to pass on their genes.

As an aside, it is perhaps worth emphasizing that both of these variational free energies are perfectly valid: unlike the Helmholtz free energy, which is uniquely defined, one can define different variational free energies depending on which ensembles one wishes to compare, provided it admits an expression of the form {\langle E\rangle-S} for some energy {E} and entropy {S} (in case it wasn’t clear by now, we’re working with the dimensionless or reduced free energy, equivalent to setting {\beta=1}; the reason for this general form involves a digression on Legendre transforms). Comparing (4) and (3), one sees that the difference in this case is simply a difference in entropies and expectation values with respect to prior {q_\lambda(z)} vs. conditional distributions {q_\phi(z|x)} (which makes sense, since all we did was replace the latter by the former in our first definition).

Now, viewing the brain as an inference machine means that it seeks to optimize its predictions about the world, which in this context, amounts to minimizing the free energy by varying the parameters {\theta,\,\lambda}. As explained above, {\theta} corresponds to the actions the system can take to alter its sensory inputs. From the first equality in (4), we see that the dependence on the action parameters is entirely contained in the log-likelihood of sensory inputs: the second, Kullback-Leibler term contains only priors (cf. our discussion of gradient descent in VAEs). This, optimizing the free energy with respect to {\theta} means that the system will act in such a way as to fulfill its expectations with regards to sensory inputs. Friston neatly summarizes this philosophy as the view that “we may not interact with the world to maximize our reward but simply to ensure it behaves as we think it should” [3]. While this might sound bizarre at first glance, the key fact to bear in mind is that the system is limited in the actions it can perform, i.e., in its ability to adapt. In other words, a system with low free energy is per definition adapting well to changes in its environment or its own internal needs, and therefore is positively selected for relative to systems whose ability to model and adapt to their environment is worse (higher free energy).

What about optimization with respect to the other set of variational parameters, {\lambda}? As mentioned above, these correspond to the physical parameters of the system itself, so this corresponds to adjusting the brain’s internal parameters — connection strengths, neurotransmitter levels, etc. — to ensure that our perceptions are as accurate as possible. By applying Bayes rule to the joint distribution {p_\theta(x,z)}, we can re-arrange the expression for the free energy to isolate this dependence in a single Kullback-Leibler term:

\displaystyle \mathcal{F}_q=-\ln p_\theta(x)+D_z\left( q_\lambda(z)||p_\theta(z|x)\right)~. \ \ \ \ \ (5)

where we have used the fact that {\langle \ln p_\theta(x)\rangle_q=\ln p_\theta(x)}. This form of the expression shows clearly that minimization with respect to {\lambda} directly corresponds to minimizing the Kullback-Leibler divergence between the brain’s internal model of the world, {q_\lambda(z)}, and the posterior probability of the state giving rise to its sensory inputs, {p_\theta(z|x)}. That is, in the limit where the second, Kullback-Leibler term vanishes, we are correctly modelling the causes of our sensory inputs. The selectionist interpretation is that systems which are less capable of accurately modelling their environment by correctly adjusting internal, “perception parameters” {\lambda} will have higher free energy, and hence will be less adept in bringing their perceptions in line with reality.

Thus far everything is quite abstract and rather general. But things become really interesting when we apply this basic framework to hierarchical models with both forward and backwards connections — such as the cerebral cortex — which leads to “recurrent dynamics that self-organize to suppress free energy or prediction error, i.e., recognition dynamics” [3]. In fact, Friston makes the even stronger argument that it is precisely the inability to invert the recognition problem that necessitates backwards (as opposed to purely feed-forwards) connections. In other words, the selectionist pressure to accurately model the (highly non-linear) world requires that brains evolve top-down connections from higher to lower cortical layers. Let’s flesh this out in a bit more detail.

Recall that {Z} is the space of environmental states as modelled by the brain. Thus we can formally associate the encoder, {p_\theta(z|x)}, with forwards connections, which propagate sensory data up the cortical hierarchy; Friston refers to this portion as the recognition model. That is, the recognition model should take a given data point {x}, and return the likelihood of a particular cause (i.e., world-state) {z}. In general however, the map from causes to sensory inputs — captured by the so-called generative model {p_\theta(x|z)} — is highly non-linear, and the brain must essentially invert this map to find contextually invariant causes (e.g., the continued threat of a predator even when it’s no longer part of our immediate sensory input). This is the intractable problem of computing the partition function above, the workaround for which is to instead postulate an approximate recognition model {q_\lambda(z)}, whose parameters {\lambda} are encoded in the forwards connections. The role of the generative model {p_\theta(x|z)} is then to modulate sensory inputs (or their propagation and processing) based on the prevailing belief about the environment’s state, the idea being that these effects are represented in backwards (and lateral) connections. Therefore, the role of these backwards or top-down connections is to modulate forwards or bottom-up connections, thereby suppressing prediction error, which is how the brain operationally minimizes its free energy.

The punchline is that backwards connections are necessary for general perception and recognition in hierarchical models. As mentioned above, this is quite interesting insofar as it offers, on the one hand, a mathematical explanation for the cortical structure found in biological brains, and on the other, a potential guide to more powerful, neuroscience-inspired artificial intelligence.

There are however a couple technical exceptions to this claim of necessity worth mentioning, which is why I snuck in the qualifier “general” in the punchline above. If the abstract generative model can be inverted exactly, then there’s no need for (expensive and time-consuming) backwards connections, because one can obtain a perfectly suitable recognition model that reliably predicts the state of the world given sensory inputs, using a purely feed-forward network. Mathematically, this corresponds to simply taking {q_\lambda(z)=p_\theta(z|x)} in (4) (i.e., zero Kullback-Leibler divergence (2)), whereupon the free energy reduces to the negative log-likelihood of sensory inputs,

\displaystyle \mathcal{F}_{p}=-\ln p(x)~, \ \ \ \ \ (6)

where we have used the fact that {\langle\ln p(x)\rangle_{p|}=\ln p(x)}. Since real-world models are generally non-linear in their inputs however, invertibility is not something one expects to encounter in realistic inference machines (i.e., brains). Indeed, our brains evolved under strict energetic and space constraints; there simply isn’t enough processing power to brute-force the problem by using dedicated feed-forward networks for all our recognition needs. The other important exception is when the recognition process is purely deterministic. In this case one replaces {q_\lambda(z)} by a Kronecker delta function {\delta(z(x)-x)}, so that upon performing the summation, the inferred state {z} becomes a deterministic function of the inputs {x}. Then the second expression for {\mathcal{F}} in (4) becomes the negative log-likelihood of the joint distribution

\displaystyle \mathcal{F}_\delta=-\ln p_\theta(x,z(x)) =-\ln p_\theta(x|z(x))-\ln p(z(x))~, \ \ \ \ \ (7)

where we have used the fact that {\ln\delta(0)=0}. Note that the invertible case, (6), corresponds to maximum likelihood estimation (MLE), while the deterministic case (7) corresponds to so-called maximum a posteriori estimation (MAP), the only difference being that the latter includes a weighting based on the prior distribution {p(z(x))}. Neither requires the conditional distribution {p_\theta(z|x)}, and so skirts the incomputability issue with the path integral above. The reduction to these familiar machine learning metrics for such simple models is reasonable, since only in relatively contrived settings does one ever expect deterministic/invertible recognition.

In addition to motivating backwards connections, the hierarchical aspect is important because it allows the brain to learn its own priors through a form of empirical Bayes. In this sense, the free energy principle is essentially an elegant (re)formulation of predictive coding. Recall that when we introduced the generative model in the form of the decoder {p_\theta(x|z)} in (1), we also necessarily introduced the prior distribution {p(z)}: the liklihood of a particular sensory input {x} given (our internal model of) the state of the environment (i.e., the cause) {z} only makes sense in the context of the prior distribution of causes. Where does this prior distribution come from? In artificial models, we can simply postulate some (e.g., Gaussian or informative) prior distribution and proceed to train the model from there. But a hierarchical model like the brain enables a more natural option. To illustrate the basic idea, consider labelling the levels in such a cortical hierarchy by {i\in{0,\ldots,n}}, where 0 is the bottom-most layer and {n} is the top-most layer. Then {x_i} denotes sensory data at the corresponding layer; i.e., {x_0} corresponds to raw sensory inputs, while {x_n} corresponds to the propagated input signals after all previous levels of processing. Similarly, let {z_i} denote the internal model of the state of the world assembled (or accessible at) the {i^\mathrm{th}} layer. Then

\displaystyle p(z_i)=\sum_{z_{i-1}}p(z_i|z_{i-1})p(z_{i-1})~, \ \ \ \ \ (8)

i.e., the prior distribution {p(z_i)} implicitly depends on the knowledge of the state at all previous levels, analogous to how the IR degrees of freedom implicitly depend on the marginalized UV variables. The above expression can be iterated recursively until we reach {p(z_0)}. For present purposes, this can be identified with {p(x_0)}, since at the bottom-most level of the hierarchy, there’s no difference between the raw sensory data and the inferred state of the world (ignoring whatever intralayer processing might take place). In this (empirical Bayesian) way, the brain self-consistently builds up higher priors from states at lower levels.

The various works by Friston and collaborators go into vastly more detail, of course; I’ve made only the crudest sketch of the basic idea here. In particular, one can make things more concrete by examining the neural dynamics in such models, which is explored in some of these works via something akin to a mean field theory (MFT) approach. I’d originally hoped to have time to dive into this in detail, but a proper treatment will have to await another post. Suffice to say however that the free energy principle provides an elegant formulation which, as in the other topics mentioned at the beginning of this post, allows us to apply ideas from theoretical physics to understand the structure and dynamics of neural networks, and may even prove a fruitful mathematical framework for both theoretical neuroscience and (neuro-inspired) artificial intelligence.


[1] K. Friston, “Learning and inference in the brain,” Neural Networks (2003) .

[2] K. Frison, “A theory of cortical responses,” Phil. Trans. R. Soc. B (2005) .

[3] K. Friston, J. Kilner, and L. Harrison, “A free energy principle for the brain,” J. Physiology (Paris) (2006) .

[4] K. J. Friston and K. E. Stephan, “Free-energy and the brain,” Synthese (2007) .

[5] K. Friston, “The free-energy principle: a unified brain theory?,” Nature Rev. Neuro. (2010) .

Posted in Minds & Machines | 1 Comment

Deep learning and the renormalization group

In recent years, a number of works have pointed to similarities between deep learning (DL) and the renormalization group (RG) [1-7]. This connection was originally made in the context of certain lattice models, where decimation RG bears a superficial resemblance to the structure of deep networks in which one marginalizes over hidden degrees of freedom. However, the relation between DL and RG is more subtle than has been previously presented. The “exact mapping” put forth by [2], for example, is really just a formal analogy that holds for essentially any hierarchical model! That’s not to say there aren’t deeper connections between the two: in my earlier post on RBMs for example, I touched on how the cumulants encoding UV interactions appear in the renormalized couplings after marginalizing out hidden degrees of freedom, and we’ll go into this in much more detail below. But it’s obvious that DL and RG are functionally distinct: in the latter, the couplings (i.e., the connection or weight matrix) are fixed by the requirement that the partition function be preserved at each scale, while in the former, these connections are dynamically altered in the training process. There is, in other words, an important distinction between structure and dynamics which seems to have been overlooked. Understanding both these aspects is required to truly understand why deep learning “works”, but “learning” itself properly refers to the latter.

That said, structure is the first step to dynamics, so I wanted to see how far one could push the analogy. To that end, I started playing with simple Gaussian/Bernoulli RBMs, to see whether understanding the network structure — in particular, the appearance of hidden cumulants, hence the previous post in this two-part sequence — would shed light on, e.g., the hierarchical feature detection observed in certain image recognition tasks, the propagation of structured information more generally, or the relevance of criticality to both deep nets and biological brains. To really make the RG analogy precise, one would ideally like a beta function for the network, which requires a recursion relation for the couplings. So my initial hope was to derive an expression for this in terms of the cumulants of the marginalized neurons, and thereby gain some insight into how correlations behave in these sorts of hierarchical networks.

To start off, I wanted a simple model that would be analytically solvable while making the analogy with decimation RG completely transparent. So I began by considering a deep Boltzmann machine (DBM) with three layers: a visible layer of Bernoulli units {x_i}, and two hidden layers of Gaussian units {y_i,z_i}. The total energy function is

\displaystyle \begin{aligned} H(x,y,z)&=-\sum_{i=1}^na_ix_i+\frac{1}{2}\sum_{j=1}^my_j^2+\frac{1}{2}\sum_{k=1}^pz_k^2-\sum_{ij}A_{ij}x_iy_j-\sum_{jk}B_{jk}y_jz_k\\ &=-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\left( \mathbf{y}^\mathrm{T}\mathbf{y}+\mathbf{z}^\mathrm{T}\mathbf{z}\right)-\mathbf{x}^\mathrm{T} A\,\mathbf{y}-\mathbf{y}^\mathrm{T} B\,\mathbf{z}~, \end{aligned} \ \ \ \ \ (1)

where on the second line I’ve switched to the more convenient vector notation; the dot product between vectors is implicit, i.e., {\mathbf{a}\,\mathbf{x}=\mathbf{a}\cdot\mathbf{x}}. Note that there are no intra-layer couplings, and that I’ve stacked the layers so that the visible layer {x} is connected only to the intermediate hidden layer {y}, which in turn is connected only to the final hidden layer {z}. The connection to RG will be made by performing sequential marginalizations over first {z}, and then {y}, so that the flow from UV to IR is {z\rightarrow y\rightarrow x}. There’s an obvious Bayesian parallel here: we low-energy beings don’t have access to complete information about the UV, so the visible units are naturally identified with IR degrees of freedom, and indeed I’ll use these terms interchangeably throughout.

The joint distribution function describing the state of the machine is

\displaystyle p(x,y,z)=Z^{-1}e^{-\beta H(x,y,z)}~, \quad\quad Z[\beta]=\prod_{i=1}^n\sum_{x_i=\pm1}\int\!\mathrm{d}^my\mathrm{d}^pz\,p(x,y,z)~, \ \ \ \ \ (2)

where {\int\!\mathrm{d}^my=\int\!\prod_{i=1}^m\mathrm{d} y_i}, and similarly for {z}. Let us now consider sequential marginalizations to obtain {p(x,y)} and {p(x)}. In Bayesian terms, these distributions characterize our knowledge about the theory at intermediate- and low-energy scales, respectively. The first of these is

\displaystyle p(x,y)=\int\!\mathrm{d}^pz\,p(x,y,z)=Z^{-1}e^{-\beta\,\left(-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T}\mathbf{y}-\mathbf{x}^\mathrm{T} A\,\mathbf{y}\right)}\int\!\mathrm{d}^pz\,e^{-\beta\mathbf{z}^\mathrm{T}\mathbf{z}+\beta\mathbf{y}^\mathrm{T} B\,\mathbf{z}}~. \ \ \ \ \ (3)

In order to preserve the partition function (see the discussion around (23) below), we then define the hamiltonian on the remaining, lower-energy degrees of freedom {H(x,y)} such that

\displaystyle p(x,y)=Z^{-1}e^{-\beta H(x,y)}~, \ \ \ \ \ (4)

which implies

\displaystyle H(x,y)=-\frac{1}{\beta}\ln\int\!\mathrm{d}^pz\,e^{-\beta H(x,y,z)} =-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T}\mathbf{y}-\mathbf{x}^\mathrm{T} A\,\mathbf{y} -\frac{1}{\beta}\ln\int\!\mathrm{d}^pz\,e^{-\beta\mathbf{z}^\mathrm{T}\mathbf{z}+\beta\mathbf{y}^\mathrm{T} B\,\mathbf{z}}~. \ \ \ \ \ (5)

This is a simple multidimensional Gaussian integral:

\displaystyle \int\!\mathrm{d}^pz\,\mathrm{exp}\left(-\frac{1}{2}\mathbf{z}^\mathrm{T} M\,\mathbf{z}+J^\mathrm{T}\mathbf{z}\right) =\sqrt{\frac{(2\pi)^p}{|M|}}\,\mathrm{exp}\left(\frac{1}{2}J^\mathrm{T} M^{-1} J\right)~, \ \ \ \ \ (6)

where in the present case {M=\beta\mathbf{1}} and {J=\beta B^\mathrm{T}\mathbf{y}}. We therefore obtain

\displaystyle \begin{aligned} H(x,y)&=-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T}\mathbf{y}-\mathbf{x}^\mathrm{T} A\,\mathbf{y} -\frac{1}{\beta}\ln\sqrt{\frac{(2\pi)^p}{\beta}}\exp\left(\frac{\beta}{2}\mathbf{y}^\mathrm{T} BB^\mathrm{T}\mathbf{y}\right)\\ &=f(\beta)-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T} \left(\mathbf{1}-B'\right)\mathbf{y}-\mathbf{x}^\mathrm{T} A\,\mathbf{y}~, \end{aligned} \ \ \ \ \ (7)

where we have defined

\displaystyle f(\beta)\equiv\frac{1}{2\beta}\ln\frac{\beta}{(2\pi)^p} \qquad\mathrm{and}\qquad B_{ij}'\equiv\sum_{\mu=1}^pB_{i\mu}B_{\mu j}~. \ \ \ \ \ (8)

The key point to note is that the interactions between the intermediate degrees of freedom {y} have been renormalized by an amount proportional to the coupling with the UV variables {z}. And indeed, in the context of deep neural nets, the advantage of hidden units is that they encode higher-order interactions through the cumulants of the associated prior. To make this connection explicit, consider the prior distribution of UV variables

\displaystyle q(z)=\sqrt{\frac{\beta}{(2\pi)^p}}\,\mathrm{exp}\left(-\frac{1}{2}\beta \,\mathbf{z}^\mathrm{T} \mathbf{z}\right)~. \ \ \ \ \ (9)

The cumulant generating function for {\mathbf{z}} with respect to this distribution is then

\displaystyle K_{z}(t)=\ln\langle e^{\mathbf{t}\mathbf{z}}\rangle =\ln\sqrt{\frac{\beta}{(2\pi)^p}}\int\!\mathrm{d}^pz\,\mathrm{exp}\left(-\frac{1}{2}\beta \mathbf{z}^\mathrm{T} \mathbf{z}+\mathbf{t}\mathbf{z}\right) =\frac{1}{2\beta}\mathbf{t}\mathbf{t}^\mathrm{T}~, \ \ \ \ \ (10)

cf. eqn. (4) in the previous post. So by choosing {\mathbf{t}=\beta\mathbf{y}^\mathrm{T} B}, we have

\displaystyle K_{z}\left(\beta\mathbf{y}^\mathrm{T} B\right)=\frac{\beta}{2}\mathbf{y}^\mathrm{T} BB^\mathrm{T}\mathbf{y} \ \ \ \ \ (11)

and may therefore express (7) as

\displaystyle H(x,y)=f(\beta)-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T}\mathbf{y}-\frac{1}{\beta}K_{z}\left(\beta\mathbf{y}^\mathrm{T} B\right)~. \ \ \ \ \ (12)

From the cumulant expansion in the aforementioned eqn. (4), in which the {n^\mathrm{th}} cumulant is {\kappa_n=K_X^{(n)}(t)\big|_{t=0}}, we then see that the effect of the marginalizing out UV (i.e., hidden) degrees of freedom is to induce higher-order couplings between the IR (i.e., visible) units, with the coefficients of the interaction weighted by the associated cumulant:

\displaystyle \begin{aligned} H(x,y)&=f(\beta)-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T}\mathbf{y}-\frac{1}{\beta}\left(\kappa_1\mathbf{t}+\frac{\kappa_2}{2}\mathbf{t}^2+\ldots\right)\\ &=f(\beta)-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T}\mathbf{y}-\kappa_1\mathbf{y}^\mathrm{T} B -\frac{\kappa_2}{2}\beta\mathbf{y}^\mathrm{T} BB^\mathrm{T} \mathbf{y}-\ldots~. \end{aligned} \ \ \ \ \ (13)

For the Gaussian prior (9), one immediately sees from (10) that all cumulants except for {\kappa_2=1/\beta} (the variance) vanish, whereupon (13) reduces to (7) above.

Now, let’s repeat this process to obtain the marginalized distribution of purely visible units {p(x)}. In analogy with Wilsonian RG, this corresponds to further lowering the cutoff scale in order to obtain a description of the theory in terms of low-energy degrees of freedom that we can actually observe. Hence, tracing out {y}, we have

\displaystyle p(x)=\int\!\mathrm{d}\mathbf{y}\,p(x,y)=Z^{-1}e^{-\beta\,\left( f-\mathbf{a}\mathbf{x}\right)}\int\!\mathrm{d}\mathbf{y} \,\mathrm{exp}\left(-\frac{1}{2}\beta\mathbf{y}^\mathrm{T}\left(\mathbf{1}-B'\right)\mathbf{y}+\beta\mathbf{x}^\mathrm{T} A\,\mathbf{y}\right)~. \ \ \ \ \ (14)

Of course, this is just another edition of (6), but now with {M=\beta\left(\mathbf{1}-B'\right)} and {J=\beta A^\mathrm{T}\mathbf{x}}. We therefore obtain

\displaystyle \begin{aligned} p(x)&=Z^{-1}\sqrt{\frac{(2\pi)^m}{\beta\left(1-|B'|\right)}}\mathrm{exp}\left[-\beta\left( f(\beta)-\mathbf{a}\mathbf{x}+\frac{1}{2}\mathbf{x}^\mathrm{T} A\left(\mathbf{1}-B'\right)^{-1}A^\mathrm{T}\mathbf{x}\right)\right]\\ &=Z^{-1}\sqrt{\frac{(2\pi)^m}{\beta\left(1-|B'|\right)}}\mathrm{exp}\left[-\beta\left( f(\beta)-\mathbf{a}\mathbf{x}+\frac{1}{2}\mathbf{x}^\mathrm{T} A'\mathbf{x}\right)\right] \end{aligned} \ \ \ \ \ (15)

where we have defined {A\equiv A\,\left(\mathbf{1}-B'\right)^{-1}\!A^\mathrm{T}}. As before, we then define {H(x)} such that

\displaystyle p(x)=Z^{-1}e^{-\beta H(x)}~, \ \ \ \ \ (16)

which implies

\displaystyle H(x)=g(\beta)-\mathbf{a}\mathbf{x}+\frac{1}{2}\mathbf{x}^\mathrm{T} A'\mathbf{x}~, \ \ \ \ \ (17)


\displaystyle g(\beta)\equiv f(\beta)+\frac{1}{2\beta}\ln\frac{\beta\left(1-|B'|\right)}{(2\pi)^m} =\frac{1}{2\beta}\ln\frac{\beta^2\left(1-|B'|\right)}{(2\pi)^{p+m}}~, \ \ \ \ \ (18)

where {f(\beta)} and {B'} are given in (8).

Again we see that marginalizing out UV information induces new couplings between IR degrees of freedom; in particular, the hamiltonian {H(x)} contains a quadratic interaction terms between the visible units. And we can again write this directly in terms of a cumulant generating function for hidden degrees of freedom by defining a prior of the form (9), but with {z\rightarrow w\in\{y,z\}} and {p\rightarrow m\!+\!p}. This will be of the form (10), where in this case we need to choose {\mathbf{t}=\beta\mathbf{x}^\mathrm{T} A\left(\mathbf{1}-B'\right)^{-1/2}}, so that

\displaystyle K_z(t)=\frac{\beta}{2} \mathbf{x}^\mathrm{T} A\left(\mathbf{1}-B'\right)^{-1} A^\mathrm{T}\mathbf{x} \ \ \ \ \ (19)

(where, since {B'^\mathrm{T}=B'}, the inverse matrix {(\mathbf{1}-B')^{-1}} is also invariant under the transpose; at this stage of exploration, I’m being quite cavalier about questions of existence). Thus the hamiltonian of visible units may be written

\displaystyle H(x)=g(\beta)-\mathbf{a}\mathbf{x}+\frac{1}{\beta}K_z(t)~, \ \ \ \ \ (20)

with {t} and {g(\beta)} as above. Since the prior with which these cumulants are computed is again Gaussian, only the second cumulant survives in the expansion, and we indeed recover (17).

To summarize, the sequential flow from UV (hidden) to IR (visible) distributions is, from top to bottom,

\displaystyle \begin{aligned} p(x,y,z)&=Z^{-1}e^{-\beta H(x,y,z)}~,\qquad &H(x,y,z)&=-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\left( \mathbf{y}^\mathrm{T}\mathbf{y}+\mathbf{z}^\mathrm{T}\mathbf{z}\right)-\mathbf{x}^\mathrm{T} A\,\mathbf{y}-\mathbf{y}^\mathrm{T} B\,\mathbf{z}~,\\ p(x,y)&=Z^{-1}e^{-\beta H(x,y)}~,\qquad &H(x,y)&=f(\beta)-\mathbf{a}\,\mathbf{x}+\frac{1}{2}\mathbf{y}^\mathrm{T}\left(\mathbf{1}-BB^\mathrm{T}\right)\mathbf{y}-\mathbf{x}^\mathrm{T} A\,\mathbf{y}~,\\ p(x)&=Z^{-1}e^{-\beta H(x)}~,\qquad &H(x)&=g(\beta)-\mathbf{a}\mathbf{x}+\frac{1}{2}\mathbf{x}^\mathrm{T} A\left(\mathbf{1}-BB^\mathrm{T}\right)^{-1}\!A^\mathrm{T}\mathbf{x}~, \end{aligned} \ \ \ \ \ (21)

where upon each marginalization, the new hamiltonian gains additional interaction terms/couplings governed by the cumulants of the UV prior (where “UV” is defined relative to the current cutoff scale, i.e., {q(z)} for {H(x,y)} and {q(w)} for {H(x)}), and the renormalization of the partition function is accounted for by

\displaystyle f(\beta)=\frac{1}{2\beta}\ln\frac{\beta}{(2\pi)^p}~, \quad\quad g(\beta)=-\frac{1}{2\beta}\ln\frac{\beta^2\left(1-|BB^\mathrm{T}|\right)}{(2\pi)^{p+m}}~. \ \ \ \ \ (22)

As an aside, note that at each level, fixing the form (4), (16) is equivalent to imposing that the partition function remain unchanged. This is required in order to preserve low-energy correlation functions. The two-point correlator {\langle x_1x_2\rangle} between visible (low-energy) degrees of freedom, for example, does not depend on which distribution we use to compute the expectation value, so long as the energy scale thereof is at or above the scale set by the inverse lattice spacing of {\mathbf{x}}:

\displaystyle \begin{aligned} \langle x_ix_j\rangle_{p(x,y,z)}&=\prod_{k=1}^n\sum_{x_k=\pm1}\int\!\mathrm{d}^my\mathrm{d}^pz\,x_ix_j\,p(x,y,z)\\ &=\prod_{k=1}^n\sum_{x_k=\pm1}x_ix_j\int\!\mathrm{d}^my\,p(x,y) =\langle x_ix_j\rangle_{p(x,y)}\\ &=\prod_{k=1}^n\sum_{x_k=\pm1}x_ix_j\,p(x) =\langle x_ix_j\rangle_{p(x)}~. \end{aligned} \ \ \ \ \ (23)

In other words, had we not imposed the invariance of the partition function, we would be altering the theory at each energy scale, and there would be no renormalization group relating them. In information-theoretic terms, this would represent an incorrect Bayesian inference procedure.

Despite (or perhaps, because of) its simplicity, this toy model makes manifest the fact that the RG prescription is reflected in the structure of the network, not the dynamics of learning per se. Indeed, Gaussian units aside, the above is essentially nothing more than real-space decimation RG on a 1d lattice, with a particular choice of couplings between “spins” {\sigma\in\{x,y,z\}}. In this analogy, tracing out {y} and then {x} maps to a sequential marginalization over even spins in the 1d Ising model. “Dynamics” in this sense are determined by the hamiltonian {H(\sigma)}, which is again one-dimensional. When one speaks of deep “learning” however, one views the network as two-dimensional, and “dynamics” refers to the changing values of the couplings as the network attempts to minimize the cost function. In short, RG lies in the fact that the couplings at each level in (21) encode the cumulants from hidden units in such a way as to ensure the preservation of visible correlations, whereas deep learning then determines their precise values in such a way as to reproduce a particular distribution. To say that deep learning itself is an RG is to conflate structure with function.

Nonetheless, there’s clearly an intimate parallel between RG and hierarchical Bayesian modeling at play here. As mentioned above, I’d originally hoped to derive something like a beta function for the cumulants, to see what insights theoretical physics and machine learning might yield to one another at this information-theoretic interface. Unfortunately, while one can see how the higher UV cumulants from {q(z)} are encoded in those from {q(w)}, the appearance of the inverse matrix makes a recursion relation for the couplings in terms of the cumulants rather awkward, and the result would only hold for the simple Gaussian hidden units I’ve chosen for analytical tractability here.

Fortunately, after banging my head against this for a month, I learned of a recent paper [8] that derives exactly the sort of cumulant relation I was aiming for, at least in the case of generic lattice models. The key is to not assume a priori which degrees of freedom will be considered UV/hidden vs. IR/visible. That is, when I wrote down the joint distribution (2), I’d already distinguished which units would survive each marginalization. While this made the parallel with the familiar decimation RG immediate — and the form of (1) made the calculations simple to perform analytically — it’s actually a bit unnatural from both a field-theoretic and a Bayesian perspective: the degrees of freedom that characterize the theory in the UV may be very different from those that we observe in the IR (e.g., strings vs. quarks vs. hadrons), so we shouldn’t make the distinction {x,y,z} at this level. Accordingly, [8] instead replace (2) with

\displaystyle p(\chi)=\frac{1}{Z}\,e^{\mathcal{K}(\chi)} \ \ \ \ \ (24)

where {\mathcal{K}\equiv-\beta H} is the so-called reduced (i.e., dimensionless) hamiltonian, cf. the reduced/dimensionless free energy {\tilde F\equiv\beta F} in the previous post. Note that {\chi} runs over all the degrees of freedom in the theory, which are all UV/hidden variables at the present level.

Two words of notational warning ere we proceed: first, there is a sign error in eqn. (1) of [8] (in version 1; the negative in the exponent has been absorbed into {\mathcal{K}} already). More confusingly however, their use of the terminology “visible” and “hidden” is backwards with respect to the RG analogy here. In particular, they coarse-grain a block of “visible” units into a single “hidden” unit. For reasons which should by now be obvious, I will instead stick to the natural Bayesian identifications above, in order to preserve the analogy with standard coarse-graining in RG.

Let us now repeat the above analysis in this more general framework. The real-space RG prescription consists of coarse-graining {\chi\rightarrow\chi'}, and then writing the new distribution {p(\chi')} in the canonical form (24). In Bayesian terms, we need to marginalize over the information about {\chi} contained in the distribution of {\chi'}, except that unlike in my simple example above, we don’t want to make any assumptions about the form of {p(\chi,\chi')}. So we instead express the integral — or rather, the discrete sum over lattice sites — in terms of the conditional distribution {p(\chi'|\chi)}:

\displaystyle p(\chi')=\sum_\chi p(\chi,\chi') =\sum_\chi p(\chi'|\chi)\,p(\chi)~. \ \ \ \ \ (25)

where {\sum\nolimits_\chi=\prod_{i=1}^m\sum_{\chi_i}}, {\chi=\{\chi_1,\ldots,\chi_m\}}. Denoting the new dimensionless hamiltonian {\mathcal{K}'(\chi')}, with {\chi'=\{\chi_1',\ldots,\chi_n'\}}, {n<m}, we therefore have

\displaystyle e^{\mathcal{K}'(\chi')}=\sum_\chi p(\chi'|\chi)\,e^{\mathcal{K}(\chi)}~. \ \ \ \ \ (26)

So far, so familiar, but now comes the trick: [8] split the hamiltonian {\mathcal{K}(\chi)} into a piece containing only intra-block terms, {\mathcal{K}_0(\chi)} (that is, interactions solely among the set of hidden units which is to be coarse-grained into a single visible unit), and a piece containing the remaining, inter-block terms, {\mathcal{K}_1(\chi)} (that is, interactions between different aforementioned sets of hidden units).

Let us denote a block of hidden units by {\mathcal{H}_j\ni\chi_i}, such that {\chi=\bigcup_{j=1}^n\mathcal{H}_j} (note that since {\mathrm{dim}(\chi)=m}, this implies {\mathrm{dim}(\mathcal{H}_j)=m/n} degrees of freedom {\chi_i} per block). To each {\mathcal{H}_j}, we associate a visible unit {\mathcal{V}_j=\chi'_j}, into which the constituent UV variables {\chi_i} have been coarse-grained. (Note that, for the reasons explained above, we have swapped {\mathcal{H}\leftrightarrow\mathcal{V}} relative to [8]). Then translation invariance implies

\displaystyle p(\chi'|\chi)=\prod_{j=1}^np(\mathcal{V}_j|\mathcal{H}_j) \qquad\mathrm{and}\qquad \mathcal{K}_0(\chi)=\sum_{j=1}^n\mathcal{K}_b(\mathcal{H}_j)~, \ \ \ \ \ (27)

where {\mathcal{K}_b(\chi_i)} denotes a single intra-block term of the hamiltonian. With this notation in hand, (26) becomes

\displaystyle e^{\mathcal{K}'(\chi')}=\sum_\chi e^{\mathcal{K}_1(\chi)}\prod_{j=1}^np(\mathcal{V}_j|\mathcal{H}_j)\,e^{\mathcal{K}_b(\mathcal{H}_j)}~. \ \ \ \ \ (28)

Now, getting from this to the first line of eqn. (13) in [8] is a bit of a notational hazard. We must suppose that for each block {\mathcal{H}_j}, we can define the block-distribution {p_j=Z_b^{-1}e^{\mathcal{K}_b(\mathcal{H}_j)}}, where {Z_b=\sum_{\chi_i\in\mathcal{H}_j}e^{\mathcal{K}_b(\mathcal{H}_j)}=\prod_{i=1}^{m/n}\sum_{\chi_i}e^{\mathcal{K}_b(\chi_i)}}. Given the underlying factorization of the total Hilbert space, we furthermore suppose that the distribution of all intra-block contributions can be written {p_0=Z_0^{-1}e^{\mathcal{K}_0(\chi)}}, so that {Z_0=\prod_{i=1}^m\sum_{\chi_i\in\chi}e^{\mathcal{K}_0(\chi)}}. This implies that

\displaystyle Z_0=\sum_{\chi_1}\ldots\sum_{\chi_m}e^{\mathcal{K}_b(\mathcal{H}_1)}\ldots\,e^{\mathcal{K}_b(\mathcal{H}_n)} =\prod_{j=1}^n\sum_{\chi_i\in\mathcal{H}_j}e^{\mathcal{K}_b(\mathcal{H}_j)} =\prod_{j=1}^nZ_b~. \ \ \ \ \ (29)

Thus we see that we can insert a factor of {1=Z_0\cdot\left(\prod\nolimits_jZ_b\right)^{-1}} into (28), from which the remaining manipulations are straightforward: we identify

\displaystyle p(\mathcal{V}_j|\mathcal{H}_j)\frac{1}{Z_b}\,e^{\mathcal{K}_b(\mathcal{H}_j)} =p(\mathcal{V}_j|\mathcal{H}_j)\,p(\mathcal{H}_j) =p(\mathcal{H}_j|\mathcal{V}_j)\,p(\mathcal{V}_j) \ \ \ \ \ (30)

and {p(\chi')=\prod_{j=1}^np(\mathcal{V}_j)} (again by translation invariance), whereupon we have

\displaystyle e^{\mathcal{K}'(\chi')}=Z_0\,p(\chi')\sum_\chi p(\chi|\chi')\,e^{\mathcal{K}_1(\chi)}=Z_0\,p(\chi')\langle e^{\mathcal{K}_1(\chi)}\rangle~, \ \ \ \ \ (31)

where the expectation value is defined with respect to the conditional distribution {p(\chi|\chi')}. Finally, by taking the log, we obtain

\displaystyle \mathcal{K}'(\chi')=\ln Z_0+\ln p(\chi')+\ln\langle e^{\mathcal{K}_1(\chi)}\rangle~, \ \ \ \ \ (32)

which one clearly recognizes as a generalization of (12): {\ln Z_0} accounts for the normalization factor {f(\beta)}; {\ln p(\chi')} gives the contribution from the un-marginalized variables {x,y}; and the log of the expectation values is the contribution from the UV cumulants, cf. eqn. (1) in the previous post. Note that this is not the cumulant generating function itself, but corresponds to setting {t\!=\!1} therein:

\displaystyle K_X(1)=\ln\langle e^{X}\rangle=\sum_{n=1}\frac{\kappa_n}{n!}~. \ \ \ \ \ (33)

Within expectation values, {\mathcal{K}_1} becomes the dimensionless energy {-\beta\langle E_1\rangle}, so the {n^\mathrm{th}} moment/cumulant picks up a factor of {(-\beta)^n} relative to the usual energetic moments in eqn. (11) of the previous post. Thus we may express (32) in terms of the cumulants of the dimensionless hamiltonian {\mathcal{K}_1} as

\displaystyle \mathcal{K}'(\chi')=\ln Z_0+\ln p(\chi')+\sum_{n=1}\frac{(-\beta)^n}{n!}\kappa_n~, \ \ \ \ \ (34)

where {\kappa_n=K_{E_1}^{(n)}(t)\big|_{t=0}}, and the expectation values in the generating functions are computed with respect to {p(\chi|\chi')}.

This is great, but we’re not quite finished, since we’d still like to determine the renormalized couplings in terms of the cumulants, as I did in the simple Gaussian DBM above. This requires expressing the new hamiltonian in the same form as the old, which allows one to identify exactly which contributions from the UV degrees of freedom go where. (See for example chapter 13 of [9] for a pedagogical exposition of this decimation RG procedure for the 1d Ising model). For the class of lattice models considered in [8] — by which I mean, real-space decimation with the imposition of a buffer zone — one can write down a formal expression for the canonical form of the hamiltonian, but expressions for the renormalized couplings themselves remain model-specific.

There’s more cool stuff in the paper [8] that I won’t go into here, concerning the question of “optimality” and the behaviour of mutual information in these sorts of networks. Suffice to say that, as alluded in the previous post, the intersection of physics, information theory, and machine learning is potentially rich yet relatively unexplored territory. While the act of learning itself is not an RG in a literal sense, the two share a hierarchical Bayesian language that may yield insights in both directions, and I hope to investigate this more deeply (pun intended) soon.


[1] C. Beny, “Deep learning and the renormalization group,” arXiv:1301.3124.

[2]  P. Mehta and D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” arXiv:1410.3831.

[3]  H. W. Lin, M. Tegmark, and D. Rolnick, “Why Does Deep and Cheap Learning Work So Well?,” arXiv:1608.08225.

[4]  S. Iso, S. Shiba, and S. Yokoo, “Scale-invariant Feature Extraction of Neural Network and Renormalization Group Flow,” arXiv:1801.07172.

[5]  M. Koch-Janusz and Z. Ringel, “Mutual information, neural networks and the renormalization group,” arXiv:1704.06279.

[6]  S. S. Funai and D. Giataganas, “Thermodynamics and Feature Extraction by Machine Learning,” arXiv:1810.08179.

[7]  E. Mello de Koch, R. Mello de Koch, and L. Cheng, “Is Deep Learning an RG Flow?,” arXiv:1906.05212.

[8]  P. M. Lenggenhager, Z. Ringel, S. D. Huber, and M. Koch-Janusz, “Optimal Renormalization Group Transformation from Information Theory,” arXiv:1809.09632.

[9]  R. K. Pathria, Statistical Mechanics. 1996. Butterworth-Heinemann, Second edition.

Posted in Minds & Machines, Physics | Leave a comment

Cumulants, correlators, and connectivity

Lately, I’ve been spending a lot of time exploring the surprisingly rich mathematics at the intersection of physics, information theory, and machine learning. Among other things, this has led me to a new appreciation of cumulants. At face value, these are just an alternative to the moments that characterize a given probability distribution function, and aren’t particularly exciting. Except they show up all over statistical thermodynamics, quantum field theory, and the structure of deep neural networks, so of course I couldn’t resist trying to better understand the information-theoretic connections to which this seems to allude. In the first part of this two-post sequence, I’ll introduce them in the context of theoretical physics, and then turn to their appearance in deep learning in the next post, where I’ll dive into the parallel with the renormalization group.

The relation between these probabilistic notions and statistical physics is reasonably well-known, though the literature on this particular point unfortunately tends to be slightly sloppy. Loosely speaking, the partition function corresponds to the moment generating function, and the (Helmholtz) free energy corresponds to the cumulant generating function. By way of introduction, let’s make this identification precise.

The moment generating function for a random variable {X} is

\displaystyle M_X(t)\equiv \langle e^{tX}\rangle~,\quad\quad\forall t\in\mathbb{R}~, \ \ \ \ \ (1)

where {\langle\ldots\rangle} denotes the expectation value for the corresponding distribution. (As a technical caveat: in some cases, the moments — and correspondingly, {M_X} — may not exist, in which case one can resort to the characteristic function instead). By series expanding the exponential, we have

\displaystyle M_X(t)=1+t\langle X\rangle+\frac{t^2}{2}\langle X^2\rangle+\ldots\,=1+\sum_{n=1}m_n\frac{t^n}{n!}~, \ \ \ \ \ (2)

were {m_n} is the {n^\mathrm{th}} moment, which we can obtain by taking {n} derivatives and setting {t\!=\!0}, i.e.,

\displaystyle m_n=M_X^{(n)}(t)\Big|_{t=0}=\langle X^n\rangle~. \ \ \ \ \ (3)

However, it is often more convenient to work with cumulants instead of moments (e.g., for independent random variables, the cumulant of the sum is the sum of the cumulants, thanks to the log). These are uniquely specified by the moments, and vice versa—unsurprisingly, since the cumulant generating function is just the log of the moment generating function:

\displaystyle K_X(t)\equiv\ln M_X(t)=\ln\langle e^{tX}\rangle \equiv\sum_{n=1}\kappa_n\frac{t^n}{n!}~, \ \ \ \ \ (4)

where {\kappa_n} is the {n^\mathrm{th}} cumulant, which we again obtain by differentiating {n} times and setting {t=0}:

\displaystyle \kappa_n=K_X^{(n)}(t)\big|_{t=0}~. \ \ \ \ \ (5)

Note however that {\kappa_n} is not simply the log of {m_n}!

Now, to make contact with thermodynamics, consider the case in which {X} is the energy of the canonical ensemble. The probability of a given energy eigenstate {E_i} is

\displaystyle p_i\equiv p(E_i)=\frac{1}{Z[\beta]}e^{-\beta E_i}~, \quad\quad \sum\nolimits_ip_i=1~. \ \ \ \ \ (6)

The moment generating function for energy is then

\displaystyle M_E(t)=\langle e^{tE}\rangle=\sum_i p(E_i)e^{tE_i} =\frac{1}{Z[\beta]}\sum_ie^{-(\beta\!-\!t)E_i} =\frac{Z[\beta-t]}{Z[\beta]}~. \ \ \ \ \ (7)

Thus we see that the partition function {Z[\beta]} is not the moment generating function, but there’s clearly a close relationship between the two. Rather, the precise statement is that the moment generating function {M_E(t)} is the ratio of two partition functions at inverse temperatures {\beta-t} and {\beta}, respectively. We can gain further insight by considering the moments themselves, which are — by definition (3) — simply expectation values of powers of the energy:

\displaystyle \langle E^n\rangle=M^{(n)}(t)\Big|_{t=0} =\frac{1}{Z[\beta]}\frac{\partial^n}{\partial t^n}Z[\beta\!-\!t]\bigg|_{t=0} =(-1)^n\frac{Z^{(n)}[\beta\!-\!t]}{Z[\beta]}\bigg|_{t=0} =(-1)^n\frac{Z^{(n)}[\beta]}{Z[\beta]}~. \ \ \ \ \ (8)

Note that derivatives of the partition function with respect to {t} have, at {t=0}, become derivatives with respect to inverse temperature {\beta} (obviously, this little slight of hand doesn’t work for all functions; simple counter example: {f(\beta-t)=(\beta-t)^2}). Of course, this is simply a more formal expression for the usual thermodynamic expectation values. The first moment of energy, for example, is

\displaystyle \langle E\rangle= -\frac{1}{Z[\beta]}\frac{\partial Z[\beta]}{\partial\beta} =\frac{1}{Z[\beta]}\sum_i E_ie^{-\beta E_i} =\sum_i E_i\,p_i~, \ \ \ \ \ (9)

which is the ensemble average. At a more abstract level however, (8) expresses the fact that the average energy — appropriately normalized — is canonically conjugate to {\beta}. That is, recall that derivatives of the action are conjugate variables to those with respect to which we differentiate. In classical mechanics for example, energy is conjugate to time. Upon Wick rotating to Euclidean signature, the trajectories become thermal circles with period {\beta}. Accordingly, the energetic moments can be thought of as characterizing the dynamics of the ensemble in imaginary time.

Now, it follows from (7) that the cumulant generating function (4) is

\displaystyle K_E(t)=\ln\langle e^{tE}\rangle=\ln Z[\beta\!-\!t]-\ln Z[\beta]~. \ \ \ \ \ (10)

While the {n^\mathrm{th}} cumulant does not admit a nice post-derivative expression as in (8) (though I suppose one could write it in terms of Bell polynomials if we drop the adjective), it is simple enough to compute the first few and see that, as expected, the first cumulant is the mean, the second is the variance, and the third is the third central moment:

\displaystyle \begin{aligned} K^{(1)}(t)\big|_{t=0}&=-\frac{Z'[\beta]}{Z[\beta]}=\langle E\rangle~,\\ K^{(2)}(t)\big|_{t=0}&=\frac{Z''[\beta]}{Z[\beta]}-\left(\frac{Z'[\beta]}{Z[\beta]}\right)^2=\langle E^2\rangle-\langle E\rangle^2\\ K^{(3)}(t)\big|_{t=0}&=-2\left(\frac{Z'[\beta]}{Z[\beta]}\right)^3+3\frac{Z'[\beta]Z''[\beta]}{Z[\beta]^2}-\frac{Z^{(3)}[\beta]}{Z[\beta]}\\ &=-2\langle E\rangle^3+3\langle E\rangle\langle E^2\rangle-\langle E^3\rangle =-\left\langle\left( E-\langle E\rangle\right)^3\right\rangle~. \end{aligned} \ \ \ \ \ (11)

where the prime denotes the derivative with respect to {\beta}. Note that since the second term in the generating function (10) is independent of {t}, the normalization drops out when computing the cumulants, so we would have obtained the same results had we worked directly with the partition function {Z[\beta]} and taken derivatives with respect to {\beta}. That is, we could define

\displaystyle K_E(\beta)\equiv-\ln Z[\beta] \qquad\implies\qquad \kappa_n=(-1)^{n-1}K_E^{(n)}(\beta)~, \ \ \ \ \ (12)

where, in contrast to (5), we don’t need to set anything to zero after differentiating. This expression for the cumulant generating function will feature more prominently when we discuss correlation functions below.

So, what does the cumulant generating function have to do with the (Helmholtz) free energy, {F[\beta]=-\beta^{-1}\ln Z[\beta]}? Given the form (12), one sees that they’re essentially one and the same, up to a factor of {\beta}. And indeed the free energy is a sort of “generating function” in the sense that it allows one to compute any desired thermodynamic quantity of the system. The entropy, for example, is

\displaystyle S=-\frac{\partial F}{\partial T}=\beta^2\frac{\partial F}{\partial\beta} =\beta\langle E\rangle+\ln Z=-\langle\ln p\rangle~, \ \ \ \ \ (13)

where {p} is the Boltzmann distribution (6). However, the factor of {\beta^{-1}} in the definition of free energy technically prevents a direct identification with the cumulant generating function above. Thus it is really the log of the partition function itself — i.e., the dimensionless free energy {\beta F} — that serves as the cumulant generating function for the distribution. We’ll return to this idea momentarily, cf. (21) below.

So much for definitions; what does it all mean? It turns out that in addition to encoding correlations, cumulants are intimately related to connectedness (in the sense of connected graphs), which underlies their appearance in QFT. Consider, for concreteness, a real scalar field {\phi(x)} in 4 spacetime dimensions. As every student knows, the partition function

\displaystyle Z[J]=\mathcal{N}\int\mathcal{D}\phi\,\exp\left\{i\!\int\!\mathrm{d}^dx\left[\mathcal{L}(\phi,\partial\phi)+J(x)\phi(x)\right]\right\} \ \ \ \ \ (14)

is the generating function for the {n}-point correlator or Green function {G^{(n)}(x_1,\ldots,x_n)}:

\displaystyle G^{(n)}(x_1,\ldots,x_n)=\frac{1}{i^n}\frac{\delta^nZ[J]}{\delta J(x_1)\ldots\delta J(x_n)}\bigg|_{J=0}~, \ \ \ \ \ (15)

where the normalization {\mathcal{N}} is fixed by demanding that in the absence of sources, we should recover the vacuum expectation value, i.e., {Z[0]=\langle0|0\rangle=1}. In the language of Feynman diagrams, the Green function contains all possible graphs — both connected and disconnected — that contribute to the corresponding transition amplitude. For example, the 4-point correlator of {\phi^4} theory contains, at first order in the coupling, a disconnected graph consisting of two Feynman propagators, another disconnected graph consisting of a Feynman propagator and a 1-loop diagram, and an irreducible graph consisting of a single 4-point vertex. But only the last of these contributes to the scattering process, so it’s often more useful to work with the generating function for connected diagrams only,

\displaystyle W[J]=-i\ln Z[J]~, \ \ \ \ \ (16)

from which we obtain the connected Green function {G_c^{(n)}}:

\displaystyle G_c^{(n)}(x_1,\ldots,x_n)=\frac{1}{i^{n-1}}\frac{\delta^nW[J]}{\delta J(x_1)\ldots\delta J(x_n)}\bigg|_{J=0}~. \ \ \ \ \ (17)

The fact that the generating functions for connected vs. disconnected diagrams are related by an exponential, that is, {Z[J]=\exp{i W[J]}}, is not obvious at first glance, but it is a basic exercise in one’s first QFT course to show that the coefficients of various diagrams indeed work out correctly by simply Taylor expanding the exponential {e^X=\sum_n\tfrac{X^n}{n!}}. In the example of {\phi^4} theory above, the only first-order diagram that contributes to the connected correlator is the 4-point vertex. More generally, one can decompose {G^{(n)}} into {G_c^{(n)}} plus products of {G_c^{(m)}} with {m<n}. The factor of {-i} in (16) goes away in Euclidean signature, whereupon we see that {Z[J]} is analogous to {Z[\beta]} — and hence plays the role of the moment generating function — while {W[J]} is analogous to {\beta F[\beta]} — and hence plays the role of the cumulant generating function in the form (12).

Thus, the {n^\mathrm{th}} cumulant of the field {\phi} corresponds to the connected Green function {G_c^{(n)}}, i.e., the contribution from correlators of all {n} fields only, excluding contributions from lower-order correlators among them. For example, we know from Wick’s theorem that Gaussian correlators factorize, so the corresponding {4}-point correlator {G^{(4)}} becomes

\displaystyle \langle\phi_1\phi_2\phi_3\phi_4\rangle= \langle\phi_1\phi_2\rangle\langle\phi_3\phi_4\rangle +\langle\phi_1\phi_3\rangle\langle\phi_2\phi_4\rangle +\langle\phi_1\phi_4\rangle\langle\phi_2\phi_3\rangle~. \ \ \ \ \ (18)

What this means is that there are no interactions among all four fields that aren’t already explained by interactions among pairs thereof. The probabilistic version of this statement is that for the normal distribution, all cumulants other than {n=2} are zero. (For a probabilist’s exposition on the relationship between cumulants and connectivity, see the first of three lectures by Novak and LaCroix [1], which takes a more graph-theoretic approach).

There’s one more important function that deserves mention here: the final member of the triumvirate of generating functions in QFT, namely the effective action {\Gamma[\phi]}, defined as the Legendre transform of {W[J]}:

\displaystyle \Gamma[\phi]=W[J]-\int\!\mathrm{d} x\,J(x)\phi(x)~. \ \ \ \ \ (19)

The Legendre transform is typically first encountered in classical mechanics, where it relates the hamiltonian and lagrangian formulations. Geometrically, it translates between a function and its envelope of tangents. More abstractly, it provides a map between the configuration space (here, the sources {J}) and the dual vector space (here, the fields {\phi}). In other words, {\phi} and {J} are conjugate pairs in the sense that

\displaystyle \frac{\delta\Gamma}{\delta\phi}=-J \qquad\mathrm{and}\qquad \frac{\delta W}{\delta J}=\phi~. \ \ \ \ \ (20)

As an example that connects back to the thermodynamic quantities above: we already saw that {E} and {\beta} are conjugate variables by considering the partition function, but the Legendre transform reveals that the free energy and entropy are conjugate pairs as well. This is nicely explained in the lovely pedagogical treatment of the Legendre transform by Zia, Redish, and McKay [2], and also cleans up the disruptive factor of {\beta} that prevented the identification with the cumulant generating function above. The basic idea is that since we’re working in natural units (i.e., {k_B=1}), the thermodynamic relation in the form {\beta F+S=\beta E} (13) obscures the duality between the properly dimensionless quantities {\tilde F\equiv\beta F} and {\tilde S=S/k_B}. From this perspective, it is more natural to work with {\tilde F} instead, in which case we have both an elegant expression for the duality in terms of the Legendre transform, and a precise identification of the dimensionless free energy with the cumulant generating function (12):

\displaystyle \tilde F(\beta)+\tilde S(E)=\beta E~, \qquad\qquad K_E(\beta)=\tilde F=\beta F~. \ \ \ \ \ (21)

Now, back to QFT, in which {\Gamma[\phi]} generates one-particle irreducible (1PI) diagrams. A proper treatment of this would take us too far afield, but can be found in any introductory QFT book, e.g., [3]. The basic idea is that in order to be able to cut a reducible diagram, we need to work at the level of vertices rather than sources (e.g., stripping off external legs, and identifying the bare propagator between irreducible parts). The Legendre transform (19) thus removes the dependence on the sources {J}, and serves as the generator for the vertex functions of {\phi}, i.e., the fundamental interaction terms. The reason this is called the effective action is that in perturbation theory, {\Gamma[\phi]} contains the classical action as the leading saddle-point, as well as quantum corrections from the higher-order interactions in the coupling expansion.

In information-theoretic terms, the Legendre transform of the cumulant generating function is known as the rate function. This is a core concept in large deviations theory, and I won’t go into details here. Loosely speaking, it quantifies the exponential decay that characterizes rare events. Concretely, let {X_i} represent the outcome of some measurement or operation (e.g., a coin toss); then the mean after {N} independent trials is

\displaystyle M_N=\frac{1}{N}\sum_{i=1}^N X_i~. \ \ \ \ \ (22)

The probability that a given measurement deviates from this mean by some specified amount {x} is

\displaystyle P(M_N>x)\approx e^{-N I(x)} \ \ \ \ \ (23)

where {I(x)} is the aforementioned rate function. The formal similarity with the partition function in terms of the effective action, {Z=e^{-\Gamma}}, is obvious, though the precise dictionary between the two languages is not. I suspect that a precise translation between the two languages — physics and information theory — can be made here as well, in which the increasing rarity of events as one moves along the tail of the distribution correspond to increasingly high-order corrections to the quantum effective action, but I haven’t worked this out in detail.

Of course, the above is far from the only place in physics where cumulants are lurking behind the scenes, much less the end of the parallel with information theory more generally. In the next post, I’ll discuss the analogy between deep learning and the renormalization group, and see how Bayesian terminology can provide an underlying language for both.


[1] J. Novak and M. LaCroix, “Three lectures on free probability,” arXiv:1205.2097.

[2] R. K. P. Zia, E. F. Redish, and S. R. McKay, “Making sense of the Legendre transform, arXiv:0806.1147.

[3] L. H. Ryder, Quantum Field Theory. Cambridge University Press, 2 ed., 1996.

Posted in Physics | Leave a comment

Black hole interiors, state dependence, and all that

In the context of firewalls, the crux of the paradox boils down to whether black holes have smooth horizons (as required by the equivalence principle). It turns out that this is intimately related to the question of how the interior of the black hole can be reconstructed by an external observer. AdS/CFT is particularly useful in this regard, because it enables one to make such questions especially sharp. Specifically, one studies the eternal black hole dual to the thermofield double (TFD) state, which cleanly captures the relevant physics of real black holes formed from collapse.

To construct the TFD, we take two copies of a CFT and entangle them such that tracing out either results in a thermal state. Denoting the energy eigenstates of the left and right CFTs by {\tilde E_i} and {E_i}, respectively, the state is given by

\displaystyle |\Psi\rangle=\frac{1}{\sqrt{Z_\beta}}\sum_ie^{-\beta E_i/2}|E_i\rangle\otimes|\tilde E_i\rangle\ \ \ \ \ (1)

where {Z_\beta} is the partition function at inverse temperature {\beta}. The AdS dual of this state is the eternal black hole, the two sides of which join the left and right exterior bulk regions through the wormhole. Incidentally, one of the fascinating questions inherent to this construction is how the bulk spacetime emerges in a manner consistent with the tensor product of boundary CFTs. For our immediate purposes, the important fact is that operators in the left source states behind the horizon from the perspective of the right (and vice-versa). The requirement from general relativity that the horizon be smooth then imposes conditions on the relationship between these operators.

A noteworthy approach in this vein is the so-called “state-dependence” proposal developed by Kyriakos Papadodimas and Suvrat Raju over the course of several years [1,2,3,4,5] (referred to as PR henceforth). Their collective florilegium spans several hundred pages, jam-packed with physics, and any summary I could give here would be a gross injustice. As alluded above however, the salient aspect is that they phrased the smoothness requirement precisely in terms of a condition on correlation functions of CFT operators across the horizon. Focusing on the two-point function for simplicity, this condition reads:

\displaystyle \langle\Psi|\mathcal{O}(t,\mathbf{x})\tilde{\mathcal{O}}(t',\mathbf{x}')|\Psi\rangle =Z^{-1}\mathrm{tr}\left[e^{-\beta H}\mathcal{O}(t,\mathbf{x})\mathcal{O}(t'+i\beta/2,\mathbf{x}')\right]~. \ \ \ \ \ (2)

Here, {\mathcal{O}} is an exterior operator in the right CFT, while {\tilde{\mathcal{O}}} is an interior operator in the left—that is, it represents an excitation localized behind the horizon from the perspective of an observer in the right wedge (see the diagram above). The analytical continuation {\tilde{\mathcal{O}}(t,x)\rightarrow\mathcal{O}(t+i\beta/2,\mathbf{x})} arises from the KMS condition (i.e., the periodicity of thermal Green functions in imaginary time). Physically, this is essentially the statement that one should reproduce the correct thermal expectation values when restricted to a single copy of the CFT.

The question then becomes whether one can find such operators in the CFT that satisfy this constraint. That is, we want to effectively construct interior operators by acting only in the exterior CFT. PR achieve this through their so-called “mirror operators” {\tilde{\mathcal{O}}_n}, defined by

\displaystyle \tilde{\mathcal{O}}_n\mathcal{O}_m|\Psi\rangle=\mathcal{O}_me^{-\beta H/2}\mathcal{O}_n^\dagger e^{\beta H/2}|\Psi\rangle~. \ \ \ \ \ (3)

While appealingly compact, it’s more physically insightful to unpack this into the following two equations:

\displaystyle \tilde{\mathcal{O}}_n|\Psi\rangle=e^{-\beta H/2}\mathcal{O}^\dagger e^{\beta H/2}|\Psi\rangle~, \quad\quad \tilde{\mathcal{O}}_n\mathcal{O}_m|\Psi\rangle=\mathcal{O}_m\tilde{\mathcal{O}}_n|\Psi\rangle~. \ \ \ \ \ (4)

The key point is that these operators are defined via their action on the state {|\Psi\rangle}, i.e., they are state-dependent operators. For example, the second equation does not say that the operators commute; indeed, as operators, {[\tilde{\mathcal{O}}_n,\mathcal{O}_m]\neq0}. But the commutator does vanish in this particular state, {[\tilde{\mathcal{O}}_n,\mathcal{O}_m]|\Psi\rangle=0}. This may seem strange at first sight, but it’s really just a matter of carefully distinguishing between equations that hold as operator statements and those that hold only at the level of states. Indeed, this is precisely the same crucial distinction between localized states vs. localized operators that I’ve discussed before.

PR’s work created considerable backreaction, most of which centered around the nature of this “unusual” state dependence, which generated considerable confusion. Aspects of PR’s proposal were critiqued in a number of papers, particularly [6,7], which led many to claim that state dependence violates quantum mechanics. Coincidentally, I had the good fortune of being a visiting grad student at the KITP around this time, where these issues where hotly debated during a long-term workshop on quantum gravity. This was a very stimulating time, when the firewall paradox was still center-stage, and the collective confusion was almost palpable. Granted, I was a terribly confused student, but the fact that the experts couldn’t even agree on language — let alone physics — certainly didn’t do me any favours. Needless to say, the debate was never resolved, and the field’s collective attention span eventually drifted to other things. Yet somehow, the claim that state dependence violates quantum mechanics (or otherwise constitutes an unusual or potentially problematic modification thereof) has since risen to the level of dogma, and one finds it regurgitated again and again in papers published since.

Motivated in part by the desire to understand the precise nature of state dependence in this context (though really, it was the interior spacetime I was after), I wrote a paper [8] last year in an effort to elucidate and connect a number of interesting ideas in the emergent spacetime or “It from Qubit” paradigm. At a technical level, the only really novel bit was the application of modular inclusions, which provide a relatively precise framework for investigating the question of how one represents information in the black hole interior, and perhaps how the bulk spacetime emerges more generally. The relation between Tomita-Takesaki theory itself (a subset of algebraic quantum field theory) and state dependence was already pointed out by PR [3], and is highlighted most succinctly in Kyriakos’ later paper in 2017 [9], which was the main stimulus behind my previous post on the subject. However, whereas PR arrived at this connection from more physical arguments (over the course of hundreds of pages!), I took essentially the opposite approach: my aim was to distill the fundamental physics as cleanly as possible, to which end modular theory proves rather useful for demystifying issues which might otherwise remain obfuscated by details. The focus of my paper was consequently decidedly more conceptual, and represents a personal attempt to gain deeper physical insight into a number of tantalizing connections that have appeared in the literature in recent years (e.g., the relationship between geometry and entanglement represented by Ryu-Takayanagi, or the ontological basis for quantum error correction in holography).

I’ve little to add here that isn’t said better in [8] — and indeed, I’ve already written about various aspects on other occasions — so I invite you to simply read the paper if you’re interested. Personally, I think it’s rather well-written, though card-carrying members of the “shut up and calculate” camp may find it unpalatable. The paper touches on a relatively wide range of interrelated ideas in holography, rather than state dependence alone; but the upshot for the latter is that, far from being pathological, state dependence (precisely defined) is

  1. a natural part of standard quantum field theory, built-in to the algebraic framework at a fundamental level, and
  2. an inevitable feature of any attempt to represent information behind horizons.

I hasten to add that “information” is another one of those words that physicists love to abuse; here, I mean a state sourced by an operator whose domain of support is spacelike separated from the observer (e.g., excitations localized on the opposite side of a Rindler/black hole horizon). The second statement above is actually quite general, and results whenever one attempts to reconstruct an excitation outside its causal domain.

So why I am devoting an entire post to this, if I’ve already addressed it at length elsewhere? There were essentially two motivations for this. One is that I recently had the opportunity to give a talk about this at the YITP in Kyoto (the slides for which are available from the program website here), and I fell back down the rabbit hole in the course of reviewing. In particular, I wanted to better understand various statements in the literature to the effect that state dependence violates quantum mechanics. I won’t go into these in detail here — one can find a thorough treatment in PRs later works — but suffice to say the primary issue seems to lie more with language than physics: in the vast majority of cases, the authors simply weren’t precise about what they meant by “state dependence” (though in all fairness, PR weren’t totally clear on this either), and the rare exceptions to this had little to nothing to do with the unqualified use of the phrase here. I should add the disclaimer that I’m not necessarily vouching for every aspect of PR’s approach—they did a hell of a lot more than just write down (3), after all. My claim is simply that state dependence, in the fundamental sense I describe, is a feature, not a bug. Said differently, even if one rejects PR’s proposal as a whole, the state dependence that ultimately underlies it will continue to underlie any representation of the black hole interior. Indeed, I had hoped that my paper would help clarify things in this regard.

And this brings me to the second reason, namely: after my work appeared, a couple other papers [10,11] were written that continued the offense of conflating the unqualified phrase “state dependence” with different and not-entirely-clear things. Of course, there’s no monopoly on terminology: you can redefine terms however you like, as long as you’re clear. But conflating language leads to conflated concepts, and this is where we get into trouble. Case in point: both papers contain a number of statements which I would have liked to see phrased more carefully in light of my earlier work. Indeed, [11] goes so far as to write that “interior operators cannot be encoded in the CFT in a state-dependent way.” On the contrary, as I had explained the previous year, it’s actually the state independent operators that lead to pathologies (specifically, violations of unitarity)! Clearly, whatever the author means by this, it is not the same state dependence at work here. So consider this a follow-up attempt to stop further terminological misuse confusion.

As I’ll discuss below, both these works — and indeed most other proposals from quantum information — ultimately rely on the Hayden-Preskill protocol [12] (and variations thereof), so the real question is how the latter relates to state dependence in the unqualified use of the term (i.e., as defined via Tomita-Takesaki theory; I refer to this usage as “unqualified” because if you’re talking about firewalls and don’t specify otherwise, then this is the relevant definition, as it underlies PR’s introduction of the phrase). I’ll discuss this in the context of Beni’s work [10] first, since it’s the clearer of the two, and comment more briefly on Geof’s [11] below.

In a nutshell, the classic Hayden-Preskill result [12] is a statement about the ability to decode information given only partial access to the complete quantum state. In particular, one imagines that the proverbial Alice throws a message comprised of {k} bits of information into a black hole of size {n\!-\!k\gg k}. The black hole will scramble this information very quickly — the details are not relevant here — such that the information is encoded in some complicated manner among the (new) total {n} bits of the black hole. For example, if we model the internal dynamics as a simple permutation of Alice’s {k}-bit message, it will be transformed into one of {2^k} possible {n}-bit strings—a huge number of possibilities!

Now suppose Bob wishes to reconstruct the message by collecting qubits from the subsequent Hawking radiation. Naïvely, one would expect him to need essentially all {n} bits (i.e., to wait until the black hole evaporates) in order to accurately determine among the {2^k} possibilities. The surprising result of Hayden-Preskill is that in fact he needs only slightly more than {k} bits. The time-scale for this depends somewhat on the encoding performed by the black hole, but in principle, this means that Bob can recover the message just after the scrambling time. However, a crucial aspect of this protocol is that Bob knows the initial microstate of the black hole (i.e., the original {(n\!-\!k)}-bit string). This is the source of the confusing use of the phrase “state dependence”, as we’ll see below.

Of course, as Hayden and Preskill acknowledge, this is a highly unrealistic model, and they didn’t make any claims about being able to reconstruct the black hole interior in this manner. Indeed, the basic physics involved has nothing to do with black holes per se, but is a generic feature of quantum error correcting codes, reminiscent of the question of how to share (or decode) a quantum “secret” [13]. The novel aspect of Beni’s recent work [10] is to try to apply this to resolving the firewall paradox, by explicitly reconstructing the interior of the black hole.

Beni translates the problem of black hole evaporation into the sort of circuit language that characterizes much of the quantum information literature. One the one hand, this is nice in that it enables him to make very precise statements in the context of a simple qubit model; and indeed, at the mathematical level, everything’s fine. The confusion arises when trying to lift this toy model back to the physical problem at hand. In particular, when Beni claims to reconstruct state-independent interior operators, he is — from the perspective espoused above — misusing the terms “state-independent”, “interior”, and “operator”.

Let’s first summarize the basic picture, and then try to elucidate this unfortunate linguistic hat-trick. The Hayden-Preskill protocol for recovering information from black holes is illustrated in the figure from Beni’s paper below. In this diagram, {B} is the black hole, which is maximally entangled (in the form of some number of EPR pairs) with the early radiation {R}. Alice’s message corresponds to the state {|\psi\rangle}, which we imagine tossing into the black hole as {A}. One then evolves the black hole (which now includes Alice’s message {A}) by some unitary operator {U}, which scrambles the information as above. Subsequently, {D} represents some later Hawking modes, with the remaining black hole denoted {C}. Bob’s task is to reconstruct the state {|\psi\rangle} by acting on {D} and {R} (since he only has access to the exterior) with some operator {V}.

Now, Beni’s “state dependence” refers to the fact that the technical aspects of this construction relied on putting the initial state of the black hole {+} radiation {|\Psi\rangle_{BR}} in the form of a collection of EPR pairs {|\Phi\rangle_{EPR}}. This can be done by finding some unitary operator {K}, such that

\displaystyle |\Psi\rangle_{BR}=(I\otimes K)|\Phi\rangle_{EPR}~, \ \ \ \ \ (5)

(Here, one imagines that {B} is further split into a Hawking mode and its partner just behind the horizon, so that {I} acts on the interior mode while {K} affects only the new Hawking mode and the early radiation; see [10] for details). This is useful because it enables the algorithm to work for arbitrary black holes: for some other initial state {|\Psi'\rangle_{BR}}, one can find some other {K'\neq K} which results in the same state {|\Phi\rangle_{EPR}}. The catch is that Bob’s reconstruction depends on {K}, and therefore, on the initial state {|\Psi\rangle_{BR}}. But this is to be expected: it’s none other than the Hayden-Preskill requirement above that Bob needs to know the exact microstate of the system in order for the decoding protocol to work. It is in this sense that the Hayden-Preskill protocol is “state-dependent”, which clearly references something different than what we mean here. The reason I go so far as to call this a misuse of terminology is that Beni explicitly conflates the two, and regurgitates the claim that these “state-dependent interior operators” lead to inconsistencies with quantum mechanics, referencing work above. Furthermore, as alluded above, there’s an additional discontinuity of concepts here, namely that the “state-dependent” operator {V} is obviously not the “interior operator” to which we’re referring: it’s support isn’t even restricted to the interior, nor does it source any particular state localized therein!

Needless to say, I was in a superposition of confused and unhappy with the terminology in this paper, until I managed to corner Beni at YITP for a couple hours at the aforementioned workshop, where he was gracious enough to clarify various aspects of his construction. It turns out that he actually has in mind something different when he refers to the interior operator. Ultimately, the identification still fails on these same counts, but it’s worth following the idea a bit further in order to see how he avoids the “state dependence” in the vanilla Hayden-Preskill set-up above. (By now I shouldn’t have to emphasize that this form of “state dependence” isn’t problematic in any fundamental sense, and I will continue to distinguish it from the latter, unqualified use of the phrase with quotation marks).

One can see from the above diagram that the state of the black hole {|\Psi\rangle} — before Alice & Bob start fiddling with it — can be represented by the following diagram, also from [10]:

where {R}, {D}, and {C} are again the early radiation, later Hawking mode, and remaining black hole, respectively. The problem Beni solves is finding the “partner” — by which he means, the purification — of {D} in {CR}. Explicitly, he wants to find the operator {\tilde{\mathcal{O}}_{CR}^T} such that

\displaystyle (\mathcal{O}_D\otimes I_{CR})|\Psi\rangle=(I_D\otimes\tilde{\mathcal{O}}_{CR}^T)|\Psi\rangle~. \ \ \ \ \ (6)

Note that there’s yet another language ergo conceptual discontinuity here, namely that Beni uses “qubit”, “mode”, and “operator” interchangeably (indeed, when I pressed him on this very point, he confirmed that he regards these as synonymous). These are very different beasts in the physical problem at hand; however, for the purposes of Beni’s model, the important fact is that one can push the operator {O_D} (which one should think of as some operator that acts on {D}) through the unitary {U} to some other operator {\tilde{\mathcal{O}}_{CR}^T} that acts on both {C} and {R}:

He then goes on to show that one can reconstruct this operator {\tilde{\mathcal{O}}_{CR}} independently of the initial state of the black hole (i.e., the operator {K}) by coupling to an auxiliary system. Of course, I’m glossing over a great number of details here; in particular, Beni transmutes the outgoing mode {D} into a representation of the interior mode in his model, and calls whatever purifies it the “partner” {\tilde{\mathcal{O}}_{CR}^T}. Still, I personally find this a bit underwhelming; but then, from my perspective, the Hayden-Preskill “state dependence” wasn’t the issue to begin with; quantum information people may differ, and in any case Beni’s construction is still a neat toy model in its own domain.

However, the various conflations above are problematic when one attempts to map back to the fundamental physics we’re after: {\tilde{\mathcal{O}}_{CR}^T} is not the “partner” of the mode {D} in the relevant sense (namely, the pairwise entangled modes required for smoothness across the horizon), nor does it correspond to PR’s mirror operator (since its support actually straddles both sides of the horizon). Hence, while Beni’s construction does represent a non-trivial refinement of the original Hayden-Preskill protocol, I don’t think it solves the problem.

So if this model misses the point, what does Hayden-Preskill actually achieve in this context? Indeed, even in the original paper [12], they clearly showed that one can recover a message from inside the black hole. Doesn’t this mean we can reconstruct the interior in a state-independent manner, in the proper use of the term?

Well, not really. Essentially, Hayden-Preskill (in which I’m including Beni’s model as the current state-of-the-art) & PR (and I) are asking different questions: the former are asking whether it’s possible to decode messages to which one would not normally have access (answer: yes, if you know enough about the initial state and any auxiliary systems), while the latter are asking whether physics in the interior of the black hole can be represented in the exterior (answer: yes, if you use state-dependent operators). Reconstructing information about entangled qubits is not quite the same things as reconstructing the state in the interior. Consider a single Bell pair for simplicity, consisting of an exterior qubit (say, {D} in Beni’s model) and the interior “partner” that purifies it. Obviously, this state isn’t localized to either side, and so does not correspond to an interior operator.

The distinction is perhaps a bit subtle, so let me try to clarify. Let us define the operator {\mathcal{O}_A} with support behind the horizon, whose action on the vacuum creates the state in which Alice’s message has been thrown into the black hole; i.e., let

\displaystyle |\Psi\rangle_A=(\mathcal{O}_A\otimes I_R)|\Psi\rangle_{EPR} \ \ \ \ \ (7)

denote the state of the black hole containing Alice’s message, where the identity factor acts on the early radiation. Now, the fundamental result of PR is that if Bob wishes to reconstruct the interior of the black hole (concretely, the excitation behind the horizon corresponding to Alice’s message), he can only do so using state-dependent operators. In other words, there is no operator with support localized to the exterior which precisely equals {\mathcal{O}_A}; but Bob can find a state that approximates {|\Psi\rangle_A} arbitrarily well. This is more than just an operational restriction, but rather stems from an interesting trade-off between locality and unitarity which seems built-in to the theory at a fundamental level; see [8] for details.

Alternatively, Bob might not care about directly reconstructing the black hole interior (since he’s not planning on following Alice in, he’s not concerned about verifying smoothness as we are). Instead he’s content to wait for the “information” in this state to be emitted in the Hawking radiation. In this scenario, Bob isn’t trying to reconstruct the black hole interior corresponding to (7)—indeed, by now this state has long-since been scrambled. Rather, he’s only concerned with recovering the information content of Alice’s message—a subtly related but crucially distinct procedure from trying to reconstruct the corresponding state in the interior. And the fundamental result of Hayden-Preskill is that, given some admittedly idealistic assumptions (i.e., to the extent that the evaporating black hole can be viewed as a simple qubit model) this can also be done.

In the case of Geof’s paper [11], there’s a similar but more subtle language difference at play. Here the author means “state dependence” to mean something different from both Beni and PR/myself; specifically, he means “state dependence” in the context of quantum error correction (QEC). This is more clearly explained his earlier paper with Hayden [14], and refers to the fact that in general, a given boundary operator may only reconstruct a given bulk operator for a single black hole microstate. Conversely, a “state-independent” boundary operator, in their language, is one which approximately reconstructs a given bulk operator in a larger class of states—specifically, all states in the code subspace. Note that the qualifier “approximate” is crucial here. Otherwise, schematically, if {\epsilon} represents some small perturbation of the vacuum {\Omega} (where “small” means that the backreaction is insufficient to move us beyond the code subspace), then an exact reconstruction of the operator {\mathcal{O}} that sources the state {|\Psi\rangle=\mathcal{O}|\Omega\rangle} would instead produce some other state {|\Psi'\rangle=\mathcal{O}|\Omega+\epsilon\rangle}. So at the end of the day, I simply find the phrasing in [11] misleading; the lack of qualifiers makes many of his statements about “state-(in)dependence” technically erroneous, even though they’re perfectly correct in the context of approximate QEC.

At the end of the day however, these [10,11,14] are ultimately quantum information-theoretic models, in which the causal structure of the original problem plays no role. This is obvious in Beni’s case [10], in which Hayden-Preskill boils down to the statement that if one knows the exact quantum state of the system (or approximately so, given auxiliary qubits), then one can recover information encoded non-locally (e.g., Alice’s bit string) from substantially fewer qubits than one would naïvely expect. It’s more subtle in [11,14], since the authors work explicitly in the context of entanglement wedge reconstruction in AdS/CFT, which superficially would seem to include aspects of the spacetime structure. However, they take the black hole to be included in the entanglement wedge (i.e., code subspace) in question, and ask only whether an operator in the corresponding boundary region “works” for every state in this (enlarged) subspace, regardless of whether the bulk operator we’re trying to reconstruct is behind the horizon (i.e., ignoring the localization of states in this subspace). And this is where super-loading the terminology “state-(in)dependence” creates the most confusion. For example, when Geof writes that “boundary reconstructions are state independent if, and only if, the bulk operator is contained in the entanglement wedge” (emphasis added), he is making a general statement that holds only at the level of QEC codes. If the bulk operator lies behind the horizon however, then simply placing the black hole within the entanglement wedge does not alter the fact that a state-independent reconstruction, in the unqualified use of the phrase, does not exist.

Of course, as the authors of [14] point out in this work, there is a close relationship between state-dependent in QEC and in PR’s use of the term. Indeed, one of the closing thoughts of my paper [8] was the idea that modular theory may provide an ontological basis for the epistemic utility of QEC in AdS/CFT. Hence I share the authors’ view that it would be very interesting to make the relation between QEC and (various forms of) state-dependence more precise.

I should add that in Geof’s work [11], he seems to skirt some of the interior/exterior objections above by identifying (part of) the black hole interior with the entanglement wedge of some auxiliary Hilbert space that acts as a reservoir for the Hawking radiation. Here I can only confess some skepticism as to various aspects of his construction (or rather, the legitimacy of his interpretation). In particular, the reservoir is artificially taken to lie outside the CFT, which would normally contain a complete representation of exterior states, including the radiation. Consequently, the question of whether it has a sensible bulk dual at all is not entirely clear, much less a geometric interpretation as the “entanglement wedge” behind the horizon, whose boundary is the origin rather than asymptotic infinity.

A related paper [15] by Almeiri, Engelhardt, Marolf, and Maxfield appeared on the arXiv simultaneously with Geof’s work. While these authors are not concerned with state-dependence per se, they do provide a more concrete account of the effects on the entanglement wedge in the context of a precise model for an evaporating black hole in AdS/CFT. The analogous confusion I have in this case is precisely how the Hawking radiation gets transferred to the left CFT, though this may eventually come down to language as well. In any case, this paper is more clearly written, and worth a read (happily, Henry Maxfield will speak about it during one of our group’s virtual seminars in August, so perhaps I’ll obtain greater enlightenment about both works then).

Having said all that, I believe all these works are helpful in strengthening our understanding, and exemplify the productive confluence of quantum information theory, holography, and black holes. A greater exchange of ideas from various perspectives can only lead to further progress, and I would like to see more work in all these directions.

I would like to thank Beni Yoshida, Geof Penington, and Henry Maxfield for patiently fielding my persistent questions about their work, and beg their pardon for the gross simplifications herein. I also thank the YITP in Kyoto for their hospitality during the Quantum Information and String Theory 2019 / It from Qubit workshop, where most of this post was written amidst a great deal of stimulating discussion.


  1. K. Papadodimas and S. Raju, “Remarks on the necessity and implications of state-dependence in the black hole interior,” arXiv:1503.08825
  2. K. Papadodimas and S. Raju, “Local Operators in the Eternal Black Hole,” arXiv:1502.06692
  3. K. Papadodimas and S. Raju, “State-Dependent Bulk-Boundary Maps and Black Hole Complementarity,” arXiv:1310.6335
  4. K. Papadodimas and S. Raju, “Black Hole Interior in the Holographic Correspondence and the Information Paradox,” arXiv:1310.6334
  5. K. Papadodimas and S. Raju, “An Infalling Observer in AdS/CFT,” arXiv:1211.6767
  6. D. Harlow, “Aspects of the Papadodimas-Raju Proposal for the Black Hole Interior,” arXiv:1405.1995
  7. D. Marolf and J. Polchinski, “Violations of the Born rule in cool state-dependent horizons,” arXiv:1506.01337
  8. R. Jefferson, “Comments on black hole interiors and modular inclusions,” arXiv:1811.08900
  9. K. Papadodimas, “A class of non-equilibrium states and the black hole interior,” arXiv:1708.06328
  10. B. Yoshida, “Firewalls vs. Scrambling,” arXiv:1902.09763
  11. G. Penington, “Entanglement Wedge Reconstruction and the Information Paradox,” arXiv:1905.08255
  12. P. Hayden and J. Preskill, “Black holes as mirrors: Quantum information in random subsystems,” arXiv:0708.4025
  13. R. Cleve, D. Gottesman, and H.-K. Lo, “How to share a quantum secret,” arXiv:quant-ph/9901025
  14. P. Hayden and G. Penington, “Learning the Alpha-bits of Black Holes,” arXiv:1807.06041
  15. A. Almheiri, N. Engelhardt, D. Marolf, and H. Maxfield, “The entropy of bulk quantum fields and the entanglement wedge of an evaporating black hole,” arXiv:1905.08762
Posted in Physics | Leave a comment

Variational autoencoders

As part of one of my current research projects, I’ve been looking into variational autoencoders (VAEs) for the purpose of identifying and analyzing attractor solutions within higher-dimensional phase spaces. Of course, I couldn’t resist diving into the deeper mathematical theory underlying these generative models, beyond what was strictly necessary in order to implement one. As in the case of the restricted Boltzmann machines I’ve discussed before, there are fascinating relationships between physics, information theory, and machine learning at play here, in particular the intimate connection between (free) energy minimization and Bayesian inference. Insofar as I actually needed to learn how to build one of these networks however, I’ll start by introducing VAEs from a somewhat more implementation-oriented mindset, and discuss the deeper physics/information-theoretic aspects afterwards.

Mathematical formulation

An autoencoder is a type of neural network (NN) consisting of two feedforward networks: an encoder, which maps an input {X} onto a latent space {Z}, and a decoder, which maps the latent representation {Z} to the output {X'}. The idea is that {\mathrm{dim}(Z)<\mathrm{dim}(X)=\mathrm{dim}(X')}, so that information in the original data is compressed into a lower-dimensional “feature space”. For this reason, autoencoders are often used for dimensional reduction, though their applicability to real-world problems seems rather limited. Training consists of minimizing the difference between {X} and {X'} according to some suitable loss function. They are a form of unsupervised (or rather, self-supervised) learning, in which the NN seeks to learn highly compressed, discrete representation of the input.

VAEs inherit the network structure of autoencoders, but are fundamentally rather different in that they learn the parameters of a probability distribution that represents the data. This makes them much more powerful than their simpler precursors insofar as they are generative models (that is, they can generate new examples of the input type). Additionally, their statistical nature — in particular, learning a continuousprobability distribution — makes them vastly superior in yielding meaningful results from new/test data that gets mapped to novel regions of the latent space. In a nutshell, the encoding {Z} is generated stochastically, using variational techniques—and we’ll have more to say on what precisely this means below.

Mathematically, a VAE is a latent-variable model {p_\theta(x,z)} with latent variables {z\in Z} and observed variables (i.e., data) {x\in X}, where {\theta} represents the parameters of the distribution. (For example, Gaussian distributions are uniquely characterized by their mean {\mu} and standard deviation {\sigma}, in which case {\theta\in\{\mu,\sigma\}}; more generally, {\theta} would parametrize the masses and couplings of whatever model we wish to construct. Note that we shall typically suppress the subscript {\theta} where doing so does not lead to ambiguity). This joint distribution can be written

\displaystyle p(x,z)=p(x|z)p(z)~. \ \ \ \ \ (1)

The first factor on the right-hand side is the decoder, i.e., the likelihood {p(x|z)} of observing {x} given {z}; this provides the map from {Z\rightarrow X'\simeq X}. This will typically be either a multivariate Gaussian or Bernoulli distribution, implemented by an RBM with as-yet unlearned weights and biases. The second factor is the prior distribution of latent variables {p(z)}, which will be related to observations {x} via the likelihood function (i.e., the decoder). This can be thought of as a statement about the variable {z} with the data {x} held fixed. In order to be computational tractable, we want to make the simplest possible choice for this distribution; accordingly, one typically chooses a multivariate Gaussian,

\displaystyle p(z)=\mathcal{N}(0,1)~. \ \ \ \ \ (2)

In the context of Bayesian inference, this is technically what’s known as an informative prior, since it assumes that any other parameters in the model are sufficiently small that Gaussian sampling from {Z} does not miss any strongly relevant features. This is in contrast to the somewhat misleadingly named uninformative prior, which endeavors to place no subjective constraints on the variable; for this reason, the latter class are sometimes called objective priors, insofar as they represent the minimally biased choice. In any case, the reason such a simple choice (2) suffices for {p(z)} is that any distribution can be generated by applying a sufficiently complicated function to the normal distribution.

Meanwhile, the encoder is represented by the posterior probability {p(z|x)}, i.e., the probability of {z} given {x}; this provides the map from {X\rightarrow Z}. In principle, this is given by Bayes’ rule:

\displaystyle p(z|x)=\frac{p(x|z)p(z)}{p(x)}~, \ \ \ \ \ (3)

but this is virtually impossible to compute analytically, since the denominator amounts to evaluating the partition function over all possible configurations of latent variables, i.e.,

\displaystyle p(x)=\int\!\mathrm{d}z\,p(x|z)p(z)~. \ \ \ \ \ (4)

One solution is to compute {p(x)} approximately via Monte Carlo sampling; but the impression I’ve gained from my admittedly superficial foray into the literature is that such models are computationally expensive, noisy, difficult to train, and generally inferior to the more elegant solution offered by VAEs. The key idea is that for most {z}, {p(x|z)\approx0}, so instead of sampling over all possible {z}, we construct a new distribution {q(z|x)} representing the values of {z} which are most likely to have produced {x}, and sample over this new, smaller set of {z} values [2]. In other words, we seek a more tractable approximation {q_\phi(z|x)\approx p_\theta(z|x)}, characterized by some other, variational parameters {\phi}—so-called because we will eventually vary these parameters in order to ensure that {q} is as close to {p} as possible. As usual, the discrepancy between these distributions is quantified by the familiar Kullback-Leibler (KL) divergence:

\displaystyle D_z\left(q(z|x)\,||\,p(z|x)\right)=\sum_z q(z|x)\ln\frac{q(z|x)}{p(z|x)}~, \ \ \ \ \ (5)

where the subscript on the left-hand side denotes the variable over which we marginalize.

This divergence plays a central role in the variational inference procedure we’re trying to implement, and underlies the connection to the information-theoretic relations alluded above. Observe that Bayes’ rule enables us to rewrite this expression as

\displaystyle D_z\left(q(z|x)\,||\,p(z|x)\right)= \langle \ln q(z|x)-\ln p(z)\rangle_q -\langle\ln p(x|z)\rangle_q +\ln p(x) \ \ \ \ \ (6)

where {\langle\ldots\rangle_q} denotes the expectation value with respect to {q(z|x)}, and we have used the fact that {\sum\nolimits_z q(z|x) \ln p(x)=\ln p(x)} (since probabilities are normalized to 1, and {p(x)} has no dependence on the latent variables {z}). Now observe that the first term on the right-hand side can be written as another KL divergence. Rearranging, we therefore have

\displaystyle \ln p(x)-D_z\left(q(z|x)\,||\,p(z|x)\right)=-F_q(x) \ \ \ \ \ (7)

where we have identified the (negative) variational free energy

\displaystyle -F_q(x)=\langle\ln p(x|z)\rangle_q-D_z\left(q(z|x)\,||\,p(z)\right)~. \ \ \ \ \ (8)

As the name suggests, this is closely related to the Helmholtz free energy from thermodynamics and statistical field theory; we’ll discuss this connection in more detail below, and in doing so provide a more intuitive definition: the form (8) is well-suited to the implementation-oriented interpretation we’re about to provide, but is a few manipulations removed from the underlying physical meaning.

The expressions (7) and (8) comprise the central equation of VAEs (and variational Bayesian methods more generally), and admit a particularly simple interpretation. First, observe that the left-hand side of (7) is the log-likelihood, minus an “error term” due to our use of an approximate distribution {q(z|x)}. Thus, it’s the left-hand side of (7) that we want our learning procedure to maximize. Here, the intuition underlying maximum likelihood estimation (MLE) is that we seek to maximize the probability of each {x\!\in\!X} under the generative process provided by the decoder {p(x|z)}. As we will see, the optimization process pulls {q(z|x)} towards {p(z|x)} via the KL term; ideally, this vanishes, whereupon we’re directly optimizing the log-likelihood {\ln p(x)}.

The variational free energy (8) consists of two terms: a reconstruction error given by the expectation value of {\ln p(x|z)} with respect to {q(z|x)}, and a so-called regulator given by the KL divergence. The reconstruction error arises from encoding {X} into {Z} using our approximate distribution {q(z|x)}, whereupon the log-likelihood of the original data given these inferred latent variables will be slightly off. The KL divergence, meanwhile, simply encourages the approximate posterior distribution {q(z|x)} to be close to {p(z)}, so that the encoding matches the latent distribution. Note that since the KL divergence is positive-definite, (7) implies that the negative variational free energy gives a lower-bound on the log-likelihood. For this reason, {-F_q(x)} is sometimes referred to as the Evidence Lower BOund (ELBO) by machine learners.

The appearance of the (variational) free energy (8) is not a mere mathematical coincidence, but stems from deeper physical aspects of inference learning in general. I’ll digress upon this below, as promised, but we’ve a bit more work to do first in order to be able to actually implement a VAE in code.

Computing the gradient of the cost function

Operationally, training a VAE consists of performing stochastic gradient descent (SGD) on (8) in order to minimize the variational free energy (equivalently, maximize the ELBO). In other words, this will provide the cost or loss function (9) for the model. Note that since {\ln p(x)} is constant with respect to {q(z|x)}, (7) implies that minimizing the variational energy indeed forces the approximate posterior towards the true posterior, as mentioned above.

In applying SGD to the cost function (8), we actually have two sets of parameters over which to optimize: the parameters {\theta} that define the VAE as a generative model {p_\theta(x,z)}, and the variational parameters {\phi} that define the approximate posterior {q_\phi(z|x)}. Accordingly, we shall write the cost function as

\displaystyle \mathcal{C}_{\theta,\phi}(X)=-\sum_{x\in X}F_q(x) =-\sum_{x\in X}\left[\langle\ln p_\theta(x|z)\rangle_q-D_z\left(q_\phi(z|x)\,||\,p(z)\right) \right]~, \ \ \ \ \ (9)

where, to avoid a preponderance of subscripts, we shall continue to denote {F_q\equiv F_{q_\phi(z|x)}}, and similarly {\langle\ldots\rangle_q=\langle\ldots\rangle_{q_\phi(z|x)}}. Taking the gradient with respect to {\theta} is easy, since only the first term on the right-hand side has any dependence thereon. Hence, for a given datapoint {x\in X},

\displaystyle \nabla_\theta\mathcal{C}_{\theta,\phi}(x) =-\langle\nabla_\theta\ln p_\theta(x|z)\rangle_q \approx-\nabla_\theta\ln p_\theta(x|z)~, \ \ \ \ \ (10)

where in the second step we have replaced the expectation value with a single sample drawn from the latent space {Z}. This is a common method in SGD, in which we take this particular value of {z} to be a reasonable approximation for the average {\langle\ldots\rangle_q}. (Yet-more connections to mean field theory (MFT) we must of temporal necessity forgo; see Mehta et al. [1] for some discussion in this context, or Doersch [2] for further intuition). The resulting gradient can then be computed via backpropagation through the NN.

The gradient with respect to {\phi}, on the other hand, is slightly problematic, since the variational parameters also appear in the distribution with respect to which we compute expectation values. And the sampling trick we just employed means that in the implementation of this layer of the NN, the evaluation of the expectation value is a discrete operation: it has no gradient, and hence we can’t backpropagate through it. Fortunately, there’s a clever method called the reparametrization trick that circumvents this stumbling block. The basic idea is to change variables so that {\phi} no longer appears in the distribution with respect to which we compute expectation values. To do so, we express the latent variable {z} (which is ostensibly drawn from {q_\phi(z|x)}) as a differentiable and invertible transformation of some other, independent random variable {\epsilon}, i.e., {z=g(\epsilon; \phi, x)} (where here “independent” means that the distribution of {\epsilon} does not depend on either {x} or {\phi}; typically, one simply takes {\epsilon\sim\mathcal{N}(0,1)}). We can then replace {\langle\ldots\rangle_{q_\phi}\rightarrow\langle\ldots\rangle_{p_\epsilon}}, whereupon we can move the gradient inside the expectation value as before, i.e.,

\displaystyle -\nabla_\phi\langle\ln p_\theta(x|z)\rangle_{q_\phi} =-\langle\nabla_\phi\ln p_\theta(x|z)\rangle_{p_\epsilon}~. \ \ \ \ \ (11)

Note that in principle, this results in an additional term due to the Jacobian of the transformation. Explicitly, this equivalence between expectation values may be written

\displaystyle \begin{aligned} \langle f(z)\rangle_{q_\phi}&=\int\!\mathrm{d}z\,q_\phi(z|x)f(z) =\int\!\mathrm{d}\epsilon\left|\frac{\partial z}{\partial\epsilon}\right|\,q_\phi(z(\epsilon)|x)\,f(z(\epsilon))\\ &\equiv\int\!\mathrm{d}\epsilon \,p(\epsilon)\,f(z(\epsilon)) =\langle f(z)\rangle_{p_\epsilon} \end{aligned} \ \ \ \ \ (12)

where the Jacobian has been absorbed into the definition of {p(\epsilon)}:

\displaystyle p(\epsilon)\equiv J_\phi(x)\,q_\phi(z|x)~, \quad\quad J_\phi(x)\equiv\left|\frac{\partial z}{\partial\epsilon}\right|~. \ \ \ \ \ (13)

Consequently, the Jacobian would contribute to the second term of the KL divergence via

\displaystyle \ln q_\phi(z|x)=\ln p(\epsilon)-\ln J_\phi(x)~. \ \ \ \ \ (14)

Operationally however, the reparametrization trick simply amounts to performing the requisite sampling on an additional input layer for {\epsilon} instead of on {Z}; this is nicely illustrated in both fig. 74 of Mehta et al. [1] and fig. 4 of Doersch [2]. In practice, this means that the analytical tractability of the Jacobian is a non-issue, since the change of variables is performed downstream of the KL divergence layer—see the implementation details below. The upshot is that while the above may seem complicated, it makes the calculation of the gradient tractable via standard backpropagation.


Having fleshed-out the mathematical framework underlying VAEs, how do we actually build one? Let’s summarize the necessary ingredients, layer-by-layer along the flow from observation space to latent space and back (that is, {X\rightarrow Z\rightarrow X'\!\simeq\!X}), with the Keras API in mind:

  • We need an input layer, representing the data {X}.
  • We connect this input layer to an encoder, {q_\phi(z|x)}, that maps data into the latent space {Z}. This will be a NN with an arbitrary number of layers, which outputs the parameters {\phi} of the distribution (e.g., the mean and standard deviation, {\phi\in\{\mu,\sigma\}} if {q_\phi} is Gaussian).
  • We need a special KL-divergence layer, to compute the second term in the cost function (8) and add this to the model’s loss function (e.g., the Keras loss). This takes as inputs the parameters {\phi} produced by the encoder, and our Gaussian ansatz (2) for the prior {p(z)}.
  • We need another input layer for the independent distribution {\epsilon}. This will be merged with the parameters {\phi} output by the encoder, and in this way automatically integrated into the model’s loss function.
  • Finally, we feed this merged layer into a decoder, {p_\theta(x|z)}, that maps the latent space back to {X}. This is generally another NN with as many layers as the encoder, which relies on the learned parameters {\theta} of the generative model.

At this stage of the aforementioned research project, it’s far too early to tell whether such a VAE will ultimately be useful for accomplishing our goal. If so, I’ll update this post with suitable links to paper(s), etc. But regardless, the variational inference procedure underling VAEs is interesting in its own right, and I’d like to close by discussing some of the physical connections to which I alluded above in greater detail.

Deeper connections

The following was largely inspired by the exposition in Mehta et al. [1], though we have endeavored to modify the notation for clarity/consistency. In particular, be warned that what these authors call the “free energy” is actually a dimensionless free energy, which introduces an extra factor of {\beta} (cf. eq. (158) therein); we shall instead stick to standard conventions, in which the mass dimension is {[F]=[E]=[\beta^{-1}]=1}. Of course, we’re eventually going to set {\beta=1} anyway, but it’s good to set things straight.

Consider a system of interacting degrees of freedom {s\in\{x,z\}}, with parameters {\theta} (e.g., {\theta\in\{\mu,\sigma\}} for Gaussians, or would parametrize the couplings {J_{ij}} between spins {s_i} in the Ising model). We may assign an energy {E(s;\theta)=E(x,z;\theta)} to each configuration, such that the probability {p(s;\theta)=p_\theta(x,z)} of finding the system in a given state at temperature {\beta^{-1}} is

\displaystyle p_\theta(x,z)=\frac{1}{Z[\theta]}e^{-\beta E(x,z;\theta)}~, \ \ \ \ \ (15)

where the partition function with respect to this ensemble is

\displaystyle Z[\theta]=\sum_se^{-\beta E(s;\theta)}~, \ \ \ \ \ (16)

where the sum runs over both {x} and {z}. As the notation suggests, we have in mind that {p_\theta(x,z)} will serve as our latent-variable model, in which {x,z} respectively take on the meanings of visible and latent degrees of freedom as above. Upon marginalizing over the latter, we recover the partition function (4) for {\mathrm{dim}(Z)} finite:

\displaystyle p_\theta(x)=\sum_z\,p_\theta(x,z)=\frac{1}{Z[\theta]}\sum_z e^{-\beta E(x,z;\theta)} \equiv\frac{1}{Z[\theta]}e^{-\beta E(x;\theta)}~, \ \ \ \ \ (17)

where in the last step, we have defined the marginalized energy function {E(x;\theta)} that encodes all interactions with the latent variables; cf. eq. (15) of our post on RBMs.

The above implies that the posterior probability {p(z|x)} of finding a particular value of {z\in Z}, given the observed value {x\in X} (i.e., the encoder) can be written as

\displaystyle p_\theta(z|x) =\frac{p_\theta(x,z)}{p_\theta(x)} =e^{-\beta E(x,z;\theta)+\beta E(x;\theta)} \equiv e^{-\beta E(z|x;\theta)} \ \ \ \ \ (18)


\displaystyle E(z|x;\theta) \equiv E(x,z;\theta)-E(x;\theta) \ \ \ \ \ (19)

is the hamiltonian that describes the interactions between {x} and {z}, in which the {z}-independent contributions have been subtracted off; cf. the difference between eq. (12) and (15) here. To elucidate the variational inference procedure however, it will be convenient to re-express the conditional distribution as

\displaystyle p_\theta(z|x)=\frac{1}{Z_p}e^{-\beta E_p} \ \ \ \ \ (20)

where we have defined {Z_p} and {E_p} such that

\displaystyle p_\theta(x)=Z_p~, \qquad\mathrm{and}\qquad p_\theta(x,z)=e^{-\beta E_p}~. \ \ \ \ \ (21)

where the subscript {p=p_\theta(z|x)} will henceforth be used to refer to the posterior distribution, as opposed to either the joint {p(x,z)} or prior {p(x)} (this to facilitate a more compact notation below). Here, {Z_p=p_\theta(x)} is precisely the partition function we encountered in (4), and is independent of the latent variable {z}. Statistically, this simply reflects the fact that in (20), we weight the joint probabilities {p(x,z)} by how likely the condition {x} is to occur. Meanwhile, one must be careful not to confuse {E_p} with {E(z|x;\theta)} above. Rather, comparing (21) with (15), we see that {E_p} represents a sort of renormalized energy, in which the partition function {Z[\theta]} has been absorbed.

Now, in thermodynamics, the Helmholtz free energy is defined as the difference between the energy and the entropy (with a factor of {\beta} for dimensionality) at constant temperature and volume, i.e., the work obtainable from the system. More fundamentally, it is the (negative) log of the partition function of the canonical ensemble. Hence for the encoder (18), we write

\displaystyle F_p[\theta]=-\beta^{-1}\ln Z_p[\theta]=\langle E_p\rangle_p-\beta^{-1} S_p~, \ \ \ \ \ (22)

where {\langle\ldots\rangle_p} is the expectation value with respect to {p_\theta(z|x)} and marginalization over {z} (think of these as internal degrees of freedom), and {S_p} is the corresponding entropy,

\displaystyle S_p=-\sum_zp_\theta(z|x)\ln p_\theta(z|x) =-\langle\ln p_\theta(z|x)\rangle_p~. \ \ \ \ \ (23)

Note that given the canonical form (18), the equivalence of these expressions for {F_p} — that is, the second equality in (21) — follows immediately from the definition of entropy:

\displaystyle S_p=\sum_z p_\theta(z|x)\left[\beta E_p+\ln Z_p\right] =\beta\langle E_p\rangle_p+\ln Z_p~, \ \ \ \ \ (24)

where, since {Z_p} has no explicit dependence on the latent variables, {\langle\ln Z_p\rangle_p=\langle1\rangle_p\ln Z_p=\ln Z_p}. As usual, this partition function is generally impossible to calculate. To circumvent this, we employ the strategy introduced above, namely we approximate the true distribution {p_\theta(z|x)} by a so-called variational distribution {q(z|x;\phi)=q_\phi(z|x)}, where {\phi} are the variational (e.g., coupling) parameters that define our ansatz. The idea is of course that {q} should be computationally tractable while still capturing the essential features. As alluded above, this is the reason these autoencoders are called “variational”: we’re eventually going to vary the parameters {\phi} in order to make {q} as close to {p} as possible.

To quantify this procedure, we define the variational free energy (not to be confused with the Helmholtz free energy (21)):

\displaystyle F_q[\theta,\phi]=\langle E_p\rangle_q-\beta^{-1} S_q~, \ \ \ \ \ (25)

where {\langle E_p\rangle_q} is the expectation value of the energy corresponding to the distribution {p_\theta(z|x)} with respect to {q_\phi(z|x)}. While the variational energy {F_q} has the same form as the thermodynamic definition of Helmholtz energy {F_p}, it still seems odd at first glance, since it no longer enjoys the statistical connection to a canonical partition function. To gain some intuition for this quantity, suppose we express our variational distribution in the canonical form, i.e.,

\displaystyle q_\phi(z|x)=\frac{1}{Z_q}e^{-\beta E_q}~, \quad\quad Z_q[\phi]=\sum_ze^{-\beta E_q(x,z;\phi)}~, \ \ \ \ \ (26)

where we have denoted the energy of configurations in this ensemble by {E_q}, to avoid confusion with {E_p}, cf. (18). Then {F_q} may be written

\displaystyle \begin{aligned} F_q[\theta,\phi]&=\sum_z q_\phi(x|z)E_p-\beta^{-1}\sum_z q_\phi(x|z)\left[\beta E_q+\ln Z_q\right]\\ &=\langle E_p(\theta)-E_q(\phi)\rangle_q-\beta^{-1}\ln Z_q[\phi]~. \end{aligned} \ \ \ \ \ (27)

Thus we see that the variational energy is indeed formally akin to the Helmholtz energy, except that it encodes the difference in energy between the true and approximate configurations. We can rephrase this in information-theoretic language by expressing these energies in terms of their associated ensembles; that is, we write {E_p=-\beta^{-1}\left(\ln p+\ln Z_p\right)}, and similarly for {q}, whereupon we have

\displaystyle F_q[\theta,\phi]=\beta^{-1}\sum_z q_\phi(z|x)\ln\frac{q_\phi(z|x)}{p_\theta(z|x)}-\beta^{-1}\ln Z_p[\theta]~, \ \ \ \ \ (28)

where the {\ln Z_q} terms have canceled. Recognizing (5) and (21) on the right-hand side, we therefore find that the difference between the variational and Helmholtz free energies is none other than the KL divergence,

\displaystyle F_q[\theta,\phi]-F_p[\theta]=\beta^{-1}D_z\left(q_\phi(z|x)\,||\,p_\theta(z|x)\right)\geq0~, \ \ \ \ \ (29)

which is precisely (7)! (It is perhaps worth stressing that this follows directly from (24), independently of whether {q(z|x)} takes canonical form).

As stated above, our goal in training the VAE is to make the variational distribution {q} as close to {p} as possible, i.e., minimizing the KL divergence between them. We now see that physically, this corresponds to a variational problem in which we seek to minimize {F_q} with respect to {\phi}. In the limit where we perfectly succeed in doing so, {F_q} has obtained its global minimum {F_p}, whereupon the two distributions are identical.

Finally, it remains to clarify our implementation-based definition of {F_q} given in (8) (where {\beta=1}). Applying Bayes’ rule, we have

\displaystyle \begin{aligned} F_q&=-\langle\ln p(x|z)\rangle_q+D_z\left(q(z|x)\,||\,p(z)\right) =-\left<\ln\frac{p(z|x)p(x)}{p(z)}\right>_q+\langle\ln q(z|x)-\ln p(z)\rangle_q\\ &=-\langle\ln p(z|x)p(x)\rangle_q+\langle\ln q(z|x)\rangle_q =-\langle\ln p(x,z)\rangle_q-S_q~, \end{aligned} \ \ \ \ \ (30)

which is another definition of {F_q} sometimes found in the literature, e.g., as eq. (172) of Mehta et al. [1]. By expressing {p(x,z)} in terms of {E_p} via (20), we see that this is precisely equivalent to our more thermodynamical definition (24). Alternatively, we could have regrouped the posteriors to yield

\displaystyle F_q=\langle\ln q(z|x)-\ln p(z|x)\rangle_q-\langle\ln p(x)\rangle_q =D\left(q(z|x)\,||\,p(z|x)\right)+F_p~, \ \ \ \ \ (31)

where the identification of {F_p} follows from (20). Of course, this is just (28) again, which is a nice check on internal consistency.


  1. The review by Mehta et al., A high-bias, low-variance introduction to Machine Learning for physicists is absolutely perfect for those with a physics background, and the accompanying Jupyter notebook on VAEs in Keras for the MNIST dataset was especially helpful for the implementation bits above. The latter is a more streamlined version of this blog post by Louis Tiao.
  2. Doersch has written a Tutorial on Autoencoders, which I found helpful for gaining some further intuition for the mapping between theory and practice.
Posted in Minds & Machines | 2 Comments

Black hole entropy: the heat kernel method

As part of my ongoing love affair with black holes, I’ve been digging more deeply into what it means for them to have entropy, which of course necessitates investigating how this is assigned in the first place. This is a notoriously confusing issue — indeed, one which lies at the very heart of the firewall paradox — which is further complicated by the fact that there are a priori three distinct physical entropies at play: thermodynamic, entanglement, and gravitational. (Incidentally, lest my previous post on entropy cause confusion, let me stress that said post dealt only with the relation between thermodynamic and information-theoretic a.k.a. Shannon entropy, at a purely classical level: neither entanglement nor gravity played any role there. I also didn’t include the Shannon entropy in the list above, because — as explained in the aforementioned post — this isn’t an objective/physical entropy in the sense of the other three; more on this below.)

My research led me to a review on black hole entanglement entropy by Solodukhin [1], which is primarily concerned with the use of the conical singularity method (read: replica trick) to isolate the divergences that arise whenever one attempts to compute entanglement entropy in quantum field theory. The structure of these divergences turns out to provide physical insight into the nature of this entropy, and sheds some light on the relation to thermodynamic/gravitational entropy as well, so these sorts of calculations are well-worth understanding in detail.

While I’ve written about the replica trick at a rather abstract level before, for present purposes we must be substantially more concrete. To that end, the main technical objective of this post is to elucidate one of the central tools employed by these computations, known as the heat kernel method. This is a rather powerful method with applications scattered throughout theoretical physics, notably the calculation of 1-loop divergences and the study of anomalies. Our exposition will mostly follow the excellent pedagogical review by Vassilevich [2]. Before diving into the details however, let’s first review the replica trick à la [1], in order to see how the heat kernel arises in the present context.

Consider some quantum field {\psi(X)} in {d}-dimensional Euclidean spacetime, where {X^\mu=\{\tau,x,z^i\}} with {i=1,\ldots,d\!-\!2}. The Euclidean time {\tau} is related to Minkowski time {t=-i\tau} via the usual Wick rotation, and we have singled-out one of the transverse coordinates {x} for reasons which will shortly become apparent. For simplicity, let us consider the wavefunction for the vacuum state, which we prepare by performing the path integral over the lower ({\tau\leq0}) half of Euclidean spacetime with the boundary condition {\psi(\tau=0,x,z)=\psi_0(x,z)}:

\displaystyle \Psi[\psi_0(x)]=\int_{\psi(0,x,z)=\psi_0(x,z)}\!\mathcal{D}\psi\,e^{-W[\psi]}~, \ \ \ \ \ (1)

where we have used {W} to denote the Euclidean action for the matter field, so as to reserve {S} for the entropy.

Now, since we’re interested in computing the entanglement entropy of a subregion, let us divide the {\tau=0} surface into two halves, {x\!<\!0} and {x\!>\!0}, by defining a codimension-2 surface {\Sigma} by the condition {x=0}, {\tau=0}. Correspondingly, let us denote the boundary data

\displaystyle \psi_0(x,z)\equiv \begin{cases} \psi_-(x,z) \;\; & x<0\\ \psi_+(x,z) \;\; & x>0~. \end{cases} \ \ \ \ \ (2)

The reduced density matrix that describes the {x\!>\!0} subregion of the vacuum state is then obtained by tracing over the complementary set of boundary fields {\psi_-}. In the Euclidean path integral, this corresponds to integrating out {\psi_-} over the entire spacetime, but with a cut from negative infinity to {\Sigma} along the {\tau\!=\!0} surface (i.e., along {x\!<\!0}). We must therefore impose boundary conditions for the remaining field {\psi_+} as this cut is approached from above ({\psi_+^1}) and below ({\psi_+^2}). Hence:

\displaystyle \rho(\psi_+^1,\psi_+^2)=\int\!\mathcal{D}\psi_-\Psi[\psi_+^1,\psi_-]\Psi[\psi_+^2,\psi_-]~. \ \ \ \ \ (3)

Formally, this object simply says that the transition elements {\langle\psi_+^2|\rho|\psi_+^1\rangle} are computed by performing the path integral with the specified boundary conditions along the cut.

Unfortunately, explicitly computing the von Neumann entropy {S=-\mathrm{tr}\rho\ln\rho} is an impossible task for all but the very simplest systems. Enter the replica trick. The basic idea is to consider the {n}-fold cover of the above geometry, introduce a conical deficit at the boundary of the cut {\Sigma}, and then differentiate with respect to the deficit angle, whereupon the von Neumann entropy is recovered in the appropriate limit. To see this in detail, it is convenient to represent the {(\tau,x)} subspace in polar coordinates {(r,\phi)}, where {\tau=r\sin\phi} and {x=r\cos\phi}, such that the cut corresponds to {\phi=2\pi k} for {k=1,\ldots,n}. In constructing the {n}-sheeted cover, we glue sheets along the cut such that the fields are smoothly continued from {\psi_+^{1,2}\big|_k} to {\psi_+^{1,2}\big|_{k+1}}. The resulting space is technically a cone, denoted {C_n}, with angular deficit {2\pi(1-n)} at {\Sigma}, on which the partition function for the fields is given by

\displaystyle Z[C_n]=\mathrm{tr}\rho^n~, \ \ \ \ \ (4)

where {\rho^n} is the {n^\mathrm{th}} power of the density matrix (3). At this point in our previous treatment of the replica trick, we introduced the {n^\mathrm{th}} Rényi entropy

\displaystyle S_n=\frac{1}{1-n}\ln\mathrm{tr}\rho^n~, \ \ \ \ \ (5)

which one can think of as the entropy carried by the {n} copies of the chosen subregion, and showed that the von Neumann entropy is recovered in the limit {n\rightarrow1}. Equivalently, we may express the von Neumann entropy directly as

\displaystyle S(\rho)=-(n\partial_n-1)\ln\mathrm{tr}\rho^n\big|_{n=1}~. \ \ \ \ \ (6)


\displaystyle -(n\partial_n-1)\ln\mathrm{tr}\,\rho^n\big|_{n=1} =-\left[\frac{n}{\mathrm{tr}\rho^n}\,\mathrm{tr}\!\left(\rho^n\ln\rho\right)-\ln\mathrm{tr}\rho^n\right]\bigg|_{n=1} =-\mathrm{tr}\left(\rho\ln\rho\right)~, \ \ \ \ \ (7)

where in the last step we took the path integral to be appropriately normalized such that {\mathrm{tr}\rho=1}, whereupon the second term vanishes. Et voilà! As I understand it, the aforementioned “conical singularity method” is essentially an abstraction of the replica trick to spacetimes with conical singularities. Hence, re-purposing our notation, consider the effective action {W[\alpha]=-\ln Z[C_\alpha]} for fields on a Euclidean spacetime with a conical singularity at {\Sigma}. The cone {C_\alpha} is defined, in polar coordinates, by making {\phi} periodic with period {2\pi\alpha}, and taking the limit in which the deficit {(1-\alpha)\ll1}. The entanglement entropy for fields on this background is then

\displaystyle S=(\alpha\partial_\alpha-1)W[\alpha]\big|_{\alpha=1}~. \ \ \ \ \ (8)

As a technical aside: in both these cases, there is of course the subtlety of analytically continuing the parameter {n} resp. {\alpha} to non-integer values. We’ve discussed this issue before in the context of holography, where one surmounts this by instead performing the continuation in the bulk. We shall not digress upon this here, except to note that the construction relies on an abelian (rotational) symmetry {\phi\rightarrow\phi+w}, where {w} is an arbitrary constant. This is actually an important constraint to bear in mind when attempting to infer general physical lessons from our results, but we’ll address this caveat later. Suffice to say that given this assumption, the analytical continuation can be uniquely performed without obstruction; see in particular section 2.7 of [1] for details.

We have thus obtained, in (8), an expression for the entanglement entropy in terms of the effective action in the presence of a conical singularity. And while this is all well and pretty, in order for this expression to be of any practical use, we require a means of explicitly computing {W}. It is at this point that the heat kernel enters the game. The idea is to represent the Green function — or rather, the (connected) two-point correlator — as an integral over an auxiliary “proper time” {s} of a kernel satisfying the heat equation. This enables one to express the effective action as

\displaystyle W=-\frac{1}{2}\int_0^\infty\!\frac{\mathrm{d} s}{s}K(s,D)~, \ \ \ \ \ (9)

where {K(s,D)} is the (trace of the) heat kernel for the Laplacian operator {D}. We’ll now spend some time unpacking this statement, which first requires that we review some basic facts about Green functions, propagators, and all that.

Consider a linear differential operator {L_x=L(x)} acting on distributions with support on {\mathbb{R}^n}. If {L_x} admits a right inverse, then the latter defines the Green function {G(x,x')} as the solution to the inhomogeneous differential equation

\displaystyle L_xG(x,x')=\delta^n(x-x')~, \ \ \ \ \ (10)

where {x,\,x'} represent vectors in {\mathbb{R}^n}. We may also define the kernel of {L_x} as the solution to the homogeneous differential equation

\displaystyle L_xK(x,x')=0~. \ \ \ \ \ (11)

The Green function is especially useful for solving linear differential equations of the form

\displaystyle L_xu(x)=f(x)~. \ \ \ \ \ (12)

To see this, simply multiply both sides of (10) by {f(x')} and integrate w.r.t {x'}; by virtue of the delta function (and the fact that {L_x} is independent of {x'}), one identifies

\displaystyle u(x)=\int\!\mathrm{d} x'G(x,x')f(x')~. \ \ \ \ \ (13)

The particular boundary conditions we impose on {u(x)} then determine the precise form of {G} (e.g., retarded vs. advanced Green functions in QFT).

As nicely explained in this Stack Exchange answer, the precise relation to propagators is ambiguous, because physicists use this term to mean either the Green function or the kernel, depending on context. For example, the Feynman propagator {\Delta_F(x-x')} for a scalar field {\phi(x)} is a Green function for the Klein-Gordon operator, since it satisfies the equation

\displaystyle (\square_x+m^2)\Delta_F(x-x')=\delta^n(x-x')~. \ \ \ \ \ (14)

In contrast, the corresponding Wightman functions

\displaystyle G^+(x,x')=\langle0|\phi(x)\phi(x')|0\rangle~,\quad\quad G^-(x,x')=\langle0|\phi(x')\phi(x)|0\rangle~, \ \ \ \ \ (15)

are kernels for this operator, since they satisfy

\displaystyle (\square_x+m^2)G^{\pm}(x,x')=0~. \ \ \ \ \ (16)

The reader is warmly referred to section 2.7 of Birrell & Davies classic textbook [3] for a more thorough explanation of Green functions in this context.

In the present case, we shall be concerned with the heat kernel

\displaystyle K(s;x,y;D)=\langle x|e^{-sD}|y\rangle~, \ \ \ \ \ (17)

where {D} is some Laplacian operator, by which we mean that it admits a local expression of the form

\displaystyle D=-\left(g^{\mu\nu}\partial_\mu\partial_\nu+a^\mu\partial_\mu+b\right) =-\left(g^{\mu\nu}\nabla_\mu\nabla_\nu+E\right)~, \ \ \ \ \ (18)

for some matrix-valued functions {a^\mu~,b}. In the second equality, we’ve written the operator in the so-called canonical form for a Laplacian operator on a vector bundle, where {E} is an endomorphism on the bundle over the manifold, and the covariant derivative {\nabla} includes both the familiar Riemann part as well as the contribution from the gauge (bundle) part; the field strength for the latter will be denoted {\Omega_{\mu\nu}}. Fibre bundles won’t be terribly important for our purposes, but we’ll need some of this notation later; see section 2.1 of [2] for details.

The parameter {s} in (17) is some auxiliary Euclidean time variable (note that if we were to take {s=it}, the right-hand side of (17) would correspond to the transition amplitude {\langle x|U|y\rangle} for some unitary operator {U}). {K} is so-named because it satisfied the heat equation,

\displaystyle (\partial_s+D_x)K(s;x,y;D)=0~, \ \ \ \ \ (19)

with the initial condition

\displaystyle K(0;x,y;D)=\delta(x,y)~, \ \ \ \ \ (20)

where the subscript on {D_x} is meant to emphasize the fact that the operator {D} acts only on the transverse variables, not on the auxiliary time {s}. Our earlier claim that the Green function can be expressed in terms of an integral over the latter is then based on the observation that one can invert (17) to obtain the propagator

\displaystyle D^{-1}=\int_0^\infty\!\mathrm{d} s\,K(s;x,y;D)~. \ \ \ \ \ (21)

Note that here, “propagator” indeed refers to the Green function of the field in the path integral representation, insofar as the latter serves as the generating functional for the former. Bear with me a bit longer as we elucidate this last claim, as this will finally bring us full-circle to (9)

Denote the Euclidean path integral for the fields on some fixed background by

\displaystyle Z[J]=\int\!\mathcal{D}\psi\,e^{-W[\psi,J]}~, \ \ \ \ \ (22)

where {J} is the source for the matter field {\psi}. The heat kernel method applies to one-loop calculations, in which case it suffices to expand the action to quadratic order in fluctuations around the classical saddle-point {S_\mathrm{cl}}, whence the Gaussian integral may be written

\displaystyle Z[J]=e^{-S_\mathrm{cl}}\,\mathrm{det}^{-1/2}(D)\,\mathrm{exp}\left(\frac{1}{4}JD^{-1}J\right)~. \ \ \ \ \ (23)

(We’re glossing over some mathematical caveats/assumptions here, notably that {D} be self-adjoint w.r.t. to the scalar product of the fields; see [2] for details). Thus we see that taking two functional derivatives w.r.t. the source {J} brings down the operator {D^{-1}}, thereby identifying it with the two-point correlator for {\psi},

\displaystyle G(x,y)=\langle\psi(x)\psi(y)\rangle=\frac{1}{Z[0]}\left.\frac{\delta^2 Z[J]}{\delta J(x)\delta J(y)}\right|_{J=0}=D^{-1}~, \ \ \ \ \ (24)

which is trivially (by virtue of the far r.h.s.) a Green function of {D} in the sense of (10).

We now understand how to express the Green function for the operator {D} in terms of the heat kernel. But we’re after the connected two-point correlator (sometimes called the connected Green function), which encapsulates the one-loop contributions. Recall that the connected Feynman diagrams are generated by the effective action {W} introduced above. After properly normalizing, we have only the piece which depends purely on {D}:

\displaystyle W=\frac{1}{2}\ln\mathrm{det}(D)~. \ \ \ \ \ (25)

Vassilevich [2] then provides a nice heuristic argument that relates this to the heat kernel (as well as a more rigorous treatment from spectral theory, for the less cavalier among you), which relies on the identity

\displaystyle \ln\lambda=-\int_0^\infty\!\frac{\mathrm{d} s}{s}e^{-s\lambda}~, \ \ \ \ \ (26)

for {\lambda>0} (strictly speaking, this identity is only correct up to a constant, but we may normalize away this inconvenience anyway; did I mention I’d be somewhat cavalier?). We then apply this identity to every (positive) eigenvalue {\lambda} of {D}, whence

\displaystyle W=\frac{1}{2}\ln\mathrm{det}(D)=\frac{1}{2}\mathrm{tr}\ln D =-\frac{1}{2}\int\!\frac{\mathrm{d} s}{s}\mathrm{tr}\left(e^{-sD}\right) =-\frac{1}{2}\int\!\frac{\mathrm{d} s}{s}K(s,D)~, \ \ \ \ \ (27)


\displaystyle K(s,D)\equiv \mathrm{tr}\left(e^{-sD}\right) =\int\!\mathrm{d}^dx\sqrt{g}\,\langle x|e^{-sD}|x\rangle =\int\!\mathrm{d}^dx\sqrt{g}\,K(s;x,x;D)~. \ \ \ \ \ (28)

Let us pause to take stock: so far, we’ve merely elucidated eq. (9). And while this expression itself is valid in general (not just for manifolds with conical singularities) the physical motivation for this post was the investigation of divergences in the entanglement entropy (8). And indeed, the expression for the effective action (27) is divergent at both limits! In the course of regulating this behaviour, we shall see that the UV divergences in the entropy (8) are captured by the so-called heat kernel coefficients.

To proceed, we shall need the fact that on manifolds without boundaries (or else with suitable local boundary conditions on the fields), {K(s,D)} — really, the self-adjoint operator {D} — admits an asymptotic expansion of the form

\displaystyle K(s,D)=\mathrm{tr}\left(e^{-sD}\right)\simeq (4\pi s)^{-d/2}\sum_{k\geq0}a_k(D)s^{k}~, \ \ \ \ \ (29)

cf. eq. (67) of [1]. A couple technical remarks are in order. First, recall that in contrast to a convergent series — which gives finite results for arbitrary, fixed {s} in the limit {k\rightarrow\infty} — an asymptotic series gives finite results for fixed {k} in the limit {s^{-1}\rightarrow\infty}. Second, we are ignoring various subtleties regarding the rigorous definition of the trace, wherein both (29) and (28) are properly defined via the use of an auxiliary function; cf. eq. (2.21) of [2] (n.b., Vassilevich’s coefficients do not include the normalization {(4\pi)^{-d/2}} from the volume integrals; see below).

The most important property of the heat kernel coefficients {a_k} is that they can be expressed as integrals of local invariants—tensor quantities which remain invariant under local diffeomorphisms; e.g., the Riemann curvature tensor and covariant derivatives thereof. Thus the first step in the procedure for calculating the heat kernel coefficients is to write down the integral over all such local invariants; for example, the first three coefficients are

\displaystyle \begin{aligned} a_0(D)=\int_M\!\mathrm{d}^dx\sqrt{g}&\,\alpha_0~,\\ a_1(D)=\frac{1}{6}\int_M\!\mathrm{d}^dx\sqrt{g}\,&\left(\alpha_1E+\alpha_2R\right)~,\\ a_2(D)=\frac{1}{360}\int_M\!\mathrm{d}^dx\sqrt{g}\,&\left(\alpha_3\nabla E+\alpha_4RE+\alpha_5E^2+\alpha_6\nabla R\right.\\ &\left.+\alpha_7R^2+\alpha_8R_{ij}+\alpha_9R_{ijkl}^2+\alpha_{10}\Omega_{ij}^2\right)~,\\ \end{aligned} \ \ \ \ \ (30)

where {E} and {\Omega_{ij}} were introduced in (18). A word of warning, for those of you cross-referencing with Solodukhin [1] and Vassilevich [2]: these coefficients correspond to eqs. (4.13) – (4.15) in [2], except that we have already included the normalization {(4\pi)^{-d/2}} in our expansion coefficients (29), consistent with [1]. Additionally, note that Vassilevich’s coefficients are labeled with even integers, while ours/Solodukhin’s include both even and odd—cf. (2.21) in [2]. The reason for this discrepancy is that all odd coefficients in Vassilevich’s original expansion vanish, as a consequence of the fact that there are no odd-dimensional invariants on manifolds without boundary; Solodukhin has simply relabeled the summation index for cleanliness, and we have followed his convention in (30).

It now remains to calculate the constants {\alpha_i}. This is a rather involved technical procedure, but is explained in detail in section 4.1 of [2]. One finds

\displaystyle \begin{aligned} \alpha_1=6~,\;\; \alpha_2=1~,\;\; \alpha_3=60~,\;\; \alpha_4=60~,\;\; \alpha_5=180~,\\ \alpha_6=12~,\;\; \alpha_7=5~,\;\; \alpha_8=-2~,\;\; \alpha_9=2~,\;\; \alpha_{10}=30~. \end{aligned} \ \ \ \ \ (31)

Substituting these into (30), and doing a bit of rearranging, we have

\displaystyle \begin{aligned} a_0(D)&=\int_M1~,\\ a_1(D)&=\int_M\left(E+\tfrac{1}{6}R\right)~,\\ a_2(D)&=\int_M\left[\tfrac{1}{180}R_{ijkl}^2-\tfrac{1}{180}R_{ij}^2+\tfrac{1}{6}\nabla^2\left(E+\tfrac{1}{5}R\right)+\tfrac{1}{2}\left(E+\tfrac{1}{6}R\right)^2\right]~, \end{aligned} \ \ \ \ \ (32)

where we have suppressed the integration measure for compactness, i.e., {\int_M=\int_M\mathrm{d}^dx\sqrt{g}\,}; we have also set the gauge field strength {\Omega_{ij}=0} for simplicity, since we will consider only free scalar fields below. These correspond to what Solodukhin refers to as regular coefficients, cf. his eq. (69). If one is working on a background with conical singularities, then there are additional contributions from the singular surface {\Sigma} [1]:

\displaystyle \begin{aligned} a_0^\Sigma(D)=&0~,\\ a_1^\Sigma(D)=&\frac{\pi}{3}\frac{(1-\alpha)(1+\alpha)}{\alpha}\int_\Sigma1~,\\ a_2^\Sigma(D)=&\frac{\pi}{3}\frac{(1-\alpha)(1+\alpha)}{\alpha}\int_\Sigma\left(E+\tfrac{1}{6}R\right)\\ &-\frac{\pi}{180}\frac{(1-\alpha)(1+\alpha)(1+\alpha^2)}{\alpha^3}\int_\Sigma\left(R_{ii}+2R_{ijij}\right)~, \end{aligned} \ \ \ \ \ (33)

where in the last expression {R_{ii}=R_{\mu\nu}n^\mu_in^\nu_i} and {R_{ijij}=R_{\mu\nu\rho\sigma}n^\mu_in^\nu_jn^\rho_in^\sigma_j}, where {n^k=n^\mu_k\partial_\mu} are orthonormal vectors orthogonal to {\Sigma}. In this case, if the manifold {M} in (32) is the {n}-fold cover {C_\alpha} constructed above, then the Riemannian curvature invariants are actually computed on the regular points {C_\alpha/\Sigma}, and are related to their flat counterparts by eq. (55) of [1]. Of course, here and in (33), {\alpha} refers to the conical deficit {2\pi(1-\alpha)}, not to be confused with the constants {\alpha_i} in (31).

Finally, we are in position to consider some of the physical applications of this method discussed in the introduction to this post. As a warm-up to the entanglement entropy of black holes, let’s first take the simpler case of flat space. Despite the lack of conical singularities in this completely regular, seemingly boring spacetime, the heat kernel method above can still be used to calculate the leading UV divergences in the entanglement entropy. While in this very simple case, there are integral identities that make the expansion into heat kernel coefficients unnecessary (specifically, the Sommerfeld formula employed in section 2.9 of [1]), the conical deficit method is more universal, and will greatly facilitate our treatment of the black hole below.

Consider a free scalar field with {\mathcal{D}=-(\nabla^2+X)}, where {X} is some scalar function (e.g., for a massive non-interacting scalar, {X=-m^2)}, on some background spacetime {E_\alpha} in {d>2} dimensions, with a conical deficit at the codimension-2 surface {\Sigma}. The leading UV divergence (we don’t care about the regular piece) in the entanglement entropy across this {(d\!-\!2)}-dimensional surface may be calculated directly from the coefficient {a_1^\Sigma} above. To this order in the expansion, the relevant part of {W} is

\displaystyle \begin{aligned} W[\alpha]&\simeq-\frac{1}{2}\int_{\epsilon^2}^\infty\!\frac{\mathrm{d} s}{s}(4\pi s)^{-d/2}\left(a_0^\Sigma+a_1^\Sigma s\right) =-\frac{\pi}{6}\frac{(1-\alpha)(1+\alpha)}{\alpha}\int_{\epsilon^2}^{\infty}\!\frac{\mathrm{d} s}{(4\pi s)^{d/2}}\int_\Sigma1\\ &=\frac{-1}{12(d\!-\!2)(4\pi)^{(d-2)/2}}\frac{A(\Sigma)}{\epsilon^{d-2}}\frac{(1-\alpha)(1+\alpha)}{\alpha}~, \end{aligned} \ \ \ \ \ (34)

where we have introduced the UV-cutoff {\epsilon} (which appears as {\epsilon^2} in the lower limit of integration, to make the dimensions of the auxiliary time variable work-out; this is perhaps clearest by examining eq. (1.12) of [2]) and the area of the surface {A(\Sigma)=\int_\Sigma\,}. Substituting this expression for the effective action into eq. (8) for the entropy, we obtain

\displaystyle S_\mathrm{flat}=\frac{1}{6(d\!-\!2)(4\pi)^{(d-2)/2}}\frac{A(\Sigma)}{\epsilon^{d-2}}~. \ \ \ \ \ (35)

which is eq. (81) in [1]. As explained therein, the reason this matches the flat space result — that is, the case in which {\Sigma} is a flat plane — is because even in curved spacetime, any surface can be locally approximated by flat Minkowski space. In particular, we’ll see that this result remains the leading-order correction to the black hole entropy, because the near-horizon region is approximately Rindler (flat). In other words, this result is exact for flat space (hence the equality), but provides only the leading-order divergence for more general, curved geometries.

For concreteness, we’ll limit ourselves to four dimensions henceforth, in which the flat space result above is

\displaystyle S_\mathrm{flat}=\frac{A(\Sigma)}{48\pi\epsilon^2}~. \ \ \ \ \ (36)

In the presence of a black hole, there will be higher-order corrections to this expression. In particular, in {d\!=\!4} we also have a log divergence from the {a_2^\Sigma s^2} term:

\displaystyle \begin{aligned} W[\alpha]&=-\frac{1}{2}\int_{\epsilon^2}^\infty\!\frac{\mathrm{d} s}{s}(4\pi s)^{-d/2}\left(a_0^\Sigma+a_1^\Sigma s+a_2^\Sigma s^2+\ldots\right)\\ &=-\frac{1}{32\pi^2}\int_{\epsilon^2}^\infty\!\mathrm{d} s\left(\frac{a_1^\Sigma}{s^2}+\frac{a_2^\Sigma}{s}+O(s^0)\right) \simeq-\frac{1}{32\pi^2}\left(\frac{a_1^\Sigma}{\epsilon^2}-2a_2^\Sigma\ln\epsilon\right)~, \end{aligned} \ \ \ \ \ (37)

where in the last step, we’ve dropped higher-order terms as well as the log IR divergence, since here we’re only interested in the UV part. From (33), we then see that the {a_2^\Sigma} term results in an expression for the UV divergent part of the black hole entanglement entropy of the form

\displaystyle S_\mathrm{ent}=\frac{A(\sigma)}{48\pi\epsilon^2} -\frac{1}{144\pi}\int_\Sigma\left[6E+R-\frac{1}{5}\left(R_{ii}-2R_{ijij}\right)\right]\ln\epsilon~, \ \ \ \ \ (38)

cf. (82) [1]. Specifying to a particular black hole solution then requires working out the projections of the Ricci and Riemann tensors on the subspace orthogonal to {\Sigma} ({R_{ii}} and {R_{ijij}}, respectively). For the simplest case of a massless, minimally coupled scalar field ({X=0}) on a Schwarzschild black hole background, the above yields

\displaystyle S_\mathrm{ent}=\frac{A(\Sigma)}{48\pi\epsilon^2}+\frac{1}{45}\ln\frac{r_+}{2}~, \ \ \ \ \ (39)

where {r_+} is the horizon radius; see section 3.9.1 of [1] for more details. Note that since the Ricci scalar {R=0}, the logarithmic term represents a purely topological correction to the flat space entropy (36) (in contrast to flat space, the Euler number for a black hole geometry is non-zero). Curvature corrections can still show up in UV-finite terms, of course, but that’s not what we’re seeing here: in this sense the log term is universal.

Note that we’ve specifically labeled the entropy in (39) with the subscript “ent” to denote that this is the entanglement entropy due to quantum fields on the classical background. We now come to the confusion alluded in the opening paragraph of this post, namely, what is the relation between the entanglement entropy of the black hole and either the thermodynamic or gravitational entropies, if any?

Recall that in classical systems, the thermodynamic entropy and the information-theoretic entropy coincide: not merely formally, but ontologically as well. The reason is that the correct probability mass function will be as broadly distributed as possible subject to the physical constraints on the system (equivalently, in the case of statistical inference, whatever partial information we have available). If only the average energy is fixed, then this corresponds to the Boltzmann distribution. Note that this same logic extends to the entanglement entropy as well (modulo certain skeptical reservations to which I alluded before), which is the underlying physical reason why the Shannon and quantum mechanical von Neumann entropies take the same form as well. In simple quantum systems therefore, the entanglement entropy of the fields coincides with this thermodynamic (equivalently, information-theoretic/Shannon) entropy.

More generally however, “thermodynamic entropy” is a statement about the internal microstates of the system, and is quantified by the total change in the free energy {F=\beta^{-1}\ln Z[\beta,g]} w.r.t. temperature {\beta^{-1}}:

\displaystyle S_\mathrm{thermo}=\frac{\mathrm{d} F}{\mathrm{d} T}=-\beta^2\frac{\mathrm{d} F}{\mathrm{d} \beta} =\left(\beta\frac{\mathrm{d}}{\mathrm{d}\beta}-1\right)W_\mathrm{tot}[\beta,g]~, \ \ \ \ \ (40)

where the total effective action {W_\mathrm{tot}[\beta,g]=\ln Z[\beta,g]}. Crucially, observe that here, in contrast to our partition function (22) for fields on a fixed background, the total Euclidean path integral that prepares the state also includes an integration over the metrics:

\displaystyle Z[\beta,g]=\int\!\mathcal{D} g_{\mu\nu}\,\mathcal{D}\psi\,e^{-W_\mathrm{gr}[g]+W_\mathrm{mat}[\psi,g]}~, \ \ \ \ \ (41)

where {W_\mathrm{gr}} represents the gravitational part of the action (e.g., the Einstein-Hilbert term), and {W_\mathrm{mat}} the contribution from the matter fields. (For reference, we’re following the reasoning and notation in section 4.1 of [1] here).

In the information-theoretic context above, introducing a black hole amounts to imposing additional conditions on (i.e., more information about) the system. Specifically, it imposes constraints on the class of metrics in the Euclidean path integral: the existence of a fixed point {\Sigma} of the isometry generated by the Killing vector {\partial_\tau} in the highly symmetric case of Schwarzschild, and suitable asymptotic behaviour at large radius. Hence in computing this path integral, one first performs the integration over matter fields {\psi} on backgrounds with a conical singularity at {\Sigma}:

\displaystyle \int\!\mathcal{D}\psi\,e^{-W_\mathrm{mat}[\psi,g]}=e^{-W[\beta,g]}~, \ \ \ \ \ (42)

where on the r.h.s., {W[\beta,g]} represents the effective action for the fields described above; note that the contribution from entanglement entropy is entirely encoded in this portion. The path integral (41) now looks like this:

\displaystyle Z[\beta,g]=\int\!\mathcal{D} g\,e^{-W_\mathrm{gr}[\psi,g]-W[\beta,g]} \simeq e^{-W_\mathrm{tot}[\beta,\,g(\beta)]}~, \ \ \ \ \ (43)

where {W_\mathrm{tot}} on the far right is the semiclassical effective action obtained from the saddle-point approximation. That is, the metric {g_{\mu\nu}(\beta)} is the solution to

\displaystyle \frac{\delta W_\mathrm{tot}[\beta,g]}{\delta g}=0~, \qquad\mathrm{with}\qquad W_\mathrm{tot}=W_\mathrm{gr}+W \ \ \ \ \ (44)

at fixed {\beta}. Since the saddle-point returns an on-shell action, {g_{\mu\nu}(\beta)} is a regular metric (i.e., without conical singularities). One can think of this as the equilibrium geometry at fixed {\beta} around which the metric {g} fluctuates due to the quantum corrections represented by the second term {W}. Note that the latter represents an off-shell contribution, since it is computed on singular backgrounds which do not satisfy the equations of motion for the metric (44).

To compute the thermodynamic entropy of the black hole, we now plug this total effective action into (40). A priori, this expression involves a total derivative w.r.t. {\beta}, so that we write

\displaystyle S_\mathrm{thermo}= \beta\left(\partial_\beta W_\mathrm{tot}[\beta,g]+\frac{\delta g_{\mu\nu}(\beta)}{\delta\beta}\frac{\delta W_\mathrm{tot}[\beta,g]}{\delta g_{\mu\nu}(\beta)}\right) -W_\mathrm{tot}[\beta,g]~, \ \ \ \ \ (45)

except that due to the equilibrium condition (44), the second term in parentheses vanishes anyway, and the thermodynamic entropy is given by

\displaystyle S_\mathrm{thermo}=\left(\beta\partial_\beta-1\right)W_\mathrm{tot}[\beta,g] =\left(\beta\partial_\beta-1\right)\left(W_\mathrm{gr}+W\right) =S_\mathrm{gr}+S_\mathrm{ent}~. \ \ \ \ \ (46)

This, then, is the precise relationship between the thermodynamic, gravitational, and entanglement entropies (at least for the Schwarzschild black hole). The thermodynamic entropy {S_\mathrm{thermo}} is a statement about the possible internal microstates at equilibrium — meaning, states which satisfy the quantum-corrected Einstein equations (44) — which therefore includes the entropy from possible (regular) metric configurations {S_\mathrm{gr}} as well as the quantum corrections {S_\mathrm{ent}} given by (39).

Note that the famous Bekenstein-Hawking entropy {S_\mathrm{BH}} refers only to the gravitational entropy, i.e., {S_\mathrm{BH}=S_\mathrm{gr}}. This is sometimes referred to as the “classical” part, because it represents the tree-level contribution to the path integral result. That is, if we were to restore Planck’s constant, we’d find that {S_\mathrm{gr}} comes with a {1/\hbar} prefactor, while {S_\mathrm{ent}} is order {h^0}. Accordingly, {S_\mathrm{ent}} is often called the first/one-loop quantum correction to the Bekenstein-Hawking entropy. (Confusion warning: the heat kernel method allowed us to compute {S_\mathrm{ent}} itself to one-loop in the expansion of the matter action, i.e., quadratic order in the source {J}, but the entire contribution {S_\mathrm{ent}} appears at one-loop order in the {\hbar}-expansion.)

However, despite the long-standing effort to elucidate this entropy, it’s still woefully unclear what microscopic (read: quantum-gravitational) degrees of freedom are truly being counted. The path integral over metrics makes it tempting to interpret the gravitational entropy as accounting for all possible geometrical configurations that satisfy the prescribed boundary conditions. And indeed, as we have remarked before, this is how one would interpret the Bekenstein entropy in the absence of Hawking’s famous calculation. But insofar as entropy is a physical quantity, the interpretation of {S_\mathrm{gr}} as owing to the equivalence class of geometries is rather unsatisfactory, since in this case the entropy must be associated to a single black hole, whose geometry certainly does not appear to be in any sort of quantum superposition. The situation is even less clear once one takes normalization into account, whereby — in most situations at least — the gravitational and matter couplings are harmoniously renormalized in such a way that one cannot help but question the fundamental distinction between the two; see [1] for an overview of renormalization in this context.

Furthermore, this entire calculational method relies crucially on the presence of an abelian isometry w.r.t. the Killing vector {\partial_\tau} in Euclidean time, i.e., the rotational symmetry {\phi\rightarrow\phi+w} mentioned above. But the presence of such a Killing isometry is by no means a physical requisite for a given system/region to have a meaningful entropy; things are simply (fiendishly!) more difficult to calculate in systems without such high degrees of symmetry. However, this means that cases in which all three of these entropies can be so cleanly assigned to the black hole horizon may be quite rare. Evaporating black holes are perhaps the most topical example of a non-static geometry that causes headaches. As a specific example in this vein, Netta Engelhardt and collaborators have compellingly argued that it is the apparent horizon, rather than the event horizon, to which one can meaningfully associated a coarse-grained entropy à la Shannon [4,5]. Thus, while we’re making exciting progress, even the partial interpretation for which we’ve labored above should be taken with caution. There is more work to be done!


  1. S. N. Solodukhin, “Entanglement entropy of black holes,” arXiv:1104.3712.
  2. D. V. Vassilevich, “Heat kernel expansion: User’s manual,” arXiv:hep-th/0306138.
  3. N. D. Birrell and P. C. W. Davies, “Quantum Fields in Curved Space,” Cambridge Monographs on Mathematical Physics.
  4. N. Engelhardt, “Entropy of pure state black holes,” Talk given at the Quantum Gravity and Quantum Information workshop at CERN, March 2019.
  5. N. Engelhardt and A. C. Wall, “Decoding the Apparent Horizon: Coarse-Grained Holographic Entropy,” arXiv:1706.02038.
Posted in Physics | 4 Comments