Inverse Bayesian Estimation of Gravitational Mass Density in Galaxies from Missing Kinematic Data

In this paper we focus on a type of inverse problem in which the data is expressed as an unknown function of the sought and unknown model function (or its discretised representation as a model parameter vector). In particular, we deal with situations in which training data is not available. Then we cannot model the unknown functional relationship between data and the unknown model function (or parameter vector) with a Gaussian Process of appropriate dimensionality. A Bayesian method based on state space modelling is advanced instead. Within this framework, the likelihood is expressed in terms of the probability density function ($pdf$) of the state space variable and the sought model parameter vector is embedded within the domain of this $pdf$. As the measurable vector lives only inside an identified sub-volume of the system state space, the $pdf$ of the state space variable is projected onto the space of the measurables, and it is in terms of the projected state space density that the likelihood is written; the final form of the likelihood is achieved after convolution with the distribution of measurement errors. Application motivated vague priors are invoked and the posterior probability density of the model parameter vectors, given the data is computed. Inference is performed by taking posterior samples with adaptive MCMC. The method is illustrated on synthetic as well as real galactic data.


Introduction
The method of science calls for the understanding of selected aspects of behaviour of a considered system, given available measurements and other relevant information. The measurements may be of the variable W (W ∈ W ⊆ R m ) while the parameters that define the selected system behaviour may be ρ, (ρ ∈ B ⊆ R d ) or the selected system behaviour can itself be an unknown and sought function ρ(·) of the known input variable vector X (X ∈ X ⊆ R d ), so that ρ : X −→ R ⊆ R. In either case, we relate the measurements with the model of the system behaviour as in the equation W = ξ(ρ) or W = ξ(ρ(X)) where the function ξ(·) is unknown. Alternatively, in either case the scientist aims to solve an inverse problem in which the operator ξ −1 , when operated upon the data, yields the unknown(s).
One problem that then immediately arises is the learning of the unknown function ξ(·). Indeed ξ(·) is often unknown though such is not the norm-for example in applications in which the data is generated by a known projection of the model function onto the space W of the measurables, ξ(·) is identified as this known projection. Thus, image inversion is an example of an inverse problem in which the data is a known function of the unknown model function or model parameter vector (Bertero and Boccacci, 1998;Bishop et al., 2007;Jugnon and Demanet, 2013;Kutchment, 2006;Qui, 2008, among others). On the other hand, there can arise a plethora of other situations in science in which a functional relationship between the measurable W and unknown ρ(X) (or ρ) is appreciated but the exact form of this functional relationship is not known (Bennett and McIntosh, 1982;Draper and Mendes, 2008;Gouveia and Scales, 1998;Parker, 1994;Stuart, 2010Stuart, , 2013Tarantola, 2005, to cite a few).
This situation allows for a (personal) classification of inverse problems such that • in inverse problems of Type I, ξ(·) is known where W = ξ(ρ) or W = ξ(ρ(X)), • in inverse problems of Type II, ξ(·) is unknown.
While inverse problems of Type I can be rendered difficult owing to these being ill-posed and/or ill-conditioned as well as in the quantification of the uncertainties in the estimation of the unknown(s), inverse problems of Type II appear to be entirely intractable in the current formulation of W = ξ(ρ) (or W = ξ(ρ(X))), where the aim is the learning of the unknown ρ (or ρ(X)), given the data. In fact, conventionally, this very general scientific problem would not even be treated as an inverse problem but rather as a modelling exercise specific to the relevant scientific discipline. From the point of view of inverse problems, these entail another layer of learning, namely, the learning of ξ(·) from the data-to be precise, from training data (Caers, 2001;Strebelle, 2002;Way et al., 2012). Here by training data we mean data that comprises values of W at chosen values of ρ(x) (or at chosen ρ). These chosen (and therefore known) values of ρ(x) (or ρ) are referred to as the design points, so that values of W generated for the whole design set comprise the training data.Having trained the model for ξ(·) using such training data, we then implement this learnt model on the available measurements-or test data-to learn that value of ρ (or ρ(x)) at which the measurements are realised. It is in principle possible to generate a training data set from surveys (as in selected social science applications) or generate synthetic training data sets using simulation models of the system (Henrion et al., 2013;Krasnopolsky, Fox-Rabinovitz and Chalikov, 2004;Liu et al., 2004). However, often the Physics of the situation is such that ξ(·) is rendered characteristic of the system at hand (as in complex physical and biological systems). Consequently, a simulation model of the considered system is only an approximation of the true underlying Physics and therefore risky in general; after all, the basic motivation behind the learning of the unknown ρ(x) (or ρ) is to learn the underlying system Physics, and pivoting such learning on a simulation model that is of unquantifiable crudeness, may not be useful.
Thus, in such cases, we need to develop an alternative way of learning ξ(·) or if possible, learn the unknown ρ(x) (or ρ) given the available measurements without needing to know ξ(·). It may appear that such is possible in the Bayesian approach in which we only need to write the posterior probability density of the unknown ρ(x) (or ρ), given the data. An added advantage of using the Bayesian framework is that extra information is brought into the model via the priors, thus reducing the quantity of data required to achieve inference of a given quality. Importantly, in this approach one can readily achieve estimation of uncertainties in the relevant parameters, as distinguished from point estimates of the same. In this paper we present the Bayesian learning of the unknown model parameters given the measurements but no training data, as no training data set is available. The presented methodology is inspired by state space modelling techniques and is elucidated using an application to astronomical data.
The advantages of the Bayesian framework notwithstanding, in systems in which training data is unavailable, fact remains that ξ(·) cannot be learnt. This implies that if learning of the unknown ρ(x) (or ρ) is attempted by modelling ξ(·) as a realisation from a stochastic process (such as a Gaussian Process (GP) or Ito Process or t-process, etc.), then the correlation structure that underlies this process is not known. However, in this learning approach, the posterior probability of the unknowns given the data invokes such a correlation structure. Only by using training data can we learn the covariance of the process that ξ(·) is sampled from, leading to our formulation of the posterior of the unknowns, given the measured data as well as the training data. To take the example of modelling ξ(·) using a high-dimensional GP, it might be possible of course to impose the form of the covariance by hand; for example, when it is safe to assume that ξ(·) is continuous, we could choose a stationary covariance function (Rasmussen and Williams, 2006), such as the popular square exponential covariance or the Matern class of covariance functions (Tilmann Gneiting and William Kleiber and Martin Schlather, 2010), though parameters of such a covariance (correlation length, smoothness parameter) being unknown, ad hoc values of these will then need to be imposed. In the presence of training data, the smoothness parameters can be learnt from the data.
For systems in which the continuous assumption is misplaced, choosing an appropriate covariance function and learning the relevant parameters from the measured data, in absence of training data, becomes even trickier. An example of this situation can arise in fact in an inverse problem of Type I-the unknown physical density of the system is projected onto the space of observables such that inversion of the available (noisy) image data will allow for the estimation of the unknown density, where the projection operator is known. Such a density function in real systems can often have disjoint support in its domain and can also be typically characterised by sharp density contrasts as in material density function of real-life material samples . Then, if we were to model this discontinuous and multimodal density function as a realisation from a GP, the covariance function of such a process will need to be non-stationary. It is possible to render a density function sampled from such a GP to be differently differentiable at different points, using for example prescriptions advanced in the literature (Paciorek and Schervish, 2006), but in lieu of training data it is not possible to parametrise covariance kernels to ensure the representative discontinuity and multimodality of the sampled (density) functions. Thus, the absence of training data leads to the inability to learn the correlation structure of the density function given the measured image data.
A way out this problem could be to make an attempt to construct a training data set by learning values of the unknown system behaviour function at those points in the domain of the density, at which measured data are available; effectively, we then have a set of data points, each generated at a learnt value of the function, i.e. this set comprises a training data. In this data set there are measurement uncertainties as well as uncertainty of estimation on each of the learnt values of the system function. Of course, learning the value of the function at identified points within the domain of the system function, is in itself a difficult task. Thus, in this paradigm, the domain X ⊆ R d of the unknown system function ρ(x) is discretised according to the set of values of X, {x 1 , x 2 , . . . , x n }, at which the n measurements are available. In other words, the discretisation of X is dictated by the data distribution. Over each X-bin, the function ρ(x) is held a constant such that for X in the i-th bin, the function takes the value ρ i , i = 1, 2, . . . , n; then we define ρ := (ρ 1 , ρ 2 , . . . , ρ n ) T and try to learn this vector, given the data. Unless otherwise motivated, in general applications, the probability distribution of ρ i is not imposed by hand. In the Bayesian framework this exercise translates to the computing of the joint posterior probability density of n distribution-free parameters ρ 1 , ρ 2 , . . . , ρ n given the data, where the correlation between ρ i and ρ j is not invoked, i, j = 1, 2, . . . , n; i = j. Of course, framed this way, we can only estimate the value of the sought function ρ(x) at identified values of X-unless interpolation is used-but once the training data, thus constructed, is subsequently implemented in the modelling of ξ(·) with a GP of appropriate dimensionality, statistical prediction at any value of X may be possible.
Above, we dealt schematically with the difficult case of lack of training data. However, even when a training data set is available, learning ξ(·) using such data can be hard. In principle, ξ(·) can be learnt using splines or wavelets. However, a fundamental shortcoming of this method is that splines and wavelets can fail to capture the correlation amongst the component functions of a high-dimensional ξ(·). Also, the numerical difficulty of the very task of learning ξ(·) using this technique, and particularly of inverting the learnt ξ(·), only increases with dimensionality. Thus it is an improvement to model such a ξ(·) with a high-dimensional GP. A high-dimensional ξ(·) can arise in a real-life inverse problem if the observed data is high-dimensional, eg. the data is matrix-variate (Chakrabarty, Biswas and Bhattacharya, 2013).
Measurement uncertainties or measurement noise is almost unavoidable in practical applications and therefore, any attempt at an inference on the unknown model parameter vector ρ (or the unknown model function ρ(x)) should be capable of folding in such noise in the data. In addition to this, there could be other worries stemming from inadequacies of the available measurements-the data could be "too small" to allow for any meaningful inference on the unknown(s) or "too big" to allow for processing within practical time frames; here the qualification of the size of the data is determined by the intended application as well as the constraints on the available computational resources. However, a general statement that is relevant here is the fact that in the Bayesian paradigm, less data is usually required than in the frequentists' approach, as motivated above. Lastly, data could also be missing; in particular, in this paper we discuss a case in which the measurable lives in a space U ⊂ W where W is the state space of the system at hand.
The paper is constructed as follows. In Section 2, we briefly discuss the outline of state space modelling. In the following Section 3, our new state space modelling based methodology is delineated; in particular, we explore alternatives to the suggested method in subsection 3.1. The astrophysical background to the application using which our methodology is elucidated, is motivated in Section 4 while the details of the modelling are presented in Section 5. We present details of our inference in Section 6 and applications to synthetic and real data are considered in Section 7 and Section 8 respectively. We round up the paper with some discussions about the ramifications of our results in Section 9.

State Space Modelling
Understanding the evolution of the probability density function of the state space of a dynamical system, given the available data, is of broad interest to practitioners across disciplines. Estimation of the parameters that affect such evolution can be performed within the framework of state space models or SSMs (Carlin, Polson and Stoffer, 1992;Harvey, Koopman and (eds.), 2012;Pole, West and Harrison, 1994;West and Harrison, 1997). Basically, an SSM comprises an observation structure and an evolution structure. Assuming the observations to be conditionally independent, the marginal distribution of any observation is dependent on a known or unknown stationary model parameter, at a given value of the state space parameter at the current time. Modelling of errors of such observations within the SSM framework is of interest in different disciplines (Knape et al., 200;Winship et al., 2012).
The evolution of the state space parameter is on the other hand given by another set of equations, in which the uncertainty of the evolved value of the parameter is acknowledged. A state space representation of complex systems will in general have to be designed to capacitate high-dimensional inference in which both the evolutionary as well as observation equations are in general non-linear and parameters and uncertainties are non-Gaussian.
In this paper we present a new methodology that offers a state space representation in a situation when data is collected at only one time point and the unknown state space parameter in this treatment is replaced by the discretised version of the multivariate probability density function (pdf ) of the state space variable. The focus is on the learning of the static unknown model parameter vector rather than on prediction of the state space parameter at a time point different to when the observations are made. In fact, the sought model parameter vector is treated as embedded within the definition of the pdf of the state space variable. In particular, the method that we present here pertains to a partially observed state space, i.e. the observations comprise measurements on only some-but not allof the components of the state space vector. Thus in this paradigm, probability of the observations conditional on the state space parameters reduces to the probability that the observed state space data have been sampled from the pdf of the full state space variable vector, marginalised over the unobserved components. Here this pdf includes the sought static model parameter vector in its definition. In addition to addressing missing data, the presented methodology is developed to acknowledge the measurement errors that may be non-Gaussian.
The presented method is applied to real and synthetic astronomical data with the aim of drawing inference on the distribution of the gravitational mass of all matter in a real and simulated galaxy, respectively. This gravitational mass density is projected to be useful in estimating the distribution of dark matter in the galactic system.

Method in general
Here we aim to learn the unknown model parameter vector ρ given the data, where data comprises N data measurements of some (h) components of the d-dimensional state space parameter vector X; thus, h < d. Here X = (X 1 , X 2 , . . . , X d ) T . In fact, the data set is Let the state space be W so that X ∈ W. Let the observable vector be U ∈ U ⊂ W. Let Pr(X ∈ [x, x + dx]) = f X (x, α)dx, i.e. the probability density function of the state parameter vector X is f (x, α), where the distribution is parametrised by the parameter α ∈ R j .
In light of this, we suggest that X ∼ f (x, α). Then had the observations lived in the state space W, we could have advanced the likelihood function in terms of f (·, ·). However, here we deal with missing data that we know lives in the sub-space U within W. Therefore, the data must be sampled from the density ν(u, α) that is obtained by marginalising the pdf f (x, α) over X h+1 , X h+2 , . . . , X d . In other words, the pdf f (x, α) is projected onto the space of the observables, i.e. onto U; the result is the projected or marginalised density ν(u, α) of the observables. Then under the assumption of the observed vectors being conditionally iid, the likelihood function is While the likelihood is thus defined, what this definition still does not include in it is the sought model parameter vector ρ. In this treatment, we invoke a secondary equation that allows for the model parameter vector ρ to be embedded into the definition of the likelihood. This can be accomplished by eliciting application specific details but in general, we suggest α = ξ(ρ) := (ξ 1 (ρ), ξ 2 (ρ), . . . ξ j (ρ)) T and construct the general model for the state space pdf to be where η(·) is a t-dimensional vector function of a vector. Given this rephrasing of the state space pdf , the projected density that the i-th measurement u (i) is sampled from, is re-written as so that plugging this in the RHS of Equation 3.1, the likelihood is However, it is appreciated that the pdf of the state space vector X may not be known, i.e. f (·, ·) is unknown. This motivates us to attempt to learn the state space pdf from the data, simultaneously with ρ. We consider the situation that training data is unavailable where training data would comprise a set of values of U generated at chosen values of ρ(X). However, since the very functional relationship (ξ(·) in the notation motivated above) between U and ρ(X) is not known, it is not possible to generate values of U at a chosen value of ρ(X), unless of course, an approximation of unquantifiable crudeness for this functional relationship is invoked. Here we attempt to improve upon the prospect of imposing an ad hoc model of ξ(·). Then in this paradigm, we discretise the function f (η(x), ξ(ρ)). This is done by placing the relevant ranges of the vectors η(x) and ξ(ρ) on a grid of chosen cell size. Thus, for η(·) and ξ(·) being discretised into t and j-dimensional vectors respectively, the discretised version of f (η(x), ξ(ρ)) is then represented as the t × j-dimensional vector f such that the p-th component of this vector is the value of f (η(x), ξ(ρ)) in the p-th "η − ξ-grid cell". Here, such a grid-cell is the p-th of the ones that the domain of f (·, ·) is discretised into, p = 1, 2, . . . , p max .
Given this discretisation of f (·, ·), the RHS of Equation 3.4 is reduced to a sum of integrals over the unobserved variable in each of the grid-cells. In other words, where y (p) (u (i) , ρ) is the value that the vector of the unobserved variables takes up in the p-th η − ξ-grid-cell. The integral on the RHS of Equation 3.6 represents the volume that the p-th η − ξ-grid-cell occupies in the space of the unobserved variable vector Y = (X h+1 , X h+2 , . . . , X d ) T . The value of Y in the p-th η − ξ-grid-cell is dependent in general on ρ for a given data vector u (i) ; hence the notation y (p) (u (i) , ρ).
In other words, to compute the integral for each p (on the RHS of Equation 3.6) we need to identify the bounds on the value of each component of Y imposed by the edges of the p-th η − ξ grid-cell. This effectively calls for identification of the mapping between the space of η(x) and ξ(ρ), and the space of the unobserved variables Y . Now the observation U ∈ U ⊂ W. Then Y ∈ Y, where Y ⊕ U = W. Indeed, this mapping will be understood using the physics of the system at hand. We will address this in detail in the context of the application that is considered in the paper.
The likelihood function is then again rephrased as using Equation 3.6. However, the observed data is likely to be noisy too. To incorporate the errors of measurement, the likelihood is refined by convolving ν(u (i) , ρ, f ) with the density of the error ε in the value of the observed vector U , where the error distribution is assumed known. Let the density of the error distribution be g(u; ε) where ε are the known parameters. Then the likelihood is finally advanced as In a Bayesian framework, inference is pursued thereafter by selecting priors for the unknowns ρ and f , and then using the selected priors in conjunction with the likelihood defined in Equation 3.8, in Bayes rule to give the posterior of the unknowns given the data, i.e π(ρ, f |{u (i) } N data i=1 ). In the context of the application at hand, we will discuss all this and in particular, advance the data-driven choice of the details of the discretisation of the f (η(x), ξ(ρ)) function. Posterior samples could be generated using a suitable version of Metropolis-Hastings and implemented to compute the 95% HPD credible regions on the learnt parameter values.

Alternative methods
We ask ourselves the question about alternative treatment of the data that could result in the estimation of the unknown model parameter vector ρ. Let the sought model parameter be s-dimensional while the observable U is an h-dimensional vector valued variable and there are N data number of measurements of this variable available. Then the pursuit of ρ can lead us to express the data as a function of the model parameter vector, i.e. write U = Ξ(ρ), where Ξ(·) is an unknown, h-dimensional vector valued function of an s-dimensional vector. In order to learn ρ, we will need to first learn Ξ(·) from the data, as was motivated in the introductory section.
As we saw in that section, the learning of this high-dimensional function from the data and its inversion are best tackled by modelling the unknown high-dimensional function with a Gaussian Process. (Chakrabarty, Biswas and Bhattacharya, 2013) present a generic Bayesian method that performs the learning and inversion of a high-dimensional function given matrix-variate data within a supervised learning paradigm; the (chosen) stationary covariance function implemented in this work is learnt using training data and is subsequently used in the computation of the posterior probability of the unknown model parameter vector given the measured or test data, as well as the training data. In the absence of available training data, such an implementation is not possible, i.e. such a method is not viable in the unsupervised learning paradigm. In the application we discuss below, training data is not available and therefore, the modelling of the functional relation between data and ρ, using Gaussian Processes appears to not be possible. This shortcoming can however be addressed if simulations of the system at hand can be undertaken to yield data at chosen values of ρ; however, the very physical mechanism that connects ρ with the data may be unknown (as in the considered application) and therefore, such a simulation model is missing. Alternatively, if independently learnt ρ, learnt with an independent data set, is available, the same can be used as training data to learn ρ given another data set. On such instances, the Gaussian Process approach is possible but in lieu of such training data becoming available, the learning of ρ given the matrix-valued data can be performed in the method presented above. On the other hand, a distinct advantage of the method presented below is that it allows for the learning of the state space density in addition to the unknown model parameter vector.
If the suggestion is to learn the unknown system function ρ(X) as itself a realisation of a GP, the question that then needs to be addressed is how to parametrise the covariance structure of GP in situations in which the data results from measurements of the variable U that shares an unknown functional relation with ρ(X). In other words, in such situations, the unknown system function ρ(X) has to be linked with the available data via a functional relation, which however is unknown, as motivated above; we are then back to the discussion in the previous paragraph.

Case study
Unravelling the nature of Dark Matter and Dark Energy is one of the major challenges of today's science. While such is pursued, the gathering of empirical evidence for/against Dark Matter (DM) in individual real-life observed astronomical systems is a related interesting exercise.
The fundamental problem in the quantification of dark matter in these systems is that direct observational evidence of DM remains elusive. In light of this, the quantification is pursued using information obtained from measurable physical manifestations of the gravitational field of all matter in an astronomical system, i.e. dark as well as self-luminous matter. Indeed, such measurements are difficult and physical properties that manifest the gravitational effect of the total gravitational field of the system would include the density of X-rays emitted by the hot gas in the system at a measured temperature (Pellegrini and Ciotti, 2006), velocities of individual particles that live in the system and play in its gravitational field (Chakrabarty, 2006;Chakrabarty and Raychaudhury, 2008;Coccato et al., 2009;Côté et al., 2003;Romanowsky et al., 2003) and the deviation in the path of a ray of light brought about by the gravitational field of the system acting as a gravitational lens (Koopmans, 2006).
The extraction of the density of DM from the learnt total gravitational mass density of all matter in the system, is performed by subtracting from the latter, the gravitational mass density of the self-luminous matter. The density of such luminous matter is typically modelled astronomically using measurements of the light that is observed from the system. A reliable functional relationship between the total gravitational mass density and such photometric measurements is not motivated by any physical theories though the literature includes such a relationship as obtained from a pattern recognition study performed with a chosen class of galaxies (Chakrabarty and Jackson, 2009).
In this work, we focus our attention to the learning of the total gravitational mass density in galaxies, the images of which resemble ellipses -as distinguished from disc-shaped galaxies for which the sought density is more easily learnt using measurement of rotational speed of resident particles. By a galactic "particle" we refer to resolved galactic objects such as stars. There could also be additional types of particles, such as planetary nebulae (PNe) which are an end state of certain kinds of stars; these bear signature marks in the emitted spectral data. Other examples of galactic particles could include old clusters of stars, referred to as globular clusters (GCs).

Data
As defined above, the space of all states that a dynamical system achieves is referred to as the system's state space W. Now, the state that a galaxy is in, is given by the location and velocity coordinates of all particles in the system. Here, the location coordinate is X ∈ R 3 as is the velocity coordinate vector V . Thus, in our treatment of the galaxy at hand, W is the space of the particle location and velocity vector i.e. the space of the vector W = (X T , V T ) T .
We model the galactic particles to be playing in the average (gravitational) force field that is given rise to by all the particles in this system. Under the influence of this mean field, we assume the system to have relaxed to a stationary state so that there is no time dependence in the distribution of the vector W Our aim is to learn the density function of gravitational mass of all matter in the galaxy, given the data D = The physical interpretation of these observables is that V 3 is the component of the velocity of a galactic particle that is aligned along the line-of-sight that joins the particle and the observer, i.e. we can measure how quickly the particle is coming towards the observer or receding away but cannot measure any of the other components of V . Similarly, we know the components X 1 and X 2 of the location X of a galactic particle in the galactic image but cannot observe how far orthogonal to the image plane the particle is, i.e. X 3 is unobservable.
It merits mention that in the available data, values of X 1 and X 2 appear in the form of x 2 1 + x 2 2 . Then the data Here N data is typically of the order of 10 2 . While for more distant galaxies, N data is lower, recent advancements is astronomical instrumentation allows for measurement of V 3 of around 750 planetary nebulae or PNe (as in the galaxy CenA, Woodley, & Chakrabarty, under preparation). Such high a sample size is however more of an exception than the rule -in fact, in the real application discussed below, the number of V 3 measurements of globular clusters (or GCs) available is only 29. In addition, the measurements of V 3 are typically highly noisy, the data would typically sample the sub-space U very sparsely and the data sets are typically one-time measurements. The proposed method will have to take this on board and incorporate the errors in the measurement of V 3 . Given such data, we aim to learn the gravitational mass density of all matter -dark as well as self-luminous -at any location X in the galaxy.

Modelling real data
In the Bayesian framework, we are essentially attempting to compute the posterior of the unknown gravitational mass density function ρ(X), given data D. Since gravitational mass density is non-negative, ρ : R 3 −→ R ≥0 . That we model the mass density to depend only on location X is a model assumption 1 .
From Bayes rule, the posterior probability density of ρ(X) given data D is given as proportional to the product of the prior and the likelihood function, i.e. the probability density of D given the model for the unknown mass density. Now, the probability density of the data vector U given the model parameters α is given by the probability density function ν(U , α) of the observable U , so that, assuming the N data data vectors to be conditionally independent, the likelihood function is the product of the pdf s of U obtained at the N data values of U : (5.1) This is Equation 3.7 written in the context of this application. Given that However, this formulation still does not include the gravitational mass density function ρ(x) in the definition of f (X, V ), we explore the Physics of the situation to find how to embed ρ(x) into the definition of the pdf of the state space variable W , and thereby into the likelihood. This is achieved by examining the time evolution of this pdf of the state space variable; we discuss this next.

Evolution of f (X, W ) and embedding ρ(X) in it
Here we invoke the secondary equation that tells of the evolution of f (X, V ). In general, the pdf of the state space variable is a function of X, V and time T . So the general state space pdf is expected to be written as It is interpreted as the following: at time t (t ∈ T ), the probability for However, we assume that the particles in a galaxy do not collide since the galactic particles inside it, (like stars), typically collide over timescales that are the age of galaxies (Binney and Tremaine, 1987). Given this assumption of collisionlessness, the pdf of W = (X T , V T ) T remains invariant. Thus, the evolution of f (x, v, t) must is guided by the Collisionless Boltzmann Equation (CBE): This equation suggests that when the state space distribution has attained stationarity, so that a constant ∀ x, v at a given time. This is referred to as Jeans theorem (Binney and Tremaine, 1987). In fact, the equation more correctly suggests that as long as the system has reached stationarity, at any given time, f (x, v) is a constant ∀ x, v inside a well-connected region⊆ W. Given this, the state space pdf can be written as a function of quantities that do not change with time 2 . Proof. The proof is simple; for the proof we assume X and V to take respective values of x and v inside a wellconnected sub-space of W. Let a function of the vectors x, v be I(x, v) such that it remains a constant w.r.t. time.

Then
dI ( In fact, any function of a time-invariant function of vectors X and V is also a solution to the CBE. Now, in our work we assume the system to have attained stationarity so that the pdf of the state space variable has no time dependence. Then the above theorem suggests that we can write f (x, v) = g(I 1 (x, v), I 2 (x, v), . . . , I n (x, v)) for any n ∈ Z + , where I i (·, ·) is any time-independent function of 2 vectors, for i = 1, 2, . . . , n. Now, upon eliciting from the literature in galactic dynamics (Binney, 1982;Contopoulos, 1963) we realise the following.
• The pdf of the state space variable has to include particle energy E(X, V ), (which is one constant of motion), in its domain. Thus, we can write f (X, V ) = f (E(X, V ), I 2 (X, V ), . . . , I 5 (X, V )).
• Energy E(X, V ) is given as the sum of potential energy Φ( X ) and kinetic energy V · V /2, i.e.
Here · is the Euclidean norm. That the potential is maintained as dependent only of the location vector X and not on V stems from our assumption that there is no dissipation of energy in this system, i.e. we model the galaxy at hand to be a Hamiltonian system. Here, a basic equation of Physics relates the potential of the galaxy to the gravitational mass density of the system, namely Poisson Equation: ∇ 2 is the Laplace operator (in the considered geometry of the galaxy) and G is a known constant (the Universal gravitational constant).
On the basis of the above, we can write At this point we recall the form of an isotropic function of 2 vectors (Liu, 2002;Truesdell, Noll and Antman, 2004;Wang, 1969). .
We also recall that in this application, X · V = 0 by construction. This leads us to identify any pdf of the state space variable W = (X T , V T ) T as isotropic if the pdf is expressed as a function of energy E(X, V ) alone. This follows from Equation 5.10 since f (X, which is compatible with the form of isotropic functions as per Remark 5.1. Thus, if the pdf of the state space variable is dependent on only 1 constant of motion-which by the literature in galactic dynamics has to be energy E(X, V )-then f (X, V ) is an isotropic function of X and V .
However, there is no prior reason to model a real galaxy as having an isotropic probability distribution of its state space. Instead, we attempt to • use as general a model for the state space distribution of the system as possible, • while ensuring that the degrees of freedom in the model are kept to a minimum to ease computational ease.
This leads us to include another time-invariant function L(X, V ) in the definition of the pdf of the state space variable in addition to E(X, V ), such that the dependence on X and V in L(·, ·) is not of the form that renders f (E, L) compatible with the definition of isotropic function, as per Remark 5.1, unlike f (E). This is so because L(X, V ) := X × V (5.12) where × represents the "cross-product" of the two 3-dimensional vectors X and V , i.e.
Then, we set f (E, L) ≡ f (X · X, V · V , X × V ) which is not compatible with the form of an isotropic function of the 2 vectors X and V . In other words, if the support of the pdf of X and V includes E(X, V ) and L(X, V ), then the state space distribution is no longer restricted to be isotropic. Such a general state space is indeed what we aimed to achieve with our model. At the same time, adhering to no more than 1 constant of motion in addition to energy E(X, V ) helps to keep the dimensionality of the domain of the pdf of the state space function to the minimum that it can be, given our demand that no stringent model-driven constraint be placed on the state space geometry. Thus, we use n=2 in our model. So now we are ready to express the unknown gravitational mass density function as embedded within the pdf of X and V as: using Equation 5.12. To cast this in the form of Equation 3.3, we realise that the unknown gravitational mass density function will need to be discretised; we would first discretise the range of values of R over which the gravitational mass density function ρ(R) is sought. Let R = r such that r ∈ [r min , r max ] and let the width of each R-bin be δ r . Then ρ(r) is discretised as the unknown model parameter vector ρ := (ρ 1 , ρ 2 , . . . , ρ Nx ) T , (5.16) where where N x := int r max − r min δ r +.
Then following on from Equation 5.15 we write Then to compute the likelihood and thereafter the posterior probability of ρ given data D, we will need to compute the integral in Equation 5.19. According to the general methodology discussed above in Section 3, this is performed by discretising the domain of the pdf of the state space variable, i.e. of f (E, L). In order to achieve this discretisation we will need to invoke the functional relationship between E(X, V ) and L(X, V ). Next we discuss this.

Relationship between E(X, V ) and L(X, V )
We recall the physical interpretation of L(X, V ) as the norm of the "angular momentum" vector, i.e. L 2 (X, V ) R 2 is the square of the speed V c (X, V ) of circular motion of a particle with location X and velocity V ; here, "circular motion" is motion orthogonal to the location vector X, distinguished from non-circular motion that is parallel to X and the speed of which is V nc (X, V ). Then as these two components of motion are mutually orthogonal, square of the particle's speed is where V nc is the magnitude of the component of V that is parallel to X, i.e.
But we recall that energy E(X, V ) = Φ(ρ(R)) + V · V /2. This implies that where in the last equation, we invoked the definition of L(X, V ) sing Equation 5.14. At this stage, to simplify things, we consciously choose to work in the coordinate system in which the vector X is rotated to vector S = (S 1 , S 2 , S 3 ) T by a rotation through angle θ := cos −1 X 2 Then by definition, S 1 =0, i.e. the projection of the (S 1 , S 2 , S 3 ) T vector on the S 3 =0 plane lies entirely along the S 2 -axis. This rotation does not affect the previous discussion since • the previous discussion invokes the location variable either via R = X 2 1 + X 2 2 + X 2 3 = S 2 1 + S 2 2 + S 2 3 , • or via x 2 1 + x 2 2 = s 2 1 + s 2 2 as within the data structure: Having undertaken the rotation, we refer to E(X, V ) and L(X, V ) as E(S, V ) and L(S, V ) respectively. This rotation renders the cross-product in the definition of L(·, ·) simpler; under this choice of the coordinate system, as S 1 = 0 , so that in this rotated coordinate system, from Equation 5.23 Also, the component of V along the location vector S is V nc = V · S/R = (V 2 S 2 + V 3 S 3 )/R. From Equation 5.23 it is evident that for a given value ǫ of E(S, V ), the highest value ℓ max (ǫ) of L(S, V ) is attained if V nc = 0 (all motion is circular motion). This is realised only when the radius R c of the circular path of the particle takes a value r c such that ℓ max (ǫ) 2 = 2r 2 c [ǫ − Φ(r c )] (5.28) The way to compute r c given ǫ is defined in the literature (Binney and Tremaine, 1987) as the positive definite solution for r in the equation We are now ready to discretise the domain of the pdf of the state space variable, i.e. of f (E, L) in line with the general methodology discussed above in Section 3 with the aim of computing the integral in Equation 5.19.
We then define the N ǫ × N ℓ -dimensional matrix In our model this is the discretised version of the pdf f (E, L) of the state space variable W = (S T , V T ) T . In this application, a particle with a positive value of energy is so energetic that it escapes from the galaxy. We are however concerned with particles that live inside the galaxy, i.e. are bound to the galaxy and therefore, the maximum energy that a galactic particle can attain is 0, i.e. ǫ max = 0. Given the definition of energy E(S, V ) = Φ(R) + V · V /2 we realise that the value of E(S, V ) is minimum, i.e. as negative as it can be, if V · V =0, (i.e. velocity is zero) and Φ(R) is minimum, which occurs at R = 0. In other words, the minimum value of E is Φ(0) which is negative. In our work we normalise the value ǫ of E by −Φ(0), so that ǫ ∈ [−1, 0]. In other words, the aforementioned ǫ min = −1 and ǫ max = 0.
We normalise the value ℓ of L(S, V ) with the maximal value ℓ max (ǫ) that ℓ can attain for a given value ǫ of E (Equation 5.28). The maximum value that can be attained by L is for ǫ = 0; having computed r c from Equation 5.29, ℓ max (0) is computed. Then, as normalised by ℓ max (0), the maximal value of L is 1. Also the lowest value of L is 0, i.e. ℓ min =0. In light of this, we rewrite Equation 5.30 as The E-binning and L-binning are kept uniform in the application we discuss below, i.e. δ ǫ and δ ℓ are constants.

Data-driven binning
There are N ℓ L-bins and N ǫ E-bins. Above we saw that as the range covered by normalised values of E is [−1, 0], the relationship between N ǫ and E-bin width δ ǫ is δ ǫ = 1/N ǫ . We make inference on N ℓ within our inference scheme while the Physics of the situation drives us to a value of N ǫ . It could have been possible to also learn N ǫ from the data within our inference scheme but that would have been tantamount to wastage of information that is available from the domain of application. We attempt to learn N ℓ from the data within our inference scheme; for a given N ℓ , N ǫ is fixed by the data at hand. To understand this, we recall the aforementioned relation ǫ = Φ(r) + v 2 nc /2 + ℓ 2 /2r 2 . Let in the available data set, -the minimum value of S 2 1 + S 2 2 be r min , -the maximum value of S 2 1 + S 2 2 be r max so that the value of Φ(·) is no less than Φ(r max ), -the maximum value of V 3 be v (max) 3 so that the unnormalised value of E is no less than -and the unnormalised ǫ is no more than Φ(0).
Thus, it is clear that the E-binning should cover the interval beginning at a normalised value of -1 and should at least extend to ǫ max /[−Φ(0)].
Then we set E-bin width δ ǫ = 1/N ǫ and learn number of L-bins, N ℓ , from the data within our inference scheme. Then at any iteration, for the current value of N ℓ and the current ρ (which leads to the current value of Φ(r) according to Equation 5.8), placing ǫ max /[−Φ(0)] at the centre of the N ǫ -th E-bin gives us Experiments suggest that for typical galactic data sets, N ℓ between 5 and 10 implies convergence in the learnt vectorised form of the gravitational mass density ρ. This leads us to choose a discrete uniform prior over the set {5, 6, . . . , 10}, for N ℓ : Again, the minimum and maximum values of S 2 1 + S 2 2 in the data fix r min and r max respectively, so that r max = r min + δ r (N x − 1). The radial bin width δ r is entirely dictated by the data distribution such that there is at least 1 data vector in each radial bin. Thus, N x and δ r are not parameters to be learnt within the inference scheme but are directly determined by the data.

Likelihood
Following Equation 3.7, we express the likelihood in this application in terms of the pdf of S and V , marginalised over all those variables that we do not have any observed information on. Then for the data vector (s 3 sin γ (k) )) 2 ], with [L(S, V )] 2 recalled from Equation 5.25, and we have used 2 ) 2 + s 2 3 (5.36) and cos γ (k) := s 3 r (k) . Then given that the range of values of E and L is discretised, we write  } are determined by the mapping between the space of E and L and the space of the unobservables, namely S 3 , V 1 , V 2 . This mapping involves the definition of E and L in terms of the state space coordinates (S T , V T ) T , which in turn depends upon the function ρ(r) or its discretised version, ρ. Hence the values taken by any of the 3 unobservables in the cd-th E − L-grid-cell depend on ρ. Here c = 1, 2, . . . , N ǫ and d = 1, 2, . . . , N ℓ .
We realise that the integral on the RHS of Equation 5.37 represents the volume occupied by the E − L-grid-cell inside the space of the unobserved variables. The computation of this volume is now discussed.

Volume of any E − L-grid-cell in terms of the unobservables
We begin by considering the volume of any E − L-grid-cell in the space of the 2 observables, V 1 and V 2 , at a given value of S 3 . Thereafter, we will consider the values of the 3rd unobservable, S 3 , in this grid-cell.
The definition E(s (k) , v (k) ) = Φ(r (k) ) + v (k) · v (k) /2 (Equation refeqn:ljhamela) implies that for the k-th data vector (s 3 ) T , all particles with S 3 = s 3 and energy E(s (k) , v (k) ) = ǫ c will obey the equation i.e. for S 3 = s 3 , all particles lying in the c-th E-bin will lie in the space of V 1 and V 2 , within a circular annulus that is centred at (0,0) and has radii lying in the interval [ε c+1 , ε c ] where provides a representation for all particles in the d-th L-bin with given observed values of S 1 , S 2 and V 3 .

It then follows from
3 sin γ} 2 , (Equation 5.25) that for the k-th data vector, all particles with S 3 = s 3 , and in the d-th L-bin (L(s (k) , v (k) ) = ℓ d ) will obey the equation where we have recalled r (k) from Equation 5.36. This implies that for S 3 = s 3 , all particles lying in the d-th L-bin, will lie in the space of V 1 and V 2 , along an ellipse centred at (0, v 3 tan γ (k) ) with semi-minor axis lying in the interval of [λ d+1 , λ d ] and semi-major axis lying in the interval λ d+1 cos γ (k) , λ d cos γ (k) . Here, Collating the implications of Equation 5.38 and Equation 5.40, we get that at a given value of S 3 , particles with observed data (s 3 ) T , (with energies) in the c-th E-bin and (momenta) in the d-th L-bin will lie in the space of V 1 and V 2 , within an area bound by the overlap of -the circular annular region centred at V 1 = 0, V 2 = 0, extending in radii between ε c+1 and ε c . -the elliptical annular region centred at V 1 = 0, V 2 = v (k) 3 tan γ, extending in semi-minor between λ d+1 and λ d and semi-major axis in [λ d / cos γ, λ d+1 /cosγ], where cos γ = s 3 (s The area of these overlapping annular regions represents the volume of the cd-th E − L-grid-cell in the space of V 1 and V 2 , at the value s 3 of S 3 . Thus, the first step towards writing the volume of the cd-th E − L-grid-cell in terms of the unobservables, is to compute the area of these overlapping annular regions in the space of V 1 and V 2 . Such an area of overlap is a function of s 3 . At the next step, we integrate such an area over all allowed s 3 , to recover the volume of the cd-th E − L-grid-cell in the space of V 1 , V 2 and S 3 , i.e. the integral on the RHS of Equation 5.37.
There can be multiple ways these annular regions overlap; three examples of these distinct overlapping geometries are displayed in Figure 1. In each such geometry, it is possible to compute the area of this region of overlap since we know the equations of the curves that bound the area. However, the number of possible geometries of overlap is in excess of 20 and identifying the particular geometry to then compute the area of overlap in each such case, is tedious to code. In place of this, we allow for a numerical computation of the area of overlap; this method works irrespective of the particulars of the geometry of overlap. We identify the maximum and minimum values of V 2 allowed at a given value of V 1 , having known the equations to the bounding curves, and compute the area of overlap in the plane of V 1 and V 2 using numerical integration.
This area of overlap in the plane defined by V 1 and V 2 is a function of S 3 since the equations of the bounding curves are expressed in terms of s 3 . The area of overlap is then integrated over all values that S 3 is permitted to take inside the cd-th E − L-grid-cell. For any E − L-grid-cell, the lowest value S 3 can take is zero. For ǫ ∈ [ǫ c+1 , ǫ c ], and ℓ ∈ [ℓ d , ℓ d+1 ], the maximum value of S 3 is realised (by recalling Equation 5 where v nc is the projection of v along the s vector (discussed in Section 5.2). Thus, v nc is given by the inner product of v and the unit vector parallel to s: Using this in Equation 5.42 we get This implies that given the observations represented by the k-th data vector (s The highest positive root for s 3 from Equation 5.46 as the highest value that S 3 can attain in the cd-th E − L-gridcell. Thus, for the cd-th cell, the limits on the integration over s 3 are 0 and the solution to Equation 5.46. So now we have the value of the integral over v 1 and v 2 and hereafter over s 3 , for the cd-th E −L-grid-cell. This triple integral gives the volume of the cd-th E−L-grid-cell in the space of the unobservables, i.e. of V 1 , V 2 , S 3 . This volume is multiplied by the value f c,d of the discretised pdf of the state space variable in this E − L cell and the resulting product is summed over all c and d, to give us the marginalised pdf ν(s 3 ) (see Equation 5.37). Once the marginalised pdf is known for a given k, the product over all ks contributes towards the likelihood.

Normalisation of the marginal pdf of the state space vector
As we see from Equation 5.37, the marginal pdf of S and V is dependent on ρ, so this normalisation will not cancel within the implementation of Metropolis-Hastings to perform posterior sampling. In other words, to ensure that the value of ν(·, ·, ·) -and therefore the likelihood -is not artificially enhanced by choosing a high ρ, we normalise ν(s 3 ) for each k, by the pdf integrated over all possible values of S 1 , S 2 and V 3 , i.e. by S1 S2 V3 ν(s 1 , s 2 , v 3 )ds 1 ds 2 dv 3 (5.47) where the possible values of V 3 are in the interval [− −2Φ(s 2 1 + s 2 2 ), −2Φ(s 2 1 + s 2 2 )], of S 2 in the interval [ s 2 1 − r 2 min , r 2 max − s 2 1 ] and of S 1 in [r min , r max ]. Hereafter, by ν(·, ·, ·) we will imply the normalised marginal pdf .

Incorporating measurement uncertainties
Following Equation 3.8 the likelihood is defined as the product over all data, of the convolution of the error distribution at the k-th datum and value of the marginalised pdf for this k (assuming the data to be conditionally iid). In this application the measurement of the location of the galactic particle projected onto the image plane of the galaxy, i.e. (S 1 , S 2 ), is constrained well enough to ignore measurement uncertainties in. However, the measurement errors in the line-of-sight component of the particle velocity, V 3 , can be large. This measurement error in V 3 is denoted as δV 3 . The distribution of this error is determined by the astronomical instrumentation relevant to the observations of the galaxy at hand and are usually known to the astronomer. In the implementation of the methodology to real and simulated data, as discussed below, we work with a Gaussian error distribution with a known variance σ 2 V3 . Thus, δV 3 ∼ N (0, σ 2 V3 ). For this particular error distribution, the likelihood is defined as For any other distribution of the uncertainties in the measurement of V 3 , the likelihood is to be rephrased as resulting from a convolution of ν(·, ·, ·) and that chosen error distribution.

Priors
In the existing astronomical literature, there is nothing to suggest the pdf of the state space variable in a real galaxy though there are theoretical models of the functional dependence between stellar energy (E) and angular momentum (L) and pdf of S and V (Binney and Tremaine, 1987). Given this, we opt for uniform priors on f c,d , c = 1, 2, . . . , N ǫ , d = 1, 2, . . . , N ℓ . However, in our inference, we will use the suggestion of monotonicity of the state space pdf , as given in the theoretical galactic dynamics literature. We also use the physically motivated constraint that f c,d ≥ 0, ∀ c, d. Thus, we use f c,d ∼ U(1, 0), where U(·, ·) denotes the uniform distribution over the interval [·, ·].
As far as priors on the gravitational mass density are concerned, astronomical models are available (Binney and Tremaine, 1987). All such models suggest that gravitational mass density is a monotonically decreasing function of R. A numerically motivated form that has been used in the astrophysical community is referred to as the NFW density (Navarro, Frenk and White, 1996), though criticism of predictions obtained with this form also exist (de Blok, Bosma and McGaugh, 2003, among others). For our purpose we suggest a uniform prior on ρ b such that is the gravitational mass density as given by the 2-parameter NFW form, for the particle radial location r ∈ [r min + (b − 1)δ r , r max + bδ r , b = 1, 2, . . . , N x . In fact, this location is summarised as r b , the mid-point of the b-th radial bin. R s and ρ 0 are the 2 parameters of the NFW density form. In our work these are hyperparameters and we place uniform priors on them: π 0 (R s ) = 1/(r max − r min ) and π 0 (ρ 0 ) = 1/(10 14 − 10 9 ), where these numbers are experimentally chosen.
We marginalise ρ 0 and R s out of π(ρ, F, R s , ρ 0 , N ℓ |u 1 , u 2 , . . . , u N data ) to achieve the joint posterior probability of ρ, F and N ℓ given the data.

The marginalisation involves only the term
(recalling Equation 5.49). Integrating this term over a fixed interval of values of R s and again over a fixed interval of ρ 0 , result in a constant that depends on N data , r min and δ r . Thus the marginalisation only results in a constant that can be subsumed within the unknown constant of proportionality that we do not require the exact computation of, given that posterior samples are generated using adaptive Metropolis-Hastings (Haario et al., 2006). Thus we can write down the joint posterior probability of ρ, F and N ℓ given the data as: We discuss the implemented inference next.

Inference
We intend to make inference on each component of the vector ρ and the matrix F, along with N ℓ . We do this under the constraints of a gravitational mass density function ρ(R) that is non-increasing with R and a pdf f (E, L) of the state space variable that is non-increasing with E. Motivation for these constraints is presented in Section 5.8. In other words, ρ b ≥ ρ b+1 and f c,d ≤ f c+1,d for b = 1, 2, . . . , N x and ρ Nx+1 := 0. Also, here c = 1, 2, . . . , N ǫ − 1 and d = 1, 2, . . . , N ℓ . First we discuss performing inference on ρ using adaptive Metropolis-Hastings (Haario et al., 2006), while maintaining this constraint of monotonicity. We define with ρ Nx+1 := 0.
( 6.1) It is on the parameters ∆ 1 , ∆ 2 , . . . , ∆ Nx−1 that we make inference. Let within our inference scheme, at the n-th iteration, the current value of ∆ b be δ where the choice of a folded normal (Leone, Nottingham and Nelson, 1961) or truncated normal proposal density is preferred over a density that achieves zero probability mass at the variable value of 0. This is because there is a non-zero probability for the gravitational mass density to be zero in a given radial bin. Here µ b and σ 2 b are the mean and variance of the proposal density that ∆ b is proposed from. We choose the current value of ∆ b as µ b and in this adaptive inference scheme, the variance is given by the empirical variance of the chain since the n 0 -th iteration, i.e.
We choose the folded normal proposal density given its ease of computation: It is evident that this is a symmetric proposal density. We discuss the acceptance criterion in this standard Metropolis-Hastings scheme, after discussing the proposal density of the components of the matrix F and the parameter N ℓ .
is accepted, then the updated b-th component of ρ in the n-th iteration is ρ b (n) = ρ b+1 (n) +δ b (n) . If the proposed candidate is rejected then ρ b (n) resorts back to ρ b (n) = ρ b+1 (n) + δ b (n) .
Along similar lines, we make inference directly on (6.5) Let in the n-th iteration, the current value of Γ c,d be γ c,d where the proposed candidate is sampled from the folded normal density N f olded (γ c,d ) 2 is again the empirical variance of the chain between the n / 0 -th and the n − 1-th iteration. Then the updated general element of the state space pdf matrix in this iteration is f d (n) . Thus, the proposal density that a component of the F matrix is proposed from is also symmetric.
We propose N ℓ from the discrete uniform distribution, i.e. the proposed value of N ℓ in the n-th iteration is where the bounds of the interval [z 1 , z 2 ] are found experimentally given the data at hand. Given that we are making inference on the {∆ b } Nx b=1 and {Γ c,d } Nǫ,N ℓ c=1,d=1 , we rephrase the posterior probability of the unknowns as π(∆ 1 , . . . , , ∆ Nx , Γ 1,1 , . . . , Γ Nǫ,N ℓ , N ℓ |u 1 , . . . , u N data ). This posterior density is proportional to the RHS of Equation 5.51.

Illustration on synthetic data
In this section we illustrate the methodology on synthetic data set simulated from a chosen models for the pdf of W = (S T , V T ) T . N data =198. The chosen models for this pdf are f W D (E(S, V ), L(S, V )) or f W D (E, L) and f Michie (E, L). These are given by: where E(S, V ) = Φ(ρ Model (R)) + V 2 /2 with ρ Model (R) chosen in both models for the state space pdf to be ρ Model (R) = 3M 4πa 3 1 + r 2 a 2 −5/2 . Here the model parameters r a > 0 and σ 0 are assigned realistic numerical values. From these 2 chosen pdf s, N data values of U were sampled; these 2 samples constituted the 2 synthetic data sets D W D and D Michie . The learnt gravitational mass density parameters and discretised version of the state space pdf are displayed in Figure 2. Some of the convergence characteristics of the chains are explored in Figure 3. The trace of the joint posterior probability of the unknown parameters given the data is shown along with histograms of ρ 2 learnt from 3 distinct parts of the chain that is run using data D W D .

Illustration on real data
In this section we present the gravitational mass density parameters and the state space pdf parameters learnt for the real galaxy NGC3379 using 2 data sets D P N e and D GC which respectively have sample size 164 (Douglas et al., 2007) and 29 (Bergond et al., 2006). An independent test of hypothesis exercise shows that there is relatively higher support in D GC for an isotropic pdf of the state space variable W = (S T , V T ) T than in D P N e . Given this, some runs were performed using an isotropic model of the state space pdf ; this was achieved by fixing the number N ℓ of L-bins to 1. Then L identically takes the value ℓ 1 and is rendered a constant. This effectively implies that the domain of f (E, L) is rendered uni-dimensional, i.e. the state space pdf is then rendered f (E). Recalling the definition of an isotropic function from Remark 5.1, we realise that the modelled state space pdf is then an isotropic function of S and V . Results from chains run with such an isotropic state space pdf were overplotted on results from chains run with the more relaxed version of the pdf that allows for incorporation of anisotropy; in such chain, N e ll is in fact learnt from the data.  )) ǫ, at two different ℓ, recovered from a chains that use data D P Ne . The modal value of the learnt number of L-bins is 7 for this run. The state space pdf parameters recovered using data D GC are shown in black. Middle: Gravitational mass density parameters ρ i estimated from a chain run with D P Ne are shown in magenta, over-plotted on the same obtained using the same data, from a chain in which the number of L-bins, N ℓ = 1. When N ℓ is fixed as 1, it implies that L(S, V ) is then no longer a variable and then f (E, L) is effectively univariate, depending on E(S, V ) alone. Such a state space pdf is an isotropic function of S and V (see Remark 5.1). The ρ i estimated from such an isotropic pdf of the state space variable is shown here in green. The mass density parameters learnt using the data D GC -again learnt from an isotropic state space pdf -are shown in black. Right: Figure showing estimates of M i = i j=1 4πρ j δ 2 r (j 2 − (j − 1) 2 ), against R. Here i = 1, 2, . . . , Nx. The parameters in magenta are obtained from the same chain that produce the ρ i parameters in the middle panel using D P ne while those in green and black are obtained using the ρ i that were represented in the middle panel in the corresponding colours.

Discussions
In this work we focused on an inverse problem in which noisy and partially missing data on the measurable U = (X 1 , X 2 , V 3 ) T is used to make inference on the model parameter vector ρ which is the discretisation of the unknown model function ρ( X ) ≡ ρ( S ), where S is an orthogonal transformation of X and X = (X 1 , X 2 , X 3 ) T . The measurable and the sought function are related via an unknown function. Given that the very Physics that connects U to ρ(R) is unknown-where R := S -we cannot construct training data, i.e. data comprising a set of computed u for a known ρ(r). In the absence of training data, we are unable to learn the unknown functional relationship between data and model function, either using splines/wavelets or by modelling this unknown function with a Gaussian Process. We then perform the estimation of ρ(R) at chosen values of R, i.e. discretise the range of values of R and estimate the vector ρ instead, where ρ i is the value of ρ(r) for r in the i-th R-bin. We aim to write the posterior of ρ given the data. The likelihood could be written as the product of the values of the pdf of the state space vector W = (X T , V T ) T achieved at each data point, but the data being missing, the pdf is projected onto the space of U and the likelihood is written in terms of these projections of the pdf . ρ is embedded within the definition of the domain of the pdf of W . The projection calls for identification of the mapping between this domain and the unobserved variables X 3 , V 1 , V 2 ; this is an application specific task. The likelihood is convolved with the error distribution and vague but proper priors are invoked, leading to the posterior probability of the unknowns given the data. Inference is performed using adaptive MCMC. The method is used to learn the gravitational mass density of a simulated galaxy using synthetic data, as well as that in the real galaxy NGC3379, using data of 2 different kinds of galactic particles. The gravitational mass density vector estimated from the 2 independent data sets are found to be distinct.
The distribution of the gravitational mass in the system is indicated by the function M (r) = r r ′ =0 4π(r ′ ) 2 ρ(r ′ )dr ′ .
the discretised form of this function defines the parameters M i , i = 1, 2, . . . , N x . These are computed using the learnt value of the ρ i parameters and plotted in Figure 4. We notice that the estimate of ρ i can depend on the model chosen for the state space pdf ; thus, the same galaxy can be inferred to be characterised by a higher gravitational mass distribution depending on whether an isotropic state space is invoked or not. Turning this result around, one can argue that in absence of priors on how isotropic the state space of a galaxy really is, the learnt gravitational mass density function might give an erroneous indication of how much gravitational mass there is in this galaxy and of corse how that mass is distributed. It may be remarked that in lieu of such prior knowledge about the topology of the system state space, it is best to consider the least constrained of models for the state space pdf , i.e. to consider this pdf to be dependent on both E(S, V ) and L(S, V ).
It is also to be noted that the estimate for the gravitational mass density in the real galaxy NGC3379 appears to depend crucially on which data set is being implemented in the estimation exercise. It is possible that the underlying pdf of the variable W = (S T , V T ) T is different for the sub-volume of state space that one set of data vectors are sampled from, compared to another. As these data vectors are components of S and V of different kinds of galactic particles, this implies that the state space pdf that the different kinds of galactic particles relax into, are different.