Optimizing the Matrix Element Method with Deep Neural Networks for High-Energy Physics Analyses ()
1. Introduction
Development in experimental and theoretical high-energy physics (HEP) analyses now employ the instrumentation of modern machine learning methods, involving deep learning, which are rapidly gaining applicability. Machine learning, a widely used automated inference procedure in HEP, is often associated with multivariate techniques like Boosted Decision Trees (BDT) [1]. In recent years, the set of methods and tools commonly employed in HEP has expanded significantly due to the deep learning revolution. To fully leverage the information contained in a given event, such as those generated by the Large Hadron Collider (LHC), statistical analysis necessitates enhancements in machine learning analysis techniques [2]. Neural networks and BDT are used in HEP to learn from simulations or data. Nonetheless, difficulties encompass the acquisition and generalization of physical data, employing control samples or regularization strategies, and curtailing reverse engineering possibilities because of indirect training data [3] [4].
In contrast, the Matrix Element Method (MEM) regression leverages our understanding of the Standard Model (SM) through Lagrangian to assess the compatibility of experimental events with a hypothetical process [2]. The inherent understanding of the physics process and detector response renders the internal dynamics of MEM more straightforward to interpret compared to machine learning techniques. Moreover, because there is no training step involved, MEM can be effectively utilized even when the dataset contains very few events, a scenario in which other methods encounter difficulties. The MEM, derived from Tevatron experiments, is widely used in particle physics analyses. It’s used at the LHC for searches and measurements of processes like top quark and anti-quark pair production and Higgs boson decay (
) and spin correlation. However, its complexity and computational demands make it more CPU-intensive than other methods [5]. The result of matrix element integration can be approximated with a Deep Neural Network (DNN), which makes it possible to use the MEM for parameter scans, innovative physics searches, and other uses. In the absence of a closed or computationally feasible form, the probabilities given by MoMEMta can be understood as indistinguishable functions of the 4-momentum. Given appropriate assumptions and a sufficiently large width divided across several layers, a neural network can approximate any function. While there’s no guarantee that the MEM’s output conforms to all of these presumptions, it sufficiently motivated this study. In comparison to matrix element evaluation based on direct integration, this method could produce much quicker results. DNNs often take significantly less computation time than classical MEM integrations, requiring a simulated sample and several hours of training. Advancements in the MEM field, such as GPU acceleration, parallel computing, BDT, neural networks, and normalizing flows, can reduce computation times and avoid integration variable optimizations [2].
In this study, I would like to understand the details of the computational/probabilistic distribution required for MEM, its fitting with DNN, topology using the Drell-Yan weights model, and application to real-life analysis for certain physical processes based on experimental data [2].
2. Methodology
1) Regression is a machine learning approach that is used to predict the continuous outcomes of interaction and decays based on the input variables. DNNs are simply types of neural networks having many layers between the output and input layers, which permits learning of difficult patterns in the data [5] [6]. To train the neural network, input features—such as particle momenta, energies, and other pertinent variables—derived from experimental data are fed into the system. By using theoretical calculations or Monte Carlo simulations to acquire the true distribution, the network is trained to minimize the difference between its predicted probability distribution and the true distribution. To improve its predictions, the network iteratively modifies its internal parameters (weights and biases) through techniques like gradient descent and backpropagation [7]. From first-principles computations for high-energy processes in a power series expansion order by order, we can simplify the expansion as in:
(1)
in which case
is a small expansion parameter. Where the functions
are related to the real part of a complex number,
and
gives the power of the expansion coefficient, that is a generic term such that
(
with
is called the matrix element). The number of closed loops contributing to the Feynman graph is represented by the subscript, which is the total for a particular scattering process used in computing the matrix element. Here, the second-order terms (the order is given by the power of the
)
is the term of interest, where
is computationally space- and time-expensive than
. This difference results from the Feynman graphs, where
has two loops and
has one loop. To expedite the computation of the functions, constructing a regressor with a representative sample is an easy fix. Speeding up Monte Carlo simulations is highly relevant not only in particle physics but in a very large set of problems addressed by all branches of physics using perturbative expansion like the one in Equation (1) [8]. However, the regressors must generate very highly accurate predictions over the whole function domain in order to reach the precision required for matching with experimental data. The specifications for the regressors, specifically pertaining to the definition of high precision, are as follows:
1) High precision: prediction error < 1% over more than 90% of the domain of the function.
2) Speed: prediction time per data point of <10−4 seconds.
3) Lightweight: the disk size of the regressors should be a few megabytes at the most for portability.
Compared to the Monte Carlo statistical error, for instance, the prediction error—whose exact definition is provided in Equation (2)—guarantees that propagating the approximation mistake on
to the entire function
stays a sub-leading source of error.
(2)
where
gives the difference between the predicted value of the function
, and its true value
, normalized by
[8].
The terms that emerge in the generation of four charged leptons in proton-proton collisions,
, where each lepton pair is mediated by an electrically neutral electroweak gauge boson, such as a
boson or a photon (
), are represented by the functions
in Equation (1). This process’s leading-order function (
) equals the square of the scattering amplitude at the tree level. Higher terms have graphs with internal legs that form a closed loop, whereas graphs without such legs are connected with a tree-level amplitude. For instance, adding the second-order term results in a time penalty of 1500 times greater accuracy in predicting the rate of creation of four electrons [8]-[10].
2) Let’s talk about asymmetry. The functions in question are maps
, where
and
. The feature space, or domain of the functions, is filled in from a uniform distribution and linearly transferred to the unit hypercube [10]. For instance, the physical phase-space coordinates in the 4-dimension are: the process total energy
, having a di-boson system scattering angle of
, with masses of the two bosons,
and
. Since the bosons are believed to be off their mass shell, these masses are not fixed in general [8] [9]. The 8-dimension case gives additional angles with leptons in the final state, and the 2-dimension case gives fixed masses. As a result, various physics analyses will call for various regressors. The 4D regressors, on the other hand, are more versatile and do not have any fixed physical parameters [8]. The smaller number of necessary functions in the third column of Table 1 is obtained by leveraging the symmetry properties discussed below derived from physics domain knowledge. For the data used in Ref. [8], it stems from the symmetries manifesting in the scattering process that was simulated. In Table 2, the last column indicates whether summing the functions has a physics meaning; in the cases where it does, i.e. 2D and 8D, only the single regressor of the sum of the functions is required. The complete functions,
, for any dimension
, is full. Certain combinations of the external particles that the method specifies allow pairs of functions to be mapped into each other. Thus, in feature space, the second coordinate,
, is subject to a linear transformation both separately and in conjunction with the permutation of the third and fourth coordinates,
and
. For instance, in 4D, the number of independent functions is reduced from 162 to 25 by two permutations,
and
as seen in Table 3. Here,
represents a particle with label
.
Table 1. The dimension determines how many of the resulting functions-technically known as helicity amplitudes-there are [8].
Variable |
2D |
4D and 8D |
|
|
|
|
|
|
|
|
|
Table 2. Symmetry characteristics minimize the quantity of necessary functions [8].
Dimensionality |
Total functions |
Independent functions |
Sum is physical? |
2D |
18 |
5 |
Yes |
4D |
162 |
25 |
No |
8D |
8 |
4 |
Yes |
Table 3. Symmetry representation [8].
Permutation |
Particle Symmetry |
Coordinate Symmetry |
|
|
|
|
|
and
|
3) Describing the matrix method. The methodology can, in theory, be applied to any measurement; however, processes involving intermediate resonances and leading to many-particle final states are anticipated to yield the highest improvement in comparison to cut-based analysis techniques. Generally speaking, the MEM allows for the simultaneous determination of multiple unknown parameters (both experimental and theoretical parameters characterizing the detector response and the physics processes being measured) in a single measurement, thereby reducing systematic uncertainties [11]. The possibility that
will see a sample of chosen events in the detector is the foundation of the Matrix Element technique. The likelihood is computed as a function of the assumed values for each of the parameters to be monitored and is derived directly from the theoretical prediction for the detector resolution and the differential cross-sections of the relevant processes. The measurement of the parameters is obtained by minimising
, where the likelihood
for the total event sample is calculated by multiplying the likelihoods of observing each individual event [11] [12]. The majority of analysis techniques in experimental particle physics, in contrast, compare distributions from observed events in the detector with corresponding distributions from simulated events generated theoretically, passed through a detector simulation, and then reconstructed using the same event reconstruction software [13]. The sample likelihood
for
measured events to have measured properties
can be written as:
(3)
where the symbol
denotes assumed values of the physics parameters to be measured,
stands for parameters describing the detector response that are to be determined, and
is defined below. The likelihood
to observe event
under the assumption of parameter values
and
is given as the linear combination:
(4)
where the sum is over all individual processes
that could have led to the observed event
,
is the likelihood to observe this event under the assumption that it was produced via process
, and
denotes the fraction of events from process
in the entire event sample, with
. In total, the physics parameters
, the detector response described by
, and the event fractions
are to be determined simultaneously from the minimization of
[12].
The MEM is used to calculate
, which is the likelihood of observing an experimental event in light of theoretical hypothesis
. Here,
stands for the 4-momenta of any number of particles that have been detected in their final condition. The aim is to differentiate the parton level particles
generated at the contact point prior to hadronization and their detection from the experimentally observed particles
.
can be used to denote various models or any set of parameters (such as the mass of a resonance) [2]. The probability that a hard scattering will result in a partonic final state
for hadron colliders is proportional to the differential cross section, which is calculated as
(5)
where
and
are the initial state parton momentum fractions and
is the center-of-mass energy.
is the n-body phase space of the final state
, while
denotes the matrix element for the given process
.
The parton distribution functions (PDF)
(for each parton
of flavor
), efficiency
to reconstruct and select the hadronic state
, and transfer function
normalized with respect to
are all involved in the transmission of the parton-level 4-momenta
to the experimentally observed ones
. The latter parameterizes the hadronization, the parton shower, and the detector response (which results in a blurring of the momenta of the detected particles due to its limited resolution) [2] [13]. A convolution of the differential cross section with the transfer function and a total over the initial states yields the probability
:
(6)
where
stands for the visible/observed cross-section and ensures that the probability is normalized. It is often computed a posteriori with
, where
is the total cross-section and
is the expectation of the selection efficiency. Define the MEM weights as
. In addition, the weights can span several orders of magnitude which is why most of the time we will use the event information defined as
or
which only differs by a constant term. Several complex processes are involved in the transfer function, and in order to make its integration easier, a number of assumptions are typically made. It is possible to factorize the transfer function for the various particles since each particle’s detection and measurement in the final state are independent, in first order. By factorizing the various parts of the measured 4-momentum, this argument can be advanced even further [2]. Thus, the transfer function is written as
(7)
where the index
refers to the final-state particles. The resolutions in
and
are specified as delta functions and are typically very narrow. Conversely, the energy resolution is contingent upon the particle characteristics and must replicate the behavior of the simulation detector effects, which are usually Gaussian in the case of rapid simulations [2] [11] [13].
4) Drell-Yan process and weights. In this process, a virtual photon, or Z boson, is created when a quark from one hadron and an antiquark from another hadron annihilate. After that, this virtual particle decays into two leptons with opposing charges (e.g., electrons and muons). This method is important because it yields important information regarding parton distribution functions (PDFs). These PDFs characterize the distribution of momentum among the partons (quarks and gluons) of an incoming high-energy nucleon. Deep inelastic scattering (DIS) is strongly associated with the Drell-Yan mechanism. The Feynman diagram for the Drell-Yan process can be obtained by rotating the DIS Feynman diagram by 90 degrees [14].
5) DNN decisions and fitting with MEM. Fully linked layers with linear activations for the output layer and leaky ReLU activations for the hidden layers make up the DNN architectures used in Ref. [8], see Figure 1. The Leaky ReLU performs better than other activation functions like ReLU, softplus, and ELU [9] [15] [16] in a comparison study of activation functions in “Deep neural networks and skip connections,” which has demonstrated time affordability. In order to evaluate the training stage’s progress, the coefficient of determination, or
, can be computed using,
(8)
where
is the true values’ variance over the
validation points that are identified by the index
. Keep in mind that the mean-squared error, or the selected objective function, is proportional to
. While the mean-squared error alone does not allow us to compare different models self-consistently, the constant of proportionality,
, standardizes it in a way that makes it a meaningful figure of merit during the training phase [8] [10] [13].
Figure 1. The building block with skip connections in a DNN. The activation of Leaky ReLU is totally associated with the first two levels. After being added to the input via the skip link and having linear activation, the last layer is converted using a non-linear Leaky ReLU function. If the input and output dimensions for the block and transformation gate
are different, then the weight matrix
can be trained. To create the neural network, the blocks are layered one after the other [8].
Depending on how sophisticated the process is, calculating an event’s weight with MoMEMta can take a few seconds to several minutes. It quickly becomes prohibitive to repeat this for every hypothesis
and every event. In real-world analyses with several hypotheses and perhaps extra model parameters, this frequently complicates the method’s use. To put it briefly, Ref. [8] study’s method processes simulated samples using MoMEMta and generates event information
for various hypotheses. The outcome is then utilized to train a DNN together with MoMEMta inputs, which are the 4-momenta of every observable particle. The 4-momenta of all the detected particles and the missing transverse energy (
and
) are the inputs of MoMEMta; as such, these are the inputs we wish to supply to the DNN [2]. Particles produced may be boosted in one or both directions, depending on the longitudinal difference in momentum between the initial partons that collide in the detector. The network’s capacity to explain the interesting portion of the matrix element will be hampered by the need to learn about the longitudinal boost in addition to the function in Equation (6). The situation is greatly improved by using the
,
, and
angles for each particle as inputs since the
between two particles is well approximated and independent of the longitudinal boost [2].
3. Proof of Concept and Application to
Production
Let’s use the MEM in this study to produce two opposite sign leptons and two jets (bjets) initiated by b quarks as detected particles as a proof of concept. The primary contributions to this topology make it intriguing—Drell-Yan
production with additional jets, and top quark pair production
having leptonic decays of the
bosons from top quarks—are very dissimilar in the way they are treated in the integration. While there are no missing particles in the former case, the later involves undiscovered neutrinos whose degrees of freedom must be taken into consideration. Considering the resonnant
process that results in the context of Two Higgs Doublet Models (2HDM)), and has been studied by the ATLAS and CMS collaborations [17] [18]. The unknown masses of the
and
bosons will demonstrate the effectiveness of our method when the parameter space is multi-dimensional, which is precisely the situation where the classical integration is impracticable. The multiple resonances in this process also pose an interesting challenge from the integration point of view [2] [16]. A synopsis of the events for which the weights have been calculated indicates that, in order to prevent the network from becoming unbalanced, only 500 K events for the
process have been employed in the training [2]. For both
and
, the
samples and weights are divided into 23 mass configurations up to 1 TeV. For each event
, Drell-Yan, and
, the event information
will be assessed 23 times using various mass parameters specific to the
instance. The data given in Ref. [2] also includes the number of faulty weights and the number that were recomputed using additional iterations. Figure 2 displays the
distributions for each of the three sample types. The weights from the MEM computed with MoMEMta and the ones from the DNN agree quite well. The model with six layers and 200 neurons, with RELU and SELU [9] [15] [16] activation functions for the hidden and output layers, respectively. Utilizing the Breit-Wigner resonances in a variable change allows for a reasonable computation time reduction (approximately 3.2 times slower than for the Drell-Yan weights) [1] [8]. Figure 3 displays the corresponding
distributions. Although there is less of a discrepancy in the weight distribution between the Drell-Yan and
samples, the Drell-Yan events have a longer tail. The distinct mass configurations
and
that make up this sample are the source of the double peak observed in the
scenario. Low weights are associated with high (pseudo)scalar masses, but low masses are more in line with the theory of
. Overall, there is good agreement between the weights from the DNN and those obtained classically [2]. Considering the application’s next-leading order (NLO) of MEM in
production, we can examine the impact of missing higher-order effects in the likelihood computation as an example. With POWHEG [19] [20], a single top-quark event is produced, and PYTHIA8 [21] is used to sprinkle it down [12]. It is possible to identify the specific ways in which the parton shower impacts the analysis by using the MEM to analyze these occurrences using fixed-order calculations. The extraction of the top-quark mass using the MEM from showering events produced using POWHEG and PYTHIA8 is depicted in the left plot of Figure 4. The likelihood function is exclusively computed in the Born approximation (blue) and with NLO adjustments (red) in the study displayed in Figure 4. To derive the estimator for the top-quark mass
as the position of the minimum and the related statistical uncertainty as the breadth of the parabola, a parabola is fitted to the negative logarithm of the likelihood function (Log-Likelihood). By factorizing in two different factorization and renormalization scales in the likelihood calculation, the theoretical uncertainty resulting from missing higher orders is assessed. The missing higher-order adjustments in the corresponding likelihood calculations are the reason for the biases in the recovered estimators
GeV and
GeV with regard to the input value for the top-quark mass
GeV. When there is a discrepancy between the probability density used to calculate likelihood and the actual distribution of the events, the MEM is known to provide biased estimators. Hence, there is a need for the calibration of the method in introducing its associated uncertainties.
![]()
Figure 2. Distributions of the event information
from both MoMEMta and the DNN for the three samples Drell-Yan (left),
-(middle) and
(right) events [2].
Figure 3. Distributions of the event information
from both MoMEMta and the DNN for the three samples: Drell-Yan (left),
-(middle) and
(right) events [2].
Figure 4. Left: Top-quark mass extraction via the MEM at leading order (LO) (blue) and NLO (red) from single top-quark events generated with POWHEG + PYTHIA8. Right: Same but with extended likelihoods functions [12].
4. Conclusion
This study proposes the utilization of a Deep Neural Network (DNN) to approximate the integral of the Matrix Element Method (MEM) with the objective of expediting MEM computations. It is inferred from the few examined and documented example processes that, with the assistance of specialized tools like MoMEMta, a DNN can be effectively trained to closely replicate the outcomes of the direct numerical integration of the matrix element. Although the weights in this regression employing the DNN inherently contain errors, these are unlikely to significantly impact performance across various applications. The adoption of accelerated weight computations presents a plethora of opportunities, such as systematic research, probability scans, parameter scans, and, more broadly, the application of MEM to a wider array of physics analyses, including the discovery of novel physics phenomena.
Acknowledgements
I wish to express my sincere gratitude to Professor Elena Long for her invaluable introduction to “Special Topics in Particle Physics”. The research by Simone Alioli et al. served as a significant impetus for this study.