Estimating Distribution Distance From Data Continuous
Random Sequences
Alexander S. Poznyak , in Advanced Mathematical Tools for Automatic Control Engineers: Stochastic Techniques, Volume 2, 2009
6.4.3 Distributional convergence
Definition 6.6.
- (a)
-
The variational distance between the distributions F and G is
(6.60)
- (b)
-
The distributional distance between two random variables ξ and η, both defined on a probability space (Ω, ℱ, P), is
(6.61)
- (c)
-
If ξ and {ξ n } n ≥ 1 are random variables such that
(6.62)
we say that ξn converges to ξ in total variation as n → ∞.
Lemma 6.5. If ξ n converges to ξ in total variation then .
Proof. Indeed
which establishes the desired result.
□
The next theorem deals with convergence in probability for absolutely continuous random sequences.
Proposition 6.1. (Scheffé's Lemma) Suppose {ξ n } n ≥ 1 is a sequence of absolutely continuous random variables with density functions . Then
(6.63)
and, hence, if almost everywhere, then
and in particular,
Proof. By the triangle inequality
Applying to both sides of the last inequality proves (6.63). The convergence follows automatically from Lemma 6.5. □
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780080446738000109
Casting Signal Processing to Real-World Data
Pau Closas , ... Erik G. Ström , in Satellite and Terrestrial Radio Positioning Techniques, 2012
Distribution-Based Identification Approach
A different method to jointly exploit the complete set of waveforms instead of taking a decision on the single waveform has been investigated in Ref. [8]. The idea is to provide an estimation of the probability distribution of the parameter of interest, and to compare it with the reference ones corresponding to LOS and NLOS propagation. The decision is taken in favor of the hypothesis for which the estimated distribution is at minimum "distance" to the reference one. The distance between distributions has to be defined according to a certain metric. Examples of such metrics are the Euclidean distance and the relative entropy or Kullback–Leibler distance. The decision criterion is then given by
(7.8)
where denotes the estimated joint distribution, while and are the reference distributions of the two hypotheses. For N = 1 and equal prior probabilities for the two channel states, we have
(7.9)
The experimental results relating to this identification method presented in the next section are obtained using the (squared) Euclidean distance given by
(7.10)
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123820846000076
Game Theoretic Learning
Ceyhun Eksin , ... Alejandro Ribeiro , in Cooperative and Graph Signal Processing, 2018
7.4.2 Network-Based Fictitious Play for Incomplete Information Games
Given the computational and memory demands mentioned above, the QNG filter, while being conceptually appealing and having desirable asymptotic properties for supermodular quadratic games, is not practical. In the following, we provide an adaptation of the network-based FP algorithm discussed in Section 7.3 to incomplete information games.
Assume each player i has a local belief π i, n that assigns probabilities to different realizations of the state of the world θ. We assume players have a separate state learning process that updates their local belief π i, n in addition to the steps of the network-based FP algorithm at stage n. We will be agnostic to the specifics of the state learning process as long as the players' local beliefs converge to a common belief π on the state of the world. In particular, we make the following assumption.
Assumption 7.5
For all players , the local beliefs converge to a common belief π at a rate faster than ,
(7.33)
where TV (⋅) is the total variation distance between the distributions. 10 ■For example, the state learning process can involve information exchange among players where they exchange their beliefs and implement a consensus-like updating procedure, or it can entail Bayesian updates based on repeated noisy private signal observations about the state of the world.
Given this assumption, we have the following convergence result for the modified network-based FP [65].
Theorem 7.8
Let be a potential game. Assume players follow the network-based fictitious play algorithm with individual state learning processes that satisfy the above assumption. Then, . ■
This theorem can also be seen as another application of the robustness result given in Section 7.3. In particular, the uncertainty on the state and networked interactions perturbs the best response actions of players, resulting in a weakened FP process.
The network-based FP for incomplete information is computationally less demanding than the QNG filter. However, in terms of information exchange, the QNG filter is less demanding. The QNG filter only requires the observation of neighbors' actions while step (iv) of the network-based FP necessitates exchanging entire estimates of other players' empirical distributions with neighbors. In addition, the state learning process might require additional information exchange or observations.
In the following, we present an implementation of the network-based FP for incomplete information games.
Simulation
We consider the same simulation setup as the numerical example of the QNG filter. We discretize the action space in order to implement the network-based FP, where . In the setup, each robot receives an initial noisy signal related to the target direction θ, x i = θ + ϵ i where ϵ i is drawn from a zero mean normal distribution with standard deviation equal to 1°. Robots learn about the state by averaging their neighbors' estimates with initial estimates equal to the private signal x i . 11
We consider two types of networks. The first network is a geometric network generated by randomly placing the robots on a 1 unit × 1 unit square and drawing an edge between pairs that are less than 0.25 units away. The second network is a random network. We observe that the convergence of the algorithm is faster in the random network (46 steps) than the geometric network (113 steps) in Fig. 7.5.
Fig. 7.5. Robots' actions for geometric (left) and small-world (right) networks. Solid lines correspond to each robots' actions over time. The dotted dashed line is equal to the value of the state of the world θ = 5° and the dashed line is the optimal estimate of the state given all the signals, which is equal to 5.3°. Agents reach consensus in the movement direction 5° faster in the small-world network than the geometric network.
In comparison to the QNG filter, the network-based FP for incomplete information takes longer in terms of the number of stages. As can be expected, the computation time of each stage is much faster for the network-based FP algorithm. In sum, the total simulation time until convergence takes longer for the QNG filter.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128136775000079
Deep learning and generative adversarial networks in oral and maxillofacial surgery
Antonio Pepe , ... Jan Egger , in Computer-Aided Oral and Maxillofacial Surgery, 2021
3.3 Wasserstein generative adversarial networks
To solve the problem of mode collapse, Arjovsky et al. [81] introduced the so-called Wasserstein generative adversarial networks (WGAN). In Section 3.1, we described how the adversarial training of a GAN implicitly tries to reduce the Jensen-Shannon divergence (Eq. 3.13) between the two distributions. For this to be possible, the Kullbach-Leibler integral (Eq.3.14) must be defined and finite . In their work, Arjovsky et al. [81] focus on minimizing, during training, the Wasserstein distance between the distribution of generated samples and the distribution of the training samples . They show that under this consideration the function describing the WGAN is defined as
(3.15)
where is the set of Lipschitz-continuous functions with Lipschitz constant . For deep neural networks, they indirectly enforced this condition by introducing weight clipping. Nonetheless, Gulrajani et al. [82] showed that the application of weight clipping can produce side effects like vanishing gradients. As an alternative, the authors replaced the weight clipping with a gradient penalty term to ensure that each gradient is smaller or equal to one. This approach has been empirically shown to produce less side effects, although the application of a penalty term is not enough to enforce the gradients at the beginning of a training. For this, Wei et al. [83] suggested the application of a consistency term. This term is directly defined following the definition of Lipschitz continuity where for each pair of samples and the following holds:
(3.16)
where is the Euclidean distance and K the Lipschitz constant. To penalize any violation of the Lipschitz continuity, the authors replaced the gradient penalty term with a so-called consistency term defined as:
(3.17)
K′ approximates K as it is calculated over a discrete number of sample pairs. The subtraction of the consistency term during the training helps to approximate the Lipschitz condition already during the initial iterations of the training but still follows the idea introduced with gradient penalty to avoid the limitations of weight clipping.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128232996000031
Subject-Centric Group Feature for Person Reidentification
Li Wei , Shishir K. Shah , in Group and Crowd Behavior for Computer Vision, 2017
7.3.2.2 Metric of Person–Group Feature
Given person–group features, the distance measure between features is based on a linear combination of three terms: group size score, in-group position score, and group baseline score. Let and denote the person–group feature of and . Their distance takes the form
(7.3)
The first term is the group size score, which returns the size difference of groups that include and . The group size score is computed by
(7.4)
where is the group size (number of group members) of group G.
The second term is the in-group position score, which evaluates the difference between in-group position signatures. As we know, is a set of distributions that encode the cotraveler's location around . is a distribution in metric space. The problem of computing distance between and becomes one of computing the distance between two distributions. There are many metrics that define distance between distributions. We found that the intuition behind Earth Mover Distance (EMD) [25] fits our problem best. EMD computes the distance between distributions in space by computing minimum cost of turning one distribution to another, where costs are assumed to be amount of weights moved, times the distance by which it is moved in space. The minimum cost can be solved as a linear programming problem. In our problem, we define the distance between in-group position signature as the minimum amount of deformations that transfer one feature to another. However, unlike the original EMD algorithm, the person can only be transformed as a complete part, therefore integer programming is required to solve the minimum deformation in our problem.
Let be the in-group position signature of , be the in-group position signature of . As we mentioned above, all possible angle distribution belongs to a metric space M. The distance function of M is simply defined as the distance between the distributions' mean angle
Let be the difference between the ith element in and the jth element in . We try to find a flow , where is a binary variable, with when the ith element of is moved to the same location of the jth element in after the deformation. This optimization can be formulated as a binary integer programming problem
(7.5)
subject to the following constrains:
After we solve the above optimization problem, the in-group position signature distance is calculated using
(7.6)
The final term is the group baseline score. It computes the aggregated differences of cotravelers' baseline features, under the condition that the cotraveler's correspondence is known by solving Eq. (7.5). Let be the pairwise baseline score matrix, where denotes the baseline score between the ith element in and the jth element in . The group baseline score takes the form
(7.7)
When a person is traveling alone, the person–group feature is empty. In this case the distance to an empty person–group feature and is set to zero, and only group size score, , contributes to the person–group feature difference.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128092767000084
A review of the techniques of images using GAN
Rituraj Soni , Tanvi Arora , in Generative Adversarial Networks for Image-to-Image Translation, 2021
5.2.7 Wasserstein GANs
The idea of the WGANs or the Wasserstein GAN was given by Arjovsky et al. [30]. It can be described as an augmentation to the existing GAN architecture. The main aim of the WGANs is to provide support for the model to improve the stability for the training of the given model and also provides a loss function to analyze the standards of the images generated by the model.
The WGAN uses an approach to perform a better approximation of the data provided in the dataset for training purposes. The WGAN proposes to use a critic in place of the discriminator, which decides the fakeness (or realness) of the given image with the help of the score given by that critic. The whole theory for the WGAN is based on the mathematical calculation about the distances. It states that the generator must search for minimization of the distance between the distribution of the data observed in the training dataset and the distribution observed in generated examples.
In the paper by Arjovsky et al. 30, the discussion consists of the various distribution distance measures, such as Jensen-Shannon (JS) divergence [31], Kullback-Leibler (KL) divergence [32], and the Wasserstein distance (Earth-Mover [EM] distance). The ability of each distance is based on the convergence of sequences of probability distributions. So, it was proved that WGAN could effectively train the generator using the properties of the Wasserstein distance more optimally compared to the other distribution distances.
Fig. 5.19 depicts the simple WGANs architecture and the concept of the WGANs revolves around the fact that Wasserstein distance is differential and continuous, which means that the training can be performed till it achieves optimal value. It is based on the argument that the longer the process of the training is carried out for the critic, then it will provide the reliable gradient of the Wasserstein. It happens due to the differentiable nature of the Wasserstein distance. Whereas in the case of the JS divergence (distance), the critic becomes reliable, but the true gradient becomes zero as the JS is locally saturated, and vanishing gradients are obtained. The critic in the WGANs does not get saturated. The critic gives a clean gradient as it reduces to the linear function compared to the discriminator that may learn the difference between real and fake quickly. Still, it does with almost no reliable linear-gradient information. The most crucial advantage of using the WGAN is that it makes the training process stable and ensures that the training process is not sensitive to the choice of the hyperparameter configurations. The WGAN aims to decrease the critic's loss, and achieving this can have a good quality of the generated images. The WGANs always try to make the lower the generating loss, whereas the other GANs try to achieve an equilibrium between two generative and discriminate models. The applications of the WGANs include the simulation of the isolated electromagnetic showers in a realistic setup of a multilayer sampling calorimeter [33]. Similarly, one critical step in the analysis of the medical images is the structure-preserved denoising of 3D magnetic resonance imaging (MRI) images.
Fig. 5.19. A simple WGAN architecture [34].
Ran et al. [35] presented the use of the Wasserstein generative adversarial network (RED-WGAN) for the MRI denoising method.
The next section discusses some of the open issues and research gaps in the domain of the applications related to the GANs.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128235195000063
Robust Fault Detection and Isolation Based on the Kullback Divergence
Daniele Romano , Michel Kinnaert , in Fault Detection, Supervision and Safety of Technical Processes 2006, 2007
4 FAULT ISOLATION
4.1 Generation of a vector of isolation indicators
Once the chi-square test reaches its threshold, an alarm is generated indicating that a fault has occurred. The next step is to determine what is the most likely fault, namely to isolate the fault. To this end, a set of nf fault indicators that are specifically dedicated to fault isolation, and are thus different from the detection residuals, will be developed in this section. Similarly to detection, each indicator is associated to a given fault. The structure of these indicators is chosen as in (3), namely
(14)
where λ and v are respectively m∙ s- and q∙ s-dimensional vectors of parameters and ξ is a scalar parameter. The indicator associated to fault i, qi (k), is designed so that the minimum distance between the indicator distribution upon occurrence of fault i and its distribution for all other faults (over all fault levels in the specified range) is maximised. The Kullback divergence is again used to measure the distance between the distributions of qi (k) in the different faulty modes. By similarity with the design of structured residuals according to the incidence matrix of table 1, one will require that the ith indicator has zero mean upon occurrence of a fault of type i in the interval If the pdf of the fault level in that interval, fϑi (θi ,), is known ξ i is then
Table 1. Incidence matrix (a one in row i and column j indicates sensitivity of residual i to fault j, and a zero indicates insensitivity)
| θ 1 | θ 2 | … | θnf | |
|---|---|---|---|---|
| q 1 | 0 | 1 | 1 | |
| q 2 | 1 | 0 | 1 | |
| ⋮ | ⋮ | 1 | ⋮ | |
| ⋮ | ⋮ | ⋮ | 1 | |
| qnf | 1 | 1 | 1 … … 1 | 0 |
and its discrete approximation can be computed by
(15)
where pj , j= 1 to Lf , are appropriate probability weights which sum up to 1. When no information is available on the probability distribution of the fault level, a practical setting in Equation (15) is
In (15), the expectation under pdf f z/θij (z) can again be estimated by an empirical mean thanks to appropriately designed experiments. The design of the nf fault isolation indicators can thus be stated as follows:
For i= 1 …, nf , determine qi (k) of the form (14) by solving the following optimization problem
(16)
Then, the directional properties of the resulting vector will be exploited in order to isolate the fault.
4.2 Processing of the vector of isolation indicators
Notice that the mean and variance of q isol(k) can be computed in a similar way as for r det(k) (see (11)). The mean of the vector q isol(k) upon occurrence of the ith fault within its interval is
and can be approximated by discretizing the computation of the integral as in Equation(15).
The result of the isolation procedure is then obtained as:
(17)
Thus the index i that maximizes the absolute value of the cosine of the angle vi between the present fault indicator vector qisol(k) and its average direction upon occurrence of fault i is considered to indicate the fault that has occurred.
Such a strategy is proposed and motivated in Gustafsson (2002) to isolate faults thanks to a set of parity relations designed for a process described by a stochastic linear system without parametric uncertainties. Its use in our context is relevant provided the variance of q isol(k) does not change significantly in the different faulty modes. Should this hypothesis not be realistic, more sophisticated approaches based on the log-likelihood ratio between pairs of faulty modes should be considered (Basseville and Nikiforov, 1993, p 129)
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780080444857500725
An information geometric look at the valuing of information
Ira S. Moskowitz , ... William F. Lawless , in Human-Machine Shared Contexts, 2020
9.2 Information geometry background
The field of IG had its origins in Fisher information (Fisher, 1922) and the Cramer-Rao bound (Rao, 1945). In what follows, we form our Riemannian manifolds following the techniques from Amari (1985, 2001) and Suzuki (2014). We want to know what is the "distance" between probability distributions? What makes a distribution close or far to another distribution? Certainly, there are techniques such as Kullback-Liebler divergence that address this, but we will do it via Riemannian geometry to provide an inherent geometric setting for distance, the "natural" way to proceed. Furthermore, Riemannian geometry includes the concept of geodesic paths, the most efficient paths to travel across spaces.
Many applications use distance between distributions, for example, Carter, Raich, and Hero (2009) and Wang, Cheng, and Moran (2010); but that is not the focus of our chapter. We concentrate on a thorough mathematical discussion on the distances between distributions. We cite many references in our chapter, but our goal for this chapter is a concise and mathematically accurate description of the issue.
We start by looking at the normal distribution N(μ, σ 2) with probability density function
The reader should keep in mind that we express the density function f as a function of μ and σ, yet the classical notation N(μ, σ 2) uses μ and σ 2. The classical notation is actually the most natural notation (natural parameters) once we consider covariances matrices.
In Fig. 9.1, consider the four normal distributions μ = 0 and σ = 1 or 10 (Fig. 9.1A); and μ = 10 and σ = 1 or 10 (Fig. 9.1B).
Fig. 9.1. Constant μ. Note, d F is independent of μ. (A) Constant μ = 0, d F = 3.26; (B) constant μ = 10, d F = 3.26.
Looking at Fig. 9.1 we see that, up to a translation in the graph, the difference plots are the same. We need a concept of distance that only relies on σ i if the means are equal.
In Fig. 9.2, we have the four normal distributions σ = 1 and μ = 10 or 20 (Fig. 9.2A); and σ = 1 and μ = 1 or 2 (Fig. 9.2B).
Fig. 9.2. Constant σ. Note, d F is not independent of σ. (A) Constant σ = 1, d F = 5.59; (B) constant σ = 1, d F = 0.98.
Here, the difference depends on not only σ but also μ in a nontrivial manner. We need a concept of distance that incorporates this property. We will show later in the chapter that the Fisher distance "d F " has the desired properties to deal with both Figs. 9.1 and 9.2.
In general, we will show later from Costa, Santos, and Strapasso (2015) that
Applying this to the distributions illustrated in Fig. 9.1, we see that . This result captures well that the difference between the distributions is, up to translation, the same result. However, if we vary the means and keep the standard deviations the same, the behavior is very different. This result is what we see in Fig. 9.2, where d F (N(10, 1), N(20, 1)) = 5.59, but d F (N(1, 1), N(2, 1)) = 0.98.
We note that neural net vision learning has the desirable quality of being translation invariant. Therefore, looking at Fig. 9.1, we would want a machine to see the difference functions as being the same in Fig. 9.1A and B. However, we would want a machine to distinguish between the difference functions in Fig. 9.2. The Fisher distance has the desirable quality of achieving this goal.
The reason for the difference in difference behaviors can only be understood when one studies the geodesics given by the Fisher-Rao metric in the upper (μ, σ) half plane. We will explain this later in the chapter.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128205433000092
Thermoresponsive Core-Shell Nanoparticles and Their Potential Applications
Martina Schroffenegger , Erik Reimhult , in Comprehensive Nanoscience and Nanotechnology (Second Edition), 2019
2.08.4.4 Small Angle Scattering
Small angle neutron scattering (SANS) and small angle X-ray scattering (SAXS) are often used techniques in fundamental colloidal science [36]. Small angle scattering (SAS) techniques use scattering from atomic nuclei (using neutrons) and electron shells (using X-rays), respectively, to provide structural information on objects smaller than 100 nm. SAXS uses X-rays in the range of 0.1–0.2 nm and is sensitive to electron density differences in the sample down to this size range. SANS uses neutrons with a wavelength around 0.5 nm to probe similarly small inhomogeneities based on nuclear scattering length density contrast [37].
The scattering spectrum of SANS or SAXS is comprised of two convoluted parts: a form factor that corresponds to the object size and shape and a structure factor that corresponds to the distance distribution between the objects and therefore to aggregation. These features can overlap if they have similar dimensions and the fitting of a model of the sample structure to the reciprocal space data does not have a unique solution. This implies that multiple measurements under varied sample conditions are required to create confidence in the interpretation.
An example of SAXS measurements on thermoresponsive polystyrene core and a crosslinked PNiPAAm shell nanoparticles is given in Fig. 6 [38]. The particles were measured at different temperatures (Fig. 6(a)). The shift of the maxima to lower q values by decreasing the temperature is a clear sign of the expansion of the shell. By cooling the sample below the CST, the shell is hydrated and shifts the scattering intensity maxima to lower q, which can be modeled as an increase in the size of the shell and the overall core-shell particle size. In Fig. 6(b) compiled from data obtained by both SAXS and SANS study, it is depicted that (the volume fraction of NiPAAm in the shell) is shrinking with an increase of temperature. is obtained by knowing the contrast of PNiPAAm and the surrounding water and the overall mass of the polymer in the shell and the shell thickness. A phase transition can thus be observed directly from changes in core-shell nanoparticle structure. It is visible that the swelling takes place in a continuous fashion. Naturally, in many systems the collapse and partial dehydration of the thermoresponsive shell of core-shell nanoparticles leads to net attractive particle interactions and aggregation that can be monitored through appearance or strengthening of the structure factor that corresponds to this. Thus, thermal phase transitions can be investigated also through the structure factor for systems in which particle aggregation occurs.
Fig. 6. Polystyrene core with a radius of 53.5 nm coated with a crosslinked PNiPAAm shell. (a) SAXS intensities measured at different temperatures above and below CST. With decreasing temperature, the maxima in the scattering curves are shifted to smaller q. (b) Phase diagram: the volume fraction of PNiPAAm .
Reprinted with permission from Seelenmeyer, S., Deike, I., Rosenfeldt, S., et al., 2001. Small-angle x-ray and neutron scattering studies of the volume phase transition in thermosensitive core-shell colloids. J. Chem. Phys. 114 (23), 10471–10478, doi:10.1063/1.1374633. (Fig. 5).Due to the high structural spatial resolution of SANS and SAXS and the relatively high contrast between (deuterated) water and organic molecules afforded by SANS, these methods can potentially be used to study not only average density changes of thermoresponsive shells, but also structural transitions within a shell. A recent study made a first attempt at using SAXS to capture the internal structure of densely grafted PEG shell and iron oxide core nanoparticles as they underwent a reversible thermal transition in electrolyte. The authors showed that the segment density profile of a star polymer had to be assumed for the densely grafted shell to achieve a good fit of the data and that the thermal collapse of the polymer above the CST led to an extension of the constant volume fraction part of the segment density profile of the polymer [13].
Comparing SAS with the standard three methods described above it is clear that SAS requires more extensive data analysis. To achieve the advantages of the detailed analysis of shell structure it also requires data that can only be obtained at large-scale neutron and synchrotron facilities. They are therefore not likely to be broadly adopted in the development of applications of thermoresponsive nanoparticles, but used for fundamental research studies. However, although the interpretation will not be unique, they can be used to obtain both colloidal phase transitions in analogy with information extracted from DLS and structural information on the shell transition that is greatly complementary to DSC. They are therefore useful to obtain a more complete mechanistic picture of the interplay between molecular and colloidal transitions observed for core-shell nanoparticles.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B978012803581810431X
Generative adversarial networks and their application to 3D face generation: A survey
Mukhiddin Toshpulatov , ... Suan Lee Ph.D. , in Image and Vision Computing, 2021
2.1 Loss functions
GANs use loss functions that indicate the distance between the distribution of the data generated by the GANs and that of the real data. The objective function is designed to learn the generator G, which minimizes the difference between the real and fake (or generated) data. The discriminator is trained directly on real and generated images, and it is responsible for classifying the images. The generator is not trained directly; instead, it is trained via the discriminator.
Discriminator loss. The discriminator seeks to maximize the probability between real and fake instances. In other words, it can be described mathematically as follows:
(2)
where x is a sample from the real data, D(x) is the discriminator's estimate of the probability that the real data instance x is real; G(z) is the generator's output with given noise z, and D(G(z)) is the discriminator's estimate of the probability that a fake instance is classified as a real.
Minimax loss. Loss functions such as minimax loss [72,148] are used in the minimax GANs. The minimax GANs loss refers to the minimax optimization of the discriminator and generator simultaneously. Here, min and max denote minimization of the generator's loss and maximization of the discriminator's loss, respectively.
(3)
As mentioned above, the discriminator seeks to maximize the average of the logarithmic probability of the real data, and the discriminator is given by
(4)
where log(1 −D(G(x)) is the inverse probability for fake data. The generator aims to minimize the above-mentioned value. However, the generator function will be as follows:
(5)
The generator cannot directly affect the log(D(x)) term in the function; hence, for the generator, minimizing the loss is equivalent to minimizing log(1 −D(G(z)) or maximizing D(G(z)). Thus, the generator learns to generate samples that have a low probability of being fake [148].
(6)
Modified minimax loss. The above-mentioned minimax loss function can cause GANs to get stuck in the early stages of GANs training when the discriminator's task is very easy[148]. It has been suggested that the generator loss be modified so that the generator tries to maximize logD(G(z)).
Wasserstein loss. This loss function depends on a modification of the GANs scheme. The discriminator does not classify cases. For each case, it outputs a number. This number must not be less than 1 and greater than 0. Therefore, we cannot use 0.5 as a threshold to determine whether a case is real or fake. Discriminator training only tries to make the output bigger for real cases than for fake cases. Because it cannot discriminate between real and fake, the discriminator is called a "critic" rather than a "discriminator". This distinction has theoretical importance; however, for practical purposes, we can assume that the inputs to the loss functions do not have to be probabilities[121]. The discriminator tries to maximize the critic loss D(x) −D(G(z)), i.e., it tries to maximize the difference between its outputs on real and fake cases. The generator tries to maximize the generator Loss D(G(z)), i.e., it tries to maximize the discriminator's output on fake cases. The output of critic D must not be between 1 and 0. The formulas are derived from the earth mover's distance between the real and generated distributions [10].
Read full article
URL:
https://www.sciencedirect.com/science/article/pii/S026288562100024X
Source: https://www.sciencedirect.com/topics/engineering/distance-between-the-distribution
0 Response to "Estimating Distribution Distance From Data Continuous"
Post a Comment