# [Update] Selecting pseudo-absences for species distribution models: how, where and how many? | pseudo-random distribution – Vietnamnhanvan

pseudo-random distribution: คุณกำลังดูกระทู้

For each one of these six questions, we further tested for an effect of the number of presences available and the choice of the modelling technique, using seven different SDM. To do so, we performed a comparative analysis based on virtual data. We thus knew the species’ true distribution and were able to simulate different realisations of this distribution that were either unbiased or purposely biased geographically or climatically. Geographically biased presence data could arise from sampling along main roads or railways, or within a subset of the countries where the species occurs ( Kadmon, Farber & Danin 2004 ; Albert et al. 2010 ). Geographical bias matches some large-scale surveys like the North American Breeding Bird Survey with sampling sites along the main roads or some common data sets used for species distribution modelling which follow political boundaries (e.g. European breeding birds, Huntley et al. 2008 ). Climatically biased presence data can result either from a spatially biased sampling design, that is, when data from an area with climatically different characteristics are missing ( Barbet-Massin, Thuiller & Jiguet 2010 ), or from sampling that was not carried out over the whole environmental range of a given species, which is often the case for species ranging from low to very high altitude, because the latter is usually less thoroughly surveyed.

The SDM widely used in these studies can be categorised in two groups: methods that only require presence data vs. those that require both presence and absence data ( Brotons et al. 2004 ). Contrary to popular belief, there are very few presence-only SDM, the most common being rectilinear envelope (e.g. BIOCLIM, Busby 1991 ) and distance-based envelope (e.g. Mahalanobis distance, Farber & Kadmon 2003 ). SDM such as Maxent or GARP, sometimes misleadingly referred to as presence-only methods, actually do require the use of background data or pseudo-absence data. As confirmed absences are very difficult to obtain, especially for mobile species, and require higher levels of sampling effort to ensure their reliability compared with presence data ( Mackenzie & Royle 2005 ), presence-only models have often been used to cope with the lack of absence data ( Graham et al. 2004b ). However, comparisons of various SDM show that presence–absence models tend to perform better than presence-only models ( Elith et al. 2006 ). Thus, presence–absence models are increasingly used when only presence data is available, by creating artificial absence data (usually called pseudo-absences or background data).

For each SDM, we used an anova to test the effects of the number of pseudo-absences, the method used for the selection of pseudo-absences, and the weighting scheme for presences vs. absences on model quality, for each combination of virtual species, pool of presence data and number of presences. In each case, the relative contribution of each effect was estimated as the ratio between the explained and the null deviances. Using the same approach, we also considered SDM as an additional effect to compare variability between SDM, that is, variations in model accuracy owing to differences in the way each SDM handles pseudo-absences.

We tested for an effect of the number/weighting scheme of pseudo-absences on model accuracy via a likelihood ratio test. This test compared the likelihood of two linear models: one that included as covariates both the method of generating pseudo-absences and the number/weighting scheme of pseudo-absences, and one that included only the former. The number/weighting scheme covariate was coded as a 6-level factor (100, 1000 or 10 000 pseudo-absences, with either equal or unequal weighting of presences vs. absences).

To investigate this issue and the four that follow, we used three different numbers of presences (30, 100 and 300), three different numbers of pseudo-absences (100, 1000 and 10 000), four methods to generate them and two different weighting schemes for all seven SDM and all pools of presences ( Fig. 2 ). For each combination of parameters, 20 replicates with different presence data selections were performed to account for the variability in model accuracy because of the random sampling of presence data ( Fig. 2 ). For each presence data sample, several replicates with different pseudo-absences selections were performed to further account for the variability because of the random sampling of pseudo-absence data ( Fig. 2 ).To investigate the optimal trade-off between the number of replicates, the number of pseudo-absences and the predictive accuracy, we calculated mean predicted distributions (hereafter called mean predictions) resulting from several (2–20) replicates of pseudo-absences selection. To estimate the number of replicates of pseudo-absences above which model quality does not increase significantly, we compared mean TSS across the number of replicates for each combination of pools of presence data × number of presences × number of pseudo-absences ( Fig. 2 ).

To investigate the effect of prevalence, we used four different numbers of presences (30, 100, 300 or 1000) and five different numbers of absences (100, 300, 1000, 3000 or 10 000). To make sure the results were not influenced by false positives or false negatives, presences were randomly selected from the ‘actual’ unbiased distribution and true absences were randomly selected as absence data. To account for the variability arising from the random selection of a set of presences, the models were fitted with 20 different random presence sets for each combination of sample size and each virtual species ( Fig. 1 ). For each random presence set, accuracy measures were then calculated by considering the mean of the 20 distributions obtained using different random replicates of true absences as the result distribution.

For any given set of presences and absences, we used seven SDM (to detect a potential effect of the choice of the modelling method) as found in the biomod package in R (see Thuiller et al. 2009 for further details on these modelling methods): three regression methods (GLM, GAM and MARS), two classification methods (MDA and CTA) and two machine-learning methods (BRT and RF). The models were fitted either by assigning an equal weight to each presence and absence point or by balancing the weight of presences vs. absences ( question c ), such that all presence data combined had the same weight as the total weight of the absence data (except for MARS and RF, which could not consider different weights for different data at the time of the analysis). Binary transformation was carried out using the threshold that maximised the true skill statistics (TSS; Allouche, Tsoar & Kadmon 2006 ). TSS corresponds to the sum of sensitivity and specificity minus one (the sensitivity is the proportion of presences correctly predicted, and the specificity is the proportion of absences correctly predicted). This threshold was shown to produce the most accurate predictions ( Jimenez-Valverde & Lobo 2007 ). Models were evaluated using four different criteria: the area under the receiver operating characteristic (ROC) curve (AUC) ( Fielding & Bell 1997 ), sensitivity, specificity and TSS. These four predictive accuracy measures were calculated in reference to the potential distribution only.

Five different sample sizes of absence data were considered: 100, 300, 1000, 3000 or 10 000 absences. Depending on the question under consideration, we used either true absences or pseudo-absences as absence data. We considered as true absences all points located outside the potential distribution of the species, whereas pseudo-absences were always generated without considering the species potential distribution. True absences were randomly sampled among all true absences available. We used four different methods to generate the pseudo-absences (using the biomod package in R, Thuiller et al. 2009 ): (i) random selection from all points within the studied area excluding available presence points (‘random’), (ii) random selection of points from all points outside of the suitable area estimated by a rectilinear surface envelope from the presence sample (surface range envelope model using only presence-only data, Thuiller et al. 2009 ; hereafter, the ‘SRE’ method), (iii) random selection of any point located at least one degree in latitude or longitude from any presence point (the ‘1°far’ method) and (iv) random selection of any available point located at least two degrees away from any presence point (the ‘2°far’ method). Note that pseudo-absences can be presences that were not retained within the presence sample used to build the models (i.e. false absences).

To investigate the effects of sampling bias in presence data on the models’ predictive accuracy ( question e ), we created three biased subdistributions from the actual species distributions. Firstly, we created a climatically biased distribution by considering a probability surface whose Gaussian response curve means were slightly different from the means of the potential distribution (Fig. S2). Presence points of the climatically biased distribution were then sampled from the actual distribution following a binomial distribution, the probability of success for each pixel being extracted from the biased probability surface. As a result, the presence points from this sample did not include the full extent of the fundamental climatic niche of the virtual species. Secondly, we created two spatially biased samples. One was made by removing presences from several countries on one side of the distribution, and the second by only selecting presences along transport routes (roads or railways) (Fig. S1). It should be noted that the first spatial bias considered can also be interpreted as a species that does not fully occupy its potential distribution because of dispersal limitations, historical legacies and exclusion through biotic interactions. Each one of the biased samples contained approximately 1000 presence points (Fig. S1).

To make sure that our results were not influenced by the choice of a species and the peculiarities thereof, we created two geographically distinct virtual species (Fig. S1). To produce the simplest possible potential distributions based on uncorrelated variables, we constrained the distributions of these virtual species by two explanatory variables. To include realistic environmental conditions, we chose these two uncorrelated environmental variables as the first two axes of a principal component analysis (PCA) conducted on eight variables related to temperature and precipitation at European scale (from the Worldclim data base at a 10 arc-min resolution): (i) annual mean temperature, (ii) mean temperature of the warmest month, (iii) mean temperature of the coldest month, (iv) temperature seasonality, (v) annual precipitation, (vi) precipitation of the wettest month, (vii) precipitation of the driest month, (viii) precipitation seasonality. For each species, we assumed a bell-shaped relationship between the probability of occurrence and each composite environmental variable. Each fundamental niche is therefore an ellipsoid in the principal component space, as previously used by Godsoe (2010) and Soberon & Nakamura (2009) , although the geographical points falling within that environmental ellipsoid can result in a distorted ellipsoid, depending on its position in the environmental space cloud ( Soberon & Nakamura 2009 ) (Fig. S1). Although Gaussian response curves might seem unrealistic at a first glance, this is what is expected from a theoretical point of view ( Lawton 1999 ). Whilst the SDM accuracy (in absolute terms) may depend upon the response curves chosen to create the virtual species, this choice should not influence how different methods for generating pseudo-absences affect the quality of a given SDM (in relative terms). The virtual species reflect similar ecological constraints (same shape of response curves to the same environmental variables), to ensure our results reflect differences resulting from the methods used to generate pseudo-absences and not differences arising from species characteristics.

In addition, we found that AUC and TSS were highly correlated (using Pearson’s product-moment correlation, r = 0·82 ± 0·10 across all SDM). Therefore, the relative performance of the different methods used to select the pseudo-absences did not depend on the choice of the evaluation criterion. Although we presented results on the models’ predictive accuracy, the results and conclusions were the same for the models’ ability to correctly predict climatic suitability (assessed using a correlation test between the probability distribution obtained from the model and the probabilities of occurrence of the potential distribution chosen for a given species, Fig. S5).

The relative contribution of each methodological choice to variations in model quality depended on the SDM used. GLM and GAM methods responded similarly: when 30 presences were selected, variation in TSS among distributions obtained from all models was only partly explained by the number of pseudo-absences, the method used for selecting pseudo-absences, and the weighting of presences vs. absences ( Fig. 7 ). This pattern suggested that results were most influenced by the random set of presences from the actual species distribution. However, when the number of sampled presences increased, the contribution of the other factors to variability in TSS also increased: with 100 or 300 presences, the method used for selecting the pseudo-absences explained most of the variation in TSS for GLM and GAM. In contrast, for the five remaining SDM, the number of pseudo-absences selected for each run made the biggest contribution to the variability in TSS regardless of the number of presences sampled. The method used for selecting pseudo-absences also partly explained the variation in TSS and its influence increased with the number of presences sampled.

The predictive accuracy of the models in relation to the number and weighting scheme of pseudo-absences was not influenced by the sampling biases of presence data ( Fig. 5 ). Regarding the method used to generate pseudo-absences, the results obtained with spatially biased presences were similar to those obtained with unbiased presences ( Fig. 6 ), except for MDA for which ‘random’ yielded better models with spatially biased presences. With the three regression techniques (GLM, GAM and MARS), ‘random’ did not perform well with climatically biased presences, but ‘SRE’ yielded better results when few presences were available from the actual distribution and ‘2°far’ yielded better results when more presences were selected. For the other four SDM (MDA, CTA, BRT and RF), ‘2°far’ performed better when presences were climatically biased ( Fig. 6 ).

Model accuracy was affected by the method used to generate pseudo-absences for each SDM ( 5 , 6 ): likelihood ratio tests were significant in all cases excepted with spatially biased presences with CTA. For GLM, GAM and MARS, randomly selected pseudo-absences produced the most accurate models. For the other four SDM (MDA, BRT, CTA and RF), there was less variation in the results obtained for each different method used to select pseudo-absences, but pseudo-absences selected with geographical exclusion (‘2°far’) yielded significantly better models with few presences, whereas pseudo-absences selected with climatic exclusion (‘SRE’) yielded better models with more presences. Consistently across SDM and the number of presences, we found that pseudo-absences selected with geographical exclusion (‘2°far’ and ‘1°far’) yielded predictions with higher sensitivities, whereas randomly selected pseudo-absences yielded predictions with higher specificities (Figs S3 and S4).

Depending on the SDM used, the interaction between the number of pseudo-absences and weighting of presences vs. absences had different but significant effects on TSS. The models can be separated into three groups ( 5 , 6 ). Firstly, GLM and GAM showed little variation in predictive accuracy in response to the number of pseudo-absences, but the predictive accuracy increased when using pseudo-absences with equal weight for presences and absences. Secondly, for CTA, BRT and RF, predictive accuracy was highest when approximately the same number of pseudo-absences was used as the number of presences ( Fig. 3 ). For CTA and BRT, when the number of pseudo-absences differed from the number of presences, an equal weight for presences and absences gave better model predictive quality. These results were mainly explained by the very low sensitivity of these two SDM when a large number of pseudo-absences were generated (Fig. S3). Lastly, when using MARS and MDA, model quality was highest when 100 pseudo-absences were generated in each run, with equal weight given to presences and absences.

Evaluation results (TSS) of the mean distribution according to the number of replicates with different pseudo-absences used to get that distribution. The different curves represent the results with 100, 1000, or 10 000 pseudo-absences selected in each replicate, as well as the weighting scheme. Red asterisks indicate that the TSS from the mean distribution with a larger number of replicates is not significantly better. These results were obtained with 100 climaticallybiased presences from the first virtual species (similar results were obtained with spatially biased presences and unbiased presences).

Model quality (i.e. TSS) increased with the number of replicates of pseudo-absences used to calculate the mean prediction until reaching an asymptote ( Fig. 4 ). The number of replicates to reach the asymptote decreased significantly with the number of pseudo-absences selected per replicate. When 10 000 pseudo-absences (i.e. 20% of the study area) were used in each replicate, there was no effect of the number of replicates on model quality (i.e. no need for repetition). When 1000 pseudo-absences (i.e. 2% of the study area) were generated in each replicate, five replicates were enough to reach the asymptote with respect to model quality (TSS) for the GAM and CTA models, whereas the number of replicates did not affect model quality for the other five SDM (i.e. no need for repetition). When 100 pseudo-absences were generated in each replicate, model quality reached an asymptote at 12 replicates for the GAM model, seven replicates for GLM, MARS, MDA, CTA and RF, and four replicates for the BRT model. However, we noticed that with 100 pseudo-absences, the variability in TSS was substantial across the replicates, such that it was difficult to reliably identify an asymptote below 20 replicates ( Fig. 4 ): even though accuracy was not significantly different between the mean prediction obtained with 15 replicates and the mean prediction obtained with 20 replicates, the former was lower than the latter. The use of the mean distribution obtained from 20 replicates of pseudo-absence selection for each selection of presences that was a priori chosen to reduce the variability resulting from pseudo-absence selection and answer all other questions was therefore conservative.

The models could be separated into three groups according to the effect of prevalence on their predictive accuracy ( Fig. 3 ). GAM behaved differently from the others given this technique was not influenced by prevalence. The accuracy of MARS and MDA increased with prevalence, whereas the accuracy increased until an asymptote when the number of presences reached one tenth of the number of absences for GLM, BRT and RF or reached the same amount as the number of absences for CTA. These trends were not influenced by the weighting scheme of presences vs. absences.

NỘI DUNG BÀI VIẾT

## Discussion

### Influence of the Modelling Technique

In general, our results showed that the behaviour of the different SDM varied widely depending on how, where and how many pseudo-absences were used. First of all, although the model accuracy of regression techniques GLM and GAM was not influenced as much as other SDM by the number of pseudo-absences used in each replicate, the best results were obtained by using a large number of pseudo-absences (e.g. 10 000) with presences and absences weighted equally. These results are consistent with those obtained with Maxent (Phillips & Dudik 2008) for which more accurate results were also obtained with 10 000 background points. Conversely, for classification and machine-learning techniques including MARS, the models’ predictive accuracy was greater when a moderate number of pseudo-absences per replicate were used (either few pseudo-absences or not more than the number of presences). For these models, the choice of the number of pseudo-absences used in each replicate was the main influence on model accuracy, making it a key decision when setting up a modelling exercise. This difference in terms of the optimal number of pseudo-absences to use in each replicate for different SDM could not be solely attributed to the poor performance of classification and machine-learning techniques when the number of false absences increases (which is automatically the case when the number of pseudo-absences increases), because the study regarding the influence of prevalence over model accuracy, performed with true absences only, lead to the same conclusions. This difference could therefore be attributed to the intrinsic properties of the different SDM with regard to prevalence.

The different SDM investigated in this study also appeared to behave differently with regard to the method used to generate pseudo-absences. Indeed, regression techniques were more greatly influenced by the choice of the method than classification and machine-learning techniques, and different methods were found to optimise model accuracy. When using regression techniques (GLM, GAM and MARS), the best strategy was to randomly generate the pseudo-absences data, which supported results from Wisz & Guisan (2009). Indeed, their study using simulated data showed that randomly selected pseudo-absences yielded better results than pseudo-absences selected from low suitability areas predicted using ENFA or BIOCLIM (equivalent to SRE). For classification and machine-learning techniques, although the method used to generate pseudo-absences had little influence on the models’ predictive accuracy, ‘2°far’ yielded significantly better models with few presences, whereas ‘SRE’ yielded better models with more presences. We can assume the difference in the best method for generating pseudo-absences according to the number of available presences to be the consequence of different false negative rates. Indeed, with few available presences, it is very unlikely that these presences represent the full climatic niche of the species. Therefore, pseudo-absences selected with environmental exclusion (‘SRE’) may have a higher chance of being false absences than pseudo-absences selected with large geographical exclusion (‘2°far’). However, as the amount of available presences increases, the probability of pseudo-absences selected with environmental exclusion being false absences decreases. With large amounts of presence data, although pseudo-absences selected with large geographical exclusion still have a better chance of being true absences, they are probably too different from the presence data to be as informative as the pseudo-absences selected with environmental exclusion. This may also depend in part on the level of spatial aggregation in species presences. Such differences regarding the best method of generating pseudo-absences indicate that regression techniques were less sensitive to false absences than classification and machine-learning techniques.

See also  QUÁI VẬT ẾCH KERMIT Được Mở Khóa Trong POPPY PLAYTIME Khiến bqThanh và Ốc Gặp Phải Chuyện Gì ??? | monaco wiki

Finally, the optimal number of pseudo-absence replicates also differed between the different SDM. Some of these differences could be explained by the intrinsic properties of the SDM. For example, BRT and RF were the SDM that needed the lowest number of 100 pseudo-absences replicates, perhaps because both have internal replication procedures.

### Ensemble Forecast Perspectives

As modelling a species distribution under current and future conditions can give different results according to the SDM used (Thuiller 2004; Elith 2006) and as none of the widely used techniques performs universally better than the others (Elith 2006), the use of an ensemble forecast framework has been recommended (Buisson 2010). The ensemble forecast framework aims to consider the central trend of several SDM, using different methods (Marmion 2009), and is now widely used amongst species distribution modellers, often with the same use of pseudo-absences across the different SDM used. However, we have shown here that the optimal way of creating and using pseudo-absences information differs widely across SDM. The best way of using pseudo-absences through an ensemble forecast technique could therefore be to use pseudo-absences differently for each SDM. However, most ensemble forecast techniques compare model accuracy either to select the best models or to weight their predictions differently, which can only be done in an unbiased way if the same data were used for all SDM. One way of overcoming this potential problem could be to group together SDM that share the same way of optimising the use of pseudo-absences (e.g. GLM and GAM; BRT and RF), compare their model accuracy, select the best one from each group and then obtain the median or mean distribution from all selected models.

### Spatial Extent of the Study Area

As well as being influenced by the number of pseudo-absences and the method used to generate them, model performance also relies on the spatial extent of the study. Indeed, model performance is lower when pseudo-absences are taken from either a restricted or particularly broad area (Van Der Wal 2009). Pseudo-absences are meant to be compared with the presence data and help differentiate the environmental conditions under which a species can occur or not. Therefore, pseudo-absences taken too far from the presence data in the environmental space would not be very informative. As pseudo-absences that are very distant from all presence points (from a geographical point of view) are more likely to exhibit environmental conditions that are very different from those for the presence data, a larger spatial extent of the study will lead to the selection of a higher proportion of less informative pseudo-absences. The optimal number of pseudo-absences to generate in each run is therefore likely to depend on the spatial extent of the study, which influences environmental variability. At a given spatial resolution, a higher number of pseudo-absences may be needed to optimise model performance for a larger spatial extent of the study, to ensure the selection of enough informative pseudo-absences.

### Maximising Sensitivity or Specificity

When the modelling goal is to identify potential presences of rare species for new survey efforts (Engler, Guisan & Rechsteiner 2004), high sensitivity is preferred, even if it generates overprediction. High sensitivity ensures that the percentage of true presences predicted as absences will be minimised. In such studies, the ‘SRE’, ‘1 and 2°far’ methods can be used as well as other methods for selecting pseudo-absences outside both spatially and climatically suitable areas (Hengl 2009; Lobo, Jimenez-Valverde & Hortal 2010). The selection of fewer pseudo-absences in each replicate also yielded better sensitivity (except for GLM and GAM, for which large amounts of pseudo-absences with an equal weighting of presences vs. absences still yielded better sensitivity). In contrast, other studies may wish to maximise specificity, so that the predicted distribution of a species would only be the area where the species is highly likely to be present. This is particularly true for studies on reserve planning (Marini 2009). High specificity ensures that the percentage of true absences predicted as presences will be minimised. In such cases, the random selection of pseudo-absences will maximise specificity. As for the number of pseudo-absences to generate in each replicate to maximise specificity, it depends on the number of presence points available, but overall a large number of pseudo-absences tends to yield better specificity for all SDM except GLM and GAM for which fewer pseudo-absences are better. All these results regarding sensitivity and specificity are dependent on the threshold used to produce binary distributions. The use of another commonly used threshold (minimising the difference between sensitivity and specificity) could yield slightly different results as it tends to favour specificity, whereas the threshold we used tends to favour sensitivity (Jimenez-Valverde & Lobo 2007).

## [Update] How random is pseudo-random? Testing pseudo-random number generators and measuring randomness | pseudo-random distribution – Vietnamnhanvan

After introducing true and pseudo-random number generators, and presenting the methods used to measure randomness, this article details a number of common statistical tests used to evaluate the quality of random number generators.

## Introduction: true and pseudo-random number generators

Obtaining long sequences of random numbers, as required by some cryptographic algorithms, is a delicate problem. There are basically two types of random number generators: true random number generators, and pseudo-random number generators.

### True random number generators (TRNGs)

True random number generators rely on a physical source of randomness, and measure a quantity that is either theoretically unpredictable (quantic), or practically unpredictable (chaotic — so hard to predict that it appears random).

I’ve compiled a list of sources of true randomness (capable of sufficiently fast output); please let me know in the comments if I forgot any.

• Thermal noise, due to the thermal agitation of charge carriers in an electronic component, and used by some Intel RNGs
• Imaging of random processes, used by LavaRnd
• Atmospheric noise, mostly due to thunder storms happening around the globe, and used by random.org
• Cosmic noise, due to distant stars and background radiation
• Driver noise, often used by /dev/random in Unix-like operating systems
• Transmission by a semi-transparent mirror, used by randomnumbers.info
• Nuclear decay, used by HotBits
• Coupled inverters, used by Intel’s new random generators.

The coupled inverters technique devised by Greg Taylor and George Cox at Intel is truly the most fascinating one : it relies on collapsing the wave function of two inverters put in a superposed state.

### Pseudo-random number generators (PRNGs)

Pseudo-random number generators are very different: they act as a black box, which takes one number (called the seed and produces a sequence of bits; this sequence is said to be pseudo-random if it passes a number of statistical tests, and thus appears random. These tests are discussed in the following section; simple examples include measuring the frequency of bits and bit sequences, evaluating entropy by trying to compress the sequence, or generating random matrices and computing their rank.

#### Formal definition

A pseudo random number generator is formally defined by an initialization function, a state (a sequence of bits of bounded length), a transition function, and an output function:

• The initalisation function takes a number (the seed), and puts the generator in its initial state.
• The transition function transforms the state of the generator.
• The output function transforms the current state to produce a fixed number of bit (a zero or a one).

A sequence of pseudo-random bits, called a run, is obtained by seeding the generator (that is, initializing it using a given seed to put it in its initial state), and repeatedly calling the transition function and the output function. This procedure is illustrated by the following diagram:

Schematic diagram of a pseudo-random number generator

In particular, this means that the sequence of bits produced by a PRNG is fully determined by the seed. This has a few disagreeable consequences, among which the fact that a PRNG can only generate as many sequences as it can accept different seeds. Thus, since the range of values that the state can take is usually much wider than that of the seed, it is generally best to only seldom reset (re-seed) the generator.

#### Cryptographically secure pseudo-random number generators

Since PRNGs are commonly used for cryptographic purposes, it is sometimes asked that the transformation and output functions satisfy two additional properties :

1. Un-predictability: given a sequence of output bits, the preceding and following bits shouldn’t be predictable
2. Non-reversibility: given the current state of the PRNG, the previous states shouldn’t be computable.

Rule 1 ensures that eavesdropping the output of a PRNG doesn’t allow an attacker to learn more than the bits they overheard.
Rule 2 ensures that past communications wouldn’t be compromised should an attacker manage to gain knowledge of the current state of the PRNG (thereby compromising all future communications based on the same run).

When a PRNG satisfies these two properties, it is said to be cryptographically secure.

## Testing pseudo-random number generators

There are a number of statistical tests devised to distinguish pseudo-random number generators from true ones; the more a PRNG passes, the closer it is considered to be from a true random source. This section presents general methodology, and studies a few example tests.

### Testing methodology

#### Testing a single sequence

Almost all tests are built on the same structure: the pseudo-random stream of bits produced by a PRNG is transformed (bits are grouped or rearranged, arranged in matrices, some bits are removed, …), statistical measurements are made (number of zeros, of ones, matrix rank, …), and these measurements are compared to the values mathematically expected from a truly random sequence.

More precisely, assume that $$f$$ is a function taking any finite sequence of zeros and ones, and returning a non-negative real value. Then, given a sequence of independent and uniformly distributed random variables $$X_n$$, applying $$f$$ to the finite sequence of random variables $$(X_1, …, X_n)$$ yields a new random variable, $$Y_n$$. This new variable has a certain cumulative probability distribution $$F_n(x) = \mathbb{P}(Y_n \leq x)$$, which in some cases approaches a function $$F$$ as $$n$$ grows large. This limit function $$F$$ can be seen as the cumulative probability distribution of a new random variable, $$Y$$, and in these cases $$Y_n$$ is said to converge in distribution to $$Y$$.

See also  🔥 YATORO NEW STYLE — Juggernaut Prism -18% CD + Mjollnir Swift Blink Strongest Carry Dota 2 Pro | defense of the ancients

Randomness tests are generally based on carefully selecting such a function, and calculating the resulting limit probability distribution . Then, given a random sequence $$\epsilon_1, \ldots, \epsilon_n$$, it is possible to calculate the value of $$y = f(\epsilon_1, \ldots, \epsilon_n)$$, and assess how likely this value was by evaluating the probability $$\mathbb{P}(Y \geq y)$$ — that is, the probability that a single draw of the limit random variable $$Y$$ yields a result greater than or equal to $$y$$. If this probability is lower than $$0.01$$, the sequence $$\epsilon_1, \ldots, \epsilon_n$$ is said to be non-random with confidence 99%. Otherwise, the sequence is considered random.

#### An example: the bit frequency test

Let $$(X_n)$$ be an infinite sequence of independent, identically distributed random variables. Take $$f$$ to be $$f(x_1, …, x_n) = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{x_n – \mu}{\sigma}$$ where $$\mu$$ is the average, and $$\sigma$$ the standard deviation, of any $$X_k$$. Define $$Y_n = f(X_1, …, X_n)$$. The central limit theorem states that the distribution of $$Y_n$$ tends to the standard normal distribution. This theorem is illustrated below.

Example of convergence in distribution to the standard normal distribution, after summing 1, 2, 3, 4, 5, 8, 10, and 50 independent random variables following the same distribution.

Consider the particular case where $$X_n$$ is uniformly distributed over the discrete set $$\{-1,1\}$$. The central limit theorem applies, and $$Y_n = f(X_1, …, X_n)$$ converges in distribution to the standard normal law. For practical reasons, however, we choose to consider $$Y_n = |f(X_1, …, X_n)|$$, which converges in distribution to a variable $$Y_\infty$$. $$Y_\infty$$ has the half-normal distribution (illustrated on the left), which has the nice property of being monotonically decreasing.

In plain colours, the half-normal distribution. In pale colours, the corresponding cumulative distribution function. If a pseudo-random sequence is picked in the red area, it is declared non-random with 99% certainty (the yellow and orange areas correspond to a certainty of 90% and 95% respectively).

Here is how the test proceeds : the sequence $$(\epsilon_1, \ldots, \epsilon_n)$$ given by the PRNG is converted to a sequence of elements of $$\{-1, 1\}$$ by changing each $$\epsilon_i$$ to $$x_i = 2 \cdot \epsilon_i – 1$$. This gives $$x_1, \ldots, x_n$$. Then, $$|f(x_1, \ldots, x_n)| = \left|\sum_{k=1}^n \frac{x_k}{\sqrt{n}}\right|$$ is numerically calculated, yielding $$y$$. Finally, the theoretical likeliness of a truly random sequence yielding a value equal to or greater than $$y$$ is evaluated using the cumulative distribution function of $$Y_\infty$$ (in pale colours).

If this probability is too low (red area, probability below $$0.01$$), the sequence is rejected as non-random, with 99% certainty. If it is greater than $$0.01$$, but less than $$0.05$$ (orange area) the sequence is sometimes rejected as non-random with 95% certainty. Similarly, if it is greater than $$0.05$$ but less than $$0.10$$ (yellow area), the sequence is sometimes rejected as non-random, with 90% certainty.

High probabilities (green area, probabilities greater than $$0.10$$), finally, do not permit to distinguish the pseudo-random sequence from a truly random one.

#### Testing a pseudo-random number generator

To test whether a pseudo-random number generator is close to a true one, a sequence length is chosen, and $$m$$ pseudo-random sequences of that length are retreived from the PRNG, then analysed according to the previous methodology. It is expected, if the confidence level is 1%, that about 99% of the sequences pass, and 1% of the sequences fail; if the observed ratio significantly differs from 99 to 1, the PRNG is said to be non-random. More precisely, the confidence interval is generally chosen to be $$p \pm 3\frac{\sqrt{p(1-p)}}{\sqrt{m}}$$, where $$p$$ is the theoretical ratio (99% here); this takes the probability of incorrectly rejecting a good number generator down to 0.3%.

This confidence interval is obtained as follows. For large values of $$m$$, approximate the probability of rejecting $$r$$ sequences by $$\binom{m}{r} p^{1-r} (1-p)^r$$ (here $$p = 99\%$$). Write $$\sigma_i$$ the random variable denoting whether the $$i$$th sequence was rejected. Then with the previous approximation all $$\sigma_i$$ are independent and take value $$0$$ with probability $$p$$, $$1$$ with probability $$1-p$$ (standard deviation: $$\sqrt{p(1-p)}$$). The central limit theorem states that with probability close to $$\displaystyle \int_a^b \frac{e^{-x^2/2}}{\sqrt{2\pi}}$$, the observed rejection frequency $$\hat{p}$$ lies in the interval $$\left[p + b\frac{\sqrt{p(1-p)}}{\sqrt{m}}, p + a\frac{\sqrt{p(1-p)}}{\sqrt{m}}\right]$$. Finally, setting $$b = -a$$ and adjusting $$a$$ so that this probability is $$\approx 99.7\%$$ yields $$a \approx 3$$.

### Tests suites

A number of test suites have been proposed as standards. The state-of-the-art test suite was for a long time the DieHard test suite (designed by George Marsaglia), though it as eventually superseeded by the National Institute of Standards and Technology recommendations. Pierre l’Ecuyer and Richard Simard, from Montreal university, have recently published a new collection of tests, TestU01, gathering and implementing a impressive number of previously published tests and adding new ones.

A subset of these tests is presented below.

### A list of common statistical tests used to evaluate randomness

#### Equidistribution (Monobit frequency test, discussed above)

Purpose: Evaluate whether the sequence has a systematic bias towards 0 or 1 (real-time example on random.org)
Description: Verify that the arithmetic mean of the sequence approaches $$0.5$$ (based on the law of large numbers). Alternatively, verify that normalized partial sums $$s_n = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{2\cdot\epsilon_k – 1}{\sqrt{2}}$$ approach a standard normal distribution (based on the central limit theorem).
Variants: Block frequency test (group bits in blocks of constant length and perfom the same test), Frequency test in a block (group bits in blocks of constant length and perform the test on every block).

#### Runs test

Purpose: Evaluate whether runs of ones or zeros are too long (or too short) (real-time example on random.org)
Description: Count the number of same-digits blocks in the sequence. This should follow a binomial distribution $$\mathcal{B}(n, 1/2)$$. By noting that the number of same-digits blocks is the sum of the $$\sigma_i$$ where $$\sigma_i = 1$$ if $$\epsilon_i \not= \epsilon_{i+1}$$ and $$\sigma_i = 0$$ otherwise, the previous method can be applied.
Variants: Longest run of ones (divide the sequence in blocks and measure the longest run of ones in each block).

#### Binary matrix rank

Purpose: Search for linear dependencies between successive blocks of the sequence
Description: Partition the sequence in blocks of length $$n \times p$$. Build one $$n \times p$$ binary matrix from each block, and compute its rank. The theoretical distribution of all ranks is however not easy to compute — this problem is discussed in details in a paper by Ian F. Blake and Chris Studholme from the university of Toronto.

#### Discrete Fourier transform

Purpose: Search for periodic components in the pseudo-random sequence
Description: Compute a discrete Fourier transform of the pseudo-random input $$2\epsilon_1 – 1, \ldots, 2\epsilon_n – 1$$, and take the modulus of each complex coefficient. Compute the 95% peak height threshold value, defined to be the theoretical value that no more than 5% of the previously calculated moduli should exceed ($$\sqrt{-n\log(0.05)}$$ (NIST)), and count the actual number of moduli exceeding this threshold. Assess the likeliness of such a value.

#### Compressibility (Maurer’s universal statistical test)

Purpose: Determine whether the sequence can be compressed without loss of information (evaluate information density). A random sequence should have a high information density.
Description: Partition the sequence in blocks of length $$L$$, and separate these block into categories. Call the first $$Q$$ blocks the initialisation sequence, and the remaining $$K$$ blocks the test sequence. Then, give each block in the test sequence a score equal to the distance that separates it from the previous occurence of a block with the same contents, and sum the binary logarithm of all scores. Divide by the number of blocks to obtain the value of the test function, and verify that this value is close enough to the theoretically expected value.

#### Maximum distance to zero, Average flight time, Random excursions

Purpose: Verify that the sequence has some of the properties of a truly random walk
Description: Consider the pseudo-random sequence to represent a random walk starting from zero and moving up (down) by one unit every time a zero (a one) occurs in the sequence. Measure the maximum height reached by the walk, as well as the average time and the number of states visited. Also, for each possible state (integer), measure in how many cycle it appears. Evaluate the theoretical probability of obtaining these values.

The sources previously cited (in particular the NIST recommendation paper) present mathematical background about these tests, as well as lots of other tests.

Did I miss your favourite randomness test? Were you ever confronted to obvious imperfections in a pseudo-random number generator? Tell us in the comments!

## Dota 2: Tips and Tricks #34 – Pseudo-Random Distribution

นอกจากการดูบทความนี้แล้ว คุณยังสามารถดูข้อมูลที่เป็นประโยชน์อื่นๆ อีกมากมายที่เราให้ไว้ที่นี่: ดูความรู้เพิ่มเติมที่นี่

## Learn Dota 2 – Pseudo Random Distribution

A weekly series from XVRogueGaming (http://www.youtube.com/XVRogueGaming) teaching you how to play Dota 2 better.
Website: http://www.dotacinema.com
stream: http://www.twitch.tv/dotacinema

## [Merlini’s Mailbag] Episode 6 – Pseudo-random Distribution \u0026 \”Tactical Pauses\”

Episode 6 of Merlini’s Mailbag. Q1 discusses how some skills are based on a Pseudorandom distribution and are not truly random. Q2 discusses \”tactical pauses\” and possible pause abuse.
Link to ??? vs Na’Vi game referred to in question 2: http://www.youtube.com/watch?feature=player_detailpage\u0026v=iwlNkTscR3Yt=2352s
MatchID: 114157989
Screenshot of Puppey’s player perspective at time of pause (notice Puppey’s blue ping on Lion): http://i.imgur.com/MrPmcLu.jpg (courtesy of jerryfrz)
Another note: PRD does NOT apply to evasion as I stated in the video. I am incorrect on this matter.
Submit questions to mailbag {at} merlinidota {dot} com for a chance to win ingame Dota items and be on the weekly show! Airs every Tuesday at 7:30 CST.
Merlini’s livestream: http://www.twitch.tv/MerliniDota

## Dota 2 Quick Tip – Pseudo Random Distribution

JOIN THE DAILY GIVEAWAY!!
This is how it works: https://www.youtube.com/watch?v=m4FpYmv1_30
I also stream on twitch sometimes: http://www.twitch.tv/l34Um1
If you want to play with me just join the chat channel \”Minimis\” and chill there, I usually ask for people to play with at least once a day.