Scipy stats =========== .. contents:: Scipy contains a library with statistical functions, distributions and tests, called ``scipy.stats``. In this example, we will explore some of the possibilities it offers, tackling the following problem from asteroseismology of sdB stars (see e.g. Kawaler 1988): Due to the interior structure of an sdB star, we expect long pulsation periods to follow an equidistant pattern (i.e., a period spacing). The value of the period spacing tells us more about the extent of the core and the envelope. A complication is that often, not only periods from a period spacing are observed, but also other frequencies, e.g. due to rotational splitting. So given a list of observed periods, we need to find the value of the period spacing, and perhaps test whether it is significant. We use two approaches to this problem: * Following Kawaler 1988, we perform the Kolmogorov-Smirnov test to see whether a certain period spacing is significant. Doing this for a lot of trial period spacings, we can make some kind of *periodogram*. * We run over a grid of possible periods, and compute the chi-square statistic for every trial period (and offset). The second strategy is incredibly crude, and, scientifically, not really the way to go. However, it allows us to use the chi-square distribution implemented in scipy to derive confidence intervals on the period spacing (and offset). Setup ----- To explore this problem, we will simulate an observed set of period spacings, and add some noise to them. First, we need to do the usual imports: .. sourcecode:: python import numpy as np import pylab as plt from numpy import pi import scipy.stats Now, generate 8 almost equidistant spacings, and add 3 random ones periods mixed with the equidistant structure. We fix the seed to be able to reproduce the results: .. sourcecode:: python np.random.seed(1111) periods = 1.23 + np.arange(8)*0.43 # the spacing and offset periods += scipy.stats.norm.rvs(size=len(periods),scale=0.005) # add normal noise periods = np.hstack([periods,scipy.stats.uniform.rvs(size=3,loc=1.23,scale=3)]) # mix with 3 others periods.sort() Note that we could have used the ``np.random`` module too to generate the random variables. However, the ``scipy.stats`` module contains much more distributions and many more options. For every statistical distribution, you can * generate random variables with ``.rvs()`` * compute the probability density function with ``.pdf()`` * compute the cumulative density function with ``.cdf()`` * ... and much more This is a list of spacings we want to test: .. sourcecode:: python test_spacings = np.linspace(0.05,1,1000) The Kolmogorov-Smirnov test --------------------------- The following few lines of codes closely follow Kawaler (1988). Since this is not a tutorial on statistics, we refer to that paper for more information. We choose the shortest period to compare with, and remove it from the list of periods. .. sourcecode:: python index_short = np.argmin(periods) p_short = periods[index_short] periods_del = np.delete(periods,index_short) Then, we calculate the distribution of the periods compared to the tested spacing: .. sourcecode:: python dp = [] for dp_test in test_spacings: ri = [] for pi in periods_del: ni = (pi-p_short)/dp_test ri.append(ni-np.floor(ni)) dp.append(ri) Next, we use the **Kolmogorov-Smirnov** test to derive whether the ``ri`` come from a uniform distribution. We do this for every test spacing: .. sourcecode:: python Q = np.zeros(len(dp)) for i,ri in enumerate(dp): ksstat,pvalue = scipy.stats.kstest(ri,'uniform') Q[i] = pvalue # we're only interested in the pvalue, not the KS statistic In principle, we should now correct for the number of trial periods (the Bonferroni correction, but we omit that step here. Finally, we can make a plot: .. sourcecode:: python plt.figure() plt.plot(test_spacings,Q) test_spacings[np.argmin(Q)] .. image:: ks_ks.png :scale: 50 A Chi-square approach --------------------- The chi-square approach requires us to check a set of predicted period spacings, which all have their own offset value and period spacing, against the observed set of period spacing. We will make the huge assumption that we already know that we have 8 periods within a period spacing. We have to cycle over all the possibilities, build the set of period spacing, and match them with the closest values in the set of observed periods (we do not care about the three random ones, but we do not know which ones the are). Let us first make a search grid: .. sourcecode:: python test_offset = np.linspace(1.2,1.25,500) test_spacings = np.linspace(0.3,0.5,500) And prepare for the computation of the chi-square statistic. We have 8 periods to match and free parameters, so we need to set ``k=8-2=6`` in the chi-square distribution: .. sourcecode:: python chi2 = np.zeros((len(test_offset),len(test_spacings))) k = 8-2 Next, we run over the grid and compute the chi-square statistic, which is defined as the sum of the residuals-squared divided by the error-squared: .. sourcecode:: python for i,to in enumerate(test_offset): for j,ts in enumerate(test_spacings): test_periods = to+np.arange(8)*ts periods_ = periods[:-1] + np.diff(periods)/2. closest_index = np.searchsorted(periods_,test_periods) chi2[i,j] = np.sum((periods[closest_index]-test_periods)**2/0.005**2) Using ``scipy.stats``, we can compute the cumulative density of the chi-square value, which corresponds to some kind of confidence rating: .. sourcecode:: python chi2_stat = scipy.stats.distributions.chi2.cdf(chi2,k) That's it! All is left is to make some plots: we choose to make a plot of the chi-square values themselves (left), and a plot of the confidence intervals (right). .. sourcecode:: python plt.figure() plt.subplot(121) extent = test_spacings.min(),test_spacings.max(),test_offset.max(),test_offset.min() plt.imshow(chi2,aspect='auto',extent=extent,vmin=10,vmax=3000,cmap=plt.cm.spectral) plt.colorbar() plt.subplot(122) plt.imshow(chi2_stat*100,aspect='auto',extent=extent,cmap=plt.cm.spectral) plt.plot(0.43,1.23,'x',ms=40,mew=4,color='1') plt.xlim(0.425,0.434) plt.ylim(1.24,1.212) cbar = plt.colorbar() cbar.set_label('Confidence interval (%)') .. image:: ks_chi2.png :scale: 50