
Last Updated: 2024-03-14 12:45:32 Thursday

-- TOC --

统计是数学应用的一个分支,与我们的日常生活以及计算机算法有很多联系,很需要Common Sense和Logical Thinking。初学统计,不需要深入到对计算公式的证明,只需要理解和应用。

Basic Concepts

Quanlitative的量类似于分类,图形化表示常用柱状图bar chart或饼图pie chart。而Quantitative的量,常常用直方图Histogram来表现数据的分布(data distribution)。

Bar Chart与Histogram的区别

Remember, though, that different samples from the same population will produce different histograms, even if you use the same class boundaries. However, you can expect that the sample and population histograms will be similar. As you add more and more data to the sample, the two histograms become more and more alike. If you enlarge the sample to include the entire population, the two histograms will be identical!

Numerical Measures

除了用图形来表达和展示数据(数据可视化),还可以用一些量化的方法,来表达数据的某些性质。These measures are called parameters when associated with the population, and they are called statistics when calculated from sample measurements.

Measures of Center






Median is less sensitive to extreme values or outliers! If a distribution is strongly skewed by one or more extreme values, you should use the median rather than the mean as a measure of center.



The mode is the category that occurs most frequently, or the most frequently occurring value of x. When measurements on a continuous variable have been grouped as frequency or relative frequency histogram, the class with the highest peak or frequency is called the modal class, and the midpoint of that class is taken to be the mode. It is possible for a distribution of measurements to have more than one mode.

学点英语:Think of modal as relating to some "mode" or form. The English word modal has long been used as a term in logic and statistics, such as "modal values". 一般情况下,每个modal value都对应了一个distribution,distribution可以理解为mode或form。


>>> np.mean((1,2,3,3))
>>> np.median((1,2,3,3))

Measures of Variability


The range, R, of a set of n measurements is definied as the difference between the largest and smallest measurements.


\(x_i - \bar{x}\),有正有负



标准差Standard Diviation




>>> import numpy as np
>>> x = [np.random.randn() for i in range(100)]
>>> np.var(x, ddof=1)
>>> np.std(x, ddof=1)
>>> np.sqrt(np.var(x,ddof=1))

Tchebysheff's Theorem


Given a number k greater than or equal to 1 and a set of n measurements, at least \(1 - \cfrac{1}{k^2}\) of the measurements will lie within k standard deviations of their mean.

k \(1 - \cfrac{1}{k^2}\) Interval
1 0 \(\bar{x}\pm s\)
2 \(\frac{3}{4}\) \(\bar{x}\pm 2s\)
3 \(\frac{8}{9}\) \(\bar{x}\pm 3s\)
2.5 0.84 \(\bar{x}\pm 2.5\times s\)

Empirical Rule

如果数据呈现mound-shaped形状,此时可以应用所谓的Empirical Rule来分析其分布。其实这就是著名的高斯分布,或正态分布,在自然界中几乎无处不在。这里直接给出一点结论:

Interval Approximate number of measurements
\(\bar{x}\pm s\) 68%
\(\bar{x}\pm 2s\) 95%
\(\bar{x}\pm 3s\) 99.7%

Range Approximation of s

不管是Tchebysheff,还是Empirical Rule,都能发现一个规律,绝大部分的measurements都集中在以mean为中心的4倍标准差的这个范围内。因此,下面的计算,可以得到标准差s的估计值:



Number of Measurements Expected Ratio of Range to s
5 2.5
10 3
25 4



从计算公式可以看出,z-score表示距离mean有多少个standard deviation,可正可负,分别表示above the mean或below the mean。表达的是relative standing信息。

对于mound-shaped data:



Percentile是一个值,中文翻译为百分位数,一般在对数据量较大的data set分析时才有意义。比如:60% percentile is x,这表示,在所有的measurements中,有60%的数据小于x,另有(1-60%)=40%的数据大于x。median happens to be 50% percentile


还记得偶数(n)个measurements时,median如何计算吗?把中间的两个数加起来算平均值。这实际上是在做linear interpolation。我们在计算Q1和Q3的时候也是这样,具体如下:

InterQuartile Range (IQR)

\(IQR = Q3 - Q1\)



由 min, Q1, median, Q3和max(five-number summary)构成的图形。在Q1和Q3之间画一个box,median的位置画一条线表示。

不在Lower fence和Upper fence之间的数据,就是outlier,在boxplot中用asterisk(*)表示。排除掉outlier后,再找出min和max,在Q1与min,Q3与max之间,画whiskers。



Bivariate Data

针对同一个experiment unit,同时有多个不同的measurements,将其中两个放在一起,就形成了bivariate data。比如对每个学生,统计年龄,身高,性别和体重,每个学生统计4个数据,如果将其中两个数据放在一起分析,就形成了bivariate data。

图形化bivariate data,可以并排画bar chart,或者stacked bar chart。如果都是quantitative data,可以画出scatter plot,观察数据间的关系。




Correlation Coefficient

\(r=\cfrac{s_{xy}}{s_x s_y}\)

用NumPy接口计算Correlation Coefficient

>>> a
(1360, 1940, 1750, 1550, 1790, 1750, 2230, 1600, 1450, 1870, 2210, 1480)
>>> b
(278.5, 375.7, 339.5, 329.8, 295.6, 310.3, 460.6, 305.2, 288.6, 365.7, 425.3, 268.8)
>>> np.corrcoef(a,b)
array([[1.        , 0.92410965],
       [0.92410965, 1.        ]])


Regression(Least-Squares Line)

线性回归分析,bivariate data x and y,不管x和y是什么关系,线性分析都可以提供一个最简单的模型。Regression这个名字,就是回退到最简单的Linear关系的意思。用下面的公式可直接计算线性回归线:

$$\begin{cases} b=r\cdot \left(\cfrac{s_y}{s_x}\right)=\cfrac{s_{xy}}{s_x^2} \\ a=\bar{y}-b\bar{x} \end{cases}$$

The least-squares regression line is: \(y=a+bx\)



识别出Sample Space和Event,建立概率思考模型。

Venn diagram (维恩图) can be used to visualize sample space and events. Some experiments can be generated in stages, and the sample space can be displayed in a tree diagram.

从频率的角度理解概率:P(A)表示事件A的概率(probability of event A),n为experiment的次数,f为事件A发生的frequency(次数),有公式如下:


Event Relations and Probability Rules


Addition Rule

\(P(A\cup B)=P(A)+P(B)-P(A\cap B)\)

When two events A and B are mutually exclusive, then \(P(A\cap B)=0\).

Complement Rule


Multiplication Rule

\(P(A\cap B)=P(A)P(B|A)=P(B)P(A|B)\)

| can be read as given.

Independent Event


Two events, A and B, are said to be independence if and only if the probability of event B is not influenced or changed by the occurence of event A, or vice versa.


\(P(A\cap B)=P(A)P(B)\)






\(P(A\cap B\cap C)=P(A)P(B)P(C)\)


Mutually Exclusive Events

Mutually exclusive事件相互排斥,是不可能同时发生的事件,因此,mutually exclusive事件一定是dependent事件,当A与B互斥时,有如下关系:

\(P(A\cap B)=0\)


\(P(A\cup B)=P(A) + P(B)\)

如果将mutually exclusive events看做二维空间内的不相交的事件,那么independent events就应该是三维空间内的事件,它们只是发生在不同的二维空间。

Conditional Probability

事件概率的Multiplication Rule,就是Conditional Probability,条件概率。当事件A与B不独立,B的发生与否,影响改变A的概率。把公式换一种写法,也许能看出点不一样的含义:

\(P(A|B)=\cfrac{P(A\cap B)}{P(B)}, P(B)\neq 0\)

如果是单纯的P(A),即事件A在整个sample space中发生的概率,可以说这是Unconditional Probability,无条件的事件概率。

Law of Total Probability

Give a set of events \(S_1, S_2, S_3, ..., S_k\) that are mutually exclusive and exhaustive and an event A, the probability of the event A can be expressed as:

$$P(A)=\sum_{i=1}^kP(S_i)P(A|S_i)=\sum_{i=1}^kP(A\cap S_i)$$


\(P(S_i)\) is also called prior probability.

Bayes' Rule

\(P(S_i|A)=\cfrac{P(S_i)P(A|S_i)}{P(A)}=\cfrac{P(A\cap S_i)}{P(A)}\)

\(P(S_i|A)\) is also called posterior probability.

贝叶斯概率就是条件概率,它所求的是,当某事件发生时,此事件所对应的“某块拼图”的概率。(将样本空间想象成一个拼图,所有的块相互之间mutually exclusive,放在一起exhaustive)

Discrete Random Variables


Mean(Expected Value)

Let x be a discrete random variable with probability distribution p(x). The mean or expected value of x is give as:

$$\mu=E(x)=\sum x\cdot p(x)$$


Variance and Standard deviation

Let x be a discrete random variable with probability distribution p(x) and mean \(\mu\). The variance of x is:

$$\sigma^2=E\left((x-\mu)^2\right)=\sum (x-\mu)^2\cdot p(x)$$


More about Expection and Variance

Linearity of Expectation

The expection of the sum of two random variables is the sum of their expections. 和的期望等于期望的和。



\(\begin{aligned} \sum_{i,j}(x_i+y_j)\cdot p(x_i)p(y_j)&=\sum_{i,j}\left(x_ip(x_i)p(y_j)+y_jp(x_i)p(y_j)\right) \\ &=\sum_{i}x_ip(x_i)\cdot\sum_{j}p(y_j)+\sum_{j}y_jp(y_j)\cdot\sum_{i}p(x_i) \\ &=\sum_{i}x_ip(x_i)+\sum_{j}y_jp(y_j) \end{aligned}\)


\(E(ax)=a\cdot E(x)\)

So, \(E(x+x)=E(2x)=2E(x)\)

When two random variables are independent,

\(E(xy)=E(x)\cdot E(y)\)





When two random variables are independent,


Indicator Random Variable

Indicator Random Variable对应了某一个具体的事件,当此事件发生,变量值为1,当此事件没有发生,变量值为0。Indicator random variables provide a convenient method for converting between probabilities and expectations. Given a sample space S and an event A, the indicator random variable \(I(A)\) associated with event A is defined as:

$$I(A)=\begin{cases} 1, &\text{ if A occurs} \\ 0, &\text{ if A does not occur} \end{cases}$$

事件发生的概率,就是此事件绑定的Indicator随机变量的期望。Given a sample space S and an event A in the sample space S , let \(X_A=I(A)\). Then \(E(X_A)=P(A)\).

Discrete Distributions

Binomial Distribution

A Bernoulli trial is an experiment with only two possible outcomes: success, which occurs with probability p, and failure, which occurs with probability q=1-p. It also be called binomial experiment, which has these five characteristics:

当population size很大时,每次trial对p的影响非常小,可以认为每次trial是独立的,p没有变化。而当population size较小时,每次trial无法独立,p的值会显著变化。

Rule of Thumb,n is sample size, N is population size, if n/N >= 0.05, then the resulting experiment is not binomial。(Actually, now it's another distrubtion called hypergeometric probability distribution. See below.)


The Binomial Probability Distribution

A binomial experiment consists of n identical trials with probability of success p on each trial. The probability of k successes in n trials is

Geometric Distribution

Geometric Distribution比Binomial Distribution要简单一点,后者关注的是n次实验k次S的概率,而前者仅关注第k次才出现S的概率。How many trials occur before a success?

Poisson Distribution

The number of occurences of a special event in a given unit time or space. These events occur randomly and independently.




Hypergeometric Distribution

前面在总结二项分布时就提到,当sample size n远远小于population size N时,就可以应用二项分布。泊松分布中,n很大,而u很小时,也可以用泊松分布来近似二项分布。超几何分布,就是当n不能满足远远小于N的时候,采用的概率分布。

A population contains M successes and N-M failures. The probability of exactly k successes in a random sample of size n is:


The Normal Probability Distribution

Continuous Random Variable

显然,不是所有的experiment都能产生discrete数据,比如身高,体重,长宽数据等等,这些都是continuous random variable,叫做连续随机变量。通过大量的采集experiment数据,绘制histogram,可以得到连续随机变量近似的概率分布。

Probability distribution or probability density function (pdf), \(f(x)\),概率密度函数,其实就是连续随机变量的概率分布。

Uniform Random Variable


Exponential Random Variable


Relative frequency histogram可能会提供变量数据分布的线索,我们应该选择最适合连续随机变量的分布。不过很幸运,很多场景下连续随机变量的分布,都符合正态分布。

Normal Probability Distribution

这个世界,大部分的连续随机变量的概率分布,都是normal distribtuion,我想这也是normal用词的含义。初始化神经网络的权重,一般都采用标准随机正态分布的数据。

\(f(x)=\cfrac{1}{\sigma\sqrt{2\pi}}\ e^{-(x-u)^2/(2\sigma^2)}, \sigma\gt 0, x\in R\)


Standard Normal Random Variable



\(f(z)=\cfrac{1}{\sqrt{2\pi}}\ e^{-z^2/2}\)


Sampling Distribution

通过采样数据的分析,来推测总体的参数,from statistics to parameters...

Sampling Plan

The way a sample is selected is called the sampling plan or experimental design. Knowing the sampling plan used in a particular situation will often allow you to measure the reliability or goodness of your inference.

Simple Random Sampling


Stratified Random Sampling involves selecting a simple random sample from each of a given number of subpopulations, or strata.

Cluster Random Sampling

1-in-k Systematic Random Sampling


Sampling Distribution

The sampling distribution of a statistic is the probability distribution of the possible values of the statistic that results when random samples of size n are repeatedly drawn from the population.

通过不断地随机采样,并计算statistics(numerical measurements of samples),这些数值本身的概率分布,即sampling distribution。

Central Limit Theorem (CLT)

在很一般的情况下,通过random samples得到的sums或means等数据,呈现近似的normal distribution。

If random samples of n observations are drawn from a nonnormal population with finit mean \(\mu\) and standard deviation \(\sigma\), then, when n is large, the sampling distribution of the sample mean \(\bar{x}\) is approximately normally distributed, with mean \(\mu\) and standard deviation \(\cfrac{\sigma}{\sqrt{n}}\) .

Tip: If x is normal, \(\bar{x}\) is normal for any n. If x is not normal, \(\bar{x}\) is approximately normal for large n.

Standard Error

The standard deviation \(\cfrac{\sigma}{\sqrt{n}}\) is also called the standard error of the estimator (abbreviated SE). Therefore, the SE of \(\bar{x}\) is standard error of the mean (abbreviated \(SE(\bar{x})\) or SEM).


Sampling Distribution of Sample Proportion

If a random sample of n observations is selected from a binomial population with parameter p, then the sampling distribution of the sample proportion \(\hat{p}=\cfrac{x}{n}\) will have a mean \(p\) and a standard deviation \(SE(\hat{p})=\sqrt{\cfrac{pq}{n}}\) where \(q=1-p\).

When the sample size n is large, the sampling distribution of \(\hat{p}\) can be approximated by a normal distribution. The approximation will be adequate if \(np>5\) and \(nq>5\).

Binomial分布关注在已知p的情况下,n次trial中,x次S的概率。Sample Proportion关注如何近似p值。

Large-Sample Estimation

前面的内容,都是工具箱里的工具,现在开始进入Statistical Inference!


-- EOF --

-- MORE --