# 统计和概率

-- TOC --

## 基础概念

• Population表示全体，通常统计手段是难以覆盖整个population的，因此需要一个Sample，we try to describe or predict the behavior of the population on the basis of information obtained from a representative sample from that population.
• Descriptive statistics consists of procedures used to summarize and describe the important characteristics of a set of measurements.
• Inferential statistics consists of procedures used to make inferences about population characteristics from information contained in a sample drawn from this population.
• An experimental unit is the individual or object on which a variable is measured, and the data value is called a single measurement.
• Variable，变量的类型：quanlitative（描述性质的，分类用的）和quantitative（描述数量的），quantitative有分为discrete（离散量）和continuous（连续量）。

• frequency，频率，（某种单位下的）次数或数量，可解释为 number of measurements。

• relative frequency，$$\cfrac{frequency}{n}$$，n是样本总数。
• percent，$$100\times relative frequency$$

Quanlitative的量类似于分类，图形化表示常用柱状图bar chart或饼图pie chart。而Quantitative的量，常常用直方图Histogram来表现数据的分布（data distribution）。

Bar Chart与Histogram的区别

• bar chart用于quanlitative数据，每个类别一个bar，bar与bar之间有间隔分割。
• historgram用户quantitative数量，也有bar，bar与bar之间没有空间，bar的宽度和数量人为设定。

Remember, though, that different samples from the same population will produce different histograms, even if you use the same class boundaries. However, you can expect that the sample and population histograms will be similar. As you add more and more data to the sample, the two histograms become more and more alike. If you enlarge the sample to include the entire population, the two histograms will be identical!

## 数据的Numerical Measures

### Measures of Center

• sample mean： $$\bar x=\cfrac{\sum x_i}{n}$$
• population mean： $$\mu$$

• Median is less sensitive to extreme values or outliers! If a distribution is strongly skewed by one or more extreme values, you should use the median rather than the mean as a measure of center.

Mode

The mode is the category that occurs most frequently, or the most frequently occurring value of x. When measurements on a continuous variable have been grouped as frequency or relative frequency histogram, the class with the highest peak or frequency is called the modal class, and the midpoint of that class is taken to be the mode. It is possible for a distribution of measurements to have more than one mode.

>>> np.mean((1,2,3,3))
2.25
>>> np.median((1,2,3,3))
2.5


### Measures of Variability

The range, R, of a set of n measurements is definied as the difference between the largest and smallest measurements.

$$x_i - \bar{x}$$，有正有负

• variance of population: $$\sigma^2=\cfrac{\sum(x_i-\mu)^2}{N}$$
• variance of sample: $$s^2=\cfrac{\sum(x_i-\bar{x})^2}{n-1}=\cfrac{\sum{x_i^2}-\cfrac{(\sum{x_i})^2}{n}}{n-1}$$

$$s=\sqrt{s^2}$$，正值

• The value of s is always greater than or equal to zero.
• The larger the value of $$s^2$$ or s, the greater the variability of the data set.
• If $$s^2$$ or s is equal to zero, all the measurements must have the same value.
• In order to measure the variability in the same units as the original observations, we compute the s.

>>> import numpy as np
>>> x = [np.random.randn() for i in range(100)]
>>> np.var(x, ddof=1)
0.9932329373836302
>>> np.std(x, ddof=1)
0.99661072509964
>>> np.sqrt(np.var(x,ddof=1))
0.99661072509964


-- EOF --

-- MORE --