-- TOC --
统计是数学应用的一个分支,需要common sense和logical thinking。
Population
表示全体,通常统计手段是难以覆盖整个population的,因此需要一个Sample
,we try to describe or predict the behavior of the population on the basis of information obtained from a representative sample from that population.Descriptive
statistics consists of procedures used to summarize and describe the important characteristics of a set of measurements.Inferential
statistics consists of procedures used to make inferences about population characteristics from information contained in a sample drawn from this population.experimental unit
is the individual or object on which a variable
is measured, and the data value is called a single measurement
.Variable,变量的类型:quanlitative
(描述性质的,分类用的)和quantitative
(描述数量的),quantitative有分为discrete
(离散量)和continuous
(连续量)。
frequency
,频率,(某种单位下的)次数或数量,可解释为 number of measurements。
relative frequency
,\(\cfrac{frequency}{n}\),n是样本总数。percent
,\(100\times relative frequency\)Quanlitative的量类似于分类,图形化表示常用柱状图bar chart或饼图pie chart。而Quantitative的量,常常用直方图Histogram来表现数据的分布(data distribution)。
Bar Chart与Histogram的区别
Remember, though, that different samples from the same population will produce different histograms, even if you use the same class boundaries. However, you can expect that the sample and population histograms will be similar. As you add more and more data to the sample, the two histograms become more and more alike. If you enlarge the sample to include the entire population, the two histograms will be identical!
除了用图形来表达和展示数据(数据可视化),还可以用一些量化的方法,来表达数据的某些性质。These measures are called parameters
when associated with the population, and the are called statistics
when calculated from sample measurements.
描述一组数据的中心性质。
算数平均数Mean
关于各种平均数的总结
中位数Median
将一组数据进行排序后,位置在中间的那个数,或者中间的两个数的算数平均值。比如LeetCode的第1到Hard题,两有序数组的中间值。
Mode
The mode is the category that occurs most frequently, or the most frequently occurring value of x. When measurements on a continuous variable have been grouped as frequency or relative frequency histogram, the class with the highest peak or frequency is called the modal class, and the midpoint of that class is taken to be the mode. It is possible for a distribution of measurements to have more than one mode.
用Numpy计算mean和median
>>> np.mean((1,2,3,3))
2.25
>>> np.median((1,2,3,3))
2.5
范围Range
The range, R, of a set of n measurements is definied as the difference between the largest and smallest measurements.
离差Deviation
\(x_i - \bar{x}\),有正有负
差方Variance
为什么分母是n-1:简单理解,就是当分母是n-1时,能更好的估算population的差方值,比使用n更好。
标准差Standard Diviation
\(s=\sqrt{s^2}\),正值
用Numpy计算var和std
默认ddof=0,当ddof=1时,计算var的分母就是n-1。
>>> import numpy as np
>>> x = [np.random.randn() for i in range(100)]
>>> np.var(x, ddof=1)
0.9932329373836302
>>> np.std(x, ddof=1)
0.99661072509964
>>> np.sqrt(np.var(x,ddof=1))
0.99661072509964
本文链接:https://cs.pynote.net/math/202310071/
-- EOF --
-- MORE --