ThinkStats 统计思维

Chap 7 Relationships between variables

Visualization

In most cases, the simplest way to check a relationship between two variables is scatter plot（散点图）.
To get better visual effect, usually we could use jittering to reverse the effect of rounding off. To handle the problem of too much overlapping points, we can set the parameter of alpha to less than 1.0.

However, when the dataset is bigger, hexbin may be a better choice to show the relationship.

In addition to scatter plot, bin one variable and plot percentiles of the others is also a good option.
Here is the code for bining data（数据装箱）:

bins = np.arange(10, 100, 5)
indices = np.digitize(data.x, bins)  # data is a dataframe contains a variable of x. np.digitize maps each data.x to its bin
groups = data.groupby(indices)

```  
### Correlation   
A **correlation** is a statistic to quantify the strength of the relationship between two variables.  
The challenge is:  
1. variables are not in the same unit  
2. variables may come from different distributions  

Two common solution:  
1. **Standard score** for example, use $ z\_i = (x\_i-\mu) \sigma $ . Which lead to "Pearson product-moment correlation coefficient"  
2. Transform value to **Rank**, which leads to "Spearman rank correlation coefficient"  

### Covariance   
**Covariance** is a measure of the tendency of two variables to vary together.  
$$Cov(X,Y) = E[(X-E(X))(Y-E(Y))] = E[XY]-E[X]E[Y]$$  

### Pearson's correlation  
$$\rho_{X,Y}={\frac {\operatorname {cov} (X,Y)}{\sigma_{X}\sigma _{Y}}} $$  
Pearson's correlation between -1 and +1. By dividing the starndard deviation, that yields the coefficient to dimesionless.　　
If the value is positive or negatinve, so does the relationship, and the magnitude indicates the strength of the correlation.  
If Pearson's correlation is near 0, it means that the variables do not have much linear relationships. And **Pearson's correlation only measures linear relationships**.   

### Spearman's rank correlation  
The computation of Spearmans's rank correlation is similar to Pearson's correlation, except for replace real value with rank.  
Compared with Pearson's correlation, Spearman's rank corrlation has serveral advantages:  
1. Pearson's correlation tends to underestimate the strength of relationship, when relationship is **nonlinear**.  
2. Spearmans's rank correlations is more robust, while Pearson's can be affected if the distributions is **skewed** or contains outliner.

### Correlation and causation  
Always remember [**Correlation does not imply causatioin**](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation)  
Two methods we can try to provide evidence of causation.  
1. Use time.
2. Use randomness

## Chap 8 Estimation  

### Guess the variance   
Think about a normal distribution, we have a sample of $[x\_1, x_2, \dots,x_m]$, to estimate $\sigma^2$, here is a estimator:  
$$S^2 = \frac{1}{n}\sum(x_i-\bar{x})^2$$  
When the sample is large, the estimator is adequate, but for small samples it tends to be too low.   
Since the estimator above is **biased**, there is another **unbiased** estimator of $\sigma^2$.    

### Sampling distributions  
Variations in the estimate cauesd by random selection(instead of using full data) is clled **sampling error**.    
If we run m simulations and choose n values to estimate variable each time, we will get a **sampling distribution**.  
There are two common ways to summarize the sampling distribution:  
1. Standard error (SE). *Notice that the difference between standard error and standard deviation*. 
2. Confidence interval (CI 置信区间)　is a range that includes a given fractions of sampling distribution. Shaped like 90% CI is (86,94).   
SE & CI only quantify sampling error, sampling distribution does not account for **sampling bias and measurement error**.  

### Sampling bias   
+ **sampling bias** is the bias caused by sampling that is not uniform.  
+ **measurement error** is the error caused by in inaccurate measurement.  

### Exponenial distributions  
To estimate, we could use:   
$$L=\frac{1}{\bar{x}}$$  
or use median to ensure robustness ( $m$ is the sample median):  
$$L_m=ln(2)/m$$  

## Hypothesis testing  

### Classical hypothesis testing  
To answer the question of :  
> Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?  
we follow the steps:  
1. Quantity the size of the apparent effect by choosing a **test statistic**.  
2. Define a **null hypothesis**  
3. Compute a p-value——the probability of seeing the apparent effect if the null bypothesis is true.  
4. Interpret the result  
 
The logic of this progress is similar to a proof by contrdiction.  

### Chi-squared tests  
To test proportions, it's more common to use the **chi-squared** as statistic.  
$$\chi^2 = \sum_{i}=frac{(O_i-E_i)^2}{E_i}$$  
where the $O\_i$ is the observed value, and the $E\_i$ is the excepted value.  

## Linear least square  

### why use square?  

+ Treat the positive and negative residual the same  
+ Give more weight but not so much to large residuals  
+ If the residuals are uncorrelated and normally distributed with mean 0 and constant variance, then the least squares fit is also the maximum likelihood estimator of inter and slope.  


## Time series analysis  
### Moving averages  
Three most common components in time series modeling:  
+ Trend: A smooth function that captures presistent changes  
+ Seasonality: Periodic variation  
+ Noise: Random variation around the long-term trend  

Moving average - show the trend.  
**Simplest moving average**: rolling mean  
```py
rollingMean = [mean(data[i],...,data[i+k]) for i in range(len(data)) - k]
# or 
pandas.rolling_mean(data, k)

EWMA (expontially-weighted moving average):
the most recent value has the highest weight and the weights for previous values drop off exponentially.