-
-
江湖人称潇洒哥 数据达人Lv4
发表于2018-6-11 08:03
楼主
给定一个csv文件,完成以下两题:
对应代码如下:
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
anascombe = pd.read_csv('anscombe.csv')
print(anascombe.groupby('dataset')['x'].mean())
print(anascombe.groupby('dataset')['y'].mean())
print(anascombe.groupby('dataset')['x'].var())
print(anascombe.groupby('dataset')['y'].var())
print(anascombe.groupby('dataset').corr())
dataset_names = ['I', 'II', 'III', 'IV']
for i in dataset_names:
n = len(anascombe[anascombe.dataset == i])
is_train = np.random.rand(n) < 0.7
train = anascombe[anascombe.dataset == i][is_train].reset_index(drop=True)
test = anascombe[anascombe.dataset == i][~is_train].reset_index(drop=True)
lin_model = smf.ols('y ~ x', train).fit()
print(lin_model.summary())
g = sns.FacetGrid(anascombe, col='dataset')
g.map(plt.scatter, 'x', 'y')
plt.show()
程序命令行输出:
[plain] view plain copy- dataset
- I 9.0
- II 9.0
- III 9.0
- IV 9.0
- Name: x, dtype: float64
- dataset
- I 7.500909
- II 7.500909
- III 7.500000
- IV 7.500909
- Name: y, dtype: float64
- dataset
- I 11.0
- II 11.0
- III 11.0
- IV 11.0
- Name: x, dtype: float64
- dataset
- I 4.127269
- II 4.127629
- III 4.122620
- IV 4.123249
- Name: y, dtype: float64
- x y
- dataset
- I x 1.000000 0.816421
- y 0.816421 1.000000
- II x 1.000000 0.816237
- y 0.816237 1.000000
- III x 1.000000 0.816287
- y 0.816287 1.000000
- IV x 1.000000 0.816521
- y 0.816521 1.000000
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
- "anyway, n=%i" % int(n))
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: 0.650
- Model: OLS Adj. R-squared: 0.592
- Method: Least Squares F-statistic: 11.15
- Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.0156
- Time: 12:18:34 Log-Likelihood: -12.931
- No. Observations: 8 AIC: 29.86
- Df Residuals: 6 BIC: 30.02
- Df Model: 1
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 2.4459 1.497 1.634 0.153 -1.216 6.108
- x 0.5464 0.164 3.339 0.016 0.146 0.947
- ==============================================================================
- Omnibus: 0.157 Durbin-Watson: 3.211
- Prob(Omnibus): 0.925 Jarque-Bera (JB): 0.343
- Skew: -0.096 Prob(JB): 0.842
- Kurtosis: 2.004 Cond. No. 27.8
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
- "anyway, n=%i" % int(n))
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: 0.654
- Model: OLS Adj. R-squared: 0.610
- Method: Least Squares F-statistic: 15.10
- Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00464
- Time: 12:18:34 Log-Likelihood: -15.546
- No. Observations: 10 AIC: 35.09
- Df Residuals: 8 BIC: 35.70
- Df Model: 1
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 3.0642 1.169 2.621 0.031 0.369 5.760
- x 0.4842 0.125 3.886 0.005 0.197 0.772
- ==============================================================================
- Omnibus: 1.436 Durbin-Watson: 2.438
- Prob(Omnibus): 0.488 Jarque-Bera (JB): 0.889
- Skew: -0.413 Prob(JB): 0.641
- Kurtosis: 1.795 Cond. No. 27.4
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
- "samples were given." % int(n), ValueWarning)
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: 1.000
- Model: OLS Adj. R-squared: 1.000
- Method: Least Squares F-statistic: 1.699e+06
- Date: Sun, 10 Jun 2018 Prob (F-statistic): 2.08e-12
- Time: 12:18:34 Log-Likelihood: 29.314
- No. Observations: 6 AIC: -54.63
- Df Residuals: 4 BIC: -55.04
- Df Model: 1
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 4.0098 0.003 1498.423 0.000 4.002 4.017
- x 0.3451 0.000 1303.508 0.000 0.344 0.346
- ==============================================================================
- Omnibus: nan Durbin-Watson: 2.677
- Prob(Omnibus): nan Jarque-Bera (JB): 2.907
- Skew: 1.640 Prob(JB): 0.234
- Kurtosis: 3.933 Cond. No. 29.9
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
- "anyway, n=%i" % int(n))
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1633: RuntimeWarning: divide by zero encountered in double_scalars
- return np.sqrt(eigvals[0]/eigvals[-1])
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1554: RuntimeWarning: divide by zero encountered in double_scalars
- return self.ess/self.df_model
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: -0.000
- Model: OLS Adj. R-squared: -0.000
- Method: Least Squares F-statistic: -inf
- Date: Sun, 10 Jun 2018 Prob (F-statistic): nan
- Time: 12:18:34 Log-Likelihood: -13.393
- No. Observations: 9 AIC: 28.79
- Df Residuals: 8 BIC: 28.98
- Df Model: 0
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 0.1107 0.006 18.991 0.000 0.097 0.124
- x 0.8856 0.047 18.991 0.000 0.778 0.993
- ==============================================================================
- Omnibus: 0.591 Durbin-Watson: 1.614
- Prob(Omnibus): 0.744 Jarque-Bera (JB): 0.509
- Skew: -0.052 Prob(JB): 0.775
- Kurtosis: 1.840 Cond. No. inf
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- [2] The smallest eigenvalue is 0. This might indicate that there are
- strong multicollinearity problems or that the design matrix is singular.
- C:\Users\10617\Desktop\Python\statistics_exercise\cme193-ipython-notebooks-lecture-master\data1.py
- dataset
- I 9.0
- II 9.0
- III 9.0
- IV 9.0
- Name: x, dtype: float64
- dataset
- I 7.500909
- II 7.500909
- III 7.500000
- IV 7.500909
- Name: y, dtype: float64
- dataset
- I 11.0
- II 11.0
- III 11.0
- IV 11.0
- Name: x, dtype: float64
- dataset
- I 4.127269
- II 4.127629
- III 4.122620
- IV 4.123249
- Name: y, dtype: float64
- x y
- dataset
- I x 1.000000 0.816421
- y 0.816421 1.000000
- II x 1.000000 0.816237
- y 0.816237 1.000000
- III x 1.000000 0.816287
- y 0.816287 1.000000
- IV x 1.000000 0.816521
- y 0.816521 1.000000
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
- "samples were given." % int(n), ValueWarning)
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: 0.144
- Model: OLS Adj. R-squared: -0.070
- Method: Least Squares F-statistic: 0.6714
- Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.459
- Time: 12:20:16 Log-Likelihood: -9.2736
- No. Observations: 6 AIC: 22.55
- Df Residuals: 4 BIC: 22.13
- Df Model: 1
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 5.5660 3.535 1.575 0.190 -4.249 15.381
- x 0.2723 0.332 0.819 0.459 -0.650 1.195
- ==============================================================================
- Omnibus: nan Durbin-Watson: 1.587
- Prob(Omnibus): nan Jarque-Bera (JB): 0.403
- Skew: 0.513 Prob(JB): 0.818
- Kurtosis: 2.252 Cond. No. 66.8
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
- "anyway, n=%i" % int(n))
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: 0.696
- Model: OLS Adj. R-squared: 0.658
- Method: Least Squares F-statistic: 18.33
- Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00268
- Time: 12:20:16 Log-Likelihood: -15.103
- No. Observations: 10 AIC: 34.21
- Df Residuals: 8 BIC: 34.81
- Df Model: 1
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 2.8740 1.120 2.565 0.033 0.291 5.457
- x 0.5000 0.117 4.281 0.003 0.231 0.769
- ==============================================================================
- Omnibus: 1.425 Durbin-Watson: 2.338
- Prob(Omnibus): 0.490 Jarque-Bera (JB): 0.931
- Skew: -0.471 Prob(JB): 0.628
- Kurtosis: 1.840 Cond. No. 28.0
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
- "samples were given." % int(n), ValueWarning)
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: 1.000
- Model: OLS Adj. R-squared: 1.000
- Method: Least Squares F-statistic: 7.652e+05
- Date: Sun, 10 Jun 2018 Prob (F-statistic): 3.71e-14
- Time: 12:20:16 Log-Likelihood: 31.802
- No. Observations: 7 AIC: -59.60
- Df Residuals: 5 BIC: -59.71
- Df Model: 1
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 4.0036 0.004 1102.706 0.000 3.994 4.013
- x 0.3456 0.000 874.754 0.000 0.345 0.347
- ==============================================================================
- Omnibus: nan Durbin-Watson: 2.583
- Prob(Omnibus): nan Jarque-Bera (JB): 0.574
- Skew: 0.284 Prob(JB): 0.750
- Kurtosis: 1.717 Cond. No. 29.3
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
- "samples were given." % int(n), ValueWarning)
- OLS Regression Results
- ==============================================================================
- Dep. Variable: y R-squared: 0.803
- Model: OLS Adj. R-squared: 0.754
- Method: Least Squares F-statistic: 16.34
- Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.0156
- Time: 12:20:16 Log-Likelihood: -8.3460
- No. Observations: 6 AIC: 20.69
- Df Residuals: 4 BIC: 20.28
- Df Model: 1
- Covariance Type: nonrobust
- ==============================================================================
- coef std err t P>|t| [0.025 0.975]
- ------------------------------------------------------------------------------
- Intercept 3.3904 1.264 2.683 0.055 -0.118 6.899
- x 0.4795 0.119 4.042 0.016 0.150 0.809
- ==============================================================================
- Omnibus: nan Durbin-Watson: 2.450
- Prob(Omnibus): nan Jarque-Bera (JB): 0.200
- Skew: 0.199 Prob(JB): 0.905
- Kurtosis: 2.199 Cond. No. 27.9
- ==============================================================================
- Warnings:
- [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
图形输出:
第三个图中x与y最符合线性关系,而回归分析中第三组数据的误差值也是最小的。
本文来源:CSDN