数据处理练习

复制链接 | 收藏本帖

2509

: 江湖人称潇洒哥数据达人Lv4

发表于2018-6-11 08:03

楼主

给定一个csv文件，完成以下两题：

对应代码如下：

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

anascombe = pd.read_csv('anscombe.csv')

print(anascombe.groupby('dataset')['x'].mean())
print(anascombe.groupby('dataset')['y'].mean())
print(anascombe.groupby('dataset')['x'].var())
print(anascombe.groupby('dataset')['y'].var())
print(anascombe.groupby('dataset').corr())

dataset_names = ['I', 'II', 'III', 'IV']
for i in dataset_names:

n = len(anascombe[anascombe.dataset == i])
is_train = np.random.rand(n) < 0.7
train = anascombe[anascombe.dataset == i][is_train].reset_index(drop=True)
test = anascombe[anascombe.dataset == i][~is_train].reset_index(drop=True)

lin_model = smf.ols('y ~ x', train).fit()
print(lin_model.summary())

g = sns.FacetGrid(anascombe, col='dataset')
g.map(plt.scatter, 'x', 'y')
plt.show()

程序命令行输出：

[plain] view plain copy

dataset
I 9.0
II 9.0
III 9.0
IV 9.0
Name: x, dtype: float64
dataset
I 7.500909
II 7.500909
III 7.500000
IV 7.500909
Name: y, dtype: float64
dataset
I 11.0
II 11.0
III 11.0
IV 11.0
Name: x, dtype: float64
dataset
I 4.127269
II 4.127629
III 4.122620
IV 4.123249
Name: y, dtype: float64
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=8
"anyway, n=%i" % int(n))
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.650
Model: OLS Adj. R-squared: 0.592
Method: Least Squares F-statistic: 11.15
Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.0156
Time: 12:18:34 Log-Likelihood: -12.931
No. Observations: 8 AIC: 29.86
Df Residuals: 6 BIC: 30.02
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.4459 1.497 1.634 0.153 -1.216 6.108
x 0.5464 0.164 3.339 0.016 0.146 0.947
==============================================================================
Omnibus: 0.157 Durbin-Watson: 3.211
Prob(Omnibus): 0.925 Jarque-Bera (JB): 0.343
Skew: -0.096 Prob(JB): 0.842
Kurtosis: 2.004 Cond. No. 27.8
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
"anyway, n=%i" % int(n))
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.654
Model: OLS Adj. R-squared: 0.610
Method: Least Squares F-statistic: 15.10
Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00464
Time: 12:18:34 Log-Likelihood: -15.546
No. Observations: 10 AIC: 35.09
Df Residuals: 8 BIC: 35.70
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.0642 1.169 2.621 0.031 0.369 5.760
x 0.4842 0.125 3.886 0.005 0.197 0.772
==============================================================================
Omnibus: 1.436 Durbin-Watson: 2.438
Prob(Omnibus): 0.488 Jarque-Bera (JB): 0.889
Skew: -0.413 Prob(JB): 0.641
Kurtosis: 1.795 Cond. No. 27.4
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
"samples were given." % int(n), ValueWarning)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 1.699e+06
Date: Sun, 10 Jun 2018 Prob (F-statistic): 2.08e-12
Time: 12:18:34 Log-Likelihood: 29.314
No. Observations: 6 AIC: -54.63
Df Residuals: 4 BIC: -55.04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.0098 0.003 1498.423 0.000 4.002 4.017
x 0.3451 0.000 1303.508 0.000 0.344 0.346
==============================================================================
Omnibus: nan Durbin-Watson: 2.677
Prob(Omnibus): nan Jarque-Bera (JB): 2.907
Skew: 1.640 Prob(JB): 0.234
Kurtosis: 3.933 Cond. No. 29.9
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=9
"anyway, n=%i" % int(n))
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1633: RuntimeWarning: divide by zero encountered in double_scalars
return np.sqrt(eigvals[0]/eigvals[-1])
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\regression\linear_model.py:1554: RuntimeWarning: divide by zero encountered in double_scalars
return self.ess/self.df_model
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: -0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: -inf
Date: Sun, 10 Jun 2018 Prob (F-statistic): nan
Time: 12:18:34 Log-Likelihood: -13.393
No. Observations: 9 AIC: 28.79
Df Residuals: 8 BIC: 28.98
Df Model: 0
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.1107 0.006 18.991 0.000 0.097 0.124
x 0.8856 0.047 18.991 0.000 0.778 0.993
==============================================================================
Omnibus: 0.591 Durbin-Watson: 1.614
Prob(Omnibus): 0.744 Jarque-Bera (JB): 0.509
Skew: -0.052 Prob(JB): 0.775
Kurtosis: 1.840 Cond. No. inf
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
C:\Users\10617\Desktop\Python\statistics_exercise\cme193-ipython-notebooks-lecture-master\data1.py
dataset
I 9.0
II 9.0
III 9.0
IV 9.0
Name: x, dtype: float64
dataset
I 7.500909
II 7.500909
III 7.500000
IV 7.500909
Name: y, dtype: float64
dataset
I 11.0
II 11.0
III 11.0
IV 11.0
Name: x, dtype: float64
dataset
I 4.127269
II 4.127629
III 4.122620
IV 4.123249
Name: y, dtype: float64
x y
dataset
I x 1.000000 0.816421
y 0.816421 1.000000
II x 1.000000 0.816237
y 0.816237 1.000000
III x 1.000000 0.816287
y 0.816287 1.000000
IV x 1.000000 0.816521
y 0.816521 1.000000
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
"samples were given." % int(n), ValueWarning)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.144
Model: OLS Adj. R-squared: -0.070
Method: Least Squares F-statistic: 0.6714
Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.459
Time: 12:20:16 Log-Likelihood: -9.2736
No. Observations: 6 AIC: 22.55
Df Residuals: 4 BIC: 22.13
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 5.5660 3.535 1.575 0.190 -4.249 15.381
x 0.2723 0.332 0.819 0.459 -0.650 1.195
==============================================================================
Omnibus: nan Durbin-Watson: 1.587
Prob(Omnibus): nan Jarque-Bera (JB): 0.403
Skew: 0.513 Prob(JB): 0.818
Kurtosis: 2.252 Cond. No. 66.8
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
"anyway, n=%i" % int(n))
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.696
Model: OLS Adj. R-squared: 0.658
Method: Least Squares F-statistic: 18.33
Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00268
Time: 12:20:16 Log-Likelihood: -15.103
No. Observations: 10 AIC: 34.21
Df Residuals: 8 BIC: 34.81
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 2.8740 1.120 2.565 0.033 0.291 5.457
x 0.5000 0.117 4.281 0.003 0.231 0.769
==============================================================================
Omnibus: 1.425 Durbin-Watson: 2.338
Prob(Omnibus): 0.490 Jarque-Bera (JB): 0.931
Skew: -0.471 Prob(JB): 0.628
Kurtosis: 1.840 Cond. No. 28.0
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
"samples were given." % int(n), ValueWarning)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 7.652e+05
Date: Sun, 10 Jun 2018 Prob (F-statistic): 3.71e-14
Time: 12:20:16 Log-Likelihood: 31.802
No. Observations: 7 AIC: -59.60
Df Residuals: 5 BIC: -59.71
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.0036 0.004 1102.706 0.000 3.994 4.013
x 0.3456 0.000 874.754 0.000 0.345 0.347
==============================================================================
Omnibus: nan Durbin-Watson: 2.583
Prob(Omnibus): nan Jarque-Bera (JB): 0.574
Skew: 0.284 Prob(JB): 0.750
Kurtosis: 1.717 Cond. No. 29.3
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:\Users\10617\AppData\Local\Programs\Python\Python36\lib\site-packages\statsmodels\stats\stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
"samples were given." % int(n), ValueWarning)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.803
Model: OLS Adj. R-squared: 0.754
Method: Least Squares F-statistic: 16.34
Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.0156
Time: 12:20:16 Log-Likelihood: -8.3460
No. Observations: 6 AIC: 20.69
Df Residuals: 4 BIC: 20.28
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.3904 1.264 2.683 0.055 -0.118 6.899
x 0.4795 0.119 4.042 0.016 0.150 0.809
==============================================================================
Omnibus: nan Durbin-Watson: 2.450
Prob(Omnibus): nan Jarque-Bera (JB): 0.200
Skew: 0.199 Prob(JB): 0.905
Kurtosis: 2.199 Cond. No. 27.9
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.