Skip to the content.

Data Analysis with Python

Datasets

Understanding Datasets

Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/

Exporting to different formats in Python

Data Format Read Save
csv pd.read_csv() df.to_csv()
json pd.read_json() df.to_json()
Excel pd.read_excel() df.to_excel()
sql pd.read_sql() df.to_sql()

Basic insights from the data

Jupyter Notebook: Import data


↥ back to top


Preprocessing Data in Python

How to deal with missing data

df.dropna(subset=["price"], axis=0, inplace=True)

is equivalent to

df = df.dropna(subset=["price"], axis=0)

Data Formatting in Python

Non-formatted:

Formatted:

Correcting data types

Data Normalization in Python

Approaches for normalization:

Binning

bins = np.linspace(min(df["price"]), max(df["price"]), 4)
group_names = ["Low", "Medium", "High"]
df["price-binned"] = pd.cut(df["price"], bins, labels=group_names, include_lowest=True)

Turning categorical variables into quantitative variables in Python

Jupyter Notebook: Preprocessing data


↥ back to top


Exploratory Data Analysis (EDA)

Learning Objectives:

Descriptive Statistics - Describe()

Grouping data

groupby

A table of this form isn’t the easiest to read and also not very easy to visualize.

To make it easier to understand, we can transform this table to a pivot table by using the pivot method.

pivot

The price data now becomes a rectangular grid, which is easier to visualize. This is similar to what is usually done in Excel spreadsheets. Another way to represent the pivot table is using a heat map plot.

Heatmap

Correlation

Correlation - Statistics

Pearson Correlation

The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

Correlation Heatmap


↥ back to top


Association between two categorical variables: Chi-Square

Categorical variables

Chi-Square Test of association

See also: Chi-Square Test of Independence

Jupyter Notebook: Exploratory Data Analysis (EDA)


↥ back to top


Model Development

Linear Regression and Multiple Linear Regression

Model Evaluation using Visualization

Regression Plot

Regression plot gives us a good estimate of:

Regression plot shows us a combination of:

import seaborn as sns

sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

Residual Plot

We expect to see the results to have zero mean, distributed evenly around the x axis with similar variance.

import seaborn as sns

sns.residplot(df["highway-mpg"], df["price"])

Distribution Plots

A distribution plot counts the predicted value versus the actual value. These plots are extremely useful for visualizing models with more than one independent variable or feature.

import seaborn as sns

ax1 = sns.distplot(df["price"], hist=False, color="r", label="Actual Value")

sns.distplot(Yhat, hist=False, color="b", label="Fitted Value", ax=ax1)


↥ back to top


Polynomial Regression and Pipelines

Numpy’s polyfit function cannot perform this type of regression. We use the preprocessing library in scikit-learn to create a polynomial feature object.

from sklearn.preprocessing import PolynomialFeatures

pr = PolynomialFeatures(degree=2, include_bias=False)
x_poly = pr.fit_transform(x[['horsepower', 'curb-weight']])

As the dimension of the data gets larger, we may want to normalize multiple features in scikit-learn. Instead we can use the preprocessing module to simplify many tasks. For example, we can standardize each feature simultaneously. We import StandardScaler.

from sklearn.preprocessing import StandardScaler
SCALE = StandardScaler()
SCALE.fit(x_data[['horsepower', 'highway-mpg']])
x_scale = SCALE.transform(x_data[['horsepower', 'highway-mpg']])

We can simplify our code by using a pipeline library.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(degree=2),...), ('model', LinearRegression())]
pipe = Pipeline(Input)
pipe.fit(df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y)
yhat = pipe.predict(X[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])


↥ back to top


Measures for In-Sample Evaluation

Measures for In-Sample Evaluation

Mean Squared Error (MSE)

from sklearn.metrics import mean_square_error

mean_square_error(df['price'], Y_predict_simple_fit)

R-squared

R2=(1-(MSE of regression line)/(MSE of the average of the data))

X = df[['highway-mpg']]
Y = df['price']
lm.fit(X, Y)
lm.score(X, Y)  # 0.496591188

We can say that approximately 49.695% of the variation of price is explained by this simple linear model.

Jupyter Notebook: Model Development


↥ back to top


Model Evaluation and Refinement

Training/Testing Sets

Function cross_val_score()

One of the most common out of sample evaluation metrics is cross-validation.

The simplest way to apply cross-validation is to call the cross_val_score function, which performs multiple out-of-sample evaluations.

from sklearn.model_selection import cross_val_score

score = cross_val_score(lr, x_data, y_data, cv=3)
np.mean(scores)

Function cross_val_predict()

from sklearn.model_selection import cross_val_predict

yhat = cross_val_predict(lr2e, x_data, y_data, cv=3)

Overfitting, Underfitting and Model Selection

Calculate different R-squared values as follows:

Rsqu_test = []
order = [1,2,3,4]

for n in order:
  pr = PolynomialFeatures(degree=n)
  x_train_pr = pr.fit_transform(x_train[['horsepower']])
  x_test_pr = pr.fit_transform(x_test[['horsepower']])
  lr.fit(x_train_pr, y_train)
  Rsqu_test.append(lr.score(x_test_pr, y_test))


↥ back to top


Ridge Regression

Ridge regression is a regression that is employed in a Multiple regression model when Multicollinearity occurs. Multicollinearity is when there is a strong relationship among the independent variables. Ridge regression is very common with polynomial regression.

The column corresponds to the different polynomial coefficients, and the rows correspond to the different values of alpha.

Grid Search takes the model or objects you would like to train and different values of the hyperparameters. It then calculates the mean square error or R-squared for various hyperparameter values, allowing you to choose the best values.

Use the validation dataset to pick the best hyperparameters.

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

parameters1 = [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000]}]

RR = Ridge()
Grid1 = GridSearchCV(RR, parameters1, cv=4)
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
Grid1.best_estimator_

scores = Grid1.cv_results_
scores['mean_test_score']

What are the advantages of Grid Search is how quickly we can test multiple parameters.

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

parameters2 = [{'alpha': [0.001, 0.1, 1, 10, 100], 'normalize': [True, False]}]

RR = Ridge()
Grid1 = GridSearchCV(RR, parameters2, cv=4)
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
Grid1.best_estimator_

scores = Grid1.cv_results_

for param, mean_val, mean_test in zip(scores['params'], scores['mean_test_score'], scores['mean_train_score']):
  print(param, "R^2 on test data:", mean_val, "R^2 on train data:", mean_test)

Jupyter Notebook: Model Evaluation and Refinement

Jupyter Notebook: House Sales in King Count USA


↥ back to top