Skip to the content.

Statistics for Data Science with Python

This course is one of courses in the Data Science Fundamentals with Python and SQL Specialization, which is useful for data analysis but not included in this Data Science Professional Certificate.

Introduction and Descriptive Statistics

Types of Data

Measure of Central Tendency

# get information about each variable
df.info()

df.describe()

Measure of Dispersion

Dispersion, which is also called variability, scatter or spread, is the extent to which the data distribution is stretched or squeezed. The common measures of dispersion are standard deviation and variance.

Reliability

Jupyter Notebook: Descriptive Statistics


↥ back to top


Data Visualization

The Extreme Presentation Method:

Jupyter Notebook: Visualizing Data


↥ back to top


Introduction to Probability Distribution

Hypothesis Test

To use both the p-value and significance level together, you have to decide on a value for alpha after you state your hypothesis. Suppose that is alpha = 0.10 (or 10%). You then collect the data and calculate the p-value.

Normal Distribution:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

x_axis = np.arange(-4, 4, 0.1)
plt.plot(x_axis, norm.pdf(x_axis,0,1))
plt.show()

Jupyter Notebook: T Test

Z test or T test

Comparing means - 4 cases:

Type of Test z or t Statistics* Expected p-value Decision
Two-tailed test The absolute value of the calculated z or t statistics is greater than 1.96 Less than 0.05 Reject the null hypothesis
One-tailed test The absolute value of the calculated z or t statistics is greater than 1.64 Less than 0.05 Reject the null hypothesis

* in large samples this rule of thumb holds true for the t-test because in large sample sizes, the t-distribution is approximate to a normal distribution

Levene’s Test

Levene’s test is used to check that variances are equal for all samples when your data comes from a non normal distribution. You can use Levene’s test to check the assumption of equal variances before running a test like One-Way ANOVA.

ANOVA

ANOVA - Comparing means of more than two groups

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means.

Correlation Test

Correlation test is used to evaluate the association between two or more variables. For instance, if we are interested to know whether there is a relationship between the heights of fathers and sons, a correlation coefficient can be calculated to answer this question.

See also:

Jupyter Notebook: Hypothesis Testing


↥ back to top


Regression Analysis

Linear regression is a linear relationship between the response variable and predictor variables. It can be used to predict the value of a continuous variable, based on the value of another continuous variable. The t-test statistic helps to determine the correlation between the response and the predictor variables. A one-sample t-test will be used in linear regression to test the null hypothesis that the slope or the coefficient is equal to zero. In the case of the multiple regression model, the null hypothesis is that the coefficient of each of the predictor variables is equal to zero.

Regression in place of t-test

Regression in place of ANOVA

Regression in place of Correlation

Jupyter Notebook: Regression Analysis


↥ back to top


Cheat Sheet for Statistical Analysis in Python

Descriptive Statistics

Here is a quick review of some popular functions:

Data Visualization

One of the most popular visualization tools is the seaborn library. It is a Python Data visualization library that is based on matplotlib. You can learn more here. To get access to functions in the seaborn library or any library, you must first import the library. To import the seaborn library: import seaborn.

Here is a quick summary for creating graphs and plots:

Hypothesis Testing

Jupyter Notebook: Final Project


↥ back to top