Skip to the content.

Machine Learning with Python

Introduction to Machine Learning

Learning Objectives:

Machine learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.”

In essence, machine learning follows the same process that a 4-year-old child uses to learn, understand, and differentiate animals. So, machine learning algorithms, inspired by the human learning process, iteratively learn from data, and allow computers to find hidden insights.

Major machine learning techniques:

techniques applications
Regression/Estimation Predicting continuous values
Classification Predicting the item class/category of a case
Clustering Finding the structure of data; summarization
Associations Associating frequent co-occurring items/events
Anomaly detection Discovering abnormal and unusual cases
Sequence mining Predicting next events; click-stream (Markov Model, HMM)
Dimension Reduction Reducing the size of data (PCA)
Recommendation systems Recommending items

“What is the difference between these buzzwords that we keep hearing these days, such as Artificial intelligence (or AI), Machine Learning and Deep Learning?”

Supervised vs. Unsupervised Learning


↥ back to top


Regression

Learning Objectives:

Regression algorithms:

Simple Linear Regression

How to find the best parameters for the line: (two options)

Model Evaluation in Regression Models

Evaluation Metrics

Evaluation metrics are used to explain the performance of a model.

Jupyter Notebook: Simple Linear Regression


↥ back to top


Multiple Linear Regression

Estimating multiple linear regression parameters:

Questions:

See also: How to Choose a Feature Selection Method For Machine Learning

Jupyter Notebook: Multiple Linear Regression


↥ back to top


Non-Linear Regression

How can I know if a problem is linear or non-linear in an easy way?

How should I model my data if it displays non-linear on a scatter plot?

See also: Data Analysis with Python

Jupyter Notebook: Polynomial Regression

import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.metrics import r2_score

# read dataset
df = pd.read_csv("FuelConsumptionCo2.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# cdf.head()

# split dataset
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]

train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])

test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])

# polynomial regression (transform first)
poly = PolynomialFeatures(degree=2)
train_x_poly = poly.fit_transform(train_x)
# train_x_poly

# linear regression
clf = linear_model.LinearRegression()
train_y_ = clf.fit(train_x_poly, train_y)

# The coefficients
print ('Coefficients: ', clf.coef_)
print ('Intercept: ',clf.intercept_)

# plot
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')
XX = np.arange(0.0, 10.0, 0.1)
yy = clf.intercept_[0]+ clf.coef_[0][1]*XX+ clf.coef_[0][2]*np.power(XX, 2)
plt.plot(XX, yy, '-r' )
plt.xlabel("Engine size")
plt.ylabel("Emission")

# test
test_x_poly = poly.transform(test_x)
test_y_ = clf.predict(test_x_poly)

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y,test_y_ ) )
Coefficients:  [[ 0.         51.79906437 -1.70908836]]
Intercept:  [104.94682631]
Mean absolute error: 24.32
Residual sum of squares (MSE): 936.39
R2-score: 0.77

Jupyter Notebook: Non-Linear Regression

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score

df = pd.read_csv("china_gdp.csv")
# df.head(10)

# choose model
def sigmoid(x, Beta_1, Beta_2):
     y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
     return y

# normalize data
x_data, y_data = (df["Year"].values, df["Value"].values)
xdata =x_data/max(x_data)
ydata =y_data/max(y_data)


# build the model using train set
popt, pcov = curve_fit(sigmoid, xdata, ydata)

# plot
x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()


↥ back to top


Classification

Learning Objectives:

Classification algorithms:

k-Nearest Neighbors algorithm

  1. Pick a value for k
  2. Calculate the distance of unknown case from all cases
  3. Select the k-observations in the training data that are “nearest” to the unknown data point
  4. Predict the response of the unknown data point using the most popular response value from the k-nearest neighbors

How can we find the best value for k?

Jupyter Notebook: k-Nearest Neighbors


↥ back to top


Evaluation Metrics in Classification

Jaccard Index

F1 Score

Log Loss


↥ back to top


Decision Trees

Entropy

In which tree do we have less entropy after splitting rather than before splitting? The answer is the tree with the higher information gain after splitting.

#### [Information Gain](https://en.wikipedia.org/wiki/Information_gain_(decision_tree)

Information gain is the information that can increase the level of certainty after splitting. It is the entropy of a tree before the split minus the weighted entropy after the split by an attribute. We can think of information gain and entropy as opposites. As entropy or the amount of randomness decreases, the information gain or amount of certainty increases and vice versa. So, constructing a decision tree is all about finding attributes that return the highest information gain.

import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as tree
from sklearn import preprocessing

# load data
my_data = pd.read_csv("drug200.csv", delimiter=",")
X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
y = my_data["Drug"]

# preprocess data
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 

le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])

le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

# split dataset
from sklearn.model_selection import train_test_split
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

# build model
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree.fit(X_trainset,y_trainset)

# predict
predTree = drugTree.predict(X_testset)

# evaluate
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

# plot
tree.plot_tree(drugTree)
plt.show()
DecisionTrees's Accuracy:  0.9833333333333333

Jupyter Notebook: Decision Trees


↥ back to top


Logistic Regression

Logistic regression is a statistical and machine learning technique for classifying records of a dataset based on the values of the input fields. Logistic regression can be used for both binary classification and multi-class classification.

Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1.

In logistic regression, a logit transformation is applied on the odds—that is, the probability of success divided by the probability of failure. This is also commonly known as the log odds, or the natural logarithm of odds, and this logistic function is represented by the following formulas:

Logit(pi) = 1/(1+ exp(-pi))
ln(pi/(1-pi)) = Beta_0 + Beta_1X_1 + … + B_kK_k

Logistic regression applications:

The training process:

  1. Initialize θ
  2. Calculate ŷ=σ(θTX) for a sample
  3. Compare the output of ŷ with actual output of sample, y, and record it as error
  4. Calculate the error for all samples
  5. Change the θ to reduce the cost
  6. Go back to step 2

Minimizing the cost function of the model

Gradient Descent

The gradient is the slope of the surface at every point and the direction of the gradient is the direction of the greatest uphill.

The gradient value also indicates how big of a step to take. If the slope is large we should take a large step because we are far from the minimum. If the slope is small we should take a smaller step. Gradient descent takes increasingly smaller steps towards the minimum with each iteration.

Also we multiply the gradient value by a constant value µ, which is called the learning rate. Learning rate, gives us additional control on how fast we move on the surface. In sum, we can simply say, gradient descent is like taking steps in the current direction of the slope, and the learning rate is like the length of the step you take.

Jupyter Notebook: Logistic Regression

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt

# load data
churn_df = pd.read_csv("ChurnData.csv")
churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',   'callcard', 'wireless','churn']]
churn_df['churn'] = churn_df['churn'].astype('int')
X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
y = np.asarray(churn_df['churn'])

# preprocess data
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)

# split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

# build model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)

# predict
yhat = LR.predict(X_test)
yhat_prob = LR.predict_proba(X_test)

# evaluate
# jaccard index
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=0)

# confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False,  title='Confusion matrix')

print (classification_report(y_test, yhat))

# log loss
from sklearn.metrics import log_loss
log_loss(y_test, yhat_prob)
Train set: (160, 7) (160,)
Test set: (40, 7) (40,)
[[ 6  9]
[ 1 24]]
Confusion matrix, without normalization
[[ 6  9]
[ 1 24]]
              precision    recall  f1-score   support

          0       0.73      0.96      0.83        25
          1       0.86      0.40      0.55        15

    accuracy                           0.75        40
  macro avg       0.79      0.68      0.69        40
weighted avg       0.78      0.75      0.72        40

0.6017092478101185


↥ back to top


Support Vector Machine

Kernel methods

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

  1. Linear
  2. Polynomial
  3. Radial basis function (RBF)
  4. Sigmoid

Pros and cons of SVM

SVM applications

Jupyter Notebook: SVM

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline 
import matplotlib.pyplot as plt

# load data
cell_df = pd.read_csv("cell_samples.csv")
# BareNuc column includes some values that are not numerical. drop those rows
cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()] 
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')

feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)

cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])

# split dataset
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

# build model
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

# predict
yhat = clf.predict(X_test)

# evaluate
from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)

print(classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False,  title='Confusion matrix')

# f1_score
from sklearn.metrics import f1_score
print(f1_score(y_test, yhat, average='weighted'))

from sklearn.metrics import jaccard_score
print(jaccard_score(y_test, yhat,pos_label=2))
Train set: (546, 9) (546,)
Test set: (137, 9) (137,)
              precision    recall  f1-score   support

           2       1.00      0.94      0.97        90
           4       0.90      1.00      0.95        47

    accuracy                           0.96       137
   macro avg       0.95      0.97      0.96       137
weighted avg       0.97      0.96      0.96       137

Confusion matrix, without normalization
[[85  5]
 [ 0 47]]
0.9639038982104676
0.9444444444444444


↥ back to top


Clustering

Learning Objectives:

Why clustering?


↥ back to top


K-Means

Though the objective of K-Means is to form clusters in such a way that similar samples go into a cluster, and dissimilar samples fall into different clusters, it can be shown that instead of a similarity metric, we can use dissimilarity metrics. In other words, conventionally the distance of samples from each other is used to shape the clusters.

So we can say K-Means tries to minimize the intra-cluster distances and maximize the inter-cluster distances.

How can we calculate the dissimilarity or distance of two cases such as two customers?

K-Means clutering algorithm

  1. Randomly placing k centroids, one for each cluster
  2. Calculate the distance of each point from each centroid
    • Euclidean distance is used to measure the distance from the object to the centroid. Please note, however, that you can also use different types of distance measurements, not just Euclidean distance. Euclidean distance is used because it’s the most popular.
  3. Assign each data point (object) to its closest centroid, creating a cluster
  4. Recalculate the position of the k centroids
  5. Repeat the steps 2-4, until the centroids no longer move

K-Means accuracy

Elbow point

K-Means recap

Jupyter Notebook: K-Means


↥ back to top


Hierarchical clustering

Hierarchical clustering algorithms build a hierarchy of clusters where each node is a cluster consists of the clusters of its daughter nodes.

Strategies for hierarchical clustering generally fall into two types, divisive and agglomerative.

Hierarchical clustering is typically visualized as a dendrogram. Essentially, hierarchical clustering does not require a prespecified number of clusters.

Dendrogram source: Wikipedia

How can we calculate the distance between clusters when there are multiple points in each cluster?

Advantages vs. disadvantages of Hierarchical clustering

Advantages Disadvantages
Doesn’t require number of clusters to be specified Can never undo any previous steps throughout the algorithm
Easy to implement Generally has long runtimes
Produces a dendrogram, which helps with understanding the data Sometimes difficult to identify the number of clusters by the dendrogram

Hierarchical clustering vs. K-means

K-means Hierarchical clustering
Much more efficient Can be slow for large datasets
Requires the number of clusters to be specified Doesn’t require number of clusters to run
Gives only one partitioning of the data based on the predefined number of clusters Gives more than one partitioning depending on the resolution
Potentially returns different clusters each time it is run due to random initialization of centroids Always generates the same clusters

Jupyter Notebook: Hierarchical clustering


↥ back to top


DBSCAN

Density-based spatial clustering of applications with noise (DBSCAN)

When applied to tasks with arbitrary shaped clusters or clusters within clusters, traditional techniques might not be able to achieve good results, that is elements in the same cluster might not share enough similarity or the performance may be poor.

Additionally, while partitioning based algorithms such as K-Means may be easy to understand and implement in practice, the algorithm has no notion of outliers that is, all points are assigned to a cluster even if they do not belong in any.

In the domain of anomaly detection, this causes problems as anomalous points will be assigned to the same cluster as normal data points. The anomalous points pull the cluster centroid towards them making it harder to classify them as anomalous points.

In contrast, density-based clustering locates regions of high density that are separated from one another by regions of low density.

Density in this context is defined as the number of points within a specified radius. A specific and very popular type of density-based clustering is DBSCAN.

DBSCAN can be used here to find the group of stations which show the same weather condition. As you can see, it not only finds different arbitrary shaped clusters it can find the denser part of data-centered samples by ignoring less dense areas or noises.

DBSCAN algorithm

To see how DBSCAN works, we have to determine the type of points. Each point in our dataset can be either a core, border, or outlier point.

The whole idea behind the DBSCAN algorithm is to visit each point and find its type first, then we group points as clusters based on their types.

What is a core point? A data point is a core point if within our neighborhood of the point there are at least M points.

What is a border point? A data point is a border point if

It means that even though the yellow point is within the two centimeter neighborhood of the red point, it is not by itself a core point because it does not have at least six points in its neighborhood.

What is an outlier? An outlier is a point that is not a core point and also is not close enough to be reachable from a core point.

The next step is to connect core points that are neighbors and put them in the same cluster.

So, a cluster is formed as at least one core point plus all reachable core points plus all their borders. It’s simply shapes all the clusters and finds outliers as well.

Advantages of DBSCAN

Jupyter Notebook: DBSCAN


↥ back to top


Recommender Systems

Learning Objectives:

Advantages of recommender systems

Implementing recommender systems

See also: Recommendation Systems on Google Machine Learning Course

Content-based recommender systems

A Content-based recommendation system tries to recommend items to users based on their profile. The user’s profile revolves around that user’s preferences and tastes. It is shaped based on user ratings, including the number of times that user has clicked on different items or perhaps even liked those items.

The recommendation process is based on the similarity between those items. Similarity or closeness of items is measured based on the similarity in the content of those items. When we say content, we’re talking about things like the items category, tag, genre, and so on.

Advantages and Disadvantages of Content-Based Filtering

Jupyter Notebook: Content-based recommender systems


↥ back to top


Collaborative Filtering

User-based vs. Item-based

Advantages and Disadvantages of Collaborative Filtering

Challenges of collaborative filtering

Collaborative filtering is a very effective recommendation system. However, there are some challenges with it as well.

There are some solutions for each of these challenges such as using hybrid based recommender systems.

Jupyter Notebook: Collaborative Filtering

Jupyter Notebook: Final Project


↥ back to top