Skip to the content.

Tools for Data Science

Languages of Data Science

Python

  1. Python is a high-level general-purpose programming language that can be applied to many different classes of problems.
  2. It has a large standard library that provides tools suited to many different tasks, including but not limited to databases, automation, web scraping, text processing, image processing, machine learning, and data analytics.
  3. For data science, you can use Python’s scientific computing libraries such as Pandas, NumPy, SciPy, and Matplotlib.
  4. For artificial intelligence, it has TensorFlow, PyTorch, Keras, and Scikit-learn.
  5. Python can also be used for Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK).

R

Like Python, R is free to use, but it’s a GNU project – instead of being open source, it’s actually free software.

So if Python is open source and R is free software, what’s the difference?

SQL

The SQL language is subdivided into several language elements, including clauses, expressions, predicates, queries, and statements.

Java

Scala

C++

JavaScript

Julia


↥ back to top


Data Science Tools

Categories of Data Science Tools

Open Source Tools

Commercial Tools

Cloud Based Tools

Cloud products are a newer species, they follow the trend of having multiple tasks integrated in tools.

Since operations and maintenance are not done by the cloud provider, as is the case with Watson Studio, Open Scale, and Azure Machine Learning, this delivery model should not be confused with Platform or Software as a Service – PaaS or SaaS.


↥ back to top


Packages, APIs, Data Sets and Models

Packages

Python Libraries

Scala Libraries

R Libraries

APIs

The API is simply the interface. There are also multiple volunteer-developed APIs for TensorFlow; for example Julia, MATLAB, R, Scala, and many more. REST APIs are another popular type of API.

They enable you to communicate using the internet, taking advantage of storage, greater data access, artificial intelligence algorithms, and many other resources. The RE stands for “Representational,” the S stands for “State,” the T stand for “Transfer.” In rest APIs, your program is called the “client.” The API communicates with a web service that you call through the internet. A set of rules governs Communication, Input or Request, and Output or Response.

HTTP methods are a way of transmitting data over the internet We tell the REST APIs what to do by sending a request.

The request is usually communicated through an HTTP message. The HTTP message usually contains a JSON file, which contains instructions for the operation that we would like the service to perform. This operation is transmitted to the web service over the internet. The service performs the operation. Similarly, the web service returns a response through an HTTP message, where the information is usually returned using a JSON file.

Data Sets

Where to find open data

Community Data License Agreement

The Data Asset eXchange

Models

Supervised Learning

Regression

Classification

Unsupervised Learning

Reinforcement Learning

Deep Learning Models

The Model Asset Exchange


↥ back to top


RStudio IDE

Popular R Libraries for Data Science

library(ggplot2)
ggplot(mtcars, aes(x=mpg,y=wt))+geom_point()+ggtitle("Miles per gallon vs weight")+labs(y="weight", x="Miles per gallon")

GGally is an extension of ggplot2

library(datasets)
data(iris)
library(GGally)
ggpairs(iris, mapping=ggplot2::aes(colour = Species))

Git/GitHub

Basic Git Commands

To learn more, visit https://try.github.io/

Watson Studio

IBM Watson Knowledge Catalog

the catalog only contains metadata. You can have the data in unpremises data repositories in other IBM cloud services like Cloudant or Db2 on Cloud and in non-IBM cloud services like Amazon or Azure.

Included in the metadata is how to access the data asset. In other words, the location and credentials. That means that anyone who is a member of the catalog and has sufficient permissions can get to the data without knowing the credentials or having to create their own connection to the data.

Data Refinery

Simplifying Data Preparation

Which features of Data Refinery help save hours and days of data preparation?

Modeler flows

XGBoost is a very popular model, representing gradient-boosted ensemble of decision trees. The algorithm was discovered relatively recently and has been used in many solutions and winning data science competitions. In this case, it created a model with the highest accuracy, which “won” as well. “C&RT” stands for Classification and Regression Tree”, a decision tree algorithm that is widely used. This is the same decision tree we saw earlier when we built it separately. “LE” is “linear engine”, an IBM implementation of linear regression model that includes automatic interaction detection.

IBM SPSS Modeler and Watson Studio Modeler flows allow you to graphically create a stream or flow that includes data transformation steps and machine learning models. Such sequences of steps are called data pipelines or ML pipelines.

AutoAI

AutoAI provides automatic finding of optimal data preparation steps, model selection, and hyperparameter optimization.

Model Deployment

Watson Openscale

Insurance underwriters can use machine learning and Openscale to more consistently and accurately assess claims risk, ensure fair outcomes for customers, and explain AI recommendations for regulatory and business intelligence purposes.

Before an AI model is put into production it must prove it can make accurate predictions on test data, a subset of its training data; however, over time, production data can begin to look different than training data, causing the model to start making less accurate predictions. This is called drift.

IBM Watson Openscale monitors a model’s accuracy on production data and compares it to accuracy on its training data. When a difference in accuracy exceeds a chosen threshold Openscale generates an alert. Watson Openscale reveals which transactions caused drift and identifies the top transaction features responsible.

The transactions causing drift can be sent for manual labeling and use to retrain the model so that its predictive accuracy does not drop at run time.


↥ back to top