Data Science Toolbox

Data Science Toolbox
Start doing data science in minutes

As a data scientist, you don't want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.



A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

Let us know if you want to see something added to the Data Science Toolbox.


Additional software and data bundles

The Data Science Toolbox has support for so-called bundles. A bundle is a collection of software or data that is specific to a certain book, course, or project. (In case you're interested, a bundle is essentially an Ansible playbook.) Once you're logged in to your Data Science Toolbox, you can install bundles with the dst command-line tool (we are currently developing a user-friendly web interface as well):

vagrant@data-science-toolbox:~$ dst add dsatcl

The bundle dsatcl belongs to the upcoming book "Data Science at the Command Line" by Jeroen Janssens. It contains, among other things, the command-line tools discussed in this blog post. Because the Data Science Toolbox is still very young, this is the only bundle currently available. If you're a teacher, author, or organization, and you're interested in creating a software or data bundle for your class, book, or project, let us know, and we will help you out.


Getting started with Data Science Toolbox 0.1.5

There are two ways to run the Data Science Toolbox: (1) locally using VirtualBox and Vagrant and (2) in the cloud using Amazon Web Services. Both ways result in exactly the same environment. Select the appropriate tab below for the corresponding installation steps.

Because the local version of the Data Science Toolbox runs on top of VirtualBox and Vagrant, it can be installed on Linux, Mac OS X, and Microsoft Windows.

Step 1: Download and install VirtualBox

Go to the Virtualbox download page and download the appropriate binary. Open the binary and follow the installations instructions.

Step 2: Download and install Vagrant

Similarly to Step 1, go the Vagrant download page and download the appropriate binary. Open the binary and follow the installations instructions.

Step 3: Download and start the Data Science Toolbox

Open a terminal (known as the command prompt in Microsoft Windows). Create a directory, for example "MyDataScienceToolbox", and navigate to it:

$ mkdir MyDataScienceToolbox
$ cd MyDataScienceToolbox

In order to download and start the Data Science Toolbox, run the following commands:

$ vagrant init data-science-toolbox/dst
$ vagrant up

Step 4: Log in (on Mac OS X and Linux)

If you are running Mac OS X or some other UNIX-like operating system, you can log in to the Data Science Toolbox by simply running the following command in a terminal:

$ vagrant ssh

Step 4: Log in (on Microsoft Windows)

If you are running Microsoft Windows, you need to use a third-party application in order to log in to the Data Science Toolbox. We recommend Putty for this. Go to its download page and download putty.exe. Run putty.exe and enter the following values:

Host Name (or IP address): 127.0.0.1
Port: 2222
Connection type: SSH

(If you want, you can save these values as a session by clicking the "Save" button, so that you do not need to enter these values again.) Click the "Open" button and enter "vagrant" for both the username and the password.

Step 5: Set up IPython Notebook (optional)

If you like to be able to run IPython Notebook on your Data Science Toolbox, invoke the following command to create a password-protected profile:

vagrant@data-science-toolbox:~$ dst setup base

(Note that vagrant@data-science-toolbox:~ indicates that this command should be run on the Data Science Toolbox.) Step 3 created a file named Vagrantfile, which is a configuration file used by Vagrant. Open the file in your favorite text editor and add the following text somewhere around line 22:

config.vm.network "forwarded_port", guest: 8888, host: 8888

This line instructs Vagrant to open up port 8888 so that the IPython Notebook server is accessible from your browser. Restart the Data Science Toolbox and log in again so that the changes take effect:

$ vagrant reload
$ vagrant ssh

To start the IPython Notebook server, run:

vagrant@data-science-toolbox:~$ sudo ipython notebook --profile=dst

You can now access the IPython Notebook server at https://localhost:8888. Because the SSL certificate is self-signed, you may get a warning message from your browser. The image below shows how Chrome complains about this. Because you know what's on the server-side, you can just click on the "Proceed anyway" button.

Enter the same password as you entered when you created the profile.

Once you're logged in, you're greeted with the wonderful IPython Notebook.

Step 6: Install additional software packages and bundles (optional)

It's unlikely that the Data Science Toolbox contains all the software you need for your data science project. Fortunately, you can always use apt-get and pip to install individual Ubuntu and Python packages, respectively. For example:

vagrant@data-science-toolbox:~$ sudo apt-get install cowsay
vagrant@data-science-toolbox:~$ sudo pip install networkx

Moreover, R packages can be installed from within R:

vagrant@data-science-toolbox:~$ R
> install.packages('stringr')

In the near future, we hope to have more software and data bundles. The bundle dsatcl, which belongs to the upcoming book "Data Science at the Command Line", is installed as follows:

vagrant@data-science-toolbox:~$ dst add dsatcl

Step 7: Start doing data science

Congratulations, you now have your own Data Science Toolbox. Enjoy! Let us know if you have any questions, comments, or feedback.

Amazon offers a free usage tier, which allows you to run the Data Science Toolbox in the cloud for 750 hours for free.

Step 1: Select region and open AWS launch wizard

Find the region below that is closest to you. Click the "Launch" button to open the AWS launch wizard in a new window.

Region Name Region Code AMI Launch
Asia Pacific (Singapore) ap-southeast-1 ami-c2e4b590 Launch
Asia Pacific (Sydney) ap-southeast-2 ami-bf62fb85 Launch
Asia Pacific (Tokyo) ap-northeast-1 ami-37e39336 Launch
EU (Ireland) eu-west-1 ami-6314e814 Launch
South America (Sao Paulo) sa-east-1 ami-db02a1c6 Launch
US East (Northern Virginia) us-east-1 ami-d1737bb8 Launch
US West (Northern California) us-west-1 ami-46af9003 Launch
US West (Oregon) us-west-2 ami-8aa1cfba Launch

Step 2: Configure EC2 instance

In order to launch an EC2 instance, you need to be logged in to AWS. If you do not yet have an AWS account, select "I am a new user" and click the "Sign in" button.

Once you're logged in to AWS, you can select the type of EC2 instance you want to run. Only the t1.micro type is eligible for the free usage tier.

Choose your preferred instance type and press the "Next" button. You may safely ignore the settings on the next two screen.

Giving your instance a name is useful for when you are running multiple instances, but it is not required.

The settings on the next screen ("Configure Security Group") determine through which ports you can access your Data Science Toolbox. Port 22 is open by default, which allows you log in. If you would like to be able to use IPython notebook, you need to click the "Add rule" button and add a "custom TCP rule" for port "8888" and source "Anywhere". The result should look like the screenshot below. These settings cannot be changed once the EC2 instance is running.

You can now review your settings and click the "Launch" button. Both the "dst" version and "ami" id shown in the top may be different.

Step 3: Create key pair

In order to log in to the EC2 instance, you need to have an AWS key pair. A screen will pop up where you can either use an existing key pair (if you already have one), or create a new one.

Give the key pair a name and press the "Download Key Pair" button. Remember the location where you save the file. If everything went well, you will see something like the following screen. Press the "View Instances" button.

You now see an overview of all your EC2 instances. It will take a few moments before your Data Science Toolbox is assigned a public DNS.

Step 4: Log in (on Mac OS X and Linux)

If you are running Mac OS X or some other UNIX-like operating system, you can log in to the Data Science Toolbox from the terminal. First, you need to make sure that the permissions on your key pair file you downloaded earlier are not too open:

$ chmod 400 MyKeyPair.pem

You can now log in using the following command:

$ ssh -i MyKeyPair.pem ubuntu@ec2-54-85-70-149.compute-1.amazonaws.com

The username is always "ubuntu". The hostname (the part after the "@") should be the public DNS of the EC2 instance. You can copy this from the Instances overview shown in the previous screenshot.

Step 4: Log in (on Microsoft Windows)

Please follow these instructions if you're running Microsoft Windows.

Step 5: Set up IPython Notebook (optional)

If you like to be able to run IPython Notebook on your Data Science Toolbox, invoke the following command to create a password-protected profile:

ubuntu@ip-172-31-26-198:~$ dst setup base

(Note that ubuntu@ip-172-31-26-198:~ indicates that this command should be run on the Data Science Toolbox. Your IP address may be different.) To start the IPython Notebook server, run:

ubuntu@ip-172-31-26-198:~$ sudo ipython notebook --profile=dst

You can now access the IPython Notebook server at https://<public dns>:8888. Because the SSL certificate is self-signed, you may get a warning message from your browser. The image below shows how Chrome complains about this. Because you know what's on the server-side, you can just click on the "Proceed anyway" button.

Enter the same password as you entered when you created the profile.

Once you're logged in, you're greeted with the wonderful IPython Notebook.

Step 6: Install additional software packages and bundles (optional)

It's unlikely that the Data Science Toolbox contains all the software you need for your data science project. Fortunately, you can always use apt-get and pip to install individual Ubuntu and Python packages, respectively. For example:

ubuntu@ip-172-31-26-198:~$ sudo apt-get install cowsay
ubuntu@ip-172-31-26-198:~$ sudo pip install networkx

Moreover, R packages can be installed from within R:

ubuntu@ip-172-31-26-198:~$ R
> install.packages('stringr')

In the near future, we hope to have more software and data bundles. The bundle dsatcl, which belongs to the upcoming book "Data Science at the Command Line", is installed as follows:

ubuntu@ip-172-31-26-198:~$ dst add dsatcl

Step 7: Start doing data science

Congratulations, you now have your own Data Science Toolbox. Enjoy! Let us know if you have any questions, comments, or feedback.


Standing on the shoulders of giants

The Data Science Toolbox would not have been possible without these wonderful platforms and software.