Getting started with Amazon (AWS) & Google (GCE): GPU-powered deep learning

Step-by-step tutorial for setting up a remote instance to create and train a deep learning network using Tensorflow on two major cloud service providers - Amazon Web Services (AWS) & Google Compute Engine (GCE).

Implementing deep learning models has never been easier with the release of high-level frameworks such as Tensorflow, keras, PyTorch and many others. With such frameworks, anyone can now create and run a deep learning model in a matter of a few minutes even with little or no previous experience of the underlying theory.

However, if you have ever tried to get more serious about deep learning, and implement larger networks to solve more complex tasks, you might have noticed that this simplicity is counter-balanced by the need for expensive computational power that are often beyond what most people have inside their personal computers. This is actually one main reason many people get offboard at the very beginning of their deep learning adventure.

Fortunately, many companies nowadays offer cloud services for running deep learning algorithms on servers equipped with a selection of high-performance GPUs.

In this tutorial, I will walk you through the steps for setting up a remote instance to create and train a deep learning network using Tensorflow on two major cloud service providers; Amazon Web Services (AWS) and Google Compute Engine (GCE).

For the sake of this tutorial, we will run an image classification program on the CIFAR10 dataset using Tensorflow. You can find the code we will use in this github repo. It mainly contains three files:

  • train.py: This script downloads the CIFAR10 dataset. It also creates and trains a convolutional neural network using Tensorflow.
  • test.py: This script computes the classification accuracy of the trained network on the test set.
  • helper.py: This file contains some utility functions.
  • conda_env.yml: This is a conda environment file that contains all the necessary python packages to run the train and test scripts.

This code is adapted from my work on the classification project of the Udacity Deep Learning Foundation Nanodegree. Since this tutorial is about using cloud services, I won't be explaining the classification code that we are going to run on the cloud instances.

 

Amazon Web Services (AWS)

Amazon's AWS is the leading cloud service provider in the market. It provides the cloud compute services through what is called Elastic Compute Cloud (EC2) instances, which are basically virtual cloud machines. In order to use those instances, you first need to create and account. Let's jump in.

Create an AWS account

Go to the Amazon AWS homepage https://aws.amazon.com and click on Create a Free Acount as shown in the below figure.

Screenshot: AWS Homepage

This will lead through a series of sign-up steps. This is a straight forward sign-up process. At one point of the process, you will need to provide the following information:

  • A valid credit/debit card payment information.
  • A phone number to identify your identity.
  • Choice of a support plan: choose Basic.

When you finish the sign-up process successfully, you will see the following message:

Screenshot: AWS signup confirmation

Click on the Launch Management Console to go to the AWS management console. Here is what it should look like. Also, do not forget to check your email for any registration confirmation emails that you might receive.

Screenshot: AWS Management Console

Choose your region

EC2 instances are available in multiple geographic regions across the world. Each AWS account should be associated with one such region. For instance, since I am in France, I would choose one of the closest regions to me which is the EU (Ireland) in my case. Go ahead and choose the closest region to you in the dropdown menu on the top-right corner.

Screenshot: choose region

Go to the EC2 dashboard

In order to open the EC2 dashboard where you will be able to create an manage your virtual instances, hover with the mouse cursor over the services dropdown menu at the top-left corner of the management console page. Click on EC2 under the Compute section. You might end up on a page informing you that your AWS services will be activated within 24 hours. So you will have to try again later.

Screenshot: AWS Services

If you close your browser, you can access your management console again by going to aws.amazon.com and signing in by hovering over the My Account tab and choosing AWS Management Console. If the dashboard is activated, you will see the following page after click on EC2:

Screenshot: E2C Dashboard

Increase your instance limit

For our demonstration, we are going to use a g2.2xlarge instance type. This gives you access to 8 GPUs and 15GB per instance. However, if it is the first time you use a g2.2xlarge, it is most likely that you will have no available instances by default. In other words, a 0 instance limit. So you need to increase your limit for this instance time.

To do that, go to the vertical pane at the left of the EC2 dashboard and click on the Limits. Then find the the line corresponding to g2.2xlarge instances and click on Request a limit increase.

Screenshot: request limit increase

This will open up a new page with a form that you need to fill with the following data:

  • *Region:* Choose the closest region that provide the g2.2xlarge instance. I chose EU (Ireland).
  • *Primary Instance Type:* g2.2xlarge.
  • *Limit:* Instance Limit.
  • *New Limit Value*: 1.
  • *Use Case Description:* EC2 for image classification tutorial.
  • *Contact Method*: Web (for email communication) or Phone.

Now click on the Submit button. This will create a new case. That is, a limit increase request that will be send to the support team. You need to check your email as the support team will contact you in the coming hours to inform you about the state of your request. The whole process until the limit increase takes effect might take from 24 to 48 hours. So you will need to wait again!!

NOTE: Not all instances are available in all regions. If you do not find the g2.2xlarge instance in your region, change your region to the closest one that provide g2.2xlarge. Another option is to stay in your region and choose another instance type. The hourly fee differs among regions and instance types so you need to be aware of that. Check the EC2 on-demande instance pricing page.

What I recommend you do at this point is that you check out Limits page from time to time, and check out whether the 0 limit next to the g2.2xlarge instance has turned to 1 or more. What I noticed is that in some cases the limit might be increased before you get the email confirmation. It took about 48 hours for me to get my limit increase. Once you get yours increased, you will be ready to create the instance.

Creating an instance

In the the left pane, click on EC2 Dashboard and then click on the Launch Instance button.

Screenshot: Create instance

The first step will be to choose an Amazon Machine Image (AMI). You can think of this as the software of your virtual machine. This includes an operating system and other pre-installed software packages that you will need in your project.

Choose the AWS Marketplace on the left, and type deep learning ami ubuntu version. In the results, click on the Select button next to the deep learning ami ubuntu version AMI result as in the image below.

Screenshot: choose ami

In the next step, you need to choose an instance type. Select the checkbox corresponding to the g2.2xlarge instance and click on the Review and Launch button.

In the review page, you will be able to edit some features such as instance type, storage, security groups, etc. For what we need, we will only change the security groups in order to be able to use Jupyter Notebooks on port 8888. Click on Edit security groups.

  • Select Create a new security group.
  • Choose a security group name. For instance Jupyter notebooks.
  • Write whatever description you wish.
  • Click on the Add Rule button.
  • Choose Custom TCP rule.
  • Porte range: 8888. This the the default porte for jupyter notebooks.
  • Source: Anywhere.

Screenshot: Security groups

Click on Review and Launch when you finish. Then click Launch in the next page.

You might be prompted to create a public/private key pair which will be useful to connect to your instance via SSH. Go ahead and create one:

  • Choose create a new pair in the dropdown
  • Choose a name. For example 'AWS deep learning tutorial'.
  • Do not forget to download and save the .pem file in a secure place, by clicking on the Download Key Pair button.
  • Click on the Launch Instance button.

NOTE: If you are using linux, You should change access permissions of the .pem file in order to be readable only by you. If you do not do this, Amazon server might refuse the ssh connection to the AMI. To change the permission, execute the following command in the terminal:

$ sudo chmod 400 path/to/key_file.pem

Screenshot: Keypair

If everything goes well, you should see a new page with a message on top telling you that your instance is launching. Click on the View Instances button on the bottom right corner of that page. You will see the the status of your instance, and its public IP address:

Screenshot: Instances

If you see '2/2 check passed' under 'status checks', this means you are ready to use the instance. Notice that you can access this page at anytime by clicking on the Instances tab at the left of you Management Console.

VERY IMPORTANT: Stop your instance when you are not using it. Or terminate it when you no more need your EC2 instance at all. Otherwise you will be charged usage fees which can grow fast. You can do this by right clicking on you instance info bar, choose Instance State then click on Stop, or terminate to delete the instance.

Screenshot: Stop instance

Connect to your instance

First, we are going to connect to our instance using SSH. In your terminal, type the following command:

$ ssh -i path/to/key_file.pem ubuntu@<ip address>

This would connect you to Ubuntu OS on your instance where you will be ready to run any command.

Screenshot ssh

Run the image classification code

In the instance terminal, go ahead and clone the Github repository containing the image classification code I prepared for this tutorial:

$ git clone https://github.com/ala-aboudib/tutorial_cifar10_classification.git

Then enter the project directory:

$ cd tutorial_cifar10_classification

Now you can run the train.py which will create and training the neural network with tensorflow:

$ python3 train.py

Screenshot: Training

And finally, you can run the test script to check out how the network generalizes on the test set:

$ python3 test.py

 

Using Jupyter Notebooks

Before we can use Jupyter notebook, we should create a configuration file that allows Jupyter to listen on the external IP of the instance. Go ahead and create the configuration file by running the following command in the instance's terminal:

$ jupyter notebook --generate-config

This command will create a configuration file in the path ~/.jupyter/jupyter_notebook_config.py. Open that configuration file and replace the line:

#c.NotebookApp.ip = 'localhost'

with

c.NotebookApp.ip = '*'

Save and close the file, and run Jupyter Notebook:

$ jupyter notebook

This will output an address using which you can connect to Jupyter with your web browser. Copy that address to your web browser and replace 'localhost' with the instance's external IP.

Screenshot: AWS start jupyter

 

Google Compute Engine (GCE)

The first thing you need to be able to use GCE is a Gmail account. If you do not have one, go ahead and create one for free by following this link.

Create a GCE account

If you have a Gmail account, go to cloud.google.com/compute and click on TRY IT FREE button as shown in the figure. This will lead you a login page where you need to enter you Gmail address and your password.

Screenshot: GCE start trial

After login, you will be redirected to a form asking you to enter your country and to agree for the usage terms and services. If you do, go ahead and click on the Agree and continue button.

Screenshot: GCP Terms

As mentioned in the box at the right hand side of the above figure, you will granted 300$ credit for free. However, you will be asked to enter your credit card information in order to check that you are not a robot.

NOTE: If you are in Europe, you will only be able to use GCP if you are a business. GCE is not yet available in Europe for individuals with no commercial goal. Check the following screenshot from the support page of Google Cloud Platform.

Screenshot Europe

On the following page, you will be asked to enter some personal information:

  • User type: select individual or business if you are one.
  • Address.
  • Contact information.
  • Payment method.

You need to fill and submit the form. If everything goes fine, you will be redirected to your GCE dashboard.

Screenshot: GCE Account creation

Create a GCE project

The first thing to do is to create a GCE project. In your dashboard, click on the Create button highlighted in the figure above. This will lead you to the project creation form as bellow:

Screenshot: GCE create project

I called the project 'classification tutorial'. You can pick up whatever name you wish and click on Create. Here is how you dashboard will look like after the project is created.

Screenshot: GCE Dashboard

 

Request a GPU quota increase

Before creating a GCE intance, you will need to request a quota increase to use their GPUs. On the menu in the left, hover over Compute Engine and then click on Quotas in the dropdown menu that appears.

Screenshot: Quotas

You will be prompted to enable billing before you can use google compute services. This is necessary since GCE is a paid service

Screenshot: VM instances

Click on Enable Billing. This will take a few minutes. When it's done, click on the menu symbole at the top left corner of th page to make the main menu appear and then hover over IAM & Admin and click on Quotas in the dropdown that appears.

Screenshot: IAM admin

This will take you to a page with a long list of ressources and services. Look for the entry NVIDIA K80 GPUs under one of the Google Compute Engine API list items. In order to find that entry more easily, I used the Service and Metric dropdown menus to filter out the values in the list as shown in the figure. Select the entry for 'NVIDIA K80 GPUs' in the zone 'europe-west1' using the checkbox at the left and click on Edit Quotas.

Screenshot: Edit Quotas

This will open a panel on the right-hand side of the page asking you to enter your name, email and phone number. After you enter those, click Next. Type '1' in the change to text field. This is the number of the GPUs we are going to use. In the Justification text field, enter 'Classify images using a deep neural network'. Click on Submit Request.

Screenshot: Request

This might take up to 48 hours to get approved, so you will have to wait until you get informed of your quota increase via email.

 

Create an instance

In the left-side menu, go to Compute Engine > VM Intances, and click on create in the page the follows:

Screenshot: create vm instance

A form will be created that will allow you to specify options for creatin your instance:

  • In Name type 'classification-tutorial'.
  • For Zone choose 'europe-west1-b' or a zone closer to you. Notice that GPUs might not be available in all zones, so you will need to choose one with an available GPU instance.
  • In Machine Type, adjust the sliders to select 2 CPU cores and 8 GB of memory. After that, Click on GPUs. Then, choose 'NVIDIA Tesla K80' in GPU Type and set the number of GPUs to '1'.

Screenshopt: VM create step 1

  • In Boot Disk click on Change. In OS Image, choose whichever OS you are comfortable with. I chose 'Debian GNU/Linux 9 (stretch)'. In Boot disk type, choose 'Standard persistent disk'. You can also choose and SSD disk which faster but more expensive. Finally change the disk size to 20GB and click Select.

Screenshot: GCE boot disk

  • Click on Management, disks, networking, SSH keys and select the Networkin tag. Type 'jupyter-notebook' or any other name wish in the Network tags field. We will use this later to create a firewall rule to allow use of jupyter notebooks.

Screenshot: network tag

  • Finally, click on Create at the bottom of the form. This will create and launch the instance for you. You should see the following page:

Screenshot: instance created

Create a firewall rule

In order to use Jupyter notebooks, we should create a firewall rule to allow access to port 8888. On the left-side menu click on VPC network > Firewall rules. Then click on CREATE FIREWALL RULE. This will open the following form:

Screenshot: GCE firewall rule

  • In Name type 'jupyter-rule'.
  • In Target tags type 'jupyter-notebook'. This is the same tag we set up when we created the VM instance above.
  • In Source IP ranges type '0.0.0.0/0'. This would allow access to your VM instance from any network.
  • In Protocols and ports choose 'Specified protocols and ports', and type 'tcp:8888' in the corresponding text field.
  • Click Create.

 

Connect to your instance

You first need to download the Google Cloud SDK on your local machine. This is necessary in order to be able to use the gcloud command that will allow you to interact with your GCE instance. Check out this page for instructions on how to install the SDK depending on your OS type.

once you get gcloud installed. you need to authenticate to your GCE account by typing the following in a terminal on your local machine:

$ gcloud auth login

This will open the browser and ask you to authenticate and to allow gcloud to access your GCE account. Click on Allow.

Next, go to the instance status bar, click on 'SSH' and choose View gcloud command.

Screenshot: GCE cloud command

Copy the command that shows in the pop-up. It should look something like this:

gcloud compute --project "classification-tutorial" ssh --zone "europe-west1-b" "classification-tutorial"

Paste it to your terminal and press Enter.

Congrats!! You are now connected to your GCE instance.

Screenshot

Preparing the instance

Before you can run the classification code on the instance, you need to install some software packages. We start by upgrading the OS and installing some necessary packages. Type the following in your instance shell (without the dollar sign):

$ sudo apt-get update

$ sudo apt-get upgrade

$ sudo apt-get install build-essential

The next step is to install the NVIDIA CUDA Toolkit. So in the same instance terminal, type the following commands, and wait for each one to execute successfully:

$ sudo apt-get install linux-headers-$(uname -r)

$ curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb

$ sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb

$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

$ sudo apt-get update

$ sudo apt-get install cuda

At this point, CUDA would have been installed along with all the necessary NVIDIA drivers for the GPU. Check that the drivers are installed and the GPU is recognized by running:

$ nvidia-smi

You should see the following output. Notice the name of the GPU as 'Tesla K80' with 11439 MB of memory.

Screenshot: nvidia smi

We are now ready to install Miniconda. Which will provide us with the appropriate virtual environment and packages for running the image classification code with Tensorflow. Run the following command to download and install Miniconda:

$ curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

$ bash Miniconda3-latest-Linux-x86_64.sh

$ source ~/.bashrc

You can now check your Miniconda installation by running the command:

$ conda env list

You should see one environment called root.

 

Run your Tensorflow code

Congrats, you are now ready to run the image classification code I prepared for you. If you have your own code that you wish to run, go ahead and give it a try.

First, clone the github repo:

$ git clone https://github.com/ala-aboudib/tutorial_cifar10_classification.git

Then change the working directory:

$ cd tutorial_cifar10_classification/

I provided a Conda environment that includes all necessary packages you will need to run my code such as Tensorflow, Numpy, etc. Go ahead and install the environement:

$ conda env create -f conda_env.yml

Now, let's check out if Tensorflow is well configured to use the GPU. In order to do that, you first need to enter the tensorflow conda environement we have just created.

$ source activate tensorflow

Then, enter the python shell:

$ python

and run the following python commands (without the >>> shell prompt):

>>> import tensorflow as tf

>>> tf.test.gpu_device_name()

if you see /gpu:0 in the print out, you are good to go. exit the python shell:

>>> exit()

Now run the script that creates and trains the network:

$ python train.py

Wait for the training to finish. This should take a few minutes. Then you can test the classification accuracy of the trained network on the test set by running:

$ python test.py

 

Using Jupyter notebooks

Before you can connect to a Jupyter notebook, you will first have to configure it properly. By default, Jupyter notebooks will listen on 'localhost' which is not accessible from an external network. You will need to change that to make it listen on the instance's external IP.

From inside the 'tensorflow' conda environement. run the following command:

$ jupyter notebook --generate-config

This will create a configuration file for Jupyter notebook at ~/.jupyter/jupyter_notebook_config.py. Open that file with your favorite text editor and find the line:

#c.NotebookApp.ip = 'localhost'

replace with:

c.NotebookApp.ip = '*'

Save and exit the file, and then execute:

$ jupyter notebook

This will output a few log lines. Copy the line indicating the address of the jupyter notebook as indicated in the image below, and paste it to your browser after replaceing the 'localhost' string by the external IP address of your instance.This would look something like this:

http://104.155.9.168:8888/tree?token=d88cf07ff58919cc5a07b3710d8ad03a5cc3f1703f68397d

But with your instance's IP address and the token generated for you as copied from your instance's terminal. Now press Enter. This will open the Jupyter Notebook platfrom where you can create a notebook and write whatever Tensorflow code you want.

Screenshot: Jupyter Notebook

VERY IMPORTANT: Do not forget to stop your instance whenever you finish running your algorithm otherwise you will be charged for GPU use. If you no more need it at all, go on and delete it. This will avoid some minor charges that hold even if the instance is stopped.

 

Conclusion

In this tutorial I showed you how you can use AWS and GCE cloud compute instances for running a deep learning network using Tensorflow. These are very useful services when you are learning about deep learning and not yet decided whether to update your personal PC with a high-end GPU that might cost hundreds and even thousands of dollars.

However, it is important to note that using AWS and GCE the way I presented here is for occasional use which means a few dozen hours per month. If you want to run deep learning code more frequently, you might end up paying Google or Amazon more money than what you will be paying to update your personal PC with a good GPU.

Another option if you expect a heavy GPU usage is to opt for GCE Committed Use or EC2 Reserved instances. They both provide significant discounts for such use cases.

Now, you might be wondering, which service is better and which one you should choose to run your deep learning model. Comparing these two services has become more and more subtle and difficult due to the heated competition over the cloud compute market. Bother provide very similar services and prices with some advantages here and there for specific use cases. For an advanced, full-fledged comparison between AWS and GCE, you can check out this recent article .

For a simple use case like running a simple deep learning model from time to time, both services are equivalent in terms of quality and prices. A good practice is to use the price calculator for GCE and AWS before creating your instances. This would give you an estimate of what you will end up paying.

For users in Europe, I would recommend AWS since GCE is only available for businesses in this zone. If you are not a business, you will still be able to run the instance, but Google advises against that as we saw in the GCE section of this tutorial.

Finally, AWS and GCE are not alone in the market, you might also wish to take a look at other providers such as Microsoft Azure, Oracle Cloud and Floydhub which also provide awesome services.

 

Any comments or further information? Just log in or register and leave a comment here.

Comments

Join the community!

Imaginghub: your community ... Show more