Google Cloud Vision vs. AWS Rekognition - an image classification tutorial

If you’ve reached this article because you are building an app that needs some computer vision functionality or you need to compare which cloud service is suitable for you, then you are at the right place. In this article, we are going to explore both Amazon’s Rekognition and Google Cloud Vision...
  • Why would you need a cloud computer vision service?
  • Why I chose Amazon’s and Google’s services to explore?
  • Explore and compare both services in some depth


Why would you need a cloud computer vision service?

This is the most important question you need to ask yourself. Typically, you have three options to perform CV operations:

  1. Train your own models for your own needs
  2. Use free pre-trained models that will fit your needs
  3. Use cloud services that charges you per usage


1. Train your own models for your own needs:

Let’s say you want to build an app that has the function of detecting objects in an image or a video. The first option means that you will build and train your own model. Let’s just begin with the disadvantages of this approach.

Building a model and training it requires a lot of work, memory and processing time. First, you need to collect a HUGE number of images for objects you want to detect (and yes, you cannot just detect every kind of object there is in the universe, you have to get images for any object you want to label. At least, so far). After this you will start training your classifier which is a very time and processing power consuming process. Tools like OpenCV makes initiating the training process easier (of course after you collect and prepare the training data). OpenCV relieves you from tedious work like feature extraction and algorithm choice and all that lower level stuff. Especially if you’re new to the machine learning world, this will be very helpful to you.

But this approach has the advantage of being very flexible to your needs. You can train your model to predict exactly the types of objects you need to detect. You can modify it, tweak it, improve it as much as you want. It is all yours, do as you wish.


From all of the above, the most time consuming process is preparing your data. Some resources might be useful to you like the Open Images dataset  -

Created by Google, Open Images is a huge dataset which contains almost 9 million annotated images with thousands of classes of objects.


2. Use free pre-trained models that will fit your needs:

Instead of doing all the hard work yourself, you can just use some pre-trained model to detect the objects you need. This means that someone else already did the work we discussed in the first option and they made their trained model publicly available online so you can use it without reinventing the wheel.

Of course the advantages here are clear. You saved yourself a lot of time and resources that would have been wasted to train the models you have. But on the other hand, you are restricted by the pre-trained models out there. Let’s assume you used a model that doesn’t predict a stethoscope (probably never seen one before during the training). And you need to predict it. This means that we are back to square one. You should train a new model (or retrain the existing model) yourself to fit your needs. Or just go with the current model if the losses are acceptable to you.

If you go with option 1 or 2, you will come to the final step. You have a trained model that fits your needs. Where will you store it? The first option is that you store it locally with your application. This means that the model calculations for predicting a new image label will be done on the local processor. Also the model will need to take space on the local storage. Some models take a lot of space by the way.

The other option, is that you upload your model to a server and then a user through your application will send a request to that server to get the results from the model. Which leads us to the third option:


3. Use cloud services that charges you per usage:

Here we are, freeing ourselves from the burden of local processing or storage and using a model stored elsewhere through web requests to a cloud server. But we can go one step further. Delegating all the previous tedious hard work to another trusted entity, say like Google or Amazon. These Data Giants used all their enormous resources to build models that can do most of the stuff you would need in image labeling. You will not only use their cloud service for processing but also their ability to built very strong models that can detect a large number of object types.

Here we will compare two services: AWS Rekognition and Google Cloud Vision services.


Amazon AWS Rekognition

As a developer, the first thing you look at is if the service is provided in the language you use for your application. Rest assured that the Rekognition service SDK is available for many languages (.NET, C++, Go, Java, Javascript, PHP, Python and Ruby). There are command line tools to use the service as well.

Examples for the rest of the article will be in Python as it is the most common language used in machine learning apps.


Functions - what does Rekognition have to provide for us?

Rekognition is a very robust library that has many excellent functions. We have the main function which is labelling. But is it really labeling?

Of course it labels objects and entire scenes, but there is a more restricted model for faces only (was trained only on faces). There is also face comparison feature, where you give it a face and asks it to search for that particular face in other pictures (a function called face recognition). There is also a celebrity recognition which is able to recognize thousands of celebrities in different fields.

Moreover, there is a model that classifies “Unsafe Content”. Like if you have a picture that contains pornographic stuff. This model will classify it as unsafe, then you can blur it or put a warning or do whatever you want with it.

There is also an OCR model which can detect and read text in your image.


All these features work both on pictures and also videos. Yes you can send a video to the Rekognition service and use all these features on the frames of the video. With the addition of person tracking feature which tracks a person’s movement in a video. Also there is activity labelling that can identify the activity of “walking” for example.



So, let’s try these features, shall we?

If you are familiar with AWS services and how they work you can skip the coming “setup” section and go directly to the rekognition code section (note: I’m working with Python 2.7, Ubuntu 16.04 environment).

Let’s set up our AWS account - open the following url:


  1. Create a new account (you get 12 months of free usage).
  2. If you are using python you can install the boto3 package which is the SDK for amazon AWS services. Pip install boto3 (make sure you have at least version 1.4.8 of boto3 to get video analysis services)
  3. Also install AWS CLI tools with pip install awscli
  4. Make note of the region you are using for your services (this is important) - it is in the url in your browser like this: (my region is “us-west-2”)
  5. Now we need to create an AWS Key to use with our application
  6. From the drop down menu that comes out of your name on the right top bar, choose “My security credentials”
  7. From the left pane, choose “Users” and create a new user. When you do you will get “Access Key ID” and “AWS Secret Access Key”
  8. Now in your terminal run “aws configure” and it will ask you for the credentials and the region. Then it will save these to files on your system.
  9. From the top bar of the console page you can find a “Services” menu which we will need.


We need to create what’s called "a bucket" that will contain our files.

  • First from the services look for “S3” service
  • Create a new bucket (with a unique name)
  • Upload the images and videos you need to work with

We will work with this image (which is public domain)

Blog - Image of a room

... and this Good Table manners short movie (which is also public domain):

I uploaded both to my bucket. Let’s test object detection on this image.


Now start a new python terminal - the first step, is to import boto3:

In [1]: import boto3
# Now we want to create a client for the rekognition service
In [2]: client = boto3.client('rekognition','us-west-2')
# create some variable contatining your bucket data:
In [3]: bucket = 'adawod2-images'
In [4]: videoname = 'goodtable.mp4'
In [5]: imagename = 'indoor.jpg'
# Now to the labeling. It is as simple as one line of code:
In [6]: object_label_response = client.detect_labels(Image={'S3Object': 'Bucket':bucket,'Name':imagename}},MinConfidence=75)


Here we use the “detect_labels” method which takes the Image information as its bucket name and file name. And we can also provide a minimum confidence level. In this case any label that has a confidence less than 75% will not be returned. Now if we want to see the response we can view its contents:

In [9]: object_label_response
{u'Labels': [{u'Confidence': 98.1087646484375, u'Name': u'Electronics'},
{u'Confidence': 98.1087646484375, u'Name': u'Monitor'},
{u'Confidence': 98.1087646484375, u'Name': u'Screen'},
{u'Confidence': 98.1087646484375, u'Name': u'TV'},
{u'Confidence': 98.1087646484375, u'Name': u'Television'},
{u'Confidence': 97.6505126953125, u'Name': u'Indoors'},
{u'Confidence': 97.6505126953125, u'Name': u'Interior Design'},
{u'Confidence': 95.51614379882812, u'Name': u'Couch'},
{u'Confidence': 95.51614379882812, u'Name': u'Furniture'},
{u'Confidence': 93.28926086425781, u'Name': u'Curtain'},
{u'Confidence': 93.28926086425781, u'Name': u'Home Decor'},
{u'Confidence': 93.28926086425781, u'Name': u'Window'},
{u'Confidence': 93.28926086425781, u'Name': u'Window Shade'},
{u'Confidence': 76.6932373046875, u'Name': u'Hardwood'},
{u'Confidence': 76.6932373046875, u'Name': u'Wood'}],
u'OrientationCorrection': u'ROTATE_0',
'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
'content-length': '802',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 13 Dec 2017 17:47:06 GMT',
'x-amzn-requestid': 'a1ea8258-e02d-11e7-9222-b93dec7efd45'},
'HTTPStatusCode': 200,
'RequestId': 'a1ea8258-e02d-11e7-9222-b93dec7efd45',
'RetryAttempts': 0}}


Ok. So it is a dictionary of results. With the main result in key “Labels”. But what would happen if we didn’t set a minimum confidence level?

In [10]: object_label_response =
In [11]: object_label_response
{u'Labels': [{u'Confidence': 98.10874938964844, u'Name': u'Electronics'},
{u'Confidence': 98.10874938964844, u'Name': u'Monitor'},
{u'Confidence': 98.10874938964844, u'Name': u'Screen'},
{u'Confidence': 98.10874938964844, u'Name': u'TV'},
{u'Confidence': 98.10874938964844, u'Name': u'Television'},
{u'Confidence': 97.65055084228516, u'Name': u'Indoors'},
{u'Confidence': 97.65055084228516, u'Name': u'Interior Design'},
{u'Confidence': 95.5163803100586, u'Name': u'Couch'},
{u'Confidence': 95.5163803100586, u'Name': u'Furniture'},
{u'Confidence': 93.28922271728516, u'Name': u'Curtain'},
{u'Confidence': 93.28922271728516, u'Name': u'Home Decor'},
{u'Confidence': 93.28922271728516, u'Name': u'Window'},
{u'Confidence': 93.28922271728516, u'Name': u'Window Shade'},
{u'Confidence': 76.69319915771484, u'Name': u'Hardwood'},
{u'Confidence': 76.69319915771484, u'Name': u'Wood'},
{u'Confidence': 73.91436767578125, u'Name': u'Dining Room'},
{u'Confidence': 73.91436767578125, u'Name': u'Room'},
{u'Confidence': 58.88100814819336, u'Name': u'Art'},
{u'Confidence': 58.88100814819336, u'Name': u'Modern Art'},
{u'Confidence': 58.2677116394043, u'Name': u'Dining Table'},
{u'Confidence': 58.2677116394043, u'Name': u'Table'},
{u'Confidence': 57.60408020019531, u'Name': u'Living Room'},
{u'Confidence': 57.53704833984375, u'Name': u'Coffee Table'},
{u'Confidence': 55.66920471191406, u'Name': u'Flooring'},
{u'Confidence': 54.053794860839844, u'Name': u'Apartment'},
{u'Confidence': 54.053794860839844, u'Name': u'Building'},
{u'Confidence': 54.053794860839844, u'Name': u'Housing'},
{u'Confidence': 52.69278335571289, u'Name': u'Chair'},
{u'Confidence': 50.761688232421875, u'Name': u'Lobby'},
{u'Confidence': 50.543827056884766, u'Name': u'Entertainment Center'}],
u'OrientationCorrection': u'ROTATE_0',
'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
'content-length': '1587',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 13 Dec 2017 17:48:58 GMT',
'x-amzn-requestid': 'e43e07ff-e02d-11e7-971c-e9fdf1d42f68'},
'HTTPStatusCode': 200,
'RequestId': 'e43e07ff-e02d-11e7-971c-e9fdf1d42f68',
'RetryAttempts': 0}}


Now we have more labels. How many?

In [12]: len(object_label_response['Labels'])
Out[12]: 30


We have 30 objects detected in the image. You can see that the accuracy is pretty good. The same thing goes for all other features. You will just change the called method. They all take almost the same parameters. We have “detect_faces()”, “detect_moderation_labels()”, “detect_text()”, “recognize_celebrities()”.

Let’s try one more. Obviously there are no faces or celebrities or text in our test image so we will try the moderation labeling.

In [13]: moderation_label_response =


In [14]: moderation_label_response
{u'ModerationLabels': [],
'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
'content-length': '23',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 13 Dec 2017 17:54:51 GMT',
'x-amzn-requestid': 'b7317b18-e02e-11e7-9222-b93dec7efd45'},
'HTTPStatusCode': 200,
'RequestId': 'b7317b18-e02e-11e7-9222-b93dec7efd45',
'RetryAttempts': 0}}


We can see that the ModerationLabels array is empty as the image doesn’t have any porn or violence scenes. Now, we want to know how fast that response was. Of course we can use the timeit module in python but this will vary with our internet speed at the current moment of testing. So, let’s see how fast the server generated the response instead of how fast the response arrived at our end.

  • From the services menu choose “CloudWatch” which is a monitoring service for all your other services.
  • When you go there, click “Browse Metrics”
  • Now you will find a box with Rekognition Metrics
  • When you click it you will find two choices of metrics “Operation” and “Metrics with no dimensions”.
  • Choose Operation Metrics
  • Now scroll until you see the Metric “ResponseTime” for the operation “DetectLabels” and check that one


Screenshot AWS 01

In this graph, we find that we did two operations (one with the minimum confidence and one without). The first one took 1.99k milliseconds (almost 2 seconds) to generate. The second request took 1.51k milliseconds (1.5 seconds). Which makes sense because the first request takes some time to filter the returned results to match our minimum confidence.

Now we move to the videos. How would we analyze a video with the Rekognition service? It is a little bit different here. See, the video is a bunch of images (frames). It is not like a single image that will be analyzed instantly. So, a video takes some time. Let’s see how this is done: When you submit a video job, you choose a topic to send notifications to on the AWS server. When the Rekognition service finishes processing the video, you get a notification and then you can get the results.

  • Go to the Services menu again and choose SNS (Simple Notification Service).
  • Choose Create Topic 
  • When the topic is created with the name you picked you will find a menu with topics registered on the server. Pick the one you created and you will find some information about it including its ARN name. It will look like this arn:aws:sns:region:somenumber:topicname
  • Now we need to create a role that has permissions to work with topics
  • Again from services go this time to IAM service
  • From the left pane choose “Roles”
  • Create a role and name it as you wish and choose the SNS service for the role
  • from the new role page you will find its ARN name arn:aws:iam::somenumber:role/rolename
  • Now in the code we want to detect objects AND ACTIVITIES in the video.

In [16]: video_object_response = client.start_label_detection(Video={'S3Object':
...: {'Bucket':bucket,'Name':videoname}}, NotificationChannel={
...: 'SNSTopicArn': 'arn:aws:sns:region:somenumber:topicname',
...: 'RoleArn': 'arn:aws:iam::somenumber:role/rolename'
...: })

Note that you can use MinConfidence here as well. Notice also that this time we passed the topic ARN and the role ARN to receive notifications for the topic. What if we viewed the object now?

In [17]: video_object_response
{u'JobId': u'long_hashed_id',
'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
'content-length': '76',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 13 Dec 2017 18:16:47 GMT',
'x-amzn-requestid': 'c8b2ce2f-e031-11e7-b5d4-172e74991468'},
'HTTPStatusCode': 200,
'RequestId': 'c8b2ce2f-e031-11e7-b5d4-172e74991468',
'RetryAttempts': 0}}
# It gives us the JobId on the server. But how would we use that?
In [12]: video_object_results = client.get_label_detection(JobId='The job id you got earlier')

# but we get
In [13]: video_object_results
{u'JobStatus': u'IN_PROGRESS',
u'Labels': [],
'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
'content-length': '39',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 13 Dec 2017 18:21:37 GMT',
'x-amzn-requestid': '7588693a-e032-11e7-a843-47bbd4ff47a9'},
'HTTPStatusCode': 200,
'RequestId': '7588693a-e032-11e7-a843-47bbd4ff47a9',
'RetryAttempts': 0}}


Which says that the job is still processing the video. But how will we get the notification? You can create a subscription to the topic we specified for the job.

  • Go to the SNS service again - From the left pane choose “Subscriptions”
  • Now you can create a new subscription
  • If you choose Protocol “Email” and specify your email, you will get an email notification for any message published in the topic you subscribe to. (First you get a confirmation email of course)

Now what happens when you get a notification saying that the job is completed? It will take a couple of minutes to process. Then when we get the notification we do this:

In [20]: video_object_results = client.get_label_detection(JobId='c36e9ba9b77c60 ...: 7831d0b36c00d8082d0e436a25ab300272a82f74d5be0fc37c')

In [21]: video_object_results
{u'JobStatus': u'SUCCEEDED',
u'Labels': [{u'Label': {u'Confidence': 57.609397888183594, u'Name': u'Logo'},
u'Timestamp': 0},
{u'Label': {u'Confidence': 57.609397888183594, u'Name': u'Trademark'},
u'Timestamp': 0},
{u'Label': {u'Confidence': 62.13990020751953, u'Name': u'Logo'},
u'Timestamp': 166}, # …… and goes on for a long list


We will find it detected logos, trademarks and text in the beginning of the video. It also detected activities like “Person Eating” {u'Label': {u'Confidence': 96.9111557006836, u'Name': u'Person Eating'}, u'Timestamp': 18585},

It also detected objects like “Furniture” {u'Label': {u'Confidence': 56.38970184326172, u'Name': u'Furniture'}, u'Timestamp': 19786}, and so on.

And as before, we have other methods we can use with videos like “start_celebrity_recognition()”, “start_content_moderation()”, “start_face_detection()”, “start_face_search()”, “start_person_tracking()” each with its “get” method to get results using JobId.


Google Cloud Vision / VideoIntelligence

Now let’s move on to the Google Cloud Vision service - here the setup steps are almost the same:

You go to the console page and create an account to get a free 12 months subscription.

  • Go to this page to activate the vision API
  • From the top menu you will find a drop down menu with your Google cloud projects. Select or create a new one.
  • Then click “Continue” to enable the API for this project
  • Now from the side menu choose “Storage” to create a bucket for the project
  • Upload your files to the bucket
  • Now - as we did with AWS - we need account credentials to create a client in our code. In this page you can create a new service account and after you’re done, it will download a JSON file with your credentials.
  • Now, in our Ubuntu terminal we add the file location to an environment variable.
    export GOOGLE_APPLICATION_CREDENTIALS="/path/to/file"
  • Install the python Google cloud API pip install google-cloud-vision google-cloud-videointelligence

Now, we want to do the same stuff we did with our files with Rekognition.

Open a python terminal:

In [1]: from import types
In [2]: from import vision

# Now create a client
In [3]: client = vision.ImageAnnotatorClient()

# Then, we label our image
In [4]: response = client.annotate_image({'image': {'source': {'image_uri': 'gs://bucket-name/file-name'}}, 'features': [{'type': vision.enums.Feature.Type.LABEL_DETECTION}]})

# Now to get the labels:
In [5]: labels = response.label_annotations
In [6]: for label in labels:
...: print(label.description, label.score)
(u'living room', 0.9214476346969604)
(u'interior design', 0.8677590489387512)
(u'room', 0.8587830066680908)
(u'ceiling', 0.7799193859100342)
(u'real estate', 0.6248593926429749)
(u'floor', 0.5974909663200378)
(u'flooring', 0.5898616909980774)
(u'interior designer', 0.5724017024040222)
(u'lobby', 0.5487929582595825)
(u'hardwood', 0.5441449284553528)


Here we have our labels with their confidence scores.

We can immediately see the difference. The AWS Rekognition service can recognize more objects in the scene while the Google Cloud recognizes the entire scene rather than smaller objects. It looks at the bigger picture here.


Now let’s try a video. The process is very similar. But before we need to enable the video intelligence API as we did with the Vision API.


Then we can use it:


In [1]: from import videointelligence
In [2]: video_client = videointelligence.VideoIntelligenceServiceClient()
In [5]: operation = video_client.annotate_video('gs://adawod2-bucket/goodtable.mp4', features=features) In [11]: result = operation.result()
In [8]: segment_labels = result.annotation_results[0].segment_label_annotations
In [10]: for i, segment_label in enumerate(segment_labels):
...: print('Video label description: {}'.format(
...: segment_label.entity.description))
...: for category_entity in segment_label.category_entities:
...: print('\tLabel category description: {}'.format(
...: category_entity.description))

...: ...: for i, segment in enumerate(segment_label.segments):
...: start_time = (segment.segment.start_time_offset.seconds +
...: segment.segment.start_time_offset.nanos / 1e9)
...: end_time = (segment.segment.end_time_offset.seconds +
...: segment.segment.end_time_offset.nanos / 1e9)
...: positions = '{}s to {}s'.format(start_time, end_time)
...: confidence = segment.confidence
...: print('\tSegment {}: {}'.format(i, positions))
...: print('\tConfidence: {}'.format(confidence))
...: print('\n')
[Out]: Video label description: monochrome photography
Label category description: photography
Segment 0: 0.0s to 619.95264s
Confidence: 0.514091789722

Video label description: black and white
Label category description: style
Segment 0: 0.0s to 619.95264s
Confidence: 0.880857467651
Video label description: monochrome
Label category description: style
Segment 0: 0.0s to 619.95264s
Confidence: 0.499802052975


The Google Vision service includes many features like: Face Detection, Logos, Object labelling, Landmarks labelling, Text Recognition, Safe Search detection and Extract Web Entities. 

The Google Video Intelligence includes Object labelling, Shot Change detection and content moderation.

Unfortunately, there is no way on the Google console to measure the server response time as we did with AWS.


Conclusion: What to use - AWS Rekognition or Google Cloud Vision?

What is best for your app?

I, personally, believe that this is the wrong question. In computer science in general, there is no concept of “Best”. There is no best sorting algorithm, there is no best network topology, there is no best database engine, there is no best API, … etc.

Each has its advantages and disadvantages. So, let’s compare:


AWS Rekognition

Google Cloud (Vision/Video)


Gives you free cost for the first 1,000 minutes of video and 5,000 images per month for the first year

Other than that, Rekognition is relatively cheaper than Google Cloud Vision/Video

The first 1,000 units per month are free (not just the first year)


Up to 2 seconds per image and 2 minutes per video

Similar performance (measured by response time to client)


Object detection, face detection and recognition, content moderation, celebrity recognition, activity recognition, person tracking, text recognition

Object detection, face detection but NOT recognition, content moderation, text recognition

Diversity of labels

Can detect more variety of little details in the image

Mostly looks at the bigger image

Code Clarity

Very clear and easy

A little bit ambiguous


Screenshot Google Cloud Api Backend

Screenshot AWS 02

So, again, what to choose?

It is all up to you. You know your financial status, your needs in your app and what each API can provide for you. 

For example, AWS might seem cheaper at first but on the long run Google cloud will save much more money. Another example is that they have different accuracies in different functions. Like that AWS beats Google in Text Detection. When you run our example image on Google Vision, you will find that it identified some of the window blinds as the letters E and F. While on AWS Rekognition you get nothing.


What has been your experience with this services? Which do you prefer? 
Just log in or register and leave a comment or further information here.


Want to read on?

Find other interesting ... Show more