Data Labeling

How to prepare your own dataset for training a model


Data Labeling

In supervised learning, training a model requires a list of labeled images for training and testing. In object detection, the image with bounding box co-ordinates describes the sample. The label characteristics depend on the ML framework. The training, validation(optional), and test lists are provided in an ML model training task.
In Computer Vision, some of the labeling categories are

  • Image classification assigns a class to the image

  • Multilabel classification labels data with multiple classes

  • Landmark/points annotate different co-ordinates in the image.

  • Bounding box predicts the box coordinates and dimensions enclosing different objects.

  • Segmentation assigns a group of pixels in the image to a class.

Data labeling can be performed manually or using third parties.

  • AWS Ground Truth

  • Online tools

AWS Ground Truth

AWS Ground truth service provides data labeling. The data labeling can be performed by the Amazon Mechanical Turk team, our own private team, and by third-party vendors. Data labeling by Amazon Mechanical Turk team and Vendors are paid services. In these cases, data is labeled using external workers.

Step-1: Upload data to S3 bucket

Upload the training data to an S3 bucket(How to create an S3 bucket - Amazon S3 Bucket ).

Step-2: Create labeling job

  • Create a labeling job in Amazon SageMaker.
    Amazon SageMaker → Ground Truth → Labeling Jobs → Create Labeling Job



  • Step 1: Specify job details



  1. Job name
    Enter the name for the labeling job. As specified in the image, the name should be unique.

  2. Label attribute name (optional)
    For beginners, it is better to skip this option. This is the attribute name to be used for the labels in the output manifest file.

  3. Input dataset setup
    Select Automated data setup. Ground Truth automatically identifies data in your S3 folder and creates a manifest file with the list of data.

  4. S3 location for input datasets
    The S3 bucket folder where the file containing annotations should be stored. Click on Browse S3, then choose the folder containing the images. An example is illustrated below.

    s3bucket1 s3bucket2

    Then click Choose

  5. S3 location for output datasets By default, the output data is stored in the input S3 bucket.

  6. Data type Select Image

  7. IAM role
    Select the IAM role.
    Identity and Access Management(IAM) Service provides access to AWS services. The permissions to different resources can be restricted using the IAM role (AWS IAM). The IAM role should have access to S3 buckets defined in the Input and Output path.


    Either select an Existing role or Create a new role. How to create a new role :


    info_16If you would like to use Specific S3 buckets instead of Any S3 bucket, enter the names of the buckets you would like to access.

    Finally, click on Create.


  8. Click Complete Setup.

  9. Additional configuration (Skip this step if you would like to label all images)

    1. Dataset selection
      By default, Labeling job labels all the data in the manifest file. We can label a part of data or selected data samples based on their properties. Amazon provides three options to create datasets, either using full data samples or randomly selected data or filtered data. The filtered subset needs SQL knowledge.

      1. Full dataset


      2. Random Sample


        Here 20% of the total samples are used. Now, click on Create subset and finally, Use this subset.

      3. Filtered Subset


        Enter the SQL query to select the data samples and Create subset.

    2. Encryption (optional)
      Encryption option encrypts the output data using your own key. Select the AWS KWS key by selecting the ID from options, if encryption is required.

  10. Task Category
    Select the task category. Select Bounding box option


    Click on Next.

  • Step 2: Select workers and Configure tool
  1. Workforce
    Labeling workforce is a group of people working on data annotation. There are three types of workforces.
    Amazon Mechanical Turk scales the workforce on demand. Amazon Mechanical Turk is a good option for labeling a large set of data. Private workforce is created by an owner with their people. The persons included in the team can label the data. Vendor consists of people working for the third party group that has been approved and integrated into the AWS platform. AWS Marketplace provides information about available vendors. Worker types, Task Timeout, Price per task, Additional Configuration → Number of workers can be set in order according to your labeling task. The prices per task are provided in the option Price per task. As price increases quality of labeling increases. The three worker types are explained below. For our example, we used Private team.

    1. Mechanical Turk Team - sample


      Enable automated data labeling is a paid service for automatic labeling. It is better to use Enable automated data labeling for labeling a large dataset since the labels are predicted based on the annotation pattern.

    2. Private
      Here in this example, a private team created by you labels the samples. Create a private team if there exists none.

      info Skip the Steps 1 - 8 below, if you already have a private team. How to create a private team:
      1. Go to Amazon SageMaker → Ground Truth → Labeling workforces → Private

      2. Click on Create private team

      3. Then name the team and finally, click on Create private team.

      4. Now the team will be visible in Labeling workforces → Private.

      5. The next step is to invite workers to the team.

      6. The invited workers should check for the invitation mail in the mail account. Enter the link and change the password. This step can be done later by the worker.

      7. Once the workers have been invited, select your team and add workers to the team.

      8. Now add workers from the list.

      Select Private and select your team from the option Private teams.


      Set the parameters such as time per task.
      Once a labeling job is created, we can see the jobs in the job portal.
      Amazon SageMaker → Ground Truth → Labeling Workforces → Private → portal


      Select the job and draw bounding boxes on the images.

    3. Vendor
      A third party can be assigned for labeling a dataset. Amazon Marketplace contains the approved vendor list.

  2. Existing labels display option
    This option can be used by the label verification category.

  3. Annotation
    The next step is the annotation. Here we describe how the labeling should be performed, Good example, Bad example, and Labels in the dataset.


    Enter the descriptions. Add label option adds a label to the dataset. Multiple labels can be added.

    check_markPrior to the creation of the labeling job, we can test the labeling on the sample. The annotation format for the image can be viewed from the Preview option. A new window opens with the image.


    Select the label and draw the box around the object. Then Submit.


    Now the bounding box coordinates can be viewed. Close the window once the test is over.
    Now click on Create to create the labeling job.


The created labeling job appears in Amazon SageMaker → Ground Truth → Labeling Jobs with the current status.


Click on the labeling job.


The S3 output path contains the annotation files. **** → manifests → output → output.maifest.

Open Source Online Labeling Tools

Tool Website Supported labeling tools Output file Account
MakeSense MakeSense Bounding box, Polygon, Points * zip * xml * csv Not required
LabelMe LabelMe Bounding box, Points, Polygon, Area Selection xml Required

Sample tool

  • Make Sense - Object detection example

    • Go to the MakeSense website and create a labeling job


    • Load the dataset. This example contains only one image.


    • Insert the labels. This can be done manually, uploading the label file.


    • MakeSense provides automated labeling by using pre-trained models. This example uses the COCO SSD.


    • Next, a window appears for labeling.

      • Bounding box


        Draw the bounding boxes around the objects. In the Bounding box option, the label bird is predicted by the SSD model. Extra labels can be deleted.

      • Points


      • Polygon


    • Now export the annotations. EXPORT LABELS. Select the format and save the file.




Project State

Public Project


Software Licence: Project has no software
Hardware Licence: Project has no hardware

Project Tags




Does this project pique your interest?

Login or register to join or follow this project.

Back to top

Your comments, please!

Want to comment this ... Show more