Data Labeling

How to prepare your own dataset for training a model

Documentation

Data Labeling

In supervised learning, training a model requires a list of labeled images for training and testing. In object detection, the image with bounding box co-ordinates describes the sample. The label characteristics depend on the ML framework. The training, validation(optional), and test lists are provided in an ML model training task.
In Computer Vision, some of the labeling categories are

  • Image classification assigns a class to the image

  • Multilabel classification labels data with multiple classes

  • Landmark/points annotate different co-ordinates in the image.

  • Bounding box predicts the box coordinates and dimensions enclosing different objects.

  • Segmentation assigns a group of pixels in the image to a class.

Data labeling can be performed manually or using third parties.

  • AWS Ground Truth

  • Online tools

AWS Ground Truth

AWS Ground truth service provides data labeling. The data labeling can be performed by the Amazon Mechanical Turk team, our own private team, and by third-party vendors. Data labeling by Amazon Mechanical Turk team and Vendors are paid services. In these cases, data is labeled using external workers.

Step-1: Upload data to S3 bucket

Upload the training data to an S3 bucket(How to create an S3 bucket - Amazon S3 Bucket ).

Step-2: Create labeling job

  • Create a labeling job in Amazon SageMaker.
    Amazon SageMaker → Ground Truth → Labeling Jobs → Create Labeling Job

labeling_jobs

createlabel_job

  • Step 1: Specify job details

jobdetails_part1

jobdetails_part2

  1. Job name
    Enter the name for the labeling job. As specified in the image, the name should be unique.

  2. Label attribute name (optional)
    For beginners, it is better to skip this option. This is the attribute name to be used for the labels in the output manifest file.

  3. Input dataset setup
    Select Automated data setup. Ground Truth automatically identifies data in your S3 folder and creates a manifest file with the list of data.

  4. S3 location for input datasets
    The S3 bucket folder where the file containing annotations should be stored. Click on Browse S3, then choose the folder containing the images. An example is illustrated below.

    s3bucket1 s3bucket2

    Then click Choose

  5. S3 location for output datasets By default, the output data is stored in the input S3 bucket.

  6. Data type Select Image

  7. IAM role
    Select the IAM role.
    Identity and Access Management(IAM) Service provides access to AWS services. The permissions to different resources can be restricted using the IAM role (AWS IAM). The IAM role should have access to S3 buckets defined in the Input and Output path.

    iamlist

    Either select an Existing role or Create a new role. How to create a new role :

    createnewrole

    info_16If you would like to use Specific S3 buckets instead of Any S3 bucket, enter the names of the buckets you would like to access.

    Finally, click on Create.

    createrolefinal

  8. Click Complete Setup.

  9. Additional configuration (Skip this step if you would like to label all images)

    1. Dataset selection
      By default, Labeling job labels all the data in the manifest file. We can label a part of data or selected data samples based on their properties. Amazon provides three options to create datasets, either using full data samples or randomly selected data or filtered data. The filtered subset needs SQL knowledge.

      1. Full dataset

        fulldata

      2. Random Sample

        randomsample

        Here 20% of the total samples are used. Now, click on Create subset and finally, Use this subset.

      3. Filtered Subset

        filter_set

        Enter the SQL query to select the data samples and Create subset.

    2. Encryption (optional)
      Encryption option encrypts the output data using your own key. Select the AWS KWS key by selecting the ID from options, if encryption is required.

  10. Task Category
    Select the task category. Select Bounding box option

    task_category
    task_category2

    Click on Next.

  • Step 2: Select workers and Configure tool
  1. Workforce
    Labeling workforce is a group of people working on data annotation. There are three types of workforces.
    Amazon Mechanical Turk scales the workforce on demand. Amazon Mechanical Turk is a good option for labeling a large set of data. Private workforce is created by an owner with their people. The persons included in the team can label the data. Vendor consists of people working for the third party group that has been approved and integrated into the AWS platform. AWS Marketplace provides information about available vendors. Worker types, Task Timeout, Price per task, Additional Configuration → Number of workers can be set in order according to your labeling task. The prices per task are provided in the option Price per task. As price increases quality of labeling increases. The three worker types are explained below. For our example, we used Private team.

    1. Mechanical Turk Team - sample

      turkteam

      Enable automated data labeling is a paid service for automatic labeling. It is better to use Enable automated data labeling for labeling a large dataset since the labels are predicted based on the annotation pattern.

    2. Private
      Here in this example, a private team created by you labels the samples. Create a private team if there exists none.

      info Skip the Steps 1 - 8 below, if you already have a private team. How to create a private team:
      1. Go to Amazon SageMaker → Ground Truth → Labeling workforces → Private
      private_team

      2. Click on Create private team
      create_privateteam1

      3. Then name the team and finally, click on Create private team.
      create_privateteam2

      4. Now the team will be visible in Labeling workforces → Private.
      labeling_workforces

      5. The next step is to invite workers to the team.
      invite_workers
      invite_newworker

      6. The invited workers should check for the invitation mail in the mail account. Enter the link and change the password. This step can be done later by the worker.
      loginportal

      7. Once the workers have been invited, select your team and add workers to the team.
      add_worker1

      8. Now add workers from the list.
      add_worker2

      Select Private and select your team from the option Private teams.

      privateteam

      Set the parameters such as time per task.
      Once a labeling job is created, we can see the jobs in the job portal.
      Amazon SageMaker → Ground Truth → Labeling Workforces → Private → portal

      loginportal

      Select the job and draw bounding boxes on the images.

    3. Vendor
      A third party can be assigned for labeling a dataset. Amazon Marketplace contains the approved vendor list.

  2. Existing labels display option
    This option can be used by the label verification category.

  3. Annotation
    The next step is the annotation. Here we describe how the labeling should be performed, Good example, Bad example, and Labels in the dataset.

    bbox_label

    Enter the descriptions. Add label option adds a label to the dataset. Multiple labels can be added.

    check_markPrior to the creation of the labeling job, we can test the labeling on the sample. The annotation format for the image can be viewed from the Preview option. A new window opens with the image.

    label_example

    Select the label and draw the box around the object. Then Submit.

    submit_example

    Now the bounding box coordinates can be viewed. Close the window once the test is over.
    Now click on Create to create the labeling job.

createjob_final

The created labeling job appears in Amazon SageMaker → Ground Truth → Labeling Jobs with the current status.

label_status

Click on the labeling job.

label_output

The S3 output path contains the annotation files. **** → manifests → output → output.maifest.

Open Source Online Labeling Tools

Tool Website Supported labeling tools Output file Account
MakeSense MakeSense Bounding box, Polygon, Points * zip * xml * csv Not required
LabelMe LabelMe Bounding box, Points, Polygon, Area Selection xml Required

Sample tool

  • Make Sense - Object detection example

    • Go to the MakeSense website and create a labeling job

      MK-0

    • Load the dataset. This example contains only one image.

      MK-1

    • Insert the labels. This can be done manually, uploading the label file.

      MK-2

    • MakeSense provides automated labeling by using pre-trained models. This example uses the COCO SSD.

      MK-3

    • Next, a window appears for labeling.

      • Bounding box

        bounding_box

        Draw the bounding boxes around the objects. In the Bounding box option, the label bird is predicted by the SSD model. Extra labels can be deleted.

      • Points

        points

      • Polygon

        polygon

    • Now export the annotations. EXPORT LABELS. Select the format and save the file.

      label_names

References

Info

Project State

Public Project

Licences

Software Licence: Project has no software
Hardware Licence: Project has no hardware

Project Tags

Admins

SelenaS
emeusel
Nina_Boehm
MeyerMel
TimM
nwilson
jstoltz
TomE

Members

Does this project pique your interest?

Login or register to join or follow this project.

Comments
Back to top

Ready to join the project?

You'd like to participate ... Show more