How to prepare your own dataset for training a model
In supervised learning, training a model requires a list of labeled images for training and testing. In object detection, the image with bounding box co-ordinates describes the sample. The label characteristics depend on the ML framework. The training, validation(optional), and test lists are provided in an ML model training task.
In Computer Vision, some of the labeling categories are
Image classification assigns a class to the image
Multilabel classification labels data with multiple classes
Landmark/points annotate different co-ordinates in the image.
Bounding box predicts the box coordinates and dimensions enclosing different objects.
Segmentation assigns a group of pixels in the image to a class.
Data labeling can be performed manually or using third parties.
AWS Ground Truth
AWS Ground truth service provides data labeling. The data labeling can be performed by the Amazon Mechanical Turk team, our own private team, and by third-party vendors. Data labeling by Amazon Mechanical Turk team and Vendors are paid services. In these cases, data is labeled using external workers.
Upload the training data to an S3 bucket(How to create an S3 bucket - Amazon S3 Bucket ).
Enter the name for the labeling job. As specified in the image, the name should be unique.
Label attribute name (optional)
For beginners, it is better to skip this option. This is the attribute name to be used for the labels in the output manifest file.
Input dataset setup
Select Automated data setup. Ground Truth automatically identifies data in your S3 folder and creates a manifest file with the list of data.
S3 location for input datasets
The S3 bucket folder where the file containing annotations should be stored. Click on Browse S3, then choose the folder containing the images. An example is illustrated below.
Then click Choose
S3 location for output datasets By default, the output data is stored in the input S3 bucket.
Data type Select Image
Select the IAM role.
Identity and Access Management(IAM) Service provides access to AWS services. The permissions to different resources can be restricted using the IAM role (AWS IAM). The IAM role should have access to S3 buckets defined in the Input and Output path.
Either select an Existing role or Create a new role. How to create a new role :
If you would like to use Specific S3 buckets instead of Any S3 bucket, enter the names of the buckets you would like to access.
Finally, click on Create.
Click Complete Setup.
Additional configuration (Skip this step if you would like to label all images)
By default, Labeling job labels all the data in the manifest file. We can label a part of data or selected data samples based on their properties. Amazon provides three options to create datasets, either using full data samples or randomly selected data or filtered data. The filtered subset needs SQL knowledge.
Here 20% of the total samples are used. Now, click on Create subset and finally, Use this subset.
Enter the SQL query to select the data samples and Create subset.
Encryption option encrypts the output data using your own key. Select the AWS KWS key by selecting the ID from options, if encryption is required.
Select the task category. Select Bounding box option
Click on Next.
Labeling workforce is a group of people working on data annotation. There are three types of workforces.
Amazon Mechanical Turk scales the workforce on demand. Amazon Mechanical Turk is a good option for labeling a large set of data. Private workforce is created by an owner with their people. The persons included in the team can label the data. Vendor consists of people working for the third party group that has been approved and integrated into the AWS platform. AWS Marketplace provides information about available vendors. Worker types, Task Timeout, Price per task, Additional Configuration → Number of workers can be set in order according to your labeling task. The prices per task are provided in the option Price per task. As price increases quality of labeling increases. The three worker types are explained below. For our example, we used Private team.
Mechanical Turk Team - sample
Enable automated data labeling is a paid service for automatic labeling. It is better to use Enable automated data labeling for labeling a large dataset since the labels are predicted based on the annotation pattern.
Here in this example, a private team created by you labels the samples. Create a private team if there exists none.
Skip the Steps 1 - 8 below, if you already have a private team. How to create a private team:
1. Go to Amazon SageMaker → Ground Truth → Labeling workforces → Private
2. Click on Create private team
3. Then name the team and finally, click on Create private team.
4. Now the team will be visible in Labeling workforces → Private.
5. The next step is to invite workers to the team.
6. The invited workers should check for the invitation mail in the mail account. Enter the link and change the password. This step can be done later by the worker.
7. Once the workers have been invited, select your team and add workers to the team.
8. Now add workers from the list.
Select Private and select your team from the option Private teams.
Set the parameters such as time per task.
Once a labeling job is created, we can see the jobs in the job portal.
Amazon SageMaker → Ground Truth → Labeling Workforces → Private → portal
Select the job and draw bounding boxes on the images.
A third party can be assigned for labeling a dataset. Amazon Marketplace contains the approved vendor list.
Existing labels display option
This option can be used by the label verification category.
The next step is the annotation. Here we describe how the labeling should be performed, Good example, Bad example, and Labels in the dataset.
Enter the descriptions. Add label option adds a label to the dataset. Multiple labels can be added.
Prior to the creation of the labeling job, we can test the labeling on the sample. The annotation format for the image can be viewed from the Preview option. A new window opens with the image.
Select the label and draw the box around the object. Then Submit.
Now the bounding box coordinates can be viewed. Close the window once the test is over.
Now click on Create to create the labeling job.
The created labeling job appears in Amazon SageMaker → Ground Truth → Labeling Jobs with the current status.
Click on the labeling job.
The S3 output path contains the annotation files. **** → manifests → output → output.maifest.
|Tool||Website||Supported labeling tools||Output file||Account|
|MakeSense||MakeSense||Bounding box, Polygon, Points||* zip * xml * csv||Not required|
|LabelMe||LabelMe||Bounding box, Points, Polygon, Area Selection||xml||Required|
Make Sense - Object detection example
Go to the MakeSense website and create a labeling job
Load the dataset. This example contains only one image.
Insert the labels. This can be done manually, uploading the label file.
MakeSense provides automated labeling by using pre-trained models. This example uses the COCO SSD.
Next, a window appears for labeling.
Draw the bounding boxes around the objects. In the Bounding box option, the label bird is predicted by the SSD model. Extra labels can be deleted.
Now export the annotations. EXPORT LABELS. Select the format and save the file.