Machine Learning Inference/Predictions on Video Frames using AWS Batch and AWS FSx for Lustre.

11 min readDec 23, 2022

Introduction

With the advancement of video and image capturing devices that can capture very fine details of our daily lives, we are looking for ways to organize these moments or make use of them more. Video inference can be useful with self-driving cars, indexing videos with metadata for searching with online content and learning content, finding relevant frames in security videos or in investigations to name a few use cases.

Image recognition requires analytics and compute intense machine learning (ML) algorithms to analyze millions of frames in a very short time. Running these ML workload at scale requires managing large compute environment,scaling up and down with the throughput, which requires broader skills and undifferentiated heavy lifting and does not add much value to the ML model or the mission at hand.

In this blog, you will see an example of how you can offload managing the ML compute environment to AWS Batch in combination with other AWS services such as AWS Lambda, Amazon S3 and AWS FSx Lustre. In this example we will analyze videos frame by frame at scale. Additionally, we will demonstrate how to take a set of huge video files, chunk them into pieces and apply a machine learning object detection model to each frame and store the resulting raw output in S3 which can be compiled back into a video.

AWS Batch is a free managed service that abstracts away the set up, scaling logic and networking of your compute clusters. AWS Batch manages all the infrastructure for you, avoiding the complexities of provisioning, managing, monitoring, and scaling your batch computing jobs.

Architecture

The architecture below shows how different AWS services interact with each other and their purpose. This post shows how to work with stored videos, by using Amazon Kinesis Video Streams or Media Pipeline solution from AWS you can modify this solution to work with streaming videos.

The process involves the following steps.

Store all objects, images and model files in S3 bucket
Use AWS Lambda to trigger jobs on upload or on schedule
Use FSx for Lustre to create a file system to be accessed by AWS Batch so no API calls are needed for upload and downloads to S3
Model Training is performed on batch as and when the model needs update as an independent process triggered on demand by the user (This step is not only included for illustration and is not covered in this blog)
Incoming videos are split to frames and inference and post processing is performed and stored back in FSx by AWS Batch Job
A dependent AWS Batch Job (depends on 5) then compiles frames back to videos with transformed frames from step 5.

Code

Complete code for this blog can be found on the GitHub page in this link.
https://github.com/GavinatorK/AWSBatchMLVideo

Code Set Up
There are two folders in the GitHub link above , each contain the code and artifacts to build two docker images respectively

AWS Batch

fsxPredictDocker: Code and Artifacts to create docker image for splitting video into frames and perform inference on the frames, draw bounding boxes and save the frames back into separate folder.

frames2vid : Code and Artifacts to create docker image for compiling inference frames back into a video

build and push the above two docker images to AWS Elastic Container Registry (ECR). Refer to AWS Documentation. Once pushed, you will have Amazon Resource Name (ARN) for each image available and you will need that as you set up AWS Batch Job Definition.

Lambda

lambda_function.py: Reacts to a file creation in a bucket and makes call to AWS Batch to process a single video file or set of video files.

Set Up

As discussed in the architecture, each of the services needs to be set up starting with creation of IAM policies and roles needed for AWS Batch and Lambda to execute the process as outlined above and write logs to AWS Cloudwatch Logs. You need to set up each service using AWS console. Advanced users can use services such as AWS Cloudformation or Terraform.

S3 Bucket Creation

Create an S3 bucket in us-east-1 or another but stay consistent throughout with the choice of region.

Note: Ensure you have permission to create a bucket, if not contact your IT admin or account owner

Provide Bucket Name and Region and Any other settings

You can accept all default setting from here on and complete the process.

FSx for Lustre

FSx for Lustre allows AWS Batch to work with file systems instead of read/write API calls to S3 which can create contention or limit throttle.
Navigate to FSx for Lustre page on AWS Console

So far, you have created an S3 bucket and connected it to FSx, which will expose the objects as a file system with high throughput to the instances and containers spun up by AWS Batch Job. This reduces the number of API calls to S3 when working with big video files with a large number of frames.

When all the processing is complete you can export the file system back in to S3 for access and sharing. you can create an export task from the console or aws cli or using boto3 library.

Below, you can see how to initiate the export back to s3 from FSx file system by going to FSx on AWS Console, click on your filesystem and choosing the option export option as highlighted with a orange rectangle.

Lambda

Navigate to Lambda on AWS Console

Create a lambda function from scratch and fill out the fields as shown below.

Copy the code from lambda_function.py file into the lambda code console.

Add S3 bucket created above as a trigger for this lambda function, This will allow any file uploads to the bucket to trigger a lambda function which in turn will kick off an AWS Batch job.

AWS Batch

To detailed explanation on how to set up AWS Batch environment please refer AWS Batch documentation, at a high level, requirements for running a batch can be summarized with process flow below

AWS Batch job needs a job queue and a compute environment to run the containers, You can use the default environment or set up a launch template using an ECS enabled AMI in case you need additional options. For example, in this blog, we will use a launch template to enable mounting of FSx on the compute instance and subsequently on the container run. Jobs are run in a Job Queue, A Job Queue specifies the compute environment, job priority among a few other things. A job needs a job definition that includes a container image stored on Amazon Elastic Container Registry, permissions and roles, mount volumes etc.

We will go through each of the sections and set it up starting with compute environment.

Compute environment
Compute environment involves setting up an EC2 instance profile to be able to run your containers in. This would involve using default environment and moving on to setting up the sections below. However, we are interested in mounting FSx for Lustre as a file system storage instead of uploading and downloading objects from S3. Therefore, we will first create a launch template and AMI to be used to set up our compute environment.

Launch Template
Launch template below in MIME format helps with setting up mounting FSx on to the instance.

Choose AMI
As we are customizing the instance profile, we need to ensure that the AMI used has required ECS components to launch the job and run it successfully. So we choose an AMI available in us-east-1. if you are running this in another region, Google AWS ECS AMI for your region or use the link below
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

Choose the right Amazon Machine Image (AMI)

Storage
Let’s set storage amount as required.

Copy and paste the below , ensure the spaces between lines are maintained as you paste this. MIME requires the spaces be that way or you will get an error with Launch template.

You can find your file system id and mount name on your FSx console, see FSx section above for a visual

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/cloud-config; charset="us-ascii"

runcmd:

file_system_id_01="your-file-system-id"
region=us-east-1
fsx_directory=/input
fsx_mount_name="Find-yours-on-your-FSX-Console"
amazon-linux-extras install -y lustre2.10
mkdir -p ${fsx_directory}
mount -t lustre -o noatime,flock ${file_system_id_01}.fsx.${region}.amazonaws.com@tcp:/${fsx_mount_name} ${fsx_directory}

--==MYBOUNDARY==--

Once the Launch template is created successfully, Navigate back to AWS Batch console and set up the compute environment

This completes the required set up for the compute environment, Once the creation is complete without errors, we can proceed to set up the Job Queue.

Job Queue

Next we need to create a queue where our jobs can run so all submitted jobs can be monitored and tracked

Next we need to provide a compute environment for our queue to run its jobs in. Select the compute environment we created in the previous section and click create.

Select Compute Environment you just created

Job Definition
We have our compute environment and Job Queue, We now need a definition for our job. Here is where we set the container image it uses, any storage volume mounts, memory and such. Let’s go through and set this up.

Container you build and push to ECR will have an arn.

Click on Additional configuration and expand the options

Volume Definition (/input is dir on FSX where S3 data is downloaded to)

Finally add a log driver as awslogs as shown below.

Click create.

IAM

Let’s ensure we have assigned enough permissions to the ECSInstanceRole to access the required AWS Services. For ex. S3, FSx, ECS and ECR at a minimum.

Lambda IAM

Lambda. creates a new role when you create Lambda function with default settings, you can add additional permissions to this role to trigger AWS batch jobs. Add full access to S3 and Batch to run the examples here, you can narrow it down to what is needed in production.

Batch IAM

Provision a role called ecsInstanceRole in AWS IAM console and add following policies

batchwritelogs policy above contains following, You can create it and add it to the ECS Instance role from AWS IAM console (Create Policy and Attach to role)

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:*"
]
}
]
}

Job
We have everything we need to start a job and execute it. The way we set up the container doesn’t require us to pass any parameters as we mapped a directory inside S3 bucket to be the input directory. Once the files are processed they are moved to a different directory inside the FSx. This manual step is only needed for testing your AWS Batch set up is working as intended. This job can be triggered by uploading a file to the S3 bucket we created above which triggers the lambda function to create and run a batch job for us.

You can leave the rest of the options alone unless you want to override what has been set in the job definition.

CloudWatch Monitoring

AWS Batch and AWS Lambda executions can be monitored using AWS Cloudwatch log groups. Any errors or warning can be debugged by implementing logger into the code or via print statements.

Troubleshooting

If your batch job hangs in running status, Check IAM role permission, Ensure the right AMI (ECS) is selected if providing a launch template. If using default, it is most likely permissions. Use Cloudwatch and CloudTrail to check any errors that come up.

FSx for Lustre set up requires we add more permission to ECS instance role than are assigned by default.

Conclusion
You have seen how we can efficiently apply machine learning to frames of a video at scale. While this example focuses on a specific use case, AWS Batch can be used in any ML or HPC containerized workload. Try out AWS Batch for your ML or HPC use case utilizing EC2 on demand or Spot instances or AWS Fargate. AWS provides a number of different services that enable end to end machine learning from data labeling to managing and monitoring your ML models in production. Check out other blogs put together by our colleagues listed below for other use cases.

Note: I have compiled this tutorial in early 2021 and posting it only now, if any of the code here is borrowed from your posts/github, Let me know and I will make sure to add references to rightful owners.

Author:

Raj Kadiyala is AI/ML Global Lead for AWS WWPS Partners, Raj brings a decade of ML experience to help customers leverage AWS AI/ML services to solve their business problems.