Embracing Open Source Machine Learning — AutoML: Image Classification
Updated: 1/8/2023
Correction: SageMaker Studio needed updating and autogluon works on basepython2 kernel just fine.
Why AutoML? Why this blog, We are seeing a rising interest in democratization of AI and ML. Cloud providers like AWS, GCP and Azure continue to invest in creating higher level services that simplify machine learning for the masses with click and drag UIs or APIs that cater to specific needs such as OCR, computer vision, translation and transcription to name a few. In the world of cloud providers and enterprises like H2O offering services, where does open source fit in ? The answer is it fits in nicely, you can still use the resources of the cloud and instead of just going with what is offered, you can question and compare what is offered by these companies againts open source tools and pick the best option for your use case.
In this post, I will compare two open source automl packages AutoKeras and AutoGluon for the task of image classification. I want to focus on ease of use and a dataset that is not mnist or iris. I want to try these on my own data set of junctions I downloaded from Kaggle a few years ago. I will not train the models to completion. I tested these on Amazon SageMaker Notebook instances with a GPU instance. What I’m looking for are
- Can I just pip install and not much around with settings or debug the install?
2. Can I use organized folder to let the library load and label my data ?
3. Can I just point to the directory for train, val and test data loading ?
4. Can the library split my data if I don’t have them split (one folder to train and val for example)?
5. Can I load the images and let the library resize them for different models.?
In essence, how easy is it for anyone to use?
Dataset
Source: https://www.kaggle.com/datasets/karthikaditya147/junctions-images
Training Data: a total of 1326 files containing images from the 3 classes shown below.
Test Data: 333 files from the same 3 classes
Note: Image distribution between 3 classes is more or less uniform, they are not exactly equal. Should be fine for our test here.
Both libraries below simplify the tasks for developers to do machine learning with limited amount of code without worrying about underlying algorithms or setting different parameters or even what they are.
It does require you have a high level understanding of what an epoch is or if you are doing image classification or regression or text classification etc.
Notebook Set-up
Notebooks are how data scientists first prototype and test out some sample code, This is a natural step and needs to be accounted for. If library only works as a docker container, it is good but not great. if it only works on a older cuda version like 10.1 while most are on 11.6, I would ignore that library unless you are head over heels in love with it.
For both tests I used Amazon SageMaker Notebook instance with ml.g5.8xlarge. Why? I can’t afford to buy GPUs or M2 MacBook. Cloud is cheaper for quick prototyping and I work for AWS.
AutoKeras (8.7k stars on github)
Keras is a high-level neural network library that runs on top of TensorFlow heavily contributed to by Google. Autokeras as the name suggests is the AutoML version of the Keras library.
I started with conda-python3 kernel and ran the code below in different cells of the notebook, I pasted them just as I ran them separately so you can replicate what I did.
!pip install autokeras
#this is where my dataset is stored on S3. You can get it directly from Kaggle
# here: https://www.kaggle.com/datasets/karthikaditya147/junctions-images
!aws s3 cp s3://ml-materials/junctions-data.tar.gz .
!tar -xzf junctions-data.tar.gz . --no-same-owner
# this code produces the image I Included in the dataset section.
%pylab inline
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
f, axs = plt.subplots(1,3,figsize=(15,15))
img1 = mpimg.imread('./data/train/Priority/12481.png')
img2 = mpimg.imread('./data/train/Roundabout/53408.png')
img3 = mpimg.imread('./data/train/Signal/27258.png')
axs[0].imshow(img1)
axs[0].set_title("Priority")
axs[1].imshow(img2)
axs[1].set_title("Roundabout")
axs[2].imshow(img3)
axs[2].set_title("Signal")
plt.show()
and to check the image shape (640, 640, 3).
img1.shape
#(640, 640, 3)
So far so good, Autokeras hasn’t even been called.
import autokeras as ak
import os
# this will be yours if you followed my code, if not adjust the path
data_dir=os.getcwd()+"/data"
Let’s answer one of our questions we started with, Can I use organized folder to let the library load and label my data ?
yes and it is a pass.
batch_size = 32
img_height = 640
img_width = 640
train_data = ak.image_dataset_from_directory(
data_dir+"/train",
# Use 20% data as testing data.I have my data organized in train val and test, so commented it
#validation_split=0.2,
subset="training",
# Set seed to ensure the same split when loading testing data.I'm not splitting so commented it
#seed=123,
color_mode="rgb",
image_size=(img_height, img_width),
batch_size=batch_size,
)
test_data = ak.image_dataset_from_directory(
data_dir+"/val",
#validation_split=0.2, same comment as above
subset="validation",
#seed=123, same comment as above
color_mode="rgb",
image_size=(img_height, img_width),
batch_size=batch_size,
)
Next we will start our classification process and run for 1 epoch.
import time
start=time.perf_counter()
clf = ak.ImageClassifier(overwrite=True, max_trials=1)
clf.fit(train_data, epochs=1)
print(clf.evaluate(test_data))
end=time.perf_counter()
s = (end-start)
print(f"Elapsed {s:.03f} secs.")
Here is where the issues start with conda python3 kernel on Amazon SageMaker Notebook. you will see an error like this ?
Too much for auto anything, wtf? Autokeras documentation will say that data is accepted in (size, size, channels) or (channels, size, size) format and this error clearly shows it is looking for channels first as it compares minimum size of 32 with 3 and freaks out. How do I fix this ? if I’m looking for Auto anything, I will give up and move on to something that works out of the box. Let’s understand and what you can do anyway.
When you install Autokeras with pip install, it creates a folder called .keras in your home directory and inside it is a file called keras.json.
you can see here that keras.json shows image_data_format to be channels first, hence the error. Change that to channels_last and you restart your notebook kernel.
Once this is complete you can try to run all the cells from Import autokeras as ak again.
You will either get an error such cuDNN not found or W tensorflow/core/framework/op_kernel.cc:1818] RESOURCE_EXHAUSTED: failed to allocate memory.
At this point I switched my kernel to conda_tensorflow2_p38 and it works. Looks like it needed a tensorflow gpu installation and autokeras may not have installed it from right wheel. Not sure, didn’t bother to look too deep as I was simulating myself to be a autoML user.
You can generate predictions on new data with
clf.predict("path to a new image")
So to answer one of our questions, Can I just pip install and not much around with settings or debug the install? NO.
Once resolved it was easy to use but we will hold the judgement until we have evaluated AutoGluon.
Time to evaluate AutoGluon.
AutoGluon(5.2k stars on github)
Gluon is a high-level neural network library that runs on top of MXNet, heavily contributed to by Amazon. AutoGluon as the name suggests is the AutoML version of the Gluon library. I had to put aside my personal opinion of MXNet in that it is not as popular as Tensorflow or Pytorch or FastAI or anything else for that matter. It is hard to find on stackoverflow when you hit a wall unlike the others where is there is a larger active user base. That said AutoGluon boasts a number of features that very few opensource alternatives beat. For example, AutoKeras doesn’t support object detection, Autogluon does, That is for another post. So moving on, Let’s get to it.
AutoGluon installation instrcutions can be found here https://auto.gluon.ai/stable/install.html.
On SageMaker Notebook Instance(needs AWS account)
I used an updated verison of SageMaker studio with super cool UI compared to Google Colab
I used ml.g4dn.xlarge instance with Base Python 2.0 image (python3.8)
Installation on SageMaker Studio
!pip3 install -U pip
!pip3 install -U setuptools wheel
# Install the proper version of PyTorch following https://pytorch.org/get-started/locally/
!pip3 install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchtext==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu113
!pip3 install autogluon
At this point running the imports gave a me libgl1 error
Fix
Run the following to fix it
!apt-get update && apt-get -y install libgl1
Installation on Google Colab
I have ran the following code on Google colab for installation
!pip3 install -U pip
!pip3 install -U setuptools wheel
# Install the proper version of PyTorch following https://pytorch.org/get-started/locally/
!pip3 install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchtext==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu113
!pip3 install autogluon
I downloaded the dataset like in case of SageMaker Notebook but with some alteration on the dataset path to work on both colab and SageMaker Studio.
!wget https://ml-materials.s3.amazonaws.com/junctions-data.tar.gz
!tar -xzf junctions-data.tar.gz . --no-same-owner
import autogluon as ag
from autogluon.vision import ImageDataset
import pandas as pd
from autogluon.vision import ImagePredictor
Can I use organized folder to let the library load and label my data ? Yes
train_dataset, val_dataset, test_dataset = ImageDataset.from_folders('./data')
print(train_dataset)
Can we start training without much effort from here ?
predictor = ImagePredictor()
predictor.fit(train_dataset, hyperparameters={'epochs': 1})
You can see in the output that ImagePredictor will be deprecated in favor of MultiModalPredictor, so here is what that would look like tested in the same colab notebook.
You can generate predictions on new images with trained model with the code below
proba = predictor.predict_proba({'image': [image_path]})
print(proba)
So to answer one of our questions, Can I just pip install and not much around with settings or debug the install? NO.
SageMaker Studio Lab (Free)
Install was quick, although I couldn’t get a GPU runtime as fast as I could on Google Colab. I settled with CPU run time and got the installation instructions and successfully installed AutoGluon, However, import results in error.
In conclusions, Neither of them worked with just pip install but AutoKeras wins with ease of installation relative to effort on SageMaker Notebooks and needing to switch to SageMaker Studio Lab with no success and Google Colab with sucess. Both modules deliver on promise of easy loading of the datasets and quickly starting a classifier with low effort. On top of using local folders, AutoGluon supports s3 urls like below to load your data which is very convenient for large datasets.
download_dir = 'https://ml-materials.s3.amazonaws.com/junctions-data.tar.gz'
train_data,_, test_data = ImageDataset.from_folders(download_dir)
As long as you are willing to suffer through some installation quirks, you can get by using either of these modules. As to the quality of the models themselves, I will update this post when I have trained these models long enough.