“From Image Recognition to Visual Question Answering: The Evolution of AI”
As we continue to explore the vast and ever-evolving landscape of artificial intelligence, it is important to take note of the subtle yet impactful trends that have the potential to revolutionize the field. One such trend, Visual Question Answering, may not receive the same level of attention as more flashy developments such as DALL-E 2 or Midjourney, but it holds a wealth of potential for a wide range of applications. From image captioning and forensic analysis to educational tools and accessibility, Visual Question Answering has the power to change the way we interact with and understand visual information. In this piece, I will examine a quick implementation of this powerful and practical advancement in AI, and examine the ways in which it can be harnessed to improve our daily lives.
It was not too long ago when we were challenged to develop models that could accurately classify, detect and count objects within an image. In those days, we were content with even marginal success. However, as time progressed, we became increasingly ambitious, utilizing our creativity and imagination to push the boundaries of what was possible. Now, we find ourselves on the brink of an AI revolution, as evidenced by the words of one of our esteemed clients at Rekor.ai.
“If you want to identify a trend, get infront of it and let it roll over you”
That is exactly what I have been up to, letting the trends roll over me in some cases and others chasing after them, there are definitely a lot to keep up with, so I’m happy with that combination.
Visual Question Answering What is it ?
Simply put, provided an image, questions can be posed as to the content or quantity of that content, colors present in the image or actions being performed in an image.
Let’s look at some example runs using hugging face transformer.
Example Scenario -1:
Taking the role of an analyst working for a crime division, we have an intersection below with street cams or traffic cams and we are looking for a red car and green car passing at the time, a traditional object detection model will give us frames of cars detected or cars detected by make and model etc. Any further probing such as was the green car ahead of red care could involve human or machine hours.
We can give this a try with this model and see what happens.
Running this sample code inside SageMaker Studio Notebook will take a couple of pip installs… and we are off to a start.
from PIL import Image
from transformers import pipeline
vqa_pipeline = pipeline("visual-question-answering")
image = Image.open("./city-street-top-view-aerial-wallpaper.jpg")
question = "is there a green car?"
vqa_pipeline(image, question, top_k=1)
You can change the top_k=1 to top_k=n , where n decides how many outputs you get, if the question is yes/no, top_k=1 and if the question is like how many colors are there in the image , top_k=6 or more.
Output: Predicted answer: yes
Another Text prompt using the same code above
Text Prompt: is the red car between green car and yellow car?
Output: yes
Text Prompt: what colors are in the image?
Output
[{'score': 0.3806068003177643, 'answer': 'gray'},
{'score': 0.09111639112234116, 'answer': 'many'},
{'score': 0.05836665630340576, 'answer': 'white'},
{'score': 0.053519248962402344, 'answer': 'brown'},
{'score': 0.05041731894016266, 'answer': 'beige'},
{'score': 0.049074362963438034, 'answer': 'multiple'}]
Image Captioning
Let’s look at a different paradigm of the same application, say, we want to provide a caption for an image. In the single modality model, you get a bunch of predictions but not necessarily the connection between them.
Using the same image of the city street above, an image captioning model from hugging face, we get an output description “a street scene with a car and a bus”
from transformers import pipeline
image_to_text = pipeline("image-to-text", model="nlpconnect/vit-gpt2-image-captioning")
image_to_text("city-street-top-view-aerial-wallpaper.jpg")
Output
[{'generated_text': 'a street scene with a car and a bus '}]
A different image generated by DALL-E 2 generates the caption “a person on a horse in the desert” , It is an astronaut but pretty close.
Output
[{'generated_text': 'a person on a horse in the desert '}]
Conclusion
We took some pretrained models and some sample images and were able to query an image and its content as well generate captions for images.
We just delved into the world of multimodal models, which possess the ability to seamlessly blend image and text inputs to generate captions and answer queries about an image’s content. The potential of these models is truly staggering, as they pave the way for even more advanced interactions and capabilities, such as incorporating audio inputs. The thought of combining the power of Chat GPT, DALL-E 2, and VALL E into a single offering, eliminating the need for databases, indexing, and reducing network traffic as we use these at the edge, is nothing short of revolutionary. However, this powerful trend often goes unnoticed amidst the flashier developments in AI. It’s time to look beyond the noise and embrace the future of multimodal models, as they have the potential to revolutionize the way we interact with and understand information.
A picture is worth a zillion words.