Unlocking the Power of Microsoft's Multimodal Model 53 Vision for OCR

- June 07, 2024

Welcome Fellow Learners

Microsoft has recently released a state-of-the-art, open-source model called Multimodal Model 53 Vision. This model, part of Microsoft's 53 family, is capable of handling both vision data and text data due to its multimodal capabilities. With a context length of 128k, it offers significant latency and compute benefits. You can find this model on popular model repositories and it is designed for three primary use cases: general image understanding, OCR (Optical Character Recognition), and chart and table understanding.

Our use case will be OCR, specifically extracting text from an invoice image. Let's get started!

Steps to Implement the 53 Vision Model

Step 1: Setting Up the Environment

First, we need to install all the required libraries. Open your Visual Studio Code and create a new project folder. Inside this folder, create three files: requirements.txt, main.py, and .env.

requirements.txt

numpy
Pillow
requests
torch
torchvision
transformers
accelerate

.env

API_KEY=your_hugging_face_api_key_here

You can get an API key from the Hugging Face website by creating an account and navigating to the API tokens section.

main.py

import os
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
API_KEY = os.getenv('API_KEY')

# Set the model and processor
model_id = "microsoft/53-vision"
model = AutoModelForCausalLM.from_pretrained(model_id, use_auth_token=API_KEY)
processor = AutoProcessor.from_pretrained(model_id)

# Define the prompt and load the image
prompt = "image: 'invoice.jpg'\nProvide OCR for all the text in the given image in markdown format."
image_url = "https://example.com/invoice.jpg"  # Replace with your image URL
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

# Process the input
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate the output
outputs = model.generate(**inputs)
result = processor.decode(outputs[0], skip_special_tokens=True)

print(result)

Step 2: Install the Libraries

Open your terminal in Visual Studio Code and run:

pip install -r requirements.txt

Step 3: Running the Script

After installing the required libraries, run your script using:

python main.py

Understanding the Implementation

Environment Setup:
- We first set up our environment by installing the necessary libraries listed in requirements.txt.
- We then load our Hugging Face API key from the .env file.
Model and Processor Initialization:
- We use the AutoModelForCausalLM to load the 53 Vision model from Hugging Face.
- The AutoProcessor is used to handle both text and image data.
Prompt and Image:
- We define a prompt that describes the OCR task.
- We load an image from a URL. You can replace the image URL with any publicly accessible URL.
Processing and Generating Output:
- The input text and image are processed into a format acceptable by the model.
- The model then generates the output, which is decoded and printed.

Conclusion

In this blog post, we demonstrated how to implement Microsoft's 53 Vision model for OCR tasks using a free Google Colab environment. We walked through setting up the environment, initializing the model, processing inputs, and generating outputs. This model is highly versatile and can be used for various tasks like image understanding, OCR, and chart/table understanding.

For more details on the 53 Vision model, you can refer to its Hugging Face page.

Thanks for following along, and happy coding! Until next time!

Revanth Tech Trends