Unlocking the Power of Microsoft's Multimodal Model 53 Vision for OCR
Welcome Fellow Learners
Microsoft has recently released a state-of-the-art, open-source model called Multimodal Model 53 Vision. This model, part of Microsoft's 53 family, is capable of handling both vision data and text data due to its multimodal capabilities. With a context length of 128k, it offers significant latency and compute benefits. You can find this model on popular model repositories and it is designed for three primary use cases: general image understanding, OCR (Optical Character Recognition), and chart and table understanding.
Our use case will be OCR, specifically extracting text from an invoice image. Let's get started!
Steps to Implement the 53 Vision Model
Step 1: Setting Up the Environment
First, we need to install all the required libraries. Open your Visual Studio Code and create a new project folder. Inside this folder, create three files: requirements.txt
, main.py
, and .env
.
requirements.txt
numpy
Pillow
requests
torch
torchvision
transformers
accelerate
.env
API_KEY=your_hugging_face_api_key_here
You can get an API key from the Hugging Face website by creating an account and navigating to the API tokens section.
main.py
import os
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
API_KEY = os.getenv('API_KEY')
# Set the model and processor
model_id = "microsoft/53-vision"
model = AutoModelForCausalLM.from_pretrained(model_id, use_auth_token=API_KEY)
processor = AutoProcessor.from_pretrained(model_id)
# Define the prompt and load the image
prompt = "image: 'invoice.jpg'\nProvide OCR for all the text in the given image in markdown format."
image_url = "https://example.com/invoice.jpg" # Replace with your image URL
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
# Process the input
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate the output
outputs = model.generate(**inputs)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)
Step 2: Install the Libraries
Open your terminal in Visual Studio Code and run:
pip install -r requirements.txt
Step 3: Running the Script
After installing the required libraries, run your script using:
python main.py
Understanding the Implementation
Environment Setup:
- We first set up our environment by installing the necessary libraries listed in
requirements.txt
. - We then load our Hugging Face API key from the
.env
file.
- We first set up our environment by installing the necessary libraries listed in
Model and Processor Initialization:
- We use the
AutoModelForCausalLM
to load the 53 Vision model from Hugging Face. - The
AutoProcessor
is used to handle both text and image data.
- We use the
Prompt and Image:
- We define a prompt that describes the OCR task.
- We load an image from a URL. You can replace the image URL with any publicly accessible URL.
Processing and Generating Output:
- The input text and image are processed into a format acceptable by the model.
- The model then generates the output, which is decoded and printed.
Conclusion
In this blog post, we demonstrated how to implement Microsoft's 53 Vision model for OCR tasks using a free Google Colab environment. We walked through setting up the environment, initializing the model, processing inputs, and generating outputs. This model is highly versatile and can be used for various tasks like image understanding, OCR, and chart/table understanding.
For more details on the 53 Vision model, you can refer to its Hugging Face page.
Thanks for following along, and happy coding! Until next time!
Comments
Post a Comment