Exploring Hugging Face and Microsoft’s Vision Models

- June 07, 2024

Hello everyone! As many of you might already know, Hugging Face is one of the most significant platforms available today for AI enthusiasts and professionals. It hosts a plethora of state-of-the-art models which can be leveraged to solve various AI use cases. In today's blog post, we will delve into two crucial topics that you should be familiar with if you are diving into the data science industry:

How to call any open-source model from the Hugging Face Hub.
Microsoft's new model, Phi-3 Vision, a multimodal model that processes both images and text.

Let's break down these topics step by step.

Understanding Phi-3 Vision

Phi-3 Vision is a groundbreaking multimodal model by Microsoft. It boasts a whopping 4.2 billion parameters and has both language and vision capabilities. This means it can work with both images and text, making it incredibly versatile. It is especially optimized for understanding charts and diagrams, generating insights, and answering questions based on visual data.

Features of Phi-3 Vision:

Multimodal Capabilities: Integrates text and image processing.
Large Scale: With 4.2 billion parameters, it’s equipped to handle complex tasks.
Optimized for Charts and Diagrams: Specifically designed for extracting and reasoning over information from visual data.

Setting Up Your Environment

To get started, you'll need to set up your environment. Here's a step-by-step guide on how to do this using Visual Studio Code (VS Code).

1. Install Required Libraries

First, create a requirements.txt file to list all the necessary libraries:

torch
transformers
accelerate
bitsandbytes
flash-attn
requests
Pillow
python-dotenv

2. Create a Python Script

Create a .py file, for example, main.py, to write your code.

3. Obtain API Key

To access models from Hugging Face, you need an API key. You can create one by signing up on the Hugging Face website and navigating to your profile settings.

4. Create an `.env` File

Create an .env file to store your API key securely:

HF_API_KEY=your_api_key_here

5. Write the Code

Here's a step-by-step breakdown of the code:

Import Libraries

import os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import requests
from dotenv import load_dotenv

load_dotenv()

# Set your API key
os.environ["HUGGING_FACE_HUB_TOKEN"] = os.getenv("HF_API_KEY")

Load the Model and Processor

model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Define the Image and Text Prompts

# Sample image URL
image_url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(image_url, stream=True).raw)

# Define roles and prompts
messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."},
    {"role": "user", "content": "Provide insightful questions to spark discussion."}
]

Encode the Image and Text

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

Generate the Response

generation_args = {
    "max_new_tokens": 500,
    "temperature": 0.0,
    "do_sample": False,
}

generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

# Remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(response)

6. Running the Code

To run your code, ensure you have your virtual environment activated and all dependencies installed:

pip install -r requirements.txt
python main.py

Conclusion

By following these steps, you can call any open-source model from the Hugging Face Hub and leverage powerful models like Microsoft’s Phi-3 Vision for your AI projects. Whether you are working with text or images, these models provide a solid foundation for building sophisticated AI applications.

Hugging Face offers extensive documentation and examples for various models, making it easier for developers to integrate cutting-edge AI into their projects. So, dive in, explore, and start building!

We hope you found this guide helpful. Stay tuned for more insights and tutorials. Happy coding!

Feel free to reach out with any questions or feedback. Until next time, take care and keep exploring the exciting world of AI!

Revanth Tech Trends