VisualBERT
This model was released on 2019-08-09 and added to Hugging Face Transformers on 2021-06-02.
VisualBERT
Section titled “VisualBERT”VisualBERT is a vision-and-language model. It uses an approach called “early fusion”, where inputs are fed together into a single Transformer stack initialized from BERT. Self-attention implicitly aligns words with their corresponding image objects. It processes text with visual features from object-detector regions instead of raw pixels.
You can find all the original VisualBERT checkpoints under the UCLA NLP organization.
The example below demonstrates how to answer a question based on an image with the AutoModel class.
import torchimport torchvisionfrom PIL import Imageimport numpy as npfrom transformers import AutoTokenizer, VisualBertForQuestionAnsweringimport requestsfrom io import BytesIO
def get_visual_embeddings_simple(image, device=None):
model = torchvision.models.resnet50(pretrained=True) model = torch.nn.Sequential(*list(model.children())[:-1]) model.to(device) model.eval()
transform = torchvision.transforms.Compose([ torchvision.transforms.Resize(256), torchvision.transforms.CenterCrop(224), torchvision.transforms.ToTensor(), torchvision.transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) ])
if isinstance(image, str): image = Image.open(image).convert('RGB') elif isinstance(image, Image.Image): image = image.convert('RGB') else: raise ValueError("Image must be a PIL Image or path to image file")
image_tensor = transform(image).unsqueeze(0).to(device)
with torch.no_grad(): features = model(image_tensor)
batch_size = features.shape[0] feature_dim = features.shape[1] visual_seq_length = 10
visual_embeds = features.squeeze(-1).squeeze(-1).unsqueeze(1).expand(batch_size, visual_seq_length, feature_dim)
return visual_embeds
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")model = VisualBertForQuestionAnswering.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
response = requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")image = Image.open(BytesIO(response.content))
visual_embeds = get_visual_embeddings_simple(image)
inputs = tokenizer("What is shown in this image?", return_tensors="pt")
visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
inputs.update({ "visual_embeds": visual_embeds, "visual_token_type_ids": visual_token_type_ids, "visual_attention_mask": visual_attention_mask,})
with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predicted_answer_idx = logits.argmax(-1).item()
print(f"Predicted answer: {predicted_answer_idx}")- Use a fine-tuned checkpoint for downstream tasks, like
visualbert-vqafor visual question answering. Otherwise, use one of the pretrained checkpoints. - The fine-tuned detector and weights aren’t provided (available in the research projects), but the states can be directly loaded into the detector.
- The text input is concatenated in front of the visual embeddings in the embedding layer and is expected to be bound by
[CLS]andSEPtokens. - The segment ids must be set appropriately for the text and visual parts.
- Use
BertTokenizerto encode the text and implement a custom detector/image processor to get the visual embeddings.
Resources
Section titled “Resources”- Refer to this notebook for an example of using VisualBERT for visual question answering.
- Refer to this notebook for an example of how to generate visual embeddings.
VisualBertConfig
Section titled “VisualBertConfig”[[autodoc]] VisualBertConfig
VisualBertModel
Section titled “VisualBertModel”[[autodoc]] VisualBertModel - forward
VisualBertForPreTraining
Section titled “VisualBertForPreTraining”[[autodoc]] VisualBertForPreTraining - forward
VisualBertForQuestionAnswering
Section titled “VisualBertForQuestionAnswering”[[autodoc]] VisualBertForQuestionAnswering - forward
VisualBertForMultipleChoice
Section titled “VisualBertForMultipleChoice”[[autodoc]] VisualBertForMultipleChoice - forward
VisualBertForVisualReasoning
Section titled “VisualBertForVisualReasoning”[[autodoc]] VisualBertForVisualReasoning - forward
VisualBertForRegionToPhraseAlignment
Section titled “VisualBertForRegionToPhraseAlignment”[[autodoc]] VisualBertForRegionToPhraseAlignment - forward