MLCD
This model was released on 2024-07-24 and added to Hugging Face Transformers on 2025-04-15.
Overview
Section titled “Overview”The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
🔥MLCD-ViT-bigG🔥 series is the state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.
Tips:
-
We adopted the official LLaVA-NeXT and the official training dataset LLaVA-NeXT-Data for evaluating the foundational visual models.
-
The language model is Qwen2.5-7B.
Result:
| Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
|---|---|---|---|---|---|---|
| CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
| SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
| DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | 48.00 |
| MLCD (ViT-L-14-336px) | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
| MLCD (ViT-bigG-14-336px) | √ | 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
| MLCD (ViT-bigG-14-448px) | √ | 73.80 | 83.34 | 46.59 | 582.00 | 46.00 |
import requestsfrom PIL import Imagefrom transformers import AutoProcessor, MLCDVisionModel
# Load model and processormodel = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")processor = AutoProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
# Process single imageurl = "http://images.cocodataset.org/val2017/000000039769.jpg"image = Image.open(requests.get(url, stream=True).raw)inputs = processor(images=image, return_tensors="pt")
# Generate outputswith torch.no_grad(): outputs = model(**inputs)
# Get visual featuresfeatures = outputs.last_hidden_state
print(f"Extracted features shape: {features.shape}")MLCDVisionConfig
Section titled “MLCDVisionConfig”[[autodoc]] MLCDVisionConfig
MLCDVisionModel
Section titled “MLCDVisionModel”[[autodoc]] MLCDVisionModel - forward