LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Feng Han1,2 Zhixiong Zhang2,3 Zheming Liang2,4 Yibin Wang1,2 Jiaqi Wang2,5,*
1Fudan University 2Shanghai Innovation Institute 3Shanghai Jiao Tong University 4University of Science and Technology of China 5JD.COM

Toward deeper cross-carrier fusion in Vision-Language Models.

Carrier Sensitivity in VLMs

Replacing a textual question with its rendered-image counterpart should ideally leave VLM performance essentially unchanged because the semantics are the same. In practice, this local modality substitution causes large accuracy drops, and larger cross-modal representation distances lead to stronger degradation.

Carrier sensitivity across VLMs

(a) Rendering the same question as an image causes a clear accuracy drop across strong VLMs. (b) Samples with larger text-image representation distance suffer larger performance degradation. (c) LoMo shifts paired representations closer together, indicating stronger cross-carrier alignment.

LoMo Overview

LoMo transforms a single-modal training instance into a text-image-text interleaved instance through three components: structure-aware span localization selects a coherent middle span, visual rendering converts it into an image, and perceptual distortion makes the rendered carrier more robust while preserving semantics.

Overview of LoMo LoMo objective decomposition

Optimizing the substituted instance T(x) is equivalent to providing an extra cross-carrier alignment signal, encouraging equivalent text and rendered-image carriers to produce consistent predictions.

Benchmark Results

Across 13 multimodal benchmarks, LoMo consistently improves over Standard SFT on two backbones: +2.68 average accuracy on LLaVA-OneVision-1.5-8B and +2.82 on Qwen3.5-9B under standard evaluation.

Benchmark gains across two backbones

Performance improves consistently across both tested VLM backbones and most benchmark categories.

Main benchmark results table

LoMo improves standard evaluation and substantially narrows the gap under rendered-question evaluation.

Alignment Improves With Scale

LoMo improves average benchmark accuracy while reducing both Modality Integration Rate and pairwise cross-modal distance, indicating stronger cross-modal fusion.

Data scale and cross-modal alignment

Accuracy rises with data scale while MIR and pairwise cross-modal distance decrease.

Ablation Studies

Component ablation
Rewrite ratio ablation
Rendering position ablation

Pure-Text Capability Check

LoMo preserves pure-text capability while improving multimodal performance, matching or slightly exceeding Standard SFT on five text-only benchmarks across the two tested backbones.

Pure-text capability analysis

The multimodal gains do not come at the cost of pure-text capability.

BibTeX