LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Feng Han^1,2 Zhixiong Zhang^2,3 Zheming Liang^2,4 Yibin Wang^1,2 Jiaqi Wang^2,5,*

¹Fudan University ²Shanghai Innovation Institute ³Shanghai Jiao Tong University ⁴University of Science and Technology of China ⁵JD.COM

Toward deeper cross-carrier fusion in Vision-Language Models.

Paper Code Checkpoints

Carrier Sensitivity in VLMs

Replacing a textual question with its rendered-image counterpart should ideally leave VLM performance essentially unchanged because the semantics are the same. In practice, this local modality substitution causes large accuracy drops, and larger cross-modal representation distances lead to stronger degradation.

(a) Rendering the same question as an image causes a clear accuracy drop across strong VLMs. (b) Samples with larger text-image representation distance suffer larger performance degradation. (c) LoMo shifts paired representations closer together, indicating stronger cross-carrier alignment.

LoMo Overview

LoMo transforms a single-modal training instance into a text-image-text interleaved instance through three components: structure-aware span localization selects a coherent middle span, visual rendering converts it into an image, and perceptual distortion makes the rendered carrier more robust while preserving semantics.

Optimizing the substituted instance $T(x)$ is equivalent to providing an extra cross-carrier alignment signal, encouraging equivalent text and rendered-image carriers to produce consistent predictions.

Benchmark Results

Across 13 multimodal benchmarks, LoMo consistently improves over Standard SFT on two backbones: +2.68 average accuracy on LLaVA-OneVision-1.5-8B and +2.82 on Qwen3.5-9B under standard evaluation.

Performance improves consistently across both tested VLM backbones and most benchmark categories.

LoMo improves standard evaluation and substantially narrows the gap under rendered-question evaluation.

Alignment Improves With Scale

LoMo improves average benchmark accuracy while reducing both Modality Integration Rate and pairwise cross-modal distance, indicating stronger cross-modal fusion.

Accuracy rises with data scale while MIR and pairwise cross-modal distance decrease.

Ablation Studies

Pure-Text Capability Check

LoMo preserves pure-text capability while improving multimodal performance, matching or slightly exceeding Standard SFT on five text-only benchmarks across the two tested backbones.

The multimodal gains do not come at the cost of pure-text capability.

BibTeX

@article{han2026lomo,
  title={LoMo: Local Modality Substitution for Deeper Vision-Language Fusion},
  author={Han, Feng and Zhang, Zhixiong and Liang, Zheming and Wang, Yibin and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2605.30265},
  year={2026}
}