Toward deeper cross-carrier fusion in Vision-Language Models.
Replacing a textual question with its rendered-image counterpart should ideally leave VLM performance essentially unchanged because the semantics are the same. In practice, this local modality substitution causes large accuracy drops, and larger cross-modal representation distances lead to stronger degradation.
(a) Rendering the same question as an image causes a clear accuracy drop across strong VLMs. (b) Samples with larger text-image representation distance suffer larger performance degradation. (c) LoMo shifts paired representations closer together, indicating stronger cross-carrier alignment.
LoMo transforms a single-modal training instance into a text-image-text interleaved instance through three components: structure-aware span localization selects a coherent middle span, visual rendering converts it into an image, and perceptual distortion makes the rendered carrier more robust while preserving semantics.
Optimizing the substituted instance T(x) is equivalent to providing an extra cross-carrier alignment signal, encouraging equivalent text and rendered-image carriers to produce consistent predictions.
Across 13 multimodal benchmarks, LoMo consistently improves over Standard SFT on two backbones: +2.68 average accuracy on LLaVA-OneVision-1.5-8B and +2.82 on Qwen3.5-9B under standard evaluation.
Performance improves consistently across both tested VLM backbones and most benchmark categories.
LoMo improves standard evaluation and substantially narrows the gap under rendered-question evaluation.
LoMo improves average benchmark accuracy while reducing both Modality Integration Rate and pairwise cross-modal distance, indicating stronger cross-modal fusion.
Accuracy rises with data scale while MIR and pairwise cross-modal distance decrease.
LoMo preserves pure-text capability while improving multimodal performance, matching or slightly exceeding Standard SFT on five text-only benchmarks across the two tested backbones.
The multimodal gains do not come at the cost of pure-text capability.