Engineering7 min readMarch 4, 2025

Why Multimodal AI Is the Future of Web QA -- and What It Changes for Engineering Teams

For the past decade, automated web QA has been fundamentally text-based. Selenium parses DOM selectors. Accessibility scanners read attribute values. Linters analyse code. These tools are useful, but they share a critical limitation: they cannot see.

What multimodal actually means in a QA context

Multimodal AI in a QA context means an agent that can process five types of input simultaneously: visual (screenshots, video frames), textual (DOM, copy, code), auditory (captions, audio tracks, text-to-speech output), interactive (browser automation, form submission, navigation flows), and reasoning (cross-domain inference).

What changes for engineering teams

The shift from text-based to multimodal QA changes three things. First, coverage expands dramatically. Second, maintenance cost collapses. Third, the feedback loop closes -- auto-remediation means engineers receive not just issue reports but proposed fixes.

The multimodal shift is not incremental. It is a different epistemological approach to what quality means in software. Quality was once "the code works as specified." Multimodal QA redefines quality as "a real user -- including a disabled user -- can accomplish their goal."