Using AI Vision Agents for visual DOM parsing when HTML is obfuscated

I’m dealing with a nightmare target site. They render their pricing tables inside a element, and the checkout buttons are just empty

tags with randomized, obfuscated CSS classes.

Standard DOM selectors are completely useless here. Can the AI just “look” at the page and click the right thing?

This is exactly why we built the Vision Agent! When the DOM is a hostile environment, we stop relying on the code and start looking at the page like a human.

Here is the magic: When you use the Vision Agent, RTILA X injects a visual annotation layer over the page. It calculates the bounding boxes of all visible elements and draws numbered boxes over them. It then takes a screenshot and sends it to your chosen LLM (either local or via OpenRouter).

The AI analyzes the image, identifies the page layout, and returns the exact integer IDs of the elements you want to interact with. Our engine maps those IDs back to the enriched selectors and executes the click. No CSS or XPath required.