Why Does Midscene Split Locate and Action into Two Steps?
The previous post, What Actually Happens Inside a Single Midscene aiAct Call?, walked through the plan-execute loop inside aiAct, but one stop was deliberately left unopened — “finding the element”.
That stop is arguably the most technically distinctive part of Midscene. Most vision Agents either trust the coordinates the AI gives them, or fire one more AI request to refine the location. Midscene takes a different path: separate the locate step out, and try four fallback layers in order from cheapest to most expensive.
This post is about that.
