Automation in the AI Era (updated Mar '24)
by Edwin Ong, Founder & CEO
Tl;dr:
I would go straight to levels 6 and 7 (video demos right below).

Overview:
Level | Automation Approach | Description | Tools Used | Benefits |
---|---|---|---|---|
1 | Non-visual GUI Automation | Automating web interactions without a graphical interface (terminal-based) | Playwright, Scrapy | Improved speed, accuracy, and ease of use |
2 | Visual GUI Automation | Analyzing dynamically content or simulating custom visual analysis | Selenium, Puppeteer | Enhanced handling of JS, custom downstream applications |
3 | Hardcoding Images for Visual Search | Using image recognition to find and interact with elements | OpenCV, PyAutogui | Robustness if UI stable or page source changes frequently |
4 | Using AI to allow dynamic searches | Implementing search with AI vision and logic to better tolerate UI and page structure changes | OpenAI vision API, LLaVa | Increased flexibility, and efficiency |
5 | AI agent with overarching intent | AI agent with general intent (e.g. "register user then verify email") and ability to control browser/machine, to replicate a human exploring and using a website | Vimium, Self-Operating-Computer | Minimal scripting overheads per new website (just change prompt) |
6 | Augmenting AI agent with external tools | Providing the AI further ability to take other downstream actions, based on intent (e.g. call API, update a database ) | Langchain, External APIs | Interoperability and extensibility |
7 | Agentic interaction and Copilot | Splitting control between user and AI, dynamically generating information to guide users through tasks | Custom scripts | Facilitates user learning and control |
Level 1: The workhorse that still gets the job done
While relatively "basic", these still cover the bulk of automated operations. These are quick, powerful, and easy to implement for your internal use cases.
Externally, a simple example is scraping data from a website—quickly navigate across different pages and scrape content from them, without actually seeing the website.

Basic scraping application
Level 2: See it visually
Sometimes it's useful to actually replicate what a user would see if they go to a website. This can be for a variety of reasons, with the most common being:
- Handling JavaScript-heavy websites
- Simulating user interactions (e.g. signing in, filling out forms, extracting brand colours and aesthetic)
▶️ Video Demo (with some bonus verbal overlay)
Levels 3 - 5
These levels are bundled due to their similarity in approach, but differ in the level of sophistication and adaptability.
- In L3, it's a rigid approach of prescribing step by step the specific things you want to click (by providing screenshots of each element to click). If any step changes, or the UI changes, you have to update the code and images.
- In L4, you provide a general instruction and the AI agent figures its way around, using screenshots and a means of controlling the browser. This way, even if the steps or UI changes, the agent has a chance of getting it right.
- L5 builds on the above, where you're no longer restricted to a browser, and the AI agent can control the machine itself.
▶️ Video Demo This is where the AI agent starts to become more intelligent and adaptable, and can handle a wider range of websites, providing a strong foundation for complete automation (demonstrated through L6 and L7).

Here are some projects to watch in this space (image above is from UFO, one of the repos listed below):
- VimGPT, which gives multimodal models a way to interact with the browser using the Vimium browser extension.
- Self-Operating-Computer, which does something similar but uses x/y coordinates of the screen and controls the mouse/keyboard instead.
- Open Interpreter, where LLMs can control your computer and automate folder management, file conversion, video editing, and more.
- UFO, aka UI-Focused agent. The paper and code were only published 2 weeks ago and so definitely not production-ready, but taps into the native Windows UI Automation API so might have greater long term potential.
Level 6: Augmenting AI agent with external tools
▶️ Video Demo, this is where people get excited because it opens up new possibilities of AI agents.

A simple example I use personally would be:
- My voice notes are automatically dictated, categorized and filed into respective folders.
- If there are follow-up actions to take, it is automatically added to my task management list.
- If I instruct in my voice note to send an email, it will do that without any further action from me.
With the above steps, and conditional logic, nearly all applications and workflows can be automated.
Level 7: Agentic interaction and Copilot
▶️ Video Demo, which includes a discussion and demo of potential applications for students using multiple apps.
