Automation in the AI Era (updated Mar '24)

Tl;dr:

I would go straight to levels 6 and 7 (video demos right below).

Level 6 — this demo gets people excited because it opens up a world of possibilities for automation that they previously thought was impossible.
Level 7 — builds on the above, showcasing an elegant approach to UX simplification for students (even when using 3rd party apps).

Overview:

Level	Automation Approach	Description	Tools Used	Benefits
1	Non-visual GUI Automation	Automating web interactions without a graphical interface (terminal-based)	Playwright, Scrapy	Improved speed, accuracy, and ease of use
2	Visual GUI Automation	Analyzing dynamically content or simulating custom visual analysis	Selenium, Puppeteer	Enhanced handling of JS, custom downstream applications
3	Hardcoding Images for Visual Search	Using image recognition to find and interact with elements	OpenCV, PyAutogui	Robustness if UI stable or page source changes frequently
4	Using AI to allow dynamic searches	Implementing search with AI vision and logic to better tolerate UI and page structure changes	OpenAI vision API, LLaVa	Increased flexibility, and efficiency
5	AI agent with overarching intent	AI agent with general intent (e.g. "register user then verify email") and ability to control browser/machine, to replicate a human exploring and using a website	Vimium, Self-Operating-Computer	Minimal scripting overheads per new website (just change prompt)
6	Augmenting AI agent with external tools	Providing the AI further ability to take other downstream actions, based on intent (e.g. call API, update a database )	Langchain, External APIs	Interoperability and extensibility
7	Agentic interaction and Copilot	Splitting control between user and AI, dynamically generating information to guide users through tasks	Custom scripts	Facilitates user learning and control

Level 1: The workhorse that still gets the job done

While relatively "basic", these still cover the bulk of automated operations. These are quick, powerful, and easy to implement for your internal use cases.

Externally, a simple example is scraping data from a website—quickly navigate across different pages and scrape content from them, without actually seeing the website.

▶️ Video Demo

Basic scraping application

Level 2: See it visually

Sometimes it's useful to actually replicate what a user would see if they go to a website. This can be for a variety of reasons, with the most common being:

Handling JavaScript-heavy websites
Simulating user interactions (e.g. signing in, filling out forms, extracting brand colours and aesthetic)

▶️ Video Demo (with some bonus verbal overlay)

Levels 3 - 5

These levels are bundled due to their similarity in approach, but differ in the level of sophistication and adaptability.

In L3, it's a rigid approach of prescribing step by step the specific things you want to click (by providing screenshots of each element to click). If any step changes, or the UI changes, you have to update the code and images.
In L4, you provide a general instruction and the AI agent figures its way around, using screenshots and a means of controlling the browser. This way, even if the steps or UI changes, the agent has a chance of getting it right.
L5 builds on the above, where you're no longer restricted to a browser, and the AI agent can control the machine itself.

▶️ Video Demo This is where the AI agent starts to become more intelligent and adaptable, and can handle a wider range of websites, providing a strong foundation for complete automation (demonstrated through L6 and L7).

Depiction of AI that uses screenshots to observe and work with programs

Here are some projects to watch in this space (image above is from UFO, one of the repos listed below):

VimGPT, which gives multimodal models a way to interact with the browser using the Vimium browser extension.
Self-Operating-Computer, which does something similar but uses x/y coordinates of the screen and controls the mouse/keyboard instead.
Open Interpreter, where LLMs can control your computer and automate folder management, file conversion, video editing, and more.
UFO, aka UI-Focused agent. The paper and code were only published 2 weeks ago and so definitely not production-ready, but taps into the native Windows UI Automation API so might have greater long term potential.

Level 6: Augmenting AI agent with external tools

▶️ Video Demo, this is where people get excited because it opens up new possibilities of AI agents.

AI with logic based routing and external tools

A simple example I use personally would be:

My voice notes are automatically dictated, categorized and filed into respective folders.
If there are follow-up actions to take, it is automatically added to my task management list.
If I instruct in my voice note to send an email, it will do that without any further action from me.

With the above steps, and conditional logic, nearly all applications and workflows can be automated.

Level 7: Agentic interaction and Copilot

▶️ Video Demo, which includes a discussion and demo of potential applications for students using multiple apps.

Edwin Ong