Automation in the AI Era (updated Mar '24)

by Edwin Ong, Founder & CEO

Tl;dr:

I would go straight to levels 6 and 7 (video demos right below).

  • Level 6 — this demo gets people excited because it opens up a world of possibilities for automation that they previously thought was impossible.
  • Level 7 — builds on the above, showcasing an elegant approach to UX simplification for students (even when using 3rd party apps).
Dynamic UX support for students

Overview:

LevelAutomation ApproachDescriptionTools UsedBenefits
1Non-visual GUI AutomationAutomating web interactions without a graphical interface (terminal-based)Playwright, ScrapyImproved speed, accuracy, and ease of use
2Visual GUI AutomationAnalyzing dynamically content or simulating custom visual analysisSelenium, PuppeteerEnhanced handling of JS, custom downstream applications
3Hardcoding Images for Visual SearchUsing image recognition to find and interact with elementsOpenCV, PyAutoguiRobustness if UI stable or page source changes frequently
4Using AI to allow dynamic searchesImplementing search with AI vision and logic to better tolerate UI and page structure changesOpenAI vision API, LLaVaIncreased flexibility, and efficiency
5AI agent with overarching intentAI agent with general intent (e.g. "register user then verify email") and ability to control browser/machine, to replicate a human exploring and using a websiteVimium, Self-Operating-ComputerMinimal scripting overheads per new website (just change prompt)
6Augmenting AI agent with external toolsProviding the AI further ability to take other downstream actions, based on intent (e.g. call API, update a database )Langchain, External APIsInteroperability and extensibility
7Agentic interaction and CopilotSplitting control between user and AI, dynamically generating information to guide users through tasksCustom scriptsFacilitates user learning and control

Level 1: The workhorse that still gets the job done

While relatively "basic", these still cover the bulk of automated operations. These are quick, powerful, and easy to implement for your internal use cases.

Externally, a simple example is scraping data from a website—quickly navigate across different pages and scrape content from them, without actually seeing the website.

▶️ Video Demo

Basic scraping

Basic scraping application

Level 2: See it visually

Sometimes it's useful to actually replicate what a user would see if they go to a website. This can be for a variety of reasons, with the most common being:

  • Handling JavaScript-heavy websites
  • Simulating user interactions (e.g. signing in, filling out forms, extracting brand colours and aesthetic)

▶️ Video Demo (with some bonus verbal overlay)

Levels 3 - 5

These levels are bundled due to their similarity in approach, but differ in the level of sophistication and adaptability.

  • In L3, it's a rigid approach of prescribing step by step the specific things you want to click (by providing screenshots of each element to click). If any step changes, or the UI changes, you have to update the code and images.
  • In L4, you provide a general instruction and the AI agent figures its way around, using screenshots and a means of controlling the browser. This way, even if the steps or UI changes, the agent has a chance of getting it right.
  • L5 builds on the above, where you're no longer restricted to a browser, and the AI agent can control the machine itself.

▶️ Video Demo This is where the AI agent starts to become more intelligent and adaptable, and can handle a wider range of websites, providing a strong foundation for complete automation (demonstrated through L6 and L7).

Depiction of AI that uses screenshots to observe and work with programs

Here are some projects to watch in this space (image above is from UFO, one of the repos listed below):

  • VimGPT, which gives multimodal models a way to interact with the browser using the Vimium browser extension.
  • Self-Operating-Computer, which does something similar but uses x/y coordinates of the screen and controls the mouse/keyboard instead.
  • Open Interpreter, where LLMs can control your computer and automate folder management, file conversion, video editing, and more.
  • UFO, aka UI-Focused agent. The paper and code were only published 2 weeks ago and so definitely not production-ready, but taps into the native Windows UI Automation API so might have greater long term potential.

Level 6: Augmenting AI agent with external tools

▶️ Video Demo, this is where people get excited because it opens up new possibilities of AI agents.

AI with logic based routing and external tools

A simple example I use personally would be:

  • My voice notes are automatically dictated, categorized and filed into respective folders.
  • If there are follow-up actions to take, it is automatically added to my task management list.
  • If I instruct in my voice note to send an email, it will do that without any further action from me.

With the above steps, and conditional logic, nearly all applications and workflows can be automated.

Level 7: Agentic interaction and Copilot

▶️ Video Demo, which includes a discussion and demo of potential applications for students using multiple apps.

Dynamic UX support for students

More articles

AI tech stack (updated Mar '24)

In a rapidly evolving AI landscape, staying informed is key. Here I provide updated information and reasons for my curated tech stack, helping you make savvy choices and stay ahead of the curve—without the detours of overly ambitious projects that aren’t ready for prime time.

Read more

Advanced prompt engineering without the jargon

Alternatively titled "6 Shockingly Simple Skills That Could Earn You a Six-Figure Salary - You're Already an Expert!”. I share how advanced prompt engineering techniques can be analogous to clearly communicating with a junior employee.

Read more