Skip to content

gpt + aria = ability to read browser contents

License

Notifications You must be signed in to change notification settings

momentic-ai/gpt-aria

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpt-aria

Experiment to teach gpt to make use of the chrome accessibility tree to turn the web into a textual interface and access it like a user of a screen-reader. This avoids html parsing, supports dynamic content, etc.

https://taras.glek.net/post/gpt-aria-experiment/

Running:

  • Install node
  • run npm install
  • run export OPENAI_API_KEY=<your key>
  • Run gpt-aria: ./gpt-aria.ts --objective "Whats the price of iphone 13 pro"
  • Note first run will take a while as puppeteer has to download chrome
  • Run it starting custom start page: ./gpt-aria.ts --objective "Whats the price of iphone 14 pro" --start-url https://duckduckgo.com

Prompt lives in prompt.ts, log of execution is in log.txt

Questions? @tarasglek on twitter or file github issues

Sample queries:

  • ./gpt-aria.ts --objective "What is the cultural capital of western ukraine" --start-url https://bing.com --headless
  • ./gpt-aria.ts --objective "Who was king of england when lviv was founded" --headless
  • who was president when first starwars was released?

Design

graph TD;
    subgraph GPT
      gpt["decide if enough info\nto return an ObjectiveComplete\nor if a BrowserAction is needed"]
    end
    subgraph gpt-aria
        BrowserAction
        ObjectiveComplete
        BrowserResponse
    end
    gpt-->ObjectiveComplete
    gpt--command-->BrowserAction
    BrowserAction--"command"-->Browser
    Browser --"url,ariaTree"--> BrowserResponse
    UserInput --"{objective, start-url}"--> BrowserAction
    BrowserResponse --"objective,progress[],url,ariaTree"--> gpt
    ObjectiveComplete--"result"--> UserOutput
Loading

Prior art:

Why ARIA is superior to raw html

html:

html

html with ARIA accessibility tree:

accessibility_tree

Follow-up ideas

Scrolling

Would be nice to have a command so gpt could scroll up/down the page to summarize content in it

Have an index of gpt prompts that explain in natural language how to navigate a particular website.

  • Eg for twitter it was say "In order to 'tweet', one goes to twitter.com and posts a tweet. in order to scan the latest news on twitter, one can pick use default timeline or pick a twitter list for a particular category".
  • For buying a house "redfin provides search functionality and ability to narrow down location and prices"
  • for shopping "amazon.com is a shopping site"
  • likewise for google, wikipedia, etc
  • eventually we'd want langchain-style website modules so you could specify "Summarize my inbox and news" which would be a composition of gmail and news modules

About

gpt + aria = ability to read browser contents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 80.7%
  • JavaScript 19.3%