Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚧 Colour based tagging & non-ocr page_to_text #111

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
9ad4726
✨ Added functionality for taking screenshot of original/raw page prio…
seanmcguire12 Jun 26, 2024
a9ee07f
✨ Reformatted core.py using ruff. Updated signature of page_to_image …
seanmcguire12 Jun 26, 2024
8ce819c
🔨 fix: specified correct type for the combined OCR annotations.
seanmcguire12 Jun 26, 2024
81229ab
♻️ Updated naming for _hide_non_tag_element(), moved logic for two ph…
seanmcguire12 Jun 26, 2024
b1611b3
Merge pull request #95 from reworkd/APE-75
seanmcguire12 Jun 26, 2024
5dd3391
Merge remote-tracking branch 'origin/main' into API-33
seanmcguire12 Jul 3, 2024
b4ad5e9
🔧 Removed "words" key from ImageAnnotatorResponse in core.py.
seanmcguire12 Jul 4, 2024
c597504
Merge remote-tracking branch 'origin/main' into APE-76
seanmcguire12 Jul 18, 2024
0ee8f82
WIP: implemented colour based tagging & page_to_text_new which doesnt…
seanmcguire12 Jul 18, 2024
03f05f2
Fix: google creds can be loaded as JSON. Added bananalyzer download t…
seanmcguire12 Jul 22, 2024
df5cd7e
Fixed sticky/fixed element issue. Added data type for coloured elements.
seanmcguire12 Jul 24, 2024
5812ff5
Fix: added filtering functionality on getElementBoundingBoxes functio…
seanmcguire12 Jul 24, 2024
9c64fd1
implemented sorting before combining annotations to reduce spacing & …
seanmcguire12 Jul 25, 2024
384c55f
- Added functions for recolouring elements so they can be found with …
seanmcguire12 Aug 1, 2024
81025fb
Improved color tagging:
seanmcguire12 Aug 1, 2024
f7bceab
added text child tagging
seanmcguire12 Aug 10, 2024
f51a353
Merge branch 'refs/heads/main' into APE-76
Aug 14, 2024
943a24c
WIP: still need to fix tests and delete debug code
seanmcguire12 Aug 14, 2024
f6ba5be
Merge branch 'main' into APE-76
seanmcguire12 Aug 15, 2024
31a1da3
Fixed missing leaf text issue
seanmcguire12 Aug 19, 2024
a1eedd7
Merge remote-tracking branch 'refs/remotes/origin/main' into APE-76
seanmcguire12 Aug 22, 2024
385e9c5
📈 Rm debug code, added !important for span styles
seanmcguire12 Aug 22, 2024
84a1ecd
📈 Rm transformXpath fn
seanmcguire12 Aug 23, 2024
a49c673
✏️ Avoid tagging separator symbols
asim-shrestha Aug 23, 2024
4b8015f
Merge branch 'main' into APE-76
seanmcguire12 Aug 29, 2024
e644419
Fix: include tags for buttons & icons that don't have text
seanmcguire12 Aug 29, 2024
dab4508
use words.length instead of boundingBoxes to determine if we should r…
seanmcguire12 Aug 29, 2024
03a386d
add spacing between tag characters. eg: [$1] becomes [ $ 1 ]
seanmcguire12 Aug 29, 2024
4016151
fix: make sure checkboxes are coloured
seanmcguire12 Aug 30, 2024
94d1b75
include placeholder text
seanmcguire12 Aug 30, 2024
93fd680
reduce height of bounding boxes to mitigate excesive use of ** in tex…
seanmcguire12 Aug 30, 2024
3d92214
update bounding box width to accommodate extra spacing inside tarsier…
seanmcguire12 Aug 30, 2024
fd1c30e
Merge branch 'main' into APE-76
seanmcguire12 Aug 30, 2024
362c85b
Merge branch 'main' into APE-76
seanmcguire12 Aug 30, 2024
b46f249
added tagless functionality for colour tagging
seanmcguire12 Aug 31, 2024
46dad31
fix: make sure tag_to_xpath returns xpaths of all coloured elements, …
seanmcguire12 Aug 31, 2024
d98989a
Merge branch 'main' into APE-76
seanmcguire12 Aug 31, 2024
7e2bf7a
get the first option of dropdown text if there is no default selected…
seanmcguire12 Aug 31, 2024
5b8c4ff
Merge branch 'main' into APE-76
seanmcguire12 Sep 1, 2024
3e0b294
added functionality to revert webpage after colour tagging. changed r…
seanmcguire12 Sep 1, 2024
0d643e2
Merge branch 'main' into APE-76
seanmcguire12 Sep 26, 2024
960d686
Merge branch 'main' into APE-76
seanmcguire12 Sep 27, 2024
917057c
refactor colour tagging
seanmcguire12 Sep 29, 2024
4c5edf4
more refactoring, store/restore DOM instead of using revert functions…
seanmcguire12 Oct 3, 2024
0e56205
reformat
seanmcguire12 Oct 4, 2024
06f1eaf
Merge branch 'main' into APE-76
seanmcguire12 Oct 4, 2024
54122e1
update lock
seanmcguire12 Oct 4, 2024
ed35707
prettier fix
seanmcguire12 Oct 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
859 changes: 562 additions & 297 deletions poetry.lock

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ playwright = "^1.44.0"
selenium = "^4.21.0"
google-cloud-vision = "^3.7.2"
azure-ai-vision-imageanalysis = "^1.0.0b2"
pillow = "^10.4.0"
numpy = "^2.0.1"
tiktoken = "^0.7.0"


[tool.poetry.group.dev.dependencies]
Expand Down
6 changes: 5 additions & 1 deletion scripts/setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,8 @@ cd ..
npm install
npm run build

poetry install
poetry install

cd ./tarsier-snapshots || exit 1
poetry install
poetry run bananalyze --download
Comment on lines +9 to +12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really relevant for settuing up tarsier. Would delete

48 changes: 48 additions & 0 deletions tarsier-snapshots/snapshots/1JWoJWs3uZMt8Wa5ql6pr/non_ocr_2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[0] We value your privacy
[1] We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking Accept
All", you consent to our use of cookies.
[$2] Customize [$3] Reject All [$4] Accept All
[@7] Notice of Non-Discrimination [@8] Careers [@9] Patient Portal/Pay my Bill [@10] For Providers [@11] Contact Us
[@12] About Us [@26] For Patients [@35] Our Services [@65] Our Locations [@66] Find a Provider


**[69] Our Locations**
[$71] Search by location name [$73] Filter by category [$75] Search by ZIP code
[$74] All categories Children's Hospital New Orleans East Jefferson General Hospital Lakeside Hospital Lakeview Hospital Lakeview Regional Physician Group New Orleans East Hospital Touro Tulane Medical Center University Medical Center New Orleans Urgent Care West Jefferson Medical Center







[@77] Use my location

[@78] Clear filters

[79] Amelia Health Center Cardiology and
Cardiovascular Surgery
[80] Touro
[81] 3715 Prytania St.
[82] Suite 400
[83] New Orleans, LA 70115
[@84] 504. 897. 8276
[@85] More information
[@86] Get directions

[87] Behavioral Health Center
[88] Children's Hospital New Orleans
[89] 210 State St.
[90] New Orleans, LA 10118
[@91] 504. 896. 7200
[@92] More information
[@93] Get directions

[94] Bienville Health Center Primary Care and
OB/GYN
[95] Touro

[98] New Orleans, LA 70119
[@99] 504. 252. 9488
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
265 changes: 265 additions & 0 deletions tarsier-snapshots/snapshots/1qkOHewUy0Kqq9RVVSOoQ/non_ocr_2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[9] Notice
[10] Road and parking lot construction in Madison, Wis. may result in travel delays and route changes to UW
Health clinic and hospital locations. Please plan accordingly. Read [@11] Read more more


[@20] Find a Doctor [@21] Conditions Services [@22] Locations Clinics [@23] Patients Families [@24] MyChart




[@29] UW Health Specialty Clinic - Sauk Prairie


**[31] Eye Care Clinic**
**Optometry**










[@32] 250 26th Street, Suite 120 Prairie du Sac, WI 53578 [@33] 608 643-6060 [$34] Closed now


[@50] Parking and [@51] Hours of
transportation operation [@52] Providers




**[53] Parking and**
**transportation**



















[$58] Keyboard shortcuts [59] Map data 2023 [$60] 50 m [@61] Terms [@62] Report a map error
Click to toggle between metric and imperial units




































































































[64] UW Health Specialty Clinic - Sauk
Prairie

[66] 250 26th Street, Suite 120 Prairie
du Sac, WI
[68] 608 643-6060
















[70] Hours of operation


[$71] Eye Care Clinic Optometry closed


[74] Monday-Friday [75] 8am-5pm
[76] Upcoming special hours
[77] Nov 22 [78] Closed
[$79] UW Health Specialty Clinic - Sauk
Prairie
closed


[82] Upcoming special hours
[83] Nov 22 [84] Closed









**[85] Providers**













[@91] Telehealth [@92] Priority OrthoCare [@93] e-Visits [@94] Emergency Urgent care [@95] UW School of Medicine and Public Health


**[@96] MyChart**

**[@97] Find a Doctor**

**[@98] Conditions Services**

**[@99] Locations Clinics**

**[@100] Patients Families**

**[@101] Refer a Patient**
[@102] Pay a bill [@103] Careers

[@104] Refill a prescription [@105] News

[@106] Obtain medical records [@107] Clinical Trials

[@108] Order flowers and gifts [@109] Volunteering

[@110] Send a greeting card [@111] About

[@112] Make a donation [@113] Find a class or support group





[@114] About UW Health [@115] Diversity, Equity and Inclusion [@116] Media Center [@117] Contact Us [@118] Make a donation


[@119] Notice of Privacy Practices HIPAA [@121] Language Access Notice of Nondiscrimination
[120] Donations to UW Health are managed by the University of
Wisconsin Foundation, a publicly supported charitable
organization under 501 c 3 of the Internal Revenue Code.
[@122] English [@123] Espa ol Spanish [@124] Hmoob Hmong
[@125] Chinese [@126] Deutsch German [@127] Arabic
[@128] Russian [@129] Korean [@130] Ti ng Vi t Vietnamese
[@131] Deitsch Pennsylvania Dutch [@132] Lao
[@133] Fran ais French [@134] Polski Polish [@135] Hindi
[@136] Shqip Albanian [@137] Tagalog Tagalog Filipino

[138] Copyright 2023 University of Wisconsin Hospitals and Clinics Authority [@142] Terms and conditions [@143] Website privacy policy [@144] Employee home access
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Loading