Skip to content

Commit

Permalink
Gdrive crawler 2 (#105)
Browse files Browse the repository at this point in the history
* google drive crawler

* updated gdrive crawler config

* Refactored GdriveCrawler to use date filtering and removed local storage for processed files, and handling PDF files

* Refactor: Switch to using index_file() for direct uploads, sanitize filenames

* added numpy

* Refactor gdrive_crawler.py: use slugify, logging.info, adjust date comparison, and rename byte_stream

* standardize date handling

* changed the file to earlier versions

* changed crawing to crawling

* removed redundant checks on dates and clubbed download() and export() into one

* resolving commit issues

* minor fixes

* small mypy fix

* updated Docker load of credentials.json

* added typing annotations

* added openpyxl

* same run.sh as in main branch

---------

Co-authored-by: Abhilasha Lodha <[email protected]>
Co-authored-by: Ofer Mendelevitch <[email protected]>
  • Loading branch information
3 people authored Jul 23, 2024
1 parent 2fec733 commit 70687ae
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 2 deletions.
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,5 @@ pydub==0.25.1
pytube==15.0.0
openai-whisper==20231117
youtube-transcript-api==0.6.2
sec-downloader==0.11.1
sec-downloader==0.11.1
openpyxl==3.1.4
2 changes: 1 addition & 1 deletion run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,4 @@ if [ $? -eq 0 ]; then
echo "You can try 'docker logs -f vingest' to see the progress."
else
echo "Ingest container failed to start. Please check the messages above."
fi
fi

0 comments on commit 70687ae

Please sign in to comment.