Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google drive crawler #101

Merged
merged 17 commits into from
Jul 18, 2024
Merged

google drive crawler #101

merged 17 commits into from
Jul 18, 2024

Conversation

AbhilashaLodha
Copy link
Contributor

  • Added a new config file vectara-gdrive.yaml for the google drive under the folder config.
  • Added the python script for crawling the google-drive named gdrive_crawler.py under the folder crawlers
  • Added the condition to extract delegated-users in ingest.py
  • Added some more packages in requirements.txt

@AbhilashaLodha AbhilashaLodha requested a review from ofermend July 2, 2024 21:49
crawlers/gdrive_crawler.py Outdated Show resolved Hide resolved
crawlers/gdrive_crawler.py Outdated Show resolved Hide resolved
requirements.txt Outdated
@@ -19,7 +19,8 @@ biopython==1.81
boto3==1.26.116
mwviews==0.2.1
toml==0.10.2
pandas==1.3.5
pandas==2.2.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to upgrade pandas and numpy?
This may break other things, so unless it's tested thoroughly I would avoid without a specific need.

requirements.txt Outdated Show resolved Hide resolved
ingest.py Outdated Show resolved Hide resolved
self.creds = get_credentials(user)
self.service = build("drive", "v3", credentials=self.creds)

list_files = self.list_files(self.service, date_threshold=date_threshold.isoformat() + 'Z')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you using the ISO format + "z": is this because Google API demands it to be in UTC format and in this way?


list_files = self.list_files(self.service, date_threshold=date_threshold.isoformat() + 'Z')
for file in list_files:
modified_time = file.get('modifiedTime', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unneeded. The date_threshold check was already done in list_files() so here all files should already be in the right date range.

Dockerfile Outdated
@@ -34,6 +34,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
# install python packages
WORKDIR ${HOME}
COPY requirements.txt requirements-extra.txt $HOME/
COPY crawlers/credentials.json $HOME/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea. I'll explain offline why.

Copy link
Collaborator

@ofermend ofermend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ofermend ofermend merged commit 947d581 into main Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants