Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file-based access permissions for SharePoint ingest #1628

Merged
merged 66 commits into from
Oct 13, 2023

Conversation

ahmetmeleq
Copy link
Contributor

@ahmetmeleq ahmetmeleq commented Oct 3, 2023

This PR:

  • defines rbac_data as a SourceMetadata field,
  • manages connections to an external api for obtaining rbac data with ConnectorRBAC class,
  • serializes rbac data and saves it to the disk,
  • matches the rbac_data in the disk to each IngestDoc, using a common field,
  • forwards rbac data to Elements, via the partition() function

To test the changes, run examples/ingest/sharepoint/ingest.sh with the relevant rbac & connector credentials

@ahmetmeleq ahmetmeleq self-assigned this Oct 3, 2023
@ahmetmeleq
Copy link
Contributor Author

Potential improvements before merging:

  • improve rbac data to ingestdoc matching logic (currently checking filenames in a naive way)
  • better folder management for writing down the rbac data (currently writes all rbac data directly into the output_dir)
  • obtaining rbac_data field as a python object and serializing later, (currently obtains it as a str)
  • handling errors caused by upstream connection issues
    • ie requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: ...

@ahmetmeleq
Copy link
Contributor Author

Note: Reading permissions data as an object rather than a str causes proper formatting in the outputted elements.json file (rather than keeping the permissions data in one line), and results in permission data dominating / polluting the file. Repeating it for each element adds to the problem. Open to suggestions about this

@ahmetmeleq
Copy link
Contributor Author

Note: Sharepoint seems to be having trouble with our parallel requests, giving:

Error: {"error":"invalid_request",
"error_description":"AADSTS900023: Specified tenant identifier 'none' is neither a valid DNS name, nor a valid external domain.
...}

or

office365.runtime.client_request_exception.ClientRequestException: (None, None, "503 Server Error: Service Unavailable for url: 

To address this, I've reduced num-processes in the example .sh to 1, and it seems like this helps

@ahmetmeleq
Copy link
Contributor Author

Example file for testing:

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1

PYTHONPATH=. ./unstructured/ingest/main.py \
    sharepoint \
    --client-id $SHAREPOINT_CLIENT_ID \
    --client-cred $SHAREPOINT_CRED \
    --permissions-application-id $SHAREPOINT_RBAC_CLIENT_APPLICATION_ID \
    --permissions-client-cred $SHAREPOINT_RBAC_CLIENT_SECRET \
    --site $SHAREPOINT_SITE \
    --permissions-tenant $SHAREPOINT_RBAC_TENANT \
    --output-dir sharepoint-ingest-output \
    --num-processes 1 \
    --path "Shared Documents" \
    --verbose

@ahmetmeleq
Copy link
Contributor Author

Test fixtures update PR on the way as well

are these still on route?

We're there now 👍

@ahmetmeleq ahmetmeleq enabled auto-merge October 12, 2023 02:06
Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple nits on docs and a comment / followup on CliPermissionsConfig

Copy link
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! nice work, I know this was a huge lift.

@ryannikolaidis ryannikolaidis changed the title feat: rbac ingestion for sharepoint feat: add file-based access permissions for SharePoint ingest Oct 13, 2023
@ahmetmeleq ahmetmeleq added this pull request to the merge queue Oct 13, 2023
Merged via the queue into main with commit 94836cf Oct 13, 2023
39 checks passed
@ahmetmeleq ahmetmeleq deleted the ahmet/sharepoint-rbac branch October 13, 2023 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants