Skip to content

Commit

Permalink
feat: add file-based access permissions for SharePoint ingest (#1628)
Browse files Browse the repository at this point in the history
This PR:

- defines rbac_data as a SourceMetadata field,
- manages connections to an external api for obtaining rbac data with
ConnectorRBAC class,
- serializes rbac data and saves it to the disk,
- matches the rbac_data in the disk to each IngestDoc, using a common
field,
- forwards rbac data to Elements, via the partition() function

To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the
relevant rbac & connector credentials

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: ahmetmeleq <[email protected]>
  • Loading branch information
3 people authored Oct 13, 2023
1 parent 3ec3673 commit 94836cf
Show file tree
Hide file tree
Showing 24 changed files with 1,481 additions and 15 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,9 @@ jobs:
SHAREPOINT_CLIENT_ID: ${{secrets.SHAREPOINT_CLIENT_ID}}
SHAREPOINT_CRED: ${{secrets.SHAREPOINT_CRED}}
SHAREPOINT_SITE: ${{secrets.SHAREPOINT_SITE}}
SHAREPOINT_PERMISSIONS_APP_ID: ${{secrets.SHAREPOINT_PERMISSIONS_APP_ID}}
SHAREPOINT_PERMISSIONS_APP_CRED: ${{secrets.SHAREPOINT_PERMISSIONS_APP_CRED}}
SHAREPOINT_PERMISSIONS_TENANT: ${{secrets.SHAREPOINT_PERMISSIONS_TENANT}}
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
NOTION_API_KEY: ${{ secrets.NOTION_API_KEY }}
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,9 @@ jobs:
SHAREPOINT_CLIENT_ID: ${{secrets.SHAREPOINT_CLIENT_ID}}
SHAREPOINT_CRED: ${{secrets.SHAREPOINT_CRED}}
SHAREPOINT_SITE: ${{secrets.SHAREPOINT_SITE}}
SHAREPOINT_PERMISSIONS_APP_ID: ${{secrets.SHAREPOINT_PERMISSIONS_APP_ID}}
SHAREPOINT_PERMISSIONS_APP_CRED: ${{secrets.SHAREPOINT_PERMISSIONS_APP_CRED}}
SHAREPOINT_PERMISSIONS_TENANT: ${{secrets.SHAREPOINT_PERMISSIONS_TENANT}}
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
NOTION_API_KEY: ${{ secrets.NOTION_API_KEY }}
Expand Down
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,11 @@
### Features
* **Add `elements_to_text` as a staging helper function** In order to get a single clean text output from unstructured for metric calculations, automate the process of extracting text from elements using this function.

* **Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector.** Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.

### Fixes

* **Fixes PDF list parsing creating duplicate list items** Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list items
* **Fixes PDF list parsing creating duplicate list items** Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list item
* **Fixes duplicated elements** Fixes issue where elements are duplicated when embeddings are generated. This will allow users to generate embeddings for their list of Elements without duplicating/breaking the orginal content.
* **Fixes failure when flagging for embeddings through unstructured-ingest** Currently adding the embedding parameter to any connector results in a failure on the copy stage. This is resolves the issue by adding the IngestDoc to the context map in the embedding node's `run` method. This allows users to specify that connectors fetch embeddings without failure.
* **Fix ingest pipeline reformat nodes not discoverable** Fixes issue where reformat nodes raise ModuleNotFoundError on import. This was due to the directory was missing `__init__.py` in order to make it discoverable.
Expand Down
14 changes: 14 additions & 0 deletions docs/source/source_connectors/sharepoint.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ Run Locally
--client-id "<Microsoft Sharepoint app client-id>" \
--client-cred "<Microsoft Sharepoint app client-secret>" \
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
--permissions-application-id "<Microsoft Graph API application id, to process per-file access permissions>" \
--permissions-client-cred "<Microsoft Graph API application credentials, to process per-file access permissions>" \
--permissions-tenant "<e.g https://contoso.onmicrosoft.com (tenant URL) to process per-file access permissions>" \
--files-only "Flag to process only files within the site(s)" \
--output-dir sharepoint-ingest-output \
--num-processes 2 \
Expand All @@ -46,6 +49,10 @@ Run Locally
client_id="<Microsoft Sharepoint app client-id>",
client_cred="<Microsoft Sharepoint app client-secret>",
site="<e.g https://contoso.sharepoint.com to process all sites within tenant>",
# Credentials to process data about permissions (rbac) within the tenant
permissions_application_id="<Microsoft Graph API application id>",
permissions_client_cred="<Microsoft Graph API application credentials>",
permissions_tenant="<e.g https://contoso.onmicrosoft.com to process permission info within tenant>",
# Flag to process only files within the site(s)
files_only=True,
path="Shared Documents",
Expand All @@ -68,6 +75,9 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
--client-id "<Microsoft Sharepoint app client-id>" \
--client-cred "<Microsoft Sharepoint app client-secret>" \
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
--permissions-application-id "<Microsoft Graph API application id, to process per-file access permissions>" \
--permissions-client-cred "<Microsoft Graph API application credentials, to process per-file access permissions>" \
--permissions-tenant "<e.g https://contoso.onmicrosoft.com (tenant URL) to process per-file access permissions>" \
--files-only "Flag to process only files within the site(s)" \
--output-dir sharepoint-ingest-output \
--num-processes 2 \
Expand Down Expand Up @@ -98,6 +108,10 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
client_id="<Microsoft Sharepoint app client-id>",
client_cred="<Microsoft Sharepoint app client-secret>",
site="<e.g https://contoso.sharepoint.com to process all sites within tenant>",
# Credentials to process data about permissions (rbac) within the tenant
permissions_application_id="<Microsoft Graph API application id>",
permissions_client_cred="<Microsoft Graph API application credentials>",
permissions_tenant="<e.g https://contoso.onmicrosoft.com to process permission info within tenant>",
# Flag to process only files within the site(s)
files_only=True,
path="Shared Documents",
Expand Down
5 changes: 5 additions & 0 deletions examples/ingest/sharepoint/ingest.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
# To get the credentials for your Sharepoint app, follow these steps:
# https://github.com/vgrem/Office365-REST-Python-Client/wiki/How-to-connect-to-SharePoint-Online-and-and-SharePoint-2013-2016-2019-on-premises--with-app-principal

# To optionally set up your application and obtain permissions related variables (--permissions-application-id, --permissions-client-cred, --permissions-tenant), follow these steps:
# https://tsmatz.wordpress.com/2016/10/07/application-permission-with-v2-endpoint-and-microsoft-graph


SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
Expand All @@ -22,6 +24,9 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--client-id "<Microsoft Sharepoint app client-id>" \
--client-cred "<Microsoft Sharepoint app client-secret>" \
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
--permissions-application-id "<Microsoft Graph API application id to process per-file access permissions>" \
--permissions-client-cred "<Microsoft Graph API application credentials to process per-file access permissions>" \
--permissions-tenant "<e.g https://contoso.onmicrosoft.com to process per-file access permissions>" \
--files-only "Flag to process only files within the site(s)" \
--output-dir sharepoint-ingest-output \
--num-processes 2 \
Expand Down
1 change: 1 addition & 0 deletions test_unstructured_ingest/check-diff-expected-output.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ if [ "$OVERWRITE_FIXTURES" != "false" ]; then
elif ! diff -ru "$EXPECTED_OUTPUT_DIR" "$OUTPUT_DIR" ; then
"$SCRIPT_DIR"/json-to-clean-text-folder.sh "$EXPECTED_OUTPUT_DIR" "$EXPECTED_OUTPUT_DIR_TEXT"
"$SCRIPT_DIR"/json-to-clean-text-folder.sh "$OUTPUT_DIR" "$OUTPUT_DIR_TEXT"
"$SCRIPT_DIR"/clean-permissions-files.sh "$OUTPUT_DIR_TEXT"
diff -ru "$EXPECTED_OUTPUT_DIR_TEXT" "$OUTPUT_DIR_TEXT"> outputdiff.txt
cat outputdiff.txt
diffstat -c outputdiff.txt
Expand Down
27 changes: 27 additions & 0 deletions test_unstructured_ingest/clean-permissions-files.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/usr/bin/env bash

# Description: Delete (cleanup) permissions files in a folder, so that they are not included in
# text diff tests.
#
# Arguments:
# - $1: Name of the folder to do the cleanup operation in.

set +e
if [ "$#" -ne 1 ]; then
echo "Please provide a folder to clean the files in: $0 <folder_path>"
exit 1
fi

folder_path="$1"
if [ ! -d "$folder_path" ]; then
echo "'$folder_path' is not a directory. Please provide a folder / directory."
exit 1
fi

for file in "$folder_path"/*_SEP_*; do
if [ -e "$file" ]; then
rm "$file"
fi
done

echo "Completed cleanup for permissions files"
Loading

0 comments on commit 94836cf

Please sign in to comment.