Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file-based access permissions for SharePoint ingest #1628

Merged
merged 66 commits into from
Oct 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
6d2ecd2
add rbac ingestion via graph api
ahmetmeleq Sep 28, 2023
1b8a109
implement class
ahmetmeleq Sep 28, 2023
9b6400e
run ConnectorRBAC class
ahmetmeleq Sep 28, 2023
559aa2b
integrate rbac ingestion to the cli app
ahmetmeleq Sep 28, 2023
0798cbd
delete prototyping file
ahmetmeleq Sep 28, 2023
990d985
delete prototyping file
ahmetmeleq Sep 28, 2023
9d942b6
write permissions to disk
ahmetmeleq Sep 29, 2023
6cc3bf8
debugging to identify common field between rbac object and ingestdoc …
ahmetmeleq Sep 29, 2023
50f1466
trials on writing rbac data into the ingest doc
ahmetmeleq Sep 29, 2023
18905da
add rbac data ingestion
ahmetmeleq Oct 3, 2023
c0ab58a
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 3, 2023
17f4319
remove irrelevant lines
ahmetmeleq Oct 3, 2023
a97b6cd
change name rbac to permissions
ahmetmeleq Oct 4, 2023
2fc2539
load permissions data as an object rather than str
ahmetmeleq Oct 4, 2023
035cc32
assign folder for permissions data
ahmetmeleq Oct 4, 2023
5a5c80b
remove todo point
ahmetmeleq Oct 4, 2023
65ab961
add exact filename match condition
ahmetmeleq Oct 4, 2023
bba81db
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 4, 2023
c9b1691
revert example file
ahmetmeleq Oct 4, 2023
f3895c4
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 4, 2023
a384df1
typing
ahmetmeleq Oct 4, 2023
3ec0ca8
typing
ahmetmeleq Oct 4, 2023
adeb806
changelog and version
ahmetmeleq Oct 4, 2023
b557094
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 4, 2023
b74c81e
include sitenames for name matching
ahmetmeleq Oct 5, 2023
1ada530
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 5, 2023
e1cfab7
revert example
ahmetmeleq Oct 5, 2023
c6eb13a
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 5, 2023
72c21d0
changelog and version
ahmetmeleq Oct 5, 2023
bb3b191
Update unstructured/ingest/interfaces.py
ahmetmeleq Oct 5, 2023
803dcec
remove debugging lines
ahmetmeleq Oct 9, 2023
e2fef65
create permissions config
ahmetmeleq Oct 9, 2023
170bc79
change error message
ahmetmeleq Oct 9, 2023
c491ff0
skip permissions ingestion when args are not provided
ahmetmeleq Oct 9, 2023
1454522
keep access token in post init
ahmetmeleq Oct 9, 2023
62f3262
typing: make permission args for runner optional
ahmetmeleq Oct 9, 2023
bb03eaf
update example
ahmetmeleq Oct 9, 2023
bfe2925
update tests
ahmetmeleq Oct 9, 2023
b4ba750
add guidelines for obtaining permissions related arg variables
ahmetmeleq Oct 9, 2023
128dfaf
work in progress on refactoring cli args
ahmetmeleq Oct 10, 2023
75e1735
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 10, 2023
7a13f2c
update config variable names
ahmetmeleq Oct 10, 2023
8bd7de8
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 10, 2023
f3b7019
add permissions config interface, check cli params in cli interfaces
ahmetmeleq Oct 10, 2023
501f0c9
update docs
ahmetmeleq Oct 10, 2023
be7864f
update test credential names
ahmetmeleq Oct 10, 2023
6311eea
add permissions node, cleanup, and additional site types
ahmetmeleq Oct 11, 2023
27be65f
remove redundant comment
ahmetmeleq Oct 11, 2023
69f9b22
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 11, 2023
633f419
do not run cleanup when permissions args are not provided
ahmetmeleq Oct 11, 2023
6d1a66f
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 11, 2023
d4bdc15
cleanup output-text folder permissions files before diffing
ahmetmeleq Oct 11, 2023
e18c836
type check fix
ahmetmeleq Oct 11, 2023
b25392e
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 11, 2023
cdf2d20
feat: rbac ingestion for sharepoint <- Ingest test fixtures update (#…
ryannikolaidis Oct 11, 2023
243ef0e
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 11, 2023
fa95e36
update comments
ahmetmeleq Oct 12, 2023
60df65a
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 12, 2023
e5be437
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 12, 2023
5802146
Update docs/source/source_connectors/sharepoint.rst
ahmetmeleq Oct 12, 2023
3d5fb8e
Update docs/source/source_connectors/sharepoint.rst
ahmetmeleq Oct 12, 2023
64e03c0
Update examples/ingest/sharepoint/ingest.sh
ahmetmeleq Oct 12, 2023
07617b5
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 12, 2023
f7f752b
update comments
ahmetmeleq Oct 12, 2023
ae982a4
remove redundant click options
ahmetmeleq Oct 13, 2023
b898ed5
remove redundant cli options
ahmetmeleq Oct 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,9 @@ jobs:
SHAREPOINT_CLIENT_ID: ${{secrets.SHAREPOINT_CLIENT_ID}}
SHAREPOINT_CRED: ${{secrets.SHAREPOINT_CRED}}
SHAREPOINT_SITE: ${{secrets.SHAREPOINT_SITE}}
SHAREPOINT_PERMISSIONS_APP_ID: ${{secrets.SHAREPOINT_PERMISSIONS_APP_ID}}
SHAREPOINT_PERMISSIONS_APP_CRED: ${{secrets.SHAREPOINT_PERMISSIONS_APP_CRED}}
SHAREPOINT_PERMISSIONS_TENANT: ${{secrets.SHAREPOINT_PERMISSIONS_TENANT}}
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
NOTION_API_KEY: ${{ secrets.NOTION_API_KEY }}
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/ingest-test-fixtures-update-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,9 @@ jobs:
SHAREPOINT_CLIENT_ID: ${{secrets.SHAREPOINT_CLIENT_ID}}
SHAREPOINT_CRED: ${{secrets.SHAREPOINT_CRED}}
SHAREPOINT_SITE: ${{secrets.SHAREPOINT_SITE}}
SHAREPOINT_PERMISSIONS_APP_ID: ${{secrets.SHAREPOINT_PERMISSIONS_APP_ID}}
SHAREPOINT_PERMISSIONS_APP_CRED: ${{secrets.SHAREPOINT_PERMISSIONS_APP_CRED}}
SHAREPOINT_PERMISSIONS_TENANT: ${{secrets.SHAREPOINT_PERMISSIONS_TENANT}}
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
NOTION_API_KEY: ${{ secrets.NOTION_API_KEY }}
Expand Down
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.22-dev5
## 0.10.22-dev6

### Enhancements

Expand All @@ -7,9 +7,11 @@

### Features

* **Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector.** Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.

### Fixes

* **Fixes PDF list parsing creating duplicate list items** Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list items
* **Fixes PDF list parsing creating duplicate list items** Previously a bug in PDF list item parsing caused removal of other elements and duplication of the list item
* **Fixes duplicated elements** Fixes issue where elements are duplicated when embeddings are generated. This will allow users to generate embeddings for their list of Elements without duplicating/breaking the orginal content.
* **Fixes failure when flagging for embeddings through unstructured-ingest** Currently adding the embedding parameter to any connector results in a failure on the copy stage. This is resolves the issue by adding the IngestDoc to the context map in the embedding node's `run` method. This allows users to specify that connectors fetch embeddings without failure.
* **Fix ingest pipeline reformat nodes not discoverable** Fixes issue where reformat nodes raise ModuleNotFoundError on import. This was due to the directory was missing `__init__.py` in order to make it discoverable.
Expand Down
14 changes: 14 additions & 0 deletions docs/source/source_connectors/sharepoint.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ Run Locally
--client-id "<Microsoft Sharepoint app client-id>" \
--client-cred "<Microsoft Sharepoint app client-secret>" \
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
--permissions-application-id "<Microsoft Graph API application id, to process per-file access permissions>" \
--permissions-client-cred "<Microsoft Graph API application credentials, to process per-file access permissions>" \
--permissions-tenant "<e.g https://contoso.onmicrosoft.com (tenant URL) to process per-file access permissions>" \
--files-only "Flag to process only files within the site(s)" \
--output-dir sharepoint-ingest-output \
--num-processes 2 \
Expand All @@ -46,6 +49,10 @@ Run Locally
client_id="<Microsoft Sharepoint app client-id>",
client_cred="<Microsoft Sharepoint app client-secret>",
site="<e.g https://contoso.sharepoint.com to process all sites within tenant>",
# Credentials to process data about permissions (rbac) within the tenant
permissions_application_id="<Microsoft Graph API application id>",
permissions_client_cred="<Microsoft Graph API application credentials>",
permissions_tenant="<e.g https://contoso.onmicrosoft.com to process permission info within tenant>",
# Flag to process only files within the site(s)
files_only=True,
path="Shared Documents",
Expand All @@ -68,6 +75,9 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
--client-id "<Microsoft Sharepoint app client-id>" \
--client-cred "<Microsoft Sharepoint app client-secret>" \
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
--permissions-application-id "<Microsoft Graph API application id, to process per-file access permissions>" \
--permissions-client-cred "<Microsoft Graph API application credentials, to process per-file access permissions>" \
--permissions-tenant "<e.g https://contoso.onmicrosoft.com (tenant URL) to process per-file access permissions>" \
--files-only "Flag to process only files within the site(s)" \
--output-dir sharepoint-ingest-output \
--num-processes 2 \
Expand Down Expand Up @@ -98,6 +108,10 @@ You can also use upstream connectors with the ``unstructured`` API. For this you
client_id="<Microsoft Sharepoint app client-id>",
client_cred="<Microsoft Sharepoint app client-secret>",
site="<e.g https://contoso.sharepoint.com to process all sites within tenant>",
# Credentials to process data about permissions (rbac) within the tenant
permissions_application_id="<Microsoft Graph API application id>",
permissions_client_cred="<Microsoft Graph API application credentials>",
permissions_tenant="<e.g https://contoso.onmicrosoft.com to process permission info within tenant>",
# Flag to process only files within the site(s)
files_only=True,
path="Shared Documents",
Expand Down
5 changes: 5 additions & 0 deletions examples/ingest/sharepoint/ingest.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
# To get the credentials for your Sharepoint app, follow these steps:
# https://github.com/vgrem/Office365-REST-Python-Client/wiki/How-to-connect-to-SharePoint-Online-and-and-SharePoint-2013-2016-2019-on-premises--with-app-principal

# To optionally set up your application and obtain permissions related variables (--permissions-application-id, --permissions-client-cred, --permissions-tenant), follow these steps:
# https://tsmatz.wordpress.com/2016/10/07/application-permission-with-v2-endpoint-and-microsoft-graph
ryannikolaidis marked this conversation as resolved.
Show resolved Hide resolved


SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
Expand All @@ -22,6 +24,9 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
--client-id "<Microsoft Sharepoint app client-id>" \
--client-cred "<Microsoft Sharepoint app client-secret>" \
--site "<e.g https://contoso.sharepoint.com or https://contoso.admin.sharepoint.com to process all sites within tenant>" \
--permissions-application-id "<Microsoft Graph API application id to process per-file access permissions>" \
--permissions-client-cred "<Microsoft Graph API application credentials to process per-file access permissions>" \
--permissions-tenant "<e.g https://contoso.onmicrosoft.com to process per-file access permissions>" \
--files-only "Flag to process only files within the site(s)" \
--output-dir sharepoint-ingest-output \
--num-processes 2 \
Expand Down
1 change: 1 addition & 0 deletions test_unstructured_ingest/check-diff-expected-output.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ if [ "$OVERWRITE_FIXTURES" != "false" ]; then
elif ! diff -ru "$EXPECTED_OUTPUT_DIR" "$OUTPUT_DIR" ; then
"$SCRIPT_DIR"/json-to-clean-text-folder.sh "$EXPECTED_OUTPUT_DIR" "$EXPECTED_OUTPUT_DIR_TEXT"
"$SCRIPT_DIR"/json-to-clean-text-folder.sh "$OUTPUT_DIR" "$OUTPUT_DIR_TEXT"
"$SCRIPT_DIR"/clean-permissions-files.sh "$OUTPUT_DIR_TEXT"
diff -ru "$EXPECTED_OUTPUT_DIR_TEXT" "$OUTPUT_DIR_TEXT"> outputdiff.txt
cat outputdiff.txt
diffstat -c outputdiff.txt
Expand Down
27 changes: 27 additions & 0 deletions test_unstructured_ingest/clean-permissions-files.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/usr/bin/env bash

# Description: Delete (cleanup) permissions files in a folder, so that they are not included in
# text diff tests.
#
# Arguments:
# - $1: Name of the folder to do the cleanup operation in.

set +e
if [ "$#" -ne 1 ]; then
echo "Please provide a folder to clean the files in: $0 <folder_path>"
exit 1
fi

folder_path="$1"
if [ ! -d "$folder_path" ]; then
echo "'$folder_path' is not a directory. Please provide a folder / directory."
exit 1
fi

for file in "$folder_path"/*_SEP_*; do
if [ -e "$file" ]; then
rm "$file"
fi
done

echo "Completed cleanup for permissions files"
Loading
Loading