Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file-based access permissions for SharePoint ingest #1628

Merged
merged 66 commits into from
Oct 13, 2023
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
6d2ecd2
add rbac ingestion via graph api
ahmetmeleq Sep 28, 2023
1b8a109
implement class
ahmetmeleq Sep 28, 2023
9b6400e
run ConnectorRBAC class
ahmetmeleq Sep 28, 2023
559aa2b
integrate rbac ingestion to the cli app
ahmetmeleq Sep 28, 2023
0798cbd
delete prototyping file
ahmetmeleq Sep 28, 2023
990d985
delete prototyping file
ahmetmeleq Sep 28, 2023
9d942b6
write permissions to disk
ahmetmeleq Sep 29, 2023
6cc3bf8
debugging to identify common field between rbac object and ingestdoc …
ahmetmeleq Sep 29, 2023
50f1466
trials on writing rbac data into the ingest doc
ahmetmeleq Sep 29, 2023
18905da
add rbac data ingestion
ahmetmeleq Oct 3, 2023
c0ab58a
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 3, 2023
17f4319
remove irrelevant lines
ahmetmeleq Oct 3, 2023
a97b6cd
change name rbac to permissions
ahmetmeleq Oct 4, 2023
2fc2539
load permissions data as an object rather than str
ahmetmeleq Oct 4, 2023
035cc32
assign folder for permissions data
ahmetmeleq Oct 4, 2023
5a5c80b
remove todo point
ahmetmeleq Oct 4, 2023
65ab961
add exact filename match condition
ahmetmeleq Oct 4, 2023
bba81db
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 4, 2023
c9b1691
revert example file
ahmetmeleq Oct 4, 2023
f3895c4
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 4, 2023
a384df1
typing
ahmetmeleq Oct 4, 2023
3ec0ca8
typing
ahmetmeleq Oct 4, 2023
adeb806
changelog and version
ahmetmeleq Oct 4, 2023
b557094
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 4, 2023
b74c81e
include sitenames for name matching
ahmetmeleq Oct 5, 2023
1ada530
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 5, 2023
e1cfab7
revert example
ahmetmeleq Oct 5, 2023
c6eb13a
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 5, 2023
72c21d0
changelog and version
ahmetmeleq Oct 5, 2023
bb3b191
Update unstructured/ingest/interfaces.py
ahmetmeleq Oct 5, 2023
803dcec
remove debugging lines
ahmetmeleq Oct 9, 2023
e2fef65
create permissions config
ahmetmeleq Oct 9, 2023
170bc79
change error message
ahmetmeleq Oct 9, 2023
c491ff0
skip permissions ingestion when args are not provided
ahmetmeleq Oct 9, 2023
1454522
keep access token in post init
ahmetmeleq Oct 9, 2023
62f3262
typing: make permission args for runner optional
ahmetmeleq Oct 9, 2023
bb03eaf
update example
ahmetmeleq Oct 9, 2023
bfe2925
update tests
ahmetmeleq Oct 9, 2023
b4ba750
add guidelines for obtaining permissions related arg variables
ahmetmeleq Oct 9, 2023
128dfaf
work in progress on refactoring cli args
ahmetmeleq Oct 10, 2023
75e1735
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 10, 2023
7a13f2c
update config variable names
ahmetmeleq Oct 10, 2023
8bd7de8
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 10, 2023
f3b7019
add permissions config interface, check cli params in cli interfaces
ahmetmeleq Oct 10, 2023
501f0c9
update docs
ahmetmeleq Oct 10, 2023
be7864f
update test credential names
ahmetmeleq Oct 10, 2023
6311eea
add permissions node, cleanup, and additional site types
ahmetmeleq Oct 11, 2023
27be65f
remove redundant comment
ahmetmeleq Oct 11, 2023
69f9b22
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 11, 2023
633f419
do not run cleanup when permissions args are not provided
ahmetmeleq Oct 11, 2023
6d1a66f
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 11, 2023
d4bdc15
cleanup output-text folder permissions files before diffing
ahmetmeleq Oct 11, 2023
e18c836
type check fix
ahmetmeleq Oct 11, 2023
b25392e
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 11, 2023
cdf2d20
feat: rbac ingestion for sharepoint <- Ingest test fixtures update (#…
ryannikolaidis Oct 11, 2023
243ef0e
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 11, 2023
fa95e36
update comments
ahmetmeleq Oct 12, 2023
60df65a
Merge branch 'ahmet/sharepoint-rbac' of https://github.com/Unstructur…
ahmetmeleq Oct 12, 2023
e5be437
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 12, 2023
5802146
Update docs/source/source_connectors/sharepoint.rst
ahmetmeleq Oct 12, 2023
3d5fb8e
Update docs/source/source_connectors/sharepoint.rst
ahmetmeleq Oct 12, 2023
64e03c0
Update examples/ingest/sharepoint/ingest.sh
ahmetmeleq Oct 12, 2023
07617b5
Merge branch 'main' into ahmet/sharepoint-rbac
ahmetmeleq Oct 12, 2023
f7f752b
update comments
ahmetmeleq Oct 12, 2023
ae982a4
remove redundant click options
ahmetmeleq Oct 13, 2023
b898ed5
remove redundant cli options
ahmetmeleq Oct 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
## 0.10.20-dev0

### Enhancements

### Fixes

### Features

* **Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector.** Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.


## 0.10.19

### Enhancements
Expand Down
Empty file modified examples/ingest/sharepoint/ingest.sh
100644 → 100755
Empty file.
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.19" # pragma: no cover
__version__ = "0.10.20-dev0" # pragma: no cover
1 change: 1 addition & 0 deletions unstructured/documents/elements.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ class DataSourceMetadata:
date_created: Optional[str] = None
date_modified: Optional[str] = None
date_processed: Optional[str] = None
permissions_data: Optional[List[Dict[str, Any]]] = None

def to_dict(self):
return {key: value for key, value in self.__dict__.items() if value is not None}
Expand Down
21 changes: 21 additions & 0 deletions unstructured/ingest/cli/cmds/sharepoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@
class SharepointCliConfig(BaseConfig, CliMixin):
client_id: t.Optional[str] = None
client_cred: t.Optional[str] = None
permissions_application_id: t.Optional[str] = None
permissions_client_cred: t.Optional[str] = None
permissions_tenant: t.Optional[str] = None
ryannikolaidis marked this conversation as resolved.
Show resolved Hide resolved
site: t.Optional[str] = None
path: str = "Shared Documents"
files_only: bool = False
Expand Down Expand Up @@ -55,6 +58,24 @@ def add_cli_options(cmd: click.Command) -> None:
https://[tenant]-admin.sharepoint.com.\
This requires the app to be registered at a tenant level",
),
click.Option(
ryannikolaidis marked this conversation as resolved.
Show resolved Hide resolved
["--permissions-application-id"],
default=None,
type=str,
help="Application id for ingesting permission (rbac) data",
),
click.Option(
["--permissions-client-cred"],
default=None,
type=str,
help="Credentials for ingesting permission (rbac) data",
),
click.Option(
["--permissions-tenant"],
default=None,
type=str,
help="Sharepoint permission (rbac) tenant name, such as: abcde.onmicrosoft.com",
),
ryannikolaidis marked this conversation as resolved.
Show resolved Hide resolved
click.Option(
["--path"],
default="Shared Documents",
Expand Down
238 changes: 237 additions & 1 deletion unstructured/ingest/connector/sharepoint.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import json
import os
import typing as t
from dataclasses import dataclass, field
from datetime import datetime
Expand All @@ -10,6 +12,7 @@
from unstructured.file_utils.filetype import EXT_TO_FILETYPE
from unstructured.ingest.error import SourceConnectionError
from unstructured.ingest.interfaces import (
BaseConfig,
BaseConnectorConfig,
BaseIngestDoc,
BaseSourceConnector,
Expand All @@ -31,6 +34,26 @@
CONTENT_LABELS = ["CanvasContent1", "LayoutWebpartsContent1", "TimeCreated"]


@dataclass
class SharepointPermissionsConfig(BaseConfig):
application_id: str = None
client_credential: str = None
tenant: str = None

def __post_init__(self):
self.provided = False
if any([self.application_id or self.client_credential or self.tenant]):
if not all([self.application_id and self.client_credential and self.tenant]):
raise ValueError(
"Please provide either none or all of the following optional values:\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this check be part of converting the click options since this part of the code has no knowledge of the cli parameters? It can be factored into the from_dict method when this config is created such as this: CliEmbeddingConfig from_dict()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with f3b7019

"--permissions-application-id\n"
"--permissions-client-cred\n"
"--permissions-tenant",
)
else:
self.provided = True


@dataclass
class SimpleSharepointConfig(BaseConnectorConfig):
client_id: str
Expand All @@ -39,6 +62,7 @@ class SimpleSharepointConfig(BaseConnectorConfig):
path: str
process_pages: bool = False
recursive: bool = False
permissions_config: t.Optional[SharepointPermissionsConfig] = None

def __post_init__(self):
if not (self.client_id and self.client_credential and self.site_url):
Expand All @@ -61,6 +85,14 @@ def get_site_client(self, site_url: str = "") -> "ClientContext":
raise
return site_client

def get_permissions_client(self):
try:
permissions_connector = SharepointPermissionsConnector(self.permissions_config)
assert permissions_connector.access_token
return permissions_connector
except Exception as e:
logger.error("Couldn't obtain Sharepoint permissions ingestion access token:", e)


@dataclass
class SharepointIngestDoc(IngestDocCleanupMixin, BaseIngestDoc):
Expand Down Expand Up @@ -146,7 +178,6 @@ def _fetch_file(self, properties_only: bool = False):
file = site_client.web.get_file_by_server_relative_url(self.server_path)
if properties_only:
file = file.get().execute_query()

except ClientRequestException as e:
if e.response.status_code == 404:
return None
Expand All @@ -168,6 +199,42 @@ def _fetch_page(self):
return None
return page

def update_permissions_data(self):
def parent_name_matches(parent_type, permissions_filename, ingest_doc_filepath):
permissions_filename = permissions_filename.split("_SEP_")
ingest_doc_filepath = ingest_doc_filepath.split("/")

if parent_type == "sites":
return permissions_filename[0] == ingest_doc_filepath[1]

elif parent_type == "SitePages" or parent_type == "Shared Documents":
return True

permissions_data = None
permissions_dir = Path(self.partition_config.output_dir) / "permissions_data"

if permissions_dir.is_dir():
parent_type = self.file_path.split("/")[0]

if parent_type == "sites":
read_dir = permissions_dir / "sites"
elif parent_type == "SitePages" or parent_type == "Shared Documents":
read_dir = permissions_dir / "other"

for filename in os.listdir(read_dir):
permissions_docname = os.path.splitext(filename)[0].split("_SEP_")[1]
ingestdoc_docname = self.file_path.split("/")[-1]

if ingestdoc_docname == permissions_docname and parent_name_matches(
parent_type=parent_type,
permissions_filename=filename,
ingest_doc_filepath=self.file_path,
):
with open(read_dir / filename) as f:
permissions_data = json.loads(f.read())

return permissions_data

def update_source_metadata(self, **kwargs):
if self.is_page:
page = self._fetch_page()
Expand All @@ -182,6 +249,9 @@ def update_source_metadata(self, **kwargs):
version=page.get_property("Version", ""),
source_url=page.absolute_url,
exists=True,
permissions_data=self.update_permissions_data()
if self.connector_config.permissions_config
else None,
)
return

Expand All @@ -200,6 +270,9 @@ def update_source_metadata(self, **kwargs):
version=file.major_version,
source_url=file.properties.get("LinkingUrl", None),
exists=True,
permissions_data=self.update_permissions_data()
if self.connector_config.permissions_config
else None,
)

def _download_page(self):
Expand Down Expand Up @@ -345,6 +418,12 @@ def initialize(self):

def get_ingest_docs(self):
base_site_client = self.connector_config.get_site_client()

if self.connector_config.permissions_config:
permissions_client = self.connector_config.get_permissions_client()
if permissions_client:
permissions_client.write_all_permissions(self.partition_config.output_dir)

if not base_site_client.is_tenant:
return self._ingest_site_docs(base_site_client)
tenant = base_site_client.tenant
Expand All @@ -356,3 +435,160 @@ def get_ingest_docs(self):
site_client = self.connector_config.get_site_client(site_url)
ingest_docs = ingest_docs + self._ingest_site_docs(site_client)
return ingest_docs


@dataclass
class SharepointPermissionsConnector:
def __init__(self, permissions_config):
self.permissions_config: SharepointPermissionsConfig = permissions_config
self.initialize()

def initialize(self):
self.access_token: str = self.get_access_token()

@requires_dependencies(["requests"], extras="sharepoint")
def get_access_token(self) -> str:
import requests

url = (
f"https://login.microsoftonline.com/{self.permissions_config.tenant}/oauth2/v2.0/token"
)
headers = {"Content-Type": "application/x-www-form-urlencoded"}
data = {
"client_id": self.permissions_config.application_id,
"scope": "https://graph.microsoft.com/.default",
"client_secret": self.permissions_config.client_credential,
"grant_type": "client_credentials",
}
response = requests.post(url, headers=headers, data=data)
return response.json()["access_token"]

def validated_response(self, response):
if response.status_code == 200:
return response.json()
else:
print(f"Request failed with status code {response.status_code}:")
print(response.text)

@requires_dependencies(["requests"], extras="sharepoint")
def get_sites(self):
import requests

url = "https://graph.microsoft.com/v1.0/sites"
params = {
"$select": "webUrl, id",
}

headers = {
"Authorization": f"Bearer {self.access_token}",
}

response = requests.get(url, params=params, headers=headers)
return self.validated_response(response)

@requires_dependencies(["requests"], extras="sharepoint")
def get_drives(self, site):
import requests

url = f"https://graph.microsoft.com/v1.0/sites/{site}/drives"

headers = {
"Authorization": f"Bearer {self.access_token}",
}

response = requests.get(url, headers=headers)

return self.validated_response(response)

@requires_dependencies(["requests"], extras="sharepoint")
def get_drive_items(self, site, drive_id):
import requests

url = f"https://graph.microsoft.com/v1.0/sites/{site}/drives/{drive_id}/root/children"

headers = {
"Authorization": f"Bearer {self.access_token}",
}

response = requests.get(url, headers=headers)

return self.validated_response(response)

def extract_site_name_from_weburl(self, weburl):
split_path = urlparse(weburl).path.lstrip("/").split("/")

if split_path[0] == "sites":
return "sites", split_path[1]

elif split_path[0] == "Shared%20Documents":
return "Shared Documents", "Shared Documents"

elif split_path[0] == "personal":
return "Personal", "Personal"

# if other weburl structures are found, additional logic might need to be implemented

logger.warning(
"Couldn't extract sitename, skipping RBAC ingestion \
for the document with the URL:",
weburl,
)

@requires_dependencies(["requests"], extras="sharepoint")
def get_permissions_for_drive_item(self, site, drive_id, item_id):
import requests

url = f"https://graph.microsoft.com/v1.0/sites/ \
{site}/drives/{drive_id}/items/{item_id}/permissions"

headers = {
"Authorization": f"Bearer {self.access_token}",
}

response = requests.get(url, headers=headers)

return self.validated_response(response)

def write_all_permissions(self, output_dir):
sites = [(site["id"], site["webUrl"]) for site in self.get_sites()["value"]]
drive_ids = []

print("Obtaining drive data for sites for permissions (rbac)")
for site_id, site_url in sites:
drives = self.get_drives(site_id)
if drives:
drives_for_site = drives["value"]
drive_ids.extend([(site_id, drive["id"]) for drive in drives_for_site])

print("Obtaining item data from drives for permissions (rbac)")
item_ids = []
for site, drive_id in drive_ids:
drive_items = self.get_drive_items(site, drive_id)
if drive_items:
item_ids.extend(
# [(site, drive_id, item["id"], item["name"]) for item in drive_items["value"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug? can remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, addressed with 27be65f now

[
(site, drive_id, item["id"], item["name"], item["webUrl"])
for item in drive_items["value"]
],
)

permissions_dir = Path(output_dir) / "permissions_data"

print("Writing permissions data to disk")
for site, drive_id, item_id, item_name, item_web_url in item_ids:
res = self.get_permissions_for_drive_item(site, drive_id, item_id)
if res:
parent_type, parent_name = self.extract_site_name_from_weburl(item_web_url)

if parent_type == "sites":
write_path = permissions_dir / "sites" / f"{parent_name}_SEP_{item_name}.json"

if parent_type == "Personal" or parent_type == "Shared Documents":
write_path = permissions_dir / "other" / f"{parent_name}_SEP_{item_name}.json"

if not Path(os.path.dirname(write_path)).is_dir():
os.makedirs(os.path.dirname(write_path))

with open(write_path, "w") as f:
json.dump(res["value"], f)
Loading