Skip to content

Commit

Permalink
Github "Direct" User Repo Access (#1391)
Browse files Browse the repository at this point in the history
### Summary

This PR adds to the Github graph, adding repo access that is granted to
all users who have a 'direct' affiliation to the repo.

Cartography currently
[does](https://cartography-cncf.github.io/cartography/modules/github/schema.html#id6)
already map some direct user repo access, but only for collaborators
with an 'outside' affiliation to the repo. This PR broadens that to
include all collaborators, aka anybody with a 'direct' affiliation. This
follows Github's naming for these concepts, as seen
[here](https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation).

In case it is unclear or for people newer to Github, note: this is
focusing on access users are granted directly to a repo, as opposed to
via a team. Access granted via team is outside the scope of this PR.

We think this is a valuable addition to the graph for a few reasons,
including:
1. Our analysts want few-to-no users to be granted access directly to
repos, on the thinking that managing access via teams can make access
easier to automate (ie with ABAC/RBAC type logic) and to audit. Graphing
direct-access, regardless of whether a user is within the org or outside
it, will help highlight who to clean up.
2. Longer term, we eventually want to know, from the graph, _all_ access
a user has. This PR is a step in that direction. (In a future PR, we
hope to add a user-team membership relationship. Since Cartography maps
team-repo access rel, we could then have a user-team-repo graph, and
that would complete the picture of user access in Github.)

#### Illustration of the intention

![Cartography AMPS User Direct Repo Access
(3)](https://github.com/user-attachments/assets/83a28a9b-f4f9-40fe-bdc5-153aa5196070)


#### Screencaps

**A REPO WITH OUTSIDE COLLABORATORS**

BEFORE

![CollabsBefore](https://github.com/user-attachments/assets/6806fe08-5e7c-4ced-a8c8-ed43a39566c6)

AFTER

![CollabsAfter](https://github.com/user-attachments/assets/18e9f96d-eb8e-4cad-984a-7e7056615776)


**A REPO WITH NON-OUTSIDE COLLABORATORS**

BEFORE
(no results, because these sorts of users are not graphed)
![Screenshot 2024-11-21 at 5 54
37 PM](https://github.com/user-attachments/assets/d5b371ba-f86e-4a97-a06f-9f62c2548e76)

AFTER
![Screenshot 2024-11-21 at 5 54
25 PM](https://github.com/user-attachments/assets/04941eb1-10ff-4ff9-a95d-172295744978)


**GENERAL COUNTS TO GIVE A SENSE OF CONNECTIONS NOW THERE**

BEFORE

![UserRepoRelsBefore](https://github.com/user-attachments/assets/08d2b96f-41ca-44b6-923d-75247ef09812)

AFTER

![UserRepoRelsAfter](https://github.com/user-attachments/assets/85e3bf51-6a60-4742-865a-454bfab1ef24)



### Related issues or links

None

### Checklist

Provide proof that this works (this makes reviews move faster). Please
perform one or more of the following:
- [X] Update/add unit or integration tests.
- [X] Include a screenshot showing what the graph looked like before and
after your changes.
- [ ] Include console log trace showing what happened before and after
your changes.

If you are changing a node or relationship:
- [X] Update the
[schema](https://github.com/lyft/cartography/tree/master/docs/root/modules)
and
[readme](https://github.com/lyft/cartography/blob/master/docs/schema/README.md).

**N/A/** If you are implementing a new intel module:
- [ ] Use the NodeSchema [data
model](https://cartography-cncf.github.io/cartography/dev/writing-intel-modules.html#defining-a-node).

---------

Signed-off-by: Daniel Brauer <[email protected]>
  • Loading branch information
danbrauer authored Nov 25, 2024
1 parent 9b2a3e4 commit 774d67b
Show file tree
Hide file tree
Showing 5 changed files with 465 additions and 91 deletions.
25 changes: 25 additions & 0 deletions cartography/data/jobs/cleanup/github_repos_cleanup.json
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,31 @@
"query": "MATCH (:GitHubUser)-[r:OUTSIDE_COLLAB_WRITE]->(:GitHubRepository) WHERE r.lastupdated <> $UPDATE_TAG WITH r LIMIT $LIMIT_SIZE DELETE (r)",
"iterative": true,
"iterationsize": 100
},
{
"query": "MATCH (:GitHubUser)-[r:DIRECT_COLLAB_ADMIN]->(:GitHubRepository) WHERE r.lastupdated <> $UPDATE_TAG WITH r LIMIT $LIMIT_SIZE DELETE (r)",
"iterative": true,
"iterationsize": 100
},
{
"query": "MATCH (:GitHubUser)-[r:DIRECT_COLLAB_MAINTAIN]->(:GitHubRepository) WHERE r.lastupdated <> $UPDATE_TAG WITH r LIMIT $LIMIT_SIZE DELETE (r)",
"iterative": true,
"iterationsize": 100
},
{
"query": "MATCH (:GitHubUser)-[r:DIRECT_COLLAB_READ]->(:GitHubRepository) WHERE r.lastupdated <> $UPDATE_TAG WITH r LIMIT $LIMIT_SIZE DELETE (r)",
"iterative": true,
"iterationsize": 100
},
{
"query": "MATCH (:GitHubUser)-[r:DIRECT_COLLAB_TRIAGE]->(:GitHubRepository) WHERE r.lastupdated <> $UPDATE_TAG WITH r LIMIT $LIMIT_SIZE DELETE (r)",
"iterative": true,
"iterationsize": 100
},
{
"query": "MATCH (:GitHubUser)-[r:DIRECT_COLLAB_WRITE]->(:GitHubRepository) WHERE r.lastupdated <> $UPDATE_TAG WITH r LIMIT $LIMIT_SIZE DELETE (r)",
"iterative": true,
"iterationsize": 100
}],
"name": "cleanup GitHub repos data"
}
203 changes: 176 additions & 27 deletions cartography/intel/github/repos.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import configparser
import logging
from collections import namedtuple
from string import Template
from typing import Any
from typing import Dict
Expand All @@ -12,11 +13,26 @@
from packaging.utils import canonicalize_name

from cartography.intel.github.util import fetch_all
from cartography.intel.github.util import PaginatedGraphqlData
from cartography.util import run_cleanup_job
from cartography.util import timeit

logger = logging.getLogger(__name__)


# Representation of a user's permission level and affiliation to a GitHub repo. See:
# - Permission: https://docs.github.com/en/graphql/reference/enums#repositorypermission
# - Affiliation: https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation
UserAffiliationAndRepoPermission = namedtuple(
'UserAffiliationAndRepoPermission',
[
'user', # Dict
'permission', # 'WRITE', 'MAINTAIN', 'ADMIN', etc
'affiliation', # 'OUTSIDE', 'DIRECT'
],
)


GITHUB_ORG_REPOS_PAGINATED_GRAPHQL = """
query($login: String!, $cursor: String) {
organization(login: $login)
Expand Down Expand Up @@ -59,17 +75,11 @@
login
__typename
}
collaborators(affiliation: OUTSIDE, first: 50) {
edges {
permission
}
nodes {
url
login
name
email
company
}
directCollaborators: collaborators(first: 100, affiliation: DIRECT) {
totalCount
}
outsideCollaborators: collaborators(first: 100, affiliation: OUTSIDE) {
totalCount
}
requirements:object(expression: "HEAD:requirements.txt") {
... on Blob {
Expand All @@ -89,6 +99,111 @@
# Note: In the above query, `HEAD` references the default branch.
# See https://stackoverflow.com/questions/48935381/github-graphql-api-default-branch-in-repository

GITHUB_REPO_COLLABS_PAGINATED_GRAPHQL = """
query($login: String!, $repo: String!, $affiliation: CollaboratorAffiliation!, $cursor: String) {
organization(login: $login) {
url
login
repository(name: $repo){
name
collaborators(first: 50, affiliation: $affiliation, after: $cursor) {
edges {
permission
}
nodes {
url
login
name
email
company
}
pageInfo{
endCursor
hasNextPage
}
}
}
}
rateLimit {
limit
cost
remaining
resetAt
}
}
"""


def _get_repo_collaborators_for_multiple_repos(
repo_raw_data: list[dict[str, Any]],
affiliation: str,
org: str,
api_url: str,
token: str,
) -> dict[str, List[UserAffiliationAndRepoPermission]]:
"""
For every repo in the given list, retrieve the collaborators.
:param repo_raw_data: A list of dicts representing repos. See tests.data.github.repos.GET_REPOS for data shape.
:param affiliation: The type of affiliation to retrieve collaborators for. Either 'DIRECT' or 'OUTSIDE'.
See https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation
:param org: The name of the target Github organization as string.
:param api_url: The Github v4 API endpoint as string.
:param token: The Github API token as string.
:return: A dictionary of repo URL to list of UserAffiliationAndRepoPermission
"""
result: dict[str, List[UserAffiliationAndRepoPermission]] = {}
for repo in repo_raw_data:
repo_name = repo['name']
repo_url = repo['url']

if ((affiliation == 'OUTSIDE' and repo['outsideCollaborators']['totalCount'] == 0) or
(affiliation == 'DIRECT' and repo['directCollaborators']['totalCount'] == 0)):
# repo has no collabs of the affiliation type we're looking for, so don't waste time making an API call
result[repo_url] = []
continue

collab_users = []
collab_permission = []
collaborators = _get_repo_collaborators(token, api_url, org, repo_name, affiliation)
# nodes and edges are expected to always be present given that we only call for them if totalCount is > 0
for collab in collaborators.nodes:
collab_users.append(collab)
for perm in collaborators.edges:
collab_permission.append(perm['permission'])

result[repo_url] = [
UserAffiliationAndRepoPermission(user, permission, affiliation)
for user, permission in zip(collab_users, collab_permission)
]
return result


def _get_repo_collaborators(
token: str, api_url: str, organization: str, repo: str, affiliation: str,
) -> PaginatedGraphqlData:
"""
Retrieve a list of collaborators for a given repository, as described in
https://docs.github.com/en/graphql/reference/objects#repositorycollaboratorconnection.
:param token: The Github API token as string.
:param api_url: The Github v4 API endpoint as string.
:param organization: The name of the target Github organization as string.
:pram repo: The name of the target Github repository as string.
:param affiliation: The type of affiliation to retrieve collaborators for. Either 'DIRECT' or 'OUTSIDE'.
See https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation
:return: A list of dicts representing repos. See tests.data.github.repos for data shape.
"""
collaborators, _ = fetch_all(
token,
api_url,
organization,
GITHUB_REPO_COLLABS_PAGINATED_GRAPHQL,
'repository',
resource_inner_type='collaborators',
repo=repo,
affiliation=affiliation,
)
return collaborators


@timeit
def get(token: str, api_url: str, organization: str) -> List[Dict]:
Expand All @@ -111,34 +226,52 @@ def get(token: str, api_url: str, organization: str) -> List[Dict]:
return repos.nodes


def transform(repos_json: List[Dict]) -> Dict:
def transform(
repos_json: List[Dict], direct_collaborators: dict[str, List[UserAffiliationAndRepoPermission]],
outside_collaborators: dict[str, List[UserAffiliationAndRepoPermission]],
) -> Dict:
"""
Parses the JSON returned from GitHub API to create data for graph ingestion
:param repos_json: the list of individual repository nodes from GitHub. See tests.data.github.repos.GET_REPOS for
data shape.
:param repos_json: the list of individual repository nodes from GitHub.
See tests.data.github.repos.GET_REPOS for data shape.
:param direct_collaborators: dict of repo URL to list of direct collaborators.
See tests.data.github.repos.DIRECT_COLLABORATORS for data shape.
:param outside_collaborators: dict of repo URL to list of outside collaborators.
See tests.data.github.repos.OUTSIDE_COLLABORATORS for data shape.
:return: Dict containing the repos, repo->language mapping, owners->repo mapping, outside collaborators->repo
mapping, and Python requirements files (if any) in a repo.
"""
transformed_repo_list: List[Dict] = []
transformed_repo_languages: List[Dict] = []
transformed_repo_owners: List[Dict] = []
# See https://docs.github.com/en/graphql/reference/enums#repositorypermission
transformed_collaborators: Dict[str, List[Any]] = {
transformed_outside_collaborators: Dict[str, List[Any]] = {
'ADMIN': [], 'MAINTAIN': [], 'READ': [], 'TRIAGE': [], 'WRITE': [],
}
transformed_direct_collaborators: Dict[str, List[Any]] = {
'ADMIN': [], 'MAINTAIN': [], 'READ': [], 'TRIAGE': [], 'WRITE': [],
}
transformed_requirements_files: List[Dict] = []
for repo_object in repos_json:
_transform_repo_languages(repo_object['url'], repo_object, transformed_repo_languages)
_transform_repo_objects(repo_object, transformed_repo_list)
_transform_repo_owners(repo_object['owner']['url'], repo_object, transformed_repo_owners)
_transform_collaborators(repo_object['collaborators'], repo_object['url'], transformed_collaborators)
_transform_collaborators(
repo_object['url'], outside_collaborators[repo_object['url']],
transformed_outside_collaborators,
)
_transform_collaborators(
repo_object['url'], direct_collaborators[repo_object['url']],
transformed_direct_collaborators,
)
_transform_requirements_txt(repo_object['requirements'], repo_object['url'], transformed_requirements_files)
_transform_setup_cfg_requirements(repo_object['setupCfg'], repo_object['url'], transformed_requirements_files)
results = {
'repos': transformed_repo_list,
'repo_languages': transformed_repo_languages,
'repo_owners': transformed_repo_owners,
'repo_collaborators': transformed_collaborators,
'repo_outside_collaborators': transformed_outside_collaborators,
'repo_direct_collaborators': transformed_direct_collaborators,
'python_requirements': transformed_requirements_files,
}
return results
Expand Down Expand Up @@ -229,22 +362,27 @@ def _transform_repo_languages(repo_url: str, repo: Dict, repo_languages: List[Di
})


def _transform_collaborators(collaborators: Dict, repo_url: str, transformed_collaborators: Dict) -> None:
def _transform_collaborators(
repo_url: str, collaborators: List[UserAffiliationAndRepoPermission], transformed_collaborators: Dict,
) -> None:
"""
Performs data adjustments for outside collaborators in a GitHub repo.
Performs data adjustments for collaborators in a GitHub repo.
Output data shape = [{permission, repo_url, url (the user's URL), login, name}, ...]
:param collaborators: See cartography.tests.data.github.repos for data shape.
:param collaborators: For data shape, see
cartography.tests.data.github.repos.DIRECT_COLLABORATORS
cartography.tests.data.github.repos.OUTSIDE_COLLABORATORS
:param repo_url: The URL of the GitHub repo.
:param transformed_collaborators: Output dict. Data shape =
{'ADMIN': [{ user }, ...], 'MAINTAIN': [{ user }, ...], 'READ': [ ... ], 'TRIAGE': [ ... ], 'WRITE': [ ... ]}
:return: Nothing.
"""
# `collaborators` is sometimes None
if collaborators:
for idx, user in enumerate(collaborators['nodes']):
user_permission = collaborators['edges'][idx]['permission']
for collaborator in collaborators:
user = collaborator.user
user['repo_url'] = repo_url
transformed_collaborators[user_permission].append(user)
user['affiliation'] = collaborator.affiliation
transformed_collaborators[collaborator.permission].append(user)


def _transform_requirements_txt(
Expand Down Expand Up @@ -482,7 +620,7 @@ def load_github_owners(neo4j_session: neo4j.Session, update_tag: int, repo_owner


@timeit
def load_collaborators(neo4j_session: neo4j.Session, update_tag: int, collaborators: Dict) -> None:
def load_collaborators(neo4j_session: neo4j.Session, update_tag: int, collaborators: Dict, affiliation: str) -> None:
query = Template("""
UNWIND $UserData as user
Expand All @@ -502,7 +640,7 @@ def load_collaborators(neo4j_session: neo4j.Session, update_tag: int, collaborat
SET o.lastupdated = $UpdateTag
""")
for collab_type in collaborators.keys():
relationship_label = f"OUTSIDE_COLLAB_{collab_type}"
relationship_label = f"{affiliation}_COLLAB_{collab_type}"
neo4j_session.run(
query.safe_substitute(rel_label=relationship_label),
UserData=collaborators[collab_type],
Expand All @@ -515,7 +653,12 @@ def load(neo4j_session: neo4j.Session, common_job_parameters: Dict, repo_data: D
load_github_repos(neo4j_session, common_job_parameters['UPDATE_TAG'], repo_data['repos'])
load_github_owners(neo4j_session, common_job_parameters['UPDATE_TAG'], repo_data['repo_owners'])
load_github_languages(neo4j_session, common_job_parameters['UPDATE_TAG'], repo_data['repo_languages'])
load_collaborators(neo4j_session, common_job_parameters['UPDATE_TAG'], repo_data['repo_collaborators'])
load_collaborators(
neo4j_session, common_job_parameters['UPDATE_TAG'], repo_data['repo_direct_collaborators'], 'DIRECT',
)
load_collaborators(
neo4j_session, common_job_parameters['UPDATE_TAG'], repo_data['repo_outside_collaborators'], 'OUTSIDE',
)
load_python_requirements(neo4j_session, common_job_parameters['UPDATE_TAG'], repo_data['python_requirements'])


Expand Down Expand Up @@ -561,6 +704,12 @@ def sync(
"""
logger.info("Syncing GitHub repos")
repos_json = get(github_api_key, github_url, organization)
repo_data = transform(repos_json)
direct_collabs = _get_repo_collaborators_for_multiple_repos(
repos_json, "DIRECT", organization, github_url, github_api_key,
)
outside_collabs = _get_repo_collaborators_for_multiple_repos(
repos_json, "OUTSIDE", organization, github_url, github_api_key,
)
repo_data = transform(repos_json, direct_collabs, outside_collabs)
load(neo4j_session, common_job_parameters, repo_data)
run_cleanup_job('github_repos_cleanup.json', neo4j_session, common_job_parameters)
18 changes: 16 additions & 2 deletions docs/root/modules/github/schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,20 @@ Representation of a single GitHubRepository (repo) [repository object](https://d
(GitHubOrganization)-[OWNER]->(GitHubRepository)
```
- GitHubRepositories in an organization can have outside collaborators with different permissions, including ADMIN,
- GitHubRepositories in an organization can have [outside collaborators](https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation) who may be granted different levels of access, including ADMIN,
WRITE, MAINTAIN, TRIAGE, and READ ([Reference](https://docs.github.com/en/graphql/reference/enums#repositorypermission)).
```
(GitHubUser)-[:OUTSIDE_COLLAB_{ACTION}]->(GitHubRepository)
```
- GitHubRepositories in an organization also mark all [direct collaborators](https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation), folks who are not necessarily 'outside' but who are granted access directly to the repository (as opposed to via membership in a team). They may be granted different levels of access, including ADMIN,
WRITE, MAINTAIN, TRIAGE, and READ ([Reference](https://docs.github.com/en/graphql/reference/enums#repositorypermission)).
```
(GitHubUser)-[:DIRECT_COLLAB_{ACTION}]->(GitHubRepository)
```
- GitHubRepositories use ProgrammingLanguages
```
(GitHubRepository)-[:LANGUAGE]->(ProgrammingLanguage)
Expand Down Expand Up @@ -151,13 +158,20 @@ Representation of a single GitHubUser [user object](https://developer.github.com
(GitHubUser)-[OWNER]->(GitHubRepository)
```
- GitHubRepositories in an organization can have outside collaborators with different permissions, including ADMIN,
- GitHubRepositories in an organization can have [outside collaborators](https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation) who may be granted different levels of access, including ADMIN,
WRITE, MAINTAIN, TRIAGE, and READ ([Reference](https://docs.github.com/en/graphql/reference/enums#repositorypermission)).
```
(GitHubUser)-[:OUTSIDE_COLLAB_{ACTION}]->(GitHubRepository)
```
- GitHubRepositories in an organization also mark all [direct collaborators](https://docs.github.com/en/graphql/reference/enums#collaboratoraffiliation), folks who are not necessarily 'outside' but who are granted access directly to the repository (as opposed to via membership in a team). They may be granted different levels of access, including ADMIN,
WRITE, MAINTAIN, TRIAGE, and READ ([Reference](https://docs.github.com/en/graphql/reference/enums#repositorypermission)).
```
(GitHubUser)-[:DIRECT_COLLAB_{ACTION}]->(GitHubRepository)
```
- GitHubUsers are members of an organization. In some cases there may be a user who is "unaffiliated" with an org, for example if the user is an enterprise owner, but not member of, the org. [Enterprise owners](https://docs.github.com/en/enterprise-cloud@latest/admin/managing-accounts-and-repositories/managing-users-in-your-enterprise/roles-in-an-enterprise#enterprise-owners) have complete control over the enterprise (i.e. they can manage all enterprise settings, members, and policies) yet may not show up on member lists of the GitHub org.
```
Expand Down
Loading

0 comments on commit 774d67b

Please sign in to comment.