Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of the github2prov Prototype Based on gitlab2prov Provenance Model #100

Draft
wants to merge 82 commits into
base: master
Choose a base branch
from

Conversation

cdboer
Copy link
Collaborator

@cdboer cdboer commented Aug 27, 2023

Summary 📝

Implemented the github2prov prototype which adapts the gitlab2prov provenance model for GitHub. This will pave the way for creating provenance documents based on GitHub activities, offering users a way to trace the lineage and modification of their repositories.

Fixes #77

Proposed Changes 👷

  • Adapted the GitlabAnnotationParser class to handle GitHub's unique data structures and API responses. Renamed it to GithubAnnotationParser.
  • Modified provenance operations in prov/operations.py to cater to GitHub-specific behaviors.
  • Integrated pseudonymization techniques to obfuscate user data, ensuring user privacy is maintained.
  • Expanded error handling, especially in deserialization and file writing, to manage potential GitHub-specific issues.
  • ...

Type of Change 🏷️

  • Bug fix (non breaking change which fixes an issue)
  • New feature (non breaking change which adds functionality)
  • Breaking change (fix or feature that could cause existing functionality to not work as expected)

Checklist ✅

  • I have included tests, if necessary
  • I have updated documentation, if necessary
  • I have updated the changelog, if necessary

Differentiate between gitlab and github instances when cloning a git repository using a token over https.
cdboer added 18 commits May 22, 2023 11:02
- Renamed attributes for better clarity (e.g., inserted to insertions, start to �uthored_at)
- Removed redundant attributes like ile_paths
- Consolidated datetime attributes (start and end to �uthored_at and committed_at)
- Standardized uid attribute to id across various domain classes
- Improved naming clarity: Renamed command groups from gitlab_cli to gitlab2prov and github_cli to github2prov.
- Refined config validation and loading: Split invoke_command_line_from_config to separate functions for loading and validation (load_and_validate_config) and command execution (execute_command_from_config).
- Introduced clear separation between loading and validation of config to provide clearer error messages.
- Replaced --verbose option with --explain for statistics command for better understanding of its purpose.
- Minor cleanups: Removed unnecessary whitespace and lines.
- Introduced post-initialization method __post_init__ to parse the URL upon object creation, thus reducing repetitive calls to urlsplit.
- Simplified slug property logic by utilizing the parsed URL path directly, ensuring clearer and more concise code.
- Removed individual properties for
etloc and scheme in favor of the parsed attributes from __post_init__.
- Streamlined the clone_url method by removing redundant parameters and leveraging a dictionary lookup for platform-specific URLs.
- Adapted child classes GitlabProjectUrl and GithubProjectUrl to match the streamlined method signature.
- Add 'deletions', 'insertions', 'lines', and 'files_changed' attributes to 'extract_commits'.
- Rename 'start' to 'authored_at' and 'end' to 'committed_at' in 'extract_commits'.
- Add 'insertions', 'deletions', 'lines', and 'score' attributes to 'FileRevision' in 'extract_revisions'.
- Correct typo in comment from 'remeber' to 'remember' in 'extract_revisions'.
- Introduced a ilter_valid method to filter annotations without �nnotator or start values.
- Modified parse method to use ilter_valid after sorting annotations.
- Changed property name uid to id in methods: parse_commit_comment, parse_commit_status, parse_award_reaction, parse_issue_comment, and parse_issue_event.
- Removed extraneous whitespace after parse_award_reaction method.
- Replaced uid with id in the methods for parsing notes, comments, awards, and labels.
- Improved docstrings for clarity in various methods.
- Updated error handling in the 
ead_provenance_file method to handle file not found exceptions.
- Refactored the deserialize_string method to provide better feedback on deserialization failures.
- Modified the file write mode in write_provenance_file based on the overwrite parameter.
- Enhanced ASCII table formatting in ormat_stats_as_ascii_table.
- Introduced new methods for pseudonymization:
  - generate_pseudonym
  - pseudonymize_agent
  - pseudonymize_relation
  - Overhauled existing pseudonymize method for better clarity and efficiency.
@cdboer cdboer added the feature Implement this feature label Aug 27, 2023
@cdboer cdboer self-assigned this Aug 27, 2023
@cdboer cdboer linked an issue Aug 27, 2023 that may be closed by this pull request
@cdboer cdboer changed the title 77 github2prov prototype implementation Implementation of the github2prov Prototype Based on gitlab2prov Provenance Model Aug 27, 2023
@cdboer cdboer marked this pull request as draft August 27, 2023 14:18
@bollwyvl
Copy link

This seems like a very useful addition! Is it likely this work will continue?

Elsewhere, I've seen some work around ForgeFed, but it hardly seems to have the precision of the data model seen here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Implement this feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GitHub2PROV prototype implementation
2 participants