Skip to content

Commit

Permalink
Merge pull request #58 from DLR-SC/54-support-more-than-one-output-fo…
Browse files Browse the repository at this point in the history
…rmat-through-argument-chaining

54 Support more than one output format through argument chaining
  • Loading branch information
cdboer authored Jun 12, 2022
2 parents a80db75 + 9a246ef commit 6cf9805
Show file tree
Hide file tree
Showing 6 changed files with 142 additions and 35 deletions.
84 changes: 64 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,96 @@
# :seedling: `gitlab2prov`: Extract Provenance from GitLab Projects
# :seedling: `gitlab2prov`: Extract Provenance from GitLab Projects

[![License: MIT](https://img.shields.io/github/license/dlr-sc/gitlab2prov?label=License)](https://opensource.org/licenses/MIT) [![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/) [![PyPI version fury.io](https://badge.fury.io/py/gitlab2prov.svg)](https://pypi.python.org/pypi/gitlab2prov/) [![DOI](https://zenodo.org/badge/215042878.svg)](https://zenodo.org/badge/latestdoi/215042878) [![Open in Visual Studio Code](https://open.vscode.dev/badges/open-in-vscode.svg)](https://open.vscode.dev/DLR-SC/gitlab2prov)

[![Git commits (by Cauldron.io)](https://cauldron.io/project/4509/export/svg/git_commits.svg)](https://cauldron.io/project/4509) [![Issues created (by Cauldron.io)](https://cauldron.io/project/4509/export/svg/issues_created.svg)](https://cauldron.io/project/4509) [![Issues closed (by Cauldron.io)](https://cauldron.io/project/4509/export/svg/issues_closed.svg)](https://cauldron.io/project/4509)

`gitlab2prov` is a Python library and command line tool for extracting provenance information from GitLab projects.
`gitlab2prov` is a Python library and command line tool for extracting provenance information from GitLab projects.

The data model employed by `gitlab2prov` has been modelled according to [W3C PROV](https://www.w3.org/TR/prov-overview/) [![PROV](https://www.w3.org/Icons/SW/Buttons/sw-prov-blue.png)](https://www.w3.org/TR/prov-overview/) specification.
A representation of the model can be found in `/docs`.
The data model employed by `gitlab2prov` has been modelled according to [W3C PROV](https://www.w3.org/TR/prov-overview/) [![PROV](https://www.w3.org/Icons/SW/Buttons/sw-prov-blue.png)](https://www.w3.org/TR/prov-overview/) specification.
More information regarding the provenance model can be found in `/docs`.

## Installation :wrench:
## ️🏗️ ️Installation

Clone the project and use the provided `setup.py` to install `gitlab2prov`.

```bash
python setup.py install --user
```

## Usage :computer:
## 👩‍💻 Usage

`gitlab2prov` can be used either as a command line script or as a Python lib.
`gitlab2prov` can be used as a command line script and as a Python lib.

To extract provenance from a project, follow these steps:
To extract provenance from a gitlab project, follow these steps:
| Instructions | Config Option |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|
| 1. Obtain an API Token for the GitLab API ([Token Guide](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html#creating-a-personal-access-token)) | `--token` |
| 2. Set the URL[s] for the GitLab Project[s] | `--project_urls` |
| 3. Choose a PROV serialization format | `--format` |

### As a Command Line Script

`gitlab2prov` can be configured either by command line flags or by using a config file.

### 📋 Config File Example

##### Config File :clipboard:

An example of a configuration file can be found in `/config`.
An example of a configuration file can be found in `/config/example.ini`.

```ini
# This is an example of a configuration file as used by gitlab2prov.
# The configuration options match the command line flags in function.

[GITLAB]
# Gitlab project urls as a comma seperated list.
project_urls = project_a_url, project_b_url

# Gitlab personal access token.
# More about tokens and how to create them:
# https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html#create-a-personal-access-token
token = token

[OUTPUT]
format = json
# Provenance serialization format.
# Supported formats: json, rdf, xml, provn, dot
format = json, rdf, xml

# File location to write provenance output to.
# Each specified format will result in a seperate file.
# For example:
# format = json, xml
# outfile = out/example
# Creates the files:
# out/example.json
# out/example.xml
outfile = provout/example

[MISC]
# Enables/Disables profiling using the cprofile lib.
# The runtime profile is written to a file called gitlab2prov-run-$TIMESTAMP.profile
# where $TIMESTAMP is the current time in 'YYYY-MM-DD-hh-mm-ss' format.
# The profile can be visualized using tools such as snakeviz.
profile = False

# Enables/Disables verbose output (DEBUG mode logging to stdout)
verbose = False
pseudonymous = False

# Path to double agent mapping to unify duplicated agents.
double_agents = path/to/alias/mapping

# Enables/Disables agent pseudonymization by enumeration.
pseudonymous = False
```

##### Command Line Flags :flags:
### 🖥️ Command Line Usage ☝ Single Format Serialization

```
usage: gitlab2prov [-h] -p PROJECT_URLS [PROJECT_URLS ...] -t TOKEN [-c CONFIG_FILE] [-f {json,rdf,xml,provn,dot}] [-v] [--double-agents DOUBLE_AGENTS] [--pseudonymous] [--profile]
usage: gitlab2prov [-h] -p PROJECT_URLS [PROJECT_URLS ...] -t TOKEN [-c CONFIG_FILE] [-f {json,rdf,xml,provn,dot}] [-v] [--double-agents DOUBLE_AGENTS] [--pseudonymous] [--profile] {multi-format} ...
Extract provenance information from GitLab projects.
positional arguments:
{multi-format}
multi-format serialize output in multiple formats
options:
-h, --help show this help message and exit
-p PROJECT_URLS [PROJECT_URLS ...], --project-urls PROJECT_URLS [PROJECT_URLS ...]
Expand All @@ -75,8 +107,21 @@ options:
--pseudonymous pseudonymize user names by enumeration
--profile enable deterministic profiling, write profile to 'gitlab2prov-run-$TIMESTAMP.profile' where $TIMESTAMP is the current timestamp in 'YYYY-MM-DD-hh-mm-ss' format
```
### 🖥️ Command Line Usage 🖐 Multi Format Serialization
To serialize the extracted provenance information into multiple formats in one go, use the provided `multi-format` mode.

```
usage: gitlab2prov multi-format [-h] [-f {json,rdf,xml,provn,dot} [{json,rdf,xml,provn,dot} ...]] -o OUTFILE
options:
-h, --help show this help message and exit
-f {json,rdf,xml,provn,dot} [{json,rdf,xml,provn,dot} ...], --format {json,rdf,xml,provn,dot} [{json,rdf,xml,provn,dot} ...]
provenance serialization formats
-o OUTFILE, --outfile OUTFILE
serialize to {outfile}.{format} for each specified format
```

### Provenance Output Formats
### 🎨 Provenance Output Formats

`gitlab2prov` supports output formats that the [`prov`](https://github.com/trungdong/prov) library provides:
* [PROV-N](http://www.w3.org/TR/prov-n/)
Expand All @@ -85,7 +130,6 @@ options:
* [PROV-JSON](http://www.w3.org/Submission/prov-json/)
* [Graphviz](https://graphviz.org/) (DOT)


## Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Expand Down Expand Up @@ -115,7 +159,7 @@ You can also cite specific releases published on Zenodo: [![DOI](https://zenodo.
## References

**Influencial Software for `gitlab2prov`**
* Martin Stoffers: "Gitlab2Graph", v1.0.0, October 13. 2019, [GitHub Link](https://github.com/DLR-SC/Gitlab2Graph), DOI 10.5281/zenodo.3469385
* Martin Stoffers: "Gitlab2Graph", v1.0.0, October 13. 2019, [GitHub Link](https://github.com/DLR-SC/Gitlab2Graph), DOI 10.5281/zenodo.3469385

* Quentin Pradet: "How do you rate limit calls with aiohttp?", [GitHub Gist](https://gist.github.com/pquentin/5d8f5408cdad73e589d85ba509091741), MIT LICENSE

Expand All @@ -131,4 +175,4 @@ You can also cite specific releases published on Zenodo: [![DOI](https://zenodo.

* Tim Sonnekalb, Thomas S. Heinze, Lynn von Kurnatowski, Andreas Schreiber, Jesus M. Gonzalez-Barahona, and Heather Packer (2020). [Towards automated, provenance-driven security audit for git-based repositories: applied to germany's corona-warn-app: vision paper](https://doi.org/10.1145/3416507.3423190). In *Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Security from Design to Deployment* (pp. 15–18).

* Andreas Schreiber (2020). [Visualization of contributions to open-source projects](https://doi.org/10.1145/3430036.3430057). In *Proceedings of the 13th International Symposium on Visual Information Communication and Interaction*. ACM, USA.
* Andreas Schreiber (2020). [Visualization of contributions to open-source projects](https://doi.org/10.1145/3430036.3430057). In *Proceedings of the 13th International Symposium on Visual Information Communication and Interaction*. ACM, USA.
12 changes: 11 additions & 1 deletion config/example.ini
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,17 @@ token = token
[OUTPUT]
# Provenance serialization format.
# Supported formats: json, rdf, xml, provn, dot
format = json
format = json, rdf, xml

# File location to write provenance output to.
# Each specified format will result in a seperate file.
# For example:
# format = json, xml
# outfile = out/example
# Creates the files:
# out/example.json
# out/example.xml
outfile = provout/example

[MISC]
# Enables/Disables profiling using the cprofile lib.
Expand Down
53 changes: 47 additions & 6 deletions gitlab2prov/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import argparse
import configparser
from dataclasses import dataclass
from typing import Optional
from typing import Optional, Tuple, Union


SUPPORTED_FORMATS = ["json", "rdf", "xml", "provn", "dot"]
Expand All @@ -13,23 +13,37 @@
class Config:
project_urls: list[str]
token: str
format: str
format: Union[str, list[str]]
outfile: Optional[str]
pseudonymous: bool
verbose: bool
profile: bool
double_agents: Optional[str]


class ConfigError(Exception):
pass


def convert_string(s: str) -> str:
return s.strip("'").strip('"')


def convert_csv(csv_string: str) -> list[str]:
lines = csv_string.splitlines()
reader = csv.reader(lines)
[urls] = list(reader)
urls = [url.strip().strip("'").strip('"') for url in urls]
return urls
[items] = list(reader)
items = [item.strip().strip("'").strip('"') for item in items]
return items


def check_mode_requirements(config: configparser.ConfigParser) -> Tuple[bool, str]:
if len(config.getstring("OUTPUT", "format")) > 1:
if "outfile" not in config["OUTPUT"]:
return False, "Missing option 'outfile' in section 'OUTPUT'"
if config.getstring("OUTPUT", "outfile") is None:
return False, "Missing value for option 'outfile' in section 'OUTPUT'"
return True, ""


def read_config():
Expand All @@ -46,10 +60,16 @@ def read_file(config_file: str) -> Config:
converters={"string": convert_string, "csv": convert_csv}
)
config.read(config_file)

ok, msg = check_mode_requirements(config)
if not ok:
raise ConfigError(msg)

return Config(
config.getcsv("GITLAB", "project_urls"),
config.getstring("GITLAB", "token"),
config.getstring("OUTPUT", "format", fallback="json"),
config.getcsv("OUTPUT", "format", fallback=["json"]),
config.getstring("OUTPUT", "outfile", fallback=None),
config.getboolean("MISC", "pseudonymous", fallback=False),
config.getboolean("MISC", "verbose", fallback=False),
config.getboolean("MISC", "profile", fallback=False),
Expand All @@ -70,6 +90,26 @@ def read_cli() -> tuple[Optional[Config], Optional[str]]:
prog="gitlab2prov",
description="Extract provenance information from GitLab projects.",
)

subparsers = parser.add_subparsers(help="")
multiformat = subparsers.add_parser(
"multi-format", help="serialize output in multiple formats"
)
multiformat.add_argument(
"-f",
"--format",
help="provenance serialization formats",
nargs="+",
choices=SUPPORTED_FORMATS,
default=["json"],
)
multiformat.add_argument(
"-o",
"--outfile",
help="serialize to {outfile}.{format} for each specified format",
required=True,
)

parser.add_argument(
"-p",
"--project-urls",
Expand Down Expand Up @@ -129,6 +169,7 @@ def read_cli() -> tuple[Optional[Config], Optional[str]]:
args.project_urls,
args.token,
args.format,
getattr(args, "outfile", None),
args.pseudonymous,
args.verbose,
args.profile,
Expand Down
2 changes: 2 additions & 0 deletions gitlab2prov/domain/commands.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from dataclasses import dataclass
from datetime import datetime
from typing import Optional


@dataclass
Expand All @@ -23,3 +24,4 @@ class Serialize(Command):
format: str
pseudonymize: bool
uncover_double_agents: str
out: Optional[str] = None
7 changes: 5 additions & 2 deletions gitlab2prov/entrypoints/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,11 @@ def run():
cmd = commands.Fetch(url, config.token)
bus.handle(cmd)

cmd = commands.Serialize(config.format, config.pseudonymous, config.double_agents)
bus.handle(cmd)
for fmt in config.formats:
cmd = commands.Serialize(
fmt, config.pseudonymous, config.double_agents, config.outfile
)
bus.handle(cmd)

run()

Expand Down
19 changes: 13 additions & 6 deletions gitlab2prov/service_layer/handlers.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
import logging
from urllib.parse import urlsplit
from pathlib import Path
from tempfile import TemporaryDirectory
from urllib.parse import urlsplit

from git import Repo
from gitlab import Gitlab
from prov.dot import prov_to_dot
from prov.model import ProvDocument

from gitlab2prov.domain import commands
from gitlab2prov.prov import operations
from gitlab2prov.prov import model

from gitlab2prov.prov import model, operations

log = logging.getLogger(__name__)

Expand All @@ -32,12 +31,16 @@ def clone_with_https_url(url: str, token: str) -> str:
return f"https://gitlab.com:{token}@{split.netloc}/{project_slug(url)}"


def serialize_graph(graph: ProvDocument, fmt: str):
def serialize_graph(graph: ProvDocument, fmt: str) -> str:
if fmt == "dot":
return prov_to_dot(graph)
return graph.serialize(format=fmt)


def strip_file_extension(s: str) -> Path:
return Path(s).with_suffix("")


def mine_git(cmd: commands.Fetch, uow, git_miner) -> None:
url = clone_with_https_url(cmd.project_url, cmd.token)
with TemporaryDirectory() as tmpdir:
Expand Down Expand Up @@ -79,7 +82,11 @@ def serialize(cmd: commands.Serialize, uow) -> None:
if cmd.pseudonymize:
graph = operations.pseudonymize(graph)

# write to stdout
if cmd.out is not None:
with open(f"{strip_file_extension(cmd.out)}.{cmd.format}", "w") as f:
print(serialize_graph(graph, cmd.format), file=f)
return

print(serialize_graph(graph, cmd.format))


Expand Down

0 comments on commit 6cf9805

Please sign in to comment.