Skip to content

Commit

Permalink
docs: add data sources docs (#50)
Browse files Browse the repository at this point in the history
Signed-off-by: frank-zsy <[email protected]>
  • Loading branch information
frank-zsy authored Sep 29, 2024
1 parent c71ccc8 commit 43d1b4f
Show file tree
Hide file tree
Showing 6 changed files with 108 additions and 0 deletions.
24 changes: 24 additions & 0 deletions docs/user_docs/data_sources/gitee.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,26 @@
# Gitee

## 数据来源

OpenDigger 与 [Gitee](https://gitee.com/) 进行官方合作,在内部长期维护 [GVP](https://gitee.com/gvp) 项目清单,并通过 [Gitee API](https://gitee.com/api/v5/swagger#/getV5ReposOwnerRepoEvents) 对所有 GVP 项目进行历史事件日志采集工作。

数据采集、清洗与入库的相关代码目前并未开源至 OpenDigger 项目,作为定时任务每天运行并导入数据到数据库。

所有 Gitee 仓库的数据均会导出指标数据,如果您发现您的项目不在[导出列表](../metrics/metrics_usage_guide#导出范围),请在 OpenDigger 仓库中提交 Issue,我们会将您的仓库加入到采集列表,同时也支持直接加入一个组织。

## 注意

由于 Gitee 的 Issues 与 Pull Request 使用了不同的编号体系,为了兼容 GitHub 的纯数字编号体系,我们对 Gitee 的 Issues 编号做了额外的处理,将其看做是 36 进制数字并转换成 10 进制后进行存储,所以在使用时如果需要恢复 Issues 编号,请将其转换成 36 进制即可。

以下是 JavaScript 中进行 10 进制与 36 进制数转换的示例:

```JavaScript
const rawIssueNumber = 'I1R';

// 36 进制转换成 10 进制
const issueNumber = parseInt(rawIssueNumber, 36);
console.log(issueNumber); // 23391

// 10 进制转换成 36 进制
console.log(issueNumber.toString(36).toUpperCase()); // I1R
```
27 changes: 27 additions & 0 deletions docs/user_docs/data_sources/github.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,29 @@
# GitHub

## 数据来源

OpenDigger 使用 [GHArchive](https://www.gharchive.org/) 作为数据源,采集 GitHub 全域日志数据并使用 [ClickHouse](https://github.com/ClickHouse/ClickHouse) 云服务作为数据服务基础设施。

数据采集、清洗与入库的相关代码目前并未开源至 OpenDigger 项目,作为定时任务每小时运行并导入数据到数据库。

## 数据缺失

由于 GHArchive 可能存在服务不可用的情况,所以 OpenDigger 的 GitHub 数据源存在部分数据缺失的情况,目前缺失的数据时段如下:

- 2016-10-21-18
- 2018-10-21-23
- 2018-10-22-0 ~ 2018-10-22-1
- 2019-05-08-12 ~ 2019-05-08-13
- 2019-09-12-8 ~ 2019-09-13-5
- 2020-03-05-22
- 2020-06-10-12 ~ 2020-06-10-21
- 2020-08-21-9 ~ 2020-08-23-15
- 2020-10-30-17
- 2021-08-25-17 ~ 2021-08-27-22
- 2021-09-11-9
- 2021-10-22-5 ~ 2021-10-22-22
- 2021-10-23-2 ~ 2021-10-23-22
- 2021-10-24-3 ~ 2021-10-24-22
- 2021-10-25-1 ~ 2021-10-25-22
- 2021-10-26-0 ~ 2021-10-29-17
- 2023-05-14-19
2 changes: 2 additions & 0 deletions docs/user_docs/metrics/metrics_usage_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,8 @@ OpenDigger 实现的所有指标对所有人开放使用,OpenDigger 的静态
}
```

> 由于 OpenDigger 的 GitHub 数据源存在[时段数据缺失](../data_sources/github#数据缺失)的情况,因此若键值中存在 `2021-10-raw`,则该值为指标数据的原始值。为了使得指标数据具有时序上的连续性,对应的 `2021-10` 指标值为前后各两个月的插值结果,具体代码参见[这里](https://github.com/X-lab2017/open-digger/blob/master/src/cron/tasks/monthly_export.ts#L176)
## 导出范围

OpenDigger 并未为全域所有仓库和用户均导出指标数据,具体导出的仓库和开发者列表可分别在 [`repo_list.csv`](https://oss.open-digger.cn/repo_list.csv)[`user_list.csv`](https://oss.open-digger.cn/user_list.csv) 文件中查询,其中:
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1,26 @@
# Gitee

## Data Source

OpenDigger has an official partnership with [Gitee](https://gitee.com/) and maintains the [GVP](https://gitee.com/gvp) project list internally. OpenDigger collects event logs for all GVP repositories using the [Gitee API](https://gitee.com/api/v5/swagger#/getV5ReposOwnerRepoEvents).

The code for data collection, cleaning, and database entry is not currently open sourced in the OpenDigger project and runs as a scheduled task daily to import data into the database.

All data from Gitee repositories will be exported as metric data. If you find that your project is not in the [export list](../metrics/metrics_usage_guide#export-range), please submit an issue in the OpenDigger repository, and we will add your repository to the collection list, you can also directly add an organization.

## Note

Since Gitee’s Issues and Pull Requests use different numbering systems, we have made additional adjustments to Gitee’s Issue numbers to be compatible with GitHub's purely numerical system. We treat Gitee's Issue numbers as base-36 numbers and convert them to base-10 for storage. Therefore, if you need to retrieve the Issue numbers, you can convert them back to base-36.

Below is an example of converting between base-10 and base-36 numbers in JavaScript:

```JavaScript
const rawIssueNumber = 'I1R';

// Convert base-36 to base-10
const issueNumber = parseInt(rawIssueNumber, 36);
console.log(issueNumber); // 23391

// Convert base-10 to base-36
console.log(issueNumber.toString(36).toUpperCase()); // I1R
```
Original file line number Diff line number Diff line change
@@ -1 +1,29 @@
# GitHub

## Data Source

OpenDigger uses [GHArchive](https://www.gharchive.org/) as its data source, collecting global event log data from GitHub and using [ClickHouse](https://github.com/ClickHouse/ClickHouse) cloud service as the underlying data infrastructure.

The code for data collection, cleaning, and database entry is not currently open sourced in the OpenDigger project and runs as a scheduled task every hour to import data into the database.

## Missing Data

Due to potential service outages from GHArchive, there is data loss in OpenDigger's GitHub data source. The following time periods currently show missing data:

- 2016-10-21-18
- 2018-10-21-23
- 2018-10-22-0 ~ 2018-10-22-1
- 2019-05-08-12 ~ 2019-05-08-13
- 2019-09-12-8 ~ 2019-09-13-5
- 2020-03-05-22
- 2020-06-10-12 ~ 2020-06-10-21
- 2020-08-21-9 ~ 2020-08-23-15
- 2020-10-30-17
- 2021-08-25-17 ~ 2021-08-27-22
- 2021-09-11-9
- 2021-10-22-5 ~ 2021-10-22-22
- 2021-10-23-2 ~ 2021-10-23-22
- 2021-10-24-3 ~ 2021-10-24-22
- 2021-10-25-1 ~ 2021-10-25-22
- 2021-10-26-0 ~ 2021-10-29-17
- 2023-05-14-19
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,8 @@ For example, for the [OpenDigger](https://github.com/X-lab2017/open-digger) repo
}
```

> Due to the existence of [missing data](../data_sources/github#missing-data) in OpenDigger's GitHub data source, if the key `2021-10-raw` exists, it represents the raw value of the metric data. To ensure temporal continuity in the metric data, the corresponding `2021-10` metric value is calculated as an interpolation result based on the values from two months before and after. For specific code, please refer to [here](https://github.com/X-lab2017/open-digger/blob/master/src/cron/tasks/monthly_export.ts#L176).
## Export Range

OpenDigger does not export metrics data for all repositories and users. The specific exported repositories and user lists can be found in [`repo_list.csv`](https://oss.open-digger.cn/repo_list.csv) and [`user_list.csv`](https://oss.open-digger.cn/user_list.csv), where:
Expand Down

0 comments on commit 43d1b4f

Please sign in to comment.