-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
145 changed files
with
15,255 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,6 @@ __pycache__/ | |
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 3dc720c00edc04fa63c316b36e5c38a2 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
<br> | ||
|
||
# Project Constraints | ||
|
||
An inexhaustive set of constraints. | ||
|
||
<br> | ||
|
||
## Solution Constraints | ||
|
||
These are "… constraints on the way that the problem must be solved." (4a., [Volere Template](https://homepages.laas.fr/kader/Robertson.pdf)) | ||
|
||
<br> | ||
|
||
## Time Constraints | ||
|
||
What is the client team's project time limit? Beware, time and budget are critical to determining whether a feasible solution exists. | ||
|
||
<br> | ||
|
||
## Implementation Environment Constraints | ||
|
||
Is there a cloud platform, or on-premises, infrastructure restriction? | ||
|
||
<br> | ||
|
||
## Budget Constraint | ||
|
||
During the viability/feasibility assessment compare the stated budget with the estimated cost. Remember, the cost is determined by all elements that have a cost; both project delivery costs and lifecycle costs are critical. | ||
|
||
<br> | ||
<br> | ||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
<br> | ||
<br> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
<br> | ||
|
||
# Data Collection | ||
|
||
<br> | ||
|
||
What was the data acquisition mechanism? Per data instance, was data acquisition via | ||
|
||
<ul class="disc"> | ||
<li class="disc">A sensor.</li> | ||
<li class="disc">An application programming interface.</li> | ||
<li class="disc">Interviews.</li> | ||
<li class="disc">Questionnaires.</li> | ||
<li class="disc">A mix of mechanisms.</li> | ||
<li class="disc">etc.</li> | ||
</ul> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,208 @@ | ||
<br> | ||
|
||
# Composition | ||
|
||
In general, the questions herein should be studied before data collection. The datasheets paper [1] notes that most of the questions: | ||
|
||
<blockquote> | ||
"… are intended to provide dataset consumers with the information they need to make informed decisions about using the dataset for their chosen tasks. … the questions are designed to elicit information about compliance with the EU's General Data Protection Regulation (GDPR) or comparable regulations in other jurisdictions." | ||
</blockquote> | ||
|
||
<br> | ||
|
||
<details><summary><b>References</b></summary> | ||
<ol class="numeric"> | ||
<li class="numeric"><a href="https://arxiv.org/abs/1803.09010v8" target="_blank">Datasheets for Datasets</a>, arXiv:1803.09010v8, 2021, updated datasheet appendix</li> | ||
</ol> | ||
</details> | ||
|
||
<br> | ||
<br> | ||
|
||
## What does each data set instance represent? | ||
|
||
Describe what each instance, i.e., row, of the data set represents. Although peculiar, if the data set is multi-representational, e.g., an instance in a merchant's data base table might represent one of: (a) an online reading event, (b) an online product purchasing event, or \(c\) an online download event. Each representation must be described. | ||
|
||
<br> | ||
<br> | ||
|
||
## The # of Instances | ||
|
||
How many instances does the data set have? | ||
|
||
<br> | ||
<br> | ||
|
||
## Pre-processed? | ||
|
||
Are any aspects of the data set pre-processed? If yes: | ||
|
||
<ul class="disc"> | ||
<li class="disc">Document the pre-processing steps.</li> | ||
<li class="disc">State whether the underlying raw data is available, and provide a link to the data.</li> | ||
<li class="disc">If available, provide a link to the pre-processing programs.</li> | ||
</ul> | ||
|
||
<br> | ||
<br> | ||
|
||
## Is the data set a sample of a larger data set? | ||
|
||
**If yes:** | ||
|
||
<ul class="disc"> | ||
<li class="disc"><b>If the data set is representative</b> of the larger data set: How was representativeness verified/validated?</li> | ||
<li class="disc"><b>If the data set is not representative</b> of the larger data set, e.g., is a geographically focused subset, explain why.</li> | ||
</ul> | ||
|
||
<br> | ||
<br> | ||
|
||
## Lineage | ||
|
||
Summarise the data set's lineage, including linkage options. [1, 2] | ||
|
||
<br> | ||
|
||
<details><summary><b>References</b></summary> | ||
<ol class="numeric"> | ||
<li class="numeric"><a href="https://www.qlik.com/us/data-management/data-lineage" target="_blank">QLIK: What is data lineage?</a></li> | ||
<li class="numeric"><a href="https://www.ibm.com/topics/data-lineage" target="_blank">IBM: What is data lineage?</a></li> | ||
</ol> | ||
</details> | ||
|
||
<br> | ||
<br> | ||
|
||
## Licences & Fees | ||
|
||
If applicable, summarise the data's costs. | ||
|
||
<br> | ||
<br> | ||
|
||
## Profiles of Instances | ||
|
||
Herein, the focus is a summary of the instances of a data set, e.g., for a tabular data set: | ||
|
||
**By Field** | ||
|
||
<ul class="disc"> | ||
<li class="disc">The field name.</li> | ||
<li class="disc">Description: What does the element of an instance denote/represent?</li> | ||
<li class="disc">Data type.</li> | ||
<li class="disc">Dictionary of a categorical data type.</li> | ||
<li class="disc">Unit of measure.</li> | ||
<li class="disc">Is this a raw data field or a feature?</li> | ||
<li class="disc">Is this a target field?</li> | ||
<li class="disc">Does the field identify a sub-population?</li> | ||
<li class="disc">Column Profile: Note column profiling <i>"… provides statistical information regarding the distribution of data values and associated patterns that are assigned to each data attribute, …"</i>. [1] If a field/column has missing elements, explain why.</li> | ||
<li class="disc">A graph of the field's data distribution.</li> | ||
</ul> | ||
|
||
<br> | ||
|
||
**Across Fields** | ||
|
||
<ul class="disc"> | ||
<li class="disc">Cross-Column Profiles: Relationships between columns.</li> | ||
</ul> | ||
|
||
<br> | ||
<br> | ||
|
||
<details><summary><b>References</b></summary> | ||
<ol class="numeric"> | ||
<li class="numeric">5.5.2 Profiling for Data Quality Assessment, in <a href="https://www.sciencedirect.com/book/9780123742254/master-data-management" target="_blank">Master Data Management</a>, Page 96, The MK/OMG Press, 2008</li> | ||
<li class="numeric"><a href="https://www.talend.com/resources/what-is-data-profiling/" target="_blank">Data Profiling</a></li> | ||
</ol> | ||
</details> | ||
|
||
<br> | ||
<br> | ||
|
||
|
||
## Errors | ||
|
||
Please detail any errors, sources of noise, or redundancies. | ||
|
||
|
||
<br> | ||
<br> | ||
|
||
|
||
## Recommended Data Splits for Machine Learning | ||
|
||
Are there recommended data splits? | ||
|
||
<br> | ||
<br> | ||
|
||
## Confidentiality | ||
|
||
Does the data set contain data that might be considered confidential? For example, | ||
|
||
<ul class="disc"> | ||
<li class="disc">Is the data protected by legal privilege or by doctor–patient confidentiality?</li> | ||
<li class="disc">Does the data include the content of private/non-public communications of individuals.</li> | ||
</ul> | ||
|
||
<br> | ||
<br> | ||
|
||
## Identification of Individuals | ||
|
||
Is it possible to identify individuals directly or indirectly? | ||
|
||
<br> | ||
<br> | ||
|
||
## Data Sensitivity | ||
|
||
Does the data set include sensitive data elements? Describe. Examples of sensitive data elements are | ||
elements that directly/indirectly reveal: | ||
|
||
<ul class="disc"> | ||
<li class="disc">Locations.</li> | ||
<li class="disc">Financial details.</li> | ||
<li class="disc">Health details.</li> | ||
<li class="disc">Biometric profiles.</li> | ||
<li class="disc">Genetic profiles.</li> | ||
<li class="disc">Government identification codes of individuals.</li> | ||
<li class="disc">Criminal history.</li> | ||
<li class="disc">Institutionally and/or commercially sensitive data.</li> | ||
<li class="disc">Race or ethnic origin.</li> | ||
<li class="disc">Sexual orientations.</li> | ||
<li class="disc">Religious beliefs.</li> | ||
<li class="disc">Political opinions.</li> | ||
<li class="disc">Trade union memberships.</li> | ||
<li class="disc">And more.</li> | ||
</ul> | ||
|
||
<br> | ||
<br> | ||
|
||
## Distressing Data Elements | ||
|
||
Does ``… the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?'' [1, 2] | ||
|
||
<br> | ||
|
||
<details><summary><b>References</b></summary> | ||
<ol class="numeric"> | ||
<li class="numeric"><a href="https://dl.acm.org/doi/10.1145/3458723" target="_blank">Datasheets for Datasets</a>, Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92</li> | ||
<li class="numeric"><a href="https://arxiv.org/abs/1803.09010v8" target="_blank">Datasheets for Datasets</a>, arXiv:1803.09010v8, 2021, updated datasheet appendix</li> | ||
</ol> | ||
</details> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
<br> | ||
|
||
# Controls | ||
|
||
<br> | ||
|
||
Do "… any export controls or other regulatory restrictions apply to the dataset or to individual instances?" [1, 2] | ||
Please, study **Section 3.6 of** [2] for more details. | ||
|
||
<br> | ||
|
||
<details><summary><b>References</b></summary> | ||
<ol class="numeric"> | ||
<li class="numeric"><a href="https://dl.acm.org/doi/10.1145/3458723" target="_blank">Datasheets for Datasets</a>, Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92</li> | ||
<li class="numeric"><a href="https://arxiv.org/abs/1803.09010v8" target="_blank">Datasheets for Datasets</a>, arXiv:1803.09010v8, 2021, updated datasheet appendix</li> | ||
</ol> | ||
</details> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
<br> | ||
|
||
# Data & Datasheets | ||
|
||
<br> | ||
|
||
The datasheet of a dataset outlines the data set's provenance, lineage, and a bit more. Each dataset must have its own datasheet. Gebru & colleagues <a href="https://arxiv.org/pdf/1803.09010v1.pdf" target="_blank">first proposed Datasheets for Datasets</a>, for machine learning products or projects, during the years 2018/2019. Following feedback from a variety of institutions, industries, agencies, etc., a baseline Datasheets for Datasets outline was released.[^acm]<sup>, </sup>[^arXiv] Each datasheet consists of a set of questions, and the datasheet's primary objective vis-à-vis data set creator is to: | ||
|
||
> … encourage data set creators to reflect carefully upon (a) the "process of creating, distributing, and maintaining a dataset", and (b) "any underlying assumptions, potential risks or harms, and implications of use“ | ||
|
||
The primary objective vis-à-vis data set user is to: | ||
|
||
> … ensure that the data set user has the information required to "… make informed decisions about using a dataset." | ||
|
||
This chapter consists of a set of sections. The sections, except the Natural Language Processing section, reflect the groupings of the questions in the latest datasheet version. [^acm]<sup>, </sup>[^arXiv]<sup>, </sup>[^applicability] The questions of the natural language processing (NLP) section apply to NLP projects only. The questions were developed by Bender & Friedman. [^bender] | ||
|
||
<br> | ||
<br> | ||
|
||
```{toctree} | ||
:caption: Content | ||
:glob: | ||
|
||
motivation | ||
composition | ||
collection | ||
controls | ||
maintenance | ||
nlp | ||
``` | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
<br> | ||
<br> | ||
|
||
[^acm]: [Datasheets for Datasets](https://dl.acm.org/doi/10.1145/3458723), Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92 | ||
[^arXiv]: [Datasheets for Datasets](https://arxiv.org/abs/1803.09010v8), arXiv:1803.09010v8, 2021, updated datasheet appendix | ||
[^applicability]: If a question is inapplicable, note down its inapplicability. | ||
[^bender]: [Data Statements for Natural Language Processing:Toward Mitigating System Bias and Enabling Better Science](https://doi.org/10.1162/tacl_a_00041), Transactions of the Association for Computational Linguistics, 2018, 6: 587–604 | ||
|
||
<br> |
Oops, something went wrong.