Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
greyhypotheses committed Apr 22, 2024
2 parents e673a0e + b19c04c commit f422c51
Show file tree
Hide file tree
Showing 145 changed files with 15,255 additions and 2 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ __pycache__/

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
Expand Down
4 changes: 4 additions & 0 deletions docs/build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 3dc720c00edc04fa63c316b36e5c38a2
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.
Binary file added docs/build/html/.doctrees/data/collection.doctree
Binary file not shown.
Binary file not shown.
Binary file added docs/build/html/.doctrees/data/controls.doctree
Binary file not shown.
Binary file added docs/build/html/.doctrees/data/data.doctree
Binary file not shown.
Binary file not shown.
Binary file added docs/build/html/.doctrees/data/motivation.doctree
Binary file not shown.
Binary file added docs/build/html/.doctrees/data/nlp.doctree
Binary file not shown.
Binary file added docs/build/html/.doctrees/environment.pickle
Binary file not shown.
Binary file added docs/build/html/.doctrees/index.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/build/html/.doctrees/model/aim.doctree
Binary file not shown.
Binary file added docs/build/html/.doctrees/model/business.doctree
Binary file not shown.
Binary file added docs/build/html/.doctrees/model/model.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/build/html/.doctrees/requirements/mr.doctree
Binary file not shown.
Binary file added docs/build/html/.doctrees/requirements/ps.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added docs/build/html/.doctrees/requirements/ua.doctree
Binary file not shown.
Binary file added docs/build/html/.doctrees/risks/risks.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file added docs/build/html/.nojekyll
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/build/html/_images/logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/build/html/_images/ml-lifecycle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions docs/build/html/_sources/constraints/constraints.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
<br>

# Project Constraints

An inexhaustive set of constraints.

<br>

## Solution Constraints

These are "… constraints on the way that the problem must be solved." (4a., [Volere Template](https://homepages.laas.fr/kader/Robertson.pdf))

<br>

## Time Constraints

What is the client team's project time limit? Beware, time and budget are critical to determining whether a feasible solution exists.

<br>

## Implementation Environment Constraints

Is there a cloud platform, or on-premises, infrastructure restriction?

<br>

## Budget Constraint

During the viability/feasibility assessment compare the stated budget with the estimated cost. Remember, the cost is determined by all elements that have a cost; both project delivery costs and lifecycle costs are critical.

<br>
<br>
<br>
<br>

<br>
<br>
<br>
<br>
28 changes: 28 additions & 0 deletions docs/build/html/_sources/data/collection.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
<br>

# Data Collection

<br>

What was the data acquisition mechanism? Per data instance, was data acquisition via

<ul class="disc">
<li class="disc">A sensor.</li>
<li class="disc">An application programming interface.</li>
<li class="disc">Interviews.</li>
<li class="disc">Questionnaires.</li>
<li class="disc">A mix of mechanisms.</li>
<li class="disc">etc.</li>
</ul>

<br>
<br>

<br>
<br>

<br>
<br>

<br>
<br>
208 changes: 208 additions & 0 deletions docs/build/html/_sources/data/composition.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
<br>

# Composition

In general, the questions herein should be studied before data collection. The datasheets paper [1] notes that most of the questions:

<blockquote>
"… are intended to provide dataset consumers with the information they need to make informed decisions about using the dataset for their chosen tasks. … the questions are designed to elicit information about compliance with the EU's General Data Protection Regulation (GDPR) or comparable regulations in other jurisdictions."
</blockquote>

<br>

<details><summary><b>References</b></summary>
<ol class="numeric">
<li class="numeric"><a href="https://arxiv.org/abs/1803.09010v8" target="_blank">Datasheets for Datasets</a>, arXiv:1803.09010v8, 2021, updated datasheet appendix</li>
</ol>
</details>

<br>
<br>

## What does each data set instance represent?

Describe what each instance, i.e., row, of the data set represents. Although peculiar, if the data set is multi-representational, e.g., an instance in a merchant's data base table might represent one of: (a) an online reading event, (b) an online product purchasing event, or \(c\) an online download event. Each representation must be described.

<br>
<br>

## The # of Instances

How many instances does the data set have?

<br>
<br>

## Pre-processed?

Are any aspects of the data set pre-processed? If yes:

<ul class="disc">
<li class="disc">Document the pre-processing steps.</li>
<li class="disc">State whether the underlying raw data is available, and provide a link to the data.</li>
<li class="disc">If available, provide a link to the pre-processing programs.</li>
</ul>

<br>
<br>

## Is the data set a sample of a larger data set?

**If yes:**

<ul class="disc">
<li class="disc"><b>If the data set is representative</b> of the larger data set: How was representativeness verified/validated?</li>
<li class="disc"><b>If the data set is not representative</b> of the larger data set, e.g., is a geographically focused subset, explain why.</li>
</ul>

<br>
<br>

## Lineage

Summarise the data set's lineage, including linkage options. [1, 2]

<br>

<details><summary><b>References</b></summary>
<ol class="numeric">
<li class="numeric"><a href="https://www.qlik.com/us/data-management/data-lineage" target="_blank">QLIK: What is data lineage?</a></li>
<li class="numeric"><a href="https://www.ibm.com/topics/data-lineage" target="_blank">IBM: What is data lineage?</a></li>
</ol>
</details>

<br>
<br>

## Licences & Fees

If applicable, summarise the data's costs.

<br>
<br>

## Profiles of Instances

Herein, the focus is a summary of the instances of a data set, e.g., for a tabular data set:

**By Field**

<ul class="disc">
<li class="disc">The field name.</li>
<li class="disc">Description: What does the element of an instance denote/represent?</li>
<li class="disc">Data type.</li>
<li class="disc">Dictionary of a categorical data type.</li>
<li class="disc">Unit of measure.</li>
<li class="disc">Is this a raw data field or a feature?</li>
<li class="disc">Is this a target field?</li>
<li class="disc">Does the field identify a sub-population?</li>
<li class="disc">Column Profile: Note column profiling <i>"… provides statistical information regarding the distribution of data values and associated patterns that are assigned to each data attribute, …"</i>. [1] &nbsp; &nbsp; If a field/column has missing elements, explain why.</li>
<li class="disc">A graph of the field's data distribution.</li>
</ul>

<br>

**Across Fields**

<ul class="disc">
<li class="disc">Cross-Column Profiles: Relationships between columns.</li>
</ul>

<br>
<br>

<details><summary><b>References</b></summary>
<ol class="numeric">
<li class="numeric">5.5.2 Profiling for Data Quality Assessment, in <a href="https://www.sciencedirect.com/book/9780123742254/master-data-management" target="_blank">Master Data Management</a>, Page 96, The MK/OMG Press, 2008</li>
<li class="numeric"><a href="https://www.talend.com/resources/what-is-data-profiling/" target="_blank">Data Profiling</a></li>
</ol>
</details>

<br>
<br>


## Errors

Please detail any errors, sources of noise, or redundancies.


<br>
<br>


## Recommended Data Splits for Machine Learning

Are there recommended data splits?

<br>
<br>

## Confidentiality

Does the data set contain data that might be considered confidential? For example,

<ul class="disc">
<li class="disc">Is the data protected by legal privilege or by doctor–patient confidentiality?</li>
<li class="disc">Does the data include the content of private/non-public communications of individuals.</li>
</ul>

<br>
<br>

## Identification of Individuals

Is it possible to identify individuals directly or indirectly?

<br>
<br>

## Data Sensitivity

Does the data set include sensitive data elements? Describe. Examples of sensitive data elements are
elements that directly/indirectly reveal:

<ul class="disc">
<li class="disc">Locations.</li>
<li class="disc">Financial details.</li>
<li class="disc">Health details.</li>
<li class="disc">Biometric profiles.</li>
<li class="disc">Genetic profiles.</li>
<li class="disc">Government identification codes of individuals.</li>
<li class="disc">Criminal history.</li>
<li class="disc">Institutionally and/or commercially sensitive data.</li>
<li class="disc">Race or ethnic origin.</li>
<li class="disc">Sexual orientations.</li>
<li class="disc">Religious beliefs.</li>
<li class="disc">Political opinions.</li>
<li class="disc">Trade union memberships.</li>
<li class="disc">And more.</li>
</ul>

<br>
<br>

## Distressing Data Elements

Does ``… the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?'' [1, 2]

<br>

<details><summary><b>References</b></summary>
<ol class="numeric">
<li class="numeric"><a href="https://dl.acm.org/doi/10.1145/3458723" target="_blank">Datasheets for Datasets</a>, Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92</li>
<li class="numeric"><a href="https://arxiv.org/abs/1803.09010v8" target="_blank">Datasheets for Datasets</a>, arXiv:1803.09010v8, 2021, updated datasheet appendix</li>
</ol>
</details>

<br>
<br>

<br>
<br>

<br>
<br>

<br>
<br>
29 changes: 29 additions & 0 deletions docs/build/html/_sources/data/controls.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<br>

# Controls

<br>

Do "… any export controls or other regulatory restrictions apply to the dataset or to individual instances?" [1, 2] &nbsp;
Please, study **Section 3.6 of** [2] for more details.

<br>

<details><summary><b>References</b></summary>
<ol class="numeric">
<li class="numeric"><a href="https://dl.acm.org/doi/10.1145/3458723" target="_blank">Datasheets for Datasets</a>, Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92</li>
<li class="numeric"><a href="https://arxiv.org/abs/1803.09010v8" target="_blank">Datasheets for Datasets</a>, arXiv:1803.09010v8, 2021, updated datasheet appendix</li>
</ol>
</details>

<br>
<br>

<br>
<br>

<br>
<br>

<br>
<br>
49 changes: 49 additions & 0 deletions docs/build/html/_sources/data/data.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
<br>

# Data & Datasheets

<br>

The datasheet of a dataset outlines the data set's provenance, lineage, and a bit more. Each dataset must have its own datasheet. Gebru & colleagues <a href="https://arxiv.org/pdf/1803.09010v1.pdf" target="_blank">first proposed Datasheets for Datasets</a>, for machine learning products or projects, during the years 2018/2019. Following feedback from a variety of institutions, industries, agencies, etc., a baseline Datasheets for Datasets outline was released.[^acm]<sup>, </sup>[^arXiv] Each datasheet consists of a set of questions, and the datasheet's primary objective vis-à-vis data set creator is to:

> … encourage data set creators to reflect carefully upon (a) the "process of creating, distributing, and maintaining a dataset", and (b) "any underlying assumptions, potential risks or harms, and implications of use“

The primary objective vis-à-vis data set user is to:

> … ensure that the data set user has the information required to "… make informed decisions about using a dataset."

This chapter consists of a set of sections. The sections, except the Natural Language Processing section, reflect the groupings of the questions in the latest datasheet version. [^acm]<sup>, </sup>[^arXiv]<sup>, </sup>[^applicability] The questions of the natural language processing (NLP) section apply to NLP projects only. The questions were developed by Bender & Friedman. [^bender]

<br>
<br>

```{toctree}
:caption: Content
:glob:

motivation
composition
collection
controls
maintenance
nlp
```

<br>
<br>

<br>
<br>

<br>
<br>

<br>
<br>

[^acm]: [Datasheets for Datasets](https://dl.acm.org/doi/10.1145/3458723), Communications of the ACM, 2021, Volume 64, Issue 12, pages 86 – 92
[^arXiv]: [Datasheets for Datasets](https://arxiv.org/abs/1803.09010v8), arXiv:1803.09010v8, 2021, updated datasheet appendix
[^applicability]: If a question is inapplicable, note down its inapplicability.
[^bender]: [Data Statements for Natural Language Processing:Toward Mitigating System Bias and Enabling Better Science](https://doi.org/10.1162/tacl_a_00041), Transactions of the Association for Computational Linguistics, 2018, 6: 587–604

<br>
Loading

0 comments on commit f422c51

Please sign in to comment.