diff --git a/docs/index.md b/docs/index.md index c2ada678..4357b05b 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,4 +2,5 @@ Welcome to the Datachecks Documentation! -Let's jump to the **[Getting Started!](getting_started.md)** \ No newline at end of file +Let's jump to the **[Getting Started!](getting_started.md)** + diff --git a/docs/metrics/combined.md b/docs/metrics/combined.md index 243112f3..67a7bf74 100644 --- a/docs/metrics/combined.md +++ b/docs/metrics/combined.md @@ -1,3 +1,34 @@ # **Combined Metrics** -## Updating Soon .... \ No newline at end of file +Combined metrics in data quality serve as a cornerstone for ensuring the accuracy and efficiency of your data operations. These metrics provide a holistic view of your data ecosystem, amalgamating various aspects to paint a comprehensive picture. + +By consistently tracking these combined metrics, you gain invaluable insights into the overall performance of your data infrastructure. This data-driven approach enables you to make informed decisions on optimization, resource allocation, and system enhancements. Moreover, these metrics act as sentinels, promptly detecting anomalies or bottlenecks within your data pipelines. This proactive stance allows you to mitigate potential issues before they escalate, safeguarding the integrity of your data. + +Combined metrics raises error on more than 2 arguments in one operation. + + +## **Available Function** + +- `div()` +- `sum()` +- `mul()` +- `sub()` +- `percentage()` + +**Example** + +```yaml title="dcs_config.yaml" +metrics: +- name: combined_metric_example + metric_type: combined + expression: sum(count_us_parts, count_us_parts_valid) +``` + +**Example** + +```yaml title="dcs_config.yaml" +metrics: +- name: combined_metric_example + metric_type: combined + expression: div(sum(count_us_parts, count_us_parts_valid), count_us_parts_not_valid) +``` \ No newline at end of file diff --git a/docs/metrics/completeness.md b/docs/metrics/completeness.md index 1a9a024a..0cddbe65 100644 --- a/docs/metrics/completeness.md +++ b/docs/metrics/completeness.md @@ -1,3 +1,69 @@ -# **Completeness Metric** +# **Completeness Metrics** -## Updating Soon .... \ No newline at end of file +Completeness metrics play a crucial role in data quality assessment, ensuring your datasets are comprehensive and reliable. By regularly monitoring these metrics, you can gain profound insights into the extent to which your data captures the entirety of the intended information. This empowers you to make informed decisions about data integrity and take corrective actions when necessary. + +These metrics unveil potential gaps or missing values in your data, enabling proactive data enhancement. Like a well-oiled machine, tracking completeness metrics enhances the overall functionality of your data ecosystem. Just as reliability metrics guarantee up-to-date information, completeness metrics guarantee a holistic, accurate dataset. + + +## **Null Count** + +Null count metrics gauge missing data, a crucial aspect of completeness metrics, revealing gaps and potential data quality issues. + + + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: null_count_in_dataset + metric_type: null_count + resource: product_db.products + field_name: first_name + +``` + + +## **Null Percentage** + +Null percentage metrics reveal missing data, a vital facet of completeness metrics, ensuring data sets are whole and reliable. + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: null_percentage_in_dataset + metric_type: null_percentage + resource: product_db.products + field_name: first_name + +``` + +## **Empty String** + +Empty string metrics gauge the extent of missing or null values, exposing gaps that impact data completeness and reliability. + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: empty_string_in_dataset + metric_type: empty_string + resource: product_db.products + field_name: first_name + +``` + +## **Empty String Percentage** + +Empty String Percentage Metrics assess data completeness by measuring the proportion of empty strings in datasets. + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: empty_string_percentage_in_dataset + metric_type: empty_string_percentage + resource: product_db.products + field_name: first_name + +``` \ No newline at end of file diff --git a/docs/metrics/numeric_distribution.md b/docs/metrics/numeric_distribution.md index 298d1fa5..30d2421b 100644 --- a/docs/metrics/numeric_distribution.md +++ b/docs/metrics/numeric_distribution.md @@ -1,3 +1,127 @@ -# **Numeric Distribution Metric** +# **Numeric Distribution Metrics** -## Updating Soon .... \ No newline at end of file +Numeric distribution metrics serve as vital tools for ensuring the ongoing integrity of your data. These metrics offer valuable insights into the distribution of values within your datasets, aiding in data quality assurance. + +By consistently monitoring these metrics, you gain a deeper understanding of how your data behaves. This knowledge empowers you to make informed decisions regarding data cleansing, anomaly detection, and overall data quality improvement. + +Furthermore, numeric distribution metrics are your early warning system. They help pinpoint outliers and anomalies, allowing you to address potential data issues before they escalate into significant problems in your data pipelines. + + +## **Average** + +Average metrics gauge performance in transitional databases and search engines, offering valuable insights into overall effectiveness. + + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: avg_price + metric_type: avg + resource: product_db.products + field_name: price + filters: + where: "country_code = 'IN'" +``` + + +## **Minimum** + +Minimum metrics ensure consistency across transitional databases and search engines, enhancing data quality and retrieval accuracy. + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: min_price + metric_type: min + resource: product_db.products + field_name: price +``` + +## **Maximum** + +Maximum metrics gauge the highest values within datasets, helping identify outliers and understand data distribution's upper limits for quality assessment. + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: max_price + metric_type: max + resource: product_db.products + field_name: price +``` + +```yaml title="dcs_config.yaml" +- name: max_price_of_products_with_high_rating + metric_type: max + resource: product_db.products + field_name: price + filters: + where: "rating > 4" +``` + +## **Variance** + +Variance in data quality measures the degree of variability or dispersion in a dataset, indicating how spread out the data points are from the mean. + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: variance_of_price + metric_type: variance + resource: product_db.products + field_name: price +``` + +## **Skew** + +Skew metric in data quality measures the extent of asymmetry or distortion in the distribution of data values. It helps assess the balance and uniformity of data distribution. + +**Example** + +```yaml title="dcs_config.yaml" + +``` + +## **Kurtosis** + +Kurtosis is a data quality metric that measures the level of peakedness or flatness of a dataset's probability distribution in a geometric space. + +**Example** + +```yaml title="dcs_config.yaml" + +``` + +## **Sum** + +The sum metric in data quality measures the accuracy and consistency of numerical data by assessing the total of a specific attribute across different records. + +**Example** + +```yaml title="dcs_config.yaml" + +``` + +## **Geometric Mean** + +The geometric mean metric in data quality is a statistical measure that calculates the nth root of the product of n data values, often used to assess the central tendency of a dataset + +**Example** + +```yaml title="dcs_config.yaml" + +``` + +## **Harmonic Mean** + +The Harmonic mean metric in data quality is a statistical measure used to assess the quality of data by calculating the reciprocal of the average of the reciprocals of data values. + +**Example** + +```yaml title="dcs_config.yaml" + +``` \ No newline at end of file diff --git a/docs/metrics/uniqueness.md b/docs/metrics/uniqueness.md index 3e249143..873e9195 100644 --- a/docs/metrics/uniqueness.md +++ b/docs/metrics/uniqueness.md @@ -1,3 +1,32 @@ # **Uniqueness Metrics** -## Updating Soon .... \ No newline at end of file +Uniqueness metrics play a pivotal role in upholding data quality standards. Just as reliability metrics ensure timely data updates, uniqueness metrics focus on the distinctiveness of data entries within a dataset. + +By consistently tracking these metrics, you gain valuable insights into data duplication, redundancy, and accuracy. This knowledge empowers data professionals to make well-informed decisions about data cleansing and optimization strategies. Uniqueness metrics also serve as a radar for potential data quality issues, enabling proactive intervention to prevent major problems down the line. + + +## **Distinct Count** + + A distinct count metric in data quality measures the number of unique values within a dataset, ensuring accuracy and completeness. + +**Example** + +```yaml title="dcs_config.yaml" +metrics: + - name: distinct_count_of_product_categories + metric_type: distinct_count + resource: product_db.products + field_name: product_category +``` + + +## **Duplicate Count** + +Duplicate count is a data quality metric that measures the number of identical or highly similar records in a dataset, highlighting potential data redundancy or errors. + +**Example** + +```yaml title="dcs_config.yaml" + +``` +