From 748722d1d6da5444762670ad44ae1747c0b1ffac Mon Sep 17 00:00:00 2001 From: pierrecamilleri Date: Fri, 22 Nov 2024 08:04:53 +0000 Subject: [PATCH] Apply automatic changes --- .gitignore | 1 + 404.html | 3439 +++++++++++ .../2022/08-22-frictionless-framework-v5.html | 4033 +++++++++++++ blog/2022/09-07-github-integration.html | 3529 +++++++++++ blog/2022/11-07-zenodo-integration.html | 3534 +++++++++++ blog/index.html | 3518 +++++++++++ data/table-output.parq | Bin 884 -> 884 bytes docs/advanced/design.html | 3467 +++++++++++ docs/advanced/extending.html | 3631 ++++++++++++ docs/advanced/system.html | 4141 +++++++++++++ docs/basic-examples.html | 4141 +++++++++++++ docs/checks/baseline.html | 3570 ++++++++++++ docs/checks/cell.html | 4024 +++++++++++++ docs/checks/row.html | 3632 ++++++++++++ docs/checks/table.html | 3691 ++++++++++++ docs/codebase/authors.html | 3507 +++++++++++ docs/codebase/changelog.html | 4183 +++++++++++++ docs/codebase/contributing.html | 3617 ++++++++++++ docs/codebase/license.html | 3479 +++++++++++ docs/codebase/migration.html | 3506 +++++++++++ docs/console/convert.html | 3518 +++++++++++ docs/console/describe.html | 3549 +++++++++++ docs/console/explore.html | 4713 +++++++++++++++ docs/console/extract.html | 3528 +++++++++++ docs/console/index.html | 3589 ++++++++++++ docs/console/list.html | 3499 +++++++++++ docs/console/overview.html | 3550 +++++++++++ docs/console/publish.html | 3474 +++++++++++ docs/console/query.html | 3489 +++++++++++ docs/console/script.html | 3489 +++++++++++ docs/console/validate.html | 3512 +++++++++++ docs/errors/cell.html | 3856 ++++++++++++ docs/errors/data.html | 3535 +++++++++++ docs/errors/file.html | 3597 ++++++++++++ docs/errors/header.html | 3589 ++++++++++++ docs/errors/label.html | 3722 ++++++++++++ docs/errors/metadata.html | 3960 +++++++++++++ docs/errors/resource.html | 3660 ++++++++++++ docs/errors/row.html | 3819 ++++++++++++ docs/errors/table.html | 3713 ++++++++++++ docs/fields/any.html | 3553 +++++++++++ docs/fields/array.html | 3565 +++++++++++ docs/fields/boolean.html | 3578 ++++++++++++ docs/fields/date.html | 3553 +++++++++++ docs/fields/datetime.html | 3553 +++++++++++ docs/fields/duration.html | 3554 +++++++++++ docs/fields/geojson.html | 3552 +++++++++++ docs/fields/geopoint.html | 3552 +++++++++++ docs/fields/integer.html | 3565 +++++++++++ docs/fields/number.html | 3600 ++++++++++++ docs/fields/object.html | 3553 +++++++++++ docs/fields/string.html | 3561 +++++++++++ docs/fields/time.html | 3553 +++++++++++ docs/fields/year.html | 3553 +++++++++++ docs/fields/yearmonth.html | 3553 +++++++++++ docs/formats/csv.html | 3680 ++++++++++++ docs/formats/erd.html | 3472 +++++++++++ docs/formats/excel.html | 3666 ++++++++++++ docs/formats/gsheets.html | 3601 ++++++++++++ docs/formats/html.html | 3619 ++++++++++++ docs/formats/inline.html | 3619 ++++++++++++ docs/formats/json.html | 3646 ++++++++++++ docs/formats/jsonschema.html | 3471 +++++++++++ docs/formats/markdown.html | 3472 +++++++++++ docs/formats/ods.html | 3600 ++++++++++++ docs/formats/pandas.html | 3515 +++++++++++ docs/formats/parquet.html | 3620 ++++++++++++ docs/formats/spss.html | 3518 +++++++++++ docs/formats/sql.html | 3718 ++++++++++++ docs/formats/yaml.html | 3630 ++++++++++++ docs/formats/zip.html | 3472 +++++++++++ docs/framework/actions.html | 3795 ++++++++++++ docs/framework/catalog.html | 3888 ++++++++++++ docs/framework/checklist.html | 3814 ++++++++++++ docs/framework/detector.html | 4106 +++++++++++++ docs/framework/dialect.html | 3990 +++++++++++++ docs/framework/error.html | 3583 ++++++++++++ docs/framework/inquiry.html | 3850 ++++++++++++ docs/framework/package.html | 4234 ++++++++++++++ docs/framework/pipeline.html | 3783 ++++++++++++ docs/framework/report.html | 4027 +++++++++++++ docs/framework/resource.html | 4701 +++++++++++++++ docs/framework/schema.html | 4256 ++++++++++++++ docs/framework/table.html | 3749 ++++++++++++ docs/getting-started.html | 3776 ++++++++++++ docs/guides/describing-data.html | 4792 +++++++++++++++ docs/guides/extracting-data.html | 4120 +++++++++++++ docs/guides/transforming-data.html | 3751 ++++++++++++ docs/guides/validating-data.html | 4798 +++++++++++++++ docs/portals/ckan.html | 3916 +++++++++++++ docs/portals/github.html | 3972 +++++++++++++ docs/portals/zenodo.html | 4236 ++++++++++++++ docs/resources/file.html | 3492 +++++++++++ docs/resources/json.html | 3512 +++++++++++ docs/resources/table.html | 3516 +++++++++++ docs/resources/text.html | 3514 +++++++++++ docs/schemes/aws.html | 3594 ++++++++++++ docs/schemes/buffer.html | 3514 +++++++++++ docs/schemes/local.html | 3520 +++++++++++ docs/schemes/multipart.html | 3584 ++++++++++++ docs/schemes/remote.html | 3608 ++++++++++++ docs/schemes/stream.html | 3524 +++++++++++ docs/steps/cell.html | 4217 +++++++++++++ docs/steps/field.html | 4622 +++++++++++++++ docs/steps/resource.html | 3891 ++++++++++++ docs/steps/row.html | 4326 ++++++++++++++ docs/steps/table.html | 5191 +++++++++++++++++ docs/universe.html | 3472 +++++++++++ index.html | 3480 +++++++++++ 109 files changed, 400435 insertions(+) create mode 100644 404.html create mode 100644 blog/2022/08-22-frictionless-framework-v5.html create mode 100644 blog/2022/09-07-github-integration.html create mode 100644 blog/2022/11-07-zenodo-integration.html create mode 100644 blog/index.html create mode 100644 docs/advanced/design.html create mode 100644 docs/advanced/extending.html create mode 100644 docs/advanced/system.html create mode 100644 docs/basic-examples.html create mode 100644 docs/checks/baseline.html create mode 100644 docs/checks/cell.html create mode 100644 docs/checks/row.html create mode 100644 docs/checks/table.html create mode 100644 docs/codebase/authors.html create mode 100644 docs/codebase/changelog.html create mode 100644 docs/codebase/contributing.html create mode 100644 docs/codebase/license.html create mode 100644 docs/codebase/migration.html create mode 100644 docs/console/convert.html create mode 100644 docs/console/describe.html create mode 100644 docs/console/explore.html create mode 100644 docs/console/extract.html create mode 100644 docs/console/index.html create mode 100644 docs/console/list.html create mode 100644 docs/console/overview.html create mode 100644 docs/console/publish.html create mode 100644 docs/console/query.html create mode 100644 docs/console/script.html create mode 100644 docs/console/validate.html create mode 100644 docs/errors/cell.html create mode 100644 docs/errors/data.html create mode 100644 docs/errors/file.html create mode 100644 docs/errors/header.html create mode 100644 docs/errors/label.html create mode 100644 docs/errors/metadata.html create mode 100644 docs/errors/resource.html create mode 100644 docs/errors/row.html create mode 100644 docs/errors/table.html create mode 100644 docs/fields/any.html create mode 100644 docs/fields/array.html create mode 100644 docs/fields/boolean.html create mode 100644 docs/fields/date.html create mode 100644 docs/fields/datetime.html create mode 100644 docs/fields/duration.html create mode 100644 docs/fields/geojson.html create mode 100644 docs/fields/geopoint.html create mode 100644 docs/fields/integer.html create mode 100644 docs/fields/number.html create mode 100644 docs/fields/object.html create mode 100644 docs/fields/string.html create mode 100644 docs/fields/time.html create mode 100644 docs/fields/year.html create mode 100644 docs/fields/yearmonth.html create mode 100644 docs/formats/csv.html create mode 100644 docs/formats/erd.html create mode 100644 docs/formats/excel.html create mode 100644 docs/formats/gsheets.html create mode 100644 docs/formats/html.html create mode 100644 docs/formats/inline.html create mode 100644 docs/formats/json.html create mode 100644 docs/formats/jsonschema.html create mode 100644 docs/formats/markdown.html create mode 100644 docs/formats/ods.html create mode 100644 docs/formats/pandas.html create mode 100644 docs/formats/parquet.html create mode 100644 docs/formats/spss.html create mode 100644 docs/formats/sql.html create mode 100644 docs/formats/yaml.html create mode 100644 docs/formats/zip.html create mode 100644 docs/framework/actions.html create mode 100644 docs/framework/catalog.html create mode 100644 docs/framework/checklist.html create mode 100644 docs/framework/detector.html create mode 100644 docs/framework/dialect.html create mode 100644 docs/framework/error.html create mode 100644 docs/framework/inquiry.html create mode 100644 docs/framework/package.html create mode 100644 docs/framework/pipeline.html create mode 100644 docs/framework/report.html create mode 100644 docs/framework/resource.html create mode 100644 docs/framework/schema.html create mode 100644 docs/framework/table.html create mode 100644 docs/getting-started.html create mode 100644 docs/guides/describing-data.html create mode 100644 docs/guides/extracting-data.html create mode 100644 docs/guides/transforming-data.html create mode 100644 docs/guides/validating-data.html create mode 100644 docs/portals/ckan.html create mode 100644 docs/portals/github.html create mode 100644 docs/portals/zenodo.html create mode 100644 docs/resources/file.html create mode 100644 docs/resources/json.html create mode 100644 docs/resources/table.html create mode 100644 docs/resources/text.html create mode 100644 docs/schemes/aws.html create mode 100644 docs/schemes/buffer.html create mode 100644 docs/schemes/local.html create mode 100644 docs/schemes/multipart.html create mode 100644 docs/schemes/remote.html create mode 100644 docs/schemes/stream.html create mode 100644 docs/steps/cell.html create mode 100644 docs/steps/field.html create mode 100644 docs/steps/resource.html create mode 100644 docs/steps/row.html create mode 100644 docs/steps/table.html create mode 100644 docs/universe.html create mode 100644 index.html diff --git a/.gitignore b/.gitignore index 088d28f607..c24a2d10aa 100644 --- a/.gitignore +++ b/.gitignore @@ -95,3 +95,4 @@ coverage/ site/ tmp/ .vim +!**/*.html diff --git a/404.html b/404.html new file mode 100644 index 0000000000..1662d1404a --- /dev/null +++ b/404.html @@ -0,0 +1,3439 @@ + + + + + + + + +Not Found | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Not Found

+
+ +

Return to the home page.

+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/blog/2022/08-22-frictionless-framework-v5.html b/blog/2022/08-22-frictionless-framework-v5.html new file mode 100644 index 0000000000..b6143273e3 --- /dev/null +++ b/blog/2022/08-22-frictionless-framework-v5.html @@ -0,0 +1,4033 @@ + + + + + + + + +Welcome Frictionless Framework (v5) | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Welcome Frictionless Framework (v5)

+

+

+ By Evgeny Karev on 2022-08-22 » + Blog Index +

We're releasing a first beta of Frictionless Framework (v5)!

+

Since the initial Frictionless Framework release we'd been collecting feedback and analyzing both high-level users' needs and bug reports to identify shortcomings and areas that can be improved in the next version for the framework. Once that process had been done we started working on a new v5 with a goal to make the framework more bullet-proof, easy to maintain and simplify user interface. Today, this version is almost stable and ready to be published. Let's go through the main improvements we have made:

+

Improved Metadata

+

This year we started working on the Frictionless Application, at the same time, we were thinking about next steps for the Frictionless Standards. For both we need well-defined and an easy-to-understand metadata model. Partially it's already published as standards like Table Schema and partially it's going to be published as standards like File Dialect and possibly validation/transform metadata.

+

Dialect

+

In v4 of the framework we had Control/Dialect/Layout concepts to describe resource details related to different formats and schemes, as well as tabular details like header rows. In v5 it's merged into the only one concept called Dialect which is going to be standardised as a File Dialect spec. Here is an example:

+ +
+
+
header: true
+headerRows: [2, 3]
+commentChar: '#'
+csv:
+  delimiter: ';'
+
+ +
+
+
from frictionless import Dialect, Control, formats
+
+dialect = Dialect(header=True, header_rows=[2, 3], comment_char='#')
+dialect.add_control(formats.CsvControl(delimiter=';'))
+print(dialect)
+
+ +
+

A dialect descriptor can be saved and reused within a resource. Technically, it's possible to provide different schemes and formats settings within one Dialect (e.g. for CSV and Excel) so it's possible to create e.g. one re-usable dialect for a data package. A legacy CSV Dialect spec is supported and will be supported forever so it's possible to provide CSV properties on the root level:

+ +
+
+
header: true
+delimiter: ';'
+
+ +
+
+
from frictionless import Dialect, Control, formats
+
+dialect = Dialect.from_descriptor({"header": True, "delimiter": ';'})
+print(dialect)
+
+ +
+

For performance and codebase maintainability reasons some marginal Layout features have been removed completely such as skip/pick/limit/offsetFields/etc. It's possible to achieve the same results using the Pipeline concept as a part of the transformation workflow.

+

Read an article about Dialect Class for more information.

+

Checklist

+

Checklist is a new concept introduced in v5. It's basically a collection of validation steps and a few other settings to make "validation rules" sharable. For example:

+ +
+
+
checks:
+  - type: ascii-value
+  - type: row_constraint
+    formula: id > 1
+skipErrors:
+  - duplicate-label
+
+ +
+
+
from frictionless import Checklist, checks
+
+checklist = Checklist(
+    checks=[checks.ascii_value(), checks.row_constraint(formula='id > 1')],
+    skip_errors=['duplicate-label'],
+)
+print(checklist)
+
+ +
+

Having and sharing this checklist it's possible to tune data quality requirements for some data file or set of data files. This concept will provide an ability for creating data quality "libraries" within projects or domains. We can use a checklist for validation:

+ +
+
+
frictionless validate table1.csv --checklist checklist.yaml
+frictionless validate table2.csv --checklist checklist.yaml
+
+ +
+

Here is a list of another changes:

+ + + + + + + + + + + + + + + + +
From (v4)To (v5)
Check(descriptor)Check.from_descriptor(descriptor)
check.codecheck.type

Read an article about Checklist Class for more information.

+

Pipeline

+

In v4 Pipeline was a complex concept similar to validation Inquiry. We reworked it for v5 to be a lightweight set of validation steps that can be applied to a data resource or a data package. For example:

+ +
+
+
steps:
+  - type: table-normalize
+  - type: cell-set
+    fieldName: version
+    value: v5
+
+ +
+
+
from frictionless import Pipeline, steps
+
+pipeline = Pipeline(
+    steps=[steps.table_normalize(), steps.cell_set(field_name='version', value='v5')],
+)
+print(pipeline)
+
+ +
+

Similar to the Checklist concept, Pipeline is a reusable (data-abstract) object that can be saved to a descriptor and used in some complex data workflow:

+ +
+
+
frictionless transform table1.csv --pipeline pipeline.yaml
+frictionless transform table2.csv --pipeline pipeline.yaml
+
+ +
+

Here is a list of another changes:

+ + + + + + + + + + + + + + + + +
From (v4)To (v5)
Step(descriptor)Step.from_descriptor(descriptor)
step.codestep.type

Read an article about Pipeline Class for more information.

+

Resource

+
+ +

There are no changes in the Resource related to the standards although currently by default instead of profile the type property will be used to mark a resource as a table. It can be changed using the --standards v1 flag.

+

It's now possible to set Checklist and Pipeline as a Resource property similar to Dialect and Schema:

+ +
+
+
path: table.csv
+# ...
+checklist:
+  checks:
+    - type: ascii-value
+    - type: row_constraint
+      formula: id > 1
+pipeline: pipeline.yaml
+  steps:
+    - type: table-normalize
+    - type: cell-set
+      fieldName: version
+      value: v5
+
+ +
+

Or using dereference:

+ +
+
+
path: table.csv
+# ...
+checklist: checklist.yaml
+pipeline: pipeline.yaml
+
+ +
+

In this case the validation/transformation will use it by default providing an ability to ship validation rules and transformation pipelines within resources and packages. This is an important development for data publishers who want to define what they consider to be valid for their datasets as well as sharing raw data with a cleaning pipeline steps:

+ +
+
+
frictionless validate resource.yaml  # will use the checklist above
+frictionless transform resource.yaml  # will use the pipeline above
+
+ +
+

There are minor changes in the stats property. Now it uses named keys to simplify hash distinction (md5/sha256 are calculated by default and it's not possible to change for performance reasons as it was in v4):

+ +
+
+
from frictionless import describe
+
+resource = describe('table.csv', stats=True)
+print(resource.stats)
+
+ +
+

Here is a list of another changes:

+ + + + + + + + + + + + +
From (v4)To (v5)
for row in resource:for row in resource.row_stream

Read an article about Resource Class for more information.

+

Package

+

There are no changes in the Package related to the standards although it's now possible to use resource dereference:

+ +
+
+
name: package
+resources:
+  - resource1.yaml
+  - resource2.yaml
+
+ +
+

Read an article about Package Class for more information.

+

Catalog

+
+ +

Catalog is a new concept that is a collection of data packages that can be written inline or using dereference:

+ +
+
+
name: catalog
+packages:
+  - package1.yaml
+  - package2.yaml
+
+ +
+

Read an article about Catalog Class for more information.

+

Detector

+

Detector is now a metadata class (it wasn't in v4) so it can be saved and shared as other metadata classes:

+ +
+
+
from frictionless import Detector
+
+detector = Detector(sample_size=1000)
+print(detector)
+
+ +
+

Read an article about Detector Class for more information.

+

Inquiry

+

There are few changes in the Inquiry concept which is known for using in the Frictionless Repository project:

+ + + + + + + + + + + + + + + + + + + + +
From (v4)To (v5)
inquiryTask.sourceinquiryTask.path
inquiryTask.sourceinquiryTask.resource
inquiryTask.sourceinquiryTask.package

Read an article about Inquiry Class for more information.

+

Report

+

The Report concept has been significantly simplified by removing the resource property from reportTask. It's been replaced by name/type/place/labels properties. Also report.time is now report.stats.seconds. The report/reportTask.warnings: List[str] have been added to provide non-error information like reached limits:

+ +
+
+
frictionless validate table.csv --yaml
+
+ +
+

Here is a list of changes:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
From (v4)To (v5)
report.timereport.stats.seconds
reportTask.timereportTask.stats.seconds
reportTask.resource.namereportTask.name
reportTask.resource.profilereportTask.type
reportTask.resource.pathreportTask.place
reportTask.resource.schemareportTask.labels

Read an article about Report Class for more information.

+

Schema

+

Changes in the Schema class:

+ + + + + + + + + + + + +
From (v4)To (v5)
Schema(descriptor)Schema.from_descriptor(descriptor)

Error

+

There are a few changes in the Error data structure:

+ + + + + + + + + + + + + + + + + + + + + + + + +
From (v4)To (v5)
error.codeerror.type
error.nameerror.title
error.rowPositionerror.rowNumber
error.fieldPositionerror.fieldNumber

Types

+

Note that all the metadata entities that have multiple implementations in v5 are based on a unified type model. It means that they use the type property to provide type information:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
From (v4)To (v5)
resource.profileresource.type
check.codecheck.type
control.codecontrol.type
error.codeerror.type
field.typefield.type
step.typestep.type

The new v5 version still supports old notation in descriptors for backward-compatibility.

+

Improved Model

+

It's been many years that Frictionless were mixing declarative metadata and object model for historical reasons. Since the first implementation of datapackage library we used different approaches to sync internal state to provide both interfaces descriptor and object model. In Frictionless Framework v4 this technique had been taken to a really sophisticated level with special observables dictionary classes. It was quite smart and nice-to-use for quick prototyping in REPL but it was really hard to maintain and error-prone.

+

In Framework v5 we finally decided to follow the "right way" for handling this problem and split descriptors and object model completely.

+

Descriptors

+

In the Frictionless World we deal with a lot of declarative metadata descriptors such as packages, schemas, pipelines, etc. Nothing changes in v5 regarding this. So for example here is a Table Schema:

+ +
+
+
fields:
+  - name: id
+    type: integer
+  - name: name
+    type: string
+
+ +
+

Object Model

+

The difference comes here we we create a metadata instance based on this descriptor. In v4 all the metadata classes were a subclasses of the dict class providing a mix between a descriptor and object model for state management. In v5 there is a clear boundary between descriptor and object model. All the state are managed as it should be in a normal Python class using class attributes:

+ +
+
+
from frictionless import Schema
+
+schema = Schema.from_descriptor('schema.yaml')
+# Here we deal with a proper object model
+descriptor = schema.to_descriptor()
+# Here we export it back to be a descriptor
+
+ +
+

There are a few important traits of the new model:

+ +

This separation might make one to add a few additional lines of code, but it gives us much less fragile programs in the end. It's especially important for software integrators who want to be sure that they write working code. At the same time, for quick prototyping and discovery Frictionless still provides high-level actions like validate function that are more forgiving regarding user input.

+

Static Typing

+

One of the most important consequences of "fixing" state management in Frictionless is our new ability to provide static typing for the framework codebase. This work is in progress but we have already added a lot of types and it successfully pass pyright validation. We highly recommend enabling pyright in your IDE to see all the type problems in-advance:

+
+ +

Livemark Docs

+

We're happy to announce that we're finally ready to drop a JavaScript dependency for the docs generation as we migrated it to Livemark. Moreover, Livemark's ability to execute scripts inside the documentation and other nifty features like simple Tabs or a reference generator will save us hours and hours for writing better docs.

+

Script Execution

+
+ +

Reference Generation

+
+ +

Happy Contributors

+

We hope that Livemark docs writing experience will make our contributors happier and allow to grow our community of Frictionless Authors and Users. Let's chat in our Slack if you have questions or just want to say hi.

+

Read Livemark Docs for more information.

+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/blog/2022/09-07-github-integration.html b/blog/2022/09-07-github-integration.html new file mode 100644 index 0000000000..27b9e1a070 --- /dev/null +++ b/blog/2022/09-07-github-integration.html @@ -0,0 +1,3529 @@ + + + + + + + + +Github Integration | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Github Integration

+

+

+ By Shashi Gharti on 2022-09-07 » + Blog Index +

We are happy to announce github plugin which makes sharing data between frictionless and github easier without any extra work and configuration. All the github plugin functionalities are wrapped around the PyGithub library. The main idea is to make the interaction between the framework and github seamless using read and write functions developed on top of the Frictionless python library. Here is a short introduction and examples of the features.

+

Reading from the repo

+

Reading package from github repository is made easy! The existing Package class can identify the github url and read the packages and resources from the repo. It can read packages from repos with or without packages descriptors. If a package descriptor is not defined, it will create a package descriptor with resources that it finds in the repo.

+ +
+
+
from frictionless import Package
+
+package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
+print(package)
+
+ +
+

Writing/Publishing to the repo

+

Writing and publishing can be easily done by passing the repository link using publish function.

+ +
+
+
from frictionless import Package, portals
+
+apikey = 'YOUR-GITHUB-API-KEY'
+package = Package('data/datapackage.json')
+response = package.publish("https://github.com/fdtester/test-repo-write",
+        control=portals.GithubControl(apikey=apikey)
+    )
+
+ +
+

Creating catalog

+

Catalog can be created from a single repository by using 'search' queries. Repositories can be searched using combination of any search text and github qualifiers. A simple example of creating catalog from search is as follows:

+ +
+
+
from frictionless import Catalog, portals
+
+catalog = Catalog(
+        control=portals.GithubControl(search="user:fdtester", per_page=1, page=1),
+    )
+
+ +
+

Happy Contributors

+

We will have more updates in future and would love to hear from you about this new feature. Let's chat in our Slack if you have questions or just want to say hi.

+

Read Github Plugin Docs for more information.

+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/blog/2022/11-07-zenodo-integration.html b/blog/2022/11-07-zenodo-integration.html new file mode 100644 index 0000000000..55e5b0ff72 --- /dev/null +++ b/blog/2022/11-07-zenodo-integration.html @@ -0,0 +1,3534 @@ + + + + + + + + +Zenodo Integration | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Zenodo Integration

+

+

+ By Shashi Gharti on 2022-11-07 » + Blog Index +

Zenodo integration was very highly requested feature and we are happy to share our first draft of the plugin which makes sharing data between frictionless and zenodo easier without any extra work and configuration. This plugin uses zenodopy library underneath to communicate with Zenodo REST API. A frictionless user can use the framework functionalities and then easily publish data to zenodo and viceversa. Here is a short description of the features with examples.

+

Reading from the repo

+

You can simply read the package or create a new package from the zenodo repository if package does not exists. No additional configuration is required. The existing Package class identifies zenodo url and reads the packages and resources from the repo. Example of reading package from the zenodo repo is as follows:

+ +
+
+
from frictionless import Package
+
+package = Package("https://zenodo.org/record/7078760")
+print(package)
+
+ +
+

Once read you can apply all the available functions to the package such as validation, transformation etc.

+

Writing/Publishing to the repo

+

To write the package we can simply use publish function, which will write the package and resource files to zenodo repository. We need to provide meta data for the repository while publishing data which we pass as meta.json as shown in the example below:

+ +
+
+
from frictionless import Package, portals
+
+control = portals.ZenodoControl(
+       metafn="data/zenodo/metadata.json",
+       apikey=apikey
+)
+package = Package("data/datapackage.json")
+deposition_id = package.publish(control=control)
+print(deposition_id)
+
+
+ +
+

Once the package is published, deposition_id will be returned.

+

Creating catalog

+

Catalog can be created from a single repository or from multiple repositories. Repositories can be searched using any search terms, phrase, field search or combination of all. A simple example of creating catalog from search is as follows:

+ +
+
+
from frictionless import Catalog, portals
+control=portals.ZenodoControl(search='title:"open science"')
+catalog = Catalog(
+        control=control,
+    )
+
+ +
+

Happy Contributors

+

We will have more updates in future and would love to hear from you about this new feature. Let's chat in our Slack if you have questions or just want to say hi.

+

Read Zenodo Plugin Docs for more information.

+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/blog/index.html b/blog/index.html new file mode 100644 index 0000000000..42e5a75744 --- /dev/null +++ b/blog/index.html @@ -0,0 +1,3518 @@ + + + + + + + + +Blog | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Blog

+
+

Zenodo Integration

+
+
+

+ + By Shashi Gharti + on 2022-11-07 + +

+ This blog gives the introduction of the zenodo plugin which helps to easily read data from and write data to Zenodo. + Read more » +
+
+ +
+
+
+
+

Github Integration

+
+
+

+ + By Shashi Gharti + on 2022-09-07 + +

+ This blog gives the introduction of the github plugin which helps to seamlessly transfer/read data to/from Github. + Read more » +
+
+ +
+
+
+
+

Welcome Frictionless Framework (v5)

+
+
+

+ + By Evgeny Karev + on 2022-08-22 + +

+ Since the initial Frictionless Framework release we'd been collecting feedback and analyzing both high-level users' needs and bug reports to identify shorcomings and areas that can be improved in the next version of the framework. + Read more » +
+
+ +
+
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/data/table-output.parq b/data/table-output.parq index 814813c4bbf245629c3ef7a77b300a5d88c94631..bfe5e0e37c4767d3c320e912c14a0a9ac779e89c 100644 GIT binary patch delta 35 kcmeyu_JwW3awawtJwro1gUM@|3K@+j%QM@+1QMBr0mhRFfB*mh delta 35 kcmeyu_JwW3awaxoJwpRM!^vxz3K + + + + + + + +Design | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Design

+

This guides provides a high-level overview of the Frictionless Framework architecture. It will be useful for plugin authors and advanced users.

+

Reading Flow

+

Frictionless uses modular approach for its architecture. During reading a data source goes through various subsystems which are selected depending on the data characteristics:

+

Reading

+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/advanced/extending.html b/docs/advanced/extending.html new file mode 100644 index 0000000000..8b2f3fac68 --- /dev/null +++ b/docs/advanced/extending.html @@ -0,0 +1,3631 @@ + + + + + + + + +Extension | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Extension

+
+ +

Frictionless is built on top of a powerful plugins system which is used internally and allows to extend the framework.

+

Creating Plugin

+

To create a plugin you need:

+
    +
  • create a module called frictionless_<name> available in PYTHONPATH
  • +
  • subclass the Plugin class and override one of the methods above
  • +
+

Please consult with System/Plugin for in-detail information about the Plugin interface and how these methods can be implemented.

+

Plugin Example

+

Let's say we're interested in supporting the csv2k format that we have just invented. For simplicity, let's use a format that is exactly the same with CSV.

+

First of all, we need to create a frictionless_csv2k module containing a Plugin implementation and a Parser implementation but we're going to re-use the CsvParser as our new format is the same:

+
+

frictionless_csv2k.py

+
+ +
+
+
from frictionless import Plugin, system
+from frictionless.plugins.csv import CsvParser
+
+class Csv2kPlugin(Plugin):
+    def create_parser(self, resource):
+        if resource.format == "csv2k":
+            return Csv2kParser(resource)
+
+class Csv2kParser(CsvParser):
+    pass
+
+system.register('csv2k', Csv2kPlugin())
+
+ +
+

Now, we can use our new format in any of the Frictionless functions that accept a table source, for example, extract or Table:

+ +
+
+
from frictionless import extract
+
+rows = extract('data/table.csv2k')
+print(rows)
+
+ +
+

This example is over-simplified to show the high-level mechanics but writing Frictionless Plugins is designed to be easy. For inspiration, you can check the frictionless/plugins directory and learn from real-life examples. Also, in the Frictionless codebase there are many Check, Control, Dialect, Loader, Parser, and Server implementations - you can read their code for better understanding of how to write your own subclass or reach out to us for support.

+

Reference

+
+ + +
+
+ +

Plugin (class)

+ +
+
+ + +
+

Plugin (class)

+

Plugin representation + +It's an interface for writing Frictionless plugins. +You can implement one or more methods to hook into Frictionless system.

+
+ + + +
+

plugin.create_adapter (method)

+

Create adapter

+

Signature

+

(source: Any, *, control: Optional[Control] = None, basepath: Optional[str] = None, packagify: bool = False) -> Optional[Adapter]

+

Parameters

+
    +
  • + source + (Any): source
  • +
  • + control + (Optional[Control]): control
  • +
  • + basepath + (Optional[str])
  • +
  • + packagify + (bool)
  • +
+
+
+

plugin.create_loader (method)

+

Create loader

+

Signature

+

(resource: Resource) -> Optional[Loader]

+

Parameters

+
    +
  • + resource + (Resource): loader resource
  • +
+
+
+

plugin.create_parser (method)

+

Create parser

+

Signature

+

(resource: Resource) -> Optional[Parser]

+

Parameters

+
    +
  • + resource + (Resource): parser resource
  • +
+
+
+

plugin.detect_field_candidates (method)

+

Detect field candidates

+

Signature

+

(candidates: List[dict[str, Any]]) -> None

+

Parameters

+
    +
  • + candidates + (List[dict[str, Any]])
  • +
+
+
+

plugin.detect_resource (method)

+

Hook into resource detection

+

Signature

+

(resource: Resource) -> None

+

Parameters

+
    +
  • + resource + (Resource): resource
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/advanced/system.html b/docs/advanced/system.html new file mode 100644 index 0000000000..4af64258b7 --- /dev/null +++ b/docs/advanced/system.html @@ -0,0 +1,4141 @@ + + + + + + + + +System | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

System

+
+ +

System Object

+

The most important undelaying object in the Frictionless Framework is system. It's an singleton object avaialble as frictionless.system.

+

System Context

+

Using the system object a user can alter the execution context. It uses a Python context manager so it can be used in anyway that it's possible in Python, for example, it can be nested or combined.

+

trusted

+

If data or metadata comes from a trusted origin, it's possible to disable safety checks for paths:

+
with system.use_context(trusted=True):
+    extract('/path/to/file/is/absolute.csv')
+
+

onerror

+

To raise warning or errors on data problems, it's possible to use onerror context value. It's default to ignore and can be set to warn or error:

+
with system.use_context(onerror='error'):
+    extract('table-with-error-will-raise-an-exeption.csv')
+
+

standards

+

By default, the framework uses coming v2 version of the standards for outputing metadata. It's possible to alter this behaviour:

+
with system.use_context(standards='v1'):
+    describe('metadata-will-be-in-v1.csv')
+
+

http_session

+

It's possible to provide a custom requests.Session:

+
session = requests.Session()
+with system.use_context(http_session=session):
+    with Resource(BASEURL % "data/table.csv") as resource:
+        assert resource.header == ["id", "name"]
+
+

System methods

+

This object can be used to instantiate different kind of lower-level as though Check, Step, or Field. Here is a quick example of using the system object:

+ +
+
+
from frictionless import Resource, system
+
+# Create
+
+adapter = system.create_adapter(source, control=control)
+loader = system.create_loader(resource)
+parser = system.create_parser(resource)
+
+# Detect
+
+system.detect_resource(resource)
+field_candidates = system.detect_field_candidates()
+
+# Select
+
+Check = system.selectCheck('type')
+Control = system.selectControl('type')
+Error = system.selectError('type')
+Field = system.selectField('type')
+Step = system.selectStep('type')
+
+ +
+

As an extension author you might use the system object in various cases. For example, take a look at this MultipartLoader excerpts:

+ +
+
+
def read_line_stream(self):
+    for number, path in enumerate(self.__path, start=1):
+        resource = Resource(path=path)
+        resource.infer(sample=False)
+        with system.create_loader(resource) as loader:
+            for line_number, line in enumerate(loader.byte_stream, start=1):
+                if not self.__headless and number > 1 and line_number == 1:
+                    continue
+                yield line
+
+ +
+

It's important to understand that creating low-level objects in general is more corect using the system object than just classes because it will include all the available plugins in the process.

+

Plugin API

+

The Plugin API almost fully follows the system object's API. So as a plugin author you need to hook into the same methods. For example, let's take a look at a builtin Csv Plugin:

+ +
+
+
class CsvPlugin(Plugin):
+    """Plugin for CSV"""
+
+    # Hooks
+
+    def create_parser(self, resource: Resource):
+        if resource.format in ["csv", "tsv"]:
+            return CsvParser(resource)
+
+    def detect_resource(self, resource: Resource):
+        if resource.format in ["csv", "tsv"]:
+            resource.type = "table"
+            resource.mediatype = f"text/{resource.format}"
+
+    def select_Control(self, type: str):
+        if type == "csv":
+            return CsvControl
+
+ +
+

Reference

+
+ + +
+
+ +

Adapter (class)

+

Loader (class)

+

Mapper (class)

+

Parser (class)

+

Plugin (class)

+

System (class)

+ +
+
+ + +
+

Adapter (class)

+

+
+ + + + + +
+

Loader (class)

+

Loader representation

+

Signature

+

(resource: Resource)

+

Parameters

+
    +
  • + resource + (Resource): resource
  • +
+
+ +
+

loader.remote (property)

+

+ Specifies if the resource is remote. +

+

Signature

+

bool

+
+ +
+

loader.buffer (property)

+

+

Signature

+

types.IBuffer

+
+
+

loader.byte_stream (property)

+

Resource byte stream + +The stream is available after opening the loader

+

Signature

+

types.IByteStream

+
+
+

loader.closed (property)

+

Whether the loader is closed

+

Signature

+

bool

+
+
+

loader.resource (property)

+

+

Signature

+

Resource

+
+
+

loader.text_stream (property)

+

Resource text stream + +The stream is available after opening the loader

+

Signature

+

types.ITextStream

+
+ +
+

loader.close (method)

+

Close the loader as "filelike.close" does

+

Signature

+

() -> None

+
+
+

loader.open (method)

+

Open the loader as "io.open" does

+
+
+

loader.read_byte_stream (method)

+

Read bytes stream

+

Signature

+

() -> types.IByteStream

+
+
+

loader.read_byte_stream_analyze (method)

+

Detect metadta using sample

+

Signature

+

(buffer: bytes)

+

Parameters

+
    +
  • + buffer + (bytes): byte buffer
  • +
+
+
+

loader.read_byte_stream_buffer (method)

+

Buffer byte stream

+

Signature

+

(byte_stream: types.IByteStream)

+

Parameters

+
    +
  • + byte_stream + (types.IByteStream): resource byte stream
  • +
+
+
+

loader.read_byte_stream_create (method)

+

Create bytes stream

+

Signature

+

() -> types.IByteStream

+
+
+

loader.read_byte_stream_decompress (method)

+

Decompress byte stream

+

Signature

+

(byte_stream: types.IByteStream) -> types.IByteStream

+

Parameters

+
    +
  • + byte_stream + (types.IByteStream): resource byte stream
  • +
+
+
+

loader.read_byte_stream_process (method)

+

Process byte stream

+

Signature

+

(byte_stream: types.IByteStream) -> ByteStreamWithStatsHandling

+

Parameters

+
    +
  • + byte_stream + (types.IByteStream): resource byte stream
  • +
+
+
+

loader.read_text_stream (method)

+

Read text stream

+
+
+

loader.write_byte_stream (method)

+

Write from a temporary file

+

Signature

+

(path: str) -> Any

+

Parameters

+
    +
  • + path + (str): path to a temporary file
  • +
+
+
+

loader.write_byte_stream_create (method)

+

Create byte stream for writing

+

Signature

+

(path: str) -> types.IByteStream

+

Parameters

+
    +
  • + path + (str): path to a temporary file
  • +
+
+
+

loader.write_byte_stream_save (method)

+

Store byte stream

+

Signature

+

(byte_stream: types.IByteStream) -> Any

+

Parameters

+
    +
  • + byte_stream + (types.IByteStream)
  • +
+
+ + +
+

Mapper (class)

+

+
+ + + + + +
+

Parser (class)

+

Parser representation

+

Signature

+

(resource: Resource)

+

Parameters

+
    +
  • + resource + (Resource): resource
  • +
+
+ +
+

parser.requires_loader (property)

+

+ Specifies if parser requires the loader to load the + data. +

+

Signature

+

ClassVar[bool]

+
+
+

parser.supported_types (property)

+

+ Data types supported by the parser. +

+

Signature

+

ClassVar[List[str]]

+
+ +
+

parser.cell_stream (property)

+

+

Signature

+

types.ICellStream

+
+
+

parser.closed (property)

+

Whether the parser is closed

+

Signature

+

bool

+
+
+

parser.loader (property)

+

+

Signature

+

Loader

+
+
+

parser.resource (property)

+

+

Signature

+

Resource

+
+
+

parser.sample (property)

+

+

Signature

+

types.ISample

+
+ +
+

parser.close (method)

+

Close the parser as "filelike.close" does

+

Signature

+

() -> None

+
+
+

parser.open (method)

+

Open the parser as "io.open" does

+
+
+

parser.read_cell_stream (method)

+

Read list stream

+

Signature

+

() -> types.ICellStream

+
+
+

parser.read_cell_stream_create (method)

+

Create list stream from loader

+

Signature

+

() -> types.ICellStream

+
+
+

parser.read_cell_stream_handle_errors (method)

+

Wrap list stream into error handler

+

Signature

+

(cell_stream: types.ICellStream) -> CellStreamWithErrorHandling

+

Parameters

+
    +
  • + cell_stream + (types.ICellStream)
  • +
+
+
+

parser.read_loader (method)

+

Create and open loader

+

Signature

+

() -> Optional[Loader]

+
+
+

parser.write_row_stream (method)

+

Write row stream from the source resource

+

Signature

+

(source: TableResource) -> Any

+

Parameters

+
    +
  • + source + (TableResource): source resource
  • +
+
+ + +
+

Plugin (class)

+

Plugin representation + +It's an interface for writing Frictionless plugins. +You can implement one or more methods to hook into Frictionless system.

+
+ + + +
+

plugin.create_adapter (method)

+

Create adapter

+

Signature

+

(source: Any, *, control: Optional[Control] = None, basepath: Optional[str] = None, packagify: bool = False) -> Optional[Adapter]

+

Parameters

+
    +
  • + source + (Any): source
  • +
  • + control + (Optional[Control]): control
  • +
  • + basepath + (Optional[str])
  • +
  • + packagify + (bool)
  • +
+
+
+

plugin.create_loader (method)

+

Create loader

+

Signature

+

(resource: Resource) -> Optional[Loader]

+

Parameters

+
    +
  • + resource + (Resource): loader resource
  • +
+
+
+

plugin.create_parser (method)

+

Create parser

+

Signature

+

(resource: Resource) -> Optional[Parser]

+

Parameters

+
    +
  • + resource + (Resource): parser resource
  • +
+
+
+

plugin.detect_field_candidates (method)

+

Detect field candidates

+

Signature

+

(candidates: List[dict[str, Any]]) -> None

+

Parameters

+
    +
  • + candidates + (List[dict[str, Any]])
  • +
+
+
+

plugin.detect_resource (method)

+

Hook into resource detection

+

Signature

+

(resource: Resource) -> None

+

Parameters

+
    +
  • + resource + (Resource): resource
  • +
+
+ + +
+

System (class)

+

System representation + +This class provides an ability to make system Frictionless calls. +It's available as `frictionless.system` singletone.

+

Signature

+

+
+ +
+

system.supported_hooks (property)

+

+ A flag that indicates if resource, path or package is trusted. +

+

Signature

+

ClassVar[List[str]]

+
+
+

system.trusted (property)

+

+ A flag that indicates if resource, path or package is trusted. +

+

Signature

+

bool

+
+
+

system.onerror (property)

+

+ Type of action to take on Error such as "warn", "raise" or "ignore". +

+

Signature

+

types.IOnerror

+
+
+

system.standards (property)

+

+ Setting this value user can use feature of the specific version. + The default value is v2. +

+

Signature

+

types.IStandards

+
+ +
+

system.http_session (property)

+

Return a HTTP session + +This method will return a new session or the session +from `system.use_http_session` context manager

+
+ +
+

system.create_adapter (method)

+

Create adapter

+

Signature

+

(source: Any, *, control: Optional[Control] = None, basepath: Optional[str] = None, packagify: bool = False) -> Optional[Adapter]

+

Parameters

+
    +
  • + source + (Any)
  • +
  • + control + (Optional[Control])
  • +
  • + basepath + (Optional[str])
  • +
  • + packagify + (bool)
  • +
+
+
+

system.create_loader (method)

+

Create loader

+

Signature

+

(resource: Resource) -> Loader

+

Parameters

+
    +
  • + resource + (Resource): loader resource
  • +
+
+
+

system.create_parser (method)

+

Create parser

+

Signature

+

(resource: Resource) -> Parser

+

Parameters

+
    +
  • + resource + (Resource): parser resource
  • +
+
+
+

system.deregister (method)

+

Deregister a plugin

+

Signature

+

(name: str)

+

Parameters

+
    +
  • + name + (str): plugin name
  • +
+
+
+

system.detect_field_candidates (method)

+

Create candidates

+

Signature

+

() -> List[dict[str, Any]]

+
+
+

system.detect_resource (method)

+

Hook into resource detection

+

Signature

+

(resource: Resource) -> None

+

Parameters

+
    +
  • + resource + (Resource): resource
  • +
+
+
+

system.register (method)

+

Register a plugin

+

Signature

+

(name: str, plugin: Plugin)

+

Parameters

+
    +
  • + name + (str): plugin name
  • +
  • + plugin + (Plugin): plugin to register
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/basic-examples.html b/docs/basic-examples.html new file mode 100644 index 0000000000..eee956f981 --- /dev/null +++ b/docs/basic-examples.html @@ -0,0 +1,4141 @@ + + + + + + + + +Basic Examples | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Basic Examples

+

Let's start with an example dataset. We will look at a few raw data files that have recently been collected by an anthropologist. The anthropologist wants to publish this data in an open repository so her colleagues can also use this data. Before publishing the data, she wants to add metadata and check the data for errors. We are here to help, so let’s start by exploring the data. We see that the quality of data is far from perfect. In fact, the first row contains comments from the anthropologist! To be able to use this data, we need to clean it up a bit.

+
+

Download countries.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
cat countries.csv
+
+ +
# clean this data!
+id,neighbor_id,name,population
+1,Ireland,Britain,67
+2,3,France,n/a,find the population
+3,22,Germany,83
+4,,Italy,60
+5
+ +
+
+
with open('countries.csv') as file:
+    print(file.read())
+
+ +
# clean this data!
+id,neighbor_id,name,population
+1,Ireland,Britain,67
+2,3,France,n/a,find the population
+3,22,Germany,83
+4,,Italy,60
+5
+ +
+

As we can see, this is data containing information about European countries and their populations. Also, it looks like there are two fields having a relationship based on a country's identifier: neighbor_id is a Foreign Key to id.

+

Describing Data

+

First of all, we're going to describe our dataset. Frictionless uses the powerful Frictionless Data Specifications. They are very handy to describe:

+ +

Let's describe the countries table:

+ +
+
+
frictionless describe countries.csv # optionally add --stats to get statistics
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ countries │ table │ countries.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                   countries
+┏━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id      ┃ neighbor_id ┃ name   ┃ population ┃
+┡━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
+│ integer │ string      │ string │ string     │
+└─────────┴─────────────┴────────┴────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import describe
+
+resource = describe('countries.csv')
+pprint(resource)
+
+ +
{'name': 'countries',
+ 'type': 'table',
+ 'path': 'countries.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'dialect': {'headerRows': [2]},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                       {'name': 'neighbor_id', 'type': 'string'},
+                       {'name': 'name', 'type': 'string'},
+                       {'name': 'population', 'type': 'string'}]}}
+ +
+

As we can see, Frictionless was smart enough to understand that the first row contains a comment. It's good, but we still have a few problems:

+
    +
  • we use n/a as a missing values marker
  • +
  • neighbor_id must be numerical: let's edit the schema
  • +
  • population must be numerical: setting proper missing values will solve it
  • +
  • there is a relation between the id and neighbor_id fields
  • +
+

Let's update our metadata and save it to the disc:

+
+

Open this file in your favorite editor and update as it's shown below

+
+ +
+
+
frictionless describe countries.csv --yaml > countries.resource.yaml
+editor countries.resource.yaml
+
+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(field_missing_values=["", "n/a"])
+resource = describe("countries.csv", detector=detector)
+resource.schema.set_field_type("neighbor_id", "integer")
+resource.schema.foreign_keys.append(
+    {"fields": ["neighbor_id"], "reference": {"resource": "", "fields": ["id"]}}
+)
+resource.to_yaml("countries.resource.yaml")
+
+ +
+

Let's see what we have created:

+ +
+
+
cat countries.resource.yaml
+
+ +
name: countries
+type: table
+path: countries.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+  headerRows:
+    - 2
+schema:
+  fields:
+    - name: id
+      type: integer
+    - name: neighbor_id
+      type: integer
+    - name: name
+      type: string
+    - name: population
+      type: integer
+  missingValues:
+    - ''
+    - n/a
+  foreignKeys:
+    - fields:
+        - neighbor_id
+      reference:
+        resource: ''
+        fields:
+          - id
+ +
+
+
with open('countries.resource.yaml') as file:
+    print(file.read())
+
+ +
name: countries
+type: table
+path: countries.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+  headerRows:
+    - 2
+schema:
+  fields:
+    - name: id
+      type: integer
+    - name: neighbor_id
+      type: integer
+    - name: name
+      type: string
+    - name: population
+      type: integer
+  missingValues:
+    - ''
+    - n/a
+  foreignKeys:
+    - fields:
+        - neighbor_id
+      reference:
+        resource: ''
+        fields:
+          - id
+ +
+

It has the same metadata as we saw above but also includes our editing related to missing values and data types. We didn't change all the wrong data types manually because providing proper missing values had fixed it automatically. Now we have a resource descriptor. In the next section, we will show why metadata matters and how to use it.

+

Extracting Data

+

It's time to try extracting our data as a table. As a first naive attempt, we will ignore the metadata we saved on the previous step:

+ +
+
+
frictionless extract countries.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ countries │ table │ countries.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                 countries
+┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ neighbor_id ┃ name    ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1  │ Ireland     │ Britain │ 67         │
+│ 2  │ 3           │ France  │ n/a        │
+│ 3  │ 22          │ Germany │ 83         │
+│ 4  │ None        │ Italy   │ 60         │
+│ 5  │ None        │ None    │ None       │
+└────┴─────────────┴─────────┴────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract('countries.csv')
+pprint(rows)
+
+ +
{'countries': [{'id': 1,
+                'name': 'Britain',
+                'neighbor_id': 'Ireland',
+                'population': '67'},
+               {'id': 2,
+                'name': 'France',
+                'neighbor_id': '3',
+                'population': 'n/a'},
+               {'id': 3,
+                'name': 'Germany',
+                'neighbor_id': '22',
+                'population': '83'},
+               {'id': 4,
+                'name': 'Italy',
+                'neighbor_id': None,
+                'population': '60'},
+               {'id': 5,
+                'name': None,
+                'neighbor_id': None,
+                'population': None}]}
+ +
+

Actually, it doesn't look terrible, but in reality, data like this is not quite useful:

+
    +
  • it's not possible to export this data e.g., to SQL because integers are mixed with strings
  • +
  • there is still a basically empty row we don't want to have
  • +
  • there are some mistakes in the neighbor_id column
  • +
+

The output of the extract is in 'utf-8' encoding scheme. Let's use the metadata we save to try extracting data with the help of Frictionless Data specifications:

+ +
+
+
frictionless extract countries.resource.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ countries │ table │ countries.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                 countries
+┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ neighbor_id ┃ name    ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1  │ None        │ Britain │ 67         │
+│ 2  │ 3           │ France  │ None       │
+│ 3  │ 22          │ Germany │ 83         │
+│ 4  │ None        │ Italy   │ 60         │
+│ 5  │ None        │ None    │ None       │
+└────┴─────────────┴─────────┴────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract('countries.resource.yaml')
+pprint(rows)
+
+ +
{'countries': [{'id': 1,
+                'name': 'Britain',
+                'neighbor_id': None,
+                'population': 67},
+               {'id': 2,
+                'name': 'France',
+                'neighbor_id': 3,
+                'population': None},
+               {'id': 3,
+                'name': 'Germany',
+                'neighbor_id': 22,
+                'population': 83},
+               {'id': 4,
+                'name': 'Italy',
+                'neighbor_id': None,
+                'population': 60},
+               {'id': 5,
+                'name': None,
+                'neighbor_id': None,
+                'population': None}]}
+ +
+

It's now much better! Numerical fields are numerical fields, and there are no more textual missing values markers. We can't see in the command-line, but missing values are now None values in Python, and the data can be e.g., exported to SQL. Although, it's still not ready for being published. In the next section, we will validate it!

+

Validating Data

+

Data validation with Frictionless is as easy as describing or extracting data:

+ +
+
+
frictionless validate countries.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                    dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃ status  ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ countries │ table │ countries.csv │ INVALID │
+└───────────┴───────┴───────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                   countries
+┏━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type         ┃ Message                                         ┃
+┡━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ 4   │ 5     │ extra-cell   │ Row at position "4" has an extra value in field │
+│     │       │              │ at position "5"                                 │
+│ 7   │ 2     │ missing-cell │ Row at position "7" has a missing cell in field │
+│     │       │              │ "neighbor_id" at position "2"                   │
+│ 7   │ 3     │ missing-cell │ Row at position "7" has a missing cell in field │
+│     │       │              │ "name" at position "3"                          │
+│ 7   │ 4     │ missing-cell │ Row at position "7" has a missing cell in field │
+│     │       │              │ "population" at position "4"                    │
+└─────┴───────┴──────────────┴─────────────────────────────────────────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report = validate('countries.csv')
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+ +
[[4, 5, 'extra-cell'],
+ [7, 2, 'missing-cell'],
+ [7, 3, 'missing-cell'],
+ [7, 4, 'missing-cell']]
+ +
+

Ahh, we had seen that coming. The data is not valid; there are some missing and extra cells. But wait a minute, in the first step, we created the metadata file with more information about our table. We have to use it.

+ +
+
+
frictionless validate countries.resource.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                    dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃ status  ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ countries │ table │ countries.csv │ INVALID │
+└───────────┴───────┴───────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                   countries
+┏━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type         ┃ Message                                         ┃
+┡━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ 3   │ 2     │ type-error   │ Type error in the cell "Ireland" in row "3" and │
+│     │       │              │ field "neighbor_id" at position "2": type is    │
+│     │       │              │ "integer/default"                               │
+│ 4   │ 5     │ extra-cell   │ Row at position "4" has an extra value in field │
+│     │       │              │ at position "5"                                 │
+│ 5   │ None  │ foreign-key  │ Row at position "5" violates the foreign key:   │
+│     │       │              │ for "neighbor_id": values "22" not found in the │
+│     │       │              │ lookup table "" as "id"                         │
+│ 7   │ 2     │ missing-cell │ Row at position "7" has a missing cell in field │
+│     │       │              │ "neighbor_id" at position "2"                   │
+│ 7   │ 3     │ missing-cell │ Row at position "7" has a missing cell in field │
+│     │       │              │ "name" at position "3"                          │
+│ 7   │ 4     │ missing-cell │ Row at position "7" has a missing cell in field │
+│     │       │              │ "population" at position "4"                    │
+└─────┴───────┴──────────────┴─────────────────────────────────────────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report = validate('countries.resource.yaml')
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+ +
[[3, 2, 'type-error'],
+ [4, 5, 'extra-cell'],
+ [5, None, 'foreign-key'],
+ [7, 2, 'missing-cell'],
+ [7, 3, 'missing-cell'],
+ [7, 4, 'missing-cell']]
+ +
+

Now it's even worse, but regarding data validation errors, the more, the better, actually. Thanks to the metadata, we were able to reveal some critical errors:

+
    +
  • the bad data types, i.e. Ireland instead of an id
  • +
  • the bad relation between id and neighbor_id: we don't have a country with id 22
  • +
+

In the next section, we will clean up the data.

+

Transforming Data

+

We will use metadata to fix all the data type problems automatically. The only two things we need to handle manually:

+
    +
  • France's population
  • +
  • Germany's neighborhood
  • +
+ +
+
+
cat > countries.pipeline.yaml <<EOF
+steps:
+  - type: cell-replace
+    fieldName: neighbor_id
+    pattern: '22'
+    replace: '2'
+  - type: cell-replace
+    fieldName: population
+    pattern: 'n/a'
+    replace: '67'
+  - type: row-filter
+    formula: population
+  - type: field-update
+    name: neighbor_id
+    descriptor:
+      type: integer
+  - type: field-update
+    name: population
+    descriptor:
+      type: integer
+  - type: table-normalize
+  - type: table-write
+    path: countries-cleaned.csv
+EOF
+frictionless transform countries.csv --pipeline countries.pipeline.yaml
+
+ +
## Schema
+
++-------------+---------+------------+
+| name        | type    | required   |
++=============+=========+============+
+| id          | integer |            |
++-------------+---------+------------+
+| neighbor_id | integer |            |
++-------------+---------+------------+
+| name        | string  |            |
++-------------+---------+------------+
+| population  | integer |            |
++-------------+---------+------------+
+
+## Table
+
++----+-------------+---------+------------+
+| id | neighbor_id | name    | population |
++====+=============+=========+============+
+|  1 | None        | Britain |         67 |
++----+-------------+---------+------------+
+|  2 |           3 | France  |         67 |
++----+-------------+---------+------------+
+|  3 |           2 | Germany |         83 |
++----+-------------+---------+------------+
+|  4 | None        | Italy   |         60 |
++----+-------------+---------+------------+
+ +
+
+
from pprint import pprint
+from frictionless import Resource, Pipeline, describe, transform, steps
+
+pipeline = Pipeline(steps=[
+    steps.cell_replace(field_name='neighbor_id', pattern='22', replace='2'),
+    steps.cell_replace(field_name='population', pattern='n/a', replace='67'),
+    steps.row_filter(formula='population'),
+    steps.field_update(name='neighbor_id', descriptor={"type": "integer"}),
+    steps.table_normalize(),
+    steps.table_write(path="countries-cleaned.csv"),
+])
+
+source = Resource('countries.csv')
+target = source.transform(pipeline)
+pprint(target.read_rows())
+
+ +
[{'id': 1, 'neighbor_id': None, 'name': 'Britain', 'population': '67'},
+ {'id': 2, 'neighbor_id': 3, 'name': 'France', 'population': '67'},
+ {'id': 3, 'neighbor_id': 2, 'name': 'Germany', 'population': '83'},
+ {'id': 4, 'neighbor_id': None, 'name': 'Italy', 'population': '60'}]
+ +
+

Finally, we've got the cleaned version of our data, which can be exported to a database or published. We have used a CSV as an output format but could have used Excel, JSON, SQL, and others.

+ +
+
+
cat countries-cleaned.csv
+
+ +
id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,,Italy,60
+ +
+
+
with open('countries-cleaned.csv') as file:
+    print(file.read())
+
+ +
id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,,Italy,60
+ +
+

Basically, that's it; now, we have a valid data file and a corresponding metadata file. It can be shared with other people or stored without fear of type errors or other problems making research data not reproducible.

+ +
+
+
ls countries-cleaned.*
+
+ +
countries-cleaned.csv
+ +
+
+
import os
+
+files = [f for f in os.listdir('.') if os.path.isfile(f) and f.startswith('countries-cleaned.')]
+print(files)
+
+ +
['countries-cleaned.csv']
+ +
+

In the next articles, we will explore more advanced Frictionless functionality.

+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/checks/baseline.html b/docs/checks/baseline.html new file mode 100644 index 0000000000..99a8c7a320 --- /dev/null +++ b/docs/checks/baseline.html @@ -0,0 +1,3570 @@ + + + + + + + + +Baseline Check | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Baseline Check

+

Overview

+

The Baseline Check is always enabled. It makes various small checks that reveal a great deal of tabular errors. You can create an empty Checklist to see the baseline check scope:

+
+

Download capital-invalid.csv to reproduce the examples (right-click and "Save link as")..

+
+ +
+
+
from pprint import pprint
+from frictionless import Checklist, validate
+
+checklist = Checklist()
+pprint(checklist.scope)
+report = validate('capital-invalid.csv')  # we don't pass the checklist as the empty one is default
+pprint(report.flatten(['type', 'message']))
+
+ +
['hash-count',
+ 'byte-count',
+ 'field-count',
+ 'row-count',
+ 'blank-header',
+ 'extra-label',
+ 'missing-label',
+ 'blank-label',
+ 'duplicate-label',
+ 'incorrect-label',
+ 'blank-row',
+ 'primary-key',
+ 'foreign-key',
+ 'extra-cell',
+ 'missing-cell',
+ 'type-error',
+ 'constraint-error',
+ 'unique-error']
+[['duplicate-label',
+  'Label "name" in the header at position "3" is duplicated to a label: at '
+  'position "2"'],
+ ['missing-cell',
+  'Row at position "10" has a missing cell in field "name2" at position "3"'],
+ ['blank-row', 'Row at position "11" is completely blank'],
+ ['type-error',
+  'Type error in the cell "x" in row "12" and field "id" at position "1": type '
+  'is "integer/default"'],
+ ['extra-cell',
+  'Row at position "12" has an extra value in field at position "4"']]
+ +
+

The Baseline Check is incorporated into base Frictionless classes as though Resource, Header, and Row. There is no exact order in which those errors are revealed as it's highly optimized. One should consider the Baseline Check as one unit of validation.

+

Reference

+
+ + +
+
+ +

checks.baseline (class)

+ +
+
+ + +
+

checks.baseline (class)

+

Check a table for basic errors + +This check is enabled by default for any `validate` function run.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/checks/cell.html b/docs/checks/cell.html new file mode 100644 index 0000000000..80445b7f04 --- /dev/null +++ b/docs/checks/cell.html @@ -0,0 +1,4024 @@ + + + + + + + + +Cell Checks | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Cell Checks

+

ASCII Value

+

If you want to skip non-ascii characters, this check helps to notify if there are any in data during validation. Here is how we can use this check.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source=[["s.no","code"],[1,"ssµ"]]
+report = validate(source, checks=[checks.ascii_value()])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['ascii-value',
+  'The cell ssµ in row at position 2 and field code at position 2 has an '
+  'error: the cell contains non-ascii characters']]
+ +
+

Reference

+
+ + +
+
+ +

checks.ascii_value (class)

+ +
+
+ + +
+

checks.ascii_value (class)

+

Check whether all the string characters in the data are ASCII + +This check can be enabled using the `checks` parameter +for the `validate` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+

Deviated Cell

+

This check identifies deviated cells from the normal ones. To flag the deviated cell, the check compares the length of the characters in each cell with a threshold value. The threshold value is either 5000 or value calculated using Python's built-in statistics module which is average plus(+) three standard deviation. The exact algorithm can be found here. For example:

+

Example

+
+

Download issue-1066.csv to reproduce the examples (right-click and "Save link as")..

+
+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+report = validate("issue-1066.csv", checks=[checks.deviated_cell()])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['deviated-cell',
+  'There is a possible error because the cell is deviated: cell at row "35" '
+  'and field "Gestore" has deviated size']]
+ +
+

Reference

+
+ + +
+
+ +

checks.deviated_cell (class)

+ +
+
+ + +
+

checks.deviated_cell (class)

+

Check if the cell size is deviated

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, interval: int = 3, ignore_fields: List[str] = NOTHING) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + interval + (int)
  • +
  • + ignore_fields + (List[str])
  • +
+
+ +
+

checks.deviated_cell.interval (property)

+

+ Interval specifies number of standard deviation away from the center. + The median is used to find the center of the data. The default value + is 3. +

+
Signature
+

int

+
+
+

checks.deviated_cell.ignore_fields (property)

+

+ List of data columns to be skipped by check. To all the data columns + listed here, check will not be applied. The default value is []. +

+
Signature
+

List[str]

+
+ + + + +
+
+

Deviated Value

+

This check uses Python's built-in statistics module to check a field's data for deviations. By default, deviated values are outside of the average +- three standard deviations. Take a look at the API Reference for more details about available options and default values. The exact algorithm can be found here. For example:

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = [["temperature"], [1], [-2], [7], [0], [1], [2], [5], [-4], [1000], [8], [3]]
+report = validate(source, checks=[checks.deviated_value(field_name="temperature")])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['deviated-value',
+  'There is a possible error because the value is deviated: value "1000" in '
+  'row at position "10" and field "temperature" is deviated "[-809.88, '
+  '995.52]"']]
+ +
+

Reference

+
+ + +
+
+ +

checks.deviated_value (class)

+ +
+
+ + +
+

checks.deviated_value (class)

+

Check for deviated values in a field.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, interval: int = 3, average: str = mean) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + field_name + (str)
  • +
  • + interval + (int)
  • +
  • + average + (str)
  • +
+
+ +
+

checks.deviated_value.field_name (property)

+

+ Name of the field to which the check will be applied. + Check will not be applied to fields other than this. +

+
Signature
+

str

+
+
+

checks.deviated_value.interval (property)

+

+ Interval specifies number of standard deviation away from the mean. + The default value is 3. +

+
Signature
+

int

+
+
+

checks.deviated_value.average (property)

+

+ It specifies preferred method to calculate average of the data. + Default value is "mean". Supported average calculation methods + are "mean", "median", and "mode". +

+
Signature
+

str

+
+ + + + +
+
+

Forbidden Value

+

This check ensures that some field doesn't have any forbidden or denylist values.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = b'header\nvalue1\nvalue2'
+checks = [checks.forbidden_value(field_name='header', values=['value2'])]
+report = validate(source, format='csv', checks=checks)
+pprint(report.flatten(['type', 'message']))
+
+ +
[['forbidden-value',
+  'The cell value2 in row at position 3 and field header at position 1 has an '
+  'error: forbidden values are "[\'value2\']"']]
+ +
+

Reference

+
+ + +
+
+ +

checks.forbidden_value (class)

+ +
+
+ + +
+

checks.forbidden_value (class)

+

Check for forbidden values in a field.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, values: List[Any]) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + field_name + (str)
  • +
  • + values + (List[Any])
  • +
+
+ +
+

checks.forbidden_value.field_name (property)

+

+ The name of the field to apply the check. Check will not be applied to + other fields. +

+
Signature
+

str

+
+
+

checks.forbidden_value.values (property)

+

+ Specify the forbidden values to check for, in the field specified by + "field_name". +

+
Signature
+

List[Any]

+
+ + + + +
+
+

Sequential Value

+

This check gives us an opportunity to validate sequential fields like primary keys or other similar data. It doesn't need to start from 0 or 1. We're providing a field name.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = b'header\n2\n3\n5'
+report = validate(source, format='csv', checks=[checks.sequential_value(field_name='header')])
+pprint(report.flatten(['type', 'message']))
+
+ +
[['sequential-value',
+  'The cell 5 in row at position 4 and field header at position 1 has an '
+  'error: the value is not sequential']]
+ +
+

Reference

+
+ + +
+
+ +

checks.sequential_value (class)

+ +
+
+ + +
+

checks.sequential_value (class)

+

Check that a column having sequential values.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + field_name + (str)
  • +
+
+ +
+

checks.sequential_value.field_name (property)

+

+ The name of the field to apply the check. Check will not be + applied to other fields. +

+
Signature
+

str

+
+ + + + +
+
+

Truncated Value

+

Sometime during data export from a database or other storage, data values can be truncated. This check tries to detect such truncation. Let's explore some truncation indicators.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = [["int", "str"], ["a" * 255, 32767], ["good", 2147483647]]
+report = validate(source, checks=[checks.truncated_value()])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['truncated-value',
+  'The cell '
+  'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa '
+  'in row at position 2 and field int at position 1 has an error: value  is '
+  'probably truncated'],
+ ['truncated-value',
+  'The cell 32767 in row at position 2 and field str at position 2 has an '
+  'error: value  is probably truncated'],
+ ['truncated-value',
+  'The cell 2147483647 in row at position 3 and field str at position 2 has an '
+  'error: value  is probably truncated']]
+ +
+

Reference

+
+ + +
+
+ +

checks.truncated_value (class)

+ +
+
+ + +
+

checks.truncated_value (class)

+

Check for possible truncated values + +This check can be enabled using the `checks` parameter +for the `validate` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/checks/row.html b/docs/checks/row.html new file mode 100644 index 0000000000..f929ad3a31 --- /dev/null +++ b/docs/checks/row.html @@ -0,0 +1,3632 @@ + + + + + + + + +Row Checks | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Row Checks

+

Duplicate Row

+

This checks for duplicate rows. You need to take into account that checking for duplicate rows can lead to high memory consumption on big files. Here is an example.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = b"header\nvalue\nvalue"
+report = validate(source, format="csv", checks=[checks.duplicate_row()])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['duplicate-row',
+  'Row at position 3 is duplicated: the same as row at position "2"']]
+ +
+

Reference

+
+ + +
+
+ +

checks.duplicate_row (class)

+ +
+
+ + +
+

checks.duplicate_row (class)

+

Check for duplicate rows + +This check can be enabled using the `checks` parameter +for the `validate` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+

Row Constraint

+

This check is the most powerful one as it uses the external simpleeval package allowing you to evaluate arbitrary Python expressions on data rows. Let's show on an example.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = [
+    ["row", "salary", "bonus"],
+    [2, 1000, 200],
+    [3, 2500, 500],
+    [4, 1300, 500],
+    [5, 5000, 1000],
+]
+report = validate(source, checks=[checks.row_constraint(formula="salary == bonus * 5")])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['row-constraint',
+  'The row at position 4 has an error: the row constraint to conform is '
+  '"salary == bonus * 5"']]
+ +
+

Reference

+
+ + +
+
+ +

checks.row_constraint (class)

+ +
+
+ + +
+

checks.row_constraint (class)

+

Check that every row satisfies a provided Python expression.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, formula: str) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + formula + (str)
  • +
+
+ +
+

checks.row_constraint.formula (property)

+

+ Python expression to apply to all rows. To evaluate the formula + simpleeval library is used. +

+
Signature
+

str

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/checks/table.html b/docs/checks/table.html new file mode 100644 index 0000000000..23938fe27c --- /dev/null +++ b/docs/checks/table.html @@ -0,0 +1,3691 @@ + + + + + + + + +Table Checks | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Table Checks

+

Table Dimensions

+

This check is used to validate if your data has expected dimensions as: exact number of rows , minimum and maximum number of rows, exact number of fields , minimum and maximum number of fields.

+

Basic Example

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = [
+    ["row", "salary", "bonus"],
+    [2, 1000, 200],
+    [3, 2500, 500],
+    [4, 1300, 500],
+    [5, 5000, 1000],
+]
+report = validate(source, checks=[checks.table_dimensions(num_rows=5)])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['table-dimensions',
+  'The data source does not have the required dimensions: number of rows is 4, '
+  'the required is 5']]
+ +
+

Multiple Limits

+

You can also give multiples limits at the same time:

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+source = [
+    ["row", "salary", "bonus"],
+    [2, 1000, 200],
+    [3, 2500, 500],
+    [4, 1300, 500],
+    [5, 5000, 1000],
+]
+report = validate(source, checks=[checks.table_dimensions(num_rows=5, num_fields=4)])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['table-dimensions',
+  'The data source does not have the required dimensions: number of fields is '
+  '3, the required is 4'],
+ ['table-dimensions',
+  'The data source does not have the required dimensions: number of rows is 4, '
+  'the required is 5']]
+ +
+

Using Declaratively

+

It is possible to use de check declaratively as:

+ +
+
+
from pprint import pprint
+from frictionless import Check, validate, checks
+
+source = [
+    ["row", "salary", "bonus"],
+    [2, 1000, 200],
+    [3, 2500, 500],
+    [4, 1300, 500],
+    [5, 5000, 1000],
+]
+
+check = Check.from_descriptor({"type": "table-dimensions", "minFields": 4, "maxRows": 3})
+report = validate(source, checks=[check])
+pprint(report.flatten(["type", "message"]))
+
+ +
[['table-dimensions',
+  'The data source does not have the required dimensions: number of fields is '
+  '3, the minimum is 4'],
+ ['table-dimensions',
+  'The data source does not have the required dimensions: number of rows is 4, '
+  'the maximum is 3']]
+ +
+

But the table dimensions check arguments num_rows, min_rows, max_rows, num_fields, min_fields, max_fields must be passed in camelCase format as the example above i.e. numRows, minRows, maxRows, numFields, minFields and maxFields.

+

Reference

+
+ + +
+
+ +

checks.table_dimensions (class)

+ +
+
+ + +
+

checks.table_dimensions (class)

+

Check for minimum and maximum table dimensions.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, num_rows: Optional[int] = None, min_rows: Optional[int] = None, max_rows: Optional[int] = None, num_fields: Optional[int] = None, min_fields: Optional[int] = None, max_fields: Optional[int] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + num_rows + (Optional[int])
  • +
  • + min_rows + (Optional[int])
  • +
  • + max_rows + (Optional[int])
  • +
  • + num_fields + (Optional[int])
  • +
  • + min_fields + (Optional[int])
  • +
  • + max_fields + (Optional[int])
  • +
+
+ +
+

checks.table_dimensions.num_rows (property)

+

+ Specify the number of rows to compare with actual rows in + the table. If the actual number of rows are less than num_rows it will + notify user as errors. +

+
Signature
+

Optional[int]

+
+
+

checks.table_dimensions.min_rows (property)

+

+ Specify the minimum number of rows that should be in the table. + If the actual number of rows are less than min_rows it will notify user + as errors. +

+
Signature
+

Optional[int]

+
+
+

checks.table_dimensions.max_rows (property)

+

+ Specify the maximum number of rows allowed. + If the actual number of rows are more than max_rows it will notify user + as errors. +

+
Signature
+

Optional[int]

+
+
+

checks.table_dimensions.num_fields (property)

+

+ Specify the number of fields to compare with actual fields in + the table. If the actual number of fields are less than num_fields it will + notify user as errors. +

+
Signature
+

Optional[int]

+
+
+

checks.table_dimensions.min_fields (property)

+

+ Specify the minimum number of fields that should be in the table. + If the actual number of fields are less than min_fields it will notify user + as errors. +

+
Signature
+

Optional[int]

+
+
+

checks.table_dimensions.max_fields (property)

+

+ Specify the maximum number of expected fields. + If the actual number of fields are more than max_fields it will notify user + as errors. +

+
Signature
+

Optional[int]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/codebase/authors.html b/docs/codebase/authors.html new file mode 100644 index 0000000000..26ae9862d3 --- /dev/null +++ b/docs/codebase/authors.html @@ -0,0 +1,3507 @@ + + + + + + + + +Authors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Authors

+
+

This page is powered by contributors-img

+
+

This package is a collective effort made by many great people working on various projects. You can click on the pictures below to see their contribution in detail.

+

frictionless-py

+ + + +

datapackage-py

+ + + +

tableschema-py

+ + + +

tableschema-bigquery-py

+ + + +

tableschema-ckan-datastore-py

+ + + +

tableschema-elasticsearch-py

+ + + +

tableschema-pandas-py

+ + + +

tableschema-sql-py

+ + + +

tableschema-spss-py

+ + + +

tabulator-py

+ + + + + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/codebase/changelog.html b/docs/codebase/changelog.html new file mode 100644 index 0000000000..a0f83eff1c --- /dev/null +++ b/docs/codebase/changelog.html @@ -0,0 +1,4183 @@ + + + + + + + + +Changelog | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Changelog

+

Here described only the breaking and most significant changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.

+

v5.18

+
    +
  • Support ignore_constraints option for the Indexer (#1691)
  • +
  • Various bug fixes
  • +
+

v5.17.1

+
    +
  • fix: deprecated dependencies (PR 1674)
  • +
  • fix: unexpected "missing-label" error with option header_case = False (#1635)
  • +
  • fix: KeyError when a "primaryKey" is missing (#1633)
  • +
  • fix: unexpected field-error for a boolean "example" with "trueValues" or +"falseValues" properties (#1610)
  • +
+

v5.15

+
    +
  • Local development has been migrated to using Hatch
  • +
+

v5.14

+
    +
  • Rebased packaging on PEP 621
  • +
  • Extracted experimental application/server from the codebase
  • +
+

v5.13

+
    +
  • Implemented "Metadata.from_descriptor(allow_invalid=False)" (#1501)
  • +
+

v5.10

+
    +
  • +Various architectural and standards-compatibility improvements (minor breaking changes):
      +
    • +Added new Console commands:
        +
      • list
      • +
      • explore
      • +
      • query
      • +
      • script
      • +
      • convert
      • +
      • publish
      • +
      +
    • +
    • Rebased Console commands on Rich (nice output in the Console)
    • +
    • Fixed extract returning the results depends on the source type (now it's always a dictionary indexed by the resource name)
    • +
    • Enforced type safety -- many tabular command will be marked as impossible for non-tabular resources if a type checker is used
    • +
    • Improved frictionless.Resource(source) guessing abilities; if you just like to open a table resource use frictionless.resources.TableResource(path=path)
    • +
    +
  • +
+

v5.8

+
    +
  • Implemented Implemented catalog/dataset/package/resource.deference (#1451)
  • +
+

v5.7

+
    +
  • +Various architectural and standards-compatibility improvements (minor breaking changes):
      +
    • Improved type detection mechanism (including remote descriptors)
    • +
    • Added resources module including File/Text/Json/TableResource
    • +
    • Deprecated resource.type argument -- use the classes above
    • +
    • Changed catalog.packages[] to catalog.datasets[].package
    • +
    • Made resource.schema optional (resource.has_schema is removed)
    • +
    • Made resource.normpath optional (resource.normdata is removed)
    • +
    • Standards-compatability improvements: profile, stats
    • +
    • Renamed system/plugin.select_Check/etc to system/plugin.select_check_class/etc
    • +
    +
  • +
+

v5.6

+
    +
  • Added support for sqlalchemy@2 (#1427)
  • +
+

v5.5

+
    +
  • Implemented program/resource.index preview (#1395)
  • +
+

v5.4

+
    +
  • Support dialect.skip_blank_rows (#1387)
  • +
+

v5.3

+
    +
  • Support steps.resource_update for resource transformations (#1381)
  • +
+

v5.2

+
    +
  • Added support for wkt format in fields.StringField (#1363 by @jze)
  • +
+

v5.1

+
    +
  • Support descriptor argument for actions/program.extract (#1372)
  • +
+

v5.0

+
    +
  • Frictionless Framework (v5) is out of Beta and released on PyPi
  • +
+

v5.0.0b19

+ +

v5.0.0b8

+
    +
  • ForeignKeyError has been extended with additional information: fieldNames, fieldCells, referenceName, and referenceFieldNames
  • +
+

v5.0.0b2

+ +

v5.0.0b1

+ +

v4.40

+
    +
  • Added Dialect support to packages (#1137)
  • +
+

v4.39

+
    +
  • Fixed processing of incompatible decimal char in table schema and data (#1089)
  • +
  • Added support for Time Zone data (#1097)
  • +
  • Improved validation messages by adding summary and partial validation details (#1106)
  • +
  • +Implemented new feature summary (#1127)
      +
    • schema.to_summary
    • +
    • report.to_summary
    • +
    • Added CLI command summary
    • +
    +
  • +
  • Fixed file compression package.to_zip (#1104)
  • +
  • Implemented feature to validate single resource (#1112)
  • +
  • Improved error message to notify about invalid fields (#1117)
  • +
  • Fixed type conversion of NaN values for data of type Int64 (#1115)
  • +
  • Exposed valid/invalid flags in CLI extract command (#1130)
  • +
  • Implemented feature package.to_er_diagram (#1135)
  • +
+

v4.38

+
    +
  • Implemented checks.ascii_value (#1064)
  • +
  • Implemented checks.deviated_cell (#1069)
  • +
  • Implemented detector.field_true/false_values (#1074)
  • +
+

v4.37

+
    +
  • +Deprecated high-level legacy actions (use class-based alternatives):
      +
    • describe_*
    • +
    • extract_*
    • +
    • transform_*
    • +
    • validate_*
    • +
    +
  • +
+

v4.36

+
    +
  • +Implemented pipeline actions:
      +
    • pipeline.validate (will replace validate_pipeline in v5)
    • +
    • pipeline.transform (will replace transform_pipeline in v5)
    • +
    +
  • +
  • +Implemented inqiury actions:
      +
    • inqiury.validate (will replace validate_inqiury in v5)
    • +
    +
  • +
+

v4.35

+
    +
  • +Implemented schema actions:
      +
    • Schema.describe (will replace describe_schema in v5)
    • +
    • schema.validate (will replace validate_schema in v5)
    • +
    +
  • +
  • +Implemented new transform steps:
      +
    • steps.field_merge
    • +
    • steps.field_pack
    • +
    +
  • +
+

v4.34

+
    +
  • +Implemented package actions:
      +
    • Package.describe (will replace describe_package in v5)
    • +
    • package.extract (will replace extract_package in v5)
    • +
    • package.validate (will replace validate_package in v5)
    • +
    • package.transform (will replace transform_package in v5)
    • +
    +
  • +
+

v4.33

+
    +
  • +Implemented resource actions:
      +
    • Resource.describe (will replace describe_resource in v5)
    • +
    • resource.extract (will replace extract_resource in v5)
    • +
    • resource.validate (will replace validate_resource in v5)
    • +
    • resource.transform (will replace transform_resource in v5)
    • +
    +
  • +
+

v4.32

+
    +
  • Added to_markdown() feature to metadata (#1052)
  • +
+

v4.31

+
    +
  • Added a feature that allows to export table schema as excel (#1040)
  • +
  • Added nontabular note to validation results to indicate nontabular file (#1046)
  • +
  • Excel stats now shows bytes and hash (#1045)
  • +
  • Added pprint feature which displays metadata in a readable and pretty way (#1039)
  • +
  • Improved error message if resource.data is not a string (#1036)
  • +
+

v4.29

+
    +
  • Made Detector's private properties public and writable (#1025)
  • +
+

v4.28

+
    +
  • Improved an order of the metadata in YAML representation
  • +
+

v4.27

+
    +
  • Exposed Dialect options via CLI such as sheet, table, keys, and keyed (#886)
  • +
+

v4.26

+
    +
  • Validate 'schema.fields[].example' (#998)
  • +
+

v4.25

+
    +
  • Allows descriptors that subclass collections.abc.Mapping (#985)
  • +
+

v4.24

+ +

v4.23

+
    +
  • Added table dimensions check (#985)
  • +
+

v4.22

+
    +
  • Added "extract --trusted" flag
  • +
+

v4.21

+
    +
  • Added "--json/yaml" CLI options for transform
  • +
+

v4.20

+
    +
  • Improved layout/schema detection algorithms (#945)
  • +
+

v4.19

+
    +
  • Renamed inlineDialect.keys to inlineDialect.data_keys due to a conflict with dict.keys property
  • +
+

v4.18

+
    +
  • Normalized metadata properties (increased type safety)
  • +
+

v4.17

+
    +
  • Add fields, limit, sort and filter options to CkanDialect (#912)
  • +
+

v4.16

+
    +
  • Implemented system/plugin.create_candidates (#893)
  • +
+

v4.15

+
    +
  • Implemented system.get/use_http_session (#892)
  • +
+

v4.14

+
    +
  • SQL Where Clause (#882)
  • +
+

v4.13

+
    +
  • Implemented descriptor type detection for extract/validate (#881)
  • +
+

v4.12

+
    +
  • Support external profiles for data package (#864)
  • +
+

v4.11

+
    +
  • Added json argument to resource.to_snap
  • +
+

v4.10

+
    +
  • Support resource/field renaming in transform (#843)
  • +
+

v4.9

+
    +
  • Support --path CLI argument (#829)
  • +
+

v4.8

+
    +
  • Added support for Package(innerpath) argument for unzipping a data package's descriptor
  • +
+

v4.7

+
    +
  • Support control/dialect as JSON in CLI (#806)
  • +
+

v4.6

+
    +
  • Implemented describe_dialect and describe(path, type="dialect")
  • +
  • Support --dialect argument in CLI
  • +
+

v4.5

+
    +
  • Implemented Schema.from_jsonschema (#797)
  • +
+

v4.4

+
    +
  • Use field.constraints.maxLength for SQL's VARCHAR (#795)
  • +
+

v4.3

+
    +
  • Implemented resource.to_view() (#781)
  • +
+

v4.2

+
    +
  • Make fields[].arrayItem errors more granular (#767)
  • +
+

v4.1

+
    +
  • Added support for fields[].arrayItem (#750)
  • +
+

v4.0

+
    +
  • Released frictionless@4 :tada:
  • +
+

v4.0.0a15

+
    +
  • +Updated loaders (#658) (BREAKING)
      +
    • Renamed filelike loader to stream loader
    • +
    • Migrated from text loader to buffer loader
    • +
    +
  • +
+

v4.0.0a14

+
    +
  • +Improve transform API (#657) (BREAKING)
      +
    • Swithed to the transform_resource(resource) signature
    • +
    • Swithed to the transform_package(package) signature
    • +
    +
  • +
+

v4.0.0a13

+
    +
  • +Improved resource/package import/export (#655) (BREAKING)
      +
    • Reworked parser.write_row_stream API
    • +
    • Reworked resource.from/to API
    • +
    • Reworked package.from/to API
    • +
    • Reworked Storage API
    • +
    • Reworked system.create_storage API
    • +
    • Merged PandasStorage into PandasParser
    • +
    • Merged SpssStorage into SpssParser
    • +
    +
  • +
+

v4.0.0a12

+
    +
  • +Improved transformation steps (#650) (BREAKING)
      +
    • Split value/formula/function concepts
    • +
    • Renamed a few minor step arguments
    • +
    +
  • +
+

v4.0.0a11

+
    +
  • +Improved layout and data streams concepts (#648) (BREAKING)
      +
    • Renamed data_stream to list_stream
    • +
    • Renamed readData to readLists
    • +
    • Renamed sample to fragment (sample now is raw lists)
    • +
    • Implemented loader.buffer
    • +
    • Implemented parser.sample
    • +
    • Added support for function based checks
    • +
    • Added support for function based steps
    • +
    +
  • +
+

v4.0.0a10

+
    +
  • Reworked Error.tags (BREAKING)
  • +
  • Reworked Check API and split labels/header (BREAKING)
  • +
+

v4.0.0a9

+
    +
  • +Rebased on Detector class (BREAKING)
      +
    • Migrated all infer_*, sync/patch_schema and detect_encoding parameters to Detector
    • +
    • Made resource.infer omit empty objects
    • +
    • Added resource.read_*(size) argument
    • +
    • Added resource.labels property
    • +
    +
  • +
+

v4.0.0a8

+
    +
  • +Improved checks/steps API (#621) (BREAKING)
      +
    • Updated validate(extra_checks=[...]) to validate(checks=[{"code": 'code', ...}])
    • +
    +
  • +
+

v4.0.0a7

+
    +
  • +Updated describe/extract/transform/validate APIs (BREAKING)
      +
    • Removed validate_table (use validate_resource)
    • +
    • Removed legacy Table and File classes
    • +
    • Removed dataflows plugin
    • +
    • Replaced nopool by parallel (not parallel by default)
    • +
    • Renamed report.tables to report.tasks
    • +
    • Rebased on report.tasks[].resource (instead of plain path/scheme/format/etc)
    • +
    • Flatten Pipeline steps signature
    • +
    +
  • +
+

v4.0.0a6

+
    +
  • +Introduced Layout class (BREAKING)
      +
    • Renamed Query class and arguments/properties to Layout
    • +
    • Moved header options from Dialect to Layout
    • +
    +
  • +
+

v4.0.0a5

+
    +
  • +Updated transform API
      +
    • Added transform(type) argument
    • +
    +
  • +
+

v4.0.0a4

+
    +
  • +Updated describe API (BREAKING)
      +
    • Renamed describe(source_type) argument to type
    • +
    +
  • +
+

v4.0.0a3

+
    +
  • +Updated extract API (BREAKING)
      +
    • Removed extract_table (use extract_resource with the same API)
    • +
    • Renamed extract(source_type) argument to type
    • +
    +
  • +
+

v4.0.0a1

+
    +
  • +Initial API/codebase improvements for v4 (BREAKING)
      +
    • Allow Package/Resource(source) notation (guess descriptor/path/etc)
    • +
    • Renamed schema.infer -> Schema.from_sample
    • +
    • Renamed resource.inline -> resource.memory
    • +
    • Renamed compression_path -> innerpath
    • +
    • Renamed compression: no -> compression: ""
    • +
    • Updated Package/Resource.infer not to infer stats (use stats=True)
    • +
    • Removed Package/Resource.infer(only_sample) argument
    • +
    • Removed Resouce.from/to_zip (use Package.from/to_zip)
    • +
    • Removed Resouce.source (use Resource.data or Resource.fullpath)
    • +
    • Removed package/resource.infer(source) argument (use constructors)
    • +
    • Added some new API (will be covered in the updated docs after the v4 release)
    • +
    +
  • +
+

v3.48

+
    +
  • +Make Resource independent from Table/File (#607) (BREAKING)
      +
    • Resource can be opened like Table (it's recommended to use Resource instead of Table)
    • +
    • Renamed resource.read_sample() to resource.sample
    • +
    • Renamed resource.read_header() to resource.header
    • +
    • Renamed resource.read_stats() to resource.stats
    • +
    • Removed resource.to_table()
    • +
    • Removed resource.to_file()
    • +
    +
  • +
+

v3.47

+
    +
  • +Optimize Row/Header/Table and rename header errors (#601) (BREAKING)
      +
    • Row object is now lazy; it casts data on-demand preserving the same API
    • +
    • Method resource/table.read_data(_stream) now includes a header row if present
    • +
    • Renamed errors.ExtraHeaderError->ExtraLabelError (extra-label-error)
    • +
    • Renamed errors.MissingHeaderError->MissingLabelError (missing-label-error)
    • +
    • Renamed errors.BlankHeaderError->BlankLabelError (blank-label-error)
    • +
    • Renamed errors.DuplicateHeaderError->DuplicateLabelError (duplicate-label-error)
    • +
    • Renamed errors.NonMatchingHeaderError->IncorrectLabelError (incorrect-label-error)
    • +
    • Renamed schema.read/write_data->read/write_cells
    • +
    +
  • +
+

v3.46

+
    +
  • Renamed aws plugin to s3 (#594) (BREAKING)
  • +
+
$ pip install frictionless[aws] # before
+$ pip install frictionless[s3] # after
+
+

v3.45

+
    +
  • Drafted support for writing Multipart Data (#583)
  • +
+

v3.44

+
    +
  • Added support for writing to Remote Data (#582)
  • +
+

v3.43

+
    +
  • Add support to writing to Google Sheets (#581)
  • +
  • Renamed gsheet plugin/format to gsheets (BREAKING: minor)
  • +
+

v3.42

+
    +
  • Added support for writing to S3 (#580)
  • +
+

v3.41

+
    +
  • Update Loader/Parser API to write to different targets (#579) (BREAKING: minor)
  • +
+

v3.40

+
    +
  • Implemented a standalone multipart loader (#573)
  • +
+

v3.39

+
    +
  • Fixed Header not being an original one (#572)
  • +
  • Fix bad format validation (#571)
  • +
  • Added default errors limit equals to 1000 (#570)
  • +
  • Added support for field.float_number (#569)
  • +
+

v3.38

+
    +
  • Improved ckan plugin (#560)
  • +
+

v3.37

+
    +
  • Remove not working elastic plugin draft (#558)
  • +
+

v3.36

+
    +
  • Support custom types (#557)
  • +
+

v3.35

+
    +
  • Added "resolve" option to "resource/package.to_zip" (#556)
  • +
+

v3.34

+
    +
  • Moved frictionless.controls to frictionless.plugins.* (BREAKING)
  • +
  • Moved frictionless.dialects to frictionless.plugins.* (BREAKING)
  • +
  • Moved frictionless.exceptions.FrictionlessException to frictionless.FrictionlessException (BREAKING)
  • +
  • Moved excel dependencies to frictionless[excel] extras (BREAKING)
  • +
  • Moved json dependencies to frictionless[json] extras (BREAKING)
  • +
  • Consider json files to be a metadata by default (BREAKING)
  • +
+

Code example:

+
# Before
+# pip install frictionless
+from frictionless import dialects, exceptions
+excel_dialect = dialects.ExcelDialect()
+json_dialect = dialects.JsonDialect()
+exception = exceptions.FrictionlessException()
+
+# After
+# pip install frictionless[excel,json]
+from frictionless import FrictionlessException
+from frictionless.plugins.excel import ExcelDialect
+from frictionless.plugins.json import JsonDialect
+excel_dialect = dialects.ExcelDialect()
+json_dialect = dialects.JsonDialect()
+exception = FrictionlessException()
+
+

v3.33

+
    +
  • Implemented resource.write (#537)
  • +
+

v3.32

+
    +
  • Added url parameter to SQL import/export (#535)
  • +
+

v3.31

+
    +
  • Made tables with header and no data rows valid (#534) (BREAKING: minor)
  • +
+

v3.30

+
    +
  • +Various CLI improvements (#532)
      +
    • Added autocompletion
    • +
    • Added stdin support
    • +
    • Added "extract --csv"
    • +
    • Exposed more options
    • +
    +
  • +
+

v3.29

+
    +
  • Added experimental CKAN support (#528)
  • +
+

v3.28

+
    +
  • Add a "nopool" argument to validate (#527)
  • +
+

v3.27

+
    +
  • Stop sorting keyed sources as the order is now guaranteed by Python (#512) (BREAKING)
  • +
+

v3.26

+
    +
  • Added "nolookup" argument for validate_package (#515)
  • +
+

v3.25

+
    +
  • Add transform functionality (#505)
  • +
  • Methods schema.get/remove_field now raise if not found (#505) (BREAKING)
  • +
  • Methods package.get/remove_resource now raise if not found (#505) (BREAKING)
  • +
+

v3.24

+
    +
  • Lower case resource.scheme/format/hashing/encoding/compression (#499) (BREAKING)
  • +
+

v3.23

+
    +
  • Support "header_case" option for dialects (#488)
  • +
+

v3.22

+
    +
  • Added suppport for DB2 format (#485)
  • +
+

v3.21

+
    +
  • Improved SPSS plugin (#483)
  • +
  • Improved BigQuery plugin (#470)
  • +
+

v3.20

+
    +
  • Added support for SQL Views (#466)
  • +
+

v3.19

+
    +
  • Rebased AwsLoader on streaming (#460)
  • +
+

v3.18

+
    +
  • Added hashing parameter to describe/describe_package
  • +
  • Removed table.onerror property (BREAKING)
  • +
+

v3.17

+
    +
  • Added timezone for datetime/time parsing (#457) (BREAKING)
  • +
+

v3.16

+
    +
  • Fixed metadata.to_yaml (#455)
  • +
  • Removed the expand argument from metadata.to_dict (BREAKING)
  • +
+

v3.15

+
    +
  • Added native schema support to SqlParser (#452)
  • +
+

v3.14

+
    +
  • Make Resource the main internal interface (#446) (BREAKING: for plugin authors)
  • +
  • Move Resource's stats to resource.stats (BREAKING)
  • +
  • Rename on_error to onerror (BREAKING)
  • +
  • Added resource.stats.fields
  • +
+

v3.13

+
    +
  • Add an on_error argument to Table/Resource/Package (#445)
  • +
+

v3.12

+
    +
  • Added streaming to the extract functions (#442)
  • +
+

v3.11

+
    +
  • Added experimental BigQuery support (#424)
  • +
+

v3.10

+
    +
  • Added experimental SPSS support (#421)
  • +
+

v3.9

+
    +
  • Rebased on a goodtables successor versioning
  • +
+

v3.8

+
    +
  • Add support SQL/Pandas import/export (#31)
  • +
+

v3.7

+
    +
  • Add support for custom JSONEncoder classes (#24)
  • +
+

v3.6

+
    +
  • Normalize header terminology
  • +
+

v3.5

+
    +
  • Initial public version
  • +
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/codebase/contributing.html b/docs/codebase/contributing.html new file mode 100644 index 0000000000..26f0fc0d75 --- /dev/null +++ b/docs/codebase/contributing.html @@ -0,0 +1,3617 @@ + + + + + + + + +Contributing | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Contributing

+

We welcome contributions from anyone! Please read the following guidelines, and feel free to reach out to us if you have questions. Thanks for your interest in helping make Frictionless awesome!

+

Introduction

+

We use Github as a code and issues hosting platform. To report a bug or propose a new feature, please open an issue. For pull requests, we would ask you initially create an issue and then create a pull requests linked to this issue.

+

Prerequisites

+

To start working on the project:

+
    +
  • Python 3.10+
  • +
+

Install Python headers if they are missing:

+
sudo apt-get install libpython3.10-dev
+
+

Enviroment

+

For development orchestration we use Hatch for Python (defined in pyproject.toml). We use make to run high-level commands (defined in Makefile)

+
pip3 install hatch
+
+

Before starting with the project we recommend configuring hatch. The following line will ensure that all the virtual environments will be stored in the .python directory in the project root:

+
hatch config set 'dirs.env.virtual' '.python'
+
+

Now you can setup you IDE to use a proper Python path:

+
.python/frictionless/bin/python
+
+

Enter the virtual environment before starting the work. It will ensure that all the development dependencies are installed into a virtual environment:

+
hatch shell
+
+

Using Docker

+

Use the following command to build the container:

+ +
+
+
hatch run image
+
+ +
+

This should take care of setting up everything. If the container is built without errors, you can then run commands like hatch inside the container to accomplish various tasks (see the next section for details).

+

To make things easier, we can create an alias:

+ +
+
+
alias "frictionless-dev=docker run --rm -v $PWD:/home/frictionless -it frictionless-dev"
+
+ +
+

Then, for example, to run the tests, we can use:

+ +
+
+
frictionless-dev hatch run test
+
+ +
+

Development

+

Codebase

+

Frictionless is a Python3.8+ framework, and it uses some common Python tools for the development process (we recommend enabling support of these tools in your IDE):

+
    +
  • linting/formatting: ruff
  • +
  • type checking: pyright
  • +
  • code testing: pytest
  • +
+

You also need git to work on the project.

+

Documentation

+

To contribute to the documentation, please find an article in the docs folder and update its contents. We write our documentation using Livemark. Livemark provides an ability to provide examples without providing an output as it's generated automatically.

+

It's possible to run this documentation portal locally:

+ +
+
+
livemark start
+
+ +
+

Running tests offline

+

VCR library records the response from HTTP requests locally as cassette in its first run. All subsequent calls are run using recorded metadata +from previous HTTP request, so it speeds up the testing process. To record a unit test(as cassette), we mark it with a decorator:

+
@pytest.mark.vcr
+def test_connect_with_server():
+	pass
+
+

Cassettee will be recorded as "test_connect_with_server.yaml". A new call is made when params change. To skip sensitive data, +we can use filters:

+
@pytest.fixture(scope="module")
+def vcr_config():
+    return {"filter_headers": ["authorization"]}
+
+

Regenerating cassettes for CKAN

+ +
CKAN_APIKEY=***************************
+
+

Regenerating cassettes for Zenodo

+

Read

+
    +
  • To read, we need to use live site, the api library uses it by default.
  • +
  • Login to zenodo if you have an account and create an access token.
  • +
  • Set access token in .env file.
  • +
+
ZENODO_ACCESS_TOKEN=***************************
+
+

Write

+
    +
  • To write we can use either live site or sandbox. We recommend to use sandbox (https://sandbox.zenodo.org/api/).
  • +
  • Login to zenodo(sandbox) if you have an account and create an access token.
  • +
  • Set access token in .env file.
  • +
+
ZENODO_SANDBOX_ACCESS_TOKEN=***************************
+
+
    +
  • Set base_url in the control params
  • +
+
base_url='base_url="https://sandbox.zenodo.org/api/'
+
+

Regenerating cassettes for Github

+
    +
  • Login to github if you have an account and create an access token(Developer settings > Personal access tokens > Tokens).
  • +
  • Set access token and other details in .env file. If email/name of the user is hidden we need to provide those details as well.
  • +
+
GITHUB_NAME=FD
+GITHUB_EMAIL=frictionlessdata@okfn.org
+GITHUB_ACCESS_TOKEN=***************************
+
+

Releasing

+

To release a new version:

+
    +
  • check that you have push access to the main branch
  • +
  • run hatch version <major|minor|micro> to update the version
  • +
  • add changes to CHANGELOG.md if it's not a patch release (major or minor)
  • +
  • run hatch run release which create a release commit and tag and push it to Github
  • +
  • an actual release will happen on the Github CI platform after running the tests
  • +
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/codebase/license.html b/docs/codebase/license.html new file mode 100644 index 0000000000..3d712d89ea --- /dev/null +++ b/docs/codebase/license.html @@ -0,0 +1,3479 @@ + + + + + + + + +The MIT License (MIT) | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

The MIT License (MIT)

+

Copyright © 2020 Open Knowledge Foundation

+

Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions:

+

The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software.

+

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE.

+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/codebase/migration.html b/docs/codebase/migration.html new file mode 100644 index 0000000000..c0a9bcd813 --- /dev/null +++ b/docs/codebase/migration.html @@ -0,0 +1,3506 @@ + + + + + + + + +Migration | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Migration

+

Frictionless is a logical continuation of many existing packages created for Frictionless Data as though datapackage or tableschema. Although, most of these packages will be supported going forward, you can migrate to Frictionless, which is Python 3.8+, as it improves many aspects of working with data and metadata. This document also covers migration from one framework's version to another.

+

From v4 to v5

+

Since the initial Frictionless Framework release we'd been collecting feedback and analyzing both high-level users' needs and bug reports to identify shortcomings and areas that can be improved in the next version of the framework. Read about a new version of the framework and migration details in this blog:

+ +

From dataflows

+

Frictionless Framework provides the frictionless transform function for data transformation. It can be used to migrate from dataflows or datapackage-pipelines:

+ +

From goodtables

+

Frictionless Framework provides the frictionless validate function which is in high-level exactly the same as goodtables validate. Also frictionless describe is an improved version of goodtables init. You instead need to use the frictionless command instead of the goodtables command:

+ +

From datapackage

+

Frictionless Framework has Package and Resource classes which is almost the same as datapackage has:

+ +

From tableschema

+

Frictionless Framework has Schema and Field classes which is almost the same as tableschema has:

+ +

From tabulator

+

Frictionless has Resource class which is an equivalent of the tabulator's Stream class:

+ + + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/convert.html b/docs/console/convert.html new file mode 100644 index 0000000000..eec225f48a --- /dev/null +++ b/docs/console/convert.html @@ -0,0 +1,3518 @@ + + + + + + + + +Convert | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Convert

+
+ +
+ +

With convert command you can quickly convert a tabular data file from one format to another (or the same format with different dialect):

+

Format Conversion

+

For example, let's convert a CSV file into an Excel:

+ +
+
+
frictionless convert table.csv table.xlsx
+
+ +
+

Downloading Files

+

The command can be used for downloading files as well. For example, let's cherry-pick one CSV file from a Zenodo dataset:

+ +
+
+
frictionless convert https://zenodo.org/record/3977957 --name aaawrestlers --to-path test.csv
+
+ +
+

Dialect Updates

+

Consider, we want to change the CSV delimiter:

+ + + +
+
+
frictionless convert table.csv table-copy.csv --csv-delimiter ;
+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/describe.html b/docs/console/describe.html new file mode 100644 index 0000000000..da685514b8 --- /dev/null +++ b/docs/console/describe.html @@ -0,0 +1,3549 @@ + + + + + + + + +Describe | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Describe

+
+ +

With Frtictionless describe command you can get a metadata of file or a dataset.

+

Normal Mode

+

By default, it outputs metadata visually formatted:

+ +
+
+
frictionless describe tables/*.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃ name   ┃ type  ┃ path              ┃
+┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ chunk1 │ table │ tables/chunk1.csv │
+│ chunk2 │ table │ tables/chunk2.csv │
+└────────┴───────┴───────────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+       chunk1
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id      ┃ name   ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+       chunk2
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id      ┃ name   ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+ +
+

Yaml/Json Mode

+

It's possible to output as YAML or JSON, for example:

+ + + +
+
+
frictionless describe tables/*.csv --yaml
+
+ +
resources:
+  - name: chunk1
+    type: table
+    path: tables/chunk1.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: name
+          type: string
+  - name: chunk2
+    type: table
+    path: tables/chunk2.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: name
+          type: string
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/explore.html b/docs/console/explore.html new file mode 100644 index 0000000000..7d11f70bf4 --- /dev/null +++ b/docs/console/explore.html @@ -0,0 +1,4713 @@ + + + + + + + + +Explore | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Explore

+
+ +

With the explore command you can open your dataset in Visidata which is an amazing visual tool for working with tabular data in Console. For example try "Shift+F" for creating data histograms!

+

Installation

+ +
+
+
pip install frictionless[visidata]
+pip install frictionless[visidata,zenodo] # for examples in this tutorial
+
+ +
+

Example

+

For example, let's expore this interesing dataset:

+ +
+
+
frictionless explore https://zenodo.org/record/3977957
+
+ +
+
+ +

Documentation

+

Before entering Visidata, it's highly recommended to read its documentation:

+ +

You can get it in Console as well:

+ + + +
+
+
vd --help
+
+ +
vd(1)                        Quick Reference Guide                       vd(1)
+
+NAME
+     VisiData — a terminal utility for exploring and arranging tabular data
+
+SYNOPSIS
+     vd [options] [input ...]
+     visidata [options] [input ...]
+     vd [options] --play cmdlog [-w waitsecs] [--batch] [-i] [-o output]
+        [field=value]
+     vd [options] [input ...] +toplevel:subsheet:row:col
+
+DESCRIPTION
+     VisiData is an easy-to-use multipurpose tool to explore, clean, edit, and
+     restructure data. Rows can be selected, filtered, and grouped; columns
+     can be rearranged, transformed, and derived via regex or Python expres‐
+     sions; and workflows can be saved, documented, and replayed.
+
+   REPLAY MODE
+     -p, --play=cmdlog       replay a saved cmdlog within the interface
+     -w, --replay-wait=seconds
+                             wait seconds between commands
+     -b, --batch             replay in batch mode (with no interface)
+     -i, --interactive       launch VisiData in interactive mode after batch
+     -o, --output=file       save final visible sheet to file as .tsv
+     field=value             replace "{field}" in cmdlog contents with value
+
+   Commands During Replay
+        ^K                   cancel current replay
+
+   GLOBAL COMMANDS
+     All keystrokes are case sensitive. The ^ prefix is shorthand for Ctrl.
+
+   Keystrokes to start off with
+      ^Q              abort program immediately
+      ^C              cancel user input or abort all async threads on current
+                      sheet
+     g^C              abort all secondary threads
+       q              quit current sheet or menu
+       Q              quit current sheet and free associated memory
+      gq              quit all sheets (clean exit)
+
+     Alt+H            activate help menu (Enter/left-mouse to expand submenu
+                      or execute command)
+     g^H              view this man page
+     z^H              view sheet of command longnames and keybindings for cur‐
+                      rent sheet
+
+      gb              open sidebar in a new sheet
+       b              toggle sidebar
+
+       U              undo the most recent modification (requires enabled
+                      options.undo)
+       R              redo the most recent undo (requires enabled
+                      options.undo)
+
+     Space longname   open command palette; execute top command by its
+                      longname
+
+     Command Palette
+        Tab              Move to command palette, and cycle through commands
+        0-9              Execute numbered command
+        Enter            Execute highlighted command
+
+   Cursor Movement
+     Arrow PgUp       go as expected
+      h   j   k   l   go left/down/up/right
+     gh  gj  gk  gl   go all the way to the left/bottom/top/right of sheet
+          G  gg       go all the way to the bottom/top of sheet
+     Ic. End  Home    go all the way to the bottom/top of sheet
+     ^B  ^F           scroll one page back/forward
+     ^Left ^Right     scroll one page left/right
+     zz               scroll current row to center of screen
+
+     ^^ (Ctrl+^)      jump to previous sheet (swaps with current sheet)
+
+      /   ? regex     search for regex forward/backward in current column's
+                      displayed values
+     g/  g? regex     search for regex forward/backward over all visible
+                      columns' displayed values
+     z/  z? expr      search by Python expr forward/backward in current column
+                      (with column names as variables)
+      n   N           go to next/previous match from last regex search
+
+      <   >           go up/down current column to next value
+     z<  z>           go up/down current column to next null value
+      {   }           go up/down current column to next selected row
+
+      c regex         go to next column with name matching regex
+      r regex         go to next row with key matching regex
+     zc  zr number    go to column/row number (0-based)
+
+      H   J   K   L   slide current row/column left/down/up/right
+     gH  gJ  gK  gL   slide current row/column all the way to the left/bot‐
+                      tom/top/right of sheet
+     zH  zJ  zK  zK number
+                      slide current row/column number positions to the
+                      left/down/up/right
+
+     zh  zj  zk  zl   scroll one left/down/up/right
+
+   Column Manipulation
+       _ (underbar)   toggle width of current column between full and default
+                      width
+      g_              toggle widths of all visible columns between full and
+                      default width
+      z_ number       adjust width of current column to number
+     gz_ number       adjust widths of all visible columns to Ar number
+
+      - (hyphen)      hide current column
+     z-               reduce width of current column by half
+     gv               unhide all columns
+
+     ! z!             toggle/unset current column as a key column
+     ~  #  %  $  @  z#
+                      set type of current column to str/int/float/cur‐
+                      rency/date/len
+     Alt++  Alt+-     show more/less precision in current numerical column
+       ^              rename current column
+      g^              rename all unnamed visible columns to contents of se‐
+                      lected rows (or current row)
+      z^              rename current column to combined contents of current
+                      cell in selected rows (or current row)
+     gz^              rename all visible columns to combined contents of cur‐
+                      rent column for selected rows (or current row)
+
+       = expr         create new column from Python expr, with column names,
+                      and attributes, as variables
+      g= expr         set current column for selected rows to result of Python
+                      expr
+     gz= expr         set current column for selected rows to the items in
+                      result of Python sequence expr
+      z= expr         evaluate Python expression on current row and set
+                      current cell with result of Python expr
+
+       i              add column with incremental values
+      gi              set current column for selected rows to incremental
+                      values
+      zi step         add column with values at increment step
+     gzi step         set current column for selected rows at increment step
+
+      ' (tick)        add a frozen copy of current column with all cells eval‐
+                      uated
+     g'               open a frozen copy of current sheet with all visible
+                      columns evaluated
+     z'  gz'          add/reset cache for current/all visible column(s)
+
+     Note that regex operations apply to the displayed value in a cell.
+      : regex         add new columns from regex split; number of columns
+                      determined by example row at cursor
+      ; regex         add new columns from capture groups of regex (also
+                      requires example row)
+     z; expr          create new column from bash expr, with $columnNames as
+                      variables
+      * regex/subst   add column derived from current column, replacing regex
+                      with subst (may include \1 backrefs)
+     g*  gz* regex/subst
+                      modify selected rows in current/all visible column(s),
+                      replacing regex with subst (may include \1 backrefs)
+
+      (   g(          expand current/all visible column(s) of lists (e.g. [3])
+                      or dicts (e.g. {3}) one level
+     z(  gz( depth    expand current/all visible column(s) of lists (e.g. [3])
+                      or dicts (e.g. {3}) to given depth (0= fully)
+      )   g(          unexpand current/all visible column(s); restore original
+                      column and remove other columns at this level
+     z)  gz) depth    contract current/all visible column(s) of former lists
+                      (e.g. [3]) or dicts (e.g. {3}) to given depth (0= fully)
+     zM               row-wise expand current column of lists (e.g. [3]) or
+                      dicts (e.g. {3}) within that column
+
+   Row Selection
+       s   t   u      select/toggle/unselect current row
+      gs  gt  gu      select/toggle/unselect all rows
+      zs  zt  zu      select/toggle/unselect all rows from top to cursor
+     gzs gzt gzu      select/toggle/unselect all rows from cursor to bottom
+      |   \ regex     select/unselect rows matching regex in current column
+     g|  g\ regex     select/unselect rows matching regex in any visible
+                      column
+     z|  z\ expr      select/unselect rows matching Python expr in any visible
+                      column
+      , (comma)       select rows matching display value of current cell in
+                      current column
+     g,               select rows matching display value of current row in all
+                      visible columns
+     z, gz,           select rows matching typed value of current cell/row in
+                      current column/all visible columns
+
+   Row Sorting/Filtering
+       [    ]         sort ascending/descending by current column; replace any
+                      existing sort criteria
+      g[   g]         sort ascending/descending by all key columns; replace
+                      any existing sort criteria
+      z[   z]         sort ascending/descending by current column; add to ex‐
+                      isting sort criteria
+     gz[  gz]         sort ascending/descending by all key columns; add to ex‐
+                      isting sort criteria
+      "               open duplicate sheet with only selected rows
+     g"               open duplicate sheet with all rows
+     gz"              open duplicate sheet with deepcopy of selected rows
+
+     The rows in these duplicated sheets (except deepcopy) are references to
+     rows on the original source sheets, and so edits to the filtered rows
+     will naturally be reflected in the original rows.  Use g' to freeze sheet
+     contents in a deliberate copy.
+
+   Editing Rows and Cells
+       a   za         append blank row/column; appended columns cannot be
+                      copied to clipboard
+      ga  gza number  append number blank rows/columns
+       d   gd         delete current/selected row(s)
+       y   gy         yank (copy) current/all selected row(s) to clipboard in
+                      Memory Sheet
+       x   gx         cut (copy and delete) current/all selected row(s) to
+                      clipboard in Memory Sheet
+      zy  gzy         yank (copy) contents of current column for
+                      current/selected row(s) to clipboard in Memory Sheet
+      zd  gzd         set contents of current column for current/selected
+                      row(s) to options.null_value
+      zx  gzx         cut (copy and delete) contents of current column for
+                      current/selected row(s) to clipboard in Memory Sheet
+       p    P         paste clipboard rows after/before current row
+      zp  gzp         set cells of current column for current/selected row(s)
+                      to last clipboard value
+      zP  gzP         paste to cells of current column for current/selected
+                      row(s) using the system clipboard
+       Y   gY         yank (copy) current/all selected row(s) to system
+                      clipboard (using options.clipboard_copy_cmd)
+      zY  gzY         yank (copy) contents of current column for
+                      current/selected row(s) to system clipboard (using
+                      options.clipboard_copy_cmd)
+       f              fill null cells in current column with contents of non-
+                      null cells up the current column
+       e text         edit contents of current cell
+      ge text         set contents of current column for selected rows to text
+
+     Commands While Editing Input
+        Enter  ^C        accept/abort input
+        ^O  g^O          open external $EDITOR to edit contents of current/se‐
+                         lected rows in current column
+        ^R               reload initial value
+        ^A   ^E          go to beginning/end of line
+        ^B   ^F          go back/forward one character
+        ^←   ^→ (arrow)  go back/forward one word
+        ^H   ^D          delete previous/current character
+        ^T               transpose previous and current characters
+        ^U   ^K          clear from cursor to beginning/end of line
+        ^Y               paste from cell clipboard
+        Backspace  Del   delete previous/current character
+        Insert           toggle insert mode
+        Up  Down         set contents to previous/next in history
+        Tab  Shift+Tab   move cursor left/right and re-enter edit mode
+        Shift+Arrow      move cursor in direction of Arrow and re-enter edit
+                         mode
+
+   Data Toolkit
+      o input         open input in VisiData
+     zo               open file or url from path in current cell
+     ^S g^S filename  save current/all sheet(s) to filename in format
+                      determined by extension (default .tsv)
+                      Note: if the format does not support multisave, or the
+                      filename ends in a /, a directory will be created.
+     z^S filename     save current column only to filename in format
+                      determined by extension (default .tsv)
+     ^D filename.vdj  save CommandLog to filename.vdj file
+     A                open new blank sheet with one column
+     T                open new sheet that has rows and columns of current
+                      sheet transposed
+
+      + aggregator    add aggregator to current column (see Frequency Table)
+     z+ aggregator    display result of aggregator over values in selected
+                      rows for current column; store result in Memory Sheet
+      &               append top two sheets in Sheets Stack
+     g&               append all sheets in Sheets Stack
+
+      w nBefore nAfter
+                      add column where each row contains a list of that row,
+                      nBefore rows, and nAfter rows
+
+   Data Visualization
+      . (dot)       plot current numeric column vs key columns. The numeric
+                    key column is used for the x-axis; categorical key column
+                    values determine color.
+     g.             plot a graph of all visible numeric columns vs key
+                    columns.
+
+     If rows on the current sheet represent plottable coordinates (as in .shp
+     or vector .mbtiles sources),  . plots the current row, and g. plots all
+     selected rows (or all rows if none selected).
+
+     Canvas-specific Commands
+         +   -              increase/decrease zoom level, centered on cursor
+         _ (underbar)       zoom to fit full extent
+        z_ (underbar)       set aspect ratio
+         x xmin xmax        set xmin/xmax on graph
+         y ymin ymax        set ymin/ymax on graph
+         s   t   u          select/toggle/unselect rows on source sheet con‐
+                            tained within canvas cursor
+        gs  gt  gu          select/toggle/unselect rows on source sheet visi‐
+                            ble on screen
+         d                  delete rows on source sheet contained within can‐
+                            vas cursor
+        gd                  delete rows on source sheet visible on screen
+         Enter              open sheet of source rows contained within canvas
+                            cursor
+        gEnter              open sheet of source rows visible on screen
+         1 - 9              toggle display of layers
+        ^L                  redraw all pixels on canvas
+         v                  toggle show_graph_labels option
+        mouse scrollwheel   zoom in/out of canvas
+        left click-drag     set canvas cursor
+        right click-drag    scroll canvas
+
+   Split Screen
+      Z             split screen in half, so that second sheet on the stack is
+                    visible in a second pane
+     zZ             split screen, and queries for height of second pane
+
+     Split Window specific Commands
+        gZ                  close an already split screen, current pane full
+                            screens
+         Z                  push second sheet on current pane's stack to the
+                            top of the other pane's stack
+         Tab                jump to other pane
+        gTab                swap panes
+        g Ctrl+^            cycle through sheets
+
+   Other Commands
+     Q                quit current sheet and remove it from the CommandLog
+     v                toggle sheet-specific visibility (multi-line rows on
+                      Sheet, legends/axes on Graph)
+
+      ^E  g^E         view traceback for most recent error(s)
+     z^E              view traceback for error in current cell
+
+      ^L              refresh screen
+      ^R              reload current sheet
+      ^Z              suspend VisiData process
+      ^G              show cursor position and bounds of current sheet on sta‐
+                      tus line
+      ^V              show version and copyright information on status line
+      ^P              open Status History
+     m keystroke      first, begin recording macro; second, prompt for
+                      keystroke , and complete recording. Macro can then be
+                      executed everytime provided keystroke is used. Will
+                      override existing keybinding. Macros will run on current
+                      row, column, sheet.
+     gm               open an index of all existing macros. Can be directly
+                      viewed with Enter, and then modified with ^S.
+
+      ^Y  z^Y  g^Y    open current row/cell/sheet as Python object
+      ^X expr         evaluate Python expr and opens result as Python object
+     z^X expr         evaluate Python expr, in context of current row, and
+                      open result as Python object
+     g^X module       import Python module in the global scope
+
+   Internal Sheets List
+      .  Directory Sheet             browse properties of files in a directory
+      .  Guide Index                 read documentation from within VisiData
+      .  Memory Sheet (Alt+Shift+M)  browse saved values, including clipboard
+
+     Metasheets
+      .  Columns Sheet (Shift+C)     edit column properties
+      .  Sheets Sheet (Shift+S)      jump between sheets or join them together
+      .  Options Sheet (Shift+O)     edit configuration options
+      .  Commandlog (Shift+D)        modify and save commands for replay
+      .  Error Sheet (Ctrl+E)            view last error
+      .  Status History (Ctrl+P)         view history of status messages
+      .  Threads Sheet (Ctrl+T)          view, cancel, and profile
+         asynchronous threads
+
+     Derived Sheets
+      .  Frequency Table (Shift+F)   group rows by column value, with
+         aggregations of other columns
+      .  Describe Sheet (Shift+I)    view summary statistics for each column
+      .  Pivot Table (Shift+W)       group rows by key and summarize current
+         column
+      .  Melted Sheet (Shift+M)      unpivot non-key columns into
+         variable/value columns
+      .  Transposed Sheet (Shift+T)   open new sheet with rows and columns
+         transposed
+
+   INTERNAL SHEETS
+   Directory Sheet
+     (global commands)
+        Space open-dir-current
+                         open the Directory Sheet for the current directory
+     (sheet-specific commands)
+        Enter  gEnter    open current/selected file(s) as new sheet(s)
+         ^O  g^O         open current/selected file(s) in external $EDITOR
+         ^R  z^R  gz^R   reload information for all/current/selected file(s)
+          d   gd         delete current/selected file(s) from filesystem, upon
+                         commit
+          y   gy directory
+                         copy current/selected file(s) to given directory,
+                         upon commit
+          e   ge name    rename current/selected file(s) to name
+          ` (backtick)   open parent directory
+        z^S              commit changes to file system
+
+   Guide Index
+     Browse through a list of available guides. Each guide shows you how to
+     use a particular feature. Gray guides have not been written yet.
+     (global commands)
+        Space open-guide-index
+                         open the Guide Index
+     (sheet-specific commands)
+        Enter            open a guide
+
+   Memory Sheet
+     Browse through a list of stored values, referanceable in expressions
+     through their name.
+     (global commands)
+        Alt+Shift+M      open the Memory Sheet
+        Alt+M name       store value in current cell in Memory Sheet under
+                         name
+     (sheet-specific commands)
+        e                edit either value or name, to edit reference
+
+   METASHEETS
+   Columns Sheet (Shift+C)
+     Properties of columns on the source sheet can be changed with standard
+     editing commands (e ge g= Del) on the Columns Sheet. Multiple aggregators
+     can be set by listing them (separated by spaces) in the aggregators
+     column. The 'g' commands affect the selected rows, which are the literal
+     columns on the source sheet.
+     (global commands)
+        gC               open Columns Sheet with all visible columns from all
+                         sheets
+     (sheet-specific commands)
+         &               add column from appending selected source columns
+        g! gz!           toggle/unset selected columns as key columns on
+                         source sheet
+        g+ aggregator    add Ar aggregator No to selected source columns
+        g- (hyphen)      hide selected columns on source sheet
+        g~ g# g% g$ g@ gz# z%
+                         set type of selected columns on source sheet to
+                         str/int/float/currency/date/len/floatsi
+         Enter           open a Frequency Table sheet grouped by column
+                         referenced in current row
+
+   Sheets Sheet (Shift+S)
+     open Sheets Stack, which contains only the active sheets on the current
+     stack
+     (global commands)
+        gS               open Sheets Sheet, which contains all sheets from
+                         current session, active and inactive
+        Alt number       jump to sheet number
+     (sheet-specific commands)
+         Enter           jump to sheet referenced in current row
+        gEnter           push selected sheets to top of sheet stack
+         a               add row to reference a new blank sheet
+        gC  gI           open Columns Sheet/Describe Sheet with all visible
+                         columns from selected sheets
+        g^R              reload all selected sheets
+        z^C  gz^C        abort async threads for current/selected sheets(s)
+        g^S              save selected or all sheets
+         & jointype      merge selected sheets with visible columns from all,
+                         keeping rows according to jointype:
+                         .  inner  keep only rows which match keys on all
+                            sheets
+                         .  outer  keep all rows from first selected sheet
+                         .  full   keep all rows from all sheets (union)
+                         .  diff   keep only rows NOT in all sheets
+                         .  append combine all rows from all sheets
+                         .  concat similar to 'append' but keep first sheet
+                            type and columns
+                         .  extend copy first selected sheet, keeping all rows
+                            and sheet type, and extend with columns from other
+                            sheets
+                         .  merge  keep all rows from first sheet, updating
+                            any False-y cells with non-False-y values from
+                            second sheet; add unique rows from second sheet
+
+   Options Sheet (Shift+O)
+     (global commands)
+        Shift+O          edit global options (apply to all sheets)
+        zO               edit sheet options (apply to current sheet only)
+        gO               open options.config as TextSheet
+     (sheet-specific commands)
+        Enter  e         edit option at current row
+        d                remove option override for this context
+        ^S               save option configuration to foo.visidatarc
+
+   CommandLog (Shift+D)
+     (global commands)
+        D                open current sheet's CommandLog with all other loose
+                         ends removed; includes commands from parent sheets
+        gD               open global CommandLog for all commands executed in
+                         the current session
+        zD               open current sheet's CommandLog with the parent
+                         sheets commands' removed
+     (sheet-specific commands)
+          x              replay command in current row
+         gx              replay contents of entire CommandLog
+         ^C              abort replay
+
+   Threads Sheet (Ctrl+T)
+     (global commands)
+        ^T               open global Threads Sheet for all asynchronous
+                         threads running
+        z^T              open current sheet's Threads Sheet
+     (sheet-specific commands)
+         ^C              abort thread at current row
+        g^C              abort all threads on current Threads Sheet
+
+   DERIVED SHEETS
+   Frequency Table (Shift+F)
+     A Frequency Table groups rows by one or more columns, and includes
+     summary columns for those with aggregators.
+     (global commands)
+        gF               open Frequency Table, grouped by all key columns on
+                         source sheet
+        zF               open one-line summary for all rows and selected rows
+     (sheet-specific commands)
+         s   t   u       select/toggle/unselect these entries in source sheet
+         Enter  gEnter   open copy of source sheet with rows that are grouped
+                         in current cell / selected rows
+
+   Describe Sheet (Shift+I)
+     A Describe Sheet contains descriptive statistics for all visible columns.
+     (global commands)
+        gI               open Describe Sheet for all visible columns on all
+                         sheets
+     (sheet-specific commands)
+        zs  zu           select/unselect rows on source sheet that are being
+                         described in current cell
+         !               toggle/unset current column as a key column on source
+                         sheet
+         Enter           open a Frequency Table sheet grouped on column
+                         referenced in current row
+        zEnter           open copy of source sheet with rows described in cur‐
+                         rent cell
+
+   Pivot Table (Shift+W)
+     Set key column(s) and aggregators on column(s) before pressing Shift+W on
+     the column to pivot.
+     (sheet-specific commands)
+         Enter           open sheet of source rows aggregated in current pivot
+                         row
+        zEnter           open sheet of source rows aggregated in current pivot
+                         cell
+
+   Melted Sheet (Shift+M)
+     Open Melted Sheet (unpivot), with key columns retained and all non-key
+     columns reduced to Variable-Value rows.
+     (global commands)
+        gM regex         open Melted Sheet (unpivot), with key columns
+                         retained and regex capture groups determining how the
+                         non-key columns will be reduced to Variable-Value
+                         rows.
+
+   Python Object Sheet (^X ^Y g^Y z^Y)
+     (sheet-specific commands)
+         Enter           dive further into Python object
+         v               toggle show/hide for methods and hidden properties
+        gv  zv           show/hide methods and hidden properties
+
+COMMANDLINE OPTIONS
+     Add -n/--nonglobal to make subsequent CLI options sheet-specific
+     (applying only to paths specified directly on the CLI). By default, CLI
+     options apply to all sheets.
+
+     Options can also be set via the Options Sheet or a .visidatarc (see
+     FILES).
+
+     -P=longname                  preplay longname before replay or regular
+                                  launch; limited to Base Sheet bound commands
+     +toplevel:subsheet:row:col   launch vd with subsheet of toplevel at
+                                  top-of-stack, and cursor at row and col; all
+                                  arguments are optional
+     --overwrite=c                Overwrite with confirmation
+     --guides                     open Guide Index
+
+     -f, --filetype=filetype      tsv                set loader to use for
+                                  filetype instead of file extension
+     -d, --delimiter=delimiter    \t                 field delimiter to use
+                                  for tsv/usv filetype
+     -y, --overwrite=y            y                  overwrite existing files
+                                  without confirmation
+     -ro, --overwrite=n           n                  do not overwrite existing
+                                  files
+     -N, --nothing=T              False              disable loading
+                                  .visidatarc and plugin addons
+     --visidata-dir=str           ~/.visidata/       directory to load and
+                                                     store additional files
+     --debug                      False              exit on error and display
+                                                     stacktrace
+     --undo=bool                  True               enable undo/redo
+     --col-cache-size=int         0                  max number of cache en‐
+                                                     tries in each cached col‐
+                                                     umn
+     --scroll-incr=int            -3                 amount to scroll with
+                                                     scrollwheel
+     --force-256-colors           False              use 256 colors even if
+                                                     curses reports fewer
+     --quitguard                  False              confirm before quitting
+                                                     modified sheet
+     --default-width=int          20                 default column width
+     --default-height=int         4                  default column height
+     --name-joiner=str            _                  string to join sheet or
+                                                     column names
+     --value-joiner=str                              string to join display
+                                                     values
+     --max-rows=int               1000000000         number of rows to load
+                                                     from source
+     --wrap                       False              wrap text to fit window
+                                                     width on TextSheet
+     --save-filetype=str          tsv                specify default file type
+                                                     to save as
+     --profile                    False              enable profiling on
+                                                     threads
+     --min-memory-mb=int          0                  minimum memory to con‐
+                                                     tinue loading and async
+                                                     processing
+     --encoding=str               utf-8-sig          encoding passed to
+                                                     codecs.open when reading
+                                                     a file
+     --encoding-errors=str        surrogateescape    encoding_errors passed to
+                                                     codecs.open
+     --mouse-interval=int         1                  max time between
+                                                     press/release for click
+                                                     (ms)
+     --bulk-select-clear          False              clear selected rows be‐
+                                                     fore new bulk selections
+     --some-selected-rows         False              if no rows selected, if
+                                                     True, someSelectedRows
+                                                     returns all rows; if
+                                                     False, fails
+     --regex-skip=str                                regex of lines to skip in
+                                                     text sources
+     --regex-flags=str            I                  flags to pass to re.com‐
+                                                     pile() [AILMSUX]
+     --load-lazy                  False              load subsheets always
+                                                     (False) or lazily (True)
+     --skip=int                   0                  skip N rows before header
+     --header=int                 1                  parse first N rows as
+                                                     column names
+     --delimiter=str                                 field delimiter to use
+                                                     for tsv/usv filetype
+     --row-delimiter=str                             " row delimiter to use
+                                                     for tsv/usv filetype
+     --tsv-safe-newline=str                          replacement for newline
+                                                     character when saving to
+                                                     tsv
+     --tsv-safe-tab=str                              replacement for tab char‐
+                                                     acter when saving to tsv
+     --visibility=int             0                  visibility level
+     --default-sample-size=int    100                number of rows to sample
+                                                     for regex.split (0=all)
+     --fmt-expand-dict=str        %s.%s              format str to use for
+                                                     names of columns expanded
+                                                     from dict (colname, key)
+     --fmt-expand-list=str        %s[%s]             format str to use for
+                                                     names of columns expanded
+                                                     from list (colname, in‐
+                                                     dex)
+     --json-indent=NoneType       None               indent to use when saving
+                                                     json
+     --json-sort-keys             False              sort object keys when
+                                                     saving to json
+     --json-ensure-ascii=bool     True               ensure ascii encode when
+                                                     saving json
+     --default-colname=str                           column name to use for
+                                                     non-dict rows
+     --filetype=str                                  specify file type
+     --safe-error=str             #ERR               error string to use while
+                                                     saving
+     --save-encoding=str          utf-8              encoding passed to
+                                                     codecs.open when saving a
+                                                     file
+     --clean-names                False              clean column/sheet names
+                                                     to be valid Python iden‐
+                                                     tifiers
+     --replay-wait=float          0.0                time to wait between re‐
+                                                     played commands, in sec‐
+                                                     onds
+     --rowkey-prefix=str          キ                 string prefix for rowkey
+                                                     in the cmdlog
+     --clipboard-copy-cmd=str     xclip -selection clipboard -filter
+                                                     command to copy stdin to
+                                                     system clipboard
+     --clipboard-paste-cmd=str    xclip -selection clipboard -o
+                                                     command to send contents
+                                                     of system clipboard to
+                                                     stdout
+     --fancy-chooser              False              a nicer selection inter‐
+                                                     face for aggregators and
+                                                     jointype
+     --null-value=NoneType        None               a value to be counted as
+                                                     null
+     --histogram-bins=int         0                  number of bins for his‐
+                                                     togram of numeric columns
+     --numeric-binning            False              bin numeric columns into
+                                                     ranges
+     --plot-colors=str                               list of distinct colors
+                                                     to use for plotting dis‐
+                                                     tinct objects
+     --motd-url=str                                  source of randomized
+                                                     startup messages
+     --dir-depth=int              0                  folder recursion depth on
+                                                     DirSheet
+     --dir-hidden                 False              load hidden files on
+                                                     DirSheet
+     --config=Path                /home/saul/.visidatarc
+                                                     config file to exec in
+                                                     Python
+     --play=str                                      file.vdj to replay
+     --batch                      False              replay in batch mode
+                                                     (with no interface and
+                                                     all status sent to std‐
+                                                     out)
+     --output=NoneType            None               save the final visible
+                                                     sheet to output at the
+                                                     end of replay
+     --preplay=str                                   longnames to preplay be‐
+                                                     fore replay
+     --imports=str                plugins            imports to preload before
+                                                     .visidatarc (command-line
+                                                     only)
+     --nothing                    False              no config, no plugins,
+                                                     nothing extra
+     --interactive                False              run interactive mode af‐
+                                                     ter batch replay
+     --overwrite=str              c                  overwrite existing files
+                                                     {y=yes|c=confirm|n=no}
+     --plugins-autoload=bool      True               do not autoload plugins
+                                                     if False
+     --theme=str                                     display/color theme to
+                                                     use
+     --airtable-auth-token=str                       Airtable API key from
+                                                     https://airtable.com/ac‐
+                                                     count
+     --matrix-token=str                              matrix API token
+     --matrix-user-id=str                            matrix user ID associated
+                                                     with token
+     --matrix-device-id=str       VisiData           device ID associated with
+                                                     matrix login
+     --reddit-client-id=str                          client_id for reddit api
+     --reddit-client-secret=str                      client_secret for reddit
+                                                     api
+     --reddit-user-agent=str      3.1dev             user_agent for reddit api
+     --zulip-batch-size=int       -100               number of messages to
+                                                     fetch per call (<0 to
+                                                     fetch before anchor)
+     --zulip-anchor=int           1000000000         message id to start
+                                                     fetching from
+     --zulip-delay-s=float        1e-05              seconds to wait between
+                                                     calls (0 to stop after
+                                                     first)
+     --zulip-api-key=str                             Zulip API key
+     --zulip-email=str                               Email for use with Zulip
+                                                     API key
+     --csv-dialect=str            excel              dialect passed to
+                                                     csv.reader
+     --csv-delimiter=str          ,                  delimiter passed to
+                                                     csv.reader
+     --csv-doublequote=bool       True               quote-doubling setting
+                                                     passed to csv.reader
+     --csv-quotechar=str          "                  quotechar passed to
+                                                     csv.reader
+     --csv-quoting=int            0                  quoting style passed to
+                                                     csv.reader and csv.writer
+     --csv-skipinitialspace=bool  True               skipinitialspace passed
+                                                     to csv.reader
+     --csv-escapechar=NoneType    None               escapechar passed to
+                                                     csv.reader
+     --csv-lineterminator=str                        " lineterminator passed
+                                                     to csv.writer
+     --safety-first               False              sanitize input/output to
+                                                     handle edge cases, with a
+                                                     performance cost
+     --f5log-object-regex=NoneType None              A regex to perform on the
+                                                     object name, useful where
+                                                     object names have a
+                                                     structure to extract. Use
+                                                     the (?P<foo>...) named
+                                                     groups form to get column
+                                                     names.
+     --f5log-log-year=NoneType    None               Override the default year
+                                                     used for log parsing. Use
+                                                     all four digits of the
+                                                     year (e.g., 2022). By de‐
+                                                     fault (None) use the year
+                                                     from the ctime of the
+                                                     file, or failing that the
+                                                     current year.
+     --f5log-log-timezone=str     UTC                The timezone the source
+                                                     file is in, by default
+                                                     UTC.
+     --fixed-rows=int             1000               number of rows to check
+                                                     for fixed width columns
+     --fixed-maxcols=int          0                  max number of fixed-width
+                                                     columns to create (0 is
+                                                     no max)
+     --graphviz-edge-labels=bool  True               whether to include edge
+                                                     labels on graphviz dia‐
+                                                     grams
+     --grep-base-dir=NoneType     None               base directory for rela‐
+                                                     tive paths opened with
+                                                     sysopen-row
+     --html-title=str             <h2>{sheet.name}</h2>
+                                                     table header when saving
+                                                     to html
+     --http-max-next=int          0                  max next.url pages to
+                                                     follow in http response
+     --http-req-headers=dict      {}                 http headers to send to
+                                                     requests
+     --http-ssl-verify=bool       True               verify host and certifi‐
+                                                     cates for https
+     --npy-allow-pickle           False              numpy allow unpickling
+                                                     objects (unsafe)
+     --pcap-internet=str          n                  (y/s/n) if save_dot in‐
+                                                     cludes all internet hosts
+                                                     separately (y), combined
+                                                     (s), or does not include
+                                                     the internet (n)
+     --pdf-tables                 False              parse PDF for tables in‐
+                                                     stead of pages of text
+     --postgres-schema=str        public             The desired schema for
+                                                     the Postgres database
+     --s3-endpoint=str                               alternate S3 endpoint,
+                                                     used for local testing or
+                                                     alternative S3-compatible
+                                                     services
+     --s3-glob=bool               True               enable glob-matching for
+                                                     S3 paths
+     --s3-version-aware           False              show all object versions
+                                                     in a versioned bucket
+     --sqlite-onconnect=str                          sqlite statement to exe‐
+                                                     cute after opening a con‐
+                                                     nection
+     --xlsx-meta-columns          False              include columns for cell
+                                                     objects, font colors, and
+                                                     fill colors
+     --xlsx-color-cells=bool      True               color cells based on xlsx
+                                                     source
+     --xml-parser-huge-tree=bool  True               allow very deep trees and
+                                                     very long text content
+     --plt-marker=str             .                  matplotlib.markers
+     --plot-palette=str           Set3               colorbrewer palette to
+                                                     use
+     --server-addr=str            127.0.0.1          IP address to listen for
+                                                     commands
+     --server-port=int            0                  port to listen for com‐
+                                                     mands
+     --fixer-api-key=str                             API Key for api.api‐
+                                                     layer.com/fixer
+     --fixer-cache-days=int       1                  Cache days for currency
+                                                     conversions
+     --describe-aggrs=str         mean stdev         numeric aggregators to
+                                                     calculate on Describe
+                                                     sheet
+     --hello-world=str            ¡Hola mundo!       shown by the hello-world
+                                                     command
+     --incr-base=float            1.0                start value for column
+                                                     increments
+     --ping-count=int             3                  send this many pings to
+                                                     each host
+     --ping-interval=float        0.1                wait between ping rounds,
+                                                     in seconds
+     --regex-maxsplit=int         0                  maxsplit to pass to
+                                                     regex.split
+     --rename-cascade             False              cascade column renames
+                                                     into expressions
+     --faker-locale=str           en_US              default locale to use for
+                                                     Faker
+     --faker-extra-providers=NoneType None           list of additional
+                                                     Provider classes to load
+                                                     via add_provider()
+     --faker-salt=str                                Use a non-empty string to
+                                                     enable deterministic
+                                                     fakes
+     --mailcap-mimetype=str                          force mimetype for
+                                                     sysopen-mailcap
+     --unfurl-empty               False              if unfurl includes rows
+                                                     for empty containers
+
+   DISPLAY OPTIONS
+     Display options can only be set via the Options Sheet or a .visidatarc
+     (see FILES).
+
+     disp_menu           True                show menu on top line when not
+                                             active
+     disp_menu_keys      True                show keystrokes inline in sub‐
+                                             menus
+     color_menu          black on 68 blue    color of menu items in general
+     color_menu_active   223 yellow on black
+                                             color of active menu items
+     color_menu_spec     black on 34 green   color of sheet-specific menu
+                                             items
+     color_menu_help     black italic on 68 blue
+                                             color of helpbox
+     disp_menu_boxchars  ││──┌┐└┘├┤          box characters to use for menus
+     disp_menu_more      »                   command submenu indicator
+     disp_menu_push      ⎘                   indicator if command pushes sheet
+                                             onto sheet stack
+     disp_menu_input     …                   indicator if input required for
+                                             command
+     disp_menu_fmt       | VisiData {vd.version} | {vd.hintStatus}
+                                             right-side menu format string
+     disp_float_fmt      {:.02f}             default fmtstr to format float
+                                             values
+     disp_int_fmt        {:d}                default fmtstr to format int val‐
+                                             ues
+     disp_formatter      generic             formatter to create the text in
+                                             each cell (also used by text
+                                             savers)
+     disp_displayer      generic             displayer to render the text in
+                                             each cell
+     disp_splitwin_pct   0                   height of second sheet on screen
+     disp_note_none      ⌀                   visible contents of a cell whose
+                                             value is None
+     disp_truncator      …                   indicator that the contents are
+                                             only partially visible
+     disp_oddspace       ·                   displayable character for odd
+                                             whitespace
+     disp_more_left      <                   header note indicating more col‐
+                                             umns to the left
+     disp_more_right     >                   header note indicating more col‐
+                                             umns to the right
+     disp_error_val                          displayed contents for computa‐
+                                             tion exception
+     disp_ambig_width    1                   width to use for unicode chars
+                                             marked ambiguous
+     disp_pending                            string to display in pending
+                                             cells
+     disp_note_pending   :                   note to display for pending cells
+     disp_note_fmtexc    ?                   cell note for an exception during
+                                             formatting
+     disp_note_getexc    !                   cell note for an exception during
+                                             computation
+     disp_note_typeexc   !                   cell note for an exception during
+                                             type conversion
+     color_note_pending  bold green          color of note in pending cells
+     color_note_type     226 yellow          color of cell note for non-str
+                                             types in anytype columns
+     color_note_row      220 yellow          color of row note on left edge
+     disp_column_sep     │                   separator between columns
+     disp_keycol_sep     ║                   separator between key columns and
+                                             rest of columns
+     disp_rowtop_sep     │
+     disp_rowmid_sep     ⁝
+     disp_rowbot_sep     ⁝
+     disp_rowend_sep     ║
+     disp_keytop_sep     ║
+     disp_keymid_sep     ║
+     disp_keybot_sep     ║
+     disp_endtop_sep     ║
+     disp_endmid_sep     ║
+     disp_endbot_sep     ║
+     disp_selected_note  •
+     disp_sort_asc       ↑↟⇞⇡⇧⇑              characters for ascending sort
+     disp_sort_desc      ↓↡⇟⇣⇩⇓              characters for descending sort
+     color_default       white on black      the default fg and bg colors
+     color_default_hdr   bold white on black
+                                             color of the column headers
+     color_bottom_hdr    underline white on black
+                                             color of the bottom header row
+     color_current_row   reverse             color of the cursor row
+     color_current_col   bold on 232         color of the cursor column
+     color_current_cell                      color of current cell, if differ‐
+                                             ent from color_cur‐
+                                             rent_row+color_current_col
+     color_current_hdr   bold reverse        color of the header for the cur‐
+                                             sor column
+     color_column_sep    white on black      color of column separators
+     color_key_col       81 cyan             color of key columns
+     color_hidden_col    8                   color of hidden columns on
+                                             metasheets
+     color_selected_row  215 yellow          color of selected rows
+     color_clickable     bold                color of internally clickable
+                                             item
+     color_code          bold white on 237   color of code sample
+     color_heading       bold black on yellow
+                                             color of header
+     color_guide_unwritten 243 on black      color of unwritten guides in
+                                             GuideGuide
+     disp_wrap_max_lines 3                   max lines for multiline view
+     disp_wrap_break_long_words False        break words longer than column
+                                             width in multiline
+     disp_wrap_replace_whitespace False      replace whitespace with spaces in
+                                             multiline
+     disp_wrap_placeholder …                 multiline string to indicate
+                                             truncation
+     disp_multiline_focus True               only multiline cursor row
+     color_aggregator    bold 255 white on 234 black
+                                             color of aggregator summary on
+                                             bottom row
+     disp_rstatus_fmt    {sheet.threadStatus} {sheet.keystrokeStatus}
+                                             [:longname_status]{sheet.longname}[/]
+                                             {sheet.nRows:9d} {sheet.rowtype}
+                                             {sheet.modifiedStatus}{sheet.selectedStatus}{vd.replayStatus}{vd.sidebarStatus}
+                                             right-side status format string
+     disp_status_fmt     {sheet.sheetlist}|
+                                             left-side status format string
+     disp_lstatus_max    0                   maximum length of left status
+                                             line
+     disp_status_sep     │                   separator between statuses
+     color_keystrokes    bold white on 237   color of input keystrokes
+     color_longname_guide 237                color of command longnames
+     color_longname_status white             color of command longnames
+     color_keys          bold reverse        color of keystrokes in help
+     color_status        bold on 238         status line color
+     color_error         202 1               error message color
+     color_warning       166 15              warning message color
+     color_top_status    underline           top window status bar color
+     color_active_status black on 68 blue     active window status bar color
+     color_inactive_status 8 on black        inactive window status bar color
+     color_highlight_status black on green   color of highlighted elements in
+                                             statusbar
+     color_working       118 5               color of system running smoothly
+     color_edit_unfocused 238 on 110         display color for unfocused input
+                                             in form
+     color_edit_cell     233 on 110          cell color to use when editing
+                                             cell
+     disp_edit_fill      _                   edit field fill character
+     disp_unprintable    ·                   substitute character for unprint‐
+                                             ables
+     disp_date_fmt       %Y-%m-%d            default fmtstr passed to strftime
+                                             for date values
+     disp_currency_fmt   %.02f               default fmtstr to format for cur‐
+                                             rency values
+     color_currency_neg  red                 color for negative values in cur‐
+                                             rency displayer
+     disp_replay_play    ▶                   status indicator for active re‐
+                                             play
+     disp_replay_record  ⏺                   status indicator for macro record
+     color_status_replay green               color of replay status indicator
+     disp_histogram      ■                   histogram element character
+     disp_graph_labels   True                show axes and legend on graph
+     disp_canvas_charset
+                                             ⠀⠁⠂⠃⠄⠅⠆⠇⠈⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒⠓⠔⠕⠖⠗⠘⠙⠚⠛⠜⠝⠞⠟⠠⠡⠢⠣⠤⠥⠦⠧⠨⠩⠪⠫⠬⠭⠮⠯⠰⠱⠲⠳⠴⠵⠶⠷⠸⠹⠺⠻⠼⠽⠾⠿⡀⡁⡂⡃⡄⡅⡆⡇⡈⡉⡊⡋⡌⡍⡎⡏⡐⡑⡒⡓⡔⡕⡖⡗⡘⡙⡚⡛⡜⡝⡞⡟⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾⡿⢀⢁⢂⢃⢄⢅⢆⢇⢈⢉⢊⢋⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙⢚⢛⢜⢝⢞⢟⢠⢡⢢⢣⢤⢥⢦⢧⢨⢩⢪⢫⢬⢭⢮⢯⢰⢱⢲⢳⢴⢵⢶⢷⢸⢹⢺⢻⢼⢽⢾⢿⣀⣁⣂⣃⣄⣅⣆⣇⣈⣉⣊⣋⣌⣍⣎⣏⣐⣑⣒⣓⣔⣕⣖⣗⣘⣙⣚⣛⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
+                                             charset to render 2x4 blocks on
+                                             canvas
+     disp_pixel_random   False               randomly choose attr from set of
+                                             pixels instead of most common
+     disp_zoom_incr      2.0                 amount to multiply current zoom‐
+                                             level when zooming
+     color_graph_hidden  238 blue            color of legend for hidden attri‐
+                                             bute
+     color_graph_selected bold               color of selected graph points
+     color_graph_axis    bold                color for graph axis labels
+     disp_graph_tick_x   ╵                   character for graph x-axis ticks
+     color_graph_refline                     color for graph reference value
+                                             lines
+     disp_graph_reflines_x_charset ▏││▕      charset to render vertical refer‐
+                                             ence lines on graph
+     disp_graph_reflines_y_charset ▔──▁      charset to render horizontal ref‐
+                                             erence lines on graph
+     disp_graph_multiple_reflines_char ▒     char to render multiple parallel
+                                             reflines
+     disp_expert         0                   max level of options and columns
+                                             to include
+     color_add_pending   green               color for rows pending add
+     color_change_pending reverse yellow     color for cells pending modifica‐
+                                             tion
+     color_delete_pending red                color for rows pending delete
+     disp_sidebar        True                whether to display sidebar
+     disp_sidebar_fmt                        format string for default sidebar
+     disp_sidebar_width  0                   max width for sidebar
+     disp_sidebar_height 0                   max height for sidebar
+     color_sidebar       black on 114 blue   base color of sidebar
+     color_sidebar_title black on yellow     color of sidebar title
+     color_match         red                 color for matching chars in pal‐
+                                             ette chooser
+     color_f5log_mon_up  green               color of f5log monitor status up
+     color_f5log_mon_down red                color of f5log monitor status
+                                             down
+     color_f5log_mon_unknown blue            color of f5log monitor status un‐
+                                             known
+     color_f5log_mon_checking magenta        color of monitor status checking
+     color_f5log_mon_disabled black          color of monitor status disabled
+     color_f5log_logid_alarm red             color of alarms
+     color_f5log_logid_warn yellow           color of warnings
+     color_f5log_logid_notice cyan           color of notice
+     color_f5log_logid_info green            color of info
+     color_xword_active  green               color of active clue
+     color_cmdpalette    black on 72         base color of command palette
+     disp_cmdpal_max     10                  max number of suggestions for
+                                             command palette
+     disp_scroll_context 0                   minimum number of lines to keep
+                                             visible above/below cursor when
+                                             scrolling
+     disp_sparkline      ▁▂▃▄▅▆▇             characters to display sparkline
+
+EXAMPLES
+           vd
+     launch DirSheet for current directory
+
+           vd foo.tsv
+     open the file foo.tsv in the current directory
+
+           vd -f ddw
+     open blank sheet of type ddw
+
+           vd new.tsv
+     open new blank tsv sheet named new
+
+           vd -f sqlite bar.db
+     open the file bar.db as a sqlite database
+
+           vd foo.tsv -n -f sqlite bar.db
+     open foo.tsv as tsv and bar.db as a sqlite database
+
+           vd -f sqlite foo.tsv bar.db
+     open both foo.tsv and bar.db as a sqlite database
+
+           vd -b countries.fixed -o countries.tsv
+     convert countries.fixed (in fixed width format) to countries.tsv (in tsv
+     format)
+
+           vd postgres://username:password@hostname:port/database
+     open a connection to the given postgres database
+
+           vd --play tests/pivot.vdj --replay-wait 1 --output tests/pivot.tsv
+     replay tests/pivot.vdj, waiting 1 second between commands, and output the
+     final sheet to test/pivot.tsv
+
+           ls -l | vd -f fixed --skip 1 --header 0
+     parse the output of ls -l into usable data
+
+           ls | vd | lpr
+     interactively select a list of filenames to send to the printer
+
+           vd newfile.tsv
+     open a blank sheet named newfile if file does not exist
+
+           vd sample.xlsx +:sheet1:2:3
+     launch with sheet1 at top-of-stack, and cursor at column 2 and row 3
+
+           vd -P open-plugins
+     preplay longname open-plugins before starting the session
+
+FILES
+     At the start of every session, VisiData looks for $HOME/.visidatarc, and
+     calls Python exec() on its contents if it exists. For example:
+
+        options.min_memory_mb=100  # stop processing without 100MB free
+
+        bindkey('0', 'go-leftmost')   # alias '0' to go to first column, like vim
+
+        def median(values):
+            L = sorted(values)
+            return L[len(L)//2]
+
+        vd.aggregator('median', median)
+
+     Functions defined in .visidatarc are available in python expressions
+     (e.g. in derived columns).
+
+SUPPORTED SOURCES
+     Core VisiData includes these sources:
+
+        tsv (tab-separated value)
+           Plain and simple. VisiData writes tsv format by default. See the
+           --tsv-delimiter option.
+
+        csv (comma-separated value)
+           .csv files are a scourge upon the earth, and still regrettably
+           common.
+           See the --csv-dialect, --csv-delimiter, --csv-quotechar, and
+           --csv-skipinitialspace options.
+           Accepted dialects are excel-tab, unix, and excel.
+
+        fixed (fixed width text)
+           Columns are autodetected from the first 1000 rows (adjustable with
+           --fixed-rows).
+
+        json (single object) and jsonl/ndjson/ldjson (one object per line).
+           Cells containing lists (e.g. [3]) or dicts ({3}) can be expanded
+           into new columns with ( and unexpanded with ).
+
+        sqlite
+           May include multiple tables. The initial sheet is the table
+           directory; Enter loads the entire table into memory. z^S saves
+           modifications to source.
+
+     URL schemes are also supported:
+        http (requires requests); can be used as transport for with another
+        filetype
+
+     For a list of all remaining formats supported by VisiData, see
+     https://visidata.org/formats.
+
+     In addition, .zip, .gz, .bz2, .xz, ,zstd, and .zst files are decompressed
+     on the fly.
+
+AUTHOR
+     VisiData was made by Saul Pwanson <vd@saul.pw>.
+
+Linux/MacOS                    October 13, 2024                    Linux/MacOS
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/extract.html b/docs/console/extract.html new file mode 100644 index 0000000000..d90e1688d5 --- /dev/null +++ b/docs/console/extract.html @@ -0,0 +1,3528 @@ + + + + + + + + +Extract | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Extract

+
+ +

With Frtictionless extract command you can extract data from a file or a dataset.

+

Normal Mode

+

By default, it outputs metadata visually formatted:

+ +
+
+
frictionless extract tables/*.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃ name   ┃ type  ┃ path              ┃
+┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ chunk1 │ table │ tables/chunk1.csv │
+│ chunk2 │ table │ tables/chunk2.csv │
+└────────┴───────┴───────────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+     chunk1
+┏━━━━┳━━━━━━━━━┓
+┃ id ┃ name    ┃
+┡━━━━╇━━━━━━━━━┩
+│ 1  │ english │
+└────┴─────────┘
+    chunk2
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name   ┃
+┡━━━━╇━━━━━━━━┩
+│ 2  │ 中国人 │
+└────┴────────┘
+ +
+

Yaml/Json Mode

+

It's possible to output as YAML or JSON, for example:

+ + + +
+
+
frictionless extract tables/*.csv --yaml
+
+ +
chunk1:
+- id: 1
+  name: english
+chunk2:
+- id: 2
+  name: 中国人
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/index.html b/docs/console/index.html new file mode 100644 index 0000000000..0e94dce351 --- /dev/null +++ b/docs/console/index.html @@ -0,0 +1,3589 @@ + + + + + + + + +Index | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Index

+
+ +
+ +

Indexing resource in Frictionless terms means loading a data table into a database. Let's explore how this feature works in different modes.

+

Installation

+ +
+
+
pip install frictionless[sql]
+
+ +
+

Normal Mode

+

This mode is supported for any database that is supported by sqlalchemy. Under the hood, Frictionless will infer Table Schema and populate the data table as it normally reads data. It means that type errors will be replaced by null values and in-general it guarantees to finish successfully for any data even very invalid.

+ +
+
+
frictionless index table.csv --database sqlite:///index/project.db
+frictionless extract sqlite:///index/project.db --table table --json
+
+ +
──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 3 rows in 0.219 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+  "project": [
+    {
+      "id": 1,
+      "name": "english"
+    },
+    {
+      "id": 2,
+      "name": "中国人"
+    }
+  ]
+}
+ +
+

Fast Mode

+
+ +

Fast mode is supported for SQLite and Postgresql databases. It will infer Table Schema using a data sample and index data using COPY in Potgresql and .import in SQLite. For big data files this mode will be 10-30x faster than normal indexing but the speed comes with the price -- if there is invalid data the indexing will fail.

+ +
+
+
frictionless index table.csv --database sqlite:///index/project.db --fast
+frictionless extract sqlite:///index/project.db --table table --json
+
+ +
──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 30 bytes in 0.362 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+  "project": [
+    {
+      "id": 1,
+      "name": "english"
+    },
+    {
+      "id": 2,
+      "name": "中国人"
+    }
+  ]
+}
+ +
+

Solution 1: Fallback

+

To ensure that the data will be successfully indexed it's possible to use fallback option. If the fast indexing fails Frictionless will start over in normal mode and finish the process successfully.

+ +
+
+
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --fallback
+
+ +
+

Solution 2: QSV

+

Another option is to provide a path to QSV binary. In this case, initial schema inferring will be done based on the whole data file and will guarantee that the table is valid type-wise:

+ + + +
+
+
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --qsv qsv_path
+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/list.html b/docs/console/list.html new file mode 100644 index 0000000000..da9f5c0312 --- /dev/null +++ b/docs/console/list.html @@ -0,0 +1,3499 @@ + + + + + + + + +List | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

List

+
+ +

With Frtictionless list command you can get a list of resources from a data source. For more detailed output see describe command.

+

Normal Mode

+

By default, it outputs metadata visually formatted:

+
frictionless list tables/*.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃ name   ┃ type  ┃ path              ┃
+┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ chunk1 │ table │ tables/chunk1.csv │
+│ chunk2 │ table │ tables/chunk2.csv │
+└────────┴───────┴───────────────────┘
+

Yaml/Json Mode

+

It's possible to output as YAML or JSON, for example:

+
frictionless list tables/*.csv --yaml
+
+ +
- name: chunk1
+  type: table
+  path: tables/chunk1.csv
+  scheme: file
+  format: csv
+  mediatype: text/csv
+- name: chunk2
+  type: table
+  path: tables/chunk2.csv
+  scheme: file
+  format: csv
+  mediatype: text/csv
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/overview.html b/docs/console/overview.html new file mode 100644 index 0000000000..aa431216be --- /dev/null +++ b/docs/console/overview.html @@ -0,0 +1,3550 @@ + + + + + + + + +Overview | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Overview

+

The Command-Line interface is a vital part for the Frictionless Framework. While working within Python provides more flexibility, CLI is the easist way to interact with Frictionless.

+
+ +

Install

+

To install the package please follow the Getting Started guide. Usually, a simple installation using Pip or Anaconda will install the frictionless binary on your computer so you don't need to install CLI aditionally.

+

Commands

+

The frictionless binary requires providing a command like describe or validate:

+ +
+
+
frictionless describe # to describe your data
+frictionless explore # to explore your data
+frictionless extract # to extract your data
+frictionless index # to index your data
+frictionless list # to list your data
+frictionless publish # to publish your data
+frictionless query # to query your data
+frictionless script # to script your data
+frictionless validate # to validate your data
+frictionless --help # to get list of the command
+frictionless --version # to get the version
+
+ +
+

Arguments

+

All the arguments for the main CLI command are the same as they are in Python. You can read Guides and use almost all the information from there within the command-line. There is an important different in how arguments are written (note the dashes):

+
Python: validate('data/table.csv', limit_errors=1)
+CLI: $ validate data/table.csv --limit-errors 1
+
+

To get help for a command and its arguments you can use the help flag with the command:

+ +
+
+
frictionless describe --help # to get help for describe
+frictionless extract --help # to get help for extract
+frictionless validate --help # to get help for validate
+frictionless transform --help # to get help for transform
+
+ +
+

Outputs

+

Usually, Frictionless commands returns pretty-formatted tabular data like extract or validate do. For the describe command you get a metadata back and you can choose in what format to return it:

+ +
+
+
frictionless describe # default YAML with a commented front-matter
+frictionless describe --yaml # standard YAML
+frictionless describe --json # standard JSON
+
+ +
+

Errors

+

The Frictionless' CLI interface should not fail with any internal Python errors with a traceback (a long listing of related code). If you see something like this please create an issue in the Issue Tracker.

+

Debug

+

To debug a problem please use:

+ + + +
+
+
frictionless command --debug
+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/publish.html b/docs/console/publish.html new file mode 100644 index 0000000000..780faaff8e --- /dev/null +++ b/docs/console/publish.html @@ -0,0 +1,3474 @@ + + + + + + + + +Publish | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Publish

+
+ +

With publish command you can publish your dataset to a data publishing platform like CKAN:

+
frictionless publish data/tables/*.csv --target http://ckan:5000/dataset/my-best --title "My best dataset"
+
+

It will ask for an API Key to upload your metadata and data. As a result:

+
+ +
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/query.html b/docs/console/query.html new file mode 100644 index 0000000000..39e346c91a --- /dev/null +++ b/docs/console/query.html @@ -0,0 +1,3489 @@ + + + + + + + + +Query | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Query

+
+ +

With query command you can explore tabular files within a Sqlite database.

+

Installation

+ +
+
+
pip install frictionless[sql]
+pip install frictionless[sql,zenodo] # for examples in this tutorial
+
+ +
+

Usage

+
frictionless query https://zenodo.org/record/3977957
+
+
+ +
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/script.html b/docs/console/script.html new file mode 100644 index 0000000000..28a9b60fd3 --- /dev/null +++ b/docs/console/script.html @@ -0,0 +1,3489 @@ + + + + + + + + +Script | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Script

+
+ +

With script command you can explore tabular files with Pandas by one console command

+

Installation

+ +
+
+
pip install frictionless[sql]
+pip install frictionless[sql,zenodo] # for examples in this tutorial
+
+ +
+

Usage

+
frictionless script https://zenodo.org/record/3977957
+
+
+ +
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/console/validate.html b/docs/console/validate.html new file mode 100644 index 0000000000..0a24f00bd5 --- /dev/null +++ b/docs/console/validate.html @@ -0,0 +1,3512 @@ + + + + + + + + +Validate | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Validate

+
+ +

With validate command you can validate your tabular files (indivisual or the whole dataset). For example:

+ + + +
+
+
frictionless validate table.csv invalid.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                  dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name    ┃ type  ┃ path        ┃ status  ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ table   │ table │ table.csv   │ VALID   │
+│ invalid │ table │ invalid.csv │ INVALID │
+└─────────┴───────┴─────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                    invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type            ┃ Message                                     ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3     │ blank-label     │ Label in the header in field at position    │
+│      │       │                 │ "3" is blank                                │
+│ None │ 4     │ duplicate-label │ Label "name" in the header at position "4"  │
+│      │       │                 │ is duplicated to a label: at position "2"   │
+│ 2    │ 3     │ missing-cell    │ Row at position "2" has a missing cell in   │
+│      │       │                 │ field "field3" at position "3"              │
+│ 2    │ 4     │ missing-cell    │ Row at position "2" has a missing cell in   │
+│      │       │                 │ field "name2" at position "4"               │
+│ 3    │ 3     │ missing-cell    │ Row at position "3" has a missing cell in   │
+│      │       │                 │ field "field3" at position "3"              │
+│ 3    │ 4     │ missing-cell    │ Row at position "3" has a missing cell in   │
+│      │       │                 │ field "name2" at position "4"               │
+│ 4    │ None  │ blank-row       │ Row at position "4" is completely blank     │
+│ 5    │ 5     │ extra-cell      │ Row at position "5" has an extra value in   │
+│      │       │                 │ field at position "5"                       │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/cell.html b/docs/errors/cell.html new file mode 100644 index 0000000000..34ab4144aa --- /dev/null +++ b/docs/errors/cell.html @@ -0,0 +1,3856 @@ + + + + + + + + +Cell Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Cell Errors

+

Cell Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typecell-error
TitleCell Error
DescriptionCell Error
TemplateCell Error
Tags#table #row #cell

Extra Cell

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeextra-cell
TitleExtra Cell
DescriptionThis row has more values compared to the header row (the first row in the data source). A key concept is that all the rows in tabular data must have the same number of columns.
TemplateRow at position "{rowNumber}" has an extra value in field at position "{fieldNumber}"
Tags#table #row #cell

Missing Cell

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typemissing-cell
TitleMissing Cell
DescriptionThis row has less values compared to the header row (the first row in the data source). A key concept is that all the rows in tabular data must have the same number of columns.
TemplateRow at position "{rowNumber}" has a missing cell in field "{fieldName}" at position "{fieldNumber}"
Tags#table #row #cell

Type Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typetype-error
TitleType Error
DescriptionThe value does not match the schema type and format for this field.
TemplateType error in the cell "{cell}" in row "{rowNumber}" and field "{fieldName}" at position "{fieldNumber}": {note}
Tags#table #row #cell

Constraint Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeconstraint-error
TitleConstraint Error
DescriptionA field value does not conform to a constraint.
TemplateThe cell "{cell}" in row at position "{rowNumber}" and field "{fieldName}" at position "{fieldNumber}" does not conform to a constraint: {note}
Tags#table #row #cell

Unique Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeunique-error
TitleUnique Error
DescriptionThis field is a unique field but it contains a value that has been used in another row.
TemplateRow at position "{rowNumber}" has unique constraint violation in field "{fieldName}" at position "{fieldNumber}": {note}
Tags#table #row #cell

Truncated Value

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typetruncated-value
TitleTruncated Value
DescriptionThe value is possible truncated.
TemplateThe cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note}
Tags#table #row #cell

Forbidden Value

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeforbidden-value
TitleForbidden Value
DescriptionThe value is forbidden.
TemplateThe cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note}
Tags#table #row #cell

Sequential Value

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typesequential-value
TitleSequential Value
DescriptionThe value is not sequential.
TemplateThe cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note}
Tags#table #row #cell

Ascii Value

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeascii-value
TitleAscii Value
DescriptionThe cell contains non-ascii characters.
TemplateThe cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note}
Tags#table #row #cell

Reference

+
+ + +
+
+ +

errors.CellError (class)

+ +
+
+ + +
+

errors.CellError (class)

+

Cell error representation. + +A base class for all the errors related to the cell value.

+

Signature

+

(*, note: str, cells: List[str], row_number: int, cell: str, field_name: str, field_number: int) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
  • + cells + (List[str])
  • +
  • + row_number + (int)
  • +
  • + cell + (str)
  • +
  • + field_name + (str)
  • +
  • + field_number + (int)
  • +
+
+ +
+

errors.cellError.cell (property)

+

+ Cell where the error occurred. +

+

Signature

+

str

+
+
+

errors.cellError.field_name (property)

+

+ Name of the field that has an error. +

+

Signature

+

str

+
+
+

errors.cellError.field_number (property)

+

+ Index of the field that has an error. +

+

Signature

+

int

+
+ + +
+

errors.CellError.from_row (method) (static)

+

Create and error from a cell

+

Signature

+

(row: Row, *, note: str, field_name: str)

+

Parameters

+
    +
  • + row + (Row): row
  • +
  • + note + (str): note
  • +
  • + field_name + (str): field name
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/data.html b/docs/errors/data.html new file mode 100644 index 0000000000..e7769a83e1 --- /dev/null +++ b/docs/errors/data.html @@ -0,0 +1,3535 @@ + + + + + + + + +Data Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Data Errors

+

Data Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typedata-error
TitleData Error
DescriptionThere is a data error.
TemplateData error: {note}

Reference

+
+ + +
+
+ +

errors.DataError (class)

+ +
+
+ + +
+

errors.DataError (class)

+

Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.

+

Signature

+

(*, note: str) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/file.html b/docs/errors/file.html new file mode 100644 index 0000000000..04b1e00b5b --- /dev/null +++ b/docs/errors/file.html @@ -0,0 +1,3597 @@ + + + + + + + + +File Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

File Errors

+

File Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typefile-error
TitleFile Error
DescriptionThere is a file error.
TemplateGeneral file error: {note}
Tags#file

Hash Count Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typehash-count
TitleHash Count Error
DescriptionThis error can happen if the data is corrupted.
TemplateThe data source does not match the expected hash count: {note}
Tags#file

Byte Count Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typebyte-count
TitleByte Count Error
DescriptionThis error can happen if the data is corrupted.
TemplateThe data source does not match the expected byte count: {note}
Tags#file

Reference

+
+ + +
+
+ +

errors.FileError (class)

+ +
+
+ + +
+

errors.FileError (class)

+

Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.

+

Signature

+

(*, note: str) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/header.html b/docs/errors/header.html new file mode 100644 index 0000000000..1cf9ac6aba --- /dev/null +++ b/docs/errors/header.html @@ -0,0 +1,3589 @@ + + + + + + + + +Header Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Header Errors

+

Header Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeheader-error
TitleHeader Error
DescriptionCell Error
TemplateCell Error
Tags#table #header

Blank Header

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeblank-header
TitleBlank Header
DescriptionThis header is empty. A header should contain at least one value.
TemplateHeader is completely blank
Tags#table #header

Reference

+
+ + +
+
+ +

errors.HeaderError (class)

+ +
+
+ + +
+

errors.HeaderError (class)

+

Header error representation. + +A base class for all the errors related to the resource header.

+

Signature

+

(*, note: str, labels: List[str], row_numbers: List[int]) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
  • + labels + (List[str])
  • +
  • + row_numbers + (List[int])
  • +
+
+ +
+

errors.headerError.labels (property)

+

+ List of labels that has errors. +

+

Signature

+

List[str]

+
+
+

errors.headerError.row_numbers (property)

+

+ Row number where the error occurred. +

+

Signature

+

List[int]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/label.html b/docs/errors/label.html new file mode 100644 index 0000000000..88df659be7 --- /dev/null +++ b/docs/errors/label.html @@ -0,0 +1,3722 @@ + + + + + + + + +Label Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Label Errors

+

Label Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typelabel-error
TitleLabel Error
DescriptionLabel Error
TemplateLabel Error
Tags#table #header #label

Extra Label

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeextra-label
TitleExtra Label
DescriptionThe header of the data source contains label that does not exist in the provided schema.
TemplateThere is an extra label "{label}" in header at position "{fieldNumber}"
Tags#table #header #label

Missing Label

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typemissing-label
TitleMissing Label
DescriptionBased on the schema there should be a label that is missing in the data's header.
TemplateThere is a missing label in the header's field "{fieldName}" at position "{fieldNumber}"
Tags#table #header #label

Blank Label

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeblank-label
TitleBlank Label
DescriptionA label in the header row is missing a value. Label should be provided and not be blank.
TemplateLabel in the header in field at position "{fieldNumber}" is blank
Tags#table #header #label

Duplicate Label

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeduplicate-label
TitleDuplicate Label
DescriptionTwo columns in the header row have the same value. Column names should be unique.
TemplateLabel "{label}" in the header at position "{fieldNumber}" is duplicated to a label: {note}
Tags#table #header #label

Incorrect Label

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeincorrect-label
TitleIncorrect Label
DescriptionOne of the data source header does not match the field name defined in the schema.
TemplateLabel "{label}" in field {fieldName} at position "{fieldNumber}" does not match the field name in the schema
Tags#table #header #label

Reference

+
+ + +
+
+ +

errors.LabelError (class)

+ +
+
+ + +
+

errors.LabelError (class)

+

Label error representation. + +A base class for all the errors related to the labels of the columns/fields.

+

Signature

+

(*, note: str, labels: List[str], row_numbers: List[int], label: str, field_name: str, field_number: int) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
  • + labels + (List[str])
  • +
  • + row_numbers + (List[int])
  • +
  • + label + (str)
  • +
  • + field_name + (str)
  • +
  • + field_number + (int)
  • +
+
+ +
+

errors.labelError.label (property)

+

+ Label of the field that has an error. +

+

Signature

+

str

+
+
+

errors.labelError.field_name (property)

+

+ Name of the field that has an error. +

+

Signature

+

str

+
+
+

errors.labelError.field_number (property)

+

+ Index of the field that has an error. +

+

Signature

+

int

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/metadata.html b/docs/errors/metadata.html new file mode 100644 index 0000000000..7b3fbf20fb --- /dev/null +++ b/docs/errors/metadata.html @@ -0,0 +1,3960 @@ + + + + + + + + +Metadata Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Metadata Errors

+

Metadata Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typemetadata-error
TitleMetadata Error
DescriptionThere is a metadata error.
TemplateMetadata error: {note}

Catalog Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typecatalog-error
TitleCatalog Error
DescriptionA validation cannot be processed.
TemplateThe data catalog has an error: {note}

Dataset Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typedataset-error
TitleDataset Error
DescriptionA validation cannot be processed.
TemplateThe dataset has an error: {note}

Checklist Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typechecklist-error
TitleChecklist Error
DescriptionProvided checklist is not valid.
TemplateChecklist is not valid: {note}

Check Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typecheck-error
TitleCheck Error
DescriptionProvided check is not valid
TemplateCheck is not valid: {note}

Detector Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typedetector-error
TitleDetector Error
DescriptionProvided detector is not valid.
TemplateDetector is not valid: {note}

Dialect Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typedialect-error
TitleDialect Error
DescriptionProvided dialect is not valid.
TemplateDialect is not valid: {note}

Control Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typecontrol-error
TitleControl Error
DescriptionProvided control is not valid.
TemplateControl is not valid: {note}

Inquiry Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeinquiry-error
TitleInquiry Error
DescriptionProvided inquiry is not valid.
TemplateInquiry is not valid: {note}

Inquiry Task Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeinquiry-task-error
TitleInquiry Task Error
DescriptionProvided inquiry task is not valid.
TemplateInquiry task is not valid: {note}

Package Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typepackage-error
TitlePackage Error
DescriptionA validation cannot be processed.
TemplateThe data package has an error: {note}

Pipeline Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typepipeline-error
TitlePipeline Error
DescriptionProvided pipeline is not valid.
TemplatePipeline is not valid: {note}

Step Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typestep-error
TitleStep Error
DescriptionProvided step is not valid
TemplateStep is not valid: {note}

Report Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typereport-error
TitleReport Error
DescriptionProvided report is not valid.
TemplateReport is not valid: {note}

Report Task Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typereport-task-error
TitleReport Task Error
DescriptionProvided report task is not valid.
TemplateReport task is not valid: {note}

Schema Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeschema-error
TitleSchema Error
DescriptionProvided schema is not valid.
TemplateSchema is not valid: {note}

Field Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typefield-error
TitleField Error
DescriptionProvided field is not valid.
TemplateField is not valid: {note}

Stats Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typestats-error
TitleStats Error
DescriptionStats object has an error.
TemplateStats object has an error: {note}

Reference

+
+ + +
+
+ +

errors.MetadataError (class)

+ +
+
+ + +
+

errors.MetadataError (class)

+

Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.

+

Signature

+

(*, note: str) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/resource.html b/docs/errors/resource.html new file mode 100644 index 0000000000..98450b1378 --- /dev/null +++ b/docs/errors/resource.html @@ -0,0 +1,3660 @@ + + + + + + + + +Resource Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Resource Errors

+

Resource Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeresource-error
TitleResource Error
DescriptionA validation cannot be processed.
TemplateThe data resource has an error: {note}

Source Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typesource-error
TitleSource Error
DescriptionData reading error because of not supported or inconsistent contents.
TemplateThe data source has not supported or has inconsistent contents: {note}

Scheme Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typescheme-error
TitleScheme Error
DescriptionData reading error because of incorrect scheme.
TemplateThe data source could not be successfully loaded: {note}

Format Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeformat-error
TitleFormat Error
DescriptionData reading error because of incorrect format.
TemplateThe data source could not be successfully parsed: {note}

Encoding Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeencoding-error
TitleEncoding Error
DescriptionData reading error because of an encoding problem.
TemplateThe data source could not be successfully decoded: {note}

Compression Error

+ + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typecompression-error
TitleCompression Error
DescriptionData reading error because of a decompression problem.
TemplateThe data source could not be successfully decompressed: {note}

Reference

+
+ + +
+
+ +

errors.ResourceError (class)

+ +
+
+ + +
+

errors.ResourceError (class)

+

Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.

+

Signature

+

(*, note: str) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/row.html b/docs/errors/row.html new file mode 100644 index 0000000000..0c53defe1b --- /dev/null +++ b/docs/errors/row.html @@ -0,0 +1,3819 @@ + + + + + + + + +Row Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Row Errors

+

Row Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typerow-error
TitleRow Error
DescriptionRow Error
TemplateRow Error
Tags#table #row

Blank Row

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeblank-row
TitleBlank Row
DescriptionThis row is empty. A row should contain at least one value.
TemplateRow at position "{rowNumber}" is completely blank
Tags#table #row

PrimaryKey Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeprimary-key
TitlePrimaryKey Error
DescriptionValues in the primary key fields should be unique for every row
TemplateRow at position "{rowNumber}" violates the primary key: {note}
Tags#table #row

ForeignKey Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeforeign-key
TitleForeignKey Error
DescriptionValues in the foreign key fields should exist in the reference table
TemplateRow at position "{rowNumber}" violates the foreign key: {note}
Tags#table #row

Duplicate Row

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typeduplicate-row
TitleDuplicate Row
DescriptionThe row is duplicated.
TemplateRow at position {rowNumber} is duplicated: {note}
Tags#table #row

Row Constraint

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typerow-constraint
TitleRow Constraint
DescriptionThe value does not conform to the row constraint.
TemplateThe row at position {rowNumber} has an error: {note}
Tags#table #row

Reference

+
+ + +
+
+ +

errors.RowError (class)

+

errors.ForeignKeyError (class)

+ +
+
+ + +
+

errors.RowError (class)

+

Row error representation. + +A base class for all the errors related to a row of the +tabular data.

+

Signature

+

(*, note: str, cells: List[str], row_number: int) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
  • + cells + (List[str])
  • +
  • + row_number + (int)
  • +
+
+ +
+

errors.rowError.cells (property)

+

+ Values of all the cells in the row that has an error. +

+

Signature

+

List[str]

+
+
+

errors.rowError.row_number (property)

+

+ Index of the row that has an error. +

+

Signature

+

int

+
+ + +
+

errors.RowError.from_row (method) (static)

+

Create an error from a row

+

Signature

+

(row: Row, *, note: str)

+

Parameters

+
    +
  • + row + (Row)
  • +
  • + note + (str)
  • +
+
+ + +
+

errors.ForeignKeyError (class)

+

Row error representation. + +A base class for all the errors related to a row of the +tabular data.

+

Signature

+

(*, note: str, cells: List[str], row_number: int, field_names: List[str], field_cells: List[str], reference_name: str, reference_field_names: List[str]) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
  • + cells + (List[str])
  • +
  • + row_number + (int)
  • +
  • + field_names + (List[str])
  • +
  • + field_cells + (List[str])
  • +
  • + reference_name + (str)
  • +
  • + reference_field_names + (List[str])
  • +
+
+ +
+

errors.foreignKeyError.field_names (property)

+

+ Keys in the resource target column. +

+

Signature

+

List[str]

+
+
+

errors.foreignKeyError.field_cells (property)

+

+ Cells not found in the lookup table. +

+

Signature

+

List[str]

+
+
+

errors.foreignKeyError.reference_name (property)

+

+ Name of the lookup table the keys were searched on +

+

Signature

+

str

+
+
+

errors.foreignKeyError.reference_field_names (property)

+

+ Key names in the lookup table defined as foreign keys in the resource. +

+

Signature

+

List[str]

+
+ + +
+

errors.ForeignKeyError.from_row (method) (static)

+

Create an foreign-key-error from a row

+

Signature

+

(row: Row, *, note: str, field_names: List[str], field_values: List[Any], reference_name: str, reference_field_names: List[str])

+

Parameters

+
    +
  • + row + (Row)
  • +
  • + note + (str)
  • +
  • + field_names + (List[str])
  • +
  • + field_values + (List[Any])
  • +
  • + reference_name + (str)
  • +
  • + reference_field_names + (List[str])
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/errors/table.html b/docs/errors/table.html new file mode 100644 index 0000000000..bb1750de27 --- /dev/null +++ b/docs/errors/table.html @@ -0,0 +1,3713 @@ + + + + + + + + +Table Errors | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Table Errors

+

Table Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typetable-error
TitleTable Error
DescriptionThere is a table error.
TemplateGeneral table error: {note}
Tags#table

Field Count Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typefield-count
TitleField Count Error
DescriptionThis error can happen if the data is corrupted.
TemplateThe data source does not match the expected field count: {note}
Tags#table

Row Count Error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typerow-count
TitleRow Count Error
DescriptionThis error can happen if the data is corrupted.
TemplateThe data source does not match the expected row count: {note}
Tags#table

Table dimensions error

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typetable-dimensions
TitleTable dimensions error
DescriptionThis error can happen if the data is corrupted.
TemplateThe data source does not have the required dimensions: {note}
Tags#table

Deviated Value

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typedeviated-value
TitleDeviated Value
DescriptionThe value is deviated.
TemplateThere is a possible error because the value is deviated: {note}
Tags#table

Deviated cell

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typedeviated-cell
TitleDeviated cell
DescriptionThe cell is deviated.
TemplateThere is a possible error because the cell is deviated: {note}
Tags#table

Required Value

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameValue
Typerequired-value
TitleRequired Value
DescriptionThe required values are missing.
TemplateRequired values not found: {note}
Tags#table

Reference

+
+ + +
+
+ +

errors.TableError (class)

+ +
+
+ + +
+

errors.TableError (class)

+

Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.

+

Signature

+

(*, note: str) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/any.html b/docs/fields/any.html new file mode 100644 index 0000000000..b8bb391ce9 --- /dev/null +++ b/docs/fields/any.html @@ -0,0 +1,3553 @@ + + + + + + + + +Any Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Any Field

+

Overview

+

AnyField provides an ability to skip any cell parsing. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], [1], ['1']]
+rows = extract(data, schema=Schema(fields=[fields.AnyField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': 1}, {'name': '1'}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.AnyField (class)

+ +
+
+ + +
+

fields.AnyField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/array.html b/docs/fields/array.html new file mode 100644 index 0000000000..a6d4282697 --- /dev/null +++ b/docs/fields/array.html @@ -0,0 +1,3565 @@ + + + + + + + + +Array Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Array Field

+

Overview

+

The field contains data that is a valid JSON format arrays. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['["value1", "value2"]']]
+rows = extract(data, schema=Schema(fields=[fields.ArrayField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': ['value1', 'value2']}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.ArrayField (class)

+ +
+
+ + +
+

fields.ArrayField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, array_item: Optional[Dict[str, Any]] = NOTHING) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
  • + array_item + (Optional[Dict[str, Any]])
  • +
+
+ +
+

fields.arrayField.array_item (property)

+

+ A dictionary that specifies the type and other constraints for the + data that will be read in this data type field. +

+

Signature

+

Optional[Dict[str, Any]]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/boolean.html b/docs/fields/boolean.html new file mode 100644 index 0000000000..d0fe251edf --- /dev/null +++ b/docs/fields/boolean.html @@ -0,0 +1,3578 @@ + + + + + + + + +Boolean Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Boolean Field

+

Overview

+

The field contains boolean (true/false) data.

+

In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['true'], ['false']]
+rows = extract(data, schema=Schema(fields=[fields.BooleanField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': True}, {'name': False}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.BooleanField (class)

+ +
+
+ + +
+

fields.BooleanField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, true_values: List[str] = NOTHING, false_values: List[str] = NOTHING) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
  • + true_values + (List[str])
  • +
  • + false_values + (List[str])
  • +
+
+ +
+

fields.booleanField.true_values (property)

+

+ It defines the values to be read as true values while reading data. The default + true values are ["true", "True", "TRUE", "1"]. +

+

Signature

+

List[str]

+
+
+

fields.booleanField.false_values (property)

+

+ It defines the values to be read as false values while reading data. The default + true values are ["false", "False", "FALSE", "0"]. +

+

Signature

+

List[str]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/date.html b/docs/fields/date.html new file mode 100644 index 0000000000..185fc87848 --- /dev/null +++ b/docs/fields/date.html @@ -0,0 +1,3553 @@ + + + + + + + + +Date Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Date Field

+

Overview

+

A date without a time (by default in ISO8601 format). Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022-08-22']]
+rows = extract(data, schema=Schema(fields=[fields.DateField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': datetime.date(2022, 8, 22)}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.DateField (class)

+ +
+
+ + +
+

fields.DateField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/datetime.html b/docs/fields/datetime.html new file mode 100644 index 0000000000..8fbd6678fc --- /dev/null +++ b/docs/fields/datetime.html @@ -0,0 +1,3553 @@ + + + + + + + + +Datetime Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Datetime Field

+

Overview

+

A date with a time (by default in ISO8601 format). Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022-08-22T12:00:00']]
+rows = extract(data, schema=Schema(fields=[fields.DatetimeField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': datetime.datetime(2022, 8, 22, 12, 0)}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.DatetimeField (class)

+ +
+
+ + +
+

fields.DatetimeField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/duration.html b/docs/fields/duration.html new file mode 100644 index 0000000000..a8b06eecbb --- /dev/null +++ b/docs/fields/duration.html @@ -0,0 +1,3554 @@ + + + + + + + + +Duration Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Duration Field

+

Overview

+

A duration of time. We follow the definition of XML Schema duration datatype directly +and that definition is implicitly inlined here. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['P1Y']]
+rows = extract(data, schema=Schema(fields=[fields.DurationField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': isodate.duration.Duration(0, 0, 0, years=1, months=0)}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.DurationField (class)

+ +
+
+ + +
+

fields.DurationField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/geojson.html b/docs/fields/geojson.html new file mode 100644 index 0000000000..16a5760436 --- /dev/null +++ b/docs/fields/geojson.html @@ -0,0 +1,3552 @@ + + + + + + + + +Geojson Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Geojson Field

+

The field contains a JSON object according to GeoJSON or TopoJSON spec. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['{"geometry": null, "type": "Feature", "properties": {"k": "v"}}']]
+rows = extract(data, schema=Schema(fields=[fields.GeojsonField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': {'geometry': None, 'type': 'Feature', 'properties': {'k': 'v'}}}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.GeojsonField (class)

+ +
+
+ + +
+

fields.GeojsonField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/geopoint.html b/docs/fields/geopoint.html new file mode 100644 index 0000000000..d33e7704bc --- /dev/null +++ b/docs/fields/geopoint.html @@ -0,0 +1,3552 @@ + + + + + + + + +Geopoint Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Geopoint Field

+

The field contains data describing a geographic point. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ["180, -90"]]
+rows = extract(data, schema=Schema(fields=[fields.GeopointField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': [180.0, -90.0]}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.GeopointField (class)

+ +
+
+ + +
+

fields.GeopointField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/integer.html b/docs/fields/integer.html new file mode 100644 index 0000000000..521e8fe782 --- /dev/null +++ b/docs/fields/integer.html @@ -0,0 +1,3565 @@ + + + + + + + + +Integer Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Integer Field

+

The field contains integers - that is whole numbers. Integer values are indicated in the standard way for any valid integer. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['1'], ['2'], ['3']]
+rows = extract(data, schema=Schema(fields=[fields.IntegerField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': 1}, {'name': 2}, {'name': 3}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.IntegerField (class)

+ +
+
+ + +
+

fields.IntegerField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, bare_number: bool = True) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
  • + bare_number + (bool)
  • +
+
+ +
+

fields.integerField.bare_number (property)

+

+ It specifies that the value is a bare number. If true, the pattern to + remove non digit character does not get applied and vice versa. + The default value is True. +

+

Signature

+

bool

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/number.html b/docs/fields/number.html new file mode 100644 index 0000000000..ad6cde35c2 --- /dev/null +++ b/docs/fields/number.html @@ -0,0 +1,3600 @@ + + + + + + + + +Number Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Number Field

+

Overview

+

The field contains numbers of any kind including decimals. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['1.1'], ['2.2'], ['3.3']]
+rows = extract(data, schema=Schema(fields=[fields.NumberField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': Decimal('1.1')}, {'name': Decimal('2.2')}, {'name': Decimal('3.3')}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.NumberField (class)

+ +
+
+ + +
+

fields.NumberField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, bare_number: bool = True, float_number: bool = False, decimal_char: str = ., group_char: str = ) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
  • + bare_number + (bool)
  • +
  • + float_number + (bool)
  • +
  • + decimal_char + (str)
  • +
  • + group_char + (str)
  • +
+
+ +
+

fields.numberField.bare_number (property)

+

+ It specifies that the value is a bare number. If true, the pattern to remove non digit + character does not get applied and vice versa. The default value is True. +

+

Signature

+

bool

+
+
+

fields.numberField.float_number (property)

+

+ It specifies that the value is a float number. +

+

Signature

+

bool

+
+
+

fields.numberField.decimal_char (property)

+

+ It specifies the char to be used as decimal character. The default + value is ".". It values can be: ".", "@" etc. +

+

Signature

+

str

+
+
+

fields.numberField.group_char (property)

+

+ It specifies the char to be used as group character. The default value + is "". It can take values such as: ",", "#" etc. +

+

Signature

+

str

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/object.html b/docs/fields/object.html new file mode 100644 index 0000000000..7b8aaf7708 --- /dev/null +++ b/docs/fields/object.html @@ -0,0 +1,3553 @@ + + + + + + + + +Object Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Object Field

+

Overview

+

The field contains data which is valid JSON. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['{"key": "value"}']]
+rows = extract(data, schema=Schema(fields=[fields.ObjectField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': {'key': 'value'}}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.ObjectField (class)

+ +
+
+ + +
+

fields.ObjectField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/string.html b/docs/fields/string.html new file mode 100644 index 0000000000..5760232b95 --- /dev/null +++ b/docs/fields/string.html @@ -0,0 +1,3561 @@ + + + + + + + + +String Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

String Field

+

Overview

+

The field contains strings, that is, sequences of characters. Read more in Table Schema Standard. Currently supported formats:

+
    +
  • default
  • +
  • uri
  • +
  • email
  • +
  • uuid
  • +
  • binary
  • +
  • wkt (doesn't work in Python3.10+)
  • +
+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['value']]
+rows = extract(data, schema=Schema(fields=[fields.StringField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': 'value'}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.StringField (class)

+ +
+
+ + +
+

fields.StringField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/time.html b/docs/fields/time.html new file mode 100644 index 0000000000..a5fd3d940f --- /dev/null +++ b/docs/fields/time.html @@ -0,0 +1,3553 @@ + + + + + + + + +Time Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Time Field

+

Overview

+

A time without a date. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['15:00:00']]
+rows = extract(data, schema=Schema(fields=[fields.TimeField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': datetime.time(15, 0)}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.TimeField (class)

+ +
+
+ + +
+

fields.TimeField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/year.html b/docs/fields/year.html new file mode 100644 index 0000000000..f448d970bc --- /dev/null +++ b/docs/fields/year.html @@ -0,0 +1,3553 @@ + + + + + + + + +Year Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Year Field

+

Overview

+

A calendar year as per XMLSchema gYear. Usual lexical representation is YYYY. There are no format options. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022']]
+rows = extract(data, schema=Schema(fields=[fields.YearField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': 2022}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.YearField (class)

+ +
+
+ + +
+

fields.YearField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/fields/yearmonth.html b/docs/fields/yearmonth.html new file mode 100644 index 0000000000..667aa692c7 --- /dev/null +++ b/docs/fields/yearmonth.html @@ -0,0 +1,3553 @@ + + + + + + + + +Yearmonth Field | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Yearmonth Field

+

Overview

+

A specific month in a specific year as per XMLSchema gYearMonth. Usual lexical representation is: YYYY-MM. Read more in Table Schema Standard.

+

Example

+ +
+
+
from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022-08']]
+rows = extract(data, schema=Schema(fields=[fields.YearmonthField(name='name')]))
+print(rows)
+
+ +
{'memory': [{'name': yearmonth(year=2022, month=8)}]}
+ +
+

Reference

+
+ + +
+
+ +

fields.YearmonthField (class)

+ +
+
+ + +
+

fields.YearmonthField (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ + + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/csv.html b/docs/formats/csv.html new file mode 100644 index 0000000000..7405cfc5fa --- /dev/null +++ b/docs/formats/csv.html @@ -0,0 +1,3680 @@ + + + + + + + + +Csv Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Csv Format

+

CSV is a file format which you can you in Frictionless for reading and writing. Arguable it's the main Open Data format so it's supported very well in Frictionless.

+

Reading Data

+

You can read this format using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('table.csv')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.csv')
+print(target)
+print(target.to_view())
+
+ +
{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | 'german'  |
++----+-----------+
+ +
+

Configuration

+

There is a control to configure how Frictionless read and write files in this format. For example:

+ +
+
+
from frictionless import Resource, formats
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('tmp/table.csv', control=formats.CsvControl(delimiter=';'))
+
+ +
+

Reference

+
+ + +
+
+ +

formats.CsvControl (class)

+ +
+
+ + +
+

formats.CsvControl (class)

+

Csv dialect representation. + +Control class to set params for CSV reader/writer.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, delimiter: str = ,, line_terminator: str = \r\n, quote_char: str = ", double_quote: bool = True, escape_char: Optional[str] = None, null_sequence: Optional[str] = None, skip_initial_space: bool = False) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + delimiter + (str)
  • +
  • + line_terminator + (str)
  • +
  • + quote_char + (str)
  • +
  • + double_quote + (bool)
  • +
  • + escape_char + (Optional[str])
  • +
  • + null_sequence + (Optional[str])
  • +
  • + skip_initial_space + (bool)
  • +
+
+ +
+

formats.csvControl.delimiter (property)

+

+ Specify the delimiter used to separate text strings while + reading from or writing to the csv file. Default value is ",". + For example: delimiter=";" +

+

Signature

+

str

+
+
+

formats.csvControl.line_terminator (property)

+

+ Specify the line terminator for the csv file while reading/writing. + For example: line_terminator="\n". Default line_terminator is "\r\n". +

+

Signature

+

str

+
+
+

formats.csvControl.quote_char (property)

+

+ Specify the quote char for fields that contains a special character + such as comma, CR, LF or double quote. Default value is '"'. + For example: quotechar='|' +

+

Signature

+

str

+
+
+

formats.csvControl.double_quote (property)

+

+ It controls how 'quote_char' appearing inside a field should themselves be + quoted. When set to True, the 'quote_char' is doubled else escape char is + used. Default value is True. +

+

Signature

+

bool

+
+
+

formats.csvControl.escape_char (property)

+

+ A one-character string used by the csv writer to escape. Default is None, which disables + escaping. It uses 'quote_char', if double_quote is False. +

+

Signature

+

Optional[str]

+
+
+

formats.csvControl.null_sequence (property)

+

+ Specify the null sequence and not set by default. + For example: \\N +

+

Signature

+

Optional[str]

+
+
+

formats.csvControl.skip_initial_space (property)

+

+ Ignores spaces following the comma if set to True. + For example space in header(in csv file): "Name", "Team" +

+

Signature

+

bool

+
+ + +
+

formats.csvControl.to_python (method)

+

Convert to Python's `csv.Dialect`

+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/erd.html b/docs/formats/erd.html new file mode 100644 index 0000000000..88f52fc6f5 --- /dev/null +++ b/docs/formats/erd.html @@ -0,0 +1,3472 @@ + + + + + + + + +Erd Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Erd Format

+
+ +

Frictionless supports exporting a data package as an ER-diagram dot file. For example:

+
package = Package('datapackage.zip')
+package.to_er_diagram(path='erd.dot')
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/excel.html b/docs/formats/excel.html new file mode 100644 index 0000000000..32fb676371 --- /dev/null +++ b/docs/formats/excel.html @@ -0,0 +1,3666 @@ + + + + + + + + +Excel Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Excel Format

+

Excel is a very popular tabular data format that usually has xlsx (newer) and xls (older) file extensions. Frictionless supports Excel files extensively.

+ +
+
+
pip install frictionless[excel]
+pip install 'frictionless[excel]' # for zsh shell
+
+ +
+

Reading Data

+

You can read this format using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='table.xlsx')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.xlsx')
+print(target)
+print(target.to_view())
+
+ +
+

Configuration

+

There is a dialect to configure how Frictionless read and write files in this format. For example:

+ +
+
+
from frictionless import Resource, formats
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table-output-sheet.xls', control=formats.ExcelControl(sheet='My Table'))
+
+ +
+

Reference

+
+ + +
+
+ +

formats.ExcelControl (class)

+ +
+
+ + +
+

formats.ExcelControl (class)

+

Excel control representation. + +Control class to set params for Excel reader/writer.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, sheet: Union[str, int] = 1, workbook_cache: Optional[Any] = None, fill_merged_cells: bool = False, preserve_formatting: bool = False, adjust_floating_point_error: bool = False, stringified: bool = False) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + sheet + (Union[str, int])
  • +
  • + workbook_cache + (Optional[Any])
  • +
  • + fill_merged_cells + (bool)
  • +
  • + preserve_formatting + (bool)
  • +
  • + adjust_floating_point_error + (bool)
  • +
  • + stringified + (bool)
  • +
+
+ +
+

formats.excelControl.sheet (property)

+

+ Name of the sheet from where to read or write data. +

+

Signature

+

Union[str, int]

+
+
+

formats.excelControl.workbook_cache (property)

+

+ An empty dictionary which is used to handle workbook caching for remote workbooks. + It stores the path to the temporary file while reading remote workbooks. +

+

Signature

+

Optional[Any]

+
+
+

formats.excelControl.fill_merged_cells (property)

+

+ If True, it will unmerge and fill all merged cells by the visible value. + Default value is False. +

+

Signature

+

bool

+
+
+

formats.excelControl.preserve_formatting (property)

+

+ If set to True, it preserves text formatting for numeric and temporal cells. If not set, + it will return all cell value as string. Default value is False. +

+

Signature

+

bool

+
+
+

formats.excelControl.adjust_floating_point_error (property)

+

+ If True, it corrects the Excel behavior regarding floating point numbers. + For example: 274.65999999999997 -> 274.66 (When True). +

+

Signature

+

bool

+
+
+

formats.excelControl.stringified (property)

+

+ Stringifies all the cell values. Default value + is False. + + Note that a table resource schema will still be applied and types coerced to match the schema + (either provided or inferred) _after_ the rows are read as strings. + + To return all cells as strings then both set `stringified=True` and specify a + schema that defines all fields to be of type string (see #1659). +

+

Signature

+

bool

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/gsheets.html b/docs/formats/gsheets.html new file mode 100644 index 0000000000..80afbf8ec1 --- /dev/null +++ b/docs/formats/gsheets.html @@ -0,0 +1,3601 @@ + + + + + + + + +Gsheets Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Gsheets Format

+

Frictionless supports parsing Google Sheets data as a file format.

+ +
+
+
pip install frictionless[gsheets]
+pip install 'frictionless[gsheets]' # for zsh shell
+
+ +
+

Reading Data

+

You can read from Google Sheets using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+path='https://docs.google.com/spreadsheets/d/1mHIWnDvW9cALRMq9OdNfRwjAthCUFUOACPp0Lkyl7b4/edit?usp=sharing'
+resource = Resource(path=path)
+pprint(resource.read_rows())
+
+ +
+
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from frictionless import Resource, formats
+
+control = formats.GsheetsControl(credentials=".google.json")
+resource = Resource(path='data/table.csv')
+resource.write("https://docs.google.com/spreadsheets/d/<id>/edit", control=control})
+
+ +
+

Configuration

+

There is a dialect to configure how Frictionless read and write files in this format. For example:

+ +
+
+
from frictionless import Resource, formats
+
+control = formats.GsheetsControl(credentials=".google.json")
+resource = Resource(path='data/table.csv')
+resource.write("https://docs.google.com/spreadsheets/d/<id>/edit", control=control)
+
+ +
+

Reference

+
+ + +
+
+ +

formats.GsheetsControl (class)

+ +
+
+ + +
+

formats.GsheetsControl (class)

+

Gsheets control representation. + +Control class to set params for Gsheets api.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, credentials: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + credentials + (Optional[str])
  • +
+
+ +
+

formats.gsheetsControl.credentials (property)

+

+ API key to access google sheets. +

+

Signature

+

Optional[str]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/html.html b/docs/formats/html.html new file mode 100644 index 0000000000..793a8f5a59 --- /dev/null +++ b/docs/formats/html.html @@ -0,0 +1,3619 @@ + + + + + + + + +Html Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Html Format

+

Frictionless supports parsing HTML format:

+ +
+
+
pip install frictionless[html]
+pip install 'frictionless[html]' # for zsh shell
+
+ +
+

Reading Data

+

You can this file format using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import resources
+
+resource = resources.TableResource(path='table1.html')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from frictionless import Resource, resources
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = resources.TableResource(path='table-output.html')
+source.write(target)
+print(target)
+print(target.to_view())
+
+ +
{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.html',
+ 'scheme': 'file',
+ 'format': 'html',
+ 'mediatype': 'text/html'}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | 'german'  |
++----+-----------+
+ +
+

Configuration

+

There is a dialect to configure HTML, for example:

+ +
+
+
from frictionless import Resource, resources, formats
+
+control=formats.HtmlControl(selector='#id')
+resource = resources.TableResource(path='table1.html', control=control)
+print(resource.read_rows())
+
+ +
[]
+ +
+

Reference

+
+ + +
+
+ +

formats.HtmlControl (class)

+ +
+
+ + +
+

formats.HtmlControl (class)

+

Html control representation. + +Control class to set params for Html reader/writer.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, selector: str = table) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + selector + (str)
  • +
+
+ +
+

formats.htmlControl.selector (property)

+

+ Any valid css selector. Default selector is 'table'. + For example: "table", "#id", ".meme" etc. +

+

Signature

+

str

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/inline.html b/docs/formats/inline.html new file mode 100644 index 0000000000..4d0efd0de1 --- /dev/null +++ b/docs/formats/inline.html @@ -0,0 +1,3619 @@ + + + + + + + + +Inline Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Inline Format

+

Frictionless supports working with Inline Data from memory.

+

Reading Data

+

You can read data in this format using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}]
+ +
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from frictionless import Resource
+
+source = Resource('table.csv')
+target = source.write(format='inline', datatype='table')
+print(target)
+print(target.to_view())
+
+ +
{'name': 'memory',
+ 'type': 'table',
+ 'data': [['id', 'name'], [1, 'english'], [2, '中国人']],
+ 'format': 'inline'}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | '中国人'     |
++----+-----------+
+ +
+

Configuration

+

There is a dialect to configure this format, for example:

+ +
+
+
from frictionless import Resource, formats
+
+control = formats.InlineControl(keyed=True, keys=['name', 'id'])
+resource = Resource(data=[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}], control=control)
+print(resource.to_view())
+
+ +
+-----------+----+
+| name      | id |
++===========+====+
+| 'english' |  1 |
++-----------+----+
+| 'german'  |  2 |
++-----------+----+
+ +
+

Reference

+
+ + +
+
+ +

formats.InlineControl (class)

+ +
+
+ + +
+

formats.InlineControl (class)

+

Inline control representation. + +Control class to set params for Inline reader/writer.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, keys: Optional[List[str]] = None, keyed: bool = False) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + keys + (Optional[List[str]])
  • +
  • + keyed + (bool)
  • +
+
+ +
+

formats.inlineControl.keys (property)

+

+ Specify the keys/columns to read from the resource. + For example: keys=["id","name"]. +

+

Signature

+

Optional[List[str]]

+
+
+

formats.inlineControl.keyed (property)

+

+ If set to True, It returns the data as key:value pair. +

+

Signature

+

bool

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/json.html b/docs/formats/json.html new file mode 100644 index 0000000000..b279b4f20e --- /dev/null +++ b/docs/formats/json.html @@ -0,0 +1,3646 @@ + + + + + + + + +Json Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Json Format

+

Frictionless supports parsing JSON tables (JSON and JSONL/NDJSON).

+ +
+
+
pip install frictionless[json]
+pip install 'frictionless[json]' # for zsh shell
+
+ +
+

Reading Data

+
+

We use the path argument to ensure that it will not be guessed to be a metadata file

+
+

You can read this format using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import resources
+
+resource = resources.TableResource(path='table.json')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from frictionless import Resource, resources
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = resources.TableResource(path='table-output.json')
+source.write(target)
+print(target)
+print(target.to_view())
+
+ +
{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.json',
+ 'scheme': 'file',
+ 'format': 'json',
+ 'mediatype': 'text/json'}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | 'german'  |
++----+-----------+
+ +
+

Configuration

+

There is a dialect to configure how Frictionless read and write files in this format. For example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource, resources, formats
+
+control=formats.JsonControl(keyed=True)
+resource = resources.TableResource(path='table.keyed.json', control=control)
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Reference

+
+ + +
+
+ +

formats.JsonControl (class)

+ +
+
+ + +
+

formats.JsonControl (class)

+

Json control representation. + +Control class to set params for JSON reader/writer class.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, keys: Optional[List[str]] = None, keyed: bool = False, property: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + keys + (Optional[List[str]])
  • +
  • + keyed + (bool)
  • +
  • + property + (Optional[str])
  • +
+
+ +
+

formats.jsonControl.keys (property)

+

+ Specifies the keys/columns to read from the resource. + For example: keys=["id","name"]. +

+

Signature

+

Optional[List[str]]

+
+
+

formats.jsonControl.keyed (property)

+

+ If set to True, It returns the data as key:value pair. Default value is False. +

+

Signature

+

bool

+
+
+

formats.jsonControl.property (property)

+

+ This property specifies the path to the attribute in a json file, it it has + nested fields. +

+

Signature

+

Optional[str]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/jsonschema.html b/docs/formats/jsonschema.html new file mode 100644 index 0000000000..788b65a383 --- /dev/null +++ b/docs/formats/jsonschema.html @@ -0,0 +1,3471 @@ + + + + + + + + +JsonSchema Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

JsonSchema Format

+
+ +

Frictionless supports importing a JsonSchema profile as a Table Schema. For example:

+
schema = Schema.from_jsonschema('table.jsonschema')
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/markdown.html b/docs/formats/markdown.html new file mode 100644 index 0000000000..a3b4fd6084 --- /dev/null +++ b/docs/formats/markdown.html @@ -0,0 +1,3472 @@ + + + + + + + + +Markdown Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Markdown Format

+
+ +

Frictionless supports exporting a metadata object as a Markdown document. For example:

+
schema = Schema('schema.json')
+schema.to_markdown('schema.md')
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/ods.html b/docs/formats/ods.html new file mode 100644 index 0000000000..6133afe9b8 --- /dev/null +++ b/docs/formats/ods.html @@ -0,0 +1,3600 @@ + + + + + + + + +Ods Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Ods Format

+

Frictionless supports ODS parsing.

+ +
+
+
pip install frictionless[ods]
+pip install 'frictionless[ods]' # for zsh shell
+
+ +
+

Reading Data

+

You can read this format using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='table.ods')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.ods')
+pprint(target)
+
+ +
+

Configuration

+

There is a dialect to configure how Frictionless read and write files in this format. For example:

+ +
+
+
from frictionless import Resource, formats
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table-output-sheet.ods', control=formats.OdsControl(sheet='My Table'))
+
+ +
+

Reference

+
+ + +
+
+ +

formats.OdsControl (class)

+ +
+
+ + +
+

formats.OdsControl (class)

+

Ods control representation. + +Control class to set params for ODS reader/writer.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, sheet: Union[str, int] = 1) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + sheet + (Union[str, int])
  • +
+
+ +
+

formats.odsControl.sheet (property)

+

+ Name or index of the sheet to read/write. +

+

Signature

+

Union[str, int]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/pandas.html b/docs/formats/pandas.html new file mode 100644 index 0000000000..f1ed0182df --- /dev/null +++ b/docs/formats/pandas.html @@ -0,0 +1,3515 @@ + + + + + + + + +Pandas Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Pandas Format

+

Frictionless supports reading and writing Pandas dataframes.

+ +
+
+
pip install frictionless[pandas]
+pip install 'frictionless[pandas]' # for zsh shell
+
+ +
+

Reading Data

+

You can read a Pandas dataframe:

+ +
+
+
from frictionless import Resource
+
+resource = Resource(df)
+pprint(resource.read_rows())
+
+ +
+

Writing Data

+

You can write a dataset to Pandas:

+ + + +
+
+
from frictionless import Resource
+
+resource = Resource('table.csv')
+df = resource.to_pandas()
+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/parquet.html b/docs/formats/parquet.html new file mode 100644 index 0000000000..93c2f14142 --- /dev/null +++ b/docs/formats/parquet.html @@ -0,0 +1,3620 @@ + + + + + + + + +Parquet Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Parquet Format

+

Frictionless supports reading and writing Parquet files.

+ +
+
+
pip install frictionless[parquet]
+pip install 'frictionless[parquet]' # for zsh shell
+
+ +
+

Reading Data

+

You can read a Parquet file:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('table.parq')
+print(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

You can write a dataset to Parquet:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('table.csv')
+target = resource.write('table-output.parq')
+print(target)
+print(target.read_rows())
+
+ +
{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.parq',
+ 'scheme': 'file',
+ 'format': 'parq',
+ 'mediatype': 'application/parquet'}
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Reference

+
+ + +
+
+ +

formats.ParquetControl (class)

+ +
+
+ + +
+

formats.ParquetControl (class)

+

Parquet control representation. + +Control class to set params for Parquet read/write class.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, columns: Optional[List[str]] = None, categories: Optional[Any] = None, filters: Optional[Any] = False) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + columns + (Optional[List[str]])
  • +
  • + categories + (Optional[Any])
  • +
  • + filters + (Optional[Any])
  • +
+
+ +
+

formats.parquetControl.columns (property)

+

+ A list of columns to load. By selecting columns, we can only access + parts of file that we are interested in and skip columns that are + not of interest. Default value is None. +

+

Signature

+

Optional[List[str]]

+
+
+

formats.parquetControl.categories (property)

+

+ List of columns that should be returned as Pandas Category-type column. + The second example specifies the number of expected labels for that column. + For example: categories=['col1'] or categories={'col1': 12} +

+

Signature

+

Optional[Any]

+
+
+

formats.parquetControl.filters (property)

+

+ Specifies the condition to filter data(row-groups). + For example: [('col3', 'in', [1, 2, 3, 4])]) +

+

Signature

+

Optional[Any]

+
+ + +
+

formats.parquetControl.to_python (method)

+

Convert to options

+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/spss.html b/docs/formats/spss.html new file mode 100644 index 0000000000..62304bd28e --- /dev/null +++ b/docs/formats/spss.html @@ -0,0 +1,3518 @@ + + + + + + + + +Spss Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Spss Format

+

Frictionless supports reading and writing SPSS files.

+ +
+
+
pip install frictionless[spss]
+pip install 'frictionless[spss]' # for zsh shell
+
+ +
+

Reading Data

+

You can read SPSS files:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('table.sav')
+pprint(resource.read_rows())
+
+ +
+

Writing Data

+

You can write SPSS files:

+ + + +
+
+
from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.sav')
+pprint(target)
+pprint(target.read_rows())
+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/sql.html b/docs/formats/sql.html new file mode 100644 index 0000000000..961c21724d --- /dev/null +++ b/docs/formats/sql.html @@ -0,0 +1,3718 @@ + + + + + + + + +Sql Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Sql Format

+

Frictionless supports reading and writing SQL databases.

+

Supported Databases

+

Frictionless Framework in-general support many databases that can be used with sqlalchemy. Here is a list of the databases with tested support:

+

SQLite

+
+

https://www.sqlite.org/index.html

+
+

It's a well-tested default database used by Frictionless:

+ +
+
+
pip install frictionless[sql]
+
+ +
+

PostgreSQL

+
+

https://www.postgresql.org/

+
+

This database is well-tested and provides the most data types:

+ +
+
+
pip install frictionless[postgresql]
+
+ +
+

MySQL

+
+

https://www.mysql.com/

+
+

Another popular databases having been tested with Frictionless:

+ +
+
+
pip install frictionless[mysql]
+
+ +
+

DuckDB

+
+

https://duckdb.org/

+
+

DuckDB is a reletively new database and, currently, Frictionless support is experimental:

+ +
+
+
pip install frictionless[duckdb]
+
+ +
+

Reading Data

+

You can read SQL database:

+ +
+
+
from frictionless import Resource, formats
+
+control = SqlControl(table="test_table", basepath='data')
+with Resource(path="sqlite:///sqlite.db", control=control) as resource:
+    print(resource.read_rows())
+
+ +
+

Writing Data

+

You can write SQL databases:

+ +
+
+
from frictionless import Package
+
+package = Package('path/to/datapackage.json')
+package.publish('postgresql://database')
+
+ +
+

Configuration

+

There is a dialect to configure how Frictionless read and write files in this format. For example:

+ +
+
+
from frictionless import Resource, formats
+
+control = SqlControl(table='table', order_by='field', where='field > 20')
+resource = Resource('postgresql://database', control=control)
+
+ +
+

Reference

+
+ + +
+
+ +

formats.SqlControl (class)

+ +
+
+ + +
+

formats.SqlControl (class)

+

SQL control representation. + +Control class to set params for Sql read/write class.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, table: Optional[str] = None, order_by: Optional[str] = None, where: Optional[str] = None, namespace: Optional[str] = None, basepath: Optional[str] = None, with_metadata: bool = False) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + table + (Optional[str])
  • +
  • + order_by + (Optional[str])
  • +
  • + where + (Optional[str])
  • +
  • + namespace + (Optional[str])
  • +
  • + basepath + (Optional[str])
  • +
  • + with_metadata + (bool)
  • +
+
+ +
+

formats.sqlControl.table (property)

+

+ Table name from which to read the data. +

+

Signature

+

Optional[str]

+
+
+

formats.sqlControl.order_by (property)

+

+ It specifies the ORDER BY keyword for SQL queries to sort the + results that are being read. The default value is None. +

+

Signature

+

Optional[str]

+
+
+

formats.sqlControl.where (property)

+

+ It specifies the WHERE clause to filter the records in SQL + queries. The default value is None. +

+

Signature

+

Optional[str]

+
+
+

formats.sqlControl.namespace (property)

+

+ To refer to table using schema or namespace or database such as + `FOO`.`TABLEFOO1` we can specify namespace. For example: + control = formats.SqlControl(table="test_table", namespace="FOO") +

+

Signature

+

Optional[str]

+
+
+

formats.sqlControl.basepath (property)

+

+ It specifies the base path for the database. The basepath will + be appended to the db path. The default value is None. For example: + formats.SqlControl(table="test_table", basepath="data") +

+

Signature

+

Optional[str]

+
+
+

formats.sqlControl.with_metadata (property)

+

+ Indicates if a table contains metadata columns like + _rowNumber or _rowValid +

+

Signature

+

bool

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/yaml.html b/docs/formats/yaml.html new file mode 100644 index 0000000000..b7701538ae --- /dev/null +++ b/docs/formats/yaml.html @@ -0,0 +1,3630 @@ + + + + + + + + +Json Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Json Format

+

Frictionless supports parsing YAML tables.

+

Reading Data

+
+

We use the path argument to ensure that it will not be guessed to be a metadata file

+
+

You can read this format using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource, resources
+
+resource = resources.TableResource(path='table.yaml')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

The same is actual for writing:

+ +
+
+
from frictionless import Resource, resources
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = resources.TableResource(path='table-output.yaml')
+source.write(target)
+print(target)
+print(target.to_view())
+
+ +
{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.yaml',
+ 'scheme': 'file',
+ 'format': 'yaml',
+ 'mediatype': 'text/yaml'}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | 'german'  |
++----+-----------+
+ +
+

Configuration

+

There is a dialect to configure how Frictionless read and write files in this format. For example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource, resources, formats
+
+control=formats.YamlControl(keyed=True)
+resource = resources.TableResource(path='table.keyed.yaml', control=control)
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Reference

+
+ + +
+
+ +

formats.YamlControl (class)

+ +
+
+ + +
+

formats.YamlControl (class)

+

Yaml control representation.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, keys: Optional[List[str]] = None, keyed: bool = False, property: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + keys + (Optional[List[str]])
  • +
  • + keyed + (bool)
  • +
  • + property + (Optional[str])
  • +
+
+ +
+

formats.yamlControl.keys (property)

+

+ Specifies the keys/columns to read from the resource. + For example: keys=["id","name"]. +

+

Signature

+

Optional[List[str]]

+
+
+

formats.yamlControl.keyed (property)

+

+ If set to True, It returns the data as key:value pair. Default value is False. +

+

Signature

+

bool

+
+
+

formats.yamlControl.property (property)

+

+ This property specifies the path to the attribute in a json file, it it has + nested fields. +

+

Signature

+

Optional[str]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/formats/zip.html b/docs/formats/zip.html new file mode 100644 index 0000000000..036239a384 --- /dev/null +++ b/docs/formats/zip.html @@ -0,0 +1,3472 @@ + + + + + + + + +Zip Format | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Zip Format

+
+ +

Frictionless supports zipped resources and reading/publishing data packages as a zip archive. For example:

+
package = Package('datapackage.zip')
+package.publish('otherpackage.zip')
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/actions.html b/docs/framework/actions.html new file mode 100644 index 0000000000..c44c1f0888 --- /dev/null +++ b/docs/framework/actions.html @@ -0,0 +1,3795 @@ + + + + + + + + +Data Actions | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Data Actions

+

Describe

+

Describe is a high-level function (action) to infer a metadata from a data source.

+

Example

+ +
+
+
from frictionless import describe
+
+resource = describe('table.csv')
+print(resource)
+
+ +
{'name': 'table',
+ 'type': 'table',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                       {'name': 'name', 'type': 'string'}]}}
+ +
+

Reference

+
+ + +
+
+ +

describe (function)

+ +
+
+ + +
+

describe (function)

+

Describe the data source

+
Signature
+

(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, stats: bool = False, **options: Any) -> Metadata

+
Parameters
+
    +
  • + source + (Optional[Any]): data source
  • +
  • + name + (Optional[str]): resoucrce name
  • +
  • + type + (Optional[str]): data type: "package", "resource", "dialect", or "schema"
  • +
  • + stats + (bool): if `True` infer resource's stats
  • +
  • + options + (Any)
  • +
+
+ + +
+
+

Extract

+

Extract is a high-level function (action) to read tabular data from a data source. The output is encoded in 'utf-8' scheme.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract('table.csv')
+pprint(rows)
+
+ +
{'table': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
+ +
+

Reference

+
+ + +
+
+ +

extract (function)

+ +
+
+ + +
+

extract (function)

+

Extract rows

+
Signature
+

(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, filter: Optional[types.IFilterFunction] = None, process: Optional[types.IProcessFunction] = None, limit_rows: Optional[int] = None, resource_name: Optional[str] = None, **options: Any)

+
Parameters
+
    +
  • + source + (Optional[Any])
  • +
  • + name + (Optional[str]): extract only resource having this name
  • +
  • + type + (Optional[str])
  • +
  • + filter + (Optional[types.IFilterFunction]): row filter function
  • +
  • + process + (Optional[types.IProcessFunction]): row processor function
  • +
  • + limit_rows + (Optional[int]): limit amount of rows to this number
  • +
  • + resource_name + (Optional[str])
  • +
  • + options + (Any)
  • +
+
+ + +
+
+

Validate

+

Validate is a high-level function (action) to validate data from a data source.

+

Example

+ +
+
+
from frictionless import validate
+
+report = validate('table.csv')
+print(report.valid)
+
+ +
True
+ +
+

Reference

+
+ + +
+
+ +

validate (function)

+ +
+
+ + +
+

validate (function)

+

Validate resource

+
Signature
+

(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, checklist: Union[frictionless.checklist.checklist.Checklist, str, NoneType] = None, checks: List[frictionless.checklist.check.Check] = [], pick_errors: List[str] = [], skip_errors: List[str] = [], limit_errors: int = 1000, limit_rows: Optional[int] = None, parallel: bool = False, resource_name: Optional[str] = None, **options: Any)

+
Parameters
+
    +
  • + source + (typing.Optional[typing.Any]): a data source
  • +
  • + name + (typing.Optional[str])
  • +
  • + type + (typing.Optional[str]): source type - inquiry, package, resource, schema or table
  • +
  • + checklist + (typing.Union[frictionless.checklist.checklist.Checklist, str, NoneType])
  • +
  • + checks + (typing.List[frictionless.checklist.check.Check])
  • +
  • + pick_errors + (typing.List[str])
  • +
  • + skip_errors + (typing.List[str])
  • +
  • + limit_errors + ()
  • +
  • + limit_rows + (typing.Optional[int])
  • +
  • + parallel + ()
  • +
  • + resource_name + (typing.Optional[str])
  • +
  • + options + (typing.Any)
  • +
+
+ + +
+
+

Transform

+

Transform is a high-level function (action) to transform tabular data from a data source.

+

Example

+ +
+
+
from frictionless import transform, steps
+
+resource = transform('table.csv', steps=[steps.cell_set(field_name='name', value='new')])
+print(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'new'}, {'id': 2, 'name': 'new'}]
+ +
+

Reference

+
+ + +
+
+ +

transform (function)

+ +
+
+ + +
+

transform (function)

+

Transform resource

+
Signature
+

(source: Optional[Any] = None, *, type: Optional[str] = None, pipeline: Union[frictionless.pipeline.pipeline.Pipeline, str, NoneType] = None, steps: Optional[List[frictionless.pipeline.step.Step]] = None, **options: Any)

+
Parameters
+
    +
  • + source + (typing.Optional[typing.Any]): data source
  • +
  • + type + (typing.Optional[str]): data type - package, resource or pipeline (default: infer)
  • +
  • + pipeline + (typing.Union[frictionless.pipeline.pipeline.Pipeline, str, NoneType])
  • +
  • + steps + (typing.Optional[typing.List[frictionless.pipeline.step.Step]]): transform steps
  • +
  • + options + (typing.Any)
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/catalog.html b/docs/framework/catalog.html new file mode 100644 index 0000000000..00634335e4 --- /dev/null +++ b/docs/framework/catalog.html @@ -0,0 +1,3888 @@ + + + + + + + + +Catalog Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Catalog Class

+
+ +

Catalog is a set of data packages.

+

Creating Catalog

+

We can create a catalog providing a list of data packages:

+ +
+
+
from frictionless import Catalog, Dataset, Package
+
+catalog = Catalog(datasets=[Dataset(name='name', package=Package('tables/*'))])
+
+ +
+

Describing Catalog

+

Usually Catalog is used to describe some external set of datasets like a CKAN instance or a Github user or search. For example:

+ +
+
+
from frictionless import Catalog
+
+catalog = Catalog('https://demo.ckan.org/dataset/')
+print(catalog)
+
+ +
+

Dataset Management

+

The core purpose of having a catalog is to provide an ability to have a set of datasets. The Catalog class provides useful methods to manage datasets:

+ +
+
+
from frictionless import Catalog
+
+catalog = Catalog('https://demo.ckan.org/dataset/')
+catalog.dataset_names
+catalog.has_dataset
+catalog.add_dataset
+catalog.get_dataset
+catalog.clear_datasets
+
+ +
+

Saving Descriptor

+

As any of the Metadata classes the Catalog class can be saved as JSON or YAML:

+ +
+
+
from frictionless import Package
+
+catalog = Catalog('https://demo.ckan.org/dataset/')
+catalog.to_json('datacatalog.json') # Save as JSON
+catalog.to_yaml('datacatalog.yaml') # Save as YAML
+
+ +
+

Reference

+
+ + +
+
+ +

Catalog (class)

+

Dataset (class)

+ +
+
+ + +
+

Catalog (class)

+

Catalog representation

+

Signature

+

(*, source: Optional[Any] = None, control: Optional[Control] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, datasets: List[Dataset] = NOTHING, basepath: Optional[str] = None) -> None

+

Parameters

+
    +
  • + source + (Optional[Any])
  • +
  • + control + (Optional[Control])
  • +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + datasets + (List[Dataset])
  • +
  • + basepath + (Optional[str])
  • +
+
+ +
+

catalog.source (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Any]

+
+
+

catalog.control (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Control]

+
+
+

catalog.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

catalog.type (property)

+

+ Type of the object +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

catalog.title (property)

+

+ A Catalog title according to the specs. It should be a + human-oriented title of the resource. +

+

Signature

+

Optional[str]

+
+
+

catalog.description (property)

+

+ A Catalog description according to the specs. It should be a + human-oriented description of the resource. +

+

Signature

+

Optional[str]

+
+
+

catalog.datasets (property)

+

+ A list of datasets. Each package in the list is a Data Dataset. +

+

Signature

+

List[Dataset]

+
+
+

catalog.basepath (property)

+

+ A basepath of the catalog. The normpath of the resource is joined + `basepath` and `/path` +

+

Signature

+

Optional[str]

+
+ +
+

catalog.dataset_names (property)

+

Return names of datasets

+

Signature

+

List[str]

+
+ +
+

catalog.add_dataset (method)

+

Add new dataset to the catalog

+

Signature

+

(dataset: Union[Dataset, str]) -> Dataset

+

Parameters

+
    +
  • + dataset + (Union[Dataset, str])
  • +
+
+
+

catalog.clear_datasets (method)

+

Remove all the datasets

+
+
+

catalog.dereference (method)

+

Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object

+
+
+

catalog.get_dataset (method)

+

Get dataset by name

+

Signature

+

(name: str) -> Dataset

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

catalog.has_dataset (method)

+

Check if a dataset is present

+

Signature

+

(name: str) -> bool

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

catalog.infer (method)

+

Infer catalog's metadata

+

Signature

+

(*, stats: bool = False)

+

Parameters

+
    +
  • + stats + (bool)
  • +
+
+
+

catalog.remove_dataset (method)

+

Remove dataset by name

+

Signature

+

(name: str) -> Dataset

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

catalog.set_dataset (method)

+

Set dataset by name

+

Signature

+

(dataset: Dataset) -> Optional[Dataset]

+

Parameters

+
    +
  • + dataset + (Dataset)
  • +
+
+
+

catalog.to_copy (method)

+

Create a copy of the catalog

+

Signature

+

(**options: Any)

+

Parameters

+
    +
  • + options + (Any)
  • +
+
+ + +
+

Dataset (class)

+

Dataset representation.

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, package: Union[Package, str], basepath: Optional[str] = None, catalog: Optional[Catalog] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + package + (Union[Package, str])
  • +
  • + basepath + (Optional[str])
  • +
  • + catalog + (Optional[Catalog])
  • +
+
+ +
+

dataset.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +

+

Signature

+

str

+
+
+

dataset.type (property)

+

+ A short name(preferably human-readable) for the Check. + This MUST be lower-case and contain only alphanumeric characters + along with "-" or "_". +

+

Signature

+

ClassVar[str]

+
+
+

dataset.title (property)

+

+ A human-readable title for the Check. +

+

Signature

+

Optional[str]

+
+
+

dataset.description (property)

+

+ A detailed description for the Check. +

+

Signature

+

Optional[str]

+
+
+

dataset._package (property)

+

+ # TODO: add docs +

+

Signature

+

Union[Package, str]

+
+
+

dataset._basepath (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[str]

+
+
+

dataset.catalog (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Catalog]

+
+ + +
+

dataset.dereference (method)

+

Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object

+
+
+

dataset.infer (method)

+

Infer dataset's metadata

+

Signature

+

(*, stats: bool = False)

+

Parameters

+
    +
  • + stats + (bool)
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/checklist.html b/docs/framework/checklist.html new file mode 100644 index 0000000000..6bdd3d2063 --- /dev/null +++ b/docs/framework/checklist.html @@ -0,0 +1,3814 @@ + + + + + + + + +Checklist Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Checklist Class

+

Creating Checklist

+

Checklist is a set of validation checks and a few addition settings. Let's create a checklist:

+ +
+
+
from frictionless import Checklist, checks
+
+checklist = Checklist(checks=[checks.row_constraint(formula='id > 1')])
+print(checklist)
+
+ +
{'checks': [{'type': 'row-constraint', 'formula': 'id > 1'}]}
+ +
+

Validation Checks

+

The Check concept is a part of the Validation API. You can create a custom Check to be used as part of resource or package validation.

+
from frictionless import Check, errors
+
+class duplicate_row(Check):
+    code = "duplicate-row"
+    Errors = [errors.DuplicateRowError]
+
+    def __init__(self, descriptor=None):
+        super().__init__(descriptor)
+        self.__memory = {}
+
+    def validate_row(self, row):
+        text = ",".join(map(str, row.values()))
+        hash = hashlib.sha256(text.encode("utf-8")).hexdigest()
+        match = self.__memory.get(hash)
+        if match:
+            note = 'the same as row at position "%s"' % match
+            yield errors.DuplicateRowError.from_row(row, note=note)
+        self.__memory[hash] = row.row_position
+
+    # Metadata
+
+    metadata_profile = {  # type: ignore
+        "type": "object",
+        "properties": {},
+    }
+
+

It's usual to create a custom Error along side with a Custom Check.

+

Reference

+
+ + +
+
+ +

Checklist (class)

+

Check (class)

+ +
+
+ + +
+

Checklist (class)

+

Checklist representation. + +A class that combines multiple checks to be applied while validating +a resource or package.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, checks: List[Check] = NOTHING, pick_errors: List[str] = NOTHING, skip_errors: List[str] = NOTHING) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + checks + (List[Check])
  • +
  • + pick_errors + (List[str])
  • +
  • + skip_errors + (List[str])
  • +
+
+ +
+

checklist.name (property)

+

+ A short name(preferably human-readable) for the Checklist. + This MUST be lower-case and contain only alphanumeric characters + along with "-" or "_". +

+

Signature

+

Optional[str]

+
+
+

checklist.type (property)

+

+ Type of the object +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

checklist.title (property)

+

+ A human-readable title for the Checklist. +

+

Signature

+

Optional[str]

+
+
+

checklist.description (property)

+

+ A detailed description for the Checklist. +

+

Signature

+

Optional[str]

+
+
+

checklist.checks (property)

+

+ List of checks to be applied during validation such as "deviated-cell", + "required-value" etc. +

+

Signature

+

List[Check]

+
+
+

checklist.pick_errors (property)

+

+ Specify the errors names to be picked while validation such as "sha256-count", + "byte-count". Errors other than specified will be ignored. +

+

Signature

+

List[str]

+
+
+

checklist.skip_errors (property)

+

+ Specify the errors names to be skipped while validation such as "sha256-count", + "byte-count". Other errors will be included. +

+

Signature

+

List[str]

+
+ + +
+

checklist.add_check (method)

+

Add new check to the schema

+

Signature

+

(check: Check) -> None

+

Parameters

+
    +
  • + check + (Check)
  • +
+
+
+

checklist.clear_checks (method)

+

Remove all the checks

+

Signature

+

() -> None

+
+
+

checklist.get_check (method)

+

Get check by type

+

Signature

+

(type: str) -> Check

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

checklist.has_check (method)

+

Check if a check is present

+

Signature

+

(type: str) -> bool

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

checklist.remove_check (method)

+

Remove check by type

+

Signature

+

(type: str) -> Check

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

checklist.set_check (method)

+

Set check by type

+

Signature

+

(check: Check) -> Optional[Check]

+

Parameters

+
    +
  • + check + (Check)
  • +
+
+ + +
+

Check (class)

+

Check representation. + +A base class for all the checks. To add a new custom check, it has to be derived +from this class.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ +
+

check.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

check.type (property)

+

+ A short name(preferably human-readable) for the Check. + This MUST be lower-case and contain only alphanumeric characters + along with "-" or "_". +

+

Signature

+

ClassVar[str]

+
+
+

check.title (property)

+

+ A human-readable title for the Check. +

+

Signature

+

Optional[str]

+
+
+

check.description (property)

+

+ A detailed description for the Check. +

+

Signature

+

Optional[str]

+
+
+

check.Errors (property)

+

+ List of errors that are being used in the Check. +

+

Signature

+

ClassVar[List[Type[Error]]]

+
+ +
+

check.resource (property)

+

+

Signature

+

Resource

+
+ +
+

check.connect (method)

+

Connect to the given resource

+

Signature

+

(resource: Resource)

+

Parameters

+
    +
  • + resource + (Resource): data resource
  • +
+
+
+

check.validate_end (method)

+

Called to validate the resource before closing

+

Signature

+

() -> Iterable[Error]

+
+
+

check.validate_row (method)

+

Called to validate the given row (on every row)

+

Signature

+

(row: Row) -> Iterable[Error]

+

Parameters

+
    +
  • + row + (Row): table row
  • +
+
+
+

check.validate_start (method)

+

Called to validate the resource after opening

+

Signature

+

() -> Iterable[Error]

+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/detector.html b/docs/framework/detector.html new file mode 100644 index 0000000000..cd4422acd1 --- /dev/null +++ b/docs/framework/detector.html @@ -0,0 +1,4106 @@ + + + + + + + + +Detector Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Detector Class

+

The Detector object can be used in various places within the Framework. The main purpose of this class is to tweak how different aspects of metadata are detected.

+

Here is a quick example:

+ +
+
+
frictionless extract table.csv --field-missing-values 1,2
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+           dataset
+┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
+┃ name  ┃ type  ┃ path      ┃
+┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
+│ table │ table │ table.csv │
+└───────┴───────┴───────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+      table
+┏━━━━━━┳━━━━━━━━━┓
+┃ id   ┃ name    ┃
+┡━━━━━━╇━━━━━━━━━┩
+│ None │ english │
+│ None │ 中国人  │
+└──────┴─────────┘
+ +
+
+
from frictionless import Detector, Resource
+
+detector = Detector(field_missing_values=['1', '2'])
+resource = Resource('table.csv', detector=detector)
+print(resource.read_rows())
+
+ +
[{'id': None, 'name': 'english'}, {'id': None, 'name': '中国人'}]
+ +
+

Many options below have their CLI equivalent. Please consult with the CLI help.

+

Detector Usage

+

The detector class instance are accepted by many classes and functions:

+
    +
  • Package
  • +
  • Resource
  • +
  • describe
  • +
  • extract
  • +
  • validate
  • +
  • and more
  • +
+

You just need to create a Detector instance using desired options and pass to the classed and function from above.

+

Buffer Size

+

By default, Frictionless will use the first 10000 bytes to detect encoding. Including more bytes by increasing buffer_size can improve the inference. However, it will be slower, but the encoding detection will be more accurate.

+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(buffer_size=100000)
+resource = describe("country-1.csv", detector=detector)
+print(resource.encoding)
+
+ +
utf-8
+ +
+

Sample Size

+

By default, Frictionless will use the first 100 rows to detect field types. Including more samples by increasing sample_size can improve the inference. However, it will be slower, but the result will be more accurate.

+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(sample_size=1000)
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema)
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'neighbor_id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
+ +
+

Encoding Function

+

By default, Frictionless encoding_function is None and user can use built in encoding functions. But user has option to implement their own encoding using this feature. The following example simply returns utf-8 encoding but user can add more complex logic to the encoding function.

+ +
+
+
from frictionless import Detector, Resource
+
+detector = Detector(encoding_function=lambda sample: "utf-8")
+with Resource("table.csv", detector=detector) as resource:
+  print(resource.encoding)
+
+ +
utf-8
+ +
+

Field Type

+

This option allows manually setting all the field types to a given type. It's useful when you need to skip data casting (setting any type) or have everything as a string (setting string type):

+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(field_type='string')
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema)
+
+ +
{'fields': [{'name': 'id', 'type': 'string'},
+            {'name': 'neighbor_id', 'type': 'string'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'string'}]}
+ +
+

Field Names

+

Sometimes you don't want to use existent header row to compose field names. It's possible to provide custom names:

+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(field_names=["f1", "f2", "f3", "f4"])
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema.field_names)
+
+ +
['f1', 'f2', 'f3', 'f4']
+ +
+

Field Confidence

+

By default, Frictionless uses 0.9 (90%) confidence level for data types detection. It means that it there are 9 integers in a field and one string it will be inferred as an integer. If you want a guarantee that an inferred schema will conform to the data you can set it to 1 (100%):

+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(field_confidence=1)
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema)
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'neighbor_id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
+ +
+

Field Float Numbers

+

By default, Frictionless will consider that all non integer numbers are decimals. It's possible to make them float which is a faster data type:

+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(field_float_numbers=True)
+resource = describe("floats.csv", detector=detector)
+print(resource.schema)
+print(resource.read_rows())
+
+ +
{'fields': [{'name': 'number', 'type': 'number', 'floatNumber': True}]}
+[{'number': 1.1}, {'number': 1.2}, {'number': 1.3}, {'number': 1.4}, {'number': 1.5}]
+ +
+

Field Missing Values

+

Missing Values is an important concept in data description. It provides information about what cell values should be considered as nulls. We can customize the defaults:

+ +
+
+
from frictionless import Detector, describe
+
+detector = Detector(field_missing_values=["", "1", "2"])
+resource = describe("table.csv", detector=detector)
+print(resource.schema.missing_values)
+print(resource.read_rows())
+
+ +
['', '1', '2']
+[{'id': None, 'name': 'english'}, {'id': None, 'name': '中国人'}]
+ +
+

As we can see, the textual values equal to "67" are now considered nulls. Usually, it's handy when you have data with values like: '-', 'n/a', and similar.

+

Schema Sync

+

There is a way to sync provided schema based on a header row's field order. It's very useful when you have a schema that describes a subset or a superset of the resource's fields:

+ +
+
+
from frictionless import Detector, Resource, Schema, fields
+
+# Note the order of the fields
+detector = Detector(schema_sync=True)
+schema = Schema(fields=[fields.StringField(name='name'), fields.IntegerField(name='id')])
+with Resource('table.csv', schema=schema, detector=detector) as resource:
+    print(resource.schema)
+    print(resource.read_rows())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'}]}
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Schema Patch

+

Sometimes we just want to update only a few fields or some schema's properties without providing a brand new schema. For example, the two examples above can be simplified as:

+ +
+
+
from frictionless import Detector, Resource
+
+detector = Detector(schema_patch={'fields': {'id': {'type': 'string'}}})
+with Resource('table.csv', detector=detector) as resource:
+    print(resource.schema)
+    print(resource.read_rows())
+
+ +
{'fields': [{'name': 'id', 'type': 'string'},
+            {'name': 'name', 'type': 'string'}]}
+[{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}]
+ +
+

Reference

+
+ + +
+
+ +

Detector (class)

+ +
+
+ + +
+

Detector (class)

+

Detector representation. + +This main purpose of this class is to set the parameters to define +how different aspects of metadata are detected.

+

Signature

+

(*, buffer_size: int = 10000, sample_size: int = 100, encoding_function: Optional[types.IEncodingFunction] = None, encoding_confidence: float = 0.5, field_type: Optional[str] = None, field_names: Optional[List[str]] = None, field_confidence: float = 0.9, field_float_numbers: bool = False, field_missing_values: List[str] = NOTHING, field_true_values: List[str] = NOTHING, field_false_values: List[str] = NOTHING, schema_sync: bool = False, schema_patch: Optional[Dict[str, Any]] = None) -> None

+

Parameters

+
    +
  • + buffer_size + (int)
  • +
  • + sample_size + (int)
  • +
  • + encoding_function + (Optional[types.IEncodingFunction])
  • +
  • + encoding_confidence + (float)
  • +
  • + field_type + (Optional[str])
  • +
  • + field_names + (Optional[List[str]])
  • +
  • + field_confidence + (float)
  • +
  • + field_float_numbers + (bool)
  • +
  • + field_missing_values + (List[str])
  • +
  • + field_true_values + (List[str])
  • +
  • + field_false_values + (List[str])
  • +
  • + schema_sync + (bool)
  • +
  • + schema_patch + (Optional[Dict[str, Any]])
  • +
+
+ +
+

detector.buffer_size (property)

+

+ The amount of bytes to be extracted as a buffer. It defaults to 10000. + The buffer_size can be increased to improve the inference accuracy to + detect file encoding. +

+

Signature

+

int

+
+
+

detector.sample_size (property)

+

+ The amount of rows to be extracted as a sample for dialect/schema inferring. + It defaults to 100. The sample_size can be increased to improve the inference + accuracy. +

+

Signature

+

int

+
+
+

detector.encoding_function (property)

+

+ A custom encoding function for the file. +

+

Signature

+

Optional[types.IEncodingFunction]

+
+
+

detector.encoding_confidence (property)

+

+ Confidence value for encoding function. +

+

Signature

+

float

+
+
+

detector.field_type (property)

+

+ Enforce all the inferred types to be this type. + For more information, please check "Describing Data" guide. +

+

Signature

+

Optional[str]

+
+
+

detector.field_names (property)

+

+ Enforce all the inferred fields to have provided names. + For more information, please check "Describing Data" guide. +

+

Signature

+

Optional[List[str]]

+
+
+

detector.field_confidence (property)

+

+ A number from 0 to 1 setting the infer confidence. + If 1 the data is guaranteed to be valid against the inferred schema. + For more information, please check "Describing Data" guide. + It defaults to 0.9 +

+

Signature

+

float

+
+
+

detector.field_float_numbers (property)

+

+ Flag to indicate desired number type. + By default numbers will be `Decimal`; if `True` - `float`. + For more information, please check "Describing Data" guide. + It defaults to `False` +

+

Signature

+

bool

+
+
+

detector.field_missing_values (property)

+

+ String to be considered as missing values. + For more information, please check "Describing Data" guide. + It defaults to `['']` +

+

Signature

+

List[str]

+
+
+

detector.field_true_values (property)

+

+ String to be considered as true values. + For more information, please check "Describing Data" guide. + It defaults to `["true", "True", "TRUE", "1"]` +

+

Signature

+

List[str]

+
+
+

detector.field_false_values (property)

+

+ String to be considered as false values. + For more information, please check "Describing Data" guide. + It defaults to `["false", "False", "FALSE", "0"]` +

+

Signature

+

List[str]

+
+
+

detector.schema_sync (property)

+

+ Whether to sync the schema. + If it sets to `True` the provided schema will be mapped to + the inferred schema. It means that, for example, you can + provide a subset of fields to be applied on top of the inferred + fields or the provided schema can have different order of fields. +

+

Signature

+

bool

+
+
+

detector.schema_patch (property)

+

+ A dictionary to be used as an inferred schema patch. + The form of this dictionary should follow the Schema descriptor form + except for the `fields` property which should be a mapping with the + key named after a field name and the values being a field patch. + For more information, please check "Extracting Data" guide. +

+

Signature

+

Optional[Dict[str, Any]]

+
+ + +
+

detector.add_missing_required_labels_to_schema_fields (method)

+

This method aims to add missing required labels and + +primary key field not in labels to schema fields.

+

Signature

+

(fields_mapping: Dict[str, Field], schema: Schema, labels: List[str], case_sensitive: bool)

+

Parameters

+
    +
  • + fields_mapping + (Dict[str, Field])
  • +
  • + schema + (Schema)
  • +
  • + labels + (List[str])
  • +
  • + case_sensitive + (bool)
  • +
+
+
+

detector.detect_dialect (method)

+

Detect dialect from sample

+

Signature

+

(sample: types.ISample, *, dialect: Optional[Dialect] = None) -> Dialect

+

Parameters

+
    +
  • + sample + (types.ISample): data sample
  • +
  • + dialect + (Optional[Dialect])
  • +
+
+
+

detector.detect_encoding (method)

+

Detect encoding from buffer

+

Signature

+

(buffer: types.IBuffer, *, encoding: Optional[str] = None) -> str

+

Parameters

+
    +
  • + buffer + (types.IBuffer): byte buffer
  • +
  • + encoding + (Optional[str])
  • +
+
+
+

Detector.detect_metadata_type (method) (static)

+

Return an descriptor type as 'resource' or 'package'

+

Signature

+

(source: Any, *, format: Optional[str] = None) -> Optional[str]

+

Parameters

+
    +
  • + source + (Any)
  • +
  • + format + (Optional[str])
  • +
+
+
+

detector.detect_resource (method)

+

Detects path details

+

Signature

+

(resource: Resource) -> None

+

Parameters

+
    +
  • + resource + (Resource)
  • +
+
+
+

detector.detect_schema (method)

+

Detect schema from fragment

+

Signature

+

(fragment: types.IFragment, *, labels: Optional[List[str]] = None, schema: Optional[Schema] = None, field_candidates: List[Dict[str, Any]] = [{type: yearmonth}, {type: geopoint}, {type: duration}, {type: geojson}, {type: object}, {type: array}, {type: datetime}, {type: time}, {type: date}, {type: integer}, {type: number}, {type: boolean}, {type: year}, {type: string}], **options: Any) -> Schema

+

Parameters

+
    +
  • + fragment + (types.IFragment): data fragment
  • +
  • + labels + (Optional[List[str]])
  • +
  • + schema + (Optional[Schema])
  • +
  • + field_candidates + (List[Dict[str, Any]])
  • +
  • + options + (Any)
  • +
+
+
+

Detector.mapped_schema_fields_names (method) (static)

+

Create a dictionnary to map field names with schema fields

+

Signature

+

(fields: List[Field], case_sensitive: bool) -> Dict[str, Field]

+

Parameters

+
    +
  • + fields + (List[Field])
  • +
  • + case_sensitive + (bool)
  • +
+
+
+

Detector.rearrange_schema_fields_given_labels (method) (static)

+

Rearrange fields according to the order of labels. All fields + +missing from labels are dropped

+

Signature

+

(fields_mapping: Dict[str, Field], schema: Schema, labels: List[str])

+

Parameters

+
    +
  • + fields_mapping + (Dict[str, Field])
  • +
  • + schema + (Schema)
  • +
  • + labels + (List[str])
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/dialect.html b/docs/framework/dialect.html new file mode 100644 index 0000000000..d65d22080d --- /dev/null +++ b/docs/framework/dialect.html @@ -0,0 +1,3990 @@ + + + + + + + + +Dialect Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Dialect Class

+

The Table Dialect is a core Frictionless Data concept meaning a metadata information regarding tabular data source. The Table Dialect concept give us an ability to manage table header and any details related to specific formats.

+

Dialect

+

The Dialect class instance are accepted by many classes and functions:

+
    +
  • Resource
  • +
  • describe
  • +
  • extract
  • +
  • validate
  • +
  • and more
  • +
+

You just need to create a Dialect instance using desired options and pass to the classed and function from above. We will show it on this examplar table:

+ +
+
+
cat capital-3.csv
+
+ +
id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+ +
+

Header

+

It's a boolean flag which defaults to True indicating whether the data has a header row or not. In the following example the header row will be treated as a data row:

+ +
+
+
from frictionless import Resource, Dialect
+
+dialect = Dialect(header=False)
+with Resource('capital-3.csv', dialect=dialect) as resource:
+      print(resource.header.labels)
+      print(resource.to_view())
+
+ +
[]
++--------+----------+
+| field1 | field2   |
++========+==========+
+| 'id'   | 'name'   |
++--------+----------+
+| '1'    | 'London' |
++--------+----------+
+| '2'    | 'Berlin' |
++--------+----------+
+| '3'    | 'Paris'  |
++--------+----------+
+| '4'    | 'Madrid' |
++--------+----------+
+...
+ +
+

Header Rows

+

If header is True which is default, this parameters indicates where to find the header row or header rows for a multiline header. Let's see on example how the first two data rows can be treated as a part of a header:

+ +
+
+
from frictionless import Resource, Dialect
+
+dialect = Dialect(header_rows=[1, 2, 3])
+with Resource('capital-3.csv', dialect=dialect) as resource:
+    print(resource.header)
+    print(resource.to_view())
+
+ +
['id 1 2', 'name London Berlin']
++--------+--------------------+
+| id 1 2 | name London Berlin |
++========+====================+
+|      3 | 'Paris'            |
++--------+--------------------+
+|      4 | 'Madrid'           |
++--------+--------------------+
+|      5 | 'Rome'             |
++--------+--------------------+
+ +
+

Header Join

+

If there are multiple header rows which is managed by header_rows parameter, we can set a string to be a separator for a header's cell join operation. Usually it's very handy for some "fancy" Excel files. For the sake of simplicity, we will show on a CSV file:

+ +
+
+
from frictionless import Resource, Dialect
+
+dialect = Dialect(header_rows=[1, 2, 3], header_join='/')
+with Resource('capital-3.csv', dialect=dialect) as resource:
+    print(resource.header)
+    print(resource.to_view())
+
+ +
['id/1/2', 'name/London/Berlin']
++--------+--------------------+
+| id/1/2 | name/London/Berlin |
++========+====================+
+|      3 | 'Paris'            |
++--------+--------------------+
+|      4 | 'Madrid'           |
++--------+--------------------+
+|      5 | 'Rome'             |
++--------+--------------------+
+ +
+

Header Case

+

By default a header is validated in a case sensitive mode. To disable this behaviour we can set the header_case parameter to False. This option is accepted by any Dialect and a dialect can be passed to extract, validate and other functions. Please note that it doesn't affect a resulting header it only affects how it's validated:

+ +
+
+
from frictionless import Resource, Schema, Dialect, fields
+
+dialect = Dialect(header_case=False)
+schema = Schema(fields=[fields.StringField(name="ID"), fields.StringField(name="NAME")])
+with Resource('capital-3.csv', dialect=dialect, schema=schema) as resource:
+  print(f'Header: {resource.header}')
+  print(f'Valid: {resource.header.valid}')  # without "header_case" it will have 2 errors
+
+ +
Header: ['ID', 'NAME']
+Valid: True
+ +
+

Comment Char

+

Specifies char used to comment the rows:

+ +
+
+
from frictionless import Resource, Dialect
+
+dialect = Dialect(comment_char="#")
+with Resource(b'name\n#row1\nrow2', format="csv", dialect=dialect) as resource:
+    print(resource.read_rows())
+
+ +
[{'name': 'row2'}]
+ +
+

Comment Rows

+

A list of rows to ignore:

+ +
+
+
from frictionless import Resource, Dialect
+
+dialect = Dialect(comment_rows=[2])
+with Resource(b'name\nrow1\nrow2', format="csv", dialect=dialect) as resource:
+    print(resource.read_rows())
+
+ +
[{'name': 'row2'}]
+ +
+

Skip Blank Rows

+

Ignores rows if they are completely blank.

+ +
+
+
from frictionless import Resource, Dialect
+
+dialect = Dialect(skip_blank_rows=True)
+with Resource(b'name\n\nrow2', format="csv", dialect=dialect) as resource:
+    print(resource.read_rows())
+
+ +
[{'name': 'row2'}]
+ +
+

Reference

+
+ + +
+
+ +

Dialect (class)

+

Control (class)

+ +
+
+ + +
+

Dialect (class)

+

Dialect representation

+

Signature

+

(*, descriptor: Optional[Union[types.IDescriptor, str]] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, header: bool = True, header_rows: List[int] = NOTHING, header_join: str = , header_case: bool = True, comment_char: Optional[str] = None, comment_rows: List[int] = NOTHING, skip_blank_rows: bool = False, controls: List[Control] = NOTHING) -> None

+

Parameters

+
    +
  • + descriptor + (Optional[Union[types.IDescriptor, str]])
  • +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + header + (bool)
  • +
  • + header_rows + (List[int])
  • +
  • + header_join + (str)
  • +
  • + header_case + (bool)
  • +
  • + comment_char + (Optional[str])
  • +
  • + comment_rows + (List[int])
  • +
  • + skip_blank_rows + (bool)
  • +
  • + controls + (List[Control])
  • +
+
+ +
+

dialect.descriptor (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Union[types.IDescriptor, str]]

+
+
+

dialect.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

dialect.type (property)

+

+ Type of the object +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

dialect.title (property)

+

+ A human-oriented title for the Dialect. +

+

Signature

+

Optional[str]

+
+
+

dialect.description (property)

+

+ A brief description of the Dialect. +

+

Signature

+

Optional[str]

+
+
+

dialect.header (property)

+

+ If true, the header will be read else header will be skipped. +

+

Signature

+

bool

+
+
+

dialect.header_rows (property)

+

+ Specifies the row numbers for the header. Default is [1]. +

+

Signature

+

List[int]

+
+
+

dialect.header_join (property)

+

+ Separator to join text of two column's. The default value is " " and other values + could be ":", "-" etc. +

+

Signature

+

str

+
+
+

dialect.header_case (property)

+

+ If set to false, it does case insensitive matching of header. The default value + is True. +

+

Signature

+

bool

+
+
+

dialect.comment_char (property)

+

+ Specifies char used to comment the rows. The default value is None. + For example: "#". +

+

Signature

+

Optional[str]

+
+
+

dialect.comment_rows (property)

+

+ A list of rows to ignore. For example: [1, 2] +

+

Signature

+

List[int]

+
+
+

dialect.skip_blank_rows (property)

+

+ Ignores rows if they are completely blank +

+

Signature

+

bool

+
+
+

dialect.controls (property)

+

+ A list of controls which defines different aspects of reading data. +

+

Signature

+

List[Control]

+
+ + +
+

dialect.add_control (method)

+

Add new control to the schema

+

Signature

+

(control: Control) -> None

+

Parameters

+
    +
  • + control + (Control)
  • +
+
+
+

Dialect.describe (method) (static)

+

Describe the given source as a dialect

+

Signature

+

(source: Optional[Any] = None, **options: Any) -> Dialect

+

Parameters

+
    +
  • + source + (Optional[Any]): data source
  • +
  • + options + (Any)
  • +
+
+
+

dialect.get_control (method)

+

Get control by type

+

Signature

+

(type: str) -> Control

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

dialect.has_control (method)

+

Check if control is present

+

Signature

+

(type: str)

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

dialect.set_control (method)

+

Set control by type

+

Signature

+

(control: Control) -> Optional[Control]

+

Parameters

+
    +
  • + control + (Control)
  • +
+
+ + +
+

Control (class)

+

Control representation. + +This class is the base class for all the control classes that are +used to set the states of various different components.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ +
+

control.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

control.type (property)

+

+ Type of the control. It could be a zenodo plugin control, csv control etc. + For example: "csv", "zenodo" etc +

+

Signature

+

ClassVar[str]

+
+
+

control.title (property)

+

+ A human-oriented title for the control. +

+

Signature

+

Optional[str]

+
+
+

control.description (property)

+

+ A brief description of the control. +

+

Signature

+

Optional[str]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/error.html b/docs/framework/error.html new file mode 100644 index 0000000000..713284a99c --- /dev/null +++ b/docs/framework/error.html @@ -0,0 +1,3583 @@ + + + + + + + + +Error Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Error Class

+

The Error class is a metadata with no behavior. It's used to describe an error that happened during Framework work or during the validation.

+

To create a custom error you basically just need to fill the required class fields:

+
from frictionless import errors
+
+class DuplicateRowError(errors.RowError):
+    code = "duplicate-row"
+    name = "Duplicate Row"
+    tags = ["#table", "#row", "#duplicate"]
+    template = "Row at position {rowPosition} is duplicated: {note}"
+    description = "The row is duplicated."
+
+

Reference

+
+ + +
+
+ +

Error (class)

+ +
+
+ + +
+

Error (class)

+

Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.

+

Signature

+

(*, note: str) -> None

+

Parameters

+
    +
  • + note + (str)
  • +
+
+ +
+

error.type (property)

+

+ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +

+

Signature

+

ClassVar[str]

+
+
+

error.title (property)

+

+ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +

+

Signature

+

ClassVar[str]

+
+
+

error.description (property)

+

+ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +

+

Signature

+

ClassVar[str]

+
+
+

error.template (property)

+

+ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +

+

Signature

+

ClassVar[str]

+
+
+

error.tags (property)

+

+ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +

+

Signature

+

ClassVar[List[str]]

+
+
+

error.message (property)

+

+ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +

+

Signature

+

str

+
+
+

error.note (property)

+

+ A short human readable description of the error. It can be set to any custom text. +

+

Signature

+

str

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/inquiry.html b/docs/framework/inquiry.html new file mode 100644 index 0000000000..e3b59f473d --- /dev/null +++ b/docs/framework/inquiry.html @@ -0,0 +1,3850 @@ + + + + + + + + +Inquiry Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Inquiry Class

+

The Inquiry gives you an ability to create arbitrary validation jobs containing a set of individual validation tasks.

+

Creating Inquiry

+

Let's create an inquiry that includes an individual file validation and a resource validation:

+ +
+
+
from frictionless import Inquiry
+
+inquiry = Inquiry.from_descriptor({'tasks': [
+  {'path': 'capital-valid.csv'},
+  {'path': 'capital-invalid.csv'},
+]})
+inquiry.to_yaml('capital.inquiry-example.yaml')
+print(inquiry)
+
+ +
{'tasks': [{'path': 'capital-valid.csv'}, {'path': 'capital-invalid.csv'}]}
+ +
+

Validating Inquiry

+

Tasks in the Inquiry accept the same arguments written in camelCase as the corresponding validate functions have. As usual, let' run validation:

+ +
+
+
frictionless validate capital.inquiry-example.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                          dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name            ┃ type  ┃ path                ┃ status  ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-valid   │ table │ capital-valid.csv   │ VALID   │
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type            ┃ Message                                     ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3     │ duplicate-label │ Label "name" in the header at position "3"  │
+│      │       │                 │ is duplicated to a label: at position "2"   │
+│ 10   │ 3     │ missing-cell    │ Row at position "10" has a missing cell in  │
+│      │       │                 │ field "name2" at position "3"               │
+│ 11   │ None  │ blank-row       │ Row at position "11" is completely blank    │
+│ 12   │ 1     │ type-error      │ Type error in the cell "x" in row "12" and  │
+│      │       │                 │ field "id" at position "1": type is         │
+│      │       │                 │ "integer/default"                           │
+│ 12   │ 4     │ extra-cell      │ Row at position "12" has an extra value in  │
+│      │       │                 │ field at position "4"                       │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+ +
+

At first sight, it's no clear why such a construct exists but when your validation workflow gets complex, the Inquiry can provide a lot of flexibility and power. Last but not least, the Inquiry will use multiprocessing if there are more than 1 task provided.

+

Reference

+
+ + +
+
+ +

Inquiry (class)

+

InquiryTask (class)

+ +
+
+ + +
+

Inquiry (class)

+

Inquiry representation.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, tasks: List[InquiryTask] = NOTHING) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + tasks + (List[InquiryTask])
  • +
+
+ +
+

inquiry.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

inquiry.type (property)

+

+ Type of the object +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

inquiry.title (property)

+

+ A human-oriented title for the Inquiry. +

+

Signature

+

Optional[str]

+
+
+

inquiry.description (property)

+

+ A brief description of the Inquiry. +

+

Signature

+

Optional[str]

+
+
+

inquiry.tasks (property)

+

+ List of underlaying task to be validated. +

+

Signature

+

List[InquiryTask]

+
+ + +
+

inquiry.validate (method)

+

Validate inquiry

+

Signature

+

(*, parallel: bool = False)

+

Parameters

+
    +
  • + parallel + (bool)
  • +
+
+ + +
+

InquiryTask (class)

+

Inquiry task representation.

+

Signature

+

(*, name: Optional[str] = None, type: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, path: Optional[str] = None, scheme: Optional[str] = None, format: Optional[str] = None, encoding: Optional[str] = None, mediatype: Optional[str] = None, compression: Optional[str] = None, extrapaths: Optional[List[str]] = None, innerpath: Optional[str] = None, dialect: Optional[Dialect] = None, schema: Optional[Schema] = None, checklist: Optional[Checklist] = None, resource: Optional[str] = None, package: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + type + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + path + (Optional[str])
  • +
  • + scheme + (Optional[str])
  • +
  • + format + (Optional[str])
  • +
  • + encoding + (Optional[str])
  • +
  • + mediatype + (Optional[str])
  • +
  • + compression + (Optional[str])
  • +
  • + extrapaths + (Optional[List[str]])
  • +
  • + innerpath + (Optional[str])
  • +
  • + dialect + (Optional[Dialect])
  • +
  • + schema + (Optional[Schema])
  • +
  • + checklist + (Optional[Checklist])
  • +
  • + resource + (Optional[str])
  • +
  • + package + (Optional[str])
  • +
+
+ +
+

inquiryTask.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.type (property)

+

+ Type of the source to be validated such as "package", "resource" etc. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.title (property)

+

+ A human-oriented title for the Inquiry. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.description (property)

+

+ A brief description of the Inquiry. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.path (property)

+

+ Path to the data source. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.scheme (property)

+

+ Scheme for loading the file (file, http, ...). If not set, it'll be + inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.format (property)

+

+ File source's format (csv, xls, ...). If not set, it'll be + inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.encoding (property)

+

+ Source encoding. If not set, it'll be inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.mediatype (property)

+

+ Mediatype/mimetype of the resource e.g. “text/csv”, or “application/vnd.ms-excel”. + Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a + media type registry. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.compression (property)

+

+ Source file compression (zip, ...). If not set, it'll be inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.extrapaths (property)

+

+ List of paths to concatenate to the main path. It's used for multipart resources. +

+

Signature

+

Optional[List[str]]

+
+
+

inquiryTask.innerpath (property)

+

+ Path within the compressed file. It defaults to the first file in the archive + (if the source is an archive). +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.dialect (property)

+

+ Specific set of formatting parameters applied while reading data source. + The parameters are set as a Dialect class. For more information, please + check the Dialect Class documentation. +

+

Signature

+

Optional[Dialect]

+
+
+

inquiryTask.schema (property)

+

+ Schema descriptor. A string descriptor or path to schema file. +

+

Signature

+

Optional[Schema]

+
+
+

inquiryTask.checklist (property)

+

+ Checklist class with a set of validation checks to be applied to the + data source being read. For more information, please check the + Validation Checks documentation. +

+

Signature

+

Optional[Checklist]

+
+
+

inquiryTask.resource (property)

+

+ Resource descriptor. A string descriptor or path to resource file. +

+

Signature

+

Optional[str]

+
+
+

inquiryTask.package (property)

+

+ Package descriptor. A string descriptor or path to package + file. +

+

Signature

+

Optional[str]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/package.html b/docs/framework/package.html new file mode 100644 index 0000000000..f6d10d9b6d --- /dev/null +++ b/docs/framework/package.html @@ -0,0 +1,4234 @@ + + + + + + + + +Package Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Package Class

+

The Data Package is a core Frictionless Data concept meaning a set of resources with additional metadata provided. You can read Data Package Standard for more information.

+

Creating Package

+

Let's create a data package:

+ +
+
+
from frictionless import Package, Resource
+
+package = Package('table.csv') # from a resource path
+package = Package('tables/*') # from a resources glob
+package = Package(['tables/chunk1.csv', 'tables/chunk2.csv']) # from a list
+package = Package('package/datapackage.json') # from a descriptor path
+package = Package({'resources': {'path': 'table.csv'}}) # from a descriptor
+package = Package(resources=[Resource(path='table.csv')]) # from arguments
+
+ +
+

As you can see it's possible to create a package providing different kinds of sources which will be detected to have some type automatically (e.g. whether it's a glob or a path). It's possible to make this step more explicit:

+ +
+
+
from frictionless import Package, Resource
+
+package = Package(resources=[Resource(path='table.csv')]) # from arguments
+package = Package('datapackage.json') # from a descriptor
+
+ +
+

Describing Package

+

The standards support a great deal of package metadata which is possible to have with Frictionless Framework too:

+ +
+
+
from frictionless import Package, Resource
+
+package = Package(
+    name='package',
+    title='My Package',
+    description='My Package for the Guide',
+    resources=[Resource(path='table.csv')],
+    # it's possible to provide all the official properties like homepage, version, etc
+)
+print(package)
+
+ +
{'name': 'package',
+ 'title': 'My Package',
+ 'description': 'My Package for the Guide',
+ 'resources': [{'name': 'table',
+                'type': 'table',
+                'path': 'table.csv',
+                'scheme': 'file',
+                'format': 'csv',
+                'mediatype': 'text/csv'}]}
+ +
+

If you have created a package, for example, from a descriptor you can access this properties:

+ +
+
+
from frictionless import Package
+
+package = Package('datapackage.json')
+print(package.name)
+# and others
+
+ +
test-tabulator
+ +
+

And edit them:

+ +
+
+
from frictionless import Package
+
+package = Package('datapackage.json')
+package.name = 'new-name'
+package.title = 'New Title'
+package.description = 'New Description'
+# and others
+print(package)
+
+ +
{'name': 'new-name',
+ 'title': 'New Title',
+ 'description': 'New Description',
+ 'resources': [{'name': 'first-resource',
+                'type': 'table',
+                'path': 'table.xls',
+                'scheme': 'file',
+                'format': 'xls',
+                'mediatype': 'application/vnd.ms-excel',
+                'schema': {'fields': [{'name': 'id', 'type': 'number'},
+                                      {'name': 'name', 'type': 'string'}]}},
+               {'name': 'number-two',
+                'type': 'table',
+                'path': 'table-reverse.csv',
+                'scheme': 'file',
+                'format': 'csv',
+                'mediatype': 'text/csv',
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}}]}
+ +
+

Resource Management

+

The core purpose of having a package is to provide an ability to have a set of resources. The Package class provides useful methods to manage resources:

+ +
+
+
from frictionless import Package, Resource
+
+package = Package('datapackage.json')
+print(package.resources)
+print(package.resource_names)
+package.add_resource(Resource(name='new', data=[['key1', 'key2'], ['val1', 'val2']]))
+resource = package.get_resource('new')
+print(package.has_resource('new'))
+package.remove_resource('new')
+
+ +
[{'name': 'first-resource',
+ 'type': 'table',
+ 'path': 'table.xls',
+ 'scheme': 'file',
+ 'format': 'xls',
+ 'mediatype': 'application/vnd.ms-excel',
+ 'schema': {'fields': [{'name': 'id', 'type': 'number'},
+                       {'name': 'name', 'type': 'string'}]}}, {'name': 'number-two',
+ 'type': 'table',
+ 'path': 'table-reverse.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                       {'name': 'name', 'type': 'string'}]}}]
+['first-resource', 'number-two']
+True
+ +
+

Saving Descriptor

+

As any of the Metadata classes the Package class can be saved as JSON or YAML:

+ +
+
+
from frictionless import Package
+package = Package('tables/*')
+package.to_json('datapackage.json') # Save as JSON
+package.to_yaml('datapackage.yaml') # Save as YAML
+
+ +
+

Reference

+
+ + +
+
+ +

Package (class)

+ +
+
+ + +
+

Package (class)

+

Package representation + +This class is one of the cornerstones of of Frictionless framework. +It manages underlaying resource and provides an ability to describe a package. + +```python +package = Package(resources=[Resource(path="data/table.csv")]) +package.get_resoure('table').read_rows() == [ + {'id': 1, 'name': 'english'}, + {'id': 2, 'name': '中国人'},

+

Signature

+

(*, source: Optional[Any] = None, control: Optional[Control] = None, basepath: Optional[str] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, profile: Optional[str] = None, licenses: List[Dict[str, Any]] = NOTHING, sources: List[Dict[str, Any]] = NOTHING, contributors: List[Dict[str, Any]] = NOTHING, keywords: List[str] = NOTHING, image: Optional[str] = None, version: Optional[str] = None, created: Optional[str] = None, resources: List[Resource] = NOTHING, dataset: Optional[Dataset] = None, dialect: Optional[Dialect] = None, detector: Optional[Detector] = None) -> None

+

Parameters

+
    +
  • + source + (Optional[Any])
  • +
  • + control + (Optional[Control])
  • +
  • + basepath + (Optional[str])
  • +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + homepage + (Optional[str])
  • +
  • + profile + (Optional[str])
  • +
  • + licenses + (List[Dict[str, Any]])
  • +
  • + sources + (List[Dict[str, Any]])
  • +
  • + contributors + (List[Dict[str, Any]])
  • +
  • + keywords + (List[str])
  • +
  • + image + (Optional[str])
  • +
  • + version + (Optional[str])
  • +
  • + created + (Optional[str])
  • +
  • + resources + (List[Resource])
  • +
  • + dataset + (Optional[Dataset])
  • +
  • + dialect + (Optional[Dialect])
  • +
  • + detector + (Optional[Detector])
  • +
+
+ +
+

package.source (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Any]

+
+
+

package.control (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Control]

+
+
+

package._basepath (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[str]

+
+
+

package.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

package.type (property)

+

+ Type of the package +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

package.title (property)

+

+ A Package title according to the specs + It should a human-oriented title of the resource. +

+

Signature

+

Optional[str]

+
+
+

package.description (property)

+

+ A Package description according to the specs + It should a human-oriented description of the resource. +

+

Signature

+

Optional[str]

+
+
+

package.homepage (property)

+

+ A URL for the home on the web that is related to this package. + For example, github repository or ckan dataset address. +

+

Signature

+

Optional[str]

+
+
+

package.profile (property)

+

+ A fully-qualified URL that points directly to a JSON Schema + that can be used to validate the descriptor +

+

Signature

+

Optional[str]

+
+
+

package.licenses (property)

+

+ The license(s) under which the package is provided. +

+

Signature

+

List[Dict[str, Any]]

+
+
+

package.sources (property)

+

+ The raw sources for this data package. + It MUST be an array of Source objects. + Each Source object MUST have a title and + MAY have path and/or email properties. +

+

Signature

+

List[Dict[str, Any]]

+
+
+

package.contributors (property)

+

+ The people or organizations who contributed to this package. + It MUST be an array. Each entry is a Contributor and MUST be an object. + A Contributor MUST have a title property and MAY contain + path, email, role and organization properties. +

+

Signature

+

List[Dict[str, Any]]

+
+
+

package.keywords (property)

+

+ An Array of string keywords to assist users searching. + For example, ['data', 'fiscal'] +

+

Signature

+

List[str]

+
+
+

package.image (property)

+

+ An image to use for this data package. + For example, when showing the package in a listing. +

+

Signature

+

Optional[str]

+
+
+

package.version (property)

+

+ A version string identifying the version of the package. + It should conform to the Semantic Versioning requirements and + should follow the Data Package Version pattern. +

+

Signature

+

Optional[str]

+
+
+

package.created (property)

+

+ The datetime on which this was created. + The datetime must conform to the string formats for RFC3339 datetime, +

+

Signature

+

Optional[str]

+
+
+

package.resources (property)

+

+ A list of resource descriptors. + It can be dicts or Resource instances +

+

Signature

+

List[Resource]

+
+
+

package.dataset (property)

+

+ It returns reference to dataset of which catalog the package is part of. If package + is not part of any catalog, then it is set to None. +

+

Signature

+

Optional[Dataset]

+
+
+

package._dialect (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Dialect]

+
+
+

package._detector (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Detector]

+
+ +
+

package.basepath (property)

+

A basepath of the package + +The normpath of the resource is joined `basepath` and `/path`

+

Signature

+

Optional[str]

+
+
+

package.resource_names (property)

+

Return names of resources

+

Signature

+

List[str]

+
+
+

package.resource_paths (property)

+

Return names of resources

+

Signature

+

List[str]

+
+ +
+

package.add_resource (method)

+

Add new resource to the package

+

Signature

+

(resource: Union[Resource, str]) -> Resource

+

Parameters

+
    +
  • + resource + (Union[Resource, str])
  • +
+
+
+

package.analyze (method)

+

Analyze the resources of the package + +This feature is currently experimental, and its API may change +without warning.

+

Signature

+

(*, detailed: bool = False)

+

Parameters

+
    +
  • + detailed + (bool)
  • +
+
+
+

package.clear_resources (method)

+

Remove all the resources

+
+
+

package.dereference (method)

+

Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object

+
+
+

Package.describe (method) (static)

+

Describe the given source as a package

+

Signature

+

(source: Optional[Any] = None, *, stats: bool = False, **options: Any)

+

Parameters

+
    +
  • + source + (Optional[Any]): data source
  • +
  • + stats + (bool)
  • +
  • + options + (Any)
  • +
+
+
+

package.extract (method)

+

Extract rows

+

Signature

+

(*, name: Optional[str] = None, filter: Optional[types.IFilterFunction] = None, process: Optional[types.IProcessFunction] = None, limit_rows: Optional[int] = None) -> types.ITabularData

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + filter + (Optional[types.IFilterFunction]): row filter function
  • +
  • + process + (Optional[types.IProcessFunction]): row processor function
  • +
  • + limit_rows + (Optional[int]): limit amount of rows to this number
  • +
+
+
+

package.flatten (method)

+

Flatten the package + +Parameters + spec (str[]): flatten specification

+

Signature

+

(spec: List[str] = [name, path])

+

Parameters

+
    +
  • + spec + (List[str])
  • +
+
+
+

package.get_resource (method)

+

Get resource by name

+

Signature

+

(name: str) -> Resource

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

package.get_table_resource (method)

+

Get table resource by name (raise if not table)

+

Signature

+

(name: str) -> TableResource

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

package.has_resource (method)

+

Check if a resource is present

+

Signature

+

(name: str) -> bool

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

package.has_table_resource (method)

+

Check if a table resource is present

+

Signature

+

(name: str) -> bool

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

package.infer (method)

+

Infer metadata

+

Signature

+

(*, stats: bool = False) -> None

+

Parameters

+
    +
  • + stats + (bool): stream files completely and infer stats
  • +
+
+
+

package.publish (method)

+

Publish package to any supported data portal

+

Signature

+

(target: Any = None, *, control: Optional[Control] = None) -> PublishResult

+

Parameters

+
    +
  • + target + (Any): url e.g. "https://github.com/frictionlessdata/repository-demo" of target[CKAN/Github...]
  • +
  • + control + (Optional[Control]): Github control
  • +
+
+
+

package.remove_resource (method)

+

Remove resource by name

+

Signature

+

(name: str) -> Resource

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

package.set_resource (method)

+

Set resource by name

+

Signature

+

(resource: Resource) -> Optional[Resource]

+

Parameters

+
    +
  • + resource + (Resource)
  • +
+
+
+

package.to_copy (method)

+

Create a copy of the package

+

Signature

+

(**options: Any) -> Self

+

Parameters

+
    +
  • + options + (Any)
  • +
+
+
+

package.to_er_diagram (method)

+

Generate ERD(Entity Relationship Diagram) from package resources + +and exports it as .dot file + +Based on: +- https://github.com/frictionlessdata/frictionless-py/issues/1118

+

Signature

+

(path: Optional[str] = None) -> str

+

Parameters

+
    +
  • + path + (Optional[str]): target path
  • +
+
+
+

package.transform (method)

+

Transform package

+

Signature

+

(: Package, pipeline: Pipeline)

+

Parameters

+
    +
  • + pipeline + (Pipeline)
  • +
+
+
+

package.update_resource (method)

+

Update resource

+

Signature

+

(name: str, descriptor: types.IDescriptor) -> Resource

+

Parameters

+
    +
  • + name + (str)
  • +
  • + descriptor + (types.IDescriptor)
  • +
+
+
+

package.validate (method)

+

Validate package

+

Signature

+

(: Package, checklist: Optional[Checklist] = None, *, name: Optional[str] = None, parallel: bool = False, limit_rows: Optional[int] = None, limit_errors: int = 1000)

+

Parameters

+
    +
  • + checklist + (Optional[Checklist])
  • +
  • + name + (Optional[str])
  • +
  • + parallel + (bool)
  • +
  • + limit_rows + (Optional[int])
  • +
  • + limit_errors + (int)
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/pipeline.html b/docs/framework/pipeline.html new file mode 100644 index 0000000000..9ba181ba2d --- /dev/null +++ b/docs/framework/pipeline.html @@ -0,0 +1,3783 @@ + + + + + + + + +Pipeline Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Pipeline Class

+

Pipeline is an object containing a list of transformation steps.

+

Creating Pipeline

+

Let's create a pipeline using Python interface:

+ +
+
+
from frictionless import Pipeline, transform, steps
+
+pipeline = Pipeline(steps=[steps.table_normalize(), steps.table_melt(field_name='name')])
+print(pipeline)
+
+ +
{'steps': [{'type': 'table-normalize'},
+           {'type': 'table-melt', 'fieldName': 'name'}]}
+ +
+

Running Pipeline

+

To run a pipeline you need to use a transform function or method:

+ +
+
+
from frictionless import Pipeline, transform, steps
+
+pipeline = Pipeline(steps=[steps.table_normalize(), steps.table_melt(field_name='name')])
+resource = transform('table.csv', pipeline=pipeline)
+print(resource.schema)
+print(resource.read_rows())
+
+ +
{'fields': [{'name': 'name', 'type': 'string'},
+            {'name': 'variable', 'type': 'string'},
+            {'name': 'value', 'type': 'any'}]}
+[{'name': 'english', 'variable': 'id', 'value': 1}, {'name': '中国人', 'variable': 'id', 'value': 2}]
+ +
+

Transform Steps

+

The Step concept is a part of the Transform API. You can create a custom Step to be used as part of resource or package transformation.

+
+

This step uses PETL under the hood.

+
+
from frictionless import Step
+
+class cell_set(Step):
+    code = "cell-set"
+
+    def __init__(self, descriptor=None, *, value=None, field_name=None):
+        self.setinitial("value", value)
+        self.setinitial("fieldName", field_name)
+        super().__init__(descriptor)
+
+    def transform_resource(self, resource):
+        value = self.get("value")
+        field_name = self.get("fieldName")
+        yield from resource.to_petl().update(field_name, value)
+
+

Reference

+
+ + +
+
+ +

Pipeline (class)

+

Step (class)

+ +
+
+ + +
+

Pipeline (class)

+

Pipeline representation

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, steps: List[Step] = NOTHING) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + steps + (List[Step])
  • +
+
+ +
+

pipeline.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

pipeline.type (property)

+

+ Type of the package +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

pipeline.title (property)

+

+ A human-oriented title for the Pipeline. +

+

Signature

+

Optional[str]

+
+
+

pipeline.description (property)

+

+ A brief description of the Pipeline. +

+

Signature

+

Optional[str]

+
+
+

pipeline.steps (property)

+

+ List of transformation steps to apply. +

+

Signature

+

List[Step]

+
+ +
+

pipeline.step_types (property)

+

Return type list of the steps

+

Signature

+

List[str]

+
+ +
+

pipeline.add_step (method)

+

Add new step to the schema

+

Signature

+

(step: Step) -> None

+

Parameters

+
    +
  • + step + (Step)
  • +
+
+
+

pipeline.clear_steps (method)

+

Remove all the steps

+

Signature

+

() -> None

+
+
+

pipeline.get_step (method)

+

Get step by type

+

Signature

+

(type: str) -> Step

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

pipeline.has_step (method)

+

Check if a step is present

+

Signature

+

(type: str) -> bool

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

pipeline.remove_step (method)

+

Remove step by type

+

Signature

+

(type: str) -> Step

+

Parameters

+
    +
  • + type + (str)
  • +
+
+
+

pipeline.set_step (method)

+

Set step by type

+

Signature

+

(step: Step) -> Optional[Step]

+

Parameters

+
    +
  • + step + (Step)
  • +
+
+ + +
+

Step (class)

+

Step representation. + +A base class for all the step subclasses.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ +
+

step.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

step.type (property)

+

+ A short url-usable (and preferably human-readable) name/type. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. For example: "cell-fill". +

+

Signature

+

ClassVar[str]

+
+
+

step.title (property)

+

+ A human-oriented title for the Step. +

+

Signature

+

Optional[str]

+
+
+

step.description (property)

+

+ A brief description of the Step. +

+

Signature

+

Optional[str]

+
+ + +
+

step.transform_package (method)

+

Transform package

+

Signature

+

(package: Package)

+

Parameters

+
    +
  • + package + (Package): package
  • +
+
+
+

step.transform_resource (method)

+

Transform resource

+

Signature

+

(resource: Resource)

+

Parameters

+
    +
  • + resource + (Resource): resource
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/report.html b/docs/framework/report.html new file mode 100644 index 0000000000..e9c8c1bb61 --- /dev/null +++ b/docs/framework/report.html @@ -0,0 +1,4027 @@ + + + + + + + + +Report Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Report Class

+

Validation Report

+

All the validate functions return the Validation Report. It's an unified object containing information about a validation: source details, found error, etc. Let's explore a report:

+ +
+
+
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.006},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+            'type': 'table',
+            'valid': False,
+            'place': 'capital-invalid.csv',
+            'labels': ['id', 'name', 'name'],
+            'stats': {'errors': 1,
+                      'warnings': 0,
+                      'seconds': 0.006,
+                      'md5': 'dcdeae358cfd50860c18d953e021f836',
+                      'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+                      'bytes': 171,
+                      'fields': 3,
+                      'rows': 11},
+            'warnings': [],
+            'errors': [{'type': 'duplicate-label',
+                        'title': 'Duplicate Label',
+                        'description': 'Two columns in the header row have the '
+                                       'same value. Column names should be '
+                                       'unique.',
+                        'message': 'Label "name" in the header at position "3" '
+                                   'is duplicated to a label: at position "2"',
+                        'tags': ['#table', '#header', '#label'],
+                        'note': 'at position "2"',
+                        'labels': ['id', 'name', 'name'],
+                        'rowNumbers': [1],
+                        'label': 'name',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3}]}]}
+ +
+

As we can see, there are a lot of information; you can find its details description in "API Reference". Errors are grouped by tables; for some validation there are can be dozens of tables. Let's use the report.flatten function to simplify errors representation:

+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+pprint(report.flatten(['rowNumber', 'fieldNumber', 'code', 'message']))
+
+ +
[[None,
+  3,
+  None,
+  'Label "name" in the header at position "3" is duplicated to a label: at '
+  'position "2"']]
+ +
+

In some situation, an error can't be associated with a table; then it goes to the top-level report.errors property:

+ +
+
+
from frictionless import validate
+
+report = validate('bad.json', type='schema')
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.0},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'bad',
+            'type': 'json',
+            'valid': False,
+            'place': 'bad.json',
+            'labels': [],
+            'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.0},
+            'warnings': [],
+            'errors': [{'type': 'schema-error',
+                        'title': 'Schema Error',
+                        'description': 'Provided schema is not valid.',
+                        'message': 'Schema is not valid: cannot retrieve '
+                                   'metadata "bad.json" because "[Errno 2] No '
+                                   'such file or directory: \'bad.json\'"',
+                        'tags': [],
+                        'note': 'cannot retrieve metadata "bad.json" because '
+                                '"[Errno 2] No such file or directory: '
+                                '\'bad.json\'"'}]}]}
+ +
+

Validation Errors

+

The Error object is at the heart of the validation process. The Report has report.errors and report.tables[].errors properties that can contain the Error object. Let's explore it:

+ +
+
+
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+error = report.task.error # it's only available for 1 table / 1 error sitution
+print(f'Type: "{error.type}"')
+print(f'Title: "{error.title}"')
+print(f'Tags: "{error.tags}"')
+print(f'Note: "{error.note}"')
+print(f'Message: "{error.message}"')
+print(f'Description: "{error.description}"')
+
+ +
Type: "duplicate-label"
+Title: "Duplicate Label"
+Tags: "['#table', '#header', '#label']"
+Note: "at position "2""
+Message: "Label "name" in the header at position "3" is duplicated to a label: at position "2""
+Description: "Two columns in the header row have the same value. Column names should be unique."
+ +
+

Above, we have listed universal error properties. Depending on the type of an error there can be additional ones. For example, for our duplicate-label error:

+ +
+
+
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+error = report.task.error # it's only available for 1 table / 1 error sitution
+print(error)
+
+ +
{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the same value. Column '
+                'names should be unique.',
+ 'message': 'Label "name" in the header at position "3" is duplicated to a '
+            'label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3}
+ +
+

Please explore "Errors Reference" to learn about all the available errors and their properties.

+

Reference

+
+ + +
+
+ +

Report (class)

+

ReportTask (class)

+ +
+
+ + +
+

Report (class)

+

Report representation. + +A class that stores the summary of the validation action.

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, valid: bool, stats: types.IReportStats, warnings: List[str] = NOTHING, errors: List[Error] = NOTHING, tasks: List[ReportTask] = NOTHING) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + valid + (bool)
  • +
  • + stats + (types.IReportStats)
  • +
  • + warnings + (List[str])
  • +
  • + errors + (List[Error])
  • +
  • + tasks + (List[ReportTask])
  • +
+
+ +
+

report.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

report.type (property)

+

+ Type of the package +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

report.title (property)

+

+ A human-oriented title for the Report. +

+

Signature

+

Optional[str]

+
+
+

report.description (property)

+

+ A brief description of the Detector. +

+

Signature

+

Optional[str]

+
+
+

report.valid (property)

+

+ Flag to specify if the data is valid or not. +

+

Signature

+

bool

+
+
+

report.stats (property)

+

+ Additional statistics of the data as defined in Stats class. +

+

Signature

+

types.IReportStats

+
+
+

report.warnings (property)

+

+ List of warnings raised while validating the data. +

+

Signature

+

List[str]

+
+
+

report.errors (property)

+

+ List of errors raised while validating the data. +

+

Signature

+

List[Error]

+
+
+

report.tasks (property)

+

+ List of task that were applied during data validation. +

+

Signature

+

List[ReportTask]

+
+ +
+

report.error (property)

+

Validation error (if there is only one)

+
+
+

report.task (property)

+

Validation task (if there is only one)

+
+ +
+

report.flatten (method)

+

Flatten the report + +Parameters + spec (str[]): flatten specification

+

Signature

+

(spec: List[str] = [taskNumber, rowNumber, fieldNumber, type])

+

Parameters

+
    +
  • + spec + (List[str])
  • +
+
+
+

Report.from_validation (method) (static)

+

Create a report from a validation

+

Signature

+

(*, time: float = 0, tasks: List[ReportTask] = [], errors: List[Error] = [], warnings: List[str] = [])

+

Parameters

+
    +
  • + time + (float)
  • +
  • + tasks + (List[ReportTask])
  • +
  • + errors + (List[Error])
  • +
  • + warnings + (List[str])
  • +
+
+
+

Report.from_validation_reports (method) (static)

+

Create a report from a set of validation reports

+

Signature

+

(*, time: float, reports: List[Report])

+

Parameters

+
    +
  • + time + (float)
  • +
  • + reports + (List[Report])
  • +
+
+
+

Report.from_validation_task (method) (static)

+

Create a report from a validation task

+

Signature

+

(resource: Resource, *, time: float, labels: List[str] = [], errors: List[Error] = [], warnings: List[str] = [])

+

Parameters

+
    +
  • + resource + (Resource)
  • +
  • + time + (float)
  • +
  • + labels + (List[str])
  • +
  • + errors + (List[Error])
  • +
  • + warnings + (List[str])
  • +
+
+
+

report.to_summary (method)

+

Summary of the report

+
+ + +
+

ReportTask (class)

+

Report task representation.

+

Signature

+

(*, name: str, type: Optional[str], title: Optional[str] = None, description: Optional[str] = None, valid: bool, place: str, labels: List[str], stats: types.IReportTaskStats, warnings: List[str] = NOTHING, errors: List[Error] = NOTHING) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + type + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + valid + (bool)
  • +
  • + place + (str)
  • +
  • + labels + (List[str])
  • +
  • + stats + (types.IReportTaskStats)
  • +
  • + warnings + (List[str])
  • +
  • + errors + (List[Error])
  • +
+
+ +
+

reportTask.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

str

+
+
+

reportTask.type (property)

+

+ Sets the property tabular to True if the type is "table". +

+

Signature

+

Optional[str]

+
+
+

reportTask.title (property)

+

+ A human-oriented title for the Report. +

+

Signature

+

Optional[str]

+
+
+

reportTask.description (property)

+

+ A brief description of the Detector. +

+

Signature

+

Optional[str]

+
+
+

reportTask.valid (property)

+

+ Flag to specify if the data is valid or not. +

+

Signature

+

bool

+
+
+

reportTask.place (property)

+

+ Specifies the place of the file. For example: "", "data/table.csv" etc. +

+

Signature

+

str

+
+
+

reportTask.labels (property)

+

+ List of labels of the task resource. +

+

Signature

+

List[str]

+
+
+

reportTask.stats (property)

+

+ Additional statistics of the data as defined in Stats class. +

+

Signature

+

types.IReportTaskStats

+
+
+

reportTask.warnings (property)

+

+ List of warnings raised while validating the data. +

+

Signature

+

List[str]

+
+
+

reportTask.errors (property)

+

+ List of errors raised while validating the data. +

+

Signature

+

List[Error]

+
+ +
+

reportTask.error (property)

+

Validation error if there is only one

+
+
+

reportTask.tabular (property)

+

Whether task's resource is tabular

+

Signature

+

bool

+
+ +
+

reportTask.flatten (method)

+

Flatten the report + +Parameters + spec (any[]): flatten specification

+

Signature

+

(spec: List[str] = [rowNumber, fieldNumber, type])

+

Parameters

+
    +
  • + spec + (List[str])
  • +
+
+
+

reportTask.to_summary (method)

+

Generate summary for validation task"

+

Signature

+

() -> str

+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/resource.html b/docs/framework/resource.html new file mode 100644 index 0000000000..470a17c9ed --- /dev/null +++ b/docs/framework/resource.html @@ -0,0 +1,4701 @@ + + + + + + + + +Resource Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Resource Class

+

The Resource class is arguable the most important class of the whole Frictionless Framework. It's based on Data Resource Standard and Tabular Data Resource Standard

+

Creating Resource

+

Let's create a data resource:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('table.csv') # from a resource path
+resource = Resource('resource.json') # from a descriptor path
+resource = Resource({'path': 'table.csv'}) # from a descriptor
+resource = Resource(path='table.csv') # from arguments
+
+ +
+

As you can see it's possible to create a resource providing different kinds of sources which will be detector to have some type automatically (e.g. whether it's a descriptor or a path). It's possible to make this step more explicit:

+ +
+
+
from frictionless import Resource
+
+resource = Resource(path='data/table.csv') # from a path
+resource = Resource('data/resource.json') # from a descriptor
+
+ +
+

Describing Resource

+

The standards support a great deal of resource metadata which is possible to have with Frictionless Framework too:

+ +
+
+
from frictionless import Resource
+
+resource = Resource(
+    name='resource',
+    title='My Resource',
+    description='My Resource for the Guide',
+    path='table.csv',
+    # it's possible to provide all the official properties like mediatype, etc
+)
+print(resource)
+
+ +
{'name': 'resource',
+ 'type': 'table',
+ 'title': 'My Resource',
+ 'description': 'My Resource for the Guide',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+ +
+

If you have created a resource, for example, from a descriptor you can access this properties:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('resource.json')
+print(resource.name)
+# and others
+
+ +
name
+ +
+

And edit them:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('resource.json')
+resource.name = 'new-name'
+resource.title = 'New Title'
+resource.description = 'New Description'
+# and others
+print(resource)
+
+ +
{'name': 'new-name',
+ 'type': 'table',
+ 'title': 'New Title',
+ 'description': 'New Description',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+ +
+

Saving Descriptor

+

As any of the Metadata classes the Resource class can be saved as JSON or YAML:

+ +
+
+
from frictionless import Resource
+resource = Resource('table.csv')
+resource.to_json('resource.json') # Save as JSON
+resource.to_yaml('resource.yaml') # Save as YAML
+
+ +
+

Resource Lifecycle

+

You might have noticed that we had to duplicate the with Resource(...) statement in some examples. The reason is that Resource is a streaming interface. Once it's read you need to open it again. Let's show it in an example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('capital-3.csv')
+resource.open()
+pprint(resource.read_rows())
+pprint(resource.read_rows())
+# We need to re-open: there is no data left
+resource.open()
+pprint(resource.read_rows())
+# We need to close manually: not context manager is used
+resource.close()
+
+ +
[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+[]
+[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+ +
+

At the same you can read data for a resource without opening and closing it explicitly. In this case Frictionless Framework will open and close the resource for you so it will be basically a one-time operation:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('capital-3.csv')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+ +
+

Reading Data

+

The Resource class is also a metadata class which provides various read and stream functions. The extract functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data. It can be rows, data, text, or bytes. Let's try reading all of them:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_bytes())
+pprint(resource.read_text())
+pprint(resource.read_cells())
+pprint(resource.read_rows())
+
+ +
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
+ b'3\n4,5,Italy,60\n5,4,Spain,47\n')
+''
+[['id', 'capital_id', 'name', 'population'],
+ ['1', '1', 'Britain', '67'],
+ ['2', '3', 'France', '67'],
+ ['3', '2', 'Germany', '83'],
+ ['4', '5', 'Italy', '60'],
+ ['5', '4', 'Spain', '47']]
+[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
+ {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
+ {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
+ {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
+ {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
+ +
+

It's really handy to read all your data into memory but it's not always possible if a file is really big. For such cases, Frictionless provides streaming functions:

+ +
+
+
from frictionless import Resource
+
+with Resource('country-3.csv') as resource:
+    pprint(resource.byte_stream)
+    pprint(resource.text_stream)
+    pprint(resource.cell_stream)
+    pprint(resource.row_stream)
+    for row in resource.row_stream:
+      print(row)
+
+ +
<frictionless.system.loader.ByteStreamWithStatsHandling object at 0x7f5e0aaf69b0>
+<_io.TextIOWrapper name='country-3.csv' encoding='utf-8'>
+<itertools.chain object at 0x7f5e0aa73cd0>
+<generator object TableResource.__open_row_stream.<locals>.row_stream at 0x7f5e0a95ab90>
+{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67}
+{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67}
+{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83}
+{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60}
+{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}
+ +
+

Indexing Data

+
+ +

Indexing resource in Frictionless terms means loading a data table into a database. Let's explore how this feature works in different modes.

+
+

All the example are written for SQLite for simplicity

+
+

Normal Mode

+

This mode is supported for any database that is supported by sqlalchemy. Under the hood, Frictionless will infer Table Schema and populate the data table as it normally reads data. It means that type errors will be replaced by null values and in-general it guarantees to finish successfully for any data even very invalid.

+ +
+
+
frictionless index table.csv --database sqlite:///index/project.db --name table
+frictionless extract sqlite:///index/project.db --table table --json
+
+ +
──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 3 rows in 0.217 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+  "project": [
+    {
+      "id": 1,
+      "name": "english"
+    },
+    {
+      "id": 2,
+      "name": "中国人"
+    }
+  ]
+}
+ +
+
+
import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table')
+print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
+
+ +
{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
+ +
+

Fast Mode

+
+ +

Fast mode is supported for SQLite and Postgresql databases. It will infer Table Schema using a data sample and index data using COPY in Potgresql and .import in SQLite. For big data files this mode will be 10-30x faster than normal indexing but the speed comes with the price -- if there is invalid data the indexing will fail.

+ +
+
+
frictionless index table.csv --database sqlite:///index/project.db --name table --fast
+frictionless extract sqlite:///index/project.db --table table --json
+
+ +
──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 30 bytes in 0.209 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+  "project": [
+    {
+      "id": 1,
+      "name": "english"
+    },
+    {
+      "id": 2,
+      "name": "中国人"
+    }
+  ]
+}
+ +
+
+
import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table', fast=True)
+print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
+
+ +
{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
+ +
+

Solution 1: Fallback

+

To ensure that the data will be successfully indexed it's possible to use fallback option. If the fast indexing fails Frictionless will start over in normal mode and finish the process successfully.

+ +
+
+
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --fallback
+
+ +
+
+
import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table', fast=True, fallback=True)
+
+ +
+

Solution 2: QSV

+

Another option is to provide a path to QSV binary. In this case, initial schema inferring will be done based on the whole data file and will guarantee that the table is valid type-wise:

+ +
+
+
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --qsv qsv_path
+
+ +
+
+
import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table', fast=True, qsv_path='qsv_path')
+
+ +
+

Scheme

+

The scheme also know as protocol indicates which loader Frictionless should use to read or write data. It can be file (default), text, http, https, s3, and others.

+ +
+
+
from frictionless import Resource
+
+with Resource(b'header1,header2\nvalue1,value2', format='csv') as resource:
+  print(resource.scheme)
+  print(resource.to_view())
+
+ +
buffer
++----------+----------+
+| header1  | header2  |
++==========+==========+
+| 'value1' | 'value2' |
++----------+----------+
+ +
+

Format

+

The format or as it's also called extension helps Frictionless to choose a proper parser to handle the file. Popular formats are csv, xlsx, json and others

+ +
+
+
from frictionless import Resource
+
+with Resource(b'header1,header2\nvalue1,value2.csv', format='csv') as resource:
+  print(resource.format)
+  print(resource.to_view())
+
+ +
csv
++----------+--------------+
+| header1  | header2      |
++==========+==============+
+| 'value1' | 'value2.csv' |
++----------+--------------+
+ +
+

Encoding

+

Frictionless automatically detects encoding of files but sometimes it can be inaccurate. It's possible to provide an encoding manually:

+ +
+
+
from frictionless import Resource
+
+with Resource('country-3.csv', encoding='utf-8') as resource:
+  print(resource.encoding)
+  print(resource.path)
+
+ +
utf-8
+country-3.csv
+ +
+
utf-8
+data/country-3.csv
+
+

Innerpath

+

By default, Frictionless uses the first file found in a zip archive. It's possible to adjust this behaviour:

+ +
+
+
from frictionless import Resource
+
+with Resource('table-multiple-files.zip', innerpath='table-reverse.csv') as resource:
+  print(resource.compression)
+  print(resource.innerpath)
+  print(resource.to_view())
+
+ +
zip
+table-reverse.csv
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | '中国人'     |
++----+-----------+
+|  2 | 'english' |
++----+-----------+
+ +
+

Compression

+

It's possible to adjust compression detection by providing the algorithm explicitly. For the example below it's not required as it would be detected anyway:

+ +
+
+
from frictionless import Resource
+
+with Resource('table.csv.zip', compression='zip') as resource:
+  print(resource.compression)
+  print(resource.to_view())
+
+ +
zip
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | '中国人'     |
++----+-----------+
+ +
+

Dialect

+

Please read Table Dialect Guide for more information.

+

Schema

+

Please read Table Schema Guide for more information.

+

Checklist

+

Please read Checklist Guide for more information.

+

Pipeline

+

Please read Pipeline Guide for more information.

+

Stats

+

Resource's stats can be accessed with resource.stats:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('table.csv')
+resource.infer(stats=True)
+print(resource.stats)
+
+ +
<frictionless.resource.stats.ResourceStats object at 0x7f5e09c8e020>
+ +
+

Reference

+
+ + +
+
+ +

Resource (class)

+ +
+
+ + +
+

Resource (class)

+

Resource representation. + +This class is one of the cornerstones of of Frictionless framework. +It loads a data source, and allows you to stream its parsed contents. +At the same time, it's a metadata class data description. + +```python +with Resource("data/table.csv") as resource: + resource.header == ["id", "name"] + resource.read_rows() == [ + {'id': 1, 'name': 'english'}, + {'id': 2, 'name': '中国人'}, + ] +```

+

Signature

+

(*, source: Optional[Any] = None, control: Optional[Control] = None, packagify: bool = False, name: Optional[str] = , title: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, profile: Optional[str] = None, licenses: List[Dict[str, Any]] = NOTHING, sources: List[Dict[str, Any]] = NOTHING, path: Optional[str] = None, data: Optional[Any] = None, scheme: Optional[str] = None, format: Optional[str] = None, datatype: Optional[str] = , mediatype: Optional[str] = None, compression: Optional[str] = None, extrapaths: List[str] = NOTHING, innerpath: Optional[str] = None, encoding: Optional[str] = None, hash: Optional[str] = None, bytes: Optional[int] = None, fields: Optional[int] = None, rows: Optional[int] = None, dialect: Union[Dialect, str] = NOTHING, schema: Union[Schema, str] = NOTHING, basepath: Optional[str] = None, detector: Detector = NOTHING, package: Optional[Package] = None) -> None

+

Parameters

+
    +
  • + source + (Optional[Any])
  • +
  • + control + (Optional[Control])
  • +
  • + packagify + (bool)
  • +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + homepage + (Optional[str])
  • +
  • + profile + (Optional[str])
  • +
  • + licenses + (List[Dict[str, Any]])
  • +
  • + sources + (List[Dict[str, Any]])
  • +
  • + path + (Optional[str])
  • +
  • + data + (Optional[Any])
  • +
  • + scheme + (Optional[str])
  • +
  • + format + (Optional[str])
  • +
  • + datatype + (Optional[str])
  • +
  • + mediatype + (Optional[str])
  • +
  • + compression + (Optional[str])
  • +
  • + extrapaths + (List[str])
  • +
  • + innerpath + (Optional[str])
  • +
  • + encoding + (Optional[str])
  • +
  • + hash + (Optional[str])
  • +
  • + bytes + (Optional[int])
  • +
  • + fields + (Optional[int])
  • +
  • + rows + (Optional[int])
  • +
  • + dialect + (Union[Dialect, str])
  • +
  • + schema + (Union[Schema, str])
  • +
  • + basepath + (Optional[str])
  • +
  • + detector + (Detector)
  • +
  • + package + (Optional[Package])
  • +
+
+ +
+

resource.source (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Any]

+
+
+

resource.control (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Control]

+
+
+

resource.packagify (property)

+

+ # TODO: add docs +

+

Signature

+

bool

+
+
+

resource._name (property)

+

+ Resource name according to the specs. + It should be a slugified name of the resource. +

+

Signature

+

Optional[str]

+
+
+

resource.type (property)

+

+ Type of the resource +

+

Signature

+

ClassVar[str]

+
+
+

resource.title (property)

+

+ Resource title according to the specs + It should a human-oriented title of the resource. +

+

Signature

+

Optional[str]

+
+
+

resource.description (property)

+

+ Resource description according to the specs + It should a human-oriented description of the resource. +

+

Signature

+

Optional[str]

+
+
+

resource.homepage (property)

+

+ A URL for the home on the web that is related to this package. + For example, github repository or ckan dataset address. +

+

Signature

+

Optional[str]

+
+
+

resource.profile (property)

+

+ A fully-qualified URL that points directly to a JSON Schema + that can be used to validate the descriptor +

+

Signature

+

Optional[str]

+
+
+

resource.licenses (property)

+

+ The license(s) under which the resource is provided. + If omitted it's considered the same as the package's licenses. +

+

Signature

+

List[Dict[str, Any]]

+
+
+

resource.sources (property)

+

+ The raw sources for this data resource. + It MUST be an array of Source objects. + Each Source object MUST have a title and + MAY have path and/or email properties. +

+

Signature

+

List[Dict[str, Any]]

+
+
+

resource.path (property)

+

+ Path to data source +

+

Signature

+

Optional[str]

+
+
+

resource.data (property)

+

+ Inline data source +

+

Signature

+

Optional[Any]

+
+
+

resource.scheme (property)

+

+ Scheme for loading the file (file, http, ...). + If not set, it'll be inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

resource.format (property)

+

+ File source's format (csv, xls, ...). + If not set, it'll be inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

resource._datatype (property)

+

+ Frictionless Framework specific data type as "table" or "schema" +

+

Signature

+

Optional[str]

+
+
+

resource.mediatype (property)

+

+ Mediatype/mimetype of the resource e.g. “text/csv”, + or “application/vnd.ms-excel”. Mediatypes are maintained by the + Internet Assigned Numbers Authority (IANA) in a media type registry. +

+

Signature

+

Optional[str]

+
+
+

resource.compression (property)

+

+ Source file compression (zip, ...). + If not set, it'll be inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

resource.extrapaths (property)

+

+ List of paths to concatenate to the main path. + It's used for multipart resources. +

+

Signature

+

List[str]

+
+
+

resource.innerpath (property)

+

+ Path within the compressed file. + It defaults to the first file in the archive (if the source is an archive). +

+

Signature

+

Optional[str]

+
+
+

resource.encoding (property)

+

+ Source encoding. + If not set, it'll be inferred from `source`. +

+

Signature

+

Optional[str]

+
+
+

resource.hash (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[str]

+
+
+

resource.bytes (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[int]

+
+
+

resource.fields (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[int]

+
+
+

resource.rows (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[int]

+
+
+

resource._dialect (property)

+

+ # TODO: add docs +

+

Signature

+

Union[Dialect, str]

+
+
+

resource._schema (property)

+

+ # TODO: add docs +

+

Signature

+

Union[Schema, str]

+
+
+

resource._basepath (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[str]

+
+
+

resource.detector (property)

+

+ File/table detector. + For more information, please check the Detector documentation. +

+

Signature

+

Detector

+
+
+

resource.package (property)

+

+ Parental to this resource package. + For more information, please check the Package documentation. +

+

Signature

+

Optional[Package]

+
+
+

resource.stats (property)

+

+ # TODO: add docs +

+

Signature

+

ResourceStats

+
+
+

resource.tabular (property)

+

+ Whether the resource is tabular +

+

Signature

+

ClassVar[bool]

+
+ +
+

resource.basepath (property)

+

A basepath of the resource + +The normpath of the resource is joined `basepath` and `/path`

+

Signature

+

Optional[str]

+
+
+

resource.buffer (property)

+

File's bytes used as a sample + +These buffer bytes are used to infer characteristics of the +source file (e.g. encoding, ...).

+

Signature

+

types.IBuffer

+
+
+

resource.byte_stream (property)

+

Byte stream in form of a generator

+

Signature

+

types.IByteStream

+
+
+

resource.closed (property)

+

Whether the table is closed

+

Signature

+

bool

+
+
+

resource.memory (property)

+

Whether resource is not path based

+

Signature

+

bool

+
+
+

resource.multipart (property)

+

Whether resource is multipart

+

Signature

+

bool

+
+
+

resource.normpath (property)

+

Normalized path of the resource or raise if not set

+

Signature

+

Optional[str]

+
+
+

resource.normpaths (property)

+

Normalized paths of the resource

+

Signature

+

List[str]

+
+
+

resource.paths (property)

+

All paths of the resource

+

Signature

+

List[str]

+
+
+

resource.place (property)

+

Stringified resource location

+

Signature

+

str

+
+
+

resource.remote (property)

+

Whether resource is remote

+

Signature

+

bool

+
+
+

resource.text_stream (property)

+

Text stream in form of a generator

+

Signature

+

types.ITextStream

+
+ +
+

resource.close (method)

+

Close the resource as "filelike.close" does

+

Signature

+

() -> None

+
+
+

resource.dereference (method)

+

Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object

+
+
+

Resource.describe (method) (static)

+

Describe the given source as a resource

+

Signature

+

(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, stats: bool = False, **options: Any) -> Metadata

+

Parameters

+
    +
  • + source + (Optional[Any]): data source
  • +
  • + name + (Optional[str]): resoucrce name
  • +
  • + type + (Optional[str]): data type: "package", "resource", "dialect", or "schema"
  • +
  • + stats + (bool): if `True` infer resource's stats
  • +
  • + options + (Any)
  • +
+
+
+

resource.infer (method)

+

Infer metadata

+

Signature

+

(*, stats: bool = False) -> None

+

Parameters

+
    +
  • + stats + (bool): stream file completely and infer stats
  • +
+
+
+

resource.list (method)

+

List dataset resources

+

Signature

+

(*, name: Optional[str] = None) -> List[Resource]

+

Parameters

+
    +
  • + name + (Optional[str]): limit to one resource (if applicable)
  • +
+
+
+

resource.open (method)

+

Open the resource as "io.open" does

+
+
+

resource.read_bytes (method)

+

Read bytes into memory

+

Signature

+

(*, size: Optional[int] = None) -> bytes

+

Parameters

+
    +
  • + size + (Optional[int])
  • +
+
+
+

resource.read_data (method)

+

Read data into memory

+

Signature

+

(*, size: Optional[int] = None) -> Any

+

Parameters

+
    +
  • + size + (Optional[int])
  • +
+
+
+

resource.read_text (method)

+

Read text into memory

+

Signature

+

(*, size: Optional[int] = None) -> str

+

Parameters

+
    +
  • + size + (Optional[int])
  • +
+
+
+

resource.to_copy (method)

+

Create a copy from the resource

+

Signature

+

(**options: Any) -> Self

+

Parameters

+
    +
  • + options + (Any)
  • +
+
+
+

resource.validate (method)

+

Validate resource

+

Signature

+

(checklist: Optional[Checklist] = None, *, name: Optional[str] = None, on_row: Optional[types.ICallbackFunction] = None, parallel: bool = False, limit_rows: Optional[int] = None, limit_errors: int = 1000) -> Report

+

Parameters

+
    +
  • + checklist + (Optional[Checklist]): a Checklist object
  • +
  • + name + (Optional[str]): limit validation to one resource (if applicable)
  • +
  • + on_row + (Optional[types.ICallbackFunction]): callbacke for every row
  • +
  • + parallel + (bool)
  • +
  • + limit_rows + (Optional[int]): limit amount of rows to this number
  • +
  • + limit_errors + (int): limit amount of errors to this number
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/schema.html b/docs/framework/schema.html new file mode 100644 index 0000000000..7abfbfad46 --- /dev/null +++ b/docs/framework/schema.html @@ -0,0 +1,4256 @@ + + + + + + + + +Schema Class | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Schema Class

+

The Table Schema is a core Frictionless Data concept meaning a metadata information regarding tabular data source. You can read Table Schema Standard for more information.

+

Creating Schema

+

Let's create a table schema:

+ +
+
+
from frictionless import Schema, fields, describe
+
+schema = describe('table.csv', type='schema') # from a resource path
+schema = Schema.from_descriptor('schema.json') # from a descriptor path
+schema = Schema.from_descriptor({'fields': [{'name': 'id', 'type': 'integer'}]}) # from a descriptor
+
+ +
+

As you can see it's possible to create a schema providing different kinds of sources which will be detector to have some type automatically (e.g. whether it's a dict or a path). It's possible to make this step more explicit:

+ +
+
+
from frictionless import Schema, Field
+
+schema = Schema(fields=[fields.StringField(name='id')]) # from fields
+schema = Schema.from_descriptor('schema.json') # from a descriptor
+
+ +
+

Describing Schema

+

The standard support some additional schema's metadata:

+ +
+
+
from frictionless import Schema, fields
+
+schema = Schema(
+    fields=[fields.StringField(name='id')],
+    missing_values=['na'],
+    primary_key=['id'],
+    # foreign_keys
+)
+print(schema)
+
+ +
{'fields': [{'name': 'id', 'type': 'string'}],
+ 'missingValues': ['na'],
+ 'primaryKey': ['id']}
+ +
+

If you have created a schema, for example, from a descriptor you can access this properties:

+ +
+
+
from frictionless import Schema
+
+schema = Schema.from_descriptor('schema.json')
+print(schema.missing_values)
+# and others
+
+ +
['']
+ +
+

And edit them:

+ +
+
+
from frictionless import Schema
+
+schema = Schema.from_descriptor('schema.json')
+schema.missing_values.append('-')
+# and others
+print(schema)
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'}],
+ 'missingValues': ['', '-']}
+ +
+

Field Management

+

The Schema class provides useful methods to manage fields:

+ +
+
+
from frictionless import Schema, fields
+
+schema = Schema.from_descriptor('schema.json')
+print(schema.fields)
+print(schema.field_names)
+schema.add_field(fields.StringField(name='new-name'))
+field = schema.get_field('new-name')
+print(schema.has_field('new-name'))
+schema.remove_field('new-name')
+
+ +
[{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
+['id', 'name']
+True
+ +
+

Saving Descriptor

+

As any of the Metadata classes the Schema class can be saved as JSON or YAML:

+ +
+
+
from frictionless import Schema, fields
+schema = Schema(fields=[fields.IntegerField(name='id')])
+schema.to_json('schema.json') # Save as JSON
+schema.to_yaml('schema.yaml') # Save as YAML
+
+ +
+

Reading Cells

+

During the process of data reading a resource uses a schema to convert data:

+ +
+
+
from frictionless import Schema, fields
+
+schema = Schema(fields=[fields.IntegerField(name='integer'), fields.StringField(name='string')])
+cells, notes = schema.read_cells(['3', 'value'])
+print(cells)
+
+ +
[3, 'value']
+ +
+

Writing Cells

+

During the process of data writing a resource uses a schema to convert data:

+ +
+
+
from frictionless import Schema, fields
+
+schema = Schema(fields=[fields.IntegerField(name='integer'), fields.StringField(name='string')])
+cells, notes = schema.write_cells([3, 'value'])
+print(cells)
+
+ +
[3, 'value']
+ +
+

Creating Field

+

Let's create a field:

+ +
+
+
from frictionless import fields
+
+field = fields.IntegerField(name='name')
+print(field)
+
+ +
{'name': 'name', 'type': 'integer'}
+ +
+

Usually we work with fields which were already created by a schema:

+ +
+
+
from frictionless import describe
+
+resource = describe('table.csv')
+field = resource.schema.get_field('id')
+print(field)
+
+ +
{'name': 'id', 'type': 'integer'}
+ +
+

Field Types

+

Frictionless Framework supports all the Table Schema Standard field types along with an ability to create custom types.

+

For some types there are additional properties available:

+ +
+
+
from frictionless import describe
+
+resource = describe('table.csv')
+field = resource.schema.get_field('id') # it's an integer
+print(field.bare_number)
+
+ +
True
+ +
+

See the complete reference at Tabular Fields.

+

Reading Cell

+

During the process of data reading a schema uses a field internally. If needed a user can convert their data using this interface:

+ +
+
+
from frictionless import fields
+
+field = fields.IntegerField(name='name')
+cell, note = field.read_cell('3')
+print(cell)
+
+ +
3
+ +
+

Writing Cell

+

During the process of data writing a schema uses a field internally. The same as with reading a user can convert their data using this interface:

+ +
+
+
from frictionless import fields
+
+field = fields.IntegerField(name='name')
+cell, note = field.write_cell(3)
+print(cell)
+
+ +
3
+ +
+

Reference

+
+ + +
+
+ +

Schema (class)

+

Field (class)

+ +
+
+ + +
+

Schema (class)

+

Schema representation + +This class is one of the cornerstones of of Frictionless framework. +It allow to work with Table Schema and its fields. + +```python +schema = Schema('schema.json') +schema.add_fied(Field(name='name', type='string')) +```

+

Signature

+

(*, descriptor: Optional[Union[types.IDescriptor, str]] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, fields: List[Field] = NOTHING, missing_values: List[str] = NOTHING, primary_key: List[str] = NOTHING, foreign_keys: List[Dict[str, Any]] = NOTHING) -> None

+

Parameters

+
    +
  • + descriptor + (Optional[Union[types.IDescriptor, str]])
  • +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + fields + (List[Field])
  • +
  • + missing_values + (List[str])
  • +
  • + primary_key + (List[str])
  • +
  • + foreign_keys + (List[Dict[str, Any]])
  • +
+
+ +
+

schema.descriptor (property)

+

+ # TODO: add docs +

+

Signature

+

Optional[Union[types.IDescriptor, str]]

+
+
+

schema.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

Optional[str]

+
+
+

schema.type (property)

+

+ Type of the object +

+

Signature

+

ClassVar[Union[str, None]]

+
+
+

schema.title (property)

+

+ A human-oriented title for the Schema. +

+

Signature

+

Optional[str]

+
+
+

schema.description (property)

+

+ A brief description of the Schema. +

+

Signature

+

Optional[str]

+
+
+

schema.fields (property)

+

+ A List of fields in the schema. +

+

Signature

+

List[Field]

+
+
+

schema.missing_values (property)

+

+ List of string values to be set as missing values in the schema fields. If any of string in + missing values is found in any of the field value then it is set as None. +

+

Signature

+

List[str]

+
+
+

schema.primary_key (property)

+

+ Specifies primary key for the schema. +

+

Signature

+

List[str]

+
+
+

schema.foreign_keys (property)

+

+ Specifies the foreign keys for the schema. +

+

Signature

+

List[Dict[str, Any]]

+
+ +
+

schema.field_names (property)

+

List of field names

+

Signature

+

List[str]

+
+
+

schema.field_types (property)

+

List of field types

+

Signature

+

List[str]

+
+ +
+

schema.add_field (method)

+

Add new field to the schema

+

Signature

+

(field: Field, *, position: Optional[int] = None) -> None

+

Parameters

+
    +
  • + field + (Field)
  • +
  • + position + (Optional[int])
  • +
+
+
+

schema.clear_fields (method)

+

Remove all the fields

+

Signature

+

() -> None

+
+
+

Schema.describe (method) (static)

+

Describe the given source as a schema

+

Signature

+

(source: Optional[Any] = None, **options: Any) -> Schema

+

Parameters

+
    +
  • + source + (Optional[Any]): data source
  • +
  • + options + (Any)
  • +
+
+
+

schema.flatten (method)

+

Flatten the schema + +Parameters + spec (str[]): flatten specification

+

Signature

+

(spec: List[str] = [name, type])

+

Parameters

+
    +
  • + spec + (List[str])
  • +
+
+
+

Schema.from_jsonschema (method) (static)

+

Create a Schema from JSONSchema profile

+

Signature

+

(profile: Union[types.IDescriptor, str]) -> Schema

+

Parameters

+
    +
  • + profile + (Union[types.IDescriptor, str]): path or dict with JSONSchema profile
  • +
+
+
+

schema.get_field (method)

+

Get field by name

+

Signature

+

(name: str) -> Field

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

schema.has_field (method)

+

Check if a field is present

+

Signature

+

(name: str) -> bool

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

schema.read_cells (method)

+

Read a list of cells (normalize/cast)

+

Signature

+

(cells: List[Any])

+

Parameters

+
    +
  • + cells + (List[Any]): list of cells
  • +
+
+
+

schema.remove_field (method)

+

Remove field by name

+

Signature

+

(name: str) -> Field

+

Parameters

+
    +
  • + name + (str)
  • +
+
+
+

schema.set_field (method)

+

Set field by name

+

Signature

+

(field: Field) -> Optional[Field]

+

Parameters

+
    +
  • + field + (Field)
  • +
+
+
+

schema.set_field_type (method)

+

Set field type

+

Signature

+

(name: str, type: str) -> Field

+

Parameters

+
    +
  • + name + (str)
  • +
  • + type + (str)
  • +
+
+
+

schema.to_excel_template (method)

+

Export schema as an excel template

+

Signature

+

(path: str) -> None

+

Parameters

+
    +
  • + path + (str): path of excel file to create with ".xlsx" extension
  • +
+
+
+

schema.to_summary (method)

+

Summary of the schema in table format

+

Signature

+

() -> str

+
+
+

schema.update_field (method)

+

Update field

+

Signature

+

(name: str, descriptor: types.IDescriptor) -> Field

+

Parameters

+
    +
  • + name + (str)
  • +
  • + descriptor + (types.IDescriptor)
  • +
+
+
+

schema.write_cells (method)

+

Write a list of cells (normalize/uncast)

+

Signature

+

(cells: List[Any], *, types: List[str] = [])

+

Parameters

+
    +
  • + cells + (List[Any]): list of cells
  • +
  • + types + (List[str])
  • +
+
+ + +
+

Field (class)

+

Field representation

+

Signature

+

(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None

+

Parameters

+
    +
  • + name + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + format + (str)
  • +
  • + missing_values + (List[str])
  • +
  • + constraints + (Dict[str, Any])
  • +
  • + rdf_type + (Optional[str])
  • +
  • + example + (Optional[str])
  • +
  • + schema + (Optional[Schema])
  • +
+
+ +
+

field.name (property)

+

+ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +

+

Signature

+

str

+
+
+

field.type (property)

+

+ Type of the field such as "boolean", "integer" etc. +

+

Signature

+

ClassVar[str]

+
+
+

field.title (property)

+

+ A human-oriented title for the Field. +

+

Signature

+

Optional[str]

+
+
+

field.description (property)

+

+ A brief description of the Field. +

+

Signature

+

Optional[str]

+
+
+

field.format (property)

+

+ Format of the field to specify different value readers for the field type. + For example: "default","array" etc. +

+

Signature

+

str

+
+
+

field.missing_values (property)

+

+ List of string values to be set as missing values in the field. If any of string in missing values + is found in the field value then it is set as None. +

+

Signature

+

List[str]

+
+
+

field.constraints (property)

+

+ A dictionary with rules that constraints the data value permitted for a field. +

+

Signature

+

Dict[str, Any]

+
+
+

field.rdf_type (property)

+

+ RDF type. Indicates whether the field is of RDF type. +

+

Signature

+

Optional[str]

+
+
+

field.example (property)

+

+ An example of a value for the field. +

+

Signature

+

Optional[str]

+
+
+

field.schema (property)

+

+ Schema class of which the field is part of. +

+

Signature

+

Optional[Schema]

+
+
+

field.builtin (property)

+

+ Specifies if field is the builtin feature. +

+

Signature

+

ClassVar[bool]

+
+
+

field.supported_constraints (property)

+

+ List of supported constraints for a field. +

+

Signature

+

ClassVar[List[str]]

+
+ +
+

field.required (property)

+

Indicates if field is mandatory.

+

Signature

+

(bool) ->

+
+ + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/framework/table.html b/docs/framework/table.html new file mode 100644 index 0000000000..25ad767ff3 --- /dev/null +++ b/docs/framework/table.html @@ -0,0 +1,3749 @@ + + + + + + + + +Table Classes | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Table Classes

+

Table Header

+

After opening a resource you get access to a resource.header object which describes the resource in more detail. This is a list of normalized labels but also provides some additional functionality. Let's take a look:

+ +
+
+
from frictionless import Resource
+
+with Resource('capital-3.csv') as resource:
+  print(f'Header: {resource.header}')
+  print(f'Labels: {resource.header.labels}')
+  print(f'Fields: {resource.header.fields}')
+  print(f'Field Names: {resource.header.field_names}')
+  print(f'Field Numbers: {resource.header.field_numbers}')
+  print(f'Errors: {resource.header.errors}')
+  print(f'Valid: {resource.header.valid}')
+  print(f'As List: {resource.header.to_list()}')
+
+ +
Header: ['id', 'name']
+Labels: ['id', 'name']
+Fields: [{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
+Field Names: ['id', 'name']
+Field Numbers: [1, 2]
+Errors: []
+Valid: True
+As List: ['id', 'name']
+ +
+

The example above shows a case when a header is valid. For a header that contains errors in its tabular structure, this information can be very useful, revealing discrepancies, duplicates or missing cell information:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+with Resource([['name', 'name'], ['value', 'value']]) as resource:
+    pprint(resource.header.errors)
+
+ +
[{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the same value. Column '
+                'names should be unique.',
+ 'message': 'Label "name" in the header at position "2" is duplicated to a '
+            'label: at position "1"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "1"',
+ 'labels': ['name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 2}]
+ +
+

Table Row

+

The extract, resource.read_rows() and other functions return or yield row objects. In Python, this returns a dictionary with the following information. Note: this example uses the Detector object, which tweaks how different aspects of metadata are detected.

+ +
+
+
from frictionless import Resource, Detector
+
+detector = Detector(schema_patch={'missingValues': ['1']})
+with Resource('capital-3.csv', detector=detector) as resource:
+  for row in resource.row_stream:
+    print(f'Row: {row}')
+    print(f'Cells: {row.cells}')
+    print(f'Fields: {row.fields}')
+    print(f'Field Names: {row.field_names}')
+    print(f'Value of field "name": {row["name"]}') # accessed as a dict
+    print(f'Row Number: {row.row_number}') # counted row number starting from 1
+    print(f'Blank Cells: {row.blank_cells}')
+    print(f'Error Cells: {row.error_cells}')
+    print(f'Errors: {row.errors}')
+    print(f'Valid: {row.valid}')
+    print(f'As Dict: {row.to_dict(json=False)}')
+    print(f'As List: {row.to_list(json=True)}') # JSON compatible data types
+    break
+
+ +
Row: {'id': None, 'name': 'London'}
+Cells: ['1', 'London']
+Fields: [{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
+Field Names: ['id', 'name']
+Value of field "name": London
+Row Number: 2
+Blank Cells: {'id': '1'}
+Error Cells: {}
+Errors: []
+Valid: True
+As Dict: {'id': None, 'name': 'London'}
+As List: [None, 'London']
+ +
+

As we can see, this output provides a lot of information which is especially useful when a row is not valid. Our row is valid but we demonstrated how it can preserve data about missing values. It also preserves data about all cells that contain errors:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+with Resource([['name'], ['value', 'value']]) as resource:
+    for row in resource.row_stream:
+        pprint(row.errors)
+
+ +
[{'type': 'extra-cell',
+ 'title': 'Extra Cell',
+ 'description': 'This row has more values compared to the header row (the '
+                'first row in the data source). A key concept is that all the '
+                'rows in tabular data must have the same number of columns.',
+ 'message': 'Row at position "2" has an extra value in field at position "2"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['value', 'value'],
+ 'rowNumber': 2,
+ 'cell': 'value',
+ 'fieldName': '',
+ 'fieldNumber': 2}]
+ +
+

Reference

+
+ + +
+
+ +

Header (class)

+

Row (class)

+ +
+
+ + +
+

Header (class)

+

Header representation + +> Constructor of this object is not Public API

+

Signature

+

(labels: List[str], *, fields: List[Field], row_numbers: List[int], ignore_case: bool = False)

+

Parameters

+
    +
  • + labels + (List[str]): header row labels
  • +
  • + fields + (List[Field]): table fields
  • +
  • + row_numbers + (List[int]): row numbers
  • +
  • + ignore_case + (bool): ignore case
  • +
+
+ + + +
+

header.to_list (method)

+

Convert to a list

+
+
+

header.to_str (method)

+

+
+ + +
+

Row (class)

+

Row representation + +> Constructor of this object is not Public API + +This object is returned by `extract`, `resource.read_rows`, and other functions. + +```python +rows = extract("data/table.csv") +for row in rows: + # work with the Row +```

+

Signature

+

(cells: List[Any], *, field_info: Dict[str, Any], row_number: int)

+

Parameters

+
    +
  • + cells + (List[Any]): array of cells
  • +
  • + field_info + (Dict[str, Any]): special field info structure
  • +
  • + row_number + (int): row number from 1
  • +
+
+ + + +
+

row.to_dict (method)

+

+

Signature

+

(*, csv: bool = False, json: bool = False, types: Optional[List[str]] = None) -> Dict[str, Any]

+

Parameters

+
    +
  • + csv + (bool)
  • +
  • + json + (bool): make data types compatible with JSON format
  • +
  • + types + (Optional[List[str]])
  • +
+
+
+

row.to_list (method)

+

+

Signature

+

(*, json: bool = False, types: Optional[List[str]] = None)

+

Parameters

+
    +
  • + json + (bool): make data types compatible with JSON format
  • +
  • + types + (Optional[List[str]]): list of supported types
  • +
+
+
+

row.to_str (method)

+

+

Signature

+

(**options: Any)

+

Parameters

+
    +
  • + options + (Any)
  • +
+
+ + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/getting-started.html b/docs/getting-started.html new file mode 100644 index 0000000000..5963ff43e6 --- /dev/null +++ b/docs/getting-started.html @@ -0,0 +1,3776 @@ + + + + + + + + +Getting Started | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Getting Started

+

Let's get started with Frictionless! We will learn how to install and use the framework. The simple example below will showcase the framework's basic functionality.

+

Installation

+
+

The framework requires Python3.8+. Versioning follows the SemVer Standard.

+
+ +
+
+
pip install frictionless
+pip install frictionless[sql] # to install a core plugin (optional)
+pip install 'frictionless[sql]' # for zsh shell
+
+ +
+

The framework supports CSV, Excel, and JSON formats by default. The second command above installs a plugin for SQL support. There are plugins for SQL, Pandas, HTML, and others (all supported plugins are listed in the "File Formats" and schemes in "File Schemes" menu). Usually, you don't need to think about it in advance–frictionless will display a useful error message about a missing plugin with installation instructions.

+

Troubleshooting

+

Did you have an error installing Frictionless? Here are some dependencies and common errors:

+ +

Still having a problem? Ask us for help on our Discord chat or open an issue. We're happy to help!

+

Usage

+

The framework can be used:

+
    +
  • as a Python library
  • +
  • as a command-line interface
  • +
+

For instance, both examples below do the same thing:

+ +
+
+
frictionless extract data/table.csv
+
+ +
+
+
from frictionless import extract
+
+rows = extract('data/table.csv')
+
+ +
+

The interfaces are as much alike as possible regarding naming conventions and +the way you interact with them. Usually, it's straightforward to translate, +for instance, Python code to a command-line call. Frictionless provides code +completion for Python and the command-line, which should help to get useful +hints in real time.

+

Arguments conform to the following naming convention:

+
    +
  • for Python interfaces, they are snake_cased, e.g. missing_values
  • +
  • within dictionaries or JSON objects, they are camelCased, e.g. missingValues
  • +
  • in the command line, they use dashes, e.g. --missing-values
  • +
+

To get the documentation for a command-line interface just use the --help flag:

+ +
+
+
frictionless --help
+frictionless describe --help
+frictionless extract --help
+frictionless validate --help
+frictionless transform --help
+
+ +
+

Example

+
+

Download invalid.csv to reproduce the examples (right-click and "Save link as"). For more examples, please take a look at the Basic Examples article.

+
+

We will take a very messy data file:

+ +
+
+
cat invalid.csv
+
+ +
id,name,,name
+1,english
+1,english
+
+2,german,1,2,3
+ +
+
+
with open('invalid.csv') as file:
+    print(file.read())
+
+ +
id,name,,name
+1,english
+1,english
+
+2,german,1,2,3
+ +
+

First of all, let's use describe to infer the metadata directly from the tabular data. We can then edit and save it to provide others with useful information about the data:

+
+

The CLI output is in YAML, it is a default Frictionless output format.

+
+ +
+
+
frictionless describe invalid.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+             dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
+┃ name    ┃ type  ┃ path        ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
+│ invalid │ table │ invalid.csv │
+└─────────┴───────┴─────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                invalid
+┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
+┃ id      ┃ name   ┃ field3  ┃ name2   ┃
+┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
+│ integer │ string │ integer │ integer │
+└─────────┴────────┴─────────┴─────────┘
+ +
+
+
from pprint import pprint
+from frictionless import describe
+
+resource = describe('invalid.csv')
+pprint(resource)
+
+ +
{'name': 'invalid',
+ 'type': 'table',
+ 'path': 'invalid.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                       {'name': 'name', 'type': 'string'},
+                       {'name': 'field3', 'type': 'integer'},
+                       {'name': 'name2', 'type': 'integer'}]}}
+ +
+

Now that we have inferred a table schema from the data file (e.g., expected format of the table, expected type of each value in a column, etc.), we can use extract to read the normalized tabular data from the source CSV file:

+ +
+
+
frictionless extract invalid.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+             dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
+┃ name    ┃ type  ┃ path        ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
+│ invalid │ table │ invalid.csv │
+└─────────┴───────┴─────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+              invalid
+┏━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
+┃ id   ┃ name    ┃ field3 ┃ name2 ┃
+┡━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
+│ 1    │ english │ None   │ None  │
+│ 1    │ english │ None   │ None  │
+│ None │ None    │ None   │ None  │
+│ 2    │ german  │ 1      │ 2     │
+└──────┴─────────┴────────┴───────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract('invalid.csv')
+pprint(rows)
+
+ +
{'invalid': [{'field3': None, 'id': 1, 'name': 'english', 'name2': None},
+             {'field3': None, 'id': 1, 'name': 'english', 'name2': None},
+             {'field3': None, 'id': None, 'name': None, 'name2': None},
+             {'field3': 1, 'id': 2, 'name': 'german', 'name2': 2}]}
+ +
+

Last but not least, let's get a validation report. This report will help us to identify and fix all the errors present in the tabular data, as comprehensive information is provided for every problem:

+ +
+
+
frictionless validate invalid.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                  dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name    ┃ type  ┃ path        ┃ status  ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ invalid │ table │ invalid.csv │ INVALID │
+└─────────┴───────┴─────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                    invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type            ┃ Message                                     ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3     │ blank-label     │ Label in the header in field at position    │
+│      │       │                 │ "3" is blank                                │
+│ None │ 4     │ duplicate-label │ Label "name" in the header at position "4"  │
+│      │       │                 │ is duplicated to a label: at position "2"   │
+│ 2    │ 3     │ missing-cell    │ Row at position "2" has a missing cell in   │
+│      │       │                 │ field "field3" at position "3"              │
+│ 2    │ 4     │ missing-cell    │ Row at position "2" has a missing cell in   │
+│      │       │                 │ field "name2" at position "4"               │
+│ 3    │ 3     │ missing-cell    │ Row at position "3" has a missing cell in   │
+│      │       │                 │ field "field3" at position "3"              │
+│ 3    │ 4     │ missing-cell    │ Row at position "3" has a missing cell in   │
+│      │       │                 │ field "name2" at position "4"               │
+│ 4    │ None  │ blank-row       │ Row at position "4" is completely blank     │
+│ 5    │ 5     │ extra-cell      │ Row at position "5" has an extra value in   │
+│      │       │                 │ field at position "5"                       │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report = validate('invalid.csv')
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+ +
[[None, 3, 'blank-label'],
+ [None, 4, 'duplicate-label'],
+ [2, 3, 'missing-cell'],
+ [2, 4, 'missing-cell'],
+ [3, 3, 'missing-cell'],
+ [3, 4, 'missing-cell'],
+ [4, None, 'blank-row'],
+ [5, 5, 'extra-cell']]
+ +
+

Now that we have all this information:

+
    +
  • we can clean up the table to ensure the data quality
  • +
  • we can use the metadata to describe and share the dataset
  • +
  • we can include the validation into our workflow to guarantee the validity
  • +
  • and much more: don't hesitate and read the following sections of the documentation!
  • +
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/guides/describing-data.html b/docs/guides/describing-data.html new file mode 100644 index 0000000000..5d37095a67 --- /dev/null +++ b/docs/guides/describing-data.html @@ -0,0 +1,4792 @@ + + + + + + + + +Describing Data | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Describing Data

+
+

This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start. Also, this guide is meant to be read in order from top to bottom, and reuses examples throughout the text. You can use the menu to skip sections, but please note that you might need to run code from earlier sections to make all the examples work.

+
+

In Frictionless terms, "Describing data" means creating metadata for your data files. Having metadata is important because data files by themselves usually do not provide enough information to fully understand the data. For example, if you have a data table in a CSV format without metadata, you are missing a few critical pieces of information:

+
    +
  • the meaning of the fields e.g., what the size field means (does that field mean geographic size? Or does it refer to the size of the file?)
  • +
  • data type information e.g., is this field a string or an integer?
  • +
  • data constraints e.g., the minimum temperature for your measurements
  • +
  • data relations e.g., identifier connections
  • +
  • and others
  • +
+

For a dataset, there is even more information that can be provided, like the general purpose of a dataset, information about data sources, list of authors, and more. Also, when there are many tabular files, relational rules can be very important. Usually, there are foreign keys ensuring the integrity of the dataset; for example, think of a reference table containing country names and other data tables using it as a reference. Data in this form is called "normalized data" and it occurs very often in scientific and other kinds of research.

+

Now that we have a general understanding of what "describing data" is, we can discuss why it is important:

+
    +
  • data validation: metadata helps to reveal problems in your data during early stages of your workflow
  • +
  • data publication: metadata provides additional information that your data doesn't include
  • +
+

These are not the only positives of having metadata, but they are two of the most important. Please continue reading to learn how Frictionless helps to achieve these advantages by describing your data. This guide will discuss the main describe functions (describe, Schema.describe, Resource.describe, Package.describe) and will then go into more detail about how to create and edit metadata in Frictionless.

+

For the following examples, you will need to have Frictionless installed. See our Quick Start Guide if you need help.

+ +
+
+
pip install frictionless
+
+ +
+

Describe Functions

+

The describe functions are the main Frictionless tool for describing data. In many cases, this high-level interface is enough for data exploration and other needs.

+

The frictionless framework provides 4 different describe functions in Python:

+
    +
  • describe: detects the source type and returns Data Resource or Data Package metadata
  • +
  • Schema.describe: always returns Table Schema metadata
  • +
  • Resource.describe: always returns Data Resource metadata
  • +
  • Package.describe: always returns Data Package metadata
  • +
+

As described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.

+

In the command-line, there is only 1 command (describe) but there is also a flag to adjust the behavior:

+ +
+
+
frictionless describe your-table.csv
+frictionless describe your-table.csv --type schema
+frictionless describe your-table.csv --type resource
+frictionless describe your-table.csv --type package
+
+ +
+

Please take into account that file names might be used by Frictionless to detect a metadata type for data extraction or validation. It's recommended to use corresponding suffixes when you save your metadata to the disk. For example, you might name your Table Schema as table.schema.yaml, Data Resource as table.resource.yaml, and Data Package as table.package.yaml. If there is no hint in the file name Frictionless will assume that it's a resource descriptor by default.

+

For example, if we want a Data Package descriptor for a single file:

+
+

Download table.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
frictionless describe table.csv --type package
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+           dataset
+┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
+┃ name  ┃ type  ┃ path      ┃
+┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
+│ table │ table │ table.csv │
+└───────┴───────┴───────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+       table
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id      ┃ name   ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+ +
+
+
from frictionless import describe
+
+package = describe("table.csv", type="package")
+print(package.to_yaml())
+
+ +
resources:
+  - name: table
+    type: table
+    path: table.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: name
+          type: string
+ +
+

Describing a Schema

+

Table Schema is a specification for providing a "schema" (similar to a database schema) for tabular data. This information includes the expected data type for each value in a column ("string", "number", "date", etc.), constraints on the value ("this string can only be at most 10 characters long"), and the expected format of the data ("this field should only contain strings that look like email addresses"). Table Schema can also specify relations between data tables.

+

We're going to use this file for the examples in this section. For this guide, we only use CSV files because of their demonstrativeness, but in general Frictionless can handle data in Excel, JSON, SQL, and many other formats:

+
+

Download country-1.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
cat country-1.csv
+
+ +
id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+ +
+
+
with open('country-1.csv') as file:
+    print(file.read())
+
+ +
id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+ +
+

Let's get a Table Schema using the Frictionless framework (note: this example uses YAML for the schema format, but Frictionless also supports JSON format):

+ +
+
+
from frictionless import Schema
+
+schema = Schema.describe("country-1.csv")
+schema.to_yaml("country.schema.yaml") # use schema.to_json for JSON
+
+ +
+

The high-level functions of Frictionless operate on the dataset and resource levels so we have to use a little bit of Python programming to get the schema information. Below we will show how to use a command-line interface for similar tasks.

+ +
+
+
cat country.schema.yaml
+
+ +
fields:
+  - name: id
+    type: integer
+  - name: neighbor_id
+    type: integer
+  - name: name
+    type: string
+  - name: population
+    type: integer
+ +
+
+
with open('country.schema.yaml') as file:
+    print(file.read())
+
+ +
fields:
+  - name: id
+    type: integer
+  - name: neighbor_id
+    type: integer
+  - name: name
+    type: string
+  - name: population
+    type: integer
+ +
+

As we can see, we were able to infer basic metadata from our data file. But describing data doesn't end here - we can provide additional information that we discussed earlier:

+
+

You can edit "country.schema.yaml" manually instead of running Python

+
+ +
+
+
from frictionless import Schema
+
+schema = Schema.describe("country-1.csv")
+schema.get_field("id").title = "Identifier"
+schema.get_field("neighbor_id").title = "Identifier of the neighbor"
+schema.get_field("name").title = "Name of the country"
+schema.get_field("population").title = "Population"
+schema.get_field("population").description = "According to the year 2020's data"
+schema.get_field("population").constraints["minimum"] = 0
+schema.foreign_keys.append(
+    {"fields": ["neighbor_id"], "reference": {"resource": "", "fields": ["id"]}}
+)
+schema.to_yaml("country.schema-full.yaml")
+
+ +
+

Let's break it down:

+
    +
  • we added a title for all the fields
  • +
  • we added a description to the "Population" field; the year information can be critical to interpret the data
  • +
  • we set a constraint to the "Population" field because it can't be less than 0
  • +
  • we added a foreign key saying that "Identifier of the neighbor" should be present in the "Identifier" field
  • +
+ +
+
+
cat country.schema-full.yaml
+
+ +
fields:
+  - name: id
+    type: integer
+    title: Identifier
+  - name: neighbor_id
+    type: integer
+    title: Identifier of the neighbor
+  - name: name
+    type: string
+    title: Name of the country
+  - name: population
+    type: integer
+    title: Population
+    description: According to the year 2020's data
+    constraints:
+      minimum: 0
+foreignKeys:
+  - fields:
+      - neighbor_id
+    reference:
+      resource: ''
+      fields:
+        - id
+ +
+
+
with open('country.schema-full.yaml') as file:
+    print(file.read())
+
+ +
fields:
+  - name: id
+    type: integer
+    title: Identifier
+  - name: neighbor_id
+    type: integer
+    title: Identifier of the neighbor
+  - name: name
+    type: string
+    title: Name of the country
+  - name: population
+    type: integer
+    title: Population
+    description: According to the year 2020's data
+    constraints:
+      minimum: 0
+foreignKeys:
+  - fields:
+      - neighbor_id
+    reference:
+      resource: ''
+      fields:
+        - id
+ +
+

Later we're going to show how to use the schema we created to ensure the validity of your data; in the next few sections, we will focus on Data Resource and Data Package metadata.

+

To continue learning about table schemas please read:

+ +

Describing a Resource

+

The Data Resource format describes a data resource such as an individual file or data table. +The essence of a Data Resource is a path to the data file it describes. +A range of other properties can be declared to provide a richer set of metadata including Table Schema for tabular data.

+

For this section, we will use a file that is slightly more complex to handle. In this example, cells are separated by the ";" character and there is a comment on the top:

+
+

Download country-2.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
cat country-2.csv
+
+ +
# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+ +
+
+
with open('country-2.csv') as file:
+    print(file.read())
+
+ +
# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+ +
+

Let's describe it:

+ +
+
+
frictionless describe country-2.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+         country-2
+┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ # Author: the scientist ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ string                  │
+└─────────────────────────┘
+ +
+
+
from frictionless import describe
+
+resource = describe('country-2.csv')
+print(resource.to_yaml())
+
+ +
name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+schema:
+  fields:
+    - name: '# Author: the scientist'
+      type: string
+ +
+

OK, that looks wrong -- for example, the schema has only inferred one field, and that field does not seem correct either. As we have seen in the "Introductory Guide" Frictionless is capable of inferring some complicated cases' metadata but our data table is too complex for it to automatically infer. We need to manually program it:

+
+

You can edit "country.resource.yaml" manually instead of running Python

+
+ +
+
+
from frictionless import Schema, describe
+
+resource = describe("country-2.csv")
+resource.dialect.header_rows = [2]
+resource.dialect.get_control('csv').delimiter = ";"
+resource.schema = "country.schema.yaml"
+resource.to_yaml("country.resource-cleaned.yaml")
+
+ +
+

So what we did here:

+
    +
  • we set the header rows to be row number 2; as humans, we can easily see that was the proper row
  • +
  • we set the CSV Delimiter to be ";"
  • +
  • we reuse the schema we created earlier as the data has the same structure and meaning
  • +
+ +
+
+
cat country.resource-cleaned.yaml
+
+ +
name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+  headerRows:
+    - 2
+  csv:
+    delimiter: ;
+schema: country.schema.yaml
+ +
+
+
with open('country.resource-cleaned.yaml') as file:
+    print(file.read())
+
+ +
name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+  headerRows:
+    - 2
+  csv:
+    delimiter: ;
+schema: country.schema.yaml
+ +
+

Our resource metadata includes the schema metadata we created earlier, but it also has:

+
    +
  • general information about the file's schema, format, and compression
  • +
  • information about CSV Dialect which helps software understand how to read it
  • +
  • checksum information like hash, bytes, and rows
  • +
+

But the most important difference is that the resource metadata contains the path property. This is a conceptual distinction of the Data Resource specification compared to the Table Schema specification. While a Table Schema descriptor can describe a class of data files, a Data Resource descriptor describes only one exact data file, data/country-2.csv in our case.

+

Using programming terminology we could say that:

+
    +
  • Table Schema descriptor is abstract (for a class of files)
  • +
  • Data Resource descriptor is concrete (for an individual file)
  • +
+

We will show the practical difference in the "Using Metadata" section, but in the next section, we will overview the Data Package specification.

+

To continue learning about data resources please read:

+ +

Describing a Package

+

A Data Package consists of:

+
    +
  • Metadata that describes the structure and contents of the package
  • +
  • Resources such as data files that form the contents of the package
  • +
+

The Data Package metadata is stored in a "descriptor". This descriptor is what makes a collection of data a Data Package. The structure of this descriptor is the main content of the specification below.

+

In addition to this descriptor, a data package will include other resources such as data files. The Data Package specification does NOT impose any requirements on their form or structure and can, therefore, be used for packaging any kind of data.

+

The data included in the package may be provided as:

+
    +
  • Files bundled locally with the package descriptor
  • +
  • Remote resources, referenced by URL (see the schemes tutorial for more information about supported URLs)
  • +
  • "Inline" data (see below) which is included directly in the descriptor
  • +
+

For this section, we will use the following files:

+
+

Download country-3.csv to reproduce the examples (right-click and "Save link as")

+
+ +
+
+
cat country-3.csv
+
+ +
id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+ +
+
+
with open('country-3.csv') as file:
+    print(file.read())
+
+ +
id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+ +
+
+

Download capital-3.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
cat capital-3.csv
+
+ +
id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+ +
+
+
with open('capital-3.csv') as file:
+    print(file.read())
+
+ +
id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+ +
+

First of all, let's describe our package now. We did it before for a resource but now we're going to use a glob pattern to indicate that there are multiple files:

+ +
+
+
frictionless describe *-3.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+│ country-3 │ table │ country-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+     capital-3
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id      ┃ name   ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+                  country-3
+┏━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id      ┃ capital_id ┃ name   ┃ population ┃
+┡━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
+│ integer │ integer    │ string │ integer    │
+└─────────┴────────────┴────────┴────────────┘
+ +
+
+
from frictionless import describe
+
+package = describe("*-3.csv")
+print(package.to_yaml())
+
+ +
resources:
+  - name: capital-3
+    type: table
+    path: capital-3.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: name
+          type: string
+  - name: country-3
+    type: table
+    path: country-3.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: capital_id
+          type: integer
+        - name: name
+          type: string
+        - name: population
+          type: integer
+ +
+

We have already learned about many concepts that are reflected in this metadata. We can see resources, schemas, fields, and other familiar entities. The difference is that this descriptor has information about multiple files which is a popular way of sharing data - in datasets. Very often you have not only one data file but also additional data files, some textual documents e.g. PDF, and others. To package all of these files with the corresponding metadata we use data packages.

+

Following the pattern that is already familiar to the guide reader, we add some additional metadata:

+
+

You can edit "country.package.yaml" manually instead of running Python

+
+ +
+
+
from frictionless import describe
+
+package = describe("*-3.csv")
+package.title = "Countries and their capitals"
+package.description = "The data was collected as a research project"
+package.get_resource("country-3").name = "country"
+package.get_resource("capital-3").name = "capital"
+package.get_resource("country").schema.foreign_keys.append(
+    {"fields": ["capital_id"], "reference": {"resource": "capital", "fields": ["id"]}}
+)
+package.to_yaml("country.package.yaml")
+
+ +
+

In this case, we add a relation between different files connecting id and capital_id. Also, we provide dataset-level metadata to explain the purpose of this dataset. We haven't added individual fields' titles and descriptions, but that can be done as it was shown in the "Table Schema" section.

+ +
+
+
cat country.package.yaml
+
+ +
title: Countries and their capitals
+description: The data was collected as a research project
+resources:
+  - name: capital
+    type: table
+    path: capital-3.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: name
+          type: string
+  - name: country
+    type: table
+    path: country-3.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: capital_id
+          type: integer
+        - name: name
+          type: string
+        - name: population
+          type: integer
+      foreignKeys:
+        - fields:
+            - capital_id
+          reference:
+            resource: capital
+            fields:
+              - id
+ +
+
+
with open('country.package.yaml') as file:
+    print(file.read())
+
+ +
title: Countries and their capitals
+description: The data was collected as a research project
+resources:
+  - name: capital
+    type: table
+    path: capital-3.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: name
+          type: string
+  - name: country
+    type: table
+    path: country-3.csv
+    scheme: file
+    format: csv
+    mediatype: text/csv
+    encoding: utf-8
+    schema:
+      fields:
+        - name: id
+          type: integer
+        - name: capital_id
+          type: integer
+        - name: name
+          type: string
+        - name: population
+          type: integer
+      foreignKeys:
+        - fields:
+            - capital_id
+          reference:
+            resource: capital
+            fields:
+              - id
+ +
+

The main role of the Data Package descriptor is describing a dataset; as we can see, it includes previously shown descriptors like schema, dialect, and resource. But it would be a mistake to think that Data Package is the least important specification; actually, it completes the Frictionless Data suite making it possible to share and validate not only individual files but also complete datasets.

+

To continue learning about data resources please read:

+ +

Metadata Importance

+

This documentation contains a great deal of information on how to use metadata and why it's vital for your data. In this section, we're going to provide a quick example based on the "Data Resource" section but please read other documents to get the full picture.

+

Let's get back to this complex data table:

+ +
+
+
cat country-2.csv
+
+ +
# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+ +
+
+
with open('country-2.csv') as file:
+    print(file.read())
+
+ +
# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+ +
+

As we tried before, by default Frictionless can't properly describe this file so we got something like:

+ +
+
+
frictionless describe country-2.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+         country-2
+┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ # Author: the scientist ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ string                  │
+└─────────────────────────┘
+ +
+
+
from frictionless import describe
+
+resource = describe("country-2.csv")
+print(resource.to_yaml())
+
+ +
name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+schema:
+  fields:
+    - name: '# Author: the scientist'
+      type: string
+ +
+

Trying to extract the data will fail this way:

+ +
+
+
frictionless extract country-2.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+            country-2
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ # Author: the scientist        ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ id;neighbor_id;name;population │
+│ 1;;Britain;67                  │
+│ 2;3;France;67                  │
+│ 3;2;Germany;83                 │
+│ 4;5;Italy;60                   │
+│ 5;4;Spain;47                   │
+└────────────────────────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract("country-2.csv")
+pprint(rows)
+
+ +
{'country-2': [{'# Author: the scientist': 'id;neighbor_id;name;population'},
+               {'# Author: the scientist': '1;;Britain;67'},
+               {'# Author: the scientist': '2;3;France;67'},
+               {'# Author: the scientist': '3;2;Germany;83'},
+               {'# Author: the scientist': '4;5;Italy;60'},
+               {'# Author: the scientist': '5;4;Spain;47'}]}
+ +
+

This example highlights a really important idea - without metadata many software will not be able to even read this data file. Furthermore, without metadata people cannot understand the purpose of this data. To see how we can use metadata to fix our data, let's now use the country.resource-full.yaml file we created in the "Data Resource" section with Frictionless extract:

+ +
+
+
frictionless extract country.resource-cleaned.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                 country-2
+┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ neighbor_id ┃ name    ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1  │ None        │ Britain │ 67         │
+│ 2  │ 3           │ France  │ 67         │
+│ 3  │ 2           │ Germany │ 83         │
+│ 4  │ 5           │ Italy   │ 60         │
+│ 5  │ 4           │ Spain   │ 47         │
+└────┴─────────────┴─────────┴────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract("country.resource-cleaned.yaml")
+pprint(rows)
+
+ +
{'country-2': [{'id': 1,
+                'name': 'Britain',
+                'neighbor_id': None,
+                'population': 67},
+               {'id': 2, 'name': 'France', 'neighbor_id': 3, 'population': 67},
+               {'id': 3, 'name': 'Germany', 'neighbor_id': 2, 'population': 83},
+               {'id': 4, 'name': 'Italy', 'neighbor_id': 5, 'population': 60},
+               {'id': 5, 'name': 'Spain', 'neighbor_id': 4, 'population': 47}]}
+ +
+

As we can see, the data is now fixed. The metadata we had saved the day! If we explore this data in Python we can discover that it also corrected data types - e.g. id is Python's integer not string. We can now export and share this data without any worries.

+

Inferring Metadata

+
+

Many Frictionless Framework's classes are metadata classes as though Schema, Resource, or Package. All the sections below are applicable for all these classes. You can read about the base Metadata class in more detail in API Reference.

+
+

Many Frictionless functions infer metadata under the hood such as describe, extract, and many more. On a lower-level, it's possible to control this process. To see this, let's create a Resource.

+
from frictionless import Resource
+
+resource = Resource("country-1.csv")
+print(resource)
+
+ +
{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'country-1.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+
{'path': 'country-1.csv'}
+
+

Frictionless always tries to be as explicit as possible. We didn't provide any metadata except for path so we got the expected result. But now, we'd like to infer additional metadata:

+
+

We can ask for stats using CLI with frictionless describe data/table.csv --stats. Note that we use the stats argument for the resource.infer function.

+
+ +
+
+
frictionless describe country-1.csv --stats --json
+
+ +
{
+  "name": "country-1",
+  "type": "table",
+  "path": "country-1.csv",
+  "scheme": "file",
+  "format": "csv",
+  "mediatype": "text/csv",
+  "encoding": "utf-8",
+  "hash": "sha256:7cf6ce03c75461e1d9862b89250dbacf43e97976d1f25c056173971dfb203671",
+  "bytes": 100,
+  "fields": 4,
+  "rows": 5,
+  "schema": {
+    "fields": [
+      {
+        "name": "id",
+        "type": "integer"
+      },
+      {
+        "name": "neighbor_id",
+        "type": "integer"
+      },
+      {
+        "name": "name",
+        "type": "string"
+      },
+      {
+        "name": "population",
+        "type": "integer"
+      }
+    ]
+  }
+}
+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource("country-1.csv")
+resource.infer(stats=True)
+pprint(resource)
+
+ +
{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'country-1.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:7cf6ce03c75461e1d9862b89250dbacf43e97976d1f25c056173971dfb203671',
+ 'bytes': 100,
+ 'fields': 4,
+ 'rows': 5,
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                       {'name': 'neighbor_id', 'type': 'integer'},
+                       {'name': 'name', 'type': 'string'},
+                       {'name': 'population', 'type': 'integer'}]}}
+ +
+

The result is really familiar to us already. We have seen it a lot as an output of the describe function or command. Basically, that's what this high-level function does under the hood: create a resource and then infer additional metadata.

+

All the main Metadata classes have this method with different available options but with the same conceptual purpose:

+
    +
  • package.infer
  • +
  • resource.infer
  • +
+

For more advanced detection options, please read the Detector Guide

+

Validating Metadata

+

Metadata validity is an important topic, and we recommend validating your metadata before publishing. For example, let's first make it invalid:

+ +
+
+
import yaml
+from frictionless import Resource
+
+descriptor = {}
+descriptor['path'] = 'country-1.csv'
+descriptor['title'] = 1
+try:
+    Resource(descriptor)
+except Exception as exception:
+    print(exception.error)
+    print(exception.reasons)
+
+ +
{'type': 'resource-error',
+ 'title': 'Resource Error',
+ 'description': 'A validation cannot be processed.',
+ 'message': 'The data resource has an error: descriptor is not valid',
+ 'tags': [],
+ 'note': 'descriptor is not valid'}
+[{'type': 'resource-error',
+ 'title': 'Resource Error',
+ 'description': 'A validation cannot be processed.',
+ 'message': "The data resource has an error: 'name' is a required property",
+ 'tags': [],
+ 'note': "'name' is a required property"}, {'type': 'resource-error',
+ 'title': 'Resource Error',
+ 'description': 'A validation cannot be processed.',
+ 'message': "The data resource has an error: 1 is not of type 'string' at "
+            "property 'title'",
+ 'tags': [],
+ 'note': "1 is not of type 'string' at property 'title'"}]
+ +
+
False
+[{'code': 'resource-error', 'name': 'Resource Error', 'tags': [], 'note': '"1 is not of type \'string\'" at "title" in metadata and at "properties/title/type" in profile', 'message': 'The data resource has an error: "1 is not of type \'string\'" at "title" in metadata and at "properties/title/type" in profile', 'description': 'A validation cannot be processed.'}]
+
+

We see this error'"1 is not of type \'string\'" at "title" in metadata and at "properties/title/type" in profile' as we set title to be an integer.

+

Frictionless' high-level functions like validate runs all metadata checks by default.

+

Transforming Metadata

+

We have seen this before but let's re-iterate; it's possible to transform core metadata properties using Python's interface:

+ +
+
+
from frictionless import Resource
+
+resource = Resource("country.resource-cleaned.yaml")
+resource.title = "Countries"
+resource.description = "It's a research project"
+resource.dialect.header_rows = [2]
+resource.dialect.get_control('csv').delimiter = ";"
+resource.to_yaml("country.resource-updated.yaml")
+
+ +
+

We can add custom options using the custom property:

+ +
+
+
from frictionless import Resource
+
+resource = Resource("country.resource-updated.yaml")
+resource.custom["customKey1"] = "Value1"
+resource.custom["customKey2"] = "Value2"
+resource.to_yaml("country.resource-updated2.yaml")
+
+ +
+

Let's check it out:

+ +
+
+
cat country.resource-updated2.yaml
+
+ +
name: country-2
+type: table
+title: Countries
+description: It's a research project
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+  headerRows:
+    - 2
+  csv:
+    delimiter: ;
+schema: country.schema.yaml
+customKey1: Value1
+customKey2: Value2
+ +
+
+
with open('country.resource-updated2.yaml') as file:
+    print(file.read())
+
+ +
name: country-2
+type: table
+title: Countries
+description: It's a research project
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+  headerRows:
+    - 2
+  csv:
+    delimiter: ;
+schema: country.schema.yaml
+customKey1: Value1
+customKey2: Value2
+ +
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/guides/extracting-data.html b/docs/guides/extracting-data.html new file mode 100644 index 0000000000..fc772ae678 --- /dev/null +++ b/docs/guides/extracting-data.html @@ -0,0 +1,4120 @@ + + + + + + + + +Extracting Data | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Extracting Data

+
+

This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.

+
+

Extracting data means reading tabular data from a source. We can use various customizations for this process such as providing a file format, table schema, limiting fields or rows amount, and much more. This guide will discuss the main extract functions (extract, extract_resource, extract_package) and will then go into more advanced details about the Resource Class, Package Class, Header Class, and Row Class. The output from the extract function is in 'utf-8' encoding scheme.

+

Let's see this with some real files:

+
+

Download country-3.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
cat country-3.csv
+
+ +
id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+ +
+
+
with open('country-3.csv') as file:
+    print(file.read())
+
+ +
id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+ +
+
+

Download capital-3.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
cat capital-3.csv
+
+ +
id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+ +
+
+
with open('capital-3.csv') as file:
+    print(file.read())
+
+ +
id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+ +
+

To start, we will extract data from a resource:

+ +
+
+
frictionless extract country-3.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-3 │ table │ country-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                country-3
+┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ capital_id ┃ name    ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1  │ 1          │ Britain │ 67         │
+│ 2  │ 3          │ France  │ 67         │
+│ 3  │ 2          │ Germany │ 83         │
+│ 4  │ 5          │ Italy   │ 60         │
+│ 5  │ 4          │ Spain   │ 47         │
+└────┴────────────┴─────────┴────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract('country-3.csv')
+pprint(rows)
+
+ +
{'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
+               {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
+               {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
+               {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
+               {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
+ +
+

Extract Functions

+

The high-level interface for extracting data provided by Frictionless is a set of extract functions:

+
    +
  • extract: detects the source file type and extracts data accordingly
  • +
  • resource.extract: returns a data table
  • +
  • package.extract: returns a map of the package's tables
  • +
+

As described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.

+

The command/function would be used as follows:

+ +
+
+
frictionless extract your-table.csv
+frictionless extract your-resource.json --type resource
+frictionless extract your-package.json --type package
+
+ +
+
+
from frictionless import extract
+
+rows = extract('capital-3.csv')
+resource = extract('capital-3.csv', type="resource")
+package = extract('capital-3.csv', type="package")
+
+ +
+

The extract functions always reads data in the form of rows, into memory. The lower-level interfaces will allow you to stream data, which you can read about in the Resource Class section below.

+

Extracting a Resource

+

A resource contains only one file. To extract a resource, we have three options. First, we can use the same approach as above, extracting from the data file itself:

+ +
+
+
frictionless extract capital-3.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+   capital-3
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name   ┃
+┡━━━━╇━━━━━━━━┩
+│ 1  │ London │
+│ 2  │ Berlin │
+│ 3  │ Paris  │
+│ 4  │ Madrid │
+│ 5  │ Rome   │
+└────┴────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract('capital-3.csv')
+pprint(rows)
+
+ +
{'capital-3': [{'id': 1, 'name': 'London'},
+               {'id': 2, 'name': 'Berlin'},
+               {'id': 3, 'name': 'Paris'},
+               {'id': 4, 'name': 'Madrid'},
+               {'id': 5, 'name': 'Rome'}]}
+ +
+

Our second option is to extract the resource from a descriptor file by using the extract_resource function. A descriptor file is useful because it can contain different metadata and be stored on the disc.

+

As an example of how to use extract_resource, let's first create a descriptor file (note: this example uses YAML for the descriptor, but Frictionless also supports JSON):

+ +
+
+
from frictionless import Resource
+
+resource = Resource('capital-3.csv')
+resource.infer()
+# as an example, in the next line we will append the schema
+resource.schema.missing_values.append('3') # will interpret 3 as a missing value
+resource.to_yaml('capital.resource-test.yaml') # use resource.to_json for JSON format
+
+ +
+

You can also use a pre-made descriptor file.

+

Now, this descriptor file can be used to extract the resource:

+ +
+
+
frictionless extract capital.resource-test.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+    capital-3
+┏━━━━━━┳━━━━━━━━┓
+┃ id   ┃ name   ┃
+┡━━━━━━╇━━━━━━━━┩
+│ 1    │ London │
+│ 2    │ Berlin │
+│ None │ Paris  │
+│ 4    │ Madrid │
+│ 5    │ Rome   │
+└──────┴────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+rows = extract('capital.resource.yaml')
+pprint(rows)
+
+ +
{'capital-invalid': [{'id': 1, 'name': 'London', 'name2': 'Britain'},
+                     {'id': 2, 'name': 'Berlin', 'name2': 'Germany'},
+                     {'id': 3, 'name': 'Paris', 'name2': 'France'},
+                     {'id': 4, 'name': 'Madrid', 'name2': 'Spain'},
+                     {'id': 5, 'name': 'Rome', 'name2': 'Italy'},
+                     {'id': 6, 'name': 'Zagreb', 'name2': 'Croatia'},
+                     {'id': 7, 'name': 'Athens', 'name2': 'Greece'},
+                     {'id': 8, 'name': 'Vienna', 'name2': 'Austria'},
+                     {'id': 8, 'name': 'Warsaw', 'name2': None},
+                     {'id': None, 'name': None, 'name2': None},
+                     {'id': None, 'name': 'Tokio', 'name2': 'Japan'}]}
+ +
+

So what has happened in this example? We set the textual representation of the number "3" to be a missing value. In the output we can see how the id number 3 now appears as None representing a missing value. This toy example demonstrates how the metadata in a descriptor can be used; other values like "NA" are more common for missing values.

+

You can read more advanced details about the Resource Class below.

+

Extracting a Package

+

The third way we can extract information is from a package, which is a set of two or more files, for instance, two data files and a corresponding metadata file.

+

As a primary example, we provide two data files to the extract command which will be enough to detect that it's a dataset. Let's start by using the command-line interface:

+ +
+
+
frictionless extract *-3.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+               dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name      ┃ type  ┃ path          ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+│ country-3 │ table │ country-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+   capital-3
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name   ┃
+┡━━━━╇━━━━━━━━┩
+│ 1  │ London │
+│ 2  │ Berlin │
+│ 3  │ Paris  │
+│ 4  │ Madrid │
+│ 5  │ Rome   │
+└────┴────────┘
+                country-3
+┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ capital_id ┃ name    ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1  │ 1          │ Britain │ 67         │
+│ 2  │ 3          │ France  │ 67         │
+│ 3  │ 2          │ Germany │ 83         │
+│ 4  │ 5          │ Italy   │ 60         │
+│ 5  │ 4          │ Spain   │ 47         │
+└────┴────────────┴─────────┴────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import extract
+
+data = extract('*-3.csv')
+pprint(data)
+
+ +
{'capital-3': [{'id': 1, 'name': 'London'},
+               {'id': 2, 'name': 'Berlin'},
+               {'id': 3, 'name': 'Paris'},
+               {'id': 4, 'name': 'Madrid'},
+               {'id': 5, 'name': 'Rome'}],
+ 'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
+               {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
+               {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
+               {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
+               {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
+ +
+

We can also extract the package from a descriptor file using the package.extract function (Note: see the Package Class section for the creation of the country.package.yaml file):

+ +
+
+
frictionless extract country.package.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+              dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name    ┃ type  ┃ path          ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital │ table │ capital-3.csv │
+│ country │ table │ country-3.csv │
+└─────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+    capital
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name   ┃
+┡━━━━╇━━━━━━━━┩
+│ 1  │ London │
+│ 2  │ Berlin │
+│ 3  │ Paris  │
+│ 4  │ Madrid │
+│ 5  │ Rome   │
+└────┴────────┘
+                 country
+┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ capital_id ┃ name    ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1  │ 1          │ Britain │ 67         │
+│ 2  │ 3          │ France  │ 67         │
+│ 3  │ 2          │ Germany │ 83         │
+│ 4  │ 5          │ Italy   │ 60         │
+│ 5  │ 4          │ Spain   │ 47         │
+└────┴────────────┴─────────┴────────────┘
+ +
+
+
from frictionless import Package
+
+package = Package('country.package.yaml')
+pprint(package.extract())
+
+ +
{'capital': [{'id': 1, 'name': 'London'},
+             {'id': 2, 'name': 'Berlin'},
+             {'id': 3, 'name': 'Paris'},
+             {'id': 4, 'name': 'Madrid'},
+             {'id': 5, 'name': 'Rome'}],
+ 'country': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
+             {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
+             {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
+             {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
+             {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
+ +
+

You can read more advanced details about the Package Class below.

+
+

The following sections contain further, advanced details about the Resource Class, Package Class, Header Class, and Row Class.

+
+

Resource Class

+

The Resource class provides metadata about a resource with read and stream functions. The extract functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data which can be rows, data, text, or bytes. Let's try reading all of them.

+

Reading Bytes

+

It's a byte representation of the contents:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_bytes())
+
+ +
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
+ b'3\n4,5,Italy,60\n5,4,Spain,47\n')
+ +
+

Reading Text

+

It's a textual representation of the contents:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_text())
+
+ +
('id,capital_id,name,population\n'
+ '1,1,Britain,67\n'
+ '2,3,France,67\n'
+ '3,2,Germany,83\n'
+ '4,5,Italy,60\n'
+ '5,4,Spain,47\n')
+ +
+

Reading Cells

+

For a tabular data there are raw representaion of the tabular contents:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_cells())
+
+ +
[['id', 'capital_id', 'name', 'population'],
+ ['1', '1', 'Britain', '67'],
+ ['2', '3', 'France', '67'],
+ ['3', '2', 'Germany', '83'],
+ ['4', '5', 'Italy', '60'],
+ ['5', '4', 'Spain', '47']]
+ +
+

Reading Rows

+

For a tabular data there are row available which is are normalized lists presented as dictionaries:

+ +
+
+
from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
+ {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
+ {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
+ {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
+ {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
+ +
+

Reading a Header

+

For a tabular data there is the Header object available:

+ +
+
+
from frictionless import Resource
+
+with Resource('country-3.csv') as resource:
+    pprint(resource.header)
+
+ +
['id', 'capital_id', 'name', 'population']
+ +
+

Streaming Interfaces

+

It's really handy to read all your data into memory but it's not always possible if a file is very big. For such cases, Frictionless provides streaming functions:

+ +
+
+
from frictionless import Resource
+
+with Resource('country-3.csv') as resource:
+    resource.byte_stream
+    resource.text_stream
+    resource.list_stream
+    resource.row_stream
+
+ +
+

Package Class

+

The Package class provides functions to read the contents of a package. First of all, let's create a package descriptor:

+ +
+
+
frictionless describe *-3.csv --json > country.package.json
+
+ +
+
+
from frictionless import describe
+
+package = describe('*-3.csv')
+package.to_json('country.package.json')
+
+ +
+

Note that --json is used here to output the descriptor in JSON format. Without this, the default output is in YAML format as we saw above.

+

We can create a package from data files (using their paths) and then read the package's resources:

+ +
+
+
from frictionless import Package
+
+package = Package('*-3.csv')
+pprint(package.get_resource('country-3').read_rows())
+pprint(package.get_resource('capital-3').read_rows())
+
+ +
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
+ {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
+ {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
+ {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
+ {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
+[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+ +
+

The package by itself doesn't provide any read functions directly because it's just a contrainer. You can select a pacakge's resource and use the Resource API from above for data reading.

+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/guides/transforming-data.html b/docs/guides/transforming-data.html new file mode 100644 index 0000000000..52ba1f7255 --- /dev/null +++ b/docs/guides/transforming-data.html @@ -0,0 +1,3751 @@ + + + + + + + + +Transforming Data | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Transforming Data

+
+

This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.

+
+

Transforming data in Frictionless means modifying data and metadata from state A to state B. For example, it could be transforming a messy Excel file to a cleaned CSV file, or transforming a folder of data files to a data package we can publish more easily. To read more about the concepts behind Frictionless Transform, please check out the Transform Principles sections belows.

+

In comparison to similiar Python software like Pandas, Frictionless provides better control over metadata, has a modular API, and fully supports Frictionless Specifications. Also, it is a streaming framework with an ability to work with large data. As a downside of the Frictionless architecture, it might be slower compared to other Python packages, especially to projects like Pandas.

+

Keep reading below to learn about the principles underlying Frictionless Transform, or skip ahead to see how to use the Transform code.

+

Transform Principles

+

Frictionless Transform is based on a few core principles which are shared with other parts of the framework:

+

Conceptual Simplicity

+

Frictionless Transform can be thought of as a list of functions that accept a source resource/package object and return a target resource/package object. Every function updates the input's metadata and data - and nothing more. We tried to make this straightforward and conceptually simple, because we want our users to be able to understand the tools and master them.

+

Metadata Matters

+

There are plenty of great ETL-frameworks written in Python and other languages. We use one of them (PETL) under the hood (described in more detail later). The core difference between Frictionless and others is that we treat metadata as a first-class citizen. This means that you don't lose type and other important information during the pipeline evaluation.

+

Data Streaming

+

Whenever possible, Frictionless streams the data instead of reading it into memory. For example, for sorting big tables we use a memory usage threshold and when it is met we use the file system to unload the data. The ability to stream data gives users power to work with files of any size, even very large files.

+

Lazy Evaluation

+

With Frictionless all data manipulation happens on-demand. For example, if you reshape one table in a data package containing 10 big csv files, Frictionless will not even read the 9 other tables. Frictionless tries to be as explicit as possible regarding actions taken. For example, it will not use CPU resources to cast data unless a user adds a normalize step. So it's possible to transform a rather big file without even casting types, for example, if you only need to reshape it.

+

Software Reuse

+

For the core transform functions, Frictionless uses the amazing PETL project under the hood. This library provides lazy-loading functionality in running data pipelines. On top of PETL, Frictionless adds metadata management and a bridge between Frictionless concepts like Package/Resource and PETL's processors.

+

Transform Functions

+

Frictionless supports a few different kinds of data and metadata transformations:

+
    +
  • resource and package transformations
  • +
  • transformations based on a declarative pipeline
  • +
+

The main difference between these is that resource and package transforms are imperative while pipelines can be created beforehand or shared as a JSON file. We'll talk more about pipelines in the Transforming Pipeline section below. First, we will introduce the transform functions, then go into detail about how to transform a resource and a package. As a reminder, in the Frictionless ecosystem, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema. This concept is described in more detail in the Introduction.

+
+

Download transform.csv to reproduce the examples (right-click and "Save link as". You might need to change the file extension from .txt to .csv).

+
+ +
+
+
cat transform.csv
+
+ +
id,name,population
+1,germany,83
+2,france,66
+3,spain,47
+ +
+

The high-level interface to transform data is a set of transform functions:

+
    +
  • transform: detects the source type and transforms data accordingly
  • +
  • reosurce.transform: transforms a resource
  • +
  • package.transform: transforms a package
  • +
+

We'll see examples of these functions in the next few sections.

+

Transforming a Resource

+

Let's write our first transformation. Here, we will transform a data file (a resource) by defining a source resource, applying transform steps and getting back a resulting target resource:

+ +
+
+
from frictionless import Resource, Pipeline, steps
+
+# Define source resource
+source = Resource(path="transform.csv")
+
+# Create a pipeline
+pipeline = Pipeline(steps=[
+    steps.table_normalize(),
+    steps.field_add(name="cars", formula='population*2', descriptor={'type': 'integer'}),
+])
+
+# Apply transform pipeline
+target = source.transform(pipeline)
+
+# Print resulting schema and data
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'cars', 'type': 'integer'}]}
++----+-----------+------------+------+
+| id | name      | population | cars |
++====+===========+============+======+
+|  1 | 'germany' |         83 |  166 |
++----+-----------+------------+------+
+|  2 | 'france'  |         66 |  132 |
++----+-----------+------------+------+
+|  3 | 'spain'   |         47 |   94 |
++----+-----------+------------+------+
+ +
+

Let's break down the transforming steps we applied:

+
    +
  1. steps.table_normalize - cast data types and shape the table according to the schema, inferred or provided
  2. +
  3. steps.field_add - adds a field to data and metadata based on the information provided by the user
  4. +
+

There are many more available steps that we will cover below.

+

Transforming a Package

+

A package is a set of resources. Transforming a package means adding or removing resources and/or transforming those resources themselves. This example shows how transforming a package is similar to transforming a single resource:

+ +
+
+
from frictionless import Package, Resource, transform, steps
+
+# Define source package
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+
+# Create a pipeline
+pipeline = Pipeline(steps=[
+    steps.resource_add(name="extra", descriptor={"data": [['id', 'cars'], [1, 166], [2, 132], [3, 94]]}),
+    steps.resource_transform(
+        name="main",
+        steps=[
+            steps.table_normalize(),
+            steps.table_join(resource="extra", field_name="id"),
+        ],
+    ),
+    steps.resource_remove(name="extra"),
+])
+
+# Apply transform steps
+target = source.transform(pipeline)
+
+# Print resulting resources, schema and data
+print(target.resource_names)
+print(target.get_resource("main").schema)
+print(target.get_resource("main").to_view())
+
+ +
['main']
+{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'cars', 'type': 'integer'}]}
++----+-----------+------------+------+
+| id | name      | population | cars |
++====+===========+============+======+
+|  1 | 'germany' |         83 |  166 |
++----+-----------+------------+------+
+|  2 | 'france'  |         66 |  132 |
++----+-----------+------------+------+
+|  3 | 'spain'   |         47 |   94 |
++----+-----------+------------+------+
+ +
+

We have basically done the same as in Transforming a Resource section. This example is quite artificial and created only to show how to join two resources, but hopefully it provides a basic understanding of how flexible package transformations can be.

+

Transforming Pipeline

+

A pipeline is a declarative way to write out metadata transform steps. With a pipeline, you can transform a resource, package, or write custom plugins too.

+

For resource and package types it's mostly the same functionality as we have seen above, but written declaratively. So let's run the same resource transformation as we did in the Transforming a Resource section:

+ +
+
+
from frictionless import Pipeline, transform
+
+pipeline = Pipeline.from_descriptor({
+    "steps": [
+        {"type": "table-normalize"},
+        {
+            "type": "field-add",
+            "name": "cars",
+            "formula": "population*2",
+            "descriptor": {"type": "integer"}
+        },
+    ],
+})
+print(pipeline)
+
+ +
{'steps': [{'type': 'table-normalize'},
+           {'name': 'cars',
+            'type': 'field-add',
+            'formula': 'population*2',
+            'descriptor': {'type': 'integer'}}]}
+ +
+

So what's the reason to use declarative pipelines if it works the same as the Python code? The main difference is that pipelines can be saved as JSON files which can be shared among different users and used with CLI and API. For example, if you implement your own UI based on Frictionless Framework you can serialize the whole pipeline as a JSON file and send it to the server. This is the same for CLI - if your colleague has given you a pipeline.json file, you can run frictionless transform pipeline.json in the CLI to get the same results as they got.

+

Available Steps

+

Frictionless includes more than 40+ built-in transform steps. They are grouped by the object so you can find them easily using code auto completion in a code editor. For example, start typing steps.table... and you will see all the available steps for that group. The available groups are:

+
    +
  • resource
  • +
  • table
  • +
  • field
  • +
  • row
  • +
  • cell
  • +
+

See Transform Steps for a list of all available steps. It is also possible to write custom transform steps: see the next section.

+

Custom Steps

+

Here is an example of a custom step written as a Python function. This example step removes a field from a data table (note: Frictionless already has a built-in function that does this same thing: steps.field_remove).

+ +
+
+
from frictionless import Package, Resource, Step, transform, steps
+
+class custom_step(Step):
+    def transform_resource(self, resource):
+        current = resource.to_copy()
+
+        # Data
+        def data():
+            with current:
+                for list in current.cell_stream:
+                    yield list[1:]
+
+        # Meta
+        resource.data = data
+        resource.schema.remove_field("id")
+
+source = Resource("transform.csv")
+pipeline = Pipeline(steps=[custom_step()])
+target = source.transform(pipeline)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++-----------+------------+
+| name      | population |
++===========+============+
+| 'germany' |         83 |
++-----------+------------+
+| 'france'  |         66 |
++-----------+------------+
+| 'spain'   |         47 |
++-----------+------------+
+ +
+

As you can see you can implement any custom steps within a Python script. To make it work within a declarative pipeline you need to implement a plugin. Learn more about Custom Steps and Plugins.

+

Transform Utils

+
+

Transform Utils is under construction.

+
+

Working with PETL

+

In some cases, it's better to use a lower-level API to achieve your goal. A resource can be exported as a PETL table. For more information please visit PETL's documentation portal.

+ + + +
+
+
from frictionless import Resource
+
+resource = Resource(path='transform.csv')
+petl_table = resource.to_petl()
+# Use it with PETL framework
+print(petl_table)
+
+ +
+----+---------+------------+
+| id | name    | population |
++====+=========+============+
+| 1  | germany | 83         |
++----+---------+------------+
+| 2  | france  | 66         |
++----+---------+------------+
+| 3  | spain   | 47         |
++----+---------+------------+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/guides/validating-data.html b/docs/guides/validating-data.html new file mode 100644 index 0000000000..8b2297ddcb --- /dev/null +++ b/docs/guides/validating-data.html @@ -0,0 +1,4798 @@ + + + + + + + + +Validating Data | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Validating Data

+
+

This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.

+
+

Tabular data validation is a process of identifying problems that have occured in your data so you can correct them. Let's explore how Frictionless helps to achieve this task using an invalid data table example:

+
+

Download capital-invalid.csv to reproduce the examples (right-click and "Save link as")..

+
+ +
+
+
cat capital-invalid.csv
+
+ +
id,name,name
+1,London,Britain
+2,Berlin,Germany
+3,Paris,France
+4,Madrid,Spain
+5,Rome,Italy
+6,Zagreb,Croatia
+7,Athens,Greece
+8,Vienna,Austria
+8,Warsaw
+
+x,Tokio,Japan,review
+ +
+
+
with open('capital-invalid.csv') as file:
+    print(file.read())
+
+ +
id,name,name
+1,London,Britain
+2,Berlin,Germany
+3,Paris,France
+4,Madrid,Spain
+5,Rome,Italy
+6,Zagreb,Croatia
+7,Athens,Greece
+8,Vienna,Austria
+8,Warsaw
+
+x,Tokio,Japan,review
+ +
+

We can validate this file by using both command-line interface and high-level functions. Frictionless provides comprehensive error details so that errors can be understood by the user. Continue reading to learn the validation process in detail.

+ +
+
+
frictionless validate capital-invalid.csv
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                          dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name            ┃ type  ┃ path                ┃ status  ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type            ┃ Message                                     ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3     │ duplicate-label │ Label "name" in the header at position "3"  │
+│      │       │                 │ is duplicated to a label: at position "2"   │
+│ 10   │ 3     │ missing-cell    │ Row at position "10" has a missing cell in  │
+│      │       │                 │ field "name2" at position "3"               │
+│ 11   │ None  │ blank-row       │ Row at position "11" is completely blank    │
+│ 12   │ 1     │ type-error      │ Type error in the cell "x" in row "12" and  │
+│      │       │                 │ field "id" at position "1": type is         │
+│      │       │                 │ "integer/default"                           │
+│ 12   │ 4     │ extra-cell      │ Row at position "12" has an extra value in  │
+│      │       │                 │ field at position "4"                       │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report = validate('capital-invalid.csv')
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 5, 'warnings': 0, 'seconds': 0.007},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+            'type': 'table',
+            'valid': False,
+            'place': 'capital-invalid.csv',
+            'labels': ['id', 'name', 'name'],
+            'stats': {'errors': 5,
+                      'warnings': 0,
+                      'seconds': 0.007,
+                      'md5': 'dcdeae358cfd50860c18d953e021f836',
+                      'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+                      'bytes': 171,
+                      'fields': 3,
+                      'rows': 11},
+            'warnings': [],
+            'errors': [{'type': 'duplicate-label',
+                        'title': 'Duplicate Label',
+                        'description': 'Two columns in the header row have the '
+                                       'same value. Column names should be '
+                                       'unique.',
+                        'message': 'Label "name" in the header at position "3" '
+                                   'is duplicated to a label: at position "2"',
+                        'tags': ['#table', '#header', '#label'],
+                        'note': 'at position "2"',
+                        'labels': ['id', 'name', 'name'],
+                        'rowNumbers': [1],
+                        'label': 'name',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'missing-cell',
+                        'title': 'Missing Cell',
+                        'description': 'This row has less values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "10" has a missing cell in '
+                                   'field "name2" at position "3"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['8', 'Warsaw'],
+                        'rowNumber': 10,
+                        'cell': '',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'blank-row',
+                        'title': 'Blank Row',
+                        'description': 'This row is empty. A row should '
+                                       'contain at least one value.',
+                        'message': 'Row at position "11" is completely blank',
+                        'tags': ['#table', '#row'],
+                        'note': '',
+                        'cells': [],
+                        'rowNumber': 11},
+                       {'type': 'type-error',
+                        'title': 'Type Error',
+                        'description': 'The value does not match the schema '
+                                       'type and format for this field.',
+                        'message': 'Type error in the cell "x" in row "12" and '
+                                   'field "id" at position "1": type is '
+                                   '"integer/default"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': 'type is "integer/default"',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'x',
+                        'fieldName': 'id',
+                        'fieldNumber': 1},
+                       {'type': 'extra-cell',
+                        'title': 'Extra Cell',
+                        'description': 'This row has more values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "12" has an extra value in '
+                                   'field at position "4"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'review',
+                        'fieldName': '',
+                        'fieldNumber': 4}]}]}
+ +
+

Validate Functions

+

The high-level interface for validating data provided by Frictionless is a set of validate functions:

+
    +
  • validate: detects the source type and validates data accordingly
  • +
  • Schema.validate_descriptor: validates a schema's metadata
  • +
  • resource.validate: validates a resource's data and metadata
  • +
  • package.validate: validates a package's data and metadata
  • +
  • inquiry.validate: validates a special Inquiry object which represents a validation task instruction
  • +
+

On the command-line, there is only one command but there is a flag to adjust the behavior. It's useful when you have a file which has a ambiguous type, for example, a json file containing a data instead of metadata:

+ +
+
+
frictionless validate your-data.csv
+frictionless validate your-schema.yaml --type schema
+frictionless validate your-data.csv --type resource
+frictionless validate your-package.json --type package
+frictionless validate your-inquiry.yaml --type inquiry
+
+ +
+

As a reminder, in the Frictionless ecosystem, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema. This concept is described in more detail in the Introduction.

+

Validating a Schema

+

The Schema.validate_descriptor function is the only function validating solely metadata. To see this work, let's create an invalid table schema:

+ +
+
+
import yaml
+from frictionless import Schema
+
+descriptor = {}
+descriptor['fields'] = 'bad' # must be a list
+with open('bad.schema.yaml', 'w') as file:
+    yaml.dump(descriptor, file)
+
+ +
+

And let's validate this schema:

+ +
+
+
frictionless validate bad.schema.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                     dataset
+┏━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name       ┃ type ┃ path            ┃ status  ┃
+┡━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ bad.schema │ json │ bad.schema.yaml │ INVALID │
+└────────────┴──────┴─────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                   bad.schema
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type         ┃ Message                                        ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ None  │ schema-error │ Schema is not valid: 'bad' is not of type      │
+│      │       │              │ 'array' at property 'fields'                   │
+└──────┴───────┴──────────────┴────────────────────────────────────────────────┘
+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report = validate('bad.schema.yaml')
+pprint(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.001},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'bad.schema',
+            'type': 'json',
+            'valid': False,
+            'place': 'bad.schema.yaml',
+            'labels': [],
+            'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.001},
+            'warnings': [],
+            'errors': [{'type': 'schema-error',
+                        'title': 'Schema Error',
+                        'description': 'Provided schema is not valid.',
+                        'message': "Schema is not valid: 'bad' is not of type "
+                                   "'array' at property 'fields'",
+                        'tags': [],
+                        'note': "'bad' is not of type 'array' at property "
+                                "'fields'"}]}]}
+ +
+

We see that the schema is invalid and the error is displayed. Schema validation can be very useful when you work with different classes of tables and create schemas for them. Using this function will ensure that the metadata is valid.

+

Validating a Resource

+

As was shown in the "Describing Data" guide, a resource is a container having both metadata and data. We need to create a resource descriptor and then we can validate it:

+ +
+
+
frictionless describe capital-invalid.csv > capital.resource.yaml
+
+ +
+
+
from frictionless import describe
+
+resource = describe('capital-invalid.csv')
+resource.to_yaml('capital.resource.yaml')
+
+ +
+

Note: this example uses YAML for the resource descriptor format, but Frictionless also supports JSON format also.

+

Let's now validate to ensure that we are getting the same result that we got without using a resource:

+ +
+
+
frictionless validate capital.resource.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                          dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name            ┃ type  ┃ path                ┃ status  ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type            ┃ Message                                     ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3     │ duplicate-label │ Label "name" in the header at position "3"  │
+│      │       │                 │ is duplicated to a label: at position "2"   │
+│ 10   │ 3     │ missing-cell    │ Row at position "10" has a missing cell in  │
+│      │       │                 │ field "name2" at position "3"               │
+│ 11   │ None  │ blank-row       │ Row at position "11" is completely blank    │
+│ 12   │ 1     │ type-error      │ Type error in the cell "x" in row "12" and  │
+│      │       │                 │ field "id" at position "1": type is         │
+│      │       │                 │ "integer/default"                           │
+│ 12   │ 4     │ extra-cell      │ Row at position "12" has an extra value in  │
+│      │       │                 │ field at position "4"                       │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+ +
+
+
from frictionless import validate
+
+report = validate('capital.resource.yaml')
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 5, 'warnings': 0, 'seconds': 0.004},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+            'type': 'table',
+            'valid': False,
+            'place': 'capital-invalid.csv',
+            'labels': ['id', 'name', 'name'],
+            'stats': {'errors': 5,
+                      'warnings': 0,
+                      'seconds': 0.004,
+                      'md5': 'dcdeae358cfd50860c18d953e021f836',
+                      'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+                      'bytes': 171,
+                      'fields': 3,
+                      'rows': 11},
+            'warnings': [],
+            'errors': [{'type': 'duplicate-label',
+                        'title': 'Duplicate Label',
+                        'description': 'Two columns in the header row have the '
+                                       'same value. Column names should be '
+                                       'unique.',
+                        'message': 'Label "name" in the header at position "3" '
+                                   'is duplicated to a label: at position "2"',
+                        'tags': ['#table', '#header', '#label'],
+                        'note': 'at position "2"',
+                        'labels': ['id', 'name', 'name'],
+                        'rowNumbers': [1],
+                        'label': 'name',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'missing-cell',
+                        'title': 'Missing Cell',
+                        'description': 'This row has less values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "10" has a missing cell in '
+                                   'field "name2" at position "3"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['8', 'Warsaw'],
+                        'rowNumber': 10,
+                        'cell': '',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'blank-row',
+                        'title': 'Blank Row',
+                        'description': 'This row is empty. A row should '
+                                       'contain at least one value.',
+                        'message': 'Row at position "11" is completely blank',
+                        'tags': ['#table', '#row'],
+                        'note': '',
+                        'cells': [],
+                        'rowNumber': 11},
+                       {'type': 'type-error',
+                        'title': 'Type Error',
+                        'description': 'The value does not match the schema '
+                                       'type and format for this field.',
+                        'message': 'Type error in the cell "x" in row "12" and '
+                                   'field "id" at position "1": type is '
+                                   '"integer/default"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': 'type is "integer/default"',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'x',
+                        'fieldName': 'id',
+                        'fieldNumber': 1},
+                       {'type': 'extra-cell',
+                        'title': 'Extra Cell',
+                        'description': 'This row has more values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "12" has an extra value in '
+                                   'field at position "4"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'review',
+                        'fieldName': '',
+                        'fieldNumber': 4}]}]}
+ +
+

Okay, why do we need to use a resource descriptor if the result is the same? The reason is metadata + data packaging. Let's extend our resource descriptor to show how you can edit and validate metadata:

+ +
+
+
from frictionless import describe
+
+resource = describe('capital-invalid.csv')
+resource.add_defined('stats')  # TODO: fix and remove this line
+resource.stats.md5 = 'ae23c74693ca2d3f0e38b9ba3570775b' # this is a made up incorrect
+resource.stats.bytes = 100 # this is wrong
+resource.to_yaml('capital.resource-bad.yaml')
+
+ +
+

We have added a few incorrect, made up attributes to our resource descriptor as an example. Now, the validation below reports these errors in addition to all the errors we had before. This example shows how concepts like Data Resource can be extremely useful when working with data.

+ +
+
+
frictionless validate capital.resource-bad.yaml  # TODO: it should have 7 errors
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                          dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name            ┃ type  ┃ path                ┃ status  ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type            ┃ Message                                     ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3     │ duplicate-label │ Label "name" in the header at position "3"  │
+│      │       │                 │ is duplicated to a label: at position "2"   │
+│ 10   │ 3     │ missing-cell    │ Row at position "10" has a missing cell in  │
+│      │       │                 │ field "name2" at position "3"               │
+│ 11   │ None  │ blank-row       │ Row at position "11" is completely blank    │
+│ 12   │ 1     │ type-error      │ Type error in the cell "x" in row "12" and  │
+│      │       │                 │ field "id" at position "1": type is         │
+│      │       │                 │ "integer/default"                           │
+│ 12   │ 4     │ extra-cell      │ Row at position "12" has an extra value in  │
+│      │       │                 │ field at position "4"                       │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+ +
+
+
from frictionless import validate
+
+report = validate('capital.resource-bad.yaml')
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 5, 'warnings': 0, 'seconds': 0.004},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+            'type': 'table',
+            'valid': False,
+            'place': 'capital-invalid.csv',
+            'labels': ['id', 'name', 'name'],
+            'stats': {'errors': 5,
+                      'warnings': 0,
+                      'seconds': 0.004,
+                      'md5': 'dcdeae358cfd50860c18d953e021f836',
+                      'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+                      'bytes': 171,
+                      'fields': 3,
+                      'rows': 11},
+            'warnings': [],
+            'errors': [{'type': 'duplicate-label',
+                        'title': 'Duplicate Label',
+                        'description': 'Two columns in the header row have the '
+                                       'same value. Column names should be '
+                                       'unique.',
+                        'message': 'Label "name" in the header at position "3" '
+                                   'is duplicated to a label: at position "2"',
+                        'tags': ['#table', '#header', '#label'],
+                        'note': 'at position "2"',
+                        'labels': ['id', 'name', 'name'],
+                        'rowNumbers': [1],
+                        'label': 'name',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'missing-cell',
+                        'title': 'Missing Cell',
+                        'description': 'This row has less values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "10" has a missing cell in '
+                                   'field "name2" at position "3"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['8', 'Warsaw'],
+                        'rowNumber': 10,
+                        'cell': '',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'blank-row',
+                        'title': 'Blank Row',
+                        'description': 'This row is empty. A row should '
+                                       'contain at least one value.',
+                        'message': 'Row at position "11" is completely blank',
+                        'tags': ['#table', '#row'],
+                        'note': '',
+                        'cells': [],
+                        'rowNumber': 11},
+                       {'type': 'type-error',
+                        'title': 'Type Error',
+                        'description': 'The value does not match the schema '
+                                       'type and format for this field.',
+                        'message': 'Type error in the cell "x" in row "12" and '
+                                   'field "id" at position "1": type is '
+                                   '"integer/default"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': 'type is "integer/default"',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'x',
+                        'fieldName': 'id',
+                        'fieldNumber': 1},
+                       {'type': 'extra-cell',
+                        'title': 'Extra Cell',
+                        'description': 'This row has more values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "12" has an extra value in '
+                                   'field at position "4"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'review',
+                        'fieldName': '',
+                        'fieldNumber': 4}]}]}
+ +
+

Validating a Package

+

A package is a set of resources + additional metadata. To showcase a package validation we need to use one more tabular file:

+
+

Download capital-valid.csv to reproduce the examples (right-click and "Save link as").

+
+ +
+
+
cat capital-valid.csv
+
+ +
id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+ +
+
+
with open('capital-valid.csv') as file:
+    print(file.read())
+
+ +
id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+ +
+

Now let's describe and validate a package which contains the data files we have seen so far:

+ +
+
+
frictionless describe capital-*id.csv > capital.package.yaml
+frictionless validate capital.package.yaml
+
+ +
──────────────────────────────────── Tables ────────────────────────────────────
+                                    dataset
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type          ┃ Message                                       ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ None  │ package-error │ The data package has an error: cannot         │
+│      │       │               │ retrieve metadata "capital.package.yaml"      │
+│      │       │               │ because ""                                    │
+└──────┴───────┴───────────────┴───────────────────────────────────────────────┘
+ +
+
+
from frictionless import describe, validate
+
+# create package descriptor
+package = describe("capital-*id.csv")
+package.to_yaml("capital.package.yaml")
+# validate
+report = validate("capital.package.yaml")
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 2, 'errors': 5, 'warnings': 0, 'seconds': 0.008},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+            'type': 'table',
+            'valid': False,
+            'place': 'capital-invalid.csv',
+            'labels': ['id', 'name', 'name'],
+            'stats': {'errors': 5,
+                      'warnings': 0,
+                      'seconds': 0.004,
+                      'md5': 'dcdeae358cfd50860c18d953e021f836',
+                      'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+                      'bytes': 171,
+                      'fields': 3,
+                      'rows': 11},
+            'warnings': [],
+            'errors': [{'type': 'duplicate-label',
+                        'title': 'Duplicate Label',
+                        'description': 'Two columns in the header row have the '
+                                       'same value. Column names should be '
+                                       'unique.',
+                        'message': 'Label "name" in the header at position "3" '
+                                   'is duplicated to a label: at position "2"',
+                        'tags': ['#table', '#header', '#label'],
+                        'note': 'at position "2"',
+                        'labels': ['id', 'name', 'name'],
+                        'rowNumbers': [1],
+                        'label': 'name',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'missing-cell',
+                        'title': 'Missing Cell',
+                        'description': 'This row has less values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "10" has a missing cell in '
+                                   'field "name2" at position "3"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['8', 'Warsaw'],
+                        'rowNumber': 10,
+                        'cell': '',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'blank-row',
+                        'title': 'Blank Row',
+                        'description': 'This row is empty. A row should '
+                                       'contain at least one value.',
+                        'message': 'Row at position "11" is completely blank',
+                        'tags': ['#table', '#row'],
+                        'note': '',
+                        'cells': [],
+                        'rowNumber': 11},
+                       {'type': 'type-error',
+                        'title': 'Type Error',
+                        'description': 'The value does not match the schema '
+                                       'type and format for this field.',
+                        'message': 'Type error in the cell "x" in row "12" and '
+                                   'field "id" at position "1": type is '
+                                   '"integer/default"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': 'type is "integer/default"',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'x',
+                        'fieldName': 'id',
+                        'fieldNumber': 1},
+                       {'type': 'extra-cell',
+                        'title': 'Extra Cell',
+                        'description': 'This row has more values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "12" has an extra value in '
+                                   'field at position "4"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'review',
+                        'fieldName': '',
+                        'fieldNumber': 4}]},
+           {'name': 'capital-valid',
+            'type': 'table',
+            'valid': True,
+            'place': 'capital-valid.csv',
+            'labels': ['id', 'name'],
+            'stats': {'errors': 0,
+                      'warnings': 0,
+                      'seconds': 0.003,
+                      'md5': 'e7b6592a0a4356ba834e4bf1c8e8c7f8',
+                      'sha256': '04202244cbb3662b0f97bfa65adfad045724cbc8d798a7c0eb85533e9da40a5b',
+                      'bytes': 50,
+                      'fields': 2,
+                      'rows': 5},
+            'warnings': [],
+            'errors': []}]}
+ +
+

As we can see, the result is in a similar format to what we have already seen, and shows errors as we expected: we have one invalid resource and one valid resource.

+

Validating an Inquiry

+
+

The Inquiry is an advanced concept mostly used by software integrators. For example, under the hood, Frictionless Framework uses inquiries to implement client-server validation within the built-in API. Please skip this section if this information feels unnecessary for you.

+
+

Inquiry is a declarative representation of a validation job. It gives you an ability to create, export, and share arbitrary validation jobs containing a set of individual validation tasks. Tasks in the Inquiry accept the same arguments written in camelCase as the corresponding validate functions.

+

Let's create an Inquiry that includes an individual file validation and a resource validation. In this example we will use the data file, capital-valid.csv and the resource, capital.resource.json which describes the invalid data file we have already seen:

+ +
+
+
from frictionless import Inquiry, InquiryTask
+
+inquiry = Inquiry(tasks=[
+    InquiryTask(path='capital-valid.csv'),
+    InquiryTask(resource='capital.resource.yaml'),
+])
+inquiry.to_yaml('capital.inquiry.yaml')
+
+ +
+

As usual, let's run validation:

+ +
+
+
frictionless validate capital.inquiry.yaml
+
+ +
─────────────────────────────────── Dataset ────────────────────────────────────
+                          dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name            ┃ type  ┃ path                ┃ status  ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-valid   │ table │ capital-valid.csv   │ VALID   │
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+                                capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row  ┃ Field ┃ Type            ┃ Message                                     ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3     │ duplicate-label │ Label "name" in the header at position "3"  │
+│      │       │                 │ is duplicated to a label: at position "2"   │
+│ 10   │ 3     │ missing-cell    │ Row at position "10" has a missing cell in  │
+│      │       │                 │ field "name2" at position "3"               │
+│ 11   │ None  │ blank-row       │ Row at position "11" is completely blank    │
+│ 12   │ 1     │ type-error      │ Type error in the cell "x" in row "12" and  │
+│      │       │                 │ field "id" at position "1": type is         │
+│      │       │                 │ "integer/default"                           │
+│ 12   │ 4     │ extra-cell      │ Row at position "12" has an extra value in  │
+│      │       │                 │ field at position "4"                       │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+ +
+
+
from frictionless import validate
+
+report = validate("capital.inquiry.yaml")
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 2, 'errors': 5, 'warnings': 0, 'seconds': 0.011},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-valid',
+            'type': 'table',
+            'valid': True,
+            'place': 'capital-valid.csv',
+            'labels': ['id', 'name'],
+            'stats': {'errors': 0,
+                      'warnings': 0,
+                      'seconds': 0.004,
+                      'md5': 'e7b6592a0a4356ba834e4bf1c8e8c7f8',
+                      'sha256': '04202244cbb3662b0f97bfa65adfad045724cbc8d798a7c0eb85533e9da40a5b',
+                      'bytes': 50,
+                      'fields': 2,
+                      'rows': 5},
+            'warnings': [],
+            'errors': []},
+           {'name': 'capital-invalid',
+            'type': 'table',
+            'valid': False,
+            'place': 'capital-invalid.csv',
+            'labels': ['id', 'name', 'name'],
+            'stats': {'errors': 5,
+                      'warnings': 0,
+                      'seconds': 0.003,
+                      'md5': 'dcdeae358cfd50860c18d953e021f836',
+                      'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+                      'bytes': 171,
+                      'fields': 3,
+                      'rows': 11},
+            'warnings': [],
+            'errors': [{'type': 'duplicate-label',
+                        'title': 'Duplicate Label',
+                        'description': 'Two columns in the header row have the '
+                                       'same value. Column names should be '
+                                       'unique.',
+                        'message': 'Label "name" in the header at position "3" '
+                                   'is duplicated to a label: at position "2"',
+                        'tags': ['#table', '#header', '#label'],
+                        'note': 'at position "2"',
+                        'labels': ['id', 'name', 'name'],
+                        'rowNumbers': [1],
+                        'label': 'name',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'missing-cell',
+                        'title': 'Missing Cell',
+                        'description': 'This row has less values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "10" has a missing cell in '
+                                   'field "name2" at position "3"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['8', 'Warsaw'],
+                        'rowNumber': 10,
+                        'cell': '',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3},
+                       {'type': 'blank-row',
+                        'title': 'Blank Row',
+                        'description': 'This row is empty. A row should '
+                                       'contain at least one value.',
+                        'message': 'Row at position "11" is completely blank',
+                        'tags': ['#table', '#row'],
+                        'note': '',
+                        'cells': [],
+                        'rowNumber': 11},
+                       {'type': 'type-error',
+                        'title': 'Type Error',
+                        'description': 'The value does not match the schema '
+                                       'type and format for this field.',
+                        'message': 'Type error in the cell "x" in row "12" and '
+                                   'field "id" at position "1": type is '
+                                   '"integer/default"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': 'type is "integer/default"',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'x',
+                        'fieldName': 'id',
+                        'fieldNumber': 1},
+                       {'type': 'extra-cell',
+                        'title': 'Extra Cell',
+                        'description': 'This row has more values compared to '
+                                       'the header row (the first row in the '
+                                       'data source). A key concept is that '
+                                       'all the rows in tabular data must have '
+                                       'the same number of columns.',
+                        'message': 'Row at position "12" has an extra value in '
+                                   'field at position "4"',
+                        'tags': ['#table', '#row', '#cell'],
+                        'note': '',
+                        'cells': ['x', 'Tokio', 'Japan', 'review'],
+                        'rowNumber': 12,
+                        'cell': 'review',
+                        'fieldName': '',
+                        'fieldNumber': 4}]}]}
+ +
+

At first sight, it might not be clear why such a construct exists, but when your validation workflow gets complex, the Inquiry can provide a lot of flexibility and power.

+
+

The Inquiry will use multiprocessing if there is the parallel flag provided. It might speed up your validation dramatically especially on a 4+ cores processor.

+
+

Validation Report

+

All the validate functions return a Validation Report. This is a unified object containing information about a validation: source details, the error, etc. Let's explore a report:

+ +
+
+
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.006},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+            'type': 'table',
+            'valid': False,
+            'place': 'capital-invalid.csv',
+            'labels': ['id', 'name', 'name'],
+            'stats': {'errors': 1,
+                      'warnings': 0,
+                      'seconds': 0.006,
+                      'md5': 'dcdeae358cfd50860c18d953e021f836',
+                      'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+                      'bytes': 171,
+                      'fields': 3,
+                      'rows': 11},
+            'warnings': [],
+            'errors': [{'type': 'duplicate-label',
+                        'title': 'Duplicate Label',
+                        'description': 'Two columns in the header row have the '
+                                       'same value. Column names should be '
+                                       'unique.',
+                        'message': 'Label "name" in the header at position "3" '
+                                   'is duplicated to a label: at position "2"',
+                        'tags': ['#table', '#header', '#label'],
+                        'note': 'at position "2"',
+                        'labels': ['id', 'name', 'name'],
+                        'rowNumbers': [1],
+                        'label': 'name',
+                        'fieldName': 'name2',
+                        'fieldNumber': 3}]}]}
+ +
+

As we can see, there is a lot of information; you can find a detailed description of the Validation Report in the API Reference. Errors are grouped by tasks (i.e. data files); for some validation there can be dozens of tasks. Let's use the report.flatten function to simplify the representation of errors. This function helps to represent a report as a list of errors:

+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+pprint(report.flatten(["rowNumber", "fieldNumber", "code", "message"]))
+
+ +
[[None,
+  3,
+  None,
+  'Label "name" in the header at position "3" is duplicated to a label: at '
+  'position "2"']]
+ +
+

In some situations, an error can't be associated with a task; then it goes to the top-level report.errors property:

+ +
+
+
from frictionless import validate
+
+report = validate("bad.json", type='schema')
+print(report)
+
+ +
{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.0},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'bad',
+            'type': 'json',
+            'valid': False,
+            'place': 'bad.json',
+            'labels': [],
+            'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.0},
+            'warnings': [],
+            'errors': [{'type': 'schema-error',
+                        'title': 'Schema Error',
+                        'description': 'Provided schema is not valid.',
+                        'message': 'Schema is not valid: cannot retrieve '
+                                   'metadata "bad.json" because "[Errno 2] No '
+                                   'such file or directory: \'bad.json\'"',
+                        'tags': [],
+                        'note': 'cannot retrieve metadata "bad.json" because '
+                                '"[Errno 2] No such file or directory: '
+                                '\'bad.json\'"'}]}]}
+ +
+

Validation Errors

+

The Error object is at the heart of the validation process. The Report has report.errors and report.tasks[].errors, properties that can contain the Error object. Let's explore it by taking a deeper look at the duplicate-label error:

+ +
+
+
from frictionless import validate
+
+report = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+error = report.error  # this is only available for one table / one error sitution
+print(f'Type: "{error.type}"')
+print(f'Title: "{error.title}"')
+print(f'Tags: "{error.tags}"')
+print(f'Note: "{error.note}"')
+print(f'Message: "{error.message}"')
+print(f'Description: "{error.description}"')
+
+ +
Type: "duplicate-label"
+Title: "Duplicate Label"
+Tags: "['#table', '#header', '#label']"
+Note: "at position "2""
+Message: "Label "name" in the header at position "3" is duplicated to a label: at position "2""
+Description: "Two columns in the header row have the same value. Column names should be unique."
+ +
+

Above, we have listed universal error properties. Depending on the type of an error there can be additional ones. For example, for our duplicate-label error:

+ +
+
+
from frictionless import validate
+
+report = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+error = report.error  # this is only available for one table / one error sitution
+print(error)
+
+ +
{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the same value. Column '
+                'names should be unique.',
+ 'message': 'Label "name" in the header at position "3" is duplicated to a '
+            'label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3}
+ +
+
{'code': 'duplicate-label',
+ 'description': 'Two columns in the header row have the same value. Column '
+                'names should be unique.',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3,
+ 'fieldPosition': 3,
+ 'label': 'name',
+ 'labels': ['id', 'name', 'name'],
+ 'message': 'Label "name" in the header at position "3" is duplicated to a '
+            'label: at position "2"',
+ 'name': 'Duplicate Label',
+ 'note': 'at position "2"',
+ 'rowPositions': [1],
+ 'tags': ['#table', '#header', '#label']}
+
+

Please explore the Errors Reference to learn about all the available errors and their properties.

+

Available Checks

+

There are various validation checks included in the core Frictionless Framework along with an ability to create custom checks. See Validation Checks for a list of available checks.

+ +
+
+
from pprint import pprint
+from frictionless import validate, checks
+
+checks = [checks.sequential_value(field_name='id')]
+report = validate('capital-invalid.csv', checks=checks)
+pprint(report.flatten(["rowNumber", "fieldNumber", "type", "note"]))
+
+ +
[[None, 3, 'duplicate-label', 'at position "2"'],
+ [10, 3, 'missing-cell', ''],
+ [10, 1, 'sequential-value', 'the value is not sequential'],
+ [11, None, 'blank-row', ''],
+ [12, 1, 'type-error', 'type is "integer/default"'],
+ [12, 4, 'extra-cell', '']]
+ +
+
[[None, 3, 'duplicate-label', 'at position "2"'],
+ [10, 3, 'missing-cell', ''],
+ [10, 1, 'sequential-value', 'the value is not sequential'],
+ [11, None, 'blank-row', ''],
+ [12, 1, 'type-error', 'type is "integer/default"'],
+ [12, 4, 'extra-cell', '']]
+
+
+

Note that only the Baseline Check is enabled by default. Other built-in checks need to be activated as shown below.

+
+

Custom Checks

+

There are many cases when built-in Frictionless checks are not enough. For instance, you might want to create a business logic rule or specific quality requirement for the data. With Frictionless it's very easy to use your own custom checks. Let's see with an example:

+ +
+
+
from pprint import pprint
+from frictionless import Check, validate, errors
+
+# Create check
+class forbidden_two(Check):
+    Errors = [errors.CellError]
+    def validate_row(self, row):
+        if row['header'] == 2:
+            note = '2 is forbidden!'
+            yield errors.CellError.from_row(row, note=note, field_name='header')
+
+# Validate table
+source = b'header\n1\n2\n3'
+report = validate(source,  format='csv', checks=[forbidden_two()])
+pprint(report.flatten(["rowNumber", "fieldNumber", "code", "note"]))
+
+ +
[[3, 1, None, '2 is forbidden!']]
+ +
+

Usually, it also makes sense to create a custom error for your custom check. The Check class provides other useful methods like validate_header etc. Please read the API Reference for more details.

+

Learn more about custom checks in the Check Guide.

+

Pick/Skip Errors

+

We can pick or skip errors by providing a list of error codes. This is useful when you already know your data has some errors, but you want to ignore them for now. For instance, if you have a data table with repeating header names. Let's see an example of how to pick and skip errors:

+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report1 = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+report2 = validate("capital-invalid.csv", skip_errors=["duplicate-label"])
+pprint(report1.flatten(["rowNumber", "fieldNumber", "type"]))
+pprint(report2.flatten(["rowNumber", "fieldNumber", "type"]))
+
+ +
[[None, 3, 'duplicate-label']]
+[[10, 3, 'missing-cell'],
+ [11, None, 'blank-row'],
+ [12, 1, 'type-error'],
+ [12, 4, 'extra-cell']]
+ +
+

It's also possible to use error tags (for more information please consult the Errors Reference):

+ +
+
+
from pprint import pprint
+from frictionless import validate
+
+report1 = validate("capital-invalid.csv", pick_errors=["#header"])
+report2 = validate("capital-invalid.csv", skip_errors=["#row"])
+pprint(report1.flatten(["rowNumber", "fieldNumber", "type"]))
+pprint(report2.flatten(["rowNumber", "fieldNumber", "type"]))
+
+ +
[[None, 3, 'duplicate-label']]
+[[None, 3, 'duplicate-label']]
+ +
+

Limit Errors

+

This option allows you to limit the amount of errors, and can be used when you need to do a quick check or want to "fail fast". For instance, here we use limit_errors to find just the 1st error and add it to our report:

+
from pprint import pprint
+from frictionless import validate
+
+report = validate("capital-invalid.csv", limit_errors=1)
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+
[[None, 3, 'duplicate-label']]
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/portals/ckan.html b/docs/portals/ckan.html new file mode 100644 index 0000000000..eb7ec3a33f --- /dev/null +++ b/docs/portals/ckan.html @@ -0,0 +1,3916 @@ + + + + + + + + +Ckan Portal | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Ckan Portal

+

With CKAN portal feature you can load and publish packages from a +CKAN, an open-source Data Management System.

+

Installation

+

To install this plugin you need to do:

+ +
+
+
pip install frictionless[ckan] --pre
+pip install 'frictionless[ckan]' --pre # for zsh shell
+
+ +
+

Reading a Package

+

To import a Dataset from a CKAN instance as a Frictionless Package you can do +as below:

+ +
+
+
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl()
+package = Package('https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos', control=ckan_control)
+
+ +
+

Where 'https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos' is the URL for +the CKAN dataset. This will download the dataset and all its resources +metadata.

+

You can pass parameters to CKAN Control to configure it, like the CKAN instance +base URL (baseurl) and the dataset that you do want to download (dataset):

+ +
+
+
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', dataset='bolsa-familia-pagamentos')
+package = Package(control=ckan_control)
+
+ +
+

You don't need to pass the dataset parameter to CkanControl. In the case that +you pass only the baseurl you can download a package as:

+ +
+
+
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br')
+package = Package('bolsa-familia-pagamentos', control=ckan_control)
+
+ +
+

Ignoring a Resource Schema

+

In case that the CKAN dataset has a resource containing errors in its schema, +you still can load the package passing the parameter ignore_schema=True to +CKAN Control:

+ +
+
+
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', ignore_schema=True)
+package = Package('bolsa-familia-pagamentos', control=ckan_control)
+
+ +
+

This will download the dataset and all its resources, saving the resources' +original schemas on original_schema.

+

Publishing a package

+

To publish a Package to a CKAN instance you will need an API key from an CKAN's +user that has permission to create datasets. This key can be passed to CKAN +Control as the parameter apikey.

+ +
+
+
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', apikey='YOUR-SECRET-API-KEY')
+package = Package(...) # Create your package
+package.publish(control=ckan_control)
+
+ +
+

Reading a Catalog

+

You can download a list of CKAN datasets using the Catalog.

+ +
+
+

+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br')
+c = Catalog(control=ckan_control)
+
+ +
+

This will download all datasets from the instance, limited only by the maximum +number of datasets returned by the instance CKAN API. If the instance returns +only 10 datasets as default, you can request more packages passing the +parameter num_packages. In the example above if you want to download 1000 +datasets you can do as:

+ +
+
+

+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', num_packages=1000)
+c = Catalog(control=ckan_control)
+
+ +
+

It's possible that when you are requesting a large number of packages from +CKAN, that some of them don't have a valid Package descriptor according to the +specifications. In that case the standard behaviour will be to stop downloading +a raise an exception. If you want to ignore individual package errors, you can +pass the parameter ignore_package_errors=True:

+ +
+
+

+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_errors=True, num_packages=1000)
+c = Catalog(control=ckan_control)
+
+ +
+

And the output of the command above will be the CKAN datasets ids with errors +and the total number of packages returned by your query to the CKAN instance:

+
Error in CKAN dataset 8d60eff7-1a46-42ef-be64-e8979117a378: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
+Error in CKAN dataset 933d7164-8128-4e12-97e6-208bc4935bcb: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
+Error in CKAN dataset 93114fec-01c2-4ef5-8dfe-67da5027d568: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email") (The data package has an error: property "contributors[].email" is not valid "email")
+Total number of packages: 13786
+
+

You can see in the example above that 1000 packages were download from a total +13786 packages. You can download other packages passing an offset as:

+ +
+
+

+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_erros=True, results_offset=1000)
+c = Catalog(control=ckan_control)
+
+ +
+

This will download 1000 packages after the the first 1000 packages.

+

Fetching the datasets from an Organization or Group

+

To fetch all packages from a organization will can use the CKAN Control +parameter organization_name. e.g. if you want to fetch all datasets from the +organization https://legado.dados.gov.br/organization/agencia-espacial-brasileira-aeb you can do +as follows:

+ +
+
+
import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', organization_name='agencia-espacial-brasileira-aeb')
+c = Catalog(control=ckan_control)
+
+ +
+

Similarly, if you want to download all datasets from a CKAN Group you can pass +the parameter group_id to the CKAN Control as:

+ +
+
+
import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', group_id='ciencia-informacao-e-comunicacao')
+c = Catalog(control=ckan_control)
+
+ +
+

Using CKAN search

+

You can also fetch only the datasets that are returned by the CKAN Package +Search endpoint. +You can pass the search parameters as the parameter search to CKAN Control.

+ +
+
+
import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', search={'q': 'name:bolsa*'})
+c = Catalog(control=ckan_control)
+
+ +
+

Reference

+
+ + +
+
+ +

portals.CkanControl (class)

+ +
+
+ + +
+

portals.CkanControl (class)

+

Ckan control representation

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, baseurl: Optional[str] = None, dataset: Optional[str] = None, apikey: Optional[str] = None, ignore_package_errors: Optional[bool] = False, ignore_schema: Optional[bool] = False, group_id: Optional[str] = None, organization_name: Optional[str] = None, search: Optional[Dict[str, Any]] = None, num_packages: Optional[int] = None, results_offset: Optional[int] = None, allow_update: Optional[bool] = False) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + baseurl + (Optional[str])
  • +
  • + dataset + (Optional[str])
  • +
  • + apikey + (Optional[str])
  • +
  • + ignore_package_errors + (Optional[bool])
  • +
  • + ignore_schema + (Optional[bool])
  • +
  • + group_id + (Optional[str])
  • +
  • + organization_name + (Optional[str])
  • +
  • + search + (Optional[Dict[str, Any]])
  • +
  • + num_packages + (Optional[int])
  • +
  • + results_offset + (Optional[int])
  • +
  • + allow_update + (Optional[bool])
  • +
+
+ +
+

portals.ckanControl.baseurl (property)

+

+ Endpoint url for CKAN instance. e.g. https://dados.gov.br +

+

Signature

+

Optional[str]

+
+
+

portals.ckanControl.dataset (property)

+

+ Unique identifier of the dataset to read or write. +

+

Signature

+

Optional[str]

+
+
+

portals.ckanControl.apikey (property)

+

+ The access token to authenticate to the CKAN instance. It is required + to write files to CKAN instance. +

+

Signature

+

Optional[str]

+
+
+

portals.ckanControl.ignore_package_errors (property)

+

+ Ignore Package errors in a Catalog. If multiple packages are being downloaded + and one fails with an invalid descriptor, continue downloading the rest. +

+

Signature

+

Optional[bool]

+
+
+

portals.ckanControl.ignore_schema (property)

+

+ Ignore dataset resources schemas +

+

Signature

+

Optional[bool]

+
+
+

portals.ckanControl.group_id (property)

+

+ CKAN Group id to get datasets in a Catalog +

+

Signature

+

Optional[str]

+
+
+

portals.ckanControl.organization_name (property)

+

+ CKAN Organization name to get datasets in a Catalog +

+

Signature

+

Optional[str]

+
+
+

portals.ckanControl.search (property)

+

+ CKAN Search parameters as defined on https://docs.ckan.org/en/2.9/api/#ckan.logic.action.get.package_search +

+

Signature

+

Optional[Dict[str, Any]]

+
+
+

portals.ckanControl.num_packages (property)

+

+ Maximum number of packages to fetch +

+

Signature

+

Optional[int]

+
+
+

portals.ckanControl.results_offset (property)

+

+ Results page number +

+

Signature

+

Optional[int]

+
+
+

portals.ckanControl.allow_update (property)

+

+ Update a dataset on publish with an id is provided on the package descriptor +

+

Signature

+

Optional[bool]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/portals/github.html b/docs/portals/github.html new file mode 100644 index 0000000000..171bc9de10 --- /dev/null +++ b/docs/portals/github.html @@ -0,0 +1,3972 @@ + + + + + + + + +Github Portal | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Github Portal

+

Github read and publish feature makes easy to share data between frictionless and the github repositories. All read/write functionalities are the wrapper around PyGithub library which is used under the hood to make connection to github api.

+

Installation

+

We need to install github extra dependencies to use this feature:

+ +
+
+
pip install frictionless[github] --pre
+pip install 'frictionless[github]' --pre # for zsh shell
+
+ +
+

Reading Package

+

You can read data from a github repository as follows:

+ +
+
+
from frictionless import Package
+
+package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
+print(package)
+
+ +
+
{'name': 'test-package',
+    'resources': [{'name': 'first-resource',
+        'type': 'table',
+        'path': 'table.xls',
+        'scheme': 'file',
+        'format': 'xls',
+        'mediatype': 'application/vnd.ms-excel',
+        'schema': {'fields': [{'name': 'id', 'type': 'number'},
+            {'name': 'name', 'type': 'string'}]}}]}
+
+

To increase the access limit, pass 'apikey' as the param to the reader function as follows:

+ +
+
+
from frictionless import portals, Package
+
+control = portals.GithubControl(apikey=apikey)
+package = Package("https://github.com/fdtester/test-repo-with-datapackage-json", control=control)
+print(package)
+
+ +
+

The reader function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the same name as the repo name. By default, the function reads files of type csv, xlsx and xls but we can set the file types using control parameters.

+

If the repo has a descriptor it simply returns the descriptor as shown above.

+

Once you read the package from the repo, you can then easily access the resources and its data, for example:

+ +
+
+
from frictionless import Package
+
+package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
+print(package.get_resource('first-resource').read_rows())
+
+ +
+
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+

Reading Catalog

+

Catalog is a container for the packages. We can read single/multiple repositories from github and create a catalog.

+ +
+
+
from frictionless import portals, Catalog
+
+control = portals.GithubControl(search="'TestAction: Read' in:readme", apikey=apikey)
+catalog = Catalog(
+        "https://github.com/fdtester", control=control
+    )
+print("Total packages", len(catalog.packages))
+print(catalog.packages[:2])
+
+ +
+
Total packages 4
+[{'resources': [{'name': 'capitals',
+                'type': 'table',
+                'path': 'data/capitals.csv',
+                'scheme': 'file',
+                'format': 'csv',
+                'encoding': 'utf-8',
+                'mediatype': 'text/csv',
+                'dialect': {'csv': {'skipInitialSpace': True}},
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'cid', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}}]},
+ {'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+                'type': 'table',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'mediatype': 'text/csv'}]}]
+
+

To read catalog, we need authenticated user so we have to pass the token as 'apikey' to the function. In the above example we are using search text to filter the repositories to small number. The search field is not mandatory.

+

We can simply use 'control' parameters and get the same result as above, for example:

+ +
+
+
from frictionless import portals, Catalog
+
+control = portals.GithubControl(search="'TestAction: Read' in:readme", user="fdtester", apikey=apikey)
+catalog = Catalog(control=control)
+print("Total packages", len(catalog.packages))
+print(catalog.packages[:2])
+
+ +
+

As shown in the example above, we can use different qualifiers to search the repos. The above example searches for all the repos which has 'TestAction: Read' text in readme files. Similary we can use many different qualifiers and combination of those. To get full list of qualifiers you can check the github document here.

+

Some examples of the qualifiers:

+
‘jquery’ in:name
+‘jquery’ in:name user:name
+sort:updated-asc ‘TestAction: Read’ in:readme
+
+

If we want to read the list of repositories of user 'fdtester' which has 'jquery' in its name then we write search query as follows:

+ +
+
+
from frictionless import portals, Catalog
+
+control = portals.GithubControl(apikey=apikey, search="user:fdtester jquery in:name")
+catalog = Catalog(control=control)
+print(catalog.packages)
+
+ +
+
[{'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+                'type': 'table',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'mediatype': 'text/csv'}]}]
+
+

There is only one repository having 'jquery' in name for this user's account, so it returned only one repository.

+

We can also read repositories in defined order using 'sort' param or qualifier. Here we are trying to read the repos with 'TestAction: Read' text in readme file in recently updated order, for example:

+ +
+
+
from frictionless import portals, Catalog
+
+control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme")
+catalog = Catalog(control=control)
+for index,package in enumerate(catalog.packages):
+    print(f"package:{index}", "\n")
+    print(package)
+
+ +
+
package:0
+
+{'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+                'type': 'table',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'mediatype': 'text/csv'}]}
+package:1
+
+{'resources': [{'name': 'capitals',
+                'type': 'table',
+                'path': 'data/capitals.csv',
+                'scheme': 'file',
+                'format': 'csv',
+                'encoding': 'utf-8',
+                'mediatype': 'text/csv',
+                'dialect': {'csv': {'skipInitialSpace': True}},
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'cid', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}}]}
+package:2
+
+{'name': 'test-tabulator',
+ 'resources': [{'name': 'first-resource',
+                'path': 'table.xls',
+                'schema': {'fields': [{'name': 'id', 'type': 'number'},
+                                      {'name': 'name', 'type': 'string'}]}},
+               {'name': 'number-two',
+                'path': 'table-reverse.csv',
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}}]}
+
+

Publishing Data

+

To write data to the repository, we use Package.publish function as follows:

+ +
+
+
from frictionless import portals, Package
+
+package = Package('1174/datapackage.json')
+control = portals.GithubControl(repo="test-new-repo-doc", name='FD', email=email, apikey=apikey)
+response = package.publish(control=control)
+print(response)
+
+ +
+
Repository(full_name="fdtester/test-new-repo-doc")
+
+

We need to mention name and email explicitly if the user doesn't have name set in his github account, and if email is private and hidden. Otherwise, it will take these info from the user account. In order to be able to publish/write to respository, we need to have the api token with 'repository write' access.

+

If the package is successfully published, the response is a 'Repository' instance.

+

Configuration

+

We can control the behavior of all the above three functions using various params.

+

For example, to read only 'csv' files in package we use the following code:

+ +
+
+
from frictionless import portals, Package
+
+control = portals.GithubControl(user="fdtester", formats=["csv"], repo="test-repo-without-datapackage")
+package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
+print(package)
+
+ +
+
{'name': 'test-package',
+ 'resources': [{'name': 'first-resource',
+                'type': 'table',
+                'path': 'table.xls',
+                'scheme': 'file',
+                'format': 'xls',
+                'mediatype': 'application/vnd.ms-excel',
+                'schema': {'fields': [{'name': 'id', 'type': 'number'},
+                                      {'name': 'name', 'type': 'string'}]}}]}
+
+

In order to read first page of the search result and create a catalog, we use per_page and page params as follows:

+ +
+
+
from frictionless import portals, Catalog
+
+control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme", per_page=1, page=1)
+catalog = Catalog(control=control)
+
+ +
+
[{'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+                'type': 'table',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'mediatype': 'text/csv'}]}]
+
+

Similary, we can also control the write function using params as follows:

+
from frictionless import portals, Package
+
+package = Package('datapackage.json')
+control = portals.GithubControl(repo="test-repo", name='FD Test', email="test@gmail", apikey=apikey)
+response = package.publish(control=control)
+print(response)
+
+
Repository(full_name="fdtester/test-repo")
+
+

Reference

+
+ + +
+
+ +

portals.GithubControl (class)

+ +
+
+ + +
+

portals.GithubControl (class)

+

Github control representation

+

Signature

+

(*, title: Optional[str] = None, description: Optional[str] = None, apikey: Optional[str] = None, basepath: Optional[str] = None, email: Optional[str] = None, formats: Optional[List[str]] = [csv, tsv, xlsx, xls, jsonl, ndjson], name: Optional[str] = None, order: Optional[str] = None, page: Optional[int] = None, per_page: Optional[int] = 30, repo: Optional[str] = None, search: Optional[str] = None, sort: Optional[str] = None, user: Optional[str] = None, filename: Optional[str] = None, enable_pages: Optional[bool] = None) -> None

+

Parameters

+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + apikey + (Optional[str])
  • +
  • + basepath + (Optional[str])
  • +
  • + email + (Optional[str])
  • +
  • + formats + (Optional[List[str]])
  • +
  • + name + (Optional[str])
  • +
  • + order + (Optional[str])
  • +
  • + page + (Optional[int])
  • +
  • + per_page + (Optional[int])
  • +
  • + repo + (Optional[str])
  • +
  • + search + (Optional[str])
  • +
  • + sort + (Optional[str])
  • +
  • + user + (Optional[str])
  • +
  • + filename + (Optional[str])
  • +
  • + enable_pages + (Optional[bool])
  • +
+
+ +
+

portals.githubControl.apikey (property)

+

The access token to authenticate to the github API. It is required + to write files to github repo. + For reading, it is optional however using apikey increases the api + access limit from 60 to 5000 requests per hour. To write, access + token has to have write repository access. +

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.basepath (property)

+

Base path is the base folder, the package and resource files will be written to.

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.email (property)

+

Email is used while publishing the data to the github repo. It should be set explicitly, + if the primary email for the github account is not set to public.

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.formats (property)

+

Formats instructs plugin to only read specified types of files. By default it is set to + 'csv,xls,xlsx'. +

+

Signature

+

Optional[List[str]]

+
+
+

portals.githubControl.name (property)

+

Name of the github which is used while publishing the data. It should be provided explicitly, + if the name of the user is not set in the github account. +

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.order (property)

+

The order in which to retrieve the data sorted by 'sort' param. It can be one of: 'asc','desc'. + This parameter is ignored if 'sort' is not provided. +

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.page (property)

+

If specified, only the given page is returned.

+

Signature

+

Optional[int]

+
+
+

portals.githubControl.per_page (property)

+

The number of results per page. Default value is 30. Max value is 100.

+

Signature

+

Optional[int]

+
+
+

portals.githubControl.repo (property)

+

Name of the repo to read or write.

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.search (property)

+

Search query containing one or more search keywords and qualifiers to filter the repositories. + For example, 'windows+label:bug+language:python'.

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.sort (property)

+

Sorts the result of the query by number of stars, forks, help-wanted-issues or updated. + By default the results are sorted by best match in desc order.

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.user (property)

+

username of the github account.

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.filename (property)

+

Custom data package file name while publishing the data. By default it will use 'datapackage.json'.

+

Signature

+

Optional[str]

+
+
+

portals.githubControl.enable_pages (property)

+

+

Signature

+

Optional[bool]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/portals/zenodo.html b/docs/portals/zenodo.html new file mode 100644 index 0000000000..7f6e6db960 --- /dev/null +++ b/docs/portals/zenodo.html @@ -0,0 +1,4236 @@ + + + + + + + + +Zenodo Portal | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Zenodo Portal

+

Zenodo API makes data sharing between frictionless framework and Zenodo easy. The data from the Zenodo repo can be read from +as well as written to zenodo seamlessly. The api uses 'zenodopy' library underneath to communicate with Zenodo REST API.

+

Installation

+

We need to install zenodo extra dependencies to use this feature:

+ +
+
+
pip install frictionless[zenodo] --pre
+pip install 'frictionless[zenodo]' --pre # for zsh shell
+
+ +
+

Reading Package

+

You can read data from a zenodo repository as follows:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078768")
+package.infer()
+print(package)
+
+ +
+
{'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'capitals',
+                'type': 'table',
+                'path': 'capitals.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'encoding': 'utf-8',
+                'mediatype': 'text/csv',
+                'dialect': {'csv': {'skipInitialSpace': True}},
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'cid', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}},
+               {'name': 'table',
+                'type': 'table',
+                'path': 'table.xls',
+                'scheme': 'https',
+                'format': 'xls',
+                'encoding': 'utf-8',
+                'mediatype': 'application/vnd.ms-excel',
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}}]}
+
+

To increase the access limit, pass 'apikey' as the param to the reader function as follows:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(apikey=apikey)
+package = Package("https://zenodo.org/record/7078768", control=control)
+print(package)
+
+ +
+

The reader function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the name same as the repo name as shown in the example above. By default, the function reads files of type csv, xlsx, xls etc which is supported by frictionless framework but we can set the file types using control parameters also.

+

If the repo has a descriptor it simply returns the descriptor as shown below:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078760")
+package.infer()
+print(package)
+
+ +
+
{'name': 'testing',
+ 'title': 'Frictionless Data Test Dataset',
+ 'resources': [{'name': 'data',
+                'path': 'data.csv',
+                'schema': {'fields': [{'name': 'id',
+                                       'type': 'string',
+                                       'constraints': {'required': True}},
+                                      {'name': 'name', 'type': 'string'},
+                                      {'name': 'description', 'type': 'string'},
+                                      {'name': 'amount', 'type': 'number'}],
+                           'primaryKey': ['id']}},
+               {'name': 'data2',
+                'path': 'data2.csv',
+                'schema': {'fields': [{'name': 'parent', 'type': 'string'},
+                                      {'name': 'comment', 'type': 'string'}],
+                           'foreignKeys': [{'fields': ['parent'],
+                                            'reference': {'resource': 'data',
+                                                          'fields': ['id']}}]}}]}
+
+

Once you read the package from the repo, you can then easily access the resources and its data, for example:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078760")
+pprint(package.get_resource('data').read_rows())
+
+ +
+
[{'amount': Decimal('10000.5'),
+  'description': 'Taxes we collect',
+  'id': 'A3001',
+  'name': 'Taxes'},
+ {'amount': Decimal('2000.5'),
+  'description': 'Parking fees we collect',
+  'id': 'A5032',
+  'name': 'Parking Fees'}]
+
+

You can apply any functions available in frictionless framework. Here is an example of applying validation to the +package that was read.

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078760")
+report = catalog.packages[0].validate()
+pprint(report)
+
+ +
+
{'valid': True,
+ 'stats': {'tasks': 1, 'warnings': 0, 'errors': 0, 'seconds': 0.655},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'valid': True,
+            'name': 'first-http-resource',
+            'type': 'table',
+            'place': 'https://raw.githubusercontent.com/fdtester/test-repo-with-datapackage-yaml/master/data/capitals.csv',
+            'labels': ['id', 'cid', 'name'],
+            'stats': {'md5': '154d822b8c2aa259867067f01c0efee5',
+                      'sha256': '5ec3d8a4d137891f2f19ab9d244cbc2c30a7493f895c6b8af2506d9b229ed6a8',
+                      'bytes': 76,
+                      'fields': 3,
+                      'rows': 5,
+                      'warnings': 0,
+                      'errors': 0,
+                      'seconds': 0.651},
+            'warnings': [],
+            'errors': []}]}
+
+
+

Reading Catalog

+

Catalog is a container for the packages. We can read single/multiple repositories from Zenodo repo and create a catalog.

+ +
+
+
from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.ZenodoControl(search='notes:"TDWD"')
+catalog = Catalog(control=control)
+catalog.infer()
+print("Total packages", len(catalog.packages))
+print(catalog.packages)
+
+ +
+
Total packages 2
+[{'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'countries',
+                'type': 'table',
+                'path': 'countries.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'encoding': 'utf-8',
+                'mediatype': 'text/csv',
+                'dialect': {'headerRows': [2]},
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'neighbor_id', 'type': 'string'},
+                                      {'name': 'name', 'type': 'string'},
+                                      {'name': 'population',
+                                       'type': 'string'}]}}]}, {'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'capitals',
+                'type': 'table',
+                'path': 'capitals.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'encoding': 'utf-8',
+                'mediatype': 'text/csv',
+                'dialect': {'csv': {'skipInitialSpace': True}},
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'cid', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}},
+               {'name': 'table',
+                'type': 'table',
+                'path': 'table.xls',
+                'scheme': 'https',
+                'format': 'xls',
+                'encoding': 'utf-8',
+                'mediatype': 'application/vnd.ms-excel',
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}}]}]
+
+

In the above example we are using search text to filter the repositories to reduce the result size to small number. However, the search field is not mandatory. We can simply use 'control' parameters and create the catalog from a single repo, for example:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.ZenodoControl(record="7078768")
+catalog = Catalog(control=control)
+catalog.infer()
+print("Total packages", len(catalog.packages))
+print(catalog.packages)
+
+ +
+
Total packages 1
+[{'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'capitals',
+                'type': 'table',
+                'path': 'capitals.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'encoding': 'utf-8',
+                'mediatype': 'text/csv',
+                'dialect': {'csv': {'skipInitialSpace': True}},
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'cid', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}},
+               {'name': 'table',
+                'type': 'table',
+                'path': 'table.xls',
+                'scheme': 'https',
+                'format': 'xls',
+                'encoding': 'utf-8',
+                'mediatype': 'application/vnd.ms-excel',
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'name', 'type': 'string'}]}}]}]
+
+

As shown in the first catalog example above, we can use different search queries to filter the repos. The above example searches for all the repos which has 'notes:"TDWD"' text in readme files. Similary we can use many different queries combining many terms, phrases or field +search. To get full list of different queries you can check the zenodo official document here.

+

Some examples of the search queries are:

+
"open science"
+title:"open science"
++description:"frictionless" +title:"Bionomia"
++publication_date:[2022-10-01 TO 2022-11-01] +title:"frictionless"
+
+

We can search for different terms such as "open science" and also use '+' to specify mandatory. If "+" is not specified, it will be optional and will apply 'OR' logic to the search. We can also use field search. All the search queries supported by Zenodo Rest API can be +used.

+

If we want to read the list of repositories which has term "+frictionlessdata +science" then we write search query as follows:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.ZenodoControl(search='+frictionlessdata +science')
+catalog = Catalog(control=control)
+print("Total Packages", len(catalog.packages))
+
+ +
+
Total Packages 1
+
+

There is only one repository having terms '+frictionlessdata +science', so it returned only one repository.

+

We can also read repositories in defined order using 'sort' param. Here we are trying to read the repos with 'creators.name:"FD Tester"' in recently updated order, for example:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Catalog
+
+catalog = Catalog(
+       control=portals.ZenodoControl(
+           search='creators.name:"FD Tester"',
+           sort="mostrecent",
+           page=1,
+           size=1,
+       ),
+   )
+catalog.infer()
+
+ +
+
[{'name': 'test-repo-resources-with-http-data-csv',
+ 'title': 'Test Write File - Remote',
+ 'resources': [{'name': 'first-http-resource',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-with-datapackage-yaml/master/data/capitals.csv',
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'cid', 'type': 'string'},
+                                      {'name': 'name', 'type': 'string'}]}}]}]
+
+

Publishing Data

+

To write data to the repository, we use Package.publish function as follows:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(
+        metafn="data/zenodo/meta.json",
+        apikey=apikey
+    )
+package = Package("484/package-to-write/datapackage.json")
+deposition_id = package.publish(control=control)
+print(deposition_id)
+
+ +
+
1123500
+
+

To publish the data, we need to provide metadata for the Zenodo repo which we are sending using "meta.json". In order to be able to publish/write to respository, we need to have the api token with 'repository write' access. If the package is successfully published, the deposition_id will be returned as shown in the example above.

+

For testing, we can pass sandbox url using base_url param

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(
+        metafn="data/zenodo/meta.json",
+        apikey=apikey_sandbox,
+        base_url="https://sandbox.zenodo.org/api/"
+    )
+package = Package("484/package-to-write/datapackage.json")
+deposition_id = package.publish(control=control)
+
+ +
+

If the metadata file is not provided, then the api will read available data from the package file. Metadata will be generated using title, contributors and description from Package descriptor.

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(
+        apikey=apikey_sandbox,
+        base_url="https://sandbox.zenodo.org/api/"
+    )
+package = Package("484/package-to-write/datapackage.json")
+deposition_id = package.publish(control=control)
+
+ +
+

Configuration

+

We can control the behavior of all the above three functions using various params.

+

For example, to read only 'csv' files in package we use the following code:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(formats=["csv"], record="7078725", apikey=apikey)
+package = Package(control=control)
+print(package)
+
+ +
+
{'name': 'test-repo-without-datapackage',
+ 'resources': [{'name': 'capitals',
+                'type': 'table',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'mediatype': 'text/csv'},
+               {'name': 'countries',
+                'type': 'table',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
+                'scheme': 'https',
+                'format': 'csv',
+                'mediatype': 'text/csv'}]}
+
+

In order to read first page of the search result and create a catalog, we use page and size params as follows:

+ +
+
+
from pprint import pprint
+from frictionless import portals, Catalog
+
+catalog = Catalog(
+       control=portals.ZenodoControl(
+           search='creators.name"FD Tester"',
+           sort="mostrecent",
+           page=1,
+           size=1,
+       ),
+   )
+print(catalog.packages)
+
+ +
+
[{'name': 'test-repo-resources-with-http-data-csv',
+ 'title': 'Test Write File - Remote',
+ 'resources': [{'name': 'first-http-resource',
+                'path': 'https://raw.githubusercontent.com/fdtester/test-repo-with-datapackage-yaml/master/data/capitals.csv',
+                'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                                      {'name': 'cid', 'type': 'string'},
+                                      {'name': 'name', 'type': 'string'}]}}]}]
+
+

Reference

+
+ + +
+
+ +

portals.ZenodoControl (class)

+ +
+
+ + +
+

portals.ZenodoControl (class)

+

Zenodo control representation

+

Signature

+

(*, all_versions: Optional[int] = None, apikey: Optional[str] = None, base_url: str = https://zenodo.org/api/, title: Optional[str] = None, description: Optional[str] = None, author: Optional[str] = None, company: Optional[str] = None, bounds: Optional[str] = None, communities: Optional[str] = None, deposition_id: Optional[int] = None, doi: Optional[str] = None, formats: Optional[List[str]] = [csv, tsv, xlsx, xls, jsonl, ndjson, csv.zip, tsv.zip, xlsx.zip, xls.zip, jsonl.zip, ndjson.zip], name: Optional[str] = None, metafn: Optional[str] = None, page: Optional[str] = None, rcustom: Optional[str] = None, record: Optional[str] = None, rtype: Optional[str] = None, search: Optional[str] = None, size: Optional[int] = None, sort: Optional[str] = None, status: Optional[str] = None, subtype: Optional[str] = None, tmp_path: Optional[str] = None) -> None

+

Parameters

+
    +
  • + all_versions + (Optional[int])
  • +
  • + apikey + (Optional[str])
  • +
  • + base_url + (str)
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + author + (Optional[str])
  • +
  • + company + (Optional[str])
  • +
  • + bounds + (Optional[str])
  • +
  • + communities + (Optional[str])
  • +
  • + deposition_id + (Optional[int])
  • +
  • + doi + (Optional[str])
  • +
  • + formats + (Optional[List[str]])
  • +
  • + name + (Optional[str])
  • +
  • + metafn + (Optional[str])
  • +
  • + page + (Optional[str])
  • +
  • + rcustom + (Optional[str])
  • +
  • + record + (Optional[str])
  • +
  • + rtype + (Optional[str])
  • +
  • + search + (Optional[str])
  • +
  • + size + (Optional[int])
  • +
  • + sort + (Optional[str])
  • +
  • + status + (Optional[str])
  • +
  • + subtype + (Optional[str])
  • +
  • + tmp_path + (Optional[str])
  • +
+
+ +
+

portals.zenodoControl.all_versions (property)

+

Show (true or 1) or hide (false or 0) all versions of records.

+

Signature

+

Optional[int]

+
+
+

portals.zenodoControl.apikey (property)

+

The access token to authenticate to the zenodo API. It is required + to write files to zenodo deposit resource. + For reading, it is optional however using apikey increases the api + access limit from 60 to 100 requests per hour. To write, access + token has to have deposit:write access. +

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.base_url (property)

+

Endpoint for zenodo. By default it is set to live site (https://zenodo.org/api). For testing upload, + we can use sandbox for example, https://sandbox.zenodo.org/api. Sandbox does not work for + reading.

+

Signature

+

str

+
+
+

portals.zenodoControl.title (property)

+

Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.description (property)

+

Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.author (property)

+

Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.company (property)

+

Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.bounds (property)

+

Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.communities (property)

+

Return records that are part of the specified communities. (Use of community identifier).

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.deposition_id (property)

+

Id of the deposition resource. Deposition resource is used for uploading and + editing files to Zenodo.

+

Signature

+

Optional[int]

+
+
+

portals.zenodoControl.doi (property)

+

Digital Object Identifier(DOI). When the deposition is published, a unique DOI is registered by + Zenodo or user can set it manually. This is only for the published depositions. If set, it returns + record that matches this DOI

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.formats (property)

+

Formats instructs plugin to only read specified types of files. By default it is set to + '"csv", "tsv", "xlsx", "xls", "jsonl", "ndjson"'. +

+

Signature

+

Optional[List[str]]

+
+
+

portals.zenodoControl.name (property)

+

Custom name for a catalog or a package. Default name is 'catalog' or 'package'

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.metafn (property)

+

Metadata file path for deposition resource. Deposition resource is used for uploading + and editing records on Zenodo.

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.page (property)

+

Page number to retrieve from the search result.

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.rcustom (property)

+

Return records containing the specified custom keywords. (Format custom=[field_name]:field_value)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.record (property)

+

Unique identifier of a record. We can use it find the specific record while creating a + package or a catalog. For example, 7078768

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.rtype (property)

+

Return records of the specified type. (Publication, Poster, Presentation…)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.search (property)

+

Search query containing one or more search keywords to filter the records. + For example, 'notes:"TDBASIC".

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.size (property)

+

Number of results to return per page.

+

Signature

+

Optional[int]

+
+
+

portals.zenodoControl.sort (property)

+

Sort order (bestmatch or mostrecent). Prefix with minus to change form + ascending to descending (e.g. -mostrecent)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.status (property)

+

Filter result based on the deposit status (either draft or published)

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.subtype (property)

+

Return records that are part of the specified communities. (Use of community identifier).

+

Signature

+

Optional[str]

+
+
+

portals.zenodoControl.tmp_path (property)

+

Temp path to create intermediate package/resource file/s to upload to the zenodo instance

+

Signature

+

Optional[str]

+
+ + + + +
+
+

+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/resources/file.html b/docs/resources/file.html new file mode 100644 index 0000000000..f5345b86a7 --- /dev/null +++ b/docs/resources/file.html @@ -0,0 +1,3492 @@ + + + + + + + + +File Resource | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

File Resource

+

A file resource is the most basic one. Actually, every data file can be maked as file. For example:

+ + + +
+
+
from frictionless.resources import FileResource
+
+resource = FileResource(path='text.txt')
+resource.infer(stats=True)
+print(resource)
+
+ +
{'name': 'text',
+ 'type': 'file',
+ 'path': 'text.txt',
+ 'scheme': 'file',
+ 'format': 'txt',
+ 'mediatype': 'text/txt',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:b9e68e1bea3e5b19ca6b2f98b73a54b73daafaa250484902e09982e07a12e733',
+ 'bytes': 5}
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/resources/json.html b/docs/resources/json.html new file mode 100644 index 0000000000..f7877ac03e --- /dev/null +++ b/docs/resources/json.html @@ -0,0 +1,3512 @@ + + + + + + + + +Json Resource | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Json Resource

+

A json resource contains a structured data like JSON or YAML (can be validated with JSONSchema -- under development):

+ +
+
+
from frictionless.resources import JsonResource
+
+resource = JsonResource(path='data.json')
+resource.infer(stats=True)
+print(resource)
+
+ +
{'name': 'data',
+ 'type': 'json',
+ 'path': 'data.json',
+ 'scheme': 'file',
+ 'format': 'json',
+ 'mediatype': 'text/json',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:80af3283a5c57e5d3a8d1d4099bebe639c610c4ecc8ce39fe53f9f9d9c441c4a',
+ 'bytes': 21}
+ +
+

We can read the contents:

+ + + +
+
+
from frictionless.resources import JsonResource
+
+resource = JsonResource(path='data.json')
+resource.infer(stats=True)
+print(resource.read_data())
+
+ +
{'key': 'value'}
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/resources/table.html b/docs/resources/table.html new file mode 100644 index 0000000000..6d974cc516 --- /dev/null +++ b/docs/resources/table.html @@ -0,0 +1,3516 @@ + + + + + + + + +Table Resource | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Table Resource

+

A table resource contains a tabular data file (can be validated with Table Schema):

+ +
+
+
from frictionless.resources import TableResource
+
+resource = TableResource(path='table.csv')
+resource.infer(stats=True)
+print(resource)
+
+ +
{'name': 'table',
+ 'type': 'table',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8',
+ 'bytes': 30,
+ 'fields': 2,
+ 'rows': 2,
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+                       {'name': 'name', 'type': 'string'}]}}
+ +
+

We can read the contents:

+ + + +
+
+
from frictionless.resources import TableResource
+
+resource = TableResource(path='table.csv')
+resource.infer(stats=True)
+print(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/resources/text.html b/docs/resources/text.html new file mode 100644 index 0000000000..3a8d9a0ef2 --- /dev/null +++ b/docs/resources/text.html @@ -0,0 +1,3514 @@ + + + + + + + + +Text Resource | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Text Resource

+

A text resource represents a textual file as a markdown document, for example:

+ +
+
+
from frictionless.resources import TextResource
+
+resource = TextResource(path='article.md')
+resource.infer(stats=True)
+print(resource)
+
+ +
{'name': 'article',
+ 'type': 'text',
+ 'path': 'article.md',
+ 'scheme': 'file',
+ 'format': 'md',
+ 'mediatype': 'text/markdown',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:c3d88243a8bbb2d95787af6edd6b0017791a090d18c80765f92b486ab502cebb',
+ 'bytes': 20}
+ +
+

We can read the contents:

+ + + +
+
+
from frictionless.resources import TextResource
+
+resource = TextResource(path='article.md')
+resource.infer(stats=True)
+print(resource.read_text())
+
+ +
# Article
+
+Contents
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/schemes/aws.html b/docs/schemes/aws.html new file mode 100644 index 0000000000..b6280f3832 --- /dev/null +++ b/docs/schemes/aws.html @@ -0,0 +1,3594 @@ + + + + + + + + +AWS Schemes | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

AWS Schemes

+

Frictionless supports reading data from a AWS cloud source. You can read files in any format that is available in your S3 bucket.

+ +
+
+
pip install frictionless[aws]
+pip install 'frictionless[aws]' # for zsh shell
+
+ +
+

Reading Data

+

You can read from this source using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='s3://bucket/table.csv')
+pprint(resource.read_rows())
+
+ +
+

For reading from a private bucket you need to setup AWS creadentials as it's described in the Boto3 documentation.

+

Writing Data

+

A similiar approach can be used for writing:

+ +
+
+
from frictionless import Resource
+
+resource = Resource(path='data/table.csv')
+resource.write('s3://bucket/table.csv')
+
+ +
+

Configuration

+

There is a Control to configure how Frictionless read files in this storage. For example:

+ +
+
+
from frictionless import Resource
+from frictionless.plugins.s3 import S3Control
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table.new.csv', control=controls.S3Control(endpoint_url='<url>'))
+
+ +
+

Reference

+
+ + +
+
+ +

schemes.AwsControl (class)

+ +
+
+ + +
+

schemes.AwsControl (class)

+

Aws control representation

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, s3_endpoint_url: str = https://s3.amazonaws.com) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + s3_endpoint_url + (str)
  • +
+
+ +
+

schemes.awsControl.s3_endpoint_url (property)

+

+

Signature

+

str

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/schemes/buffer.html b/docs/schemes/buffer.html new file mode 100644 index 0000000000..46fa60aa7e --- /dev/null +++ b/docs/schemes/buffer.html @@ -0,0 +1,3514 @@ + + + + + + + + +Buffer Scheme | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Buffer Scheme

+

Frictionless supports working with bytes loaded into memory.

+

Reading Data

+

You can read Buffer Data using Package/Resource API, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(b'id,name\n1,english\n2,german', format='csv')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}]
+ +
+

Writing Data

+

A similiar approach can be used for writing:

+ + + +
+
+
from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write(scheme='buffer', format='csv')
+print(target)
+print(target.read_rows())
+
+ +
{'name': 'memory',
+ 'type': 'table',
+ 'data': [],
+ 'scheme': 'buffer',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}]
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/schemes/local.html b/docs/schemes/local.html new file mode 100644 index 0000000000..28f23c2df1 --- /dev/null +++ b/docs/schemes/local.html @@ -0,0 +1,3520 @@ + + + + + + + + +Local Scheme | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Local Scheme

+

You can read and write files locally with Frictionless. This is a basic functionality of Frictionless.

+

Reading Data

+

You can read using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='table.csv')
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

A similiar approach can be used for writing:

+ + + +
+
+
from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.csv')
+print(target)
+print(target.to_view())
+
+ +
{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | 'german'  |
++----+-----------+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/schemes/multipart.html b/docs/schemes/multipart.html new file mode 100644 index 0000000000..55c68af614 --- /dev/null +++ b/docs/schemes/multipart.html @@ -0,0 +1,3584 @@ + + + + + + + + +Multipart Scheme | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Multipart Scheme

+

You can read and write files split into chunks with Frictionless.

+

Reading Data

+

You can read using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='chunk1.csv', extrapaths=['chunk2.csv'])
+pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

A similiar approach can be used for writing:

+ +
+
+
from frictionless import Resource
+
+resource = Resource(path='table.json')
+resource.write('table{number}.json', scheme="multipart", control={"chunkSize": 1000000})
+
+ +
+

Configuration

+

There is a Control to configure how Frictionless reads files using this scheme. For example:

+ +
+
+
from frictionless import Resource
+from frictionless.plugins.multipart import MultipartControl
+
+control = MultipartControl(chunk_size=1000000)
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table{number}.json', scheme="multipart", control=control)
+
+ +
+

Reference

+
+ + +
+
+ +

schemes.MultipartControl (class)

+ +
+
+ + +
+

schemes.MultipartControl (class)

+

Multipart control representation

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, chunk_size: int = 100000000) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + chunk_size + (int)
  • +
+
+ +
+

schemes.multipartControl.chunk_size (property)

+

+ Specifies chunk size for the multipart file. +

+

Signature

+

int

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/schemes/remote.html b/docs/schemes/remote.html new file mode 100644 index 0000000000..707839493e --- /dev/null +++ b/docs/schemes/remote.html @@ -0,0 +1,3608 @@ + + + + + + + + +Remote Scheme | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Remote Scheme

+

You can read files remotely with Frictionless. This is a basic functionality of Frictionless.

+

Reading Data

+

You can read using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+path='https://raw.githubusercontent.com/frictionlessdata/frictionless-py/master/data/table.csv'
+resource = Resource(path=path)
+pprint(resource.read_rows())
+
+ +
+
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+

Writing Data

+

A similar approach can be used for writing:

+ +
+
+
from frictionless import Resource
+
+resource = Resource(path='data/table.csv')
+resource.write('https://example.com/data/table.csv') # will POST the file to the server
+
+ +
+

Configuration

+

There is a Control to configure remote data, for example:

+ +
+
+
from frictionless import Resource
+from frictionless.plugins.remote import RemoteControl
+
+control = RemoteControl(http_timeout=10)
+path='https://raw.githubusercontent.com/frictionlessdata/frictionless-py/master/data/table.csv'
+resource = Resource(path=path, control=control)
+print(resource.to_view())
+
+ +
+
+----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | '中国人'     |
++----+-----------+
+
+

Reference

+
+ + +
+
+ +

schemes.RemoteControl (class)

+ +
+
+ + +
+

schemes.RemoteControl (class)

+

Remote control representation

+

Signature

+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, http_timeout: int = 10, http_preload: bool = False) -> None

+

Parameters

+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + http_timeout + (int)
  • +
  • + http_preload + (bool)
  • +
+
+ +
+

schemes.remoteControl.http_timeout (property)

+

+ Specifies the time to wait, if the remote server + does not respond before raising an error. The default + value is 10. +

+

Signature

+

int

+
+
+

schemes.remoteControl.http_preload (property)

+

+ Preloads data to the memory if set to True. It is set + to False by default. +

+

Signature

+

bool

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/schemes/stream.html b/docs/schemes/stream.html new file mode 100644 index 0000000000..a85bad84dd --- /dev/null +++ b/docs/schemes/stream.html @@ -0,0 +1,3524 @@ + + + + + + + + +Stream Scheme | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Stream Scheme

+

Frictionless supports using data stored as File-Like objects in Python.

+

Reading Data

+
+

It's recommended to open files in byte-mode. If the file is opened in text-mode, Frictionless will try to re-open it in byte-mode.

+
+

You can read Stream using Package/Resource, for example:

+ +
+
+
from pprint import pprint
+from frictionless import Resource
+
+with open('table.csv', 'rb') as file:
+  resource = Resource(file, format='csv')
+  pprint(resource.read_rows())
+
+ +
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+ +
+

Writing Data

+

A similiar approach can be used for writing:

+ + + +
+
+
from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write(scheme='stream', format='csv')
+print(target)
+print(target.to_view())
+
+ +
{'name': 'memory',
+ 'type': 'table',
+ 'data': [],
+ 'scheme': 'stream',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'english' |
++----+-----------+
+|  2 | 'german'  |
++----+-----------+
+ +
+
+
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/steps/cell.html b/docs/steps/cell.html new file mode 100644 index 0000000000..a4a2b44814 --- /dev/null +++ b/docs/steps/cell.html @@ -0,0 +1,4217 @@ + + + + + + + + +Cell Steps | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Cell Steps

+

The Cell steps are responsible for cell operations like converting, replacing, or formating, along with others.

+

Convert Cells

+

Converts cell values of one or more fields using arbitrary functions, method +invocations or dictionary translations.

+

Using Value

+

We can provide a value to be set as a value of all cells of this field. Take into account that the value type needs to conform to the field type otherwise it will lead to a validation error:

+ +
+
+
from frictionless import Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.cell_convert(field_name='population', value="100"),
+    ],
+)
+print(target.to_view())
+
+ +
+----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |        100 |
++----+-----------+------------+
+|  2 | 'france'  |        100 |
++----+-----------+------------+
+|  3 | 'spain'   |        100 |
++----+-----------+------------+
+ +
+

Using Mapping

+

Another option to modify the field's cell is to provide a mapping. It's a translation table that uses literal matching to replace values. It's usually used for string fields:

+ +
+
+
from frictionless import Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.cell_convert(field_name='name', mapping = {'germany': 'GERMANY'}),
+    ],
+)
+print(target.to_view())
+
+ +
+----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'GERMANY' |         83 |
++----+-----------+------------+
+|  2 | 'france'  |         66 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Using Function

+
+ +

We can provide an arbitrary function to update the field cells. If you want to modify a non-string field it's really important to normalize the table first otherwise the function will be applied to a non-parsed value:

+ +
+
+
from frictionless import Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.cell_convert(field_name='population', function=lambda v: v*2),
+    ],
+)
+print(target.to_view())
+
+ +
+----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |        166 |
++----+-----------+------------+
+|  2 | 'france'  |        132 |
++----+-----------+------------+
+|  3 | 'spain'   |         94 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.cell_convert (class)

+ +
+
+ + +
+

steps.cell_convert (class)

+

Convert cell + +Converts cell values of one or more fields using arbitrary functions, method +invocations or dictionary translations.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, value: Optional[Any] = None, mapping: Optional[Dict[str, Any]] = None, function: Optional[Any] = None, field_name: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + value + (Optional[Any])
  • +
  • + mapping + (Optional[Dict[str, Any]])
  • +
  • + function + (Optional[Any])
  • +
  • + field_name + (Optional[str])
  • +
+
+ +
+

steps.cell_convert.value (property)

+

Value to set in the field's cells

+
Signature
+

Optional[Any]

+
+
+

steps.cell_convert.mapping (property)

+

Mapping to apply to the column

+
Signature
+

Optional[Dict[str, Any]]

+
+
+

steps.cell_convert.function (property)

+

Function to apply to the column

+
Signature
+

Optional[Any]

+
+
+

steps.cell_convert.field_name (property)

+

Name of the field to apply the transform on

+
Signature
+

Optional[str]

+
+ + + + +
+
+

Fill Cells

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.cell_replace(pattern="france", replace=None),
+        steps.cell_fill(field_name="name", value="FRANCE"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  2 | 'FRANCE'  |         66 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.cell_fill (class)

+ +
+
+ + +
+

steps.cell_fill (class)

+

Fill cell + +Replaces missing values with non-missing values from the adjacent row/column.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, value: Optional[Any] = None, field_name: Optional[str] = None, direction: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + value + (Optional[Any])
  • +
  • + field_name + (Optional[str])
  • +
  • + direction + (Optional[str])
  • +
+
+ +
+

steps.cell_fill.value (property)

+

Value to replace in the field cell with missing value

+
Signature
+

Optional[Any]

+
+
+

steps.cell_fill.field_name (property)

+

Name of the field to replace the missing value cells

+
Signature
+

Optional[str]

+
+
+

steps.cell_fill.direction (property)

+

Directions to read the non missing value from(left/right/above)

+
Signature
+

Optional[str]

+
+ + + + +
+
+

Format Cells

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.cell_format(template="Prefix: {0}", field_name="name"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-------------------+------------+
+| id | name              | population |
++====+===================+============+
+|  1 | 'Prefix: germany' |         83 |
++----+-------------------+------------+
+|  2 | 'Prefix: france'  |         66 |
++----+-------------------+------------+
+|  3 | 'Prefix: spain'   |         47 |
++----+-------------------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.cell_format (class)

+ +
+
+ + +
+

steps.cell_format (class)

+

Format cell + +Formats all values in the given or all string fields using the `template` format string.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, template: str, field_name: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + template + (str)
  • +
  • + field_name + (Optional[str])
  • +
+
+ +
+

steps.cell_format.template (property)

+

format string to apply to cells

+
Signature
+

str

+
+
+

steps.cell_format.field_name (property)

+

field name to apply template format

+
Signature
+

Optional[str]

+
+ + + + +
+
+

Interpolate Cells

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.cell_interpolate(template="Prefix: %s", field_name="name"),
+    ]
+)
+pprint(target.schema)
+pprint(target.read_rows())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
+[{'id': 1, 'name': 'Prefix: germany', 'population': 83},
+ {'id': 2, 'name': 'Prefix: france', 'population': 66},
+ {'id': 3, 'name': 'Prefix: spain', 'population': 47}]
+ +
+

Reference

+
+ + +
+
+ +

steps.cell_interpolate (class)

+ +
+
+ + +
+

steps.cell_interpolate (class)

+

Interpolate cell + +Interpolate all values in a given or all string fields using the `template` string.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, template: str, field_name: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + template + (str)
  • +
  • + field_name + (Optional[str])
  • +
+
+ +
+

steps.cell_interpolate.template (property)

+

template string to apply to the field cells

+
Signature
+

str

+
+
+

steps.cell_interpolate.field_name (property)

+

field name to apply template string

+
Signature
+

Optional[str]

+
+ + + + +
+
+

Replace Cells

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.cell_replace(pattern="france", replace="FRANCE"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  2 | 'FRANCE'  |         66 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+
+ + +
+
+ +

steps.cell_replace (class)

+ +
+
+ + +
+

steps.cell_replace (class)

+

Replace cell + +Replace cell values in a given field or all fields using user defined pattern.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, pattern: str, replace: str, field_name: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + pattern + (str)
  • +
  • + replace + (str)
  • +
  • + field_name + (Optional[str])
  • +
+
+ +
+

steps.cell_replace.pattern (property)

+

Pattern to search for in single or all fields

+
Signature
+

str

+
+
+

steps.cell_replace.replace (property)

+

String to replace

+
Signature
+

str

+
+
+

steps.cell_replace.field_name (property)

+

field name to apply template string

+
Signature
+

Optional[str]

+
+ + + + +
+
+

Set Cells

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+          steps.cell_set(field_name="population", value=100),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |        100 |
++----+-----------+------------+
+|  2 | 'france'  |        100 |
++----+-----------+------------+
+|  3 | 'spain'   |        100 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.cell_set (class)

+ +
+
+ + +
+

steps.cell_set (class)

+

Set cell

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, value: Any, field_name: str) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + value + (Any)
  • +
  • + field_name + (str)
  • +
+
+ +
+

steps.cell_set.value (property)

+

+ Value to be set in cell of the given field. +

+
Signature
+

Any

+
+
+

steps.cell_set.field_name (property)

+

+ Specifies the field name where to set/replace the value. +

+
Signature
+

str

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/steps/field.html b/docs/steps/field.html new file mode 100644 index 0000000000..58bfcc8b1f --- /dev/null +++ b/docs/steps/field.html @@ -0,0 +1,4622 @@ + + + + + + + + +Field Steps | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Field Steps

+

The Field steps are responsible for managing a Table Schema's fields. You can add or remove them along with more complex operations like unpacking.

+

Add Field

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_add(name="note", value="eu", descriptor={"type": "string"}),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'note', 'type': 'string'}]}
++----+-----------+------------+------+
+| id | name      | population | note |
++====+===========+============+======+
+|  1 | 'germany' |         83 | 'eu' |
++----+-----------+------------+------+
+|  2 | 'france'  |         66 | 'eu' |
++----+-----------+------------+------+
+|  3 | 'spain'   |         47 | 'eu' |
++----+-----------+------------+------+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_add (class)

+ +
+
+ + +
+

steps.field_add (class)

+

Add field. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, value: Optional[Any] = None, formula: Optional[Any] = None, function: Optional[Any] = None, position: Optional[int] = None, descriptor: Optional[types.IDescriptor] = None, incremental: bool = False) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + value + (Optional[Any])
  • +
  • + formula + (Optional[Any])
  • +
  • + function + (Optional[Any])
  • +
  • + position + (Optional[int])
  • +
  • + descriptor + (Optional[types.IDescriptor])
  • +
  • + incremental + (bool)
  • +
+
+ +
+

steps.field_add.name (property)

+

+ A human-oriented name for the field. +

+
Signature
+

str

+
+
+

steps.field_add.value (property)

+

+ Specifies value for the field. +

+
Signature
+

Optional[Any]

+
+
+

steps.field_add.formula (property)

+

+ Evaluatable expressions to set the value for the field. The expressions are + processed using simpleeval library. +

+
Signature
+

Optional[Any]

+
+
+

steps.field_add.function (property)

+

+ Python function to set the value for the field. +

+
Signature
+

Optional[Any]

+
+
+

steps.field_add.position (property)

+

+ Position index where to add the field. For example, to + add the field in second position, we need to set it as 'position=2'. +

+
Signature
+

Optional[int]

+
+
+

steps.field_add.descriptor (property)

+

+ A dictionary, which contains metadata for the field which + describes the properties of the field. +

+
Signature
+

Optional[types.IDescriptor]

+
+
+

steps.field_add.incremental (property)

+

+ Indicates if it is an incremental value. If True, the sequential value is set + to the new field. The default value is false. +

+
Signature
+

bool

+
+ + + + +
+
+

Filter Fields

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_filter(names=["id", "name"]),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'}]}
++----+-----------+
+| id | name      |
++====+===========+
+|  1 | 'germany' |
++----+-----------+
+|  2 | 'france'  |
++----+-----------+
+|  3 | 'spain'   |
++----+-----------+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_filter (class)

+ +
+
+ + +
+

steps.field_filter (class)

+

Filter fields. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, names: List[str]) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + names + (List[str])
  • +
+
+ +
+

steps.field_filter.names (property)

+

+ Names of the field to be read. Other fields will be ignored. +

+
Signature
+

List[str]

+
+ + + + +
+
+

Merge Fields

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+     source,
+     steps=[
+     	 # seperator argument can be used to set delimeter. Default value is '-'
+    	 # preserve argument keeps the original fields
+         steps.field_merge(name="details", from_names=["name", "population"], preserve=True)
+     ],
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'details', 'type': 'string'}]}
++----+-----------+------------+--------------+
+| id | name      | population | details      |
++====+===========+============+==============+
+|  1 | 'germany' |         83 | 'germany-83' |
++----+-----------+------------+--------------+
+|  2 | 'france'  |         66 | 'france-66'  |
++----+-----------+------------+--------------+
+|  3 | 'spain'   |         47 | 'spain-47'   |
++----+-----------+------------+--------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_merge (class)

+ +
+
+ + +
+

steps.field_merge (class)

+

Merge fields. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, from_names: List[str], separator: str = -, preserve: bool = False) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + from_names + (List[str])
  • +
  • + separator + (str)
  • +
  • + preserve + (bool)
  • +
+
+ +
+

steps.field_merge.name (property)

+

+ Name of the new field that will be created after merge. +

+
Signature
+

str

+
+
+

steps.field_merge.from_names (property)

+

+ List of field names to merge. +

+
Signature
+

List[str]

+
+
+

steps.field_merge.separator (property)

+

+ Separator to use while merging values of the two fields. +

+
Signature
+

str

+
+
+

steps.field_merge.preserve (property)

+

+ It indicates if the fields are preserved or not after merging. If True, + fields will not be removed and vice versa. +

+
Signature
+

bool

+
+ + + + +
+
+

Move Field

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_move(name="id", position=3),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'id', 'type': 'integer'}]}
++-----------+------------+----+
+| name      | population | id |
++===========+============+====+
+| 'germany' |         83 |  1 |
++-----------+------------+----+
+| 'france'  |         66 |  2 |
++-----------+------------+----+
+| 'spain'   |         47 |  3 |
++-----------+------------+----+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_move (class)

+ +
+
+ + +
+

steps.field_move (class)

+

Move field. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, position: int) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + position + (int)
  • +
+
+ +
+

steps.field_move.name (property)

+

+ Field name to move. +

+
Signature
+

str

+
+
+

steps.field_move.position (property)

+

+ New position for the field being moved. +

+
Signature
+

int

+
+ + + + +
+
+

Pack Fields

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+    	# field_type returns packed fields as JSON Object. Default value for field_type is 'array'
+    	# preserve argument keeps the original fields
+        steps.field_pack(name="details", from_names=["name", "population"], as_object=True, preserve=True)
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'details', 'type': 'object'}]}
++----+-----------+------------+-----------------------------------------+
+| id | name      | population | details                                 |
++====+===========+============+=========================================+
+|  1 | 'germany' |         83 | {'name': 'germany', 'population': '83'} |
++----+-----------+------------+-----------------------------------------+
+|  2 | 'france'  |         66 | {'name': 'france', 'population': '66'}  |
++----+-----------+------------+-----------------------------------------+
+|  3 | 'spain'   |         47 | {'name': 'spain', 'population': '47'}   |
++----+-----------+------------+-----------------------------------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_pack (class)

+ +
+
+ + +
+

steps.field_pack (class)

+

Pack fields. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, from_names: List[str], as_object: bool = False, preserve: bool = False) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + from_names + (List[str])
  • +
  • + as_object + (bool)
  • +
  • + preserve + (bool)
  • +
+
+ +
+

steps.field_pack.name (property)

+

+ Name of the new field. +

+
Signature
+

str

+
+
+

steps.field_pack.from_names (property)

+

+ List of fields to be packed. +

+
Signature
+

List[str]

+
+
+

steps.field_pack.as_object (property)

+

+ The packed value of the field will be stored as object if set to + True. +

+
Signature
+

bool

+
+
+

steps.field_pack.preserve (property)

+

+ Specifies if the field should be preserved or not. If True, fields + part of packing process will be preserved. +

+
Signature
+

bool

+
+ + + + +
+
+

Remove Field

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_remove(names=["id"]),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++-----------+------------+
+| name      | population |
++===========+============+
+| 'germany' |         83 |
++-----------+------------+
+| 'france'  |         66 |
++-----------+------------+
+| 'spain'   |         47 |
++-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_remove (class)

+ +
+
+ + +
+

steps.field_remove (class)

+

Remove field. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, names: List[str]) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + names + (List[str])
  • +
+
+ +
+

steps.field_remove.names (property)

+

+ List of fields to remove. +

+
Signature
+

List[str]

+
+ + + + +
+
+

Split Field

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_split(name="name", to_names=["name1", "name2"], pattern="a"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'name1', 'type': 'string'},
+            {'name': 'name2', 'type': 'string'}]}
++----+------------+--------+-------+
+| id | population | name1  | name2 |
++====+============+========+=======+
+|  1 |         83 | 'germ' | 'ny'  |
++----+------------+--------+-------+
+|  2 |         66 | 'fr'   | 'nce' |
++----+------------+--------+-------+
+|  3 |         47 | 'sp'   | 'in'  |
++----+------------+--------+-------+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_split (class)

+ +
+
+ + +
+

steps.field_split (class)

+

Split field. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, to_names: List[str], pattern: str, preserve: bool = False) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + to_names + (List[str])
  • +
  • + pattern + (str)
  • +
  • + preserve + (bool)
  • +
+
+ +
+

steps.field_split.name (property)

+

+ Name of the field to split. +

+
Signature
+

str

+
+
+

steps.field_split.to_names (property)

+

+ List of names of new fields. +

+
Signature
+

List[str]

+
+
+

steps.field_split.pattern (property)

+

+ Pattern to split the field value, for example: "a". +

+
Signature
+

str

+
+
+

steps.field_split.preserve (property)

+

+ Whether to preserve the fields after the split. If True, + the fields are not removed after split. +

+
Signature
+

bool

+
+ + + + +
+
+

Unpack Field

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_update(name="id", value=[1, 1], descriptor={"type": "string"}),
+        steps.field_unpack(name="id", to_names=["id2", "id3"]),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'id2', 'type': 'any'},
+            {'name': 'id3', 'type': 'any'}]}
++-----------+------------+-----+-----+
+| name      | population | id2 | id3 |
++===========+============+=====+=====+
+| 'germany' |         83 |   1 |   1 |
++-----------+------------+-----+-----+
+| 'france'  |         66 |   1 |   1 |
++-----------+------------+-----+-----+
+| 'spain'   |         47 |   1 |   1 |
++-----------+------------+-----+-----+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_unpack (class)

+ +
+
+ + +
+

steps.field_unpack (class)

+

Unpack field. + +This step can be added using the `steps` parameter for the +`transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, to_names: List[str], preserve: bool = False) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + to_names + (List[str])
  • +
  • + preserve + (bool)
  • +
+
+ +
+

steps.field_unpack.name (property)

+

+ Name of the field to unpack. +

+
Signature
+

str

+
+
+

steps.field_unpack.to_names (property)

+

+ List of names for new fields that will be created + after unpacking. +

+
Signature
+

List[str]

+
+
+

steps.field_unpack.preserve (property)

+

+ Whether to preserve the source fields after unpacking. +

+
Signature
+

bool

+
+ + + + +
+
+

Update Field

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_update(name="id", value=str, descriptor={"type": "string"}),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'string'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++------+-----------+------------+
+| id   | name      | population |
++======+===========+============+
+| None | 'germany' |         83 |
++------+-----------+------------+
+| None | 'france'  |         66 |
++------+-----------+------------+
+| None | 'spain'   |         47 |
++------+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.field_update (class)

+ +
+
+ + +
+

steps.field_update (class)

+

Update field. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, value: Optional[Any] = None, formula: Optional[Any] = None, function: Optional[Any] = None, descriptor: Optional[types.IDescriptor] = None) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + value + (Optional[Any])
  • +
  • + formula + (Optional[Any])
  • +
  • + function + (Optional[Any])
  • +
  • + descriptor + (Optional[types.IDescriptor])
  • +
+
+ +
+

steps.field_update.name (property)

+

+ Name of the field to update. +

+
Signature
+

str

+
+
+

steps.field_update.value (property)

+

+ Cell value to set for the field. +

+
Signature
+

Optional[Any]

+
+
+

steps.field_update.formula (property)

+

+ Evaluatable expressions to set the value for the field. The expressions + are processed using simpleeval library. +

+
Signature
+

Optional[Any]

+
+
+

steps.field_update.function (property)

+

+ Python function to set the value for the field. +

+
Signature
+

Optional[Any]

+
+
+

steps.field_update.descriptor (property)

+

+ A descriptor for the field to set the metadata. +

+
Signature
+

Optional[types.IDescriptor]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/steps/resource.html b/docs/steps/resource.html new file mode 100644 index 0000000000..85f5ead0af --- /dev/null +++ b/docs/steps/resource.html @@ -0,0 +1,3891 @@ + + + + + + + + +Resource Steps | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Resource Steps

+

The Resource steps are only available for a package transformation (except for steps.resource_update available for standalone resources). This includes some basic resource management operations like adding or removing resources along with the hierarchical transform_resource step.

+

Add Resource

+

This step add a new resource to a data package.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+    source,
+    steps=[
+        steps.resource_add(name='extra', descriptor={'path': 'transform.csv'}),
+    ],
+)
+print(target.resource_names)
+print(target.get_resource('extra').schema)
+print(target.get_resource('extra').to_view())
+
+ +
['main', 'extra']
+{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  2 | 'france'  |         66 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.resource_add (class)

+ +
+
+ + +
+

steps.resource_add (class)

+

Add resource. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, descriptor: Dict[str, Any]) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + descriptor + (Dict[str, Any])
  • +
+
+ +
+

steps.resource_add.name (property)

+

+ Name of the resource to add. +

+
Signature
+

str

+
+
+

steps.resource_add.descriptor (property)

+

+ A descriptor for the resource. +

+
Signature
+

Dict[str, Any]

+
+ + + + +
+
+

Remove Resource

+

This step remove an existent resource from a data package.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+    source,
+    steps=[
+        steps.resource_remove(name='main'),
+    ],
+)
+print(target)
+
+ +
{'resources': []}
+ +
+

Reference

+
+ + +
+
+ +

steps.resource_remove (class)

+ +
+
+ + +
+

steps.resource_remove (class)

+

Remove resource. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
+
+ +
+

steps.resource_remove.name (property)

+

+ Name of the resource to remove. +

+
Signature
+

str

+
+ + + + +
+
+

Transform Resource

+

It's a hierarchical step allowing to transform a data package's resource. It's possible to use any resource steps as a part of this package step.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+    source,
+    steps=[
+        steps.resource_transform(name='main', steps=[
+            steps.row_sort(field_names=['name'])
+        ]),
+    ],
+)
+print(target.resource_names)
+print(target.get_resource('main').schema)
+print(target.get_resource('main').to_view())
+
+ +
['main']
+{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  2 | 'france'  |         66 |
++----+-----------+------------+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.resource_transform (class)

+ +
+
+ + +
+

steps.resource_transform (class)

+

Transform resource. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: str, steps: List[Step]) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (str)
  • +
  • + steps + (List[Step])
  • +
+
+ +
+

steps.resource_transform.name (property)

+

+ Name of the resource to transform. +

+
Signature
+

str

+
+
+

steps.resource_transform.steps (property)

+

+ List of transformation steps to apply to the given + resource. +

+
Signature
+

List[Step]

+
+ + + + +
+
+

Update Resource

+

This step update a resource's metadata. It can be used for both resource and package transformations.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+    source,
+    steps=[
+        steps.resource_update(
+          name='main',
+          descriptor={'title': 'Main Resource', 'description': 'For the docs'}
+        ),
+    ],
+)
+print(target.get_resource('main'))
+
+ +
{'name': 'main',
+ 'type': 'table',
+ 'title': 'Main Resource',
+ 'description': 'For the docs',
+ 'path': 'transform.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+ +
+

Reference

+
+ + +
+
+ +

steps.resource_update (class)

+ +
+
+ + +
+

steps.resource_update (class)

+

Update resource. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, title: Optional[str] = None, description: Optional[str] = None, name: Optional[str] = None, descriptor: types.IDescriptor) -> None

+
Parameters
+
    +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + name + (Optional[str])
  • +
  • + descriptor + (types.IDescriptor)
  • +
+
+ +
+

steps.resource_update.name (property)

+

+ Name of the resource to update. +

+
Signature
+

Optional[str]

+
+
+

steps.resource_update.descriptor (property)

+

+ New descriptor for the resource to update metadata. +

+
Signature
+

types.IDescriptor

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/steps/row.html b/docs/steps/row.html new file mode 100644 index 0000000000..721a0031be --- /dev/null +++ b/docs/steps/row.html @@ -0,0 +1,4326 @@ + + + + + + + + +Row Steps | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Row Steps

+

These steps are row-based including row filtering, slicing, and many more.

+

Filter Rows

+

This step filters rows based on a provided formula or function.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.row_filter(formula="id > 1"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+----------+------------+
+| id | name     | population |
++====+==========+============+
+|  2 | 'france' |         66 |
++----+----------+------------+
+|  3 | 'spain'  |         47 |
++----+----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.row_filter (class)

+ +
+
+ + +
+

steps.row_filter (class)

+

Filter rows. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, formula: Optional[Any] = None, function: Optional[Any] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + formula + (Optional[Any])
  • +
  • + function + (Optional[Any])
  • +
+
+ +
+

steps.row_filter.formula (property)

+

+ Evaluatable expressions to filter the rows. Rows that matches the formula + are returned and others are ignored. The expressions are processed using + simpleeval library. +

+
Signature
+

Optional[Any]

+
+
+

steps.row_filter.function (property)

+

+ Python function to filter the row. +

+
Signature
+

Optional[Any]

+
+ + + + +
+
+

Search Rows

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.row_search(regex=r"^f.*"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+----------+------------+
+| id | name     | population |
++====+==========+============+
+|  2 | 'france' |         66 |
++----+----------+------------+
+ +
+

Reference

+
+ + +
+
+ + + +
+
+ + +
+ +

Search rows. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, regex: str, field_name: Optional[str] = None, negate: bool = False) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + regex + (str)
  • +
  • + field_name + (Optional[str])
  • +
  • + negate + (bool)
  • +
+
+ +
+

steps.row_search.regex (property)

+

+ Regex pattern to search for rows. If field_name is set it + will only be applied to the specified field. For example, regex=r"^e.*". +

+
Signature
+

str

+
+
+

steps.row_search.field_name (property)

+

+ Field name in which to search for. +

+
Signature
+

Optional[str]

+
+
+

steps.row_search.negate (property)

+

+ Whether to revert the result. If True, all the rows that does + not match the pattern will be returned. +

+
Signature
+

bool

+
+ + + + +
+
+

Slice Rows

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.row_slice(head=2),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  2 | 'france'  |         66 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.row_slice (class)

+ +
+
+ + +
+

steps.row_slice (class)

+

Slice rows. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, start: Optional[int] = None, stop: Optional[int] = None, step: Optional[int] = None, head: Optional[int] = None, tail: Optional[int] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + start + (Optional[int])
  • +
  • + stop + (Optional[int])
  • +
  • + step + (Optional[int])
  • +
  • + head + (Optional[int])
  • +
  • + tail + (Optional[int])
  • +
+
+ +
+

steps.row_slice.start (property)

+

+ Starting point from where to read the rows. If None, + defaults to the beginning. +

+
Signature
+

Optional[int]

+
+
+

steps.row_slice.stop (property)

+

+ Stopping point for reading row. If None, defaults to + the end. +

+
Signature
+

Optional[int]

+
+
+

steps.row_slice.step (property)

+

+ It is the step size to read next row. If None, it defaults + to 1. +

+
Signature
+

Optional[int]

+
+
+

steps.row_slice.head (property)

+

+ Number of rows to read from head. +

+
Signature
+

Optional[int]

+
+
+

steps.row_slice.tail (property)

+

+ Number of rows to read from the bottom. +

+
Signature
+

Optional[int]

+
+ + + + +
+
+

Sort Rows

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.row_sort(field_names=["name"]),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  2 | 'france'  |         66 |
++----+-----------+------------+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.row_sort (class)

+ +
+
+ + +
+

steps.row_sort (class)

+

Sort rows. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_names: List[str], reverse: bool = False) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + field_names + (List[str])
  • +
  • + reverse + (bool)
  • +
+
+ +
+

steps.row_sort.field_names (property)

+

+ List of field names by which the rows will be + sorted. If fields more than 1, sort applies from + left to right. +

+
Signature
+

List[str]

+
+
+

steps.row_sort.reverse (property)

+

+ The sort will be reversed if it is set to True. +

+
Signature
+

bool

+
+ + + + +
+
+

Split Rows

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.row_split(field_name="name", pattern="a"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+--------+------------+
+| id | name   | population |
++====+========+============+
+|  1 | 'germ' |         83 |
++----+--------+------------+
+|  1 | 'ny'   |         83 |
++----+--------+------------+
+|  2 | 'fr'   |         66 |
++----+--------+------------+
+|  2 | 'nce'  |         66 |
++----+--------+------------+
+|  3 | 'sp'   |         47 |
++----+--------+------------+
+...
+ +
+

Reference

+
+ + +
+
+ +

steps.row_split (class)

+ +
+
+ + +
+

steps.row_split (class)

+

Split rows. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, pattern: str, field_name: str) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + pattern + (str)
  • +
  • + field_name + (str)
  • +
+
+ +
+

steps.row_split.pattern (property)

+

+ Pattern to search for in one or more fields. +

+
Signature
+

str

+
+
+

steps.row_split.field_name (property)

+

+ Field name whose cell value will be split. +

+
Signature
+

str

+
+ + + + +
+
+

Subset Rows

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.field_update(name="id", value=1),
+        steps.row_subset(subset="conflicts", field_name="id"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  1 | 'france'  |         66 |
++----+-----------+------------+
+|  1 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.row_subset (class)

+ +
+
+ + +
+

steps.row_subset (class)

+

Subset rows. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, subset: str, field_name: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + subset + (str)
  • +
  • + field_name + (Optional[str])
  • +
+
+ +
+

steps.row_subset.subset (property)

+

+ It can take different values such as "conflicts","distinct","duplicates" + and "unique". +

+
Signature
+

str

+
+
+

steps.row_subset.field_name (property)

+

+ Name of field to which the subset functions will be applied. +

+
Signature
+

Optional[str]

+
+ + + + +
+
+

Ungroup Rows

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform-groups.csv")
+target = transform(
+    source,
+    steps=[
+        steps.row_ungroup(group_name="name", selection="first"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'year', 'type': 'integer'}]}
++----+-----------+------------+------+
+| id | name      | population | year |
++====+===========+============+======+
+|  3 | 'france'  |         66 | 2020 |
++----+-----------+------------+------+
+|  1 | 'germany' |         83 | 2020 |
++----+-----------+------------+------+
+|  5 | 'spain'   |         47 | 2020 |
++----+-----------+------------+------+
+ +
+

Reference

+
+ + +
+
+ +

steps.row_ungroup (class)

+ +
+
+ + +
+

steps.row_ungroup (class)

+

Ungroup rows. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, selection: str, group_name: str, value_name: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + selection + (str)
  • +
  • + group_name + (str)
  • +
  • + value_name + (Optional[str])
  • +
+
+ +
+

steps.row_ungroup.selection (property)

+

+ Specifies whether to return first or last row. The value + can be "first", "last", "min" and "max". +

+
Signature
+

str

+
+
+

steps.row_ungroup.group_name (property)

+

+ Field name which will be used to group the rows. And it returns the + first or last row with each group based on the 'selection'. +

+
Signature
+

str

+
+
+

steps.row_ungroup.value_name (property)

+

+ If the selection is set to "min" or "max", the rows will be grouped by + "group_name" field and min or max value will be then selected from the + "value_name" field. +

+
Signature
+

Optional[str]

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/steps/table.html b/docs/steps/table.html new file mode 100644 index 0000000000..692aeaa454 --- /dev/null +++ b/docs/steps/table.html @@ -0,0 +1,5191 @@ + + + + + + + + +Table Steps | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

Table Steps

+

These steps are meant to be used on a table level of a resource. This includes various different operations from simple validation or writing to the disc to complex re-shaping like pivoting or melting.

+

Aggregate Table

+

Group rows under the given group_name then apply aggregation functions provided as aggregation dictionary (see example)

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform-groups.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_aggregate(
+            group_name="name", aggregation={"sum": ("population", sum)}
+        ),
+    ],
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'name', 'type': 'string'}, {'name': 'sum', 'type': 'any'}]}
++-----------+-----+
+| name      | sum |
++===========+=====+
+| 'france'  | 120 |
++-----------+-----+
+| 'germany' | 160 |
++-----------+-----+
+| 'spain'   |  80 |
++-----------+-----+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_aggregate (class)

+ +
+
+ + +
+

steps.table_aggregate (class)

+

Aggregate table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, aggregation: Dict[str, Any], group_name: str) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + aggregation + (Dict[str, Any])
  • +
  • + group_name + (str)
  • +
+
+ +
+

steps.table_aggregate.aggregation (property)

+

+ A dictionary with aggregation function. The values + could be max, min, len and sum. +

+
Signature
+

Dict[str, Any]

+
+
+

steps.table_aggregate.group_name (property)

+

+ Field by which the rows will be grouped. +

+
Signature
+

str

+
+ + + + +
+
+

Attach Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+      steps.table_attach(resource=Resource(data=[["note"], ["large"], ["mid"]])),
+    ],
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'note', 'type': 'string'}]}
++----+-----------+------------+---------+
+| id | name      | population | note    |
++====+===========+============+=========+
+|  1 | 'germany' |         83 | 'large' |
++----+-----------+------------+---------+
+|  2 | 'france'  |         66 | 'mid'   |
++----+-----------+------------+---------+
+|  3 | 'spain'   |         47 | None    |
++----+-----------+------------+---------+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_attach (class)

+ +
+
+ + +
+

steps.table_attach (class)

+

Attach table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str]) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + resource + (Union[Resource, str])
  • +
+
+ +
+

steps.table_attach.resource (property)

+

+ Data Resource to attach to the existing table. +

+
Signature
+

Union[Resource, str]

+
+ + + + +
+
+

Debug Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+      steps.table_debug(function=print),
+    ],
+)
+print(target.to_view())
+
+ +
{'id': 1, 'name': 'germany', 'population': 83}
+{'id': 2, 'name': 'france', 'population': 66}
+{'id': 3, 'name': 'spain', 'population': 47}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  2 | 'france'  |         66 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_debug (class)

+ +
+
+ + +
+

steps.table_debug (class)

+

Debug table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, function: Any) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + function + (Any)
  • +
+
+ +
+

steps.table_debug.function (property)

+

+ Debug function to apply to the table row. +

+
Signature
+

Any

+
+ + + + +
+
+

Diff Tables

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_diff(
+            resource=Resource(
+                data=[
+                    ["id", "name", "population"],
+                    [1, "germany", 83],
+                    [2, "france", 50],
+                    [3, "spain", 47],
+                ]
+            )
+        ),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+----------+------------+
+| id | name     | population |
++====+==========+============+
+|  2 | 'france' |         66 |
++----+----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_diff (class)

+ +
+
+ + +
+

steps.table_diff (class)

+

Diff tables. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], ignore_order: bool = False, use_hash: bool = False) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + resource + (Union[Resource, str])
  • +
  • + ignore_order + (bool)
  • +
  • + use_hash + (bool)
  • +
+
+ +
+

steps.table_diff.resource (property)

+

+ Resource with which to compare. +

+
Signature
+

Union[Resource, str]

+
+
+

steps.table_diff.ignore_order (property)

+

+ Specifies whether to ignore the order of the rows. +

+
Signature
+

bool

+
+
+

steps.table_diff.use_hash (property)

+

+ Specifies whether to use hash or not. If yes, alternative implementation will + be used where the complement is executed by constructing an in-memory set for + all rows found in the right hand table. For more information + please see the link below: + https://petl.readthedocs.io/en/stable/transform.html#petl.transform.setops.hashcomplement +

+
Signature
+

bool

+
+ + + + +
+
+

Intersect Tables

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_intersect(
+            resource=Resource(
+                data=[
+                    ["id", "name", "population"],
+                    [1, "germany", 83],
+                    [2, "france", 50],
+                    [3, "spain", 47],
+                ]
+            ),
+        ),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_intersect (class)

+ +
+
+ + +
+

steps.table_intersect (class)

+

Intersect tables. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], use_hash: bool = False) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + resource + (Union[Resource, str])
  • +
  • + use_hash + (bool)
  • +
+
+ +
+

steps.table_intersect.resource (property)

+

+ Resource with which to apply intersection. +

+
Signature
+

Union[Resource, str]

+
+
+

steps.table_intersect.use_hash (property)

+

+ Specifies whether to use hash or not. If yes, an + alternative implementation will be used. For more + information please see the link below: + https://petl.readthedocs.io/en/stable/transform.html#petl.transform.setops.hashintersection +

+
Signature
+

bool

+
+ + + + +
+
+

Join Tables

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_join(
+            resource=Resource(data=[["id", "note"], [1, "beer"], [2, "vine"]]),
+            field_name="id",
+        ),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'},
+            {'name': 'note', 'type': 'string'}]}
++----+-----------+------------+--------+
+| id | name      | population | note   |
++====+===========+============+========+
+|  1 | 'germany' |         83 | 'beer' |
++----+-----------+------------+--------+
+|  2 | 'france'  |         66 | 'vine' |
++----+-----------+------------+--------+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_join (class)

+ +
+
+ + +
+

steps.table_join (class)

+

Join tables. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], field_name: Optional[str] = None, use_hash: bool = False, mode: str = inner) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + resource + (Union[Resource, str])
  • +
  • + field_name + (Optional[str])
  • +
  • + use_hash + (bool)
  • +
  • + mode + (str)
  • +
+
+ +
+

steps.table_join.resource (property)

+

+ Resource with which to apply join. +

+
Signature
+

Union[Resource, str]

+
+
+

steps.table_join.field_name (property)

+

+ Field name with which the join will be performed comparing it's value between two tables. + If not provided natural join is tried. For more information, please see the following document: + https://petl.readthedocs.io/en/stable/_modules/petl/transform/joins.html +

+
Signature
+

Optional[str]

+
+
+

steps.table_join.use_hash (property)

+

+ Specify whether to use hash or not. If True, an alternative implementation of join will be used. +

+
Signature
+

bool

+
+
+

steps.table_join.mode (property)

+

+ Specifies which mode to use. The available modes are: "inner", "left", "right", "outer", "cross" and + "negate". The default mode is "inner". +

+
Signature
+

str

+
+ + + + +
+
+

Melt Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_melt(field_name="name"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'name', 'type': 'string'},
+            {'name': 'variable', 'type': 'string'},
+            {'name': 'value', 'type': 'any'}]}
++-----------+--------------+-------+
+| name      | variable     | value |
++===========+==============+=======+
+| 'germany' | 'id'         |     1 |
++-----------+--------------+-------+
+| 'germany' | 'population' |    83 |
++-----------+--------------+-------+
+| 'france'  | 'id'         |     2 |
++-----------+--------------+-------+
+| 'france'  | 'population' |    66 |
++-----------+--------------+-------+
+| 'spain'   | 'id'         |     3 |
++-----------+--------------+-------+
+...
+ +
+

Reference

+
+ + +
+
+ +

steps.table_melt (class)

+ +
+
+ + +
+

steps.table_melt (class)

+

Melt tables. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, variables: Optional[str] = None, to_field_names: List[str] = NOTHING) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + field_name + (str)
  • +
  • + variables + (Optional[str])
  • +
  • + to_field_names + (List[str])
  • +
+
+ +
+

steps.table_melt.field_name (property)

+

+ Field name which will be use to melt table. It will keep + the field 'field_name' as it is but melt other fields into + data. +

+
Signature
+

str

+
+
+

steps.table_melt.variables (property)

+

+ List of name of fields which will be melted into data. +

+
Signature
+

Optional[str]

+
+
+

steps.table_melt.to_field_names (property)

+

+ Labels for new fields that will be created "variable" and "value". +

+
Signature
+

List[str]

+
+ + + + +
+
+

Merge Tables

+

Example

+
+ +
+
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_merge(
+            resource=Resource(data=[["id", "name", "note"], [4, "malta", "island"]])
+        ),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_merge (class)

+ +
+
+ + +
+

steps.table_merge (class)

+

Merge tables. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], field_names: List[str] = NOTHING, sort_by_field: Optional[str] = None, ignore_fields: bool = False) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + resource + (Union[Resource, str])
  • +
  • + field_names + (List[str])
  • +
  • + sort_by_field + (Optional[str])
  • +
  • + ignore_fields + (bool)
  • +
+
+ +
+

steps.table_merge.resource (property)

+

+ Resource to merge with. +

+
Signature
+

Union[Resource, str]

+
+
+

steps.table_merge.field_names (property)

+

+ Specifies fixed headers for output table. +

+
Signature
+

List[str]

+
+
+

steps.table_merge.sort_by_field (property)

+

+ Field name by which to sort the record after merging. +

+
Signature
+

Optional[str]

+
+
+

steps.table_merge.ignore_fields (property)

+

+ If ignore_fields is set to True, it will merge two resource + without matching headers. +

+
Signature
+

bool

+
+ + + + +
+
+

Normalize Table

+

The table_normalize step normalizes an underlaying tabular stream (cast types and fix dimensions) according to a provided or inferred schema. If your data is not really big it's recommended to normalize a table before any others steps.

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource("table.csv")
+print(source.read_cells())
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+    ]
+)
+print(target.read_cells())
+
+ +
[['id', 'name'], ['1', 'english'], ['2', '中国人']]
+[['id', 'name'], [1, 'english'], [2, '中国人']]
+ +
+

Reference

+
+ + +
+
+ +

steps.table_normalize (class)

+ +
+
+ + +
+

steps.table_normalize (class)

+

Normalize table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+

Pivot Table

+

Example

+
+ +
+
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform-pivot.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_pivot(f1="region", f2="gender", f3="units", aggfun=sum),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_pivot (class)

+ +
+
+ + +
+

steps.table_pivot (class)

+

Pivot table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, f1: str, f2: str, f3: str, aggfun: Any) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + f1 + (str)
  • +
  • + f2 + (str)
  • +
  • + f3 + (str)
  • +
  • + aggfun + (Any)
  • +
+
+ +
+

steps.table_pivot.f1 (property)

+

+ Field that makes the rows in the output pivot table. +

+
Signature
+

str

+
+
+

steps.table_pivot.f2 (property)

+

+ Field that makes the columns in the output pivot table. +

+
Signature
+

str

+
+
+

steps.table_pivot.f3 (property)

+

+ Field that forms the data in the output pivot table. +

+
Signature
+

str

+
+
+

steps.table_pivot.aggfun (property)

+

+ Function to process and create data in the output pivot table. + The function can be "sum", "max", "min", "len" etc. +

+
Signature
+

Any

+
+ + + + +
+
+

Print Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_print(),
+    ]
+)
+
+ +
==  =======  ==========
+id  name     population
+==  =======  ==========
+ 1  germany          83
+ 2  france           66
+ 3  spain            47
+==  =======  ==========
+ +
+

Reference

+
+ + +
+
+ +

steps.table_print (class)

+ +
+
+ + +
+

steps.table_print (class)

+

Print table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+

Recast Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_melt(field_name="id"),
+        steps.table_recast(field_name="id"),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name      | population |
++====+===========+============+
+|  1 | 'germany' |         83 |
++----+-----------+------------+
+|  2 | 'france'  |         66 |
++----+-----------+------------+
+|  3 | 'spain'   |         47 |
++----+-----------+------------+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_recast (class)

+ +
+
+ + +
+

steps.table_recast (class)

+

Recast table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, from_field_names: List[str] = NOTHING) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + field_name + (str)
  • +
  • + from_field_names + (List[str])
  • +
+
+ +
+

steps.table_recast.field_name (property)

+

+ Recast table by the field 'field_name'. +

+
Signature
+

str

+
+
+

steps.table_recast.from_field_names (property)

+

+ List of field names for the output table. +

+
Signature
+

List[str]

+
+ + + + +
+
+

Transpose Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_normalize(),
+        steps.table_transpose(),
+    ]
+)
+print(target.schema)
+print(target.to_view())
+
+ +
{'fields': [{'name': 'id', 'type': 'string'},
+            {'name': '1', 'type': 'any'},
+            {'name': '2', 'type': 'any'},
+            {'name': '3', 'type': 'any'}]}
++--------------+-----------+----------+---------+
+| id           | 1         | 2        | 3       |
++==============+===========+==========+=========+
+| 'name'       | 'germany' | 'france' | 'spain' |
++--------------+-----------+----------+---------+
+| 'population' |        83 |       66 |      47 |
++--------------+-----------+----------+---------+
+ +
+

Reference

+
+ + +
+
+ +

steps.table_transpose (class)

+ +
+
+ + +
+

steps.table_transpose (class)

+

Transpose table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+

Validate Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.cell_set(field_name="population", value="bad"),
+        steps.table_validate(),
+    ]
+)
+pprint(target.schema)
+try:
+  pprint(target.to_view())
+except Exception as exception:
+  pprint(exception)
+
+ +
{'fields': [{'name': 'id', 'type': 'integer'},
+            {'name': 'name', 'type': 'string'},
+            {'name': 'population', 'type': 'integer'}]}
+FrictionlessException('[step-error] Step is not valid: "table_validate" raises "[type-error] Type error in the cell "bad" in row "2" and field "population" at position "3": type is "integer/default" " ')
+ +
+

Reference

+
+ + +
+
+ +

steps.table_validate (class)

+ +
+
+ + +
+

steps.table_validate (class)

+

Validate table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
+
+ + + + + +
+
+

Write Table

+

Example

+ +
+
+
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+    source,
+    steps=[
+        steps.table_write(path='transform.json'),
+    ]
+)
+
+ +
+

Let's read the output:

+ +
+
+
cat transform.json
+
+ +
[
+  [
+    "id",
+    "name",
+    "population"
+  ],
+  [
+    1,
+    "germany",
+    83
+  ],
+  [
+    2,
+    "france",
+    66
+  ],
+  [
+    3,
+    "spain",
+    47
+  ]
+]
+ +
+
+
with open('transform.json') as file:
+    print(file.read())
+
+ +
[
+  [
+    "id",
+    "name",
+    "population"
+  ],
+  [
+    1,
+    "germany",
+    83
+  ],
+  [
+    2,
+    "france",
+    66
+  ],
+  [
+    3,
+    "spain",
+    47
+  ]
+]
+ +
+

Reference

+
+ + +
+
+ +

steps.table_write (class)

+ +
+
+ + +
+

steps.table_write (class)

+

Write table. + +This step can be added using the `steps` parameter +for the `transform` function.

+
Signature
+

(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, path: str) -> None

+
Parameters
+
    +
  • + name + (Optional[str])
  • +
  • + title + (Optional[str])
  • +
  • + description + (Optional[str])
  • +
  • + path + (str)
  • +
+
+ +
+

steps.table_write.path (property)

+

+ Path of the file to write the table content. +

+
Signature
+

str

+
+ + + + +
+
+
+ + +
+
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/docs/universe.html b/docs/universe.html new file mode 100644 index 0000000000..0244eb1cfa --- /dev/null +++ b/docs/universe.html @@ -0,0 +1,3472 @@ + + + + + + + + +Universe | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+ +
+ +
+ +
+ +
+
+
+
+
+ + + + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/index.html b/index.html new file mode 100644 index 0000000000..84627306a2 --- /dev/null +++ b/index.html @@ -0,0 +1,3480 @@ + + + + + + + + +frictionless-py | Frictionless Framework + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+
+
+
+ Edit page in Livemark
+ (2024-11-22 08:02) +
+ +

frictionless-py

+

Build +Coverage +Release +Citation +Codebase +Support

+
+ +

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data (DEVT Framework). It supports a great deal of data sources and formats, as well as provides popular platforms integrations. The framework is powered by the lightweight yet comprehensive Frictionless Standards.

+

Purpose

+
    +
  • Describe your data: You can infer, edit and save metadata of your data tables. It's a first step for ensuring data quality and usability. Frictionless metadata includes general information about your data like textual description, as well as, field types and other tabular data details.
  • +
  • Extract your data: You can read your data using a unified tabular interface. Data quality and consistency are guaranteed by a schema. Frictionless supports various file schemes like HTTP, FTP, and S3 and data formats like CSV, XLS, JSON, SQL, and others.
  • +
  • Validate your data: You can validate data tables, resources, and datasets. Frictionless generates a unified validation report, as well as supports a lot of options to customize the validation process.
  • +
  • Transform your data: You can clean, reshape, and transfer your data tables and datasets. Frictionless provides a pipeline capability and a lower-level interface to work with the data.
  • +
+

Features

+
    +
  • Open Source (MIT)
  • +
  • Powerful Python framework
  • +
  • Convenient command-line interface
  • +
  • Low memory consumption for data of any size
  • +
  • Reasonable performance on big data
  • +
  • Support for compressed files
  • +
  • Custom checks and formats
  • +
  • Fully pluggable architecture
  • +
  • More than 1000+ tests
  • +
+

Installation

+
$ pip install frictionless
+
+

Example

+
$ frictionless validate data/invalid.csv
+[invalid] data/invalid.csv
+
+  row    field  code              message
+-----  -------  ----------------  --------------------------------------------
+             3  blank-header      Header in field at position "3" is blank
+             4  duplicate-header  Header "name" in field "4" is duplicated
+    2        3  missing-cell      Row "2" has a missing cell in field "field3"
+    2        4  missing-cell      Row "2" has a missing cell in field "name2"
+    3        3  missing-cell      Row "3" has a missing cell in field "field3"
+    3        4  missing-cell      Row "3" has a missing cell in field "name2"
+    4           blank-row         Row "4" is completely blank
+    5        5  extra-cell        Row "5" has an extra value in field  "5"
+
+

Documentation

+

Please visit our documentation portal:

+ + + +
+
+ +
+ +
+ +
+
+ Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data +
+
+ + +
+ + + + + + + + + + +
+ +
+ + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + + +
+
+ + + +
+
+ + + + + + + + + + + \ No newline at end of file