Frictionless primary key (#597)

* Unique keyword arg (#580) * add copy button to docs (#448) * Add missing inplace arg to SchemaModel's validate (#450) * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * link documentation to github (#449) Co-authored-by: Niels Bantilan <[email protected]> * intermediate commit for review by @cosmicBboy * WIP * fix test errors, re-factor allow_duplicates handling * fix io tests * fix docs, remove _allow_duplicates private var * update unique type signature in strategies * completing tests for setters and lazy evaluation of unique kw * small fix for the linting errors * support dataframe-level uniqueness in strategies * add docs, fix error formatting, add multiindex support Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> * Add support for timezone-aware datetime strategies (#595) * add support for Any annotation in schema model (#594) * add support for Any annotation in schema model the motivation behind this feature is to support column annotations that can have any type, to support use cases like the one described in #592, where custom checks can be applied to any column except for ones that are explicitly defined in the schema model class attributes * update pylint, fix lint * Docs/scaling - Bring Pandera to Spark and Dask (#588) * scaling.rst * edited conf * finished first pass * removing FugueWorkflow * Update index.rst * Update docs/source/scaling.rst Co-authored-by: Niels Bantilan <[email protected]> * add support for timezone-aware datetime strategies * fix le/ge strategies with datetime * fix mypy errors Co-authored-by: Niels Bantilan <[email protected]> Co-authored-by: Kevin Kho <[email protected]> * support frictionless primary keys with multiple fields Co-authored-by: Jean-Francois Zinque <[email protected]> Co-authored-by: tfwillems <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: fkroll8 <[email protected]> Co-authored-by: Kevin Kho <[email protected]>
unionai-oss · Sep 9, 2021 · 86a0e19 · 86a0e19
1 parent 84ea3c2
commit 86a0e19
Show file tree

Hide file tree

Showing 16 changed files with 552 additions and 133 deletions.
diff --git a/docs/source/dataframe_schemas.rst b/docs/source/dataframe_schemas.rst
@@ -467,6 +467,38 @@ To validate the order of the Dataframe columns, specify ``ordered=True``:
 
 .. _index:
 
+Validating the joint uniqueness of columns
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some cases you might want to ensure that a group of columns are unique:
+
+.. testcode:: joint_column_uniqueness
+
+    import pandas as pd
+    import pandera as pa
+
+    schema = pa.DataFrameSchema(
+        columns={col: pa.Column(int) for col in ["a", "b", "c"]},
+        unique=["a", "c"],
+    )
+    df = pd.DataFrame.from_records([
+        {"a": 1, "b": 2, "c": 3},
+        {"a": 1, "b": 2, "c": 3},
+    ])
+    schema.validate(df)
+
+.. testoutput:: joint_column_uniqueness
+
+    Traceback (most recent call last):
+    ...
+    SchemaError: columns '('a', 'c')' not unique:
+    column  index  failure_case
+    0      a      0             1
+    1      a      1             1
+    2      c      0             3
+    3      c      1             3
+
+
 Index Validation
 ----------------
 

diff --git a/docs/source/schema_inference.rst b/docs/source/schema_inference.rst
@@ -107,7 +107,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
                     Check.less_than_or_equal_to(max_value=20.0),
                 ],
                 nullable=False,
-                allow_duplicates=True,
+                unique=False,
                 coerce=False,
                 required=True,
                 regex=False,
@@ -116,7 +116,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
                 dtype=pandera.engines.numpy_engine.Object,
                 checks=None,
                 nullable=False,
-                allow_duplicates=True,
+                unique=False,
                 coerce=False,
                 required=True,
                 regex=False,
@@ -132,7 +132,7 @@ You can also write your schema to a python script with :func:`~pandera.io.to_scr
                     ),
                 ],
                 nullable=False,
-                allow_duplicates=True,
+                unique=False,
                 coerce=False,
                 required=True,
                 regex=False,
@@ -185,15 +185,15 @@ is a convenience method for this functionality.
         checks:
           greater_than_or_equal_to: 5.0
           less_than_or_equal_to: 20.0
-        allow_duplicates: true
+        unique: false
         coerce: false
         required: true
         regex: false
       column2:
         dtype: object
         nullable: false
         checks: null
-        allow_duplicates: true
+        unique: false
         coerce: false
         required: true
         regex: false
@@ -203,7 +203,7 @@ is a convenience method for this functionality.
         checks:
           greater_than_or_equal_to: '2010-01-01 00:00:00'
           less_than_or_equal_to: '2012-01-01 00:00:00'
-        allow_duplicates: true
+        unique: false
         coerce: false
         required: true
         regex: false
@@ -218,6 +218,7 @@ is a convenience method for this functionality.
       coerce: false
     coerce: true
     strict: false
+    unique: null
 
 You can edit this yaml file by specifying column names under the ``column``
 key. The respective values map onto key-word arguments in the

diff --git a/pandera/engines/pandas_engine.py b/pandera/engines/pandas_engine.py
@@ -157,7 +157,13 @@ def numpy_dtype(cls, pandera_dtype: dtypes.DataType) -> np.dtype:
             alias = "bool"
         elif alias.startswith("string"):
             alias = "str"
-        return np.dtype(alias)
+
+        try:
+            return np.dtype(alias)
+        except TypeError as err:
+            raise TypeError(
+                f"Data type '{pandera_dtype}' cannot be cast to a numpy dtype."
+            ) from err
 
 
 ###############################################################################

diff --git a/pandera/io.py b/pandera/io.py
@@ -108,7 +108,7 @@ def _serialize_component_stats(component_stats):
             key: component_stats.get(key)
             for key in [
                 "name",
-                "allow_duplicates",
+                "unique",
                 "coerce",
                 "required",
                 "regex",
@@ -148,6 +148,7 @@ def _serialize_schema(dataframe_schema):
         "index": index,
         "coerce": dataframe_schema.coerce,
         "strict": dataframe_schema.strict,
+        "unique": dataframe_schema.unique,
     }
 
 
@@ -195,6 +196,9 @@ def _deserialize_component_stats(serialized_component_stats):
             for key in [
                 "name",
                 "nullable",
+                "unique",
+                # deserialize allow_duplicates property for backwards
+                # compatibility. Remove this for 0.8.0 release
                 "allow_duplicates",
                 "coerce",
                 "required",
@@ -255,6 +259,7 @@ def _deserialize_schema(serialized_schema):
         index=index,
         coerce=serialized_schema.get("coerce", False),
         strict=serialized_schema.get("strict", False),
+        unique=serialized_schema.get("unique", None),
     )
 
 
@@ -310,7 +315,7 @@ def _write_yaml(obj, stream):
     dtype={dtype},
     checks={checks},
     nullable={nullable},
-    allow_duplicates={allow_duplicates},
+    unique={unique},
     coerce={coerce},
     required={required},
     regex={regex},
@@ -397,7 +402,7 @@ def to_script(dataframe_schema, path_or_buf=None):
             ),
             checks=_format_checks(properties["checks"]),
             nullable=properties["nullable"],
-            allow_duplicates=properties["allow_duplicates"],
+            unique=properties["unique"],
             coerce=properties["coerce"],
             required=properties["required"],
             regex=properties["regex"],
@@ -418,6 +423,7 @@ def to_script(dataframe_schema, path_or_buf=None):
         coerce=dataframe_schema.coerce,
         strict=dataframe_schema.strict,
         name=dataframe_schema.name.__repr__(),
+        unique=dataframe_schema.unique,
     ).strip()
 
     # add pandas imports to handle datetime and timedelta.
@@ -445,15 +451,15 @@ class FrictionlessFieldParser:
     formats, titles, descriptions).
 
     :param field: a field object from a frictionless schema.
-    :param primary_keys: the primary keys from a frictionless schema. These are used
-        to ensure primary key fields are treated properly - no duplicates,
-        no missing values etc.
+    :param primary_keys: the primary keys from a frictionless schema. These
+        are used to ensure primary key fields are treated properly - no
+        duplicates, no missing values etc.
     """
 
     def __init__(self, field, primary_keys) -> None:
         self.constraints = field.constraints or {}
+        self.primary_keys = primary_keys
         self.name = field.name
-        self.is_a_primary_key = self.name in primary_keys
         self.type = field.get("type", "string")
 
     @property
@@ -544,18 +550,22 @@ def nullable(self) -> bool:
         """Determine whether this field can contain missing values.
 
         If a field is a primary key, this will return ``False``."""
-        if self.is_a_primary_key:
+        if self.name in self.primary_keys:
             return False
         return not self.constraints.get("required", False)
 
     @property
-    def allow_duplicates(self) -> bool:
+    def unique(self) -> bool:
         """Determine whether this field can contain duplicate values.
 
-        If a field is a primary key, this will return ``False``."""
-        if self.is_a_primary_key:
-            return False
-        return not self.constraints.get("unique", False)
+        If a field is a primary key, this will return ``True``.
+        """
+
+        # only set column-level uniqueness property if `primary_keys` contains
+        # more than one field name.
+        if len(self.primary_keys) == 1 and self.name in self.primary_keys:
+            return True
+        return self.constraints.get("unique", False)
 
     @property
     def coerce(self) -> bool:
@@ -587,10 +597,10 @@ def regex(self) -> bool:
     def to_pandera_column(self) -> Dict:
         """Export this field to a column spec dictionary."""
         return {
-            "allow_duplicates": self.allow_duplicates,
             "checks": self.checks,
             "coerce": self.coerce,
             "nullable": self.nullable,
+            "unique": self.unique,
             "dtype": self.dtype,
             "required": self.required,
             "name": self.name,
@@ -645,8 +655,8 @@ def from_frictionless_schema(
     [<Check in_range: in_range(10, 99)>]
     >>> schema.columns["column_1"].required
     True
-    >>> schema.columns["column_1"].allow_duplicates
-    False
+    >>> schema.columns["column_1"].unique
+    True
     >>> schema.columns["column_2"].checks
     [<Check str_length: str_length(None, 10)>, <Check str_matches: str_matches(re.compile('^\\\\S+$'))>]
     """
@@ -664,5 +674,10 @@ def from_frictionless_schema(
         "checks": None,
         "coerce": True,
         "strict": True,
+        # only set dataframe-level uniqueness if the frictionless primary
+        # key property specifies more than one field
+        "unique": (
+            None if len(schema.primary_key) == 1 else list(schema.primary_key)
+        ),
     }
     return _deserialize_schema(assembled_schema)
diff --git a/pandera/model.py b/pandera/model.py
@@ -82,6 +82,9 @@ class BaseConfig:  # pylint:disable=R0903
     name: Optional[str] = None  #: name of schema
     coerce: bool = False  #: coerce types of all schema components
 
+    #: make sure certain column combinations are unique
+    unique: Optional[Union[str, List[str]]] = None
+
     #: make sure all specified columns are in the validated dataframe -
     #: if ``"filter"``, removes columns not specified in the schema
     strict: Union[bool, str] = False
@@ -218,6 +221,7 @@ def to_schema(cls) -> DataFrameSchema:
             strict=cls.__config__.strict,
             name=cls.__config__.name,
             ordered=cls.__config__.ordered,
+            unique=cls.__config__.unique,
         )
         if cls not in MODEL_CACHE:
             MODEL_CACHE[cls] = cls.__schema__  # type: ignore

diff --git a/pandera/model_components.py b/pandera/model_components.py
@@ -48,6 +48,7 @@ class FieldInfo:
     __slots__ = (
         "checks",
         "nullable",
+        "unique",
         "allow_duplicates",
         "coerce",
         "regex",
@@ -61,7 +62,8 @@ def __init__(
         self,
         checks: Optional[_CheckList] = None,
         nullable: bool = False,
-        allow_duplicates: bool = True,
+        unique: bool = False,
+        allow_duplicates: Optional[bool] = None,
         coerce: bool = False,
         regex: bool = False,
         alias: Any = None,
@@ -70,6 +72,7 @@ def __init__(
     ) -> None:
         self.checks = _to_checklist(checks)
         self.nullable = nullable
+        self.unique = unique
         self.allow_duplicates = allow_duplicates
         self.coerce = coerce
         self.regex = regex
@@ -118,6 +121,7 @@ def to_column(
             pandas_dtype,
             Column,
             nullable=self.nullable,
+            unique=self.unique,
             allow_duplicates=self.allow_duplicates,
             coerce=self.coerce,
             regex=self.regex,
@@ -137,6 +141,7 @@ def to_index(
             pandas_dtype,
             Index,
             nullable=self.nullable,
+            unique=self.unique,
             allow_duplicates=self.allow_duplicates,
             coerce=self.coerce,
             name=name,
@@ -161,7 +166,8 @@ def Field(
     str_matches: Optional[str] = None,
     str_startswith: Optional[str] = None,
     nullable: bool = False,
-    allow_duplicates: bool = True,
+    unique: bool = False,
+    allow_duplicates: Optional[bool] = None,
     coerce: bool = False,
     regex: bool = False,
     ignore_na: bool = True,
@@ -183,6 +189,7 @@ def Field(
     to the built-in `~pandera.checks.Check` methods.
 
     :param nullable: whether or not the column/index is nullable.
+    :param unique: whether column values should be unique
     :param allow_duplicates: whether or not to accept duplicate values.
     :param coerce: coerces the data type if ``True``.
     :param regex: whether or not the field name or alias is a regex pattern.
@@ -194,7 +201,8 @@ def Field(
     :param check_name: Whether to check the name of the column/index during
         validation. `None` is the default behavior, which translates to `True`
         for columns and multi-index, and to `False` for a single index.
-    :param dtype_kwargs: The parameters to be forwarded to the type of the field.
+    :param dtype_kwargs: The parameters to be forwarded to the type of the
+        field.
     :param kwargs: Specify custom checks that have been registered with the
         :class:`~pandera.extensions.register_check_method` decorator.
     """
@@ -229,6 +237,7 @@ def Field(
     return FieldInfo(
         checks=checks or None,
         nullable=nullable,
+        unique=unique,
         allow_duplicates=allow_duplicates,
         coerce=coerce,
         regex=regex,