-
Notifications
You must be signed in to change notification settings - Fork 1
Dataset Spec Proposal
Felipe Olmos edited this page Apr 10, 2024
·
3 revisions
- Have an easier way to construct an input multi-table dataset.
- Notably close to popular packages such as FeatureTools.
- Being the pivot object to helper functions such as:
- Sorting a multi-table dataset
- Fine-grained and convenient access to samples
The Dataset
objects (for the moment FileDataset
and PandasDataset
) implement a "Builder"
pattern by means of an empty constructor and mutator methods. The mutator methods "fail early" so if
all methods succeed the dataset should have only minor problems (eg. dangling tables).
This proposal also provides :
- export/import function to build a Khiops
DictionaryDomain
objects. - sort by key function
- Constructor:
-
PandasDataset()
: Normal empty constructor. -
FileDataset(header=True, sep="\t")
: Construction option to specify file format.
-
-
add_table(self, name, source, key=None, main_table=False)
:- Adds a table to the dataset
- Parameters:
-
name
:str
. Name of the table. -
source
:-
FileDataset
:str
path (or URL) of the table. -
PandasDataset
:pandas.Dataframe
.
-
-
key
:str
orlist of str
, optional. Key column(s) of the table.
-
- Fails if:
-
key
is not contained in the column list. -
main_table == True
butkey is None
. -
main_table == True
but there is already a main table set.
-
- Notes:
- This method obtains the column types with the corresponding heuristics.
-
remove_table(self, name)
- Removes a table from the dataset. Any relation containing this table
- Fails if: No table named
name
exists.
-
add_relation(self, parent_table_name, child_table_name, one_to_one=False)
- Adds a relation to the dataset
- Parameters:
-
parent_table_name
:str
. Name of the parent table. -
child_table_name
:str
. Name of the child table. - No relation with that pair of tables exists.
-
- Fails if:
- No table named
parent_table_name
exist. - No table named
child_table_name
exist. -
parent_table_name
==child_table_name
. - The key of any of the tables is
None
. - The key of
parent_table_name
is not contained in that ofchild_table_name
.
- No table named
-
remove_relation(self, parent_table_name, child_table_name)
- Parameters:
-
parent_table_name
:str
. Name of the parent table. -
child_table_name
:str
. Name of the child table.
-
- Fails if:
- No table named
parent_table_name
exist. - No table named
child_table_name
exist.
- No table named
- Parameters:
-
add_external_relation(self, parent_table_name, foreign_key, child_table_name)
- Adds an external relation
- Parameters:
-
parent_table_name
:str
. Name of the parent table. -
foreign_key
:str or list of str
. Column name(s) of the parent table matching the key ofchild_table_name
. -
child_table_name
:str
. Name of the child table.
-
- Notes:
- An external relation is always one-to-one
- Fails if:
- No table named
parent_table_name
exist. - No table named
child_table_name
exist. -
parent_table_name
==child_table_name
. - The key of the child table is
None
. -
foreign_key
is not equal to the key ofchild_table_name
.
- No table named
remove_external_relation(self, parent_table_name, child_table_name)
- Parameters:
-
parent_table_name
:str
. Name of the parent table. -
child_table_name
:str
. Name of the child table.
-
- Fails if:
- No table named
parent_table_name
exist. - No table named
child_table_name
exist. - No external relation with that pair of tables exists.
- No table named
-
sort_dataset(ds, engine="native", **kwargs)
- Sorts each of the tables of the dataset by their keys.
- Parameters:
-
ds
:FileDataset or PandasDataset
-
engine
:str
-
default
: Uses the default sorting engine (Khiops forFileDataset
,pandas.Dataframe.sort
forPandas
) -
khiops
: Uses khiops as sorting engine.
-
-
kwargs
: Parameters forpandas.Dataframe.sort
-
- Returns: Another
Dataset
instance with the sorted tables. - Fails if:
- There is a table with no
key
.
- There is a table with no
-
create_khiops_dictionary_domain(ds, override_types=None)
- Creates a
DictionaryDomain
instance representing the schema of the dataset - Parameters:
-
ds
:FileDataset or PandasDataset
. The input dataset object. -
override_types
:dict
. A dictionary whose keys are table name. The values aredict
's whose keys are column names and values are Khiops types.- Ex:
{ "Tweets": {"Body": "Text"} }
- Ex:
-
- Returns:
- A
DictionaryDomain
instance with the schema for the dataset.
- A
- Creates a
-
create_dataset(dictionary_file_path_or_domain, data_table_path, additional_data_tables)
:- Creates a
FileDataset
instance from aDictionaryDomain
and the paths for its tables. - Notes:
- It strips any derivation rule in the khiops dictionaries.
- Not sure if it is worth it.
- Returns: A
FileDataset
instance.
- Creates a
-
Versions of the
core.api
usingFileDataset
- The file dataset replace the following parameters
-
dictionary_file_path_or_domain
(when it describes the input data) dictionary_name
data_table_path
additional_data_tables
header_line
field_separator
-
output_additional_data_tables
(see below)
-
- Examples:
-
train_predictor_ds(ds, target_variable, results_dir, ...)
- Returns: Same as
train_predictor
- Returns: Same as
-
deploy_model_ds(dictionary_file_path_or_domain, ds, results_dir, ...)
- Returns: An output
FileDataset
with its files stored inresults_dir
- Returns: An output
-
- The file dataset replace the following parameters
-
get_sample_dataset(name, type="pandas")
- Gets a dataset for one of the Khiops samples
- Parameters:
-
name
:str
. Name of the dataset. -
type
:str
. Type of the dataset to construct. Eitherpandas
orfile
.
-
- Returns:
-
FileDataset
iftype="file"
-
PandasDataset
iftype="pandas"
-
- Example:
get_sample_dataset("Accidents", type="file")
- Notes:
- It is mainly to simplify samples/tutorials.
- It downloads the dataset if not stored locally.