Releases: BiomedDAR/copula-tabular
Releases · BiomedDAR/copula-tabular
v0.1.6
Description
Constraints update
- minor bug fixes
CleanData bug fixes
gen_data_report
minor bug fixes for variable mismatch
Package URL
v0.1.5
Description
CleanData Updates
- Previously when reading .csv -type data files, all
na
type strings are automatically removed. Following that, columns ofobject
datatype might be converted tofloat
, which might conflict with the user's definition in the data dictionary. This is still the default behaviour, but with an updated option to allow such values to be loaded as they are, unless a data field/column is explicitly set out in the dictionary to benumeric
. - The
Generate Data Report
feature now includes additional fields. Readers can now identify fields that are out-of-range (numerical types) or not-defined (categorical types), based on what is defined in the data dictionary. - The attribute
var_list
was not available when discrepancies are detected between the listed data fields in the dictionary and data files. It is now available, defaulted to the fields found in the data file. - Previously, the CleanData module generates the required output data folders as promised, but only after it tries to record its actions in a log file from a non-existent output data folder. Now it does the sensible thing by ensuring the data folders exist first.
- An additional option is now available to modify the dataframe "index" by concatenating existing "Index"-type data in the data dictionary, so as to uniquely identify rows, when they are not already uniquely identified by existing "Index"-type columns. This is useful when generating reports, and pin-pointing the exact rows which are problematic. To activate this option, specify
CREATE_UNIQUE_INDEX
toTrue
in thedefinitions.py
. Other settings includeUNIQUE_INDEX_COMPOSITION_LIST
andUNIQUE_INDEX_DELIMITER
. - If the value for OUTPUT_TYPE_DATA is
xlsx
in thedefinitions.py
file,converting_ascii
crashes if there are<NA>
type values in the data. The problem is now fixed to skip ASCII conversion for<NA>
type entries. - Additional function
add_dictionary_row
is now available to add entries to the Data Dictionary. This is useful when creating secondary variables and syncing the data dictionary along with the new creation.
TabulaCopula Updates
- Bug fix for data paths in non-windows based systems.
Constraints Updates
- Updated functions "multiparent_conditions", "evaluate_df_column" with new options. It is now able to create secondary columns with names that have appended suffixes, instead of replacing the original variables. It also generates more comprehensive logs, on the rows that have been replaced.
- Updated function "convertBlankstoValue" to also convert strings that are empty, on top of those that are
null
. - New functionality "find_mismatch" to find mismatches between any two columns in a dataframe.
Utils Updates
- New function
extract_year_month_day
is available to extract the year, month, and day from a given string-type date using a specified format. - Minor bug fixes in "mapping_dictDateFormatConversion".
VIsualPlot Updates
- Added "bins" option to histogram plots.
Package URL
v0.1.4
Description
Utilities update
- new function
gen_interpolation
for creating new datapoints via interpolation - new function
conversionFromTIMSTxtToCSV
for reading oddly delimited.txt
files and convert them to.csv
format
CleanData bug fixes
gen_data_report
no longer ignoresTYPE
categories in data dictionary when they come with trailing spacesgen_data_report
now accepts a variety ofTYPE
categories in data dictionary, on top of the standardnumeric
,string
,date
,bool
.- CleanData will now allow users to define
sheetname
for EXCEL outputs, using theRAWDICTXLSX_SHEETNAME
attribute indefinitions
.
Package URL
v0.1.3
First release to PyPI
Description
Package includes
- Data cleaning tools
- Transformation tools for converting non-numeric data into numeric equivalents
- Univariate Marginal Distribution modelling from raw data
- Conditional-Copula Implementations for generating synthetic data
- Privacy Metric evaluation wrapper