Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify clean_date #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 146 additions & 54 deletions clean_date.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,81 +4,173 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"clean_date(\n",
"# `clean_date()`\n",
"\n",
"**dataframe**, \n",
"The task is to identify, clean and impute empty part of each component of an date, and transform the date into a standardized form or split the date into its components. Users are allowed to define their own transformation functions. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# API\n",
" \n",
"## Function header"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def clean_date(\n",
" df: Union[pd.DataFrame, dd.DataFrame],\n",
" column: str,\n",
" target_format: str = 'YYYY-MM-DD HH:MM:SS',\n",
" fix_empty: str = 'auto_minimum',\n",
" show_report: bool = False,\n",
" customized_rule: str = None,\n",
" split: bool = False\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Format specification\n",
"\n",
"**column**,\n",
"The user can specify the target date format using the following formats \n",
"\n",
"**target_format** = 'yyyy-mm-dd hh:mm:ss', \n",
"* `YYYY/MM/DD HH:MM:SS`\n",
"\n",
"**target_timezone** = 'local',\n",
"* `YYYY-MM-DD HH:MM:SS`\n",
"\n",
"**fix_empty** = 'minimum',\n",
"* `YY-MM-DD HH:MM:SS`\n",
"\n",
"**show_report** = False,\n",
"* `YY/MM/DD HH:MM:SS`\n",
"\n",
"**customized_rule** = None\n",
"* `DD/MM/YYYY HH:MM:SS`\n",
"\n",
"**split** = False\n",
"* `MM/DD/YYYY HH:MM:SS`\n",
"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**dataframe: pandas.DataFrame**\n",
"* `DD-MM-YYYY HH:MM:SS`\n",
"\n",
"+ Input dataframe\n",
"* `MM-DD-YYYY HH:MM:SS`\n",
"\n",
"**column: str**\n",
"* `DD-MM-YY HH:MM:SS`\n",
"\n",
"+ The column to be transformed\n",
"+ Each value within the column is detected, if it's detected as date type, then it's cleaned. Otherwise, the function does nothing to the value.\n",
"+ Date_Type_Detection: Parse input string, if time component (year/month/day/hour/minute/second) can be parsed, then it's recognized as a date value.\n",
"+ Date_Type_Detection: Refers to implementation in datautil._parser https://github.com/dateutil/dateutil/blob/master/dateutil/parser/_parser.py\n",
"* `MM-DD-YY HH:MM:SS`\n",
"\n",
"**target_format: str**\n",
"* `DD/MM/YY HH:MM:SS`\n",
"\n",
"+ The target date format to be transformmed\n",
"+ Format should be represented as the combinition of y,m,d,h,mn,s, then after each individual component is separeted from the initial string, it can be reformatted.\n",
"+ Should be able to support customized formats\n",
"+ Missing components are filled according to fix_empty parameter\n",
"* `MM/DD/YY HH:MM:SS`\n",
"\n",
"**fix_empty: str**\n",
"### Parameter `target_format` \n",
"\n",
"+ Define the rule to fix the missing time component of parsed value\n",
"+ value set: {'empty','auto'}\n",
"+ empty: just left the missing component as it is\n",
"+ auto: do everything for users automatically\n",
" + have default setting for every individual component\n",
" + for seconds, just fill it with zeros\n",
" + for years, fill it with the nearest value/minimum value\n",
"The user can specify the `target_format` as one of specified format. The **default target format** is `\"YYYY-MM-DD HH:MM:SS\"`\n",
"\n",
"**show_report: Boolean**\n",
"Example.* df = \n",
"\n",
"+ Define whether to generate and return cleaning report\n",
"+ Should design a class like Cleaning_Report()\n",
"+ Contains: \n",
" + 1) how many values are repaired \n",
" + 2) how they are repaired \n",
" + 3) visualization \n",
" + 4) values that need repair but the library doesn't know the rule to repair (human-in-the-loop required)\n",
" + 5) ?\n",
"| DateTime |\n",
"| --- |\n",
"| 2020-09-22 21:25:18 |\n",
"\n",
"**customized_rule: Function**:\n",
"`clean_date(df, \"DateTime\", target_format=\"mm/dd hh:mm\")` returns\n",
"\n",
"+ Defines customized format transfomation function\n",
"+ Takes input string and output the corresponding output string\n",
"| DateTime |\n",
"| --- |\n",
"| 09/22 21:25 |\n",
"\n",
"### Parameter `fixed_empty` \n",
"\n",
"The user can specify the way of fixing empty value from value set: {'empty', 'auto_nearest', 'auto_minimum'}. The **default fixed_empty** is `\"auto_minimum\"`\n",
"\n",
"* empty: just left the missing component as it is\n",
"\n",
"*Example .* df = \n",
"\n",
"| DateTime |\n",
"| --- |\n",
"| 2020-09-22|\n",
"`clean_date(df, \"DateTime\", fixed_empty='empty')` returns\n",
"\n",
"| DateTime | cleaned_DateTime | \n",
"| --- | --- | \n",
"| 2020-09-22 | 2020-09-22 --:--:-- | \n",
"\n",
"* auto_nearest: \n",
" * For hours, minutes and seconds, just fill them with zeros\n",
" * for years, months and days, fill it with the nearest value\n",
" \n",
"*Example .* df = \n",
"\n",
"| DateTime |\n",
"| --- |\n",
"| 09-22|\n",
"`clean_date(df, \"DateTime\", fixed_empty='auto_nearest')` returns\n",
"\n",
"| DateTime | cleaned_DateTime | \n",
"| --- | --- | \n",
"| 09-22 | 2020-09-22 00:00:00 | \n",
"\n",
"* auto_minimum: \n",
" * For hours, minutes and seconds, just fill them with zeros\n",
" * for years, months and days, fill it with the minimum value\n",
" \n",
"*Example .* df = \n",
"\n",
"| DateTime |\n",
"| --- |\n",
"| 2020-09|\n",
"`clean_date(df, \"DateTime\", fixed_empty='auto_minimum')` returns\n",
"\n",
"| DateTime | cleaned_DateTime | \n",
"| --- | --- | \n",
"| 2020-09 | 2020-09-01 00:00:00 | \n",
"\n",
"### Parameter `show_report` \n",
"If `show_report = True`, a report contains:\n",
"\n",
"1) how many values are repaired\n",
"2) how they are repaired\n",
"3) visualization\n",
"4) values that need repair but the library doesn't know the rule to repair (human-in-the-loop required)\n",
"\n",
"will be generated.\n",
"Example and illustration with picture need to be added.\n",
"\n",
"### Parameter `customized_rule` \n",
"Users can define customized format transfomation function, which takes input string and outputs the corresponding output string. Example of costomized_rule need to be added.\n",
"\n",
"### Parameter `split`\n",
"\n",
"If True, the output will be split into each identified attribute of the date.\n",
"\n",
"*Example.* df = \n",
"\n",
"| DateTime |\n",
"| --- |\n",
"| 2020-09-22 21:25:18 |\n",
"\n",
"`clean_date(df, \"DateTime\", split=True)` returns\n",
"\n",
"| DateTime | Year | Month | Day | Hour | Minute | Second |\n",
"| --- | --- | --- | --- | --- | --- | --- |\n",
"| 2020-09-22 21:25:18 | 2020 | 09 | 22 | 21 | 25 | 18 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Supplementary material\n",
"\n",
"**split: Boolean**\n",
"## Python libraries\n",
"\n",
"+ Define whether to split time attribute into different columns\n",
"+ Parse and split string to individual components\n",
"+ components: year,month,date,hour,minute,second\n",
"+ Add parameters to limit components to be split?"
"1. [dateutil](https://github.com/dateutil/dateutil). Generic parsing of dates in almost any string format.\n",
"\n"
]
},
{
Expand Down Expand Up @@ -124,7 +216,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down