brandonlockhart · qidanrui · Sep 22, 2020
diff --git a/clean_date.ipynb b/clean_date.ipynb
@@ -4,81 +4,173 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "clean_date(\n",
+    "# `clean_date()`\n",
     "\n",
-    "**dataframe**, \n",
+    "The task is to identify, clean and impute empty part of each component of an date, and transform the date into a standardized form or split the date into its components. Users are allowed to define their own transformation functions. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# API\n",
+    "  \n",
+    "## Function header"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def clean_date(\n",
+    "    df: Union[pd.DataFrame, dd.DataFrame],\n",
+    "    column: str,\n",
+    "    target_format: str = 'YYYY-MM-DD HH:MM:SS',\n",
+    "    fix_empty: str = 'auto_minimum',\n",
+    "    show_report: bool = False,\n",
+    "    customized_rule: str = None,\n",
+    "    split: bool = False\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Format specification\n",
     "\n",
-    "**column**,\n",
+    "The user can specify the target date format using the following formats \n",
     "\n",
-    "**target_format** = 'yyyy-mm-dd hh:mm:ss', \n",
+    "* `YYYY/MM/DD HH:MM:SS`\n",
     "\n",
-    "**target_timezone** = 'local',\n",
+    "* `YYYY-MM-DD HH:MM:SS`\n",
     "\n",
-    "**fix_empty** = 'minimum',\n",
+    "* `YY-MM-DD HH:MM:SS`\n",
     "\n",
-    "**show_report** = False,\n",
+    "* `YY/MM/DD HH:MM:SS`\n",
     "\n",
-    "**customized_rule** = None\n",
+    "* `DD/MM/YYYY HH:MM:SS`\n",
     "\n",
-    "**split** = False\n",
+    "* `MM/DD/YYYY HH:MM:SS`\n",
     "\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**dataframe: pandas.DataFrame**\n",
+    "* `DD-MM-YYYY HH:MM:SS`\n",
     "\n",
-    "+ Input dataframe\n",
+    "* `MM-DD-YYYY HH:MM:SS`\n",
     "\n",
-    "**column: str**\n",
+    "* `DD-MM-YY HH:MM:SS`\n",
     "\n",
-    "+ The column to be transformed\n",
-    "+ Each value within the column is detected, if it's detected as date type, then it's cleaned. Otherwise, the function does nothing to the value.\n",
-    "+ Date_Type_Detection: Parse input string, if time component (year/month/day/hour/minute/second) can be parsed, then it's recognized as a date value.\n",
-    "+ Date_Type_Detection: Refers to implementation in datautil._parser https://github.com/dateutil/dateutil/blob/master/dateutil/parser/_parser.py\n",
+    "* `MM-DD-YY HH:MM:SS`\n",
     "\n",
-    "**target_format: str**\n",
+    "* `DD/MM/YY HH:MM:SS`\n",
     "\n",
-    "+ The target date format to be transformmed\n",
-    "+ Format should be represented as the combinition of y,m,d,h,mn,s, then after each individual component is separeted from the initial string, it can be reformatted.\n",
-    "+ Should be able to support customized formats\n",
-    "+ Missing components are filled according to fix_empty parameter\n",
+    "* `MM/DD/YY HH:MM:SS`\n",
     "\n",
-    "**fix_empty: str**\n",
+    "### Parameter `target_format` \n",
     "\n",
-    "+ Define the rule to fix the missing time component of parsed value\n",
-    "+ value set: {'empty','auto'}\n",
-    "+ empty: just left the missing component as it is\n",
-    "+ auto: do everything for users automatically\n",
-    "    + have default setting for every individual component\n",
-    "    + for seconds, just fill it with zeros\n",
-    "    + for years, fill it with the nearest value/minimum value\n",
+    "The user can specify the `target_format` as one of specified format. The **default target format** is `\"YYYY-MM-DD HH:MM:SS\"`\n",
     "\n",
-    "**show_report: Boolean**\n",
+    "Example.* df = \n",
     "\n",
-    "+ Define whether to generate and return cleaning report\n",
-    "+ Should design a class like Cleaning_Report()\n",
-    "+ Contains: \n",
-    "    + 1) how many values are repaired \n",
-    "    + 2) how they are repaired \n",
-    "    + 3) visualization \n",
-    "    + 4) values that need repair but the library doesn't know the rule to repair (human-in-the-loop required)\n",
-    "    + 5) ?\n",
+    "| DateTime |\n",
+    "| --- |\n",
+    "| 2020-09-22 21:25:18 |\n",
     "\n",
-    "**customized_rule: Function**:\n",
+    "`clean_date(df, \"DateTime\", target_format=\"mm/dd hh:mm\")` returns\n",
     "\n",
-    "+ Defines customized format transfomation function\n",
-    "+ Takes input string and output the corresponding output string\n",
+    "| DateTime |\n",
+    "| --- |\n",
+    "| 09/22 21:25 |\n",
+    "\n",
+    "### Parameter `fixed_empty` \n",
+    "\n",
+    "The user can specify the way of fixing empty value from value set: {'empty', 'auto_nearest', 'auto_minimum'}.  The **default fixed_empty** is `\"auto_minimum\"`\n",
+    "\n",
+    "* empty: just left the missing component as it is\n",
+    "\n",
+    "*Example .* df = \n",
+    "\n",
+    "| DateTime |\n",
+    "| --- |\n",
+    "| 2020-09-22|\n",
+    "`clean_date(df, \"DateTime\", fixed_empty='empty')` returns\n",
+    "\n",
+    "| DateTime | cleaned_DateTime | \n",
+    "| --- | --- | \n",
+    "| 2020-09-22 | 2020-09-22 --:--:-- | \n",
+    "\n",
+    "* auto_nearest: \n",
+    "    * For hours, minutes and seconds, just fill them with zeros\n",
+    "    * for years, months and days, fill it with the nearest value\n",
+    "    \n",
+    "*Example .* df = \n",
+    "\n",
+    "| DateTime |\n",
+    "| --- |\n",
+    "| 09-22|\n",
+    "`clean_date(df, \"DateTime\", fixed_empty='auto_nearest')` returns\n",
+    "\n",
+    "| DateTime | cleaned_DateTime | \n",
+    "| --- | --- | \n",
+    "| 09-22 | 2020-09-22 00:00:00 | \n",
+    "\n",
+    "* auto_minimum: \n",
+    "    * For hours, minutes and seconds, just fill them with zeros\n",
+    "    * for years, months and days, fill it with the minimum value\n",
+    "    \n",
+    "*Example .* df = \n",
+    "\n",
+    "| DateTime |\n",
+    "| --- |\n",
+    "| 2020-09|\n",
+    "`clean_date(df, \"DateTime\", fixed_empty='auto_minimum')` returns\n",
+    "\n",
+    "| DateTime | cleaned_DateTime | \n",
+    "| --- | --- | \n",
+    "| 2020-09 | 2020-09-01 00:00:00 | \n",
+    "\n",
+    "### Parameter `show_report` \n",
+    "If `show_report = True`, a report contains:\n",
+    "\n",
+    "1) how many values are repaired\n",
+    "2) how they are repaired\n",
+    "3) visualization\n",
+    "4) values that need repair but the library doesn't know the rule to repair (human-in-the-loop required)\n",
+    "\n",
+    "will be generated.\n",
+    "Example and illustration with picture need to be added.\n",
+    "\n",
+    "### Parameter `customized_rule` \n",
+    "Users can define customized format transfomation function, which takes input string and outputs the corresponding output string. Example of costomized_rule need to be added.\n",
+    "\n",
+    "### Parameter `split`\n",
+    "\n",
+    "If True, the output will be split into each identified attribute of the date.\n",
+    "\n",
+    "*Example.* df = \n",
+    "\n",
+    "| DateTime |\n",
+    "| --- |\n",
+    "| 2020-09-22 21:25:18 |\n",
+    "\n",
+    "`clean_date(df, \"DateTime\", split=True)` returns\n",
+    "\n",
+    "| DateTime | Year | Month | Day | Hour | Minute | Second |\n",
+    "| --- | --- | --- | --- | --- | --- | --- |\n",
+    "| 2020-09-22 21:25:18 | 2020 | 09 | 22 | 21 | 25 | 18 |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Supplementary material\n",
     "\n",
-    "**split: Boolean**\n",
+    "## Python libraries\n",
     "\n",
-    "+ Define whether to split time attribute into different columns\n",
-    "+ Parse and split string to individual components\n",
-    "+ components: year,month,date,hour,minute,second\n",
-    "+ Add parameters to limit components to be split?"
+    "1. [dateutil](https://github.com/dateutil/dateutil). Generic parsing of dates in almost any string format.\n",
+    "\n"
    ]
   },
   {
@@ -124,7 +216,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.6.8"
   }
  },
  "nbformat": 4,