intel-analytics · hkvision · Sep 15, 2022 · Aug 18, 2022 · Aug 18, 2022 · Aug 23, 2022
diff --git a/python/friesian/example/multi_task/README.md b/python/friesian/example/multi_task/README.md
@@ -0,0 +1,123 @@
+# Multi-task Recommendation with BigDL
+In addition to providing a personalized recommendation, recommendation systems need to output diverse 
+predictions to meet the needs of real-world applications, such as user click-through rates and browsing (or watching) time predictions for products.
+This example demonstrates how to use the [MMoE](https://dl.acm.org/doi/pdf/10.1145/3219819.3220007) or [PLE](https://dl.acm.org/doi/pdf/10.1145/3383313.3412236?casa_token=8fchWD8CHc0AAAAA:2cyP8EwkhIUlSFPRpfCGHahTddki0OEjDxfbUFMkXY5fU0FNtkvRzmYloJtLowFmL1en88FRFY4Q) model to implement multi-task recommendations with large-scale data.
+
+## Prepare environments
+We highly recommend you use [Anaconda](https://www.anaconda.com/distribution/#linux) to prepare the environment, especially if you want to run on a yarn cluster. 
+```
+conda create -n bigdl python=3.7 #bigdl is conda environment name, you can set another name you like.
+conda activate bigdl
+pip install bigdl-orca[ray]
+pip install bigdl-friesian
+pip install tensorflow==2.9.1
+pip install deepctr[cpu]
+```
+Refer to [this document](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/python.html#install) for more installation guides.
+
+## Data Preparation
+In this example, a news dataset is used to demonstrate the training and testing process. 
+Each row contains several feature values, timestamps and two labels. Using the timestamp to divide the training and testing sets. 
+The click prediction (classification) and duration time prediction (regression) are two output targets. Original data examples are as follows:
+```angular2html
++----------+----------+-------------------+----------+----------+-------------+-----+--------+------+-------+--------+--------+------+------+-------------------+-------+-------------+--------------------+
+|   user_id|article_id|          expo_time|net_status|flush_nums|exop_position|click|duration|device|     os|province|    city|   age|gender|              ctime|img_num|        cat_1|               cat_2|
++----------+----------+-------------------+----------+----------+-------------+-----+--------+------+-------+--------+--------+------+------+-------------------+-------+-------------+--------------------+
+|1000541010| 464467760|2021-06-30 09:57:14|         2|         0|           13|    1|      28|V2054A|Android|Shanghai|Shanghai|A_0_24|female|2021-06-29 14:46:43|      3|Entertainment| Entertainment/Stars|
+|1000541010| 463850913|2021-06-30 09:57:14|         2|         0|           15|    0|       0|V2054A|Android|Shanghai|Shanghai|A_0_24|female|2021-06-27 22:29:13|     11|     Fashions|Fashions/Female F...|
+|1000541010| 464022440|2021-06-30 09:57:14|         2|         0|           17|    0|       0|V2054A|Android|Shanghai|Shanghai|A_0_24|female|2021-06-28 12:22:54|      7|        Rural|Rural/Agriculture...|
+|1000541010| 464586545|2021-06-30 09:58:31|         2|         1|           20|    0|       0|V2054A|Android|Shanghai|Shanghai|A_0_24|female|2021-06-29 13:25:06|      5|Entertainment| Entertainment/Stars|
+|1000541010| 465352885|2021-07-03 18:13:03|         5|         0|           18|    0|       0|V2054A|Android|Shanghai|Shanghai|A_0_24|female|2021-07-02 10:43:51|     18|Entertainment| Entertainment/Stars|
++----------+----------+-------------------+----------+----------+-------------+-----+--------+------+-------+--------+--------+------+------+-------------------+-------+-------------+--------------------+
+```
+
+With the built-in high-level preprocessing operations in FeatureTable, we can easily perform distributed pre-processing for large-scale data.
+The details of pre-processing can be found [here](https://github.com/intel-analytics/BigDL/blob/main/apps/wide-deep-recommendation/feature_engineering.ipynb). Examples of processed data are as follows:
+
+```angular2html
++-------------------+-----+--------+-------------------+-----------+-----+-------+----------+----------+----------+-------------+------+---+--------+----+---+------+-----+
+|          expo_time|click|duration|              ctime|    img_num|cat_2|user_id|article_id|net_status|flush_nums|exop_position|device| os|province|city|age|gender|cat_1|
++-------------------+-----+--------+-------------------+-----------+-----+-------+----------+----------+----------+-------------+------+---+--------+----+---+------+-----+
+|2021-06-30 09:57:14|    1|      28|2021-06-29 14:46:43|0.016574586|   60|  14089|     87717|         4|        73|         1003|    36|  2|      38| 308|  5|     1|    5|
+|2021-06-30 09:57:14|    0|       0|2021-06-27 22:29:13| 0.06077348|   47|  14089|     35684|         4|        73|           43|    36|  2|      38| 308|  5|     1|   32|
+|2021-06-30 09:57:14|    0|       0|2021-06-28 12:22:54|0.038674034|  157|  14089|     20413|         4|        73|          363|    36|  2|      38| 308|  5|     1|   20|
+|2021-06-30 09:58:31|    0|       0|2021-06-29 13:25:06|0.027624309|   60|  14089|     15410|         4|       312|          848|    36|  2|      38| 308|  5|     1|    5|
+|2021-07-03 18:13:03|    0|       0|2021-07-02 10:43:51| 0.09944751|   60|  14089|     81707|         2|        73|          313|    36|  2|      38| 308|  5|     1|    5|
++-------------------+-----+--------+-------------------+-----------+-----+-------+----------+----------+----------+-------------+------+---+--------+----+---+------+-----+
+```
+Data pre-processing command:
+```bash
+python data_processing.py \
+    --input_path  path/to/input/dataset \
+    --output_path path/to/save/processed/dataset \
+    --cluster_mode local \
+    --executor_cores 8 \
+    --executor_memory 24g \
+    --num_executors 4 \
+    --driver_cores 2 \
+    --driver_memory 24g
+```
+
+__Options for data_processing:__
+* `input_path`: The path to input dataset.
+* `output_path`: The path to save processed dataset.
+* `cluster_mode`: The cluster mode, such as local, yarn, standalone or spark-submit. Default to be local. 
+* `master`: The master url, only used when cluster mode is standalone. Default to be None. 
+* `executor_cores`: The executor core number. Default to be 8.
+* `executor_memory`: The executor memory. Default to be 24g.
+* `num_executors`: The number of executors. Default to be 4.
+* `driver_cores`: The driver core number. Default to be 2. 
+* `driver_memory`: The driver memory. Default to be 24g.
+
+__NOTE:__ 
+When the *cluster_mode* is yarn, *input_path* and *output_path* can be HDFS paths. 
+
+## Train and test Multi-task models
+After data preprocessing, training MMoE or PlE model as follows:
+```bash
+python run_multi_task.py \
+    --do_train \
+    --model_type mmoe\
+    --train_data_path path/to/training/dataset \
+    --test_data_path path/to/testing/dataset \
+    --model_save_path path/to/save/the/trained/model \
+    --cluster_mode local \
+    --executor_cores 8 \
+    --executor_memory 24g \
+    --num_executors 4 \
+    --driver_cores 2 \
+    --driver_memory 24g
+```
+
+Evaluate Results as follows:
+```bash
+python run_multi_task.py \
+    --do_test \
+    --model_type mmoe\
+    --test_data_path path/to/testing/dataset \
+    --model_save_path path/to/save/the/trained/model \
+    --cluster_mode local \
+    --executor_cores 8 \
+    --executor_memory 24g \
+    --num_executors 4 \
+    --driver_cores 2 \
+    --driver_memory 24g
+```
+
+__Options for data_processing:__
+* `do_train`: To start training model.
+* `do_test`: To start test model.
+* `model_type`: The multi task model, mmoe or ple. Default to be mmoe.
+* `train_data_path`: The path to training dataset.
+* `test_data_path`: The path to testing dataset.
+* `model_save_path`: The path to save model.
+* `cluster_mode`: The cluster mode, such as local, yarn, standalone or spark-submit. Default to be local. 
+* `master`: The master url, only used when cluster mode is standalone. Default to be None. 
+* `executor_cores`: The executor core number. Default to be 8.
+* `executor_memory`: The executor memory. Default to be 24g.
+* `num_executors`: The number of executors. Default to be 4.
+* `driver_cores`: The driver core number. Default to be 2. 
+* `driver_memory`: The driver memory. Default to be 24g.
+
+__NOTE:__ 
+When the *cluster_mode* is yarn, *train_data_path*, *test_data_path* ans *model_save_path* can be HDFS paths. 
diff --git a/python/friesian/example/multi_task/data_processing.py b/python/friesian/example/multi_task/data_processing.py
@@ -0,0 +1,158 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+import argparse
+import os
+from argparse import ArgumentParser
+from bigdl.friesian.feature import FeatureTable
+from bigdl.orca import init_orca_context, stop_orca_context
+
+
+def transform(x):
+    if x == '上海':
+        return 0.0
+    elif isinstance(x, float):
+        return float(x)
+    else:
+        return float(eval(x))
+
+
+def transform_cat_2(x):
+    return '-'.join(sorted(x.split('/')))
+
+
+def read_and_split(data_input_path, sparse_int_features, sparse_string_features, dense_features):
+    header_names = ['user_id', 'article_id', 'expo_time', 'net_status', 'flush_nums',
+                    'exop_position', 'click', 'duration', 'device', 'os', 'province', 'city',
+                    'age', 'gender', 'ctime', 'img_num', 'cat_1', 'cat_2'
+                    ]
+    if data_input_path.split('.')[-1] == 'csv':
+        data_pd = FeatureTable.read_csv(data_input_path, header=False, names=header_names)
+    else:
+        data_pd = FeatureTable.read_parquet(data_input_path)
+    data_pd = data_pd.cast(sparse_int_features, 'string')
+    data_pd = data_pd.cast(dense_features, 'string')
+
+    # fill absence data
+    for feature in (sparse_int_features + sparse_string_features):
+        data_pd = data_pd.fillna("", feature)
+    for dense_feature in dense_features:
+        data_pd = data_pd.fillna('0.0', dense_feature)
+    print(data_pd.df.dtypes)
+
+    process_img_num = lambda x: transform(x)
+    process_cat_2 = lambda x: transform_cat_2(x)
+    data_pd = data_pd.apply("img_num", "img_num", process_img_num, "float")
+    data_pd = data_pd.apply("cat_2", "cat_2", process_cat_2, "string")
+
+    train_tbl = FeatureTable(data_pd.df[data_pd.df['expo_time'] < '2021-07-06'])
+    valid_tbl = FeatureTable(data_pd.df[data_pd.df['expo_time'] >= '2021-07-06'])
+    print('train_data.shape: ', train_tbl.size())
+    print('test_data.shape: ', valid_tbl.size())
+    return train_tbl, valid_tbl
+
+
+def feature_engineering(train_tbl, valid_tbl, model_path, model_path_json, sparse_int_features,
+                        sparse_string_features, dense_features):
+    import json
+    train_tbl, min_max_dict = train_tbl.min_max_scale(dense_features)
+    valid_tbl = valid_tbl.transform_min_max_scale(dense_features, min_max_dict)
+    cat_cols = sparse_string_features[-1:] + sparse_int_features + sparse_string_features[:-1]
+    for feature in cat_cols:
+        train_tbl, feature_idx = train_tbl.category_encode(feature)
+        valid_tbl = valid_tbl.encode_string(feature, feature_idx)
+        valid_tbl = valid_tbl.fillna(0, feature)
+        print("The class number of feature: {}/{}".format(feature, feature_idx.size()))
+        feature_idx.write_parquet(model_path)
+        fea_dict = feature_idx.to_dict()
+        with open(model_path_json + "/" + feature + '.json', 'w', encoding='utf-8') as ff:
+            ff.write(json.dumps(fea_dict, ensure_ascii=False, indent=2))
+    return train_tbl, valid_tbl
+
+
+def _parse_args():
+    parser = ArgumentParser(description="Transform dataset for multi task demo")
+    parser.add_argument('--input_path', type=str,
+                        default='/path/to/input/dataset',
+                        help='The path for input dataset')
+    parser.add_argument('--output_path', type=str, default='/path/to/save/processed/dataset',
+                        help='The path for output dataset')
+    parser.add_argument('--cluster_mode', type=str, default="local",
+                        help='The cluster mode, such as local, yarn, standalone or spark-submit.')
+    parser.add_argument('--master', type=str, default=None,
+                        help='The master url, only used when cluster mode is standalone.')
+    parser.add_argument('--executor_cores', type=int, default=8,
+                        help='The executor core number.')
+    parser.add_argument('--executor_memory', type=str, default="24g",
+                        help='The executor memory.')
+    parser.add_argument('--num_executors', type=int, default=4,
+                        help='The number of executors.')
+    parser.add_argument('--driver_cores', type=int, default=2,
+                        help='The driver core number.')
+    parser.add_argument('--driver_memory', type=str, default="24g",
+                        help='The driver memory.')
+    args_ = parser.parse_args()
+    return args_
+
+
+if __name__ == '__main__':
+    args = _parse_args()
+    if args.cluster_mode == "local":
+        sc = init_orca_context("local", cores=args.executor_cores,
+                               memory=args.executor_memory)
+    elif args.cluster_mode == "standalone":
+        sc = init_orca_context("standalone", master=args.master,
+                               cores=args.executor_cores, num_nodes=args.num_executors,
+                               memory=args.executor_memory,
+                               driver_cores=args.driver_cores,
+                               driver_memory=args.driver_memory)
+    elif args.cluster_mode == "yarn":
+        sc = init_orca_context("yarn-client", cores=args.executor_cores,
+                               num_nodes=args.num_executors, memory=args.executor_memory,
+                               driver_cores=args.driver_cores, driver_memory=args.driver_memory)
+    elif args.cluster_mode == "spark-submit":
+        sc = init_orca_context("spark-submit")
+    else:
+        argparse.ArgumentError(False,
+                               "cluster_mode should be one of 'local', 'yarn', 'standalone' and"
+                               " 'spark-submit', but got " + args.cluster_mode)
+
+    sparse_int_features_ = [
+        'user_id', 'article_id',
+        'net_status', 'flush_nums',
+        'exop_position',
+    ]
+    sparse_string_features_ = [
+        'device', 'os', 'province',
+        'city', 'age',
+        'gender', 'cat_1', 'cat_2'
+    ]
+    dense_features_ = ['img_num']
+    model_path_ = os.path.join(args.output_path, 'feature_maps')
+    model_path_json_ = os.path.join(args.output_path, 'feature_maps_json')
+    os.makedirs(model_path_, exist_ok=True)
+    os.makedirs(model_path_json_, exist_ok=True)
+    # read, reformat and split data
+    df_train, df_test = read_and_split(args.input_path, sparse_int_features_,
+                                       sparse_string_features_, dense_features_)
+    train_tbl_, valid_tbl_ = feature_engineering(df_train, df_test,
+                                                 model_path_, model_path_json_,
+                                                 sparse_int_features_,
+                                                 sparse_string_features_, dense_features_)
+    print(train_tbl_.size())
+    print(valid_tbl_.size())
+    train_tbl_.write_parquet(os.path.join(args.output_path, 'train_processed'))
+    valid_tbl_.write_parquet(os.path.join(args.output_path, 'test_processed'))
+    stop_orca_context()