Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix merge conflicts [skip ci] #3892

Merged
merged 9 commits into from
May 24, 2021
Merged

Fix merge conflicts [skip ci] #3892

merged 9 commits into from
May 24, 2021

Commits on May 19, 2021

  1. Remove Base.enable_rmm_pool method as it is no longer needed (#3875)

    This should resolve the confusion caused in the issue rapidsai/raft#228. Tagging @dantegd for review.
    
    Authors:
      - Thejaswi. N. S (https://github.com/teju85)
    
    Approvers:
      - Dante Gama Dessavre (https://github.com/dantegd)
    
    URL: #3875
    teju85 authored May 19, 2021
    Configuration menu
    Copy the full SHA
    3be1545 View commit details
    Browse the repository at this point in the history

Commits on May 20, 2021

  1. Deterministic UMAP with floating point rounding. (#3848)

    Use floating rounding to make UMAP optimization deterministic.  This is a breaking change as the batch size parameter is removed.
    
    * Add procedure for rounding the gradient updates.
    * Add buffer for gradient updates.
    * Add an internal parameter `deterministic`, which should be set to `true` when `random_state` is set.
    
    The test file is removed due to #3849 .
    
    Authors:
      - Jiaming Yuan (https://github.com/trivialfis)
    
    Approvers:
      - Corey J. Nolet (https://github.com/cjnolet)
    
    URL: #3848
    trivialfis authored May 20, 2021
    Configuration menu
    Copy the full SHA
    99a80c8 View commit details
    Browse the repository at this point in the history
  2. Update CHANGELOG.md links for calver (#3883)

    This PR updates the `0.20` references in `CHANGELOG.md` to be `21.06`.
    
    Authors:
      - AJ Schmidt (https://github.com/ajschmidt8)
    
    Approvers:
      - Dillon Cullinan (https://github.com/dillon-cullinan)
    
    URL: #3883
    ajschmidt8 authored May 20, 2021
    Configuration menu
    Copy the full SHA
    0b33f9d View commit details
    Browse the repository at this point in the history
  3. Make sure __init__ is called in graph callback. (#3881)

    I made the mistake and got a segmentation fault.  A value error might be nicer.
    
    Authors:
      - Jiaming Yuan (https://github.com/trivialfis)
    
    Approvers:
      - Corey J. Nolet (https://github.com/cjnolet)
    
    URL: #3881
    trivialfis authored May 20, 2021
    Configuration menu
    Copy the full SHA
    b7a634a View commit details
    Browse the repository at this point in the history
  4. AgglomerativeClustering support single cluster and ignore only zero d…

    …istances from self-loops (#3824)
    
    Closes #3801 
    Closes #3802 
    
    Corresponding RAFT PR: rapidsai/raft#217
    
    Authors:
      - Corey J. Nolet (https://github.com/cjnolet)
    
    Approvers:
      - Dante Gama Dessavre (https://github.com/dantegd)
    
    URL: #3824
    cjnolet authored May 20, 2021
    Configuration menu
    Copy the full SHA
    05124c4 View commit details
    Browse the repository at this point in the history

Commits on May 21, 2021

  1. Fix RF regression performance (#3845)

    This PR rewrites the mean squared error objective. Mean squared error is much easier when factored mathematically into a slightly different form. This should bring regression performance in line with classification.
    
    I've also removed the MAE objective as its not correct. This can be seen from the fact that leaf predictions with MAE use the mean, where the correct minimiser is the median. Also see sklearns implementation, where streaming median calculations are required: https://github.com/scikit-learn/scikit-learn/blob/de1262c35e2aa4ee062d050281ee576ce9e35c94/sklearn/tree/_criterion.pyx#L976. 
    
    Implementing this correctly for GPU would be very challenging.
    
    Performance before:
    ![rf_regression_perf](https://user-images.githubusercontent.com/7307640/117608125-8c884280-b1b1-11eb-8cb4-e92f39dad0f3.png)
    After:
    ![rf_regression_perf_fix](https://user-images.githubusercontent.com/7307640/117608145-94e07d80-b1b1-11eb-939f-b96cafbd3e35.png)
    
    Script:
    ```python
    from cuml import RandomForestRegressor as cuRF
    from sklearn.ensemble import RandomForestRegressor as sklRF
    from sklearn.datasets import make_regression
    from sklearn.metrics import mean_squared_error
    import numpy as np
    import pandas as pd
    import matplotlib
    import matplotlib.pyplot as plt
    import seaborn as sns
    import time
    
    matplotlib.use("Agg")
    sns.set()
    
    X, y = make_regression(n_samples=100000, random_state=0)
    X = X.astype(np.float32)
    y = y.astype(np.float32)
    rs = np.random.RandomState(92)
    df = pd.DataFrame(columns=["algorithm", "Time(s)", "MSE"])
    d = 10
    n_repeats = 5
    bootstrap = False
    max_samples = 1.0
    max_features = 0.5
    n_estimators = 10
    n_bins = min(X.shape[0], 128)
    for _ in range(n_repeats):
        clf = sklRF(
            n_estimators=n_estimators,
            max_depth=d,
            random_state=rs,
            max_features=max_features,
            bootstrap=bootstrap,
            max_samples=max_samples if max_samples < 1.0 else None,
        )
    
        start = time.perf_counter()
        clf.fit(X, y)
        skl_time = time.perf_counter() - start
        pred = clf.predict(X)
        cu_clf = cuRF(
            n_estimators=n_estimators,
            max_depth=d,
            random_state=rs.randint(0, 1 << 32),
            n_bins=n_bins,
            max_features=max_features,
            bootstrap=bootstrap,
            max_samples=max_samples,
            use_experimental_backend=True,
        )
    
        start = time.perf_counter()
        cu_clf.fit(X, y)
        cu_time = time.perf_counter() - start
        cu_pred = cu_clf.predict(X, predict_model="CPU")
        df = df.append(
            {
                "algorithm": "cuml",
                "Time(s)": cu_time,
                "MSE": mean_squared_error(y, cu_pred),
            },
            ignore_index=True,
        )
        df = df.append(
            {
                "algorithm": "sklearn",
                "Time(s)": skl_time,
                "MSE": mean_squared_error(y, pred),
            },
            ignore_index=True,
        )
    print(df)
    fig, ax = plt.subplots(1, 2)
    sns.barplot(data=df, x="algorithm", y="Time(s)", ax=ax[0])
    sns.barplot(data=df, x="algorithm", y="MSE", ax=ax[1])
    plt.savefig("rf_regression_perf_fix.png")
    ```
    
    Authors:
      - Rory Mitchell (https://github.com/RAMitchell)
    
    Approvers:
      - Philip Hyunsu Cho (https://github.com/hcho3)
      - Thejaswi. N. S (https://github.com/teju85)
      - John Zedlewski (https://github.com/JohnZed)
    
    URL: #3845
    RAMitchell authored May 21, 2021
    Configuration menu
    Copy the full SHA
    cb6ef52 View commit details
    Browse the repository at this point in the history

Commits on May 22, 2021

  1. Configuration menu
    Copy the full SHA
    ea662e8 View commit details
    Browse the repository at this point in the history
  2. Fix MNMG test test_rf_regression_dask_fil (#3830)

    See #3820
    
    Authors:
      - Philip Hyunsu Cho (https://github.com/hcho3)
    
    Approvers:
      - Victor Lafargue (https://github.com/viclafargue)
      - Dante Gama Dessavre (https://github.com/dantegd)
    
    URL: #3830
    hcho3 authored May 22, 2021
    Configuration menu
    Copy the full SHA
    3e89f04 View commit details
    Browse the repository at this point in the history

Commits on May 24, 2021

  1. Configuration menu
    Copy the full SHA
    b52db0b View commit details
    Browse the repository at this point in the history