Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelise duckdb resulting in e.g. 2-4x speedup on 6 core machine #1796

Merged
merged 19 commits into from
Jan 10, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Dec 12, 2023

See here for investigation into best salting settings

At the moment, DuckDB paraellelises little if any of the Splink workflow. This PR causes DuckDB to fully parallelise salted workloads, meaning operations like predict() are multiple times faster (especially on machines with many cores)

See duckdb/duckdb#9710

time it example
import time

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    jaro_winkler_at_thresholds,
    levenshtein_at_thresholds,
)
from splink.duckdb.duckdb_comparison_level_library import (
    else_level,
    null_level,
)
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k


df = df.drop("cluster", axis=1)

settings_dict = {
    "probability_two_random_records_match": 0.0001,
    "link_type": "dedupe_only",
}

linker = DuckDBLinker(df, settings_dict)
brs = linker._find_blocking_rules_below_threshold(1e7)
list(brs.sort_values("comparison_count", ascending=True)["blocking_columns_sanitised"])

br = block_on("first_name")


def get_brs(salt):
    brs = [
        ["gender", "occupation", "full_name"],
        ["gender", "occupation", "postcode_fake"],
        ["occupation", "full_name"],
        ["occupation", "postcode_fake"],
        ["gender", "occupation", "first_and_surname"],
        ["first_name", "gender", "postcode_fake"],
        ["gender", "occupation", "dob"],
        ["first_name", "gender", "dob"],
        ["occupation", "first_and_surname"],
        ["gender", "occupation", "surname"],
        ["first_name", "gender", "full_name"],
        ["gender", "full_name"],
        ["first_name", "postcode_fake"],
        ["occupation", "dob"],
        ["gender", "occupation", "birth_place"],
        ["first_name", "dob"],
        ["gender", "postcode_fake"],
        ["occupation", "surname"],
        ["full_name"],
        ["first_name", "full_name"],
        ["first_name", "gender", "birth_place"],
        ["occupation", "birth_place"],
        ["postcode_fake"],
        ["first_name", "birth_place"],
        ["first_name", "gender", "surname"],
        ["gender", "first_and_surname"],
        ["first_name", "gender", "first_and_surname"],
        ["first_name", "gender", "occupation"],
        ["first_name", "surname"],
        ["first_name", "first_and_surname"],
        ["first_and_surname"],
        ["first_name", "occupation"],
        # ["gender", "surname"],
        # ["surname"],
        # ["gender", "dob"],
        # ["dob"],
        # ["gender", "birth_place"],
        # ["birth_place"],
    ]

    if salt > 1:
        brs = [block_on(x, salting_partitions=salt) for x in brs]
    else:
        brs = [block_on(x) for x in brs]
    return brs


for apply_sort in [True, False]:
    for salt in [1, 10]:
        brs = get_brs(salt)

        settings_dict = {
            "probability_two_random_records_match": 0.0001,
            "link_type": "dedupe_only",
            "blocking_rules_to_generate_predictions": brs,
            "comparisons": [
                jaro_winkler_at_thresholds("first_name"),
                jaro_winkler_at_thresholds("surname"),
                levenshtein_at_thresholds("dob"),
                exact_match("birth_place"),
                levenshtein_at_thresholds("postcode_fake"),
                exact_match("occupation"),
            ],
            "retain_matching_columns": False,
            "retain_intermediate_calculation_columns": False,
        }

        linker = DuckDBLinker(df, settings_dict)

        linker.__apply_sort = apply_sort

        start_time = time.time()
        df_e = linker.predict()
        end_time = time.time()

        print(
            f"Execution time for salt={salt} and "
            f"apply_sort={apply_sort}: {end_time - start_time} seconds"
        )

In the example above I've hacked line 367 of blocking.py to allow __apply_sort

    # see https://github.com/duckdb/duckdb/discussions/9710
    # this generates a huge speedup because it triggers parallelisation
    if linker._sql_dialect == "duckdb" and linker.__apply_sort:
        unioned_sql = f"""
        {unioned_sql}
        order by 1
        """
Execution time for salt=1 and apply_sort=True: 4.619340658187866 seconds
Execution time for salt=10 and apply_sort=True: 11.904595136642456 seconds
Execution time for salt=1 and apply_sort=False: 10.588774919509888 seconds
Execution time for salt=10 and apply_sort=False: 20.690167665481567 seconds

Speedup seems to be even greater for a single salted blocking rule. I would expect them also to be greater the more cores the machine has

Edit:

This seems to make training u slower, despite resulting in all cores being used. Definitely solvable, but converting to draft for now

Edit2: salting for u:

import math


def get_salting(max_pairs):
    logged = math.log(max_pairs, 10)
    logged = max(logged - 4, 0)
    return math.ceil(2.5**logged)


for max_pairs in [1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 2e8, 1e9, 1e10, 1e11]:
    salt = get_salting(max_pairs)
    print(f"{max_pairs:.0e} {salt}")

@RobinL RobinL changed the title Parallelise duckdb Parallelise duckdb resulting in e.g. 2-4x speedup on 6 core machine Dec 12, 2023
@RobinL
Copy link
Member Author

RobinL commented Dec 12, 2023

@NickCrews just for info this is the speedup i mentioned last night, hopefully will be merged and released soon.

To get best performance:

  • order blocking rules in ascending order of number of comparisons
  • if you have only 1 blocking rule (or very few) consider salting them

@NickCrews
Copy link
Contributor

Thanks! Do you have evidence that ordering blocking rules matters? Wouldn't the engine always have to compute every rule, regardless of order?

@RobinL
Copy link
Member Author

RobinL commented Dec 12, 2023

it's a hunch, but it's because of the way blocking pairs are created and then scored:

  • For the first blocking rule, all pairs are generated and then scored
  • For the second blocking rules, all pairs are generated, and then pairs created by the first rule are rejected before being scored
  • For the third, all pairs are generated, then pairs created by 1st or 2nd are rejected _before_being scored
    etc.

Since scoring is the computationally intensive bit, you want to do this in parallel as much as possible, so probably it's better for scoring to be 'balanced across' the rules, rather than a single blocking rule being responsible for most of the scored comparisons

@RobinL RobinL marked this pull request as draft December 12, 2023 17:30
@RobinL
Copy link
Member Author

RobinL commented Dec 12, 2023

time train u
for apply_sort in [True]:
    settings_dict = {
        "probability_two_random_records_match": 0.0001,
        "link_type": "dedupe_only",
        "blocking_rules_to_generate_predictions": [],
        "comparisons": [
            jaro_winkler_at_thresholds("first_name"),
            jaro_winkler_at_thresholds("surname"),
            levenshtein_at_thresholds("dob"),
            exact_match("birth_place"),
            levenshtein_at_thresholds("postcode_fake"),
            exact_match("occupation"),
        ],
        "retain_matching_columns": False,
        "retain_intermediate_calculation_columns": False,
    }

    linker = DuckDBLinker(df, settings_dict)

    # linker.__apply_sort = apply_sort

    start_time = time.time()
    df_e = linker.estimate_u_using_random_sampling(1e6)
    end_time = time.time()

    df_e = linker.estimate_u_using_random_sampling(1e6)

    print(
        f"Execution time for  "
        f"apply_sort={apply_sort}: {end_time - start_time} seconds"
    )

Hyopthesis - could retain intermediate calculation columns be relevant?

@RobinL
Copy link
Member Author

RobinL commented Dec 12, 2023

This is very weird but:

  • order by 1 increases the speed of predict() because it results in parallelisation
  • salting usually slows down the speed of predict(), but improves it in the case of a single blocking rule
  • in the case of estimate_u_using_random_samplng, you get parallelisation by salting and you don't need order by 1
  • adding order by 1 slows things down a lot for estimate_u_using_random_sampling
playground
import time

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    jaro_winkler_at_thresholds,
    levenshtein_at_thresholds,
)
from splink.duckdb.duckdb_comparison_level_library import (
    else_level,
    null_level,
)
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k

import duckdb

duckdb.__version__


df = df.drop("cluster", axis=1)


import logging

settings_dict = {
    "probability_two_random_records_match": 0.0001,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name", salting_partitions=3)
    ],
    "comparisons": [
        jaro_winkler_at_thresholds("first_name"),
        jaro_winkler_at_thresholds("surname"),
        levenshtein_at_thresholds("dob"),
        exact_match("birth_place"),
        levenshtein_at_thresholds("postcode_fake"),
        exact_match("occupation"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
}

linker = DuckDBLinker(df, settings_dict)
import logging

logging.getLogger("splink").setLevel(1)
# linker.__apply_sort = True
# linker.estimate_u_using_random_sampling(1e6)
t = linker._initialise_df_concat_with_tf()

con = linker._con

con.execute(
    f"""
            CREATE TABLE __splink__df_concat_with_tf_sample_745b357b3
        AS
        (WITH __splink__df_concat_with_tf as (select * from {t.physical_name})
    select *
    from __splink__df_concat_with_tf
    USING SAMPLE 5.593196980782878% (bernoulli)
    )
            """
)

start_time = time.time()

sql = f"""
 CREATE TABLE __splink__m_u_counts_fd157dff2
        AS
        (WITH __splink__df_concat_with_tf_sample as (select * from __splink__df_concat_with_tf_sample_745b357b3),
__splink__df_blocked as (

            select
            "l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r", "l"."birth_place" AS "birth_place_l", "r"."birth_place" AS "birth_place_r", "l"."postcode_fake" AS "postcode_fake_l", "r"."postcode_fake" AS "postcode_fake_r", "l"."occupation" AS "occupation_l", "r"."occupation" AS "occupation_r"
            , '0' as match_key

            from __splink__df_concat_with_tf_sample as l
            inner join __splink__df_concat_with_tf_sample as r
            on
            (1=1)  AND (ceiling(l.__splink_salt * 2) = 1)

            where l."unique_id" < r."unique_id"


            UNION ALL

            select
            "l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r", "l"."birth_place" AS "birth_place_l", "r"."birth_place" AS "birth_place_r", "l"."postcode_fake" AS "postcode_fake_l", "r"."postcode_fake" AS "postcode_fake_r", "l"."occupation" AS "occupation_l", "r"."occupation" AS "occupation_r"
            , '0' as match_key

            from __splink__df_concat_with_tf_sample as l
            inner join __splink__df_concat_with_tf_sample as r
            on
            (1=1)  AND (ceiling(l.__splink_salt * 2) = 2)

            where l."unique_id" < r."unique_id"







        ),
__splink__df_comparison_vectors as (
    select "unique_id_l","unique_id_r",CASE WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL THEN -1 WHEN "first_name_l" = "first_name_r" THEN 3 WHEN jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9 THEN 2 WHEN jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7 THEN 1 ELSE 0 END as gamma_first_name,CASE WHEN "surname_l" IS NULL OR "surname_r" IS NULL THEN -1 WHEN "surname_l" = "surname_r" THEN 3 WHEN jaro_winkler_similarity("surname_l", "surname_r") >= 0.9 THEN 2 WHEN jaro_winkler_similarity("surname_l", "surname_r") >= 0.7 THEN 1 ELSE 0 END as gamma_surname,CASE WHEN "dob_l" IS NULL OR "dob_r" IS NULL THEN -1 WHEN "dob_l" = "dob_r" THEN 3 WHEN levenshtein("dob_l", "dob_r") <= 1 THEN 2 WHEN levenshtein("dob_l", "dob_r") <= 2 THEN 1 ELSE 0 END as gamma_dob,CASE WHEN "birth_place_l" IS NULL OR "birth_place_r" IS NULL THEN -1 WHEN "birth_place_l" = "birth_place_r" THEN 1 ELSE 0 END as gamma_birth_place,CASE WHEN "postcode_fake_l" IS NULL OR "postcode_fake_r" IS NULL THEN -1 WHEN "postcode_fake_l" = "postcode_fake_r" THEN 3 WHEN levenshtein("postcode_fake_l", "postcode_fake_r") <= 1 THEN 2 WHEN levenshtein("postcode_fake_l", "postcode_fake_r") <= 2 THEN 1 ELSE 0 END as gamma_postcode_fake,CASE WHEN "occupation_l" IS NULL OR "occupation_r" IS NULL THEN -1 WHEN "occupation_l" = "occupation_r" THEN 1 ELSE 0 END as gamma_occupation
    from __splink__df_blocked

    ),
__splink__df_predict as (
    select *, cast(0.0 as float8) as match_probability
    from __splink__df_comparison_vectors
    )
    select
    gamma_first_name as comparison_vector_value,
    sum(match_probability * 1) as m_count,
    sum((1-match_probability) * 1) as u_count,
    'first_name' as output_column_name
    from __splink__df_predict
    group by gamma_first_name
     union all
    select
    gamma_surname as comparison_vector_value,
    sum(match_probability * 1) as m_count,
    sum((1-match_probability) * 1) as u_count,
    'surname' as output_column_name
    from __splink__df_predict
    group by gamma_surname
     union all
    select
    gamma_dob as comparison_vector_value,
    sum(match_probability * 1) as m_count,
    sum((1-match_probability) * 1) as u_count,
    'dob' as output_column_name
    from __splink__df_predict
    group by gamma_dob
     union all
    select
    gamma_birth_place as comparison_vector_value,
    sum(match_probability * 1) as m_count,
    sum((1-match_probability) * 1) as u_count,
    'birth_place' as output_column_name
    from __splink__df_predict
    group by gamma_birth_place
     union all
    select
    gamma_postcode_fake as comparison_vector_value,
    sum(match_probability * 1) as m_count,
    sum((1-match_probability) * 1) as u_count,
    'postcode_fake' as output_column_name
    from __splink__df_predict
    group by gamma_postcode_fake
     union all
    select
    gamma_occupation as comparison_vector_value,
    sum(match_probability * 1) as m_count,
    sum((1-match_probability) * 1) as u_count,
    'occupation' as output_column_name
    from __splink__df_predict
    group by gamma_occupation
     union all
    select 0 as comparison_vector_value,
           sum(match_probability * 1) /
               sum(1) as m_count,
           sum((1-match_probability) * 1) /
               sum(1) as u_count,
           '_probability_two_random_records_match' as output_column_name
    from __splink__df_predict

    )
"""
con.execute(sql)
end_time = time.time()


apply_sort = True
print(
    f"Execution time for  " f"apply_sort={apply_sort}: {end_time - start_time} seconds"
)


# sql = f"""
#  CREATE TABLE __splink__m_u_counts_fd157dff2
#         AS
#         (WITH __splink__df_concat_with_tf_sample as (select * from __splink__df_concat_with_tf_sample_745b357b3),
# __splink__df_blocked as (

#             select
#             "l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r", "l"."birth_place" AS "birth_place_l", "r"."birth_place" AS "birth_place_r", "l"."postcode_fake" AS "postcode_fake_l", "r"."postcode_fake" AS "postcode_fake_r", "l"."occupation" AS "occupation_l", "r"."occupation" AS "occupation_r"
#             , '0' as match_key

#             from __splink__df_concat_with_tf_sample as l
#             inner join __splink__df_concat_with_tf_sample as r
#             on
#             (1=1)

#             where l."unique_id" < r."unique_id"


#         ),
# __splink__df_comparison_vectors as (
#     select "unique_id_l","unique_id_r",CASE WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL THEN -1 WHEN "first_name_l" = "first_name_r" THEN 3 WHEN jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.9 THEN 2 WHEN jaro_winkler_similarity("first_name_l", "first_name_r") >= 0.7 THEN 1 ELSE 0 END as gamma_first_name,CASE WHEN "surname_l" IS NULL OR "surname_r" IS NULL THEN -1 WHEN "surname_l" = "surname_r" THEN 3 WHEN jaro_winkler_similarity("surname_l", "surname_r") >= 0.9 THEN 2 WHEN jaro_winkler_similarity("surname_l", "surname_r") >= 0.7 THEN 1 ELSE 0 END as gamma_surname,CASE WHEN "dob_l" IS NULL OR "dob_r" IS NULL THEN -1 WHEN "dob_l" = "dob_r" THEN 3 WHEN levenshtein("dob_l", "dob_r") <= 1 THEN 2 WHEN levenshtein("dob_l", "dob_r") <= 2 THEN 1 ELSE 0 END as gamma_dob,CASE WHEN "birth_place_l" IS NULL OR "birth_place_r" IS NULL THEN -1 WHEN "birth_place_l" = "birth_place_r" THEN 1 ELSE 0 END as gamma_birth_place,CASE WHEN "postcode_fake_l" IS NULL OR "postcode_fake_r" IS NULL THEN -1 WHEN "postcode_fake_l" = "postcode_fake_r" THEN 3 WHEN levenshtein("postcode_fake_l", "postcode_fake_r") <= 1 THEN 2 WHEN levenshtein("postcode_fake_l", "postcode_fake_r") <= 2 THEN 1 ELSE 0 END as gamma_postcode_fake,CASE WHEN "occupation_l" IS NULL OR "occupation_r" IS NULL THEN -1 WHEN "occupation_l" = "occupation_r" THEN 1 ELSE 0 END as gamma_occupation
#     from __splink__df_blocked
#     ),
# __splink__df_predict as (
#     select *, cast(0.0 as float8) as match_probability
#     from __splink__df_comparison_vectors
#     )
#     select
#     gamma_first_name as comparison_vector_value,
#     sum(match_probability * 1) as m_count,
#     sum((1-match_probability) * 1) as u_count,
#     'first_name' as output_column_name
#     from __splink__df_predict
#     group by gamma_first_name
#      union all
#     select
#     gamma_surname as comparison_vector_value,
#     sum(match_probability * 1) as m_count,
#     sum((1-match_probability) * 1) as u_count,
#     'surname' as output_column_name
#     from __splink__df_predict
#     group by gamma_surname
#      union all
#     select
#     gamma_dob as comparison_vector_value,
#     sum(match_probability * 1) as m_count,
#     sum((1-match_probability) * 1) as u_count,
#     'dob' as output_column_name
#     from __splink__df_predict
#     group by gamma_dob
#      union all
#     select
#     gamma_birth_place as comparison_vector_value,
#     sum(match_probability * 1) as m_count,
#     sum((1-match_probability) * 1) as u_count,
#     'birth_place' as output_column_name
#     from __splink__df_predict
#     group by gamma_birth_place
#      union all
#     select
#     gamma_postcode_fake as comparison_vector_value,
#     sum(match_probability * 1) as m_count,
#     sum((1-match_probability) * 1) as u_count,
#     'postcode_fake' as output_column_name
#     from __splink__df_predict
#     group by gamma_postcode_fake
#      union all
#     select
#     gamma_occupation as comparison_vector_value,
#     sum(match_probability * 1) as m_count,
#     sum((1-match_probability) * 1) as u_count,
#     'occupation' as output_column_name
#     from __splink__df_predict
#     group by gamma_occupation
#      union all
#     select 0 as comparison_vector_value,
#            sum(match_probability * 1) /
#                sum(1) as m_count,
#            sum((1-match_probability) * 1) /
#                sum(1) as u_count,
#            '_probability_two_random_records_match' as output_column_name
#     from __splink__df_predict
#     )
# """
# con.execute(sql)
# end_time = time.time()


# apply_sort = True
# print(
#     f"Execution time for  "
#     f"apply_sort={apply_sort}: {end_time - start_time} seconds"
# )

@RobinL
Copy link
Member Author

RobinL commented Dec 12, 2023

master:

Execution time train u: 8.24 seconds
Execution time predict no salt: 3.64 seconds
Execution time predict salt multiple rules: 8.93 seconds
Execution time predict salt one rules: 9.27 seconds

this branch
Execution time train u: 4.03 seconds
Execution time predict no salt: 2.22 seconds
Execution time predict salt multiple rules: 4.06 seconds
Execution time predict salt one rules: 3.72 seconds



example
import time

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    jaro_winkler_at_thresholds,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.historical_50k


df = df.drop("cluster", axis=1)


def get_brs(salt):
    brs = [
        ["gender", "occupation", "full_name"],
        ["gender", "occupation", "postcode_fake"],
        ["occupation", "full_name"],
        ["occupation", "postcode_fake"],
        ["gender", "occupation", "first_and_surname"],
        ["first_name", "gender", "postcode_fake"],
        ["gender", "occupation", "dob"],
        ["first_name", "gender", "dob"],
        ["occupation", "first_and_surname"],
        ["gender", "occupation", "surname"],
    ]

    if salt > 1:
        brs = [block_on(x, salting_partitions=salt) for x in brs]
    else:
        brs = [block_on(x) for x in brs]
    return brs


settings_dict = {
    "probability_two_random_records_match": 0.0001,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [],
    "comparisons": [
        jaro_winkler_at_thresholds("first_name"),
        jaro_winkler_at_thresholds("surname"),
        levenshtein_at_thresholds("dob"),
        exact_match("birth_place"),
        levenshtein_at_thresholds("postcode_fake"),
        exact_match("occupation"),
    ],
    "retain_matching_columns": False,
    "retain_intermediate_calculation_columns": False,
}

# Time train u
linker = DuckDBLinker(df, settings_dict)


start_time = time.time()
df_e = linker.estimate_u_using_random_sampling(5e6)
end_time = time.time()

print(f"Execution time train u: {(end_time - start_time):,.2f} seconds")

# Time blocking no salting
settings_dict["blocking_rules_to_generate_predictions"] = get_brs(1)

linker = DuckDBLinker(df, settings_dict)


start_time = time.time()
df_e = linker.predict()
end_time = time.time()

print(f"Execution time predict no salt: {(end_time - start_time):,.2f} seconds")

# Time blocking with salting
settings_dict["blocking_rules_to_generate_predictions"] = get_brs(4)
linker = DuckDBLinker(df, settings_dict)


start_time = time.time()
df_e = linker.estimate_u_using_random_sampling(5e6)
end_time = time.time()

print(
    f"Execution time predict salt multiple rules: {(end_time - start_time):,.2f} seconds"
)


# Time blocking with salting
settings_dict["blocking_rules_to_generate_predictions"] = [
    block_on("birth_place", salting_partitions=4)
]
linker = DuckDBLinker(df, settings_dict)


start_time = time.time()
df_e = linker.estimate_u_using_random_sampling(5e6)
end_time = time.time()

print(f"Execution time predict salt one rules: {(end_time - start_time):,.2f} seconds")

@NickCrews
Copy link
Contributor

  • For the first blocking rule, all pairs are generated and then scored
  • For the second blocking rules, all pairs are generated, and then pairs created by the first rule are rejected before being scored
  • For the third, all pairs are generated, then pairs created by 1st or 2nd are rejected _before_being scored
    etc.

Ah, I figured all pairs were generated, and then compared. If the comparing happens in a streaming fashion as pairs are generated, then that makes sense. Thanks!

@RobinL RobinL marked this pull request as ready for review December 12, 2023 19:51
@OlivierBinette
Copy link
Contributor

@RobinL chiming in here, it'd be great to be able to split up predict() into two separate steps, one for doing the inner join, and then another to compute similarity scores. This could help provide more control for parallelization.

I can also see myself wanting to experiment with different models using the same blocking rule, and in that case I could use a persisted record pairs table to speed things up across models.

@RobinL
Copy link
Member Author

RobinL commented Dec 15, 2023

Yeah, I agree - I have been thinking about that too. I'm pretty sure it would guarantee parallelisation, but only at a big performance cost

It means you have to persist (to memory or disk) all comparisons, whereas if you create the comparisons and score as a single step many of them can be created and immediately rejected (because their score doesn't meet the minimum threshold).

Even if no threshold is set in your workflow (i.e. you're keeping all pairwise comparisons irrespective of score), it means the value comparisons get persisted, which you can avoid with the retain_intermediate_calculation_columns``retain_matching_columns , settings.

For smaller workflows none of this is really a problem because things are fast. But I'm not sure the benefits of parallelization outweigh the costs. Creating the pariwise comparisons is relatively cheap Vs scoring.

Ultimate a lot of the above in a hunch rather than something I've rigorously tested, so I'll we should probably do some experiments . I think there should probably be an option to split up anyway (because it might be something the user wants for reasons other than performance).

Just to add a bit of clarity, this difference is this:

Combined steps algorithm
As a single step on the CPU duckdb:

  • creates the comparison
  • scores it
  • deletes the matching columns (e.g. first_name_l, first_name_r exist for the purposes of scoring, but are never output if retain_matching_columns is set to false)
  • if the score exceeds the threshold match score, the pair is output to the final table

Two step algorithm
Duckdb:

  • creates the comparison (including e.g. first_name_l, first_name_r) and persists it to disk (or memory, if you're using an in mem duckdb databse)
  • Reads in this large table, which can be done in parallel
  • outputs a new table, potentially deleting e.g. first_name_l, first_name_r if retain_matching_columns is false, and removing rows below threshold

the later results in a potentially much higher disk and memory usage (and more work in outputting and reading back in)

@OlivierBinette
Copy link
Contributor

OlivierBinette commented Dec 15, 2023

@RobinL That makes sense. I think both solutions make sense depending on the context.

Some other thoughts that you've probably considered:

  • Could it be possible to have the comparisons table only contain two columns, for the IDs in the left and right tables? I don't know if lookups to other tables would create too much overhead, or if that would enable parallelization in following steps.
  • What about blocking on an index column or working with a "partition by" clause on a blocking key?

For my use case, I'll be disambiguating around 50M records, with around 5-10 complex comparisons per record pair. Since I'm working with labeled data for evaluation, I can quickly get feedback on the accuracy of a model and I'll want to try many models in a short period of time. I can work with a r6a.metal instance on AWS that has 192 cores and 1.5 TB or RAM. The cost is not prohibitive ($10 an hour), as long as the compute resources are fully utilized to get results fast.

@@ -117,7 +117,23 @@ def estimate_u_values(linker: Linker, max_pairs, seed=None):
training_linker._enqueue_sql(sql, "__splink__df_concat_with_tf_sample")
df_sample = training_linker._execute_sql_pipeline([nodes_with_tf])

settings_obj._blocking_rules_to_generate_predictions = []
if linker._sql_dialect == "duckdb" and max_pairs > 1e5:
if max_pairs < 1e6:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a heuristic to make duckdb parallelise more when the user is asking for a bigger computation

@RobinL RobinL mentioned this pull request Dec 21, 2023
2 tasks
@RobinL
Copy link
Member Author

RobinL commented Dec 23, 2023

I've been doing some more formal benchmarking using aws ec2 instances.

Good news: the time seems to scale linearly with the number of CPUs for estimating u.

image

Here we're comparing the latest pypi release with this PR.

I've tested on:

https://instances.vantage.sh/aws/ec2/c6gd.2xlarge
https://instances.vantage.sh/aws/ec2/c6gd.4xlarge
image

We can see going from 2xlarge -> 4xlarge halved runtime

Here's the code:
https://github.com/RobinL/run_splink_benchmarks_in_ec2

@RobinL
Copy link
Member Author

RobinL commented Dec 24, 2023

Now showing a 7.7x speedup against the c6gd.4xlarge, but I think it was limited to using only 20 CPUs:

image

@@ -51,6 +52,12 @@ def _proportion_sample_size_link_only(
return proportion, sample_size


def _get_duckdb_salting(max_pairs):
logged = math.log(max_pairs, 10)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a heuristic.

max_pairs, salting
1e+00 1
1e+01 1
1e+02 1
1e+03 1
1e+04 1
1e+05 3
1e+06 7
1e+07 16
1e+08 40
2e+08 52
1e+09 98
1e+10 245
1e+11 611

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to have a very brief explanation as a comment, as otherwise this function is maybe a little cryptic

@RobinL
Copy link
Member Author

RobinL commented Dec 30, 2023

In terms of picking AWS EC2 image types for benchmarking, it looks like memory isn't very important. Here's 100mil comparisons on a
image
c6gd.xlarge (4vcpu, 8gb mem)

(this is estimate u only, so we're not persisting large datasets to memory or disk)

@RobinL
Copy link
Member Author

RobinL commented Jan 4, 2024

Even faster with 64 cpus,
image
That's now a 9.7x speedup on a fast workload, so should definitely see 10x + on bigger workloads

@RobinL
Copy link
Member Author

RobinL commented Jan 4, 2024

image image

It's able to use 100% of a 64 core instance to compare 1.1bn rows in <44 seconds

@RobinL
Copy link
Member Author

RobinL commented Jan 4, 2024

Adding a further billion takes only 11 seconds longer (which illustrates that there is a significant 'startup/shutdown' cost i.e. the algorithm only pins CPU to 100% for part of its operation

image

Total time taken: 243.48 seconds at cost of $0.17

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, amazing results!

@@ -51,6 +52,12 @@ def _proportion_sample_size_link_only(
return proportion, sample_size


def _get_duckdb_salting(max_pairs):
logged = math.log(max_pairs, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to have a very brief explanation as a comment, as otherwise this function is maybe a little cryptic

@ADBond
Copy link
Contributor

ADBond commented Jan 8, 2024

Also would be good to add this to the changelog

@RobinL RobinL merged commit 6e7f760 into master Jan 10, 2024
10 checks passed
@RobinL RobinL deleted the faster_duckdb branch January 10, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants