All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
n_blocks
Added "guesstimate" as default value forn_blocks
. This will guess an optimal number of blocks based on empirical observation.
- matrix-blocking/splitting as a performance-enhancer (see README.md for details)
- new keyword arguments
force_symmetries
andn_blocks
(see README.md for details) - new dependency on packages
topn
andsparse_dot_topn_for_blocks
to help with the matrix-blocking - capability to reuse a previously initialized StringGrouper (that is, the corpus can now persist across high-level function calls like
match_strings()
. See README.md for details.)
- Improved the performance of the function
match_most_similar
. - The
Series
duplicates
is now the left operand, whilemaster
is the right operand in the underlying left-join operation that does the string-matching. - Changed the default value of the keyword argument
max_n_matches
to the total number of strings inmaster
. (max_n_matches
is now defined as the maximum number of matches allowed per string induplicates
[ormaster
ifduplicates
is not given]).
- Added new keyword argument
tfidf_matrix_dtype
(the datatype for the tf-idf values of the matrix components). Allowed values arenumpy.float32
andnumpy.float64
(used by the required external packagesparse_dot_topn
version 0.3.1). Default isnumpy.float32
. (Note:numpy.float32
often leads to faster processing and a smaller memory footprint albeit less numerical precision thannumpy.float64
.)
- Changed dependency on
sparse_dot_topn
from version 0.2.9 to 0.3.1 - Changed the default datatype for cosine similarities from numpy.float64 to numpy.float32 to boost computational performance at the expense of numerical precision.
- Changed the default value of the keyword argument
max_n_matches
from 20 to the number of strings induplicates
(ormaster
, ifduplicates
is not given). - Changed warning issued when the condition [
include_zeroes=True
andmin_similarity
≤ 0 andmax_n_matches
is not sufficiently high to capture all nonzero-similarity-matches] is met to an exception.
- Removed the keyword argument
suppress_warning
-
Added group representative functionality - by default the centroid is used. From @ParticularMiner
-
Added string_grouper_utils package with additional group-representative functionality:
- new_group_rep_by_earliest_timestamp
- new_group_rep_by_completeness
- new_group_rep_by_highest_weight
From @ParticularMiner
-
Original indices are now added by default to output of
group_similar_strings
,match_most_similar
andmatch_strings
. From @ParticularMiner -
compute_pairwise_similarities
function From @ParticularMiner
- Default group representative is now the centroid. Used to be the first string in the series belonging to a group. From @ParticularMiner
- Output of
match_most_similar
andmatch_strings
is now apandas.DataFrame
object instead of apandas.Series
by default. From @ParticularMiner - Fixed a bug which occurs when min_similarity=0. From @ParticularMiner