Add new distances #304

mojtababahrami · 2023-06-29T14:44:58Z

PR Checklist

Referenced issue is linked
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes

New distance metrics in the perturbation space are implemented

Additional context

pertpy/tools/_distances/_distances.py

yugeji · 2023-07-18T20:18:03Z

pertpy/tools/_distances/_distances.py

+        self.accepts_precomputed = False
+
+    def __call__(self, X: np.ndarray, Y: np.ndarray, bins=10, **kwargs) -> float:
+        kl_all = []


This implementation is ok! Did you test it? I think it's going to break depending on the shapes of the distributions of X and Y, but I'm not sure. However, I should note that typically we calculate KL divergence for continuous distributions assuming a Gaussian-distributed variable, in which case the KL divergence can be simply parameterized by the mean and variance of the two distributions. I would prefer if this could be the default implementation, as the Gaussian-distribution assumption is fairly standard, and the formula will also be more familiar to other people who work in ML.

I was also thinking to go this way in the first step but I decided to keep it the same way we talked... I'll change it with the assumption of Gaussian dist. This also solves the problem of going to infinity in non-defined regions.

yugeji · 2023-07-19T19:00:02Z

In addition, can you add "sum of t-test over all genes" (assuming variables are normally distributed) as a metric, as discussed?

For KL divergence, there should either be a check or (better) a parameter to switch to the version which takes in count data. You should be able to use scipy's entropy for this implementation, I believe.

mojtababahrami · 2023-07-20T15:13:04Z

You said you have a t-test implementation you'll send me. Would you?

yugeji · 2023-07-21T19:30:53Z

@mojtababahrami1993 No... I thought there was a fast t-test implementation (where you skip computing the p-value) but we actually don't have it. I would make sure to get the same results as scipy.stats.ttest_ind but simply write it manually so that the additional computation of the p-value is dropped.

mojtababahrami · 2023-07-25T12:43:22Z

For KL divergence, there should either be a check or (better) a parameter to switch to the version which takes in count data. You should be able to use scipy's entropy for this implementation, I believe.

setting a parameter to calculate the KL divergence using count data will not work because you'll easily go to infinity (get a division by zero) when the gene count of the second group is zero for a bin where the first group is greater than zero. You always have to fit a Gaussian/NB distribution first to avoid this.

implement T-test statistic rename the distances

mojtababahrami · 2023-07-25T13:09:54Z

@mojtababahrami1993 No... I thought there was a fast t-test implementation (where you skip computing the p-value) but we actually don't have it. I would make sure to get the same results as scipy.stats.ttest_ind but simply write it manually so that the additional computation of the p-value is dropped.

Done. Just to mention that I had to sum over the absolute values of the t-statistic across all genes to avoid positive and negative statistics to cancel each out. makes sense?

Signed-off-by: zethson <[email protected]>

* RTD config Signed-off-by: zethson <[email protected]> * RTD config Signed-off-by: zethson <[email protected]> * RTD config Signed-off-by: zethson <[email protected]> * RTD config Signed-off-by: zethson <[email protected]> --------- Signed-off-by: zethson <[email protected]>

Signed-off-by: zethson <[email protected]>

Zethson

FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[kl_divergence] - assert nan == 0
FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[t_test] - assert nan == 0

These tests don't pass yet. Texted @mojtababahrami1993 on Slack

yugeji

NBNLL looks great!

yugeji · 2023-08-31T07:06:17Z

pertpy/tools/_distances/_distances.py

+            theta = np.repeat(1 / nb_params[1], x.shape[0])
+
+            # calculate the nll of y
+            eps = np.repeat(1e-8, x.shape[0])


Thoughts about allowing epsilon to be adjustable as in the original scvi implementation of NLL @Zethson ?

Yeah, I don't see why not. I'd strongly encourage you to set a default value though.

yugeji · 2023-08-31T07:12:46Z

pertpy/tools/_distances/_distances.py

+class NBNLL(AbstractDistance):
+    """
+    Average of Negative Log likelihood (scalar) of group B cells
+    according to a NB distribution fitted over group A


@mojtababahrami1993 Can you add here credit for the equation below to scvi authors? Although I did check this parameterization of the NLL equation myself, the code is technically from them. @Zethson Let us know if there's something else we should do here.

If it's 100% copied we should add the scVI license here. If it's only adapted or something we can link to them.

By adding the license I literally mean copying it into the folder where the Distance implementation lives. And state in the docstring nevertheless that you got it from scvi-tools

If that's not too bad let's do that then. The formula is the general formula but even so there are different ways to call a gammaln, for example.

mojtababahrami · 2023-09-01T12:42:15Z

FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[kl_divergence] - assert nan == 0
FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[t_test] - assert nan == 0

These tests don't pass yet. Texted @mojtababahrami1993 on Slack

@yugeji @Zethson
The tests are failing due to testing the distances with a subsampled dataset adata_subsampled = sc.pp.subsample(adata, 0.001, copy=True) with 1 sample in each group. This makes the standard deviation of each group zero and some distances equal to nan.
Do you suggest handling such a condition (a group with only 1 sample) in the distance function or just revert testing with the previous real non-sampled data?

Zethson · 2023-09-01T15:08:37Z

@mojtababahrami1993 we have to subsample because else the test takes ages, especially Wasserstein is really slow. I actually thought that I had tested that your implementation breaks even without the subsampling but that's apparently not the case. It would be really good if we could handle such cases with the code because reverting the subsampling is something that I'd really like to avoid.

Signed-off-by: zethson <[email protected]>

Zethson · 2023-09-06T09:47:21Z

@mojtababahrami1993 we are now conditionally subsampling:

    @fixture
    def adata(self, request):
        no_subsample_distances = ["kl_divergence", "t_test"]
        distance = request.node.callspec.params["distance"]

        adata = pt.dt.distance_example()
        if distance not in no_subsample_distances:
            adata = sc.pp.subsample(adata, 0.001, copy=True)
        else:
            adata = sc.pp.subsample(adata, 0.1, copy=True)

        return adata

which should ensure that your object has enough samples. A 10% subsample should be doable, right?

…ent_additional_distance_metrics

Signed-off-by: zethson <[email protected]>

for more information, see https://pre-commit.ci

Moved from argument to Distance class attribute. Affects how precomputed distances are stored and named. Changed metric used in Edistance to sqeuclidean as in original paper. Also fixed / added some tests.

…eislab/pertpy into implement_additional_distance_metrics

codecov · 2023-10-04T17:28:06Z

Codecov Report

Merging #304 (ed254c4) into main (7a2f823) will decrease coverage by 37.19%.
Report is 120 commits behind head on main.
The diff coverage is 0.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #304       +/-   ##
==========================================
- Coverage   37.18%   0.00%   -37.19%     
==========================================
  Files          32      40        +8     
  Lines        3577    4888     +1311     
  Branches      661       0      -661     
==========================================
- Hits         1330       0     -1330     
- Misses       2126    4888     +2762     
+ Partials      121       0      -121

Files	Coverage Δ
pertpy/data/__init__.py	`0.00% <ø> (-100.00%)`	⬇️
pertpy/data/_datasets.py	`0.00% <ø> (-100.00%)`	⬇️
pertpy/plot/_scgen.py	`0.00% <ø> (-40.63%)`	⬇️
pertpy/tools/_scgen/_utils.py	`0.00% <ø> (-100.00%)`	⬇️
pertpy/preprocessing/__init__.py	`0.00% <0.00%> (-100.00%)`	⬇️
pertpy/data/_dataloader.py	`0.00% <0.00%> (-69.45%)`	⬇️
pertpy/preprocessing/_guide_rna.py	`0.00% <0.00%> (-66.67%)`	⬇️
pertpy/tools/_kernel_pca.py	`0.00% <0.00%> (-46.67%)`	⬇️
pertpy/tools/_scgen/_base_components.py	`0.00% <0.00%> (-86.28%)`	⬇️
pertpy/plot/_guide_rna.py	`0.00% <0.00%> (-38.10%)`	⬇️
... and 29 more

... and 1 file with indirect coverage changes

…o dev_metadata * 'dev_metadata' of https://github.com/theislab/pertpy: Documentation examples (#391) [pre-commit.ci] pre-commit autoupdate (#395) Speed up tests by subsampling (#398) Installation Apple Silicon (#393) Add new distances (#304) Fix cinema OT test (#392) [pre-commit.ci] pre-commit autoupdate (#390) wasserstein distance return type float (#386) fix naming of example data in doc examples (#387) Add test for test_distances.py Catches error as reported in Issue #385. Fix mypy warning for distances Type hint for `groups` reverted, Iterable is too general.

mojtababahrami added 2 commits June 29, 2023 16:41

Add new distances

fe73d7c

add tests

ec4ac90

mojtababahrami self-assigned this Jul 3, 2023

add kl-divergence

19e439c

Zethson requested a review from yugeji July 18, 2023 08:34

yugeji suggested changes Jul 19, 2023

View reviewed changes

re-implement kl-divergence

536fb78

implement T-test statistic rename the distances

mojtababahrami and others added 6 commits July 31, 2023 16:22

implement rbf and quadratic polynomial mmd distances

c33f9ce

RTD config

000da42

Signed-off-by: zethson <[email protected]>

RTD config

4f37f32

Signed-off-by: zethson <[email protected]>

Merge branch 'development' into implement_additional_distance_metrics

b2baac4

Add Negative Binomial NLL distance

340ae2f

Zethson changed the base branch from development to main August 21, 2023 12:22

Zethson added 2 commits August 24, 2023 14:56

Fix merge conflicts

d97d69a

Signed-off-by: zethson <[email protected]>

speedup

ccdd993

Signed-off-by: zethson <[email protected]>

Zethson requested changes Aug 24, 2023

View reviewed changes

yugeji reviewed Aug 31, 2023

View reviewed changes

Zethson added 2 commits September 1, 2023 17:25

Less aggressive subsampling

eec56b9

Signed-off-by: zethson <[email protected]>

Conditional subsampling

35fd863

Signed-off-by: zethson <[email protected]>

Zethson and others added 2 commits September 6, 2023 13:04

Merge branch 'main' of https://github.com/theislab/pertpy into implem…

5a190bf

…ent_additional_distance_metrics

Fix NBNLL

a956ca4

Zethson and others added 17 commits September 6, 2023 19:38

use future annotation imports

9de017b

Signed-off-by: zethson <[email protected]>

Merge branch 'main' into implement_additional_distance_metrics

692654a

[pre-commit.ci] auto fixes from pre-commit.com hooks

1a527a0

for more information, see https://pre-commit.ci

bug fix, convert to distance, check sparse csr

a1dd3bb

[pre-commit.ci] auto fixes from pre-commit.com hooks

2b47c4e

for more information, see https://pre-commit.ci

add test for sparse matrices

91fa5d9

use the internal __call__ method in pairwise

97c1f59

rename nb-nll to nb-ll

4e24f0a

rename distances + add epsilon to kl-divergance and t-test

71d8afd

fix

347998d

add mse distance

cdac167

add Kendall Tau distance

f266931

Merge branch 'main' into implement_additional_distance_metrics

a3f6343

[pre-commit.ci] auto fixes from pre-commit.com hooks

c8e1dfa

for more information, see https://pre-commit.ci

Merge branch 'main' into implement_additional_distance_metrics

a5514c5

[pre-commit.ci] auto fixes from pre-commit.com hooks

73ff6e5

for more information, see https://pre-commit.ci

Merge branch 'main' into implement_additional_distance_metrics

c56cfca

stefanpeidli self-assigned this Oct 2, 2023

stefanpeidli and others added 6 commits October 4, 2023 14:40

Change handling of cell_wise_metric in distances

c5c2be0

Moved from argument to Distance class attribute. Affects how precomputed distances are stored and named. Changed metric used in Edistance to sqeuclidean as in original paper. Also fixed / added some tests.

Merge branch 'implement_additional_distance_metrics' of github.com:th…

b4ea7c2

…eislab/pertpy into implement_additional_distance_metrics

Add fix again that was lost during merge resolve

d35ad18

Merge branch 'main' into implement_additional_distance_metrics

dbdf1e5

Merge branch 'implement_additional_distance_metrics' of github.com:th…

cd9c743

…eislab/pertpy into implement_additional_distance_metrics

Fix CINEMA OT test

ed254c4

Zethson enabled auto-merge (squash) October 4, 2023 17:01

Zethson disabled auto-merge October 5, 2023 08:31

Zethson merged commit 3a8e597 into main Oct 5, 2023
3 of 6 checks passed

Zethson deleted the implement_additional_distance_metrics branch November 2, 2023 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new distances #304

Add new distances #304

mojtababahrami commented Jun 29, 2023

yugeji Jul 18, 2023

mojtababahrami Jul 20, 2023

yugeji commented Jul 19, 2023

mojtababahrami commented Jul 20, 2023

yugeji commented Jul 21, 2023

mojtababahrami commented Jul 25, 2023

mojtababahrami commented Jul 25, 2023

Zethson left a comment

yugeji left a comment

yugeji Aug 31, 2023

Zethson Aug 31, 2023

yugeji Aug 31, 2023

Zethson Aug 31, 2023

Zethson Aug 31, 2023

yugeji Sep 1, 2023

mojtababahrami commented Sep 1, 2023

Zethson commented Sep 1, 2023

Zethson commented Sep 6, 2023

codecov bot commented Oct 4, 2023

Add new distances #304

Add new distances #304

Conversation

mojtababahrami commented Jun 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yugeji commented Jul 19, 2023

mojtababahrami commented Jul 20, 2023

yugeji commented Jul 21, 2023

mojtababahrami commented Jul 25, 2023

mojtababahrami commented Jul 25, 2023

Zethson left a comment

Choose a reason for hiding this comment

yugeji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mojtababahrami commented Sep 1, 2023

Zethson commented Sep 1, 2023

Zethson commented Sep 6, 2023

codecov bot commented Oct 4, 2023

Codecov Report