Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new distances #304

Merged
merged 45 commits into from
Oct 5, 2023
Merged

Add new distances #304

merged 45 commits into from
Oct 5, 2023

Conversation

mojtababahrami
Copy link
Contributor

PR Checklist

  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes

New distance metrics in the perturbation space are implemented

Additional context

@mojtababahrami mojtababahrami self-assigned this Jul 3, 2023
@Zethson Zethson requested a review from yugeji July 18, 2023 08:34
pertpy/tools/_distances/_distances.py Outdated Show resolved Hide resolved
pertpy/tools/_distances/_distances.py Outdated Show resolved Hide resolved
pertpy/tools/_distances/_distances.py Show resolved Hide resolved
pertpy/tools/_distances/_distances.py Outdated Show resolved Hide resolved
self.accepts_precomputed = False

def __call__(self, X: np.ndarray, Y: np.ndarray, bins=10, **kwargs) -> float:
kl_all = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is ok! Did you test it? I think it's going to break depending on the shapes of the distributions of X and Y, but I'm not sure. However, I should note that typically we calculate KL divergence for continuous distributions assuming a Gaussian-distributed variable, in which case the KL divergence can be simply parameterized by the mean and variance of the two distributions. I would prefer if this could be the default implementation, as the Gaussian-distribution assumption is fairly standard, and the formula will also be more familiar to other people who work in ML.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking to go this way in the first step but I decided to keep it the same way we talked... I'll change it with the assumption of Gaussian dist. This also solves the problem of going to infinity in non-defined regions.

@yugeji
Copy link
Contributor

yugeji commented Jul 19, 2023

In addition, can you add "sum of t-test over all genes" (assuming variables are normally distributed) as a metric, as discussed?

For KL divergence, there should either be a check or (better) a parameter to switch to the version which takes in count data. You should be able to use scipy's entropy for this implementation, I believe.

@mojtababahrami
Copy link
Contributor Author

You said you have a t-test implementation you'll send me. Would you?

@yugeji
Copy link
Contributor

yugeji commented Jul 21, 2023

@mojtababahrami1993 No... I thought there was a fast t-test implementation (where you skip computing the p-value) but we actually don't have it. I would make sure to get the same results as scipy.stats.ttest_ind but simply write it manually so that the additional computation of the p-value is dropped.

@mojtababahrami
Copy link
Contributor Author

For KL divergence, there should either be a check or (better) a parameter to switch to the version which takes in count data. You should be able to use scipy's entropy for this implementation, I believe.

setting a parameter to calculate the KL divergence using count data will not work because you'll easily go to infinity (get a division by zero) when the gene count of the second group is zero for a bin where the first group is greater than zero. You always have to fit a Gaussian/NB distribution first to avoid this.

implement T-test statistic
rename the distances
@mojtababahrami
Copy link
Contributor Author

@mojtababahrami1993 No... I thought there was a fast t-test implementation (where you skip computing the p-value) but we actually don't have it. I would make sure to get the same results as scipy.stats.ttest_ind but simply write it manually so that the additional computation of the p-value is dropped.

Done. Just to mention that I had to sum over the absolute values of the t-statistic across all genes to avoid positive and negative statistics to cancel each out. makes sense?

mojtababahrami and others added 6 commits July 31, 2023 16:22
Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
* RTD config

Signed-off-by: zethson <[email protected]>

* RTD config

Signed-off-by: zethson <[email protected]>

* RTD config

Signed-off-by: zethson <[email protected]>

* RTD config

Signed-off-by: zethson <[email protected]>

---------

Signed-off-by: zethson <[email protected]>
@Zethson Zethson changed the base branch from development to main August 21, 2023 12:22
Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[kl_divergence] - assert nan == 0
FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[t_test] - assert nan == 0

These tests don't pass yet. Texted @mojtababahrami1993 on Slack

Copy link
Contributor

@yugeji yugeji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBNLL looks great!

theta = np.repeat(1 / nb_params[1], x.shape[0])

# calculate the nll of y
eps = np.repeat(1e-8, x.shape[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts about allowing epsilon to be adjustable as in the original scvi implementation of NLL @Zethson ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see why not. I'd strongly encourage you to set a default value though.

class NBNLL(AbstractDistance):
"""
Average of Negative Log likelihood (scalar) of group B cells
according to a NB distribution fitted over group A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mojtababahrami1993 Can you add here credit for the equation below to scvi authors? Although I did check this parameterization of the NLL equation myself, the code is technically from them. @Zethson Let us know if there's something else we should do here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's 100% copied we should add the scVI license here. If it's only adapted or something we can link to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding the license I literally mean copying it into the folder where the Distance implementation lives. And state in the docstring nevertheless that you got it from scvi-tools

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's not too bad let's do that then. The formula is the general formula but even so there are different ways to call a gammaln, for example.

@mojtababahrami
Copy link
Contributor Author

FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[kl_divergence] - assert nan == 0
FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[t_test] - assert nan == 0

These tests don't pass yet. Texted @mojtababahrami1993 on Slack

@yugeji @Zethson
The tests are failing due to testing the distances with a subsampled dataset adata_subsampled = sc.pp.subsample(adata, 0.001, copy=True) with 1 sample in each group. This makes the standard deviation of each group zero and some distances equal to nan.
Do you suggest handling such a condition (a group with only 1 sample) in the distance function or just revert testing with the previous real non-sampled data?

@Zethson
Copy link
Member

Zethson commented Sep 1, 2023

@mojtababahrami1993 we have to subsample because else the test takes ages, especially Wasserstein is really slow. I actually thought that I had tested that your implementation breaks even without the subsampling but that's apparently not the case. It would be really good if we could handle such cases with the code because reverting the subsampling is something that I'd really like to avoid.

@Zethson
Copy link
Member

Zethson commented Sep 6, 2023

@mojtababahrami1993 we are now conditionally subsampling:

    @fixture
    def adata(self, request):
        no_subsample_distances = ["kl_divergence", "t_test"]
        distance = request.node.callspec.params["distance"]

        adata = pt.dt.distance_example()
        if distance not in no_subsample_distances:
            adata = sc.pp.subsample(adata, 0.001, copy=True)
        else:
            adata = sc.pp.subsample(adata, 0.1, copy=True)

        return adata

which should ensure that your object has enough samples. A 10% subsample should be doable, right?

@stefanpeidli stefanpeidli self-assigned this Oct 2, 2023
stefanpeidli and others added 6 commits October 4, 2023 14:40
Moved from argument to Distance class attribute. Affects how
precomputed distances are stored and named.
Changed metric used in Edistance to sqeuclidean as in original paper.
Also fixed / added some tests.
…eislab/pertpy into implement_additional_distance_metrics
…eislab/pertpy into implement_additional_distance_metrics
@Zethson Zethson enabled auto-merge (squash) October 4, 2023 17:01
@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Merging #304 (ed254c4) into main (7a2f823) will decrease coverage by 37.19%.
Report is 120 commits behind head on main.
The diff coverage is 0.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #304       +/-   ##
==========================================
- Coverage   37.18%   0.00%   -37.19%     
==========================================
  Files          32      40        +8     
  Lines        3577    4888     +1311     
  Branches      661       0      -661     
==========================================
- Hits         1330       0     -1330     
- Misses       2126    4888     +2762     
+ Partials      121       0      -121     
Files Coverage Δ
pertpy/data/__init__.py 0.00% <ø> (-100.00%) ⬇️
pertpy/data/_datasets.py 0.00% <ø> (-100.00%) ⬇️
pertpy/plot/_scgen.py 0.00% <ø> (-40.63%) ⬇️
pertpy/tools/_scgen/_utils.py 0.00% <ø> (-100.00%) ⬇️
pertpy/preprocessing/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
pertpy/data/_dataloader.py 0.00% <0.00%> (-69.45%) ⬇️
pertpy/preprocessing/_guide_rna.py 0.00% <0.00%> (-66.67%) ⬇️
pertpy/tools/_kernel_pca.py 0.00% <0.00%> (-46.67%) ⬇️
pertpy/tools/_scgen/_base_components.py 0.00% <0.00%> (-86.28%) ⬇️
pertpy/plot/_guide_rna.py 0.00% <0.00%> (-38.10%) ⬇️
... and 29 more

... and 1 file with indirect coverage changes

@Zethson Zethson disabled auto-merge October 5, 2023 08:31
@Zethson Zethson merged commit 3a8e597 into main Oct 5, 2023
3 of 6 checks passed
wxicu added a commit that referenced this pull request Oct 16, 2023
…o dev_metadata

* 'dev_metadata' of https://github.com/theislab/pertpy:
  Documentation examples (#391)
  [pre-commit.ci] pre-commit autoupdate (#395)
  Speed up tests by subsampling (#398)
  Installation Apple Silicon (#393)
  Add new distances (#304)
  Fix cinema OT test (#392)
  [pre-commit.ci] pre-commit autoupdate (#390)
  wasserstein distance return type float (#386)
  fix naming of example data in doc examples (#387)
  Add test for test_distances.py Catches error as reported in Issue #385.
  Fix mypy warning for distances Type hint for `groups` reverted, Iterable is too general.
@Zethson Zethson deleted the implement_additional_distance_metrics branch November 2, 2023 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants