-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new distances #304
Add new distances #304
Conversation
self.accepts_precomputed = False | ||
|
||
def __call__(self, X: np.ndarray, Y: np.ndarray, bins=10, **kwargs) -> float: | ||
kl_all = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is ok! Did you test it? I think it's going to break depending on the shapes of the distributions of X and Y, but I'm not sure. However, I should note that typically we calculate KL divergence for continuous distributions assuming a Gaussian-distributed variable, in which case the KL divergence can be simply parameterized by the mean and variance of the two distributions. I would prefer if this could be the default implementation, as the Gaussian-distribution assumption is fairly standard, and the formula will also be more familiar to other people who work in ML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking to go this way in the first step but I decided to keep it the same way we talked... I'll change it with the assumption of Gaussian dist. This also solves the problem of going to infinity in non-defined regions.
In addition, can you add "sum of t-test over all genes" (assuming variables are normally distributed) as a metric, as discussed? For KL divergence, there should either be a check or (better) a parameter to switch to the version which takes in count data. You should be able to use scipy's entropy for this implementation, I believe. |
You said you have a t-test implementation you'll send me. Would you? |
@mojtababahrami1993 No... I thought there was a fast t-test implementation (where you skip computing the p-value) but we actually don't have it. I would make sure to get the same results as |
setting a parameter to calculate the KL divergence using count data will not work because you'll easily go to infinity (get a division by zero) when the gene count of the second group is zero for a bin where the first group is greater than zero. You always have to fit a Gaussian/NB distribution first to avoid this. |
implement T-test statistic rename the distances
Done. Just to mention that I had to sum over the absolute values of the t-statistic across all genes to avoid positive and negative statistics to cancel each out. makes sense? |
Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
* RTD config Signed-off-by: zethson <[email protected]> * RTD config Signed-off-by: zethson <[email protected]> * RTD config Signed-off-by: zethson <[email protected]> * RTD config Signed-off-by: zethson <[email protected]> --------- Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[kl_divergence] - assert nan == 0
FAILED tests/tools/_distances/test_distances.py::TestDistances::test_distance[t_test] - assert nan == 0
These tests don't pass yet. Texted @mojtababahrami1993 on Slack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NBNLL looks great!
theta = np.repeat(1 / nb_params[1], x.shape[0]) | ||
|
||
# calculate the nll of y | ||
eps = np.repeat(1e-8, x.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts about allowing epsilon to be adjustable as in the original scvi implementation of NLL @Zethson ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't see why not. I'd strongly encourage you to set a default value though.
class NBNLL(AbstractDistance): | ||
""" | ||
Average of Negative Log likelihood (scalar) of group B cells | ||
according to a NB distribution fitted over group A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mojtababahrami1993 Can you add here credit for the equation below to scvi authors? Although I did check this parameterization of the NLL equation myself, the code is technically from them. @Zethson Let us know if there's something else we should do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's 100% copied we should add the scVI license here. If it's only adapted or something we can link to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By adding the license I literally mean copying it into the folder where the Distance implementation lives. And state in the docstring nevertheless that you got it from scvi-tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's not too bad let's do that then. The formula is the general formula but even so there are different ways to call a gammaln, for example.
@yugeji @Zethson |
@mojtababahrami1993 we have to subsample because else the test takes ages, especially Wasserstein is really slow. I actually thought that I had tested that your implementation breaks even without the subsampling but that's apparently not the case. It would be really good if we could handle such cases with the code because reverting the subsampling is something that I'd really like to avoid. |
Signed-off-by: zethson <[email protected]>
Signed-off-by: zethson <[email protected]>
@mojtababahrami1993 we are now conditionally subsampling:
which should ensure that your object has enough samples. A 10% subsample should be doable, right? |
…ent_additional_distance_metrics
Signed-off-by: zethson <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Moved from argument to Distance class attribute. Affects how precomputed distances are stored and named. Changed metric used in Edistance to sqeuclidean as in original paper. Also fixed / added some tests.
…eislab/pertpy into implement_additional_distance_metrics
…eislab/pertpy into implement_additional_distance_metrics
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #304 +/- ##
==========================================
- Coverage 37.18% 0.00% -37.19%
==========================================
Files 32 40 +8
Lines 3577 4888 +1311
Branches 661 0 -661
==========================================
- Hits 1330 0 -1330
- Misses 2126 4888 +2762
+ Partials 121 0 -121
|
…o dev_metadata * 'dev_metadata' of https://github.com/theislab/pertpy: Documentation examples (#391) [pre-commit.ci] pre-commit autoupdate (#395) Speed up tests by subsampling (#398) Installation Apple Silicon (#393) Add new distances (#304) Fix cinema OT test (#392) [pre-commit.ci] pre-commit autoupdate (#390) wasserstein distance return type float (#386) fix naming of example data in doc examples (#387) Add test for test_distances.py Catches error as reported in Issue #385. Fix mypy warning for distances Type hint for `groups` reverted, Iterable is too general.
PR Checklist
docs
is updatedDescription of changes
New distance metrics in the perturbation space are implemented
Additional context