Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem #496

Open
Hananebh opened this issue Jan 13, 2023 · 8 comments
Open

problem #496

Hananebh opened this issue Jan 13, 2023 · 8 comments

Comments

@Hananebh
Copy link

Hello,
I work with infercnv version 1.14 and I had a problem with the dendrogram and the final result which is completely different from the one I obtained with infercnv version 1.9
Capture d’écran 2023-01-13 à 10 16 51

I would like to know what difference there is between the two versions and why I have a problem with the new dendrogram?
Your help is highly appreciated! Thank you.

@GeorgescuC
Copy link
Collaborator

GeorgescuC commented Jan 13, 2023 via email

@Hananebh
Copy link
Author

Hananebh commented Jan 14, 2023

Hi @GeorgescuC ,
thank you for your answer and for your useful explanations.
the code I used for both versions is the same:
infercnv_obj = infercnv::run(infercnv_obj,
cutoff=0.1,
out_dir="sampleoutput"
cluster_by_groups=TRUE,
denoise=TRUE,
HMM=TRUE,
plot_steps=FALSE,
num_threads= 8)

Capture d’écran 2023-01-14 à 13 39 52
how I can corrige the dendrogram please ? I want to keep the new result but with a googd dendrogram?
Thanks,
Regards

@nigiord
Copy link

nigiord commented Feb 8, 2023

Hi @Hananebh, I have the same issue with the last version of inferCNV (1.14). I think you could try the following parameters to get something similar to how the clustering was previously done in 1.9:

infercnv::run(
  infercnv_obj,
  […]
  k_nn=30,  # 1.9.1 default param
  leiden_resolution = 1,  # 1.9.1 default param
  leiden_method = "simple",  # 1.9.1 default param
  leiden_function = "modularity",  # 1.9.1 default param
  […]
  )

The results are not exactly the same, but at least they are rather similar.

I initially tried to reduce the resolution parameter with the PCA+CPM new-default approach as suggested by @GeorgescuC, but I need to use reaaaaaally low values (down to 0.0005) to have at least some groups that are not singletons, and the original subgroup annotations are poorly clustered, so I don’t think that’s the way to go.

Cheers,
Nils

@GeorgescuC
Copy link
Collaborator

Hi @Hananebh and @nigiord ,

The settings that @nigiord posted should indeed give the closest results to previous versions of infercnv that used the python implementation of the Leiden algorithm instead of igraph's. There is one more setting that was added in 1.14 that can affect the subclustering and you might want to change, which is the masking of genes that have a z score over (by default) 0.8 in references, and is controlled by the z_score_filter option. This masking is done to ignore genes that show strong signal in references as is common for MHC genes in chromosome 6 for example. Looking at your plot, it might however mask more genes than expected due to the cluster of cells at the top of references that look different than the rest and have an residual expression pattern very similar to your observations.

Regards,
Christophe.

@deevdevil88
Copy link

deevdevil88 commented Feb 16, 2023

hi @GeorgescuC
i had a question following on from this. As I have also faced similar problem of the subclustering with leiden, which i fixed thanks to the settings @nigiord posted. Now since you mentioned the z score filter option, i m wondering when do you decide to increase this zscore_filter value, or rather in order for it to not mask more genes than expected is a zscore of 2 as a filter reasonable? my final plot after improving the leiden clustering looks like this. Now our intiial thoughts were that we still have some "potential tumour cells" that are in the reference, but based on what you just said it migh be the zscore_filter setting. I dont know how to distinguish if its one or the other. Your opinion is appreciated thanks!
Screenshot 2023-02-16 at 10 45 10

Best,
Devika

@GeorgescuC
Copy link
Collaborator

Hi @deevdevil88 ,

The z score filtering does not affect in any way the residual expression values show in the figures. The z score only masks genes when calculating the nearest neighbors for cells (either directly on the residual expression, or on a PCA of the residual expression) at the start of the Leiden clustering, but not anything else. The downstream effect that can be visible is for the HMM predictions, as that uses subclusters to combine information from clonal populations of cells, so inaccurate/overly fragmented subclusters would reduce HMM prediction accuracy.

In the figure you shared, there are 2 things I notice:

  • some extremely small groups of reference annotated cells, which, if not so small only because of compression on the plot, could mean mostly noise is used as one of the references.
  • Within your 1st and 2nd group of references cells, there appear to be respectively at least 3 and 2 different consistent patterns of expression that are not properly "zeroed" because they are mixed together. One of those patterns, the top cluster of cells in your 2nd reference group, seems to at least partially overlap with the signal seen in most observation cells in chromosomes 10 and 19.

I cannot tell from this if the cells you used as reference are actually healthy or not as infercnv is an analysis relative to what is provided as references, but it is worth looking more into the accuracy of the clusterings you defined for your references.

Regards,
Christophe.

@deevdevil88
Copy link

deevdevil88 commented Feb 17, 2023

hi @GeorgescuC ,
Thank you for replying. Indeed you are right, currently this is our pilot and we dont have "proper normal" cells sampled for sure, currently the reference is all non immune cells that arent CA9 positive, but ofc its seems iike that isnt removing all tumour cells. So seems like that i need to clean up "reference cells more" and i will remove the small groups of reference annotated cells for now and re-run. thank you, these observations have been helpful. in the future we should actually have data from normal regions so that would solve the reference not being clean enough.

Best,
Devika

@GeorgescuC
Copy link
Collaborator

Hi @deevdevil88 ,

One thing to keep in mind is that the normal baseline expressions are defined as the average expression in each reference. This means that having signal show up in your references does not necessarily mean that some of those cells are malignant/tumor. It simply means there is heterogeneity within the groups of cells you have defined as references. Conversely, if you define a homogenous group of tumor cells as references (a clonal expansion for example), they will not display any signal since they are considered one of the baseline expression levels, and normal cells used as observations might show signal that is the opposite of the event that happened in the tumor as the difference is relative.

A very common example of signal showing up in healthy references is MHC genes on chromosome 6 for immune cells: usually about half the cells show a loss signal while the other half show a gain signal. This observation was also the basis for adding the z score masking option during subclustering.

Regards,
Christophe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants