Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell Typist Providing different results between iterations #126

Open
ManuelSokolov opened this issue Jul 26, 2024 · 6 comments
Open

Cell Typist Providing different results between iterations #126

ManuelSokolov opened this issue Jul 26, 2024 · 6 comments

Comments

@ManuelSokolov
Copy link

ManuelSokolov commented Jul 26, 2024

Hi! I am doing label transfer from reference dataset and classifying two query sets that should contain exactly same cell types. I noticed that running across several iterations the classifications would be different each iterations.

reference = sc.read_h5ad("data/combined_ref.h5ad")
query1 = sc.read_h5ad("querys/unnorm_sc_C32-24h.h5ad")
query2 = sc.read_h5ad("querys/unnorm_sc_C32-72h.h5ad")

sc.pp.normalize_total(query1, target_sum=1e4)
sc.pp.log1p(query1)

sc.pp.normalize_total(query2, target_sum=1e4)
sc.pp.log1p(query2)

sc.pp.normalize_total(reference, target_sum=1e4)
sc.pp.log1p(reference)

predictions24h = pd.DataFrame()
predictions72h = pd.DataFrame()
predictions24h['id'] = list(query1.obs_names)
predictions72h['id'] = list(query2.obs_names)

features =[]

for i in range(25):
    print(f"iteration{i}")
    model2 = celltypist.train(reference,labels = 'CellClass', n_jobs = 10, feature_selection = True)
    if i == 0:
        features = model2.features
    extracted = model2.features
    features = list(set(extracted) & set(features))  
    prediction_query1 = celltypist.annotate(query1, model = model2, majority_voting=True)
    prediction_query2 = celltypist.annotate(query2, model = model2, majority_voting=True)
    adata2_query1 = prediction_query1.to_adata()
    adata2_query2 = prediction_query2.to_adata()
    predictions24h[f'run{i}'] = list(prediction_query1.predicted_labels.majority_voting)
    predictions72h[f'run{i}'] = list(prediction_query2.predicted_labels.majority_voting)

As you can see in next plot I plotted for each sample (rows) the percentages of predicted cell types per sample (e.g for first sample in the graph, from the 25 iterations of cell types it got classifed 40% of the times as radial glia and 60% of the times as glioblast.
Captura de ecrã 2024-07-26, às 12 46 56
Is this behaviour expected/documented for cell typist ? What is recommended to do in this case?

Best Regards,

Manuel

@ChuanXu1
Copy link
Collaborator

@ManuelSokolov, the training process involves various sources of randomness. For example, the first round of training uses SGD which will shuffle the data before each epoch starts and therefore create randomness. If you want to have a stable model, a better way is to increase the number of iterations during training (e.g., max_iter = 2000) at the cost of longer runtime.

@ManuelSokolov
Copy link
Author

ManuelSokolov commented Jul 28, 2024

@ChuanXu1 thank you for your response, the SGD flag is by default set to False so the randomness should not exist. Is there any other reason that can be driving this randomness - disabling feature selection when training seem to have disabled randomness in the model.
Also, my goal in addition to stability is to obtain correct results - a model that classifies wrongly with high confidence scores is not helpfull in this case (the UMAP below shows the result of one iteration)
Captura de ecrã 2024-07-28, às 22 23 03
If I disable feature selection the result will always be same:
Captura de ecrã 2024-07-28, às 22 24 14
However, since the results with and without feature selection seem to be completely different, I am not sure if I can trust the model - can you please comment on this?

@ChuanXu1
Copy link
Collaborator

@ManuelSokolov, the first round of training always use SGD. use_SGD = False (the default) is intended for the 2nd round of training after feature selection.

@ManuelSokolov
Copy link
Author

ManuelSokolov commented Jul 28, 2024

Sorry @ChuanXu1 you seem to have responded before I edited the response, disabling feature selection seemes to have stabilized the results however difficult to know what is right/wrong, please see message above

@ChuanXu1
Copy link
Collaborator

@ManuelSokolov, it is usually recommended to use feature selection to speed up the run and increase the accuracy.

@ManuelSokolov
Copy link
Author

ManuelSokolov commented Jul 28, 2024

In this case seems to be reducing accuracy by providing different results across iterations. I also looked into the annotate method and it does standard scalling before classifications, and this option cannot be set to false. What is your recommendation given this example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants