Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with continuous expression data? #94

Closed
stevenagl12 opened this issue Feb 19, 2024 · 12 comments
Closed

Working with continuous expression data? #94

stevenagl12 opened this issue Feb 19, 2024 · 12 comments

Comments

@stevenagl12
Copy link

I have a potentially dumb question. So, as I understand it, we need to discretize the data to work with this package on continuous biological data, such as gene expression or cytometry data. The inbuilt function for bn. discretize however takes in a build graph as an input though. With our data, we can't infer which nodes and edges we have to start a random graph. How can we use this package with such continuous data? As I understand it, in the R bnlearn library, it came with the iamb, and hatermink discretization options, but I don't see that in this package.

@erdogant
Copy link
Owner

erdogant commented Feb 21, 2024

When you only have data, and want to start without a structure, try the structure learning. However the methods in bnlearn does require data to be discrete.

Two suggestions how to approach this:

  1. Discritize your data based on your domein knowledge and/or in combination with other statistics. For example, for your gene expression profiles you could do a t-test between a control group and set a threshold (alpha is 0.05) with or without multiple test correction. This would return three states for each gene (up, baseline, down). If you dont have a control group, try fitting the distribution to a theoretical distribution (checkout distfit) and make a cut on the 95%CII or so. Do both sides of the distribution and you would again have three states per gene. This comes close to constrain based: https://erdogant.github.io/bnlearn/pages/html/Structure%20learning.html#constraint-based

  2. Try using the built on functionality of bnearn to automatically discritize and create states based on the continuous expression profiles. This is again a starting point towards structure learning. See documentation for more details.

https://erdogant.github.io/bnlearn/pages/html/Continuous%20Data.html

No methods like iamb. However, checkout what’s available is pgmpy. If there is something what could help you, I am open to merge commits.

Asking questions makes you smart btw. Keep it up 👍🏻

@stevenagl12
Copy link
Author

stevenagl12 commented Feb 21, 2024 via email

@erdogant
Copy link
Owner

You are right. The second part does need a DAG at start. Unfortunately there is no other implementation yet.

@akshatakarjun
Copy link

Hi,

By continuous biological data, did you mean continuous data like various numbers (for ex 103.2, 102, 99, 2.5, etc) or time-series data?
If it ain't any of these, could you please explain what the data you have mentioned, loos like?

Also, if it is different, is this package applicable fr continuous data like the one I have mentioned above?

@stevenagl12
Copy link
Author

stevenagl12 commented Jul 16, 2024 via email

@erdogant
Copy link
Owner

If you would like to know some comparison with other causal packages, you can read it in my blog over here. The last time I checked, only CausalImpact can model continuous values but that is for time series data. So, it is not applicable when you are using RNAseq data.

@Loominarty
Copy link

I also have a dumb question:

I have a dataset that mixes continuous and discrete data. I noticed the bn.discretize function takes a lot of time (my dataset is 11000 points roughly, 9 columns, among which 4 are continuous).
Is there a possibility to discretize outside of bnlearn or is this not compatible ?

I tried using the pandas functions to circumvent the issue and generate Interval Indexes in my dataset but with very little success.

@akshatakarjun
Copy link

Unsure what kind of continuous data you have but If possible, you can manually put them into a discrete range. For example, if a feature called BloodPressure has various values, then we know what values of BP is considered as normal, high BP and low BP. You can do a if loop, if the value falls in this range, replace all those rows value with the categorical value you want.

Just a thought!!

@Loominarty
Copy link

Hi @akshatakarjun ,

I found something that works alright, but is not very convenient in terms of user comfort. I have discretized outside of the library and used bn.df2onehot to encode the indexes into integers.
Then I just translate my new incoming data into one of these numbers.

@erdogant
Copy link
Owner

erdogant commented Aug 2, 2024

You can indeed manipulate your data as you wish. The df2onehot was included in bnlearn to provide one of the steps from start-to-results. So you are right, it brings some comfort but at the same time it is generally slow.

@erdogant
Copy link
Owner

erdogant commented Oct 8, 2024

I implemented LiNGAM methods (Direct and ICA) to model datasets with continuous variables (without discretizing!). See docs here.

update to the latest version with:

pip install -U bnlearn

@erdogant
Copy link
Owner

This functionality is added with the last update! Re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants