Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cores arg to methSeg #120

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

add cores arg to methSeg #120

wants to merge 13 commits into from

Conversation

katwre
Copy link
Contributor

@katwre katwre commented Jun 28, 2018

Hi guys,
I added parallel methSeg() with methylDB objects and non-tabix objects.
I separated code for running fastseg and mclust into two auxiliary function .run.fastseg and .run.mclust. Parallelization is in the step of fastseg and each fastseg run is concatenated. For that, I added to return.type in applyTbxByChr "GRanges" to concatenate GRanges.
I wrote some tests, but I think I maybe should add more of them
Let me know what do you think about it
Kasia

katwre and others added 9 commits June 18, 2018 15:40
- changed parameter name from `estimate.params.density` to `initialize.on.subset`
- allow for values higher than 1, which directly yields number of samples
- priorize Mclust argument `initialization`
- check for to few samples, 9 seems to be the magic number
- add tests
Remove duplicate check for sample size.
Copy link
Collaborator

@alexg9010 alexg9010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Job! @katwre

expect_equal(a,b)
})

test_that("check if methSeg with cores > 1 is the same as cores=1 (non-tabix file)" ,{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the same test as before.

expect_equal(a,b)
})

methylRawDB.obj <- methRead(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a tabix based object?
I think you have to set save.db=TRUE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I am correct it's dbtype="tabix"

R/methSeg.R Outdated
gr0 = gr0[,"meth"]
}else if("meth.diff" %in% names(mcols(gr0))){
gr0 = gr0[,"meth.diff"]
}else if (class(obj) != "GRanges"){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary since you are already checking above for the class.

Copy link
Contributor Author

@katwre katwre Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right!

R/methSeg.R Outdated
# match argument names to fastseg arguments
args.fastseg=dots[names(dots) %in% names(formals(fastseg)[-1] ) ]
initialize.on.subset=1,
cores=1, ...){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cores should be mc.cores to keep argument names consistent with other methylKit functions

seg.res <- do.call("fastseg", args.fastseg)

# stop if segmentation produced only one range
if(length(seg.res)==1) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this if clause be called after fastseg and before calling mclust?

Copy link
Contributor Author

@katwre katwre Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think I know what you mean, this if clause is here in the original code and you return seg.res if there is only 1 segment

return(seg.res)
but since I separated lines fo code for calling mclust into an auxiliary function, I need to make sure that this function ( .run.mclust() ) won't be run if there is only 1 segment..

R/methSeg.R Outdated
# methylKit naming convention
df2getcolnames = as.data.frame(gr0[1])
df2getcolnames$width = NULL
methylKit:::.setMethylDBNames(df2getcolnames)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this line ?
methylKit:::.setMethylDBNames(df2getcolnames)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used to predict the column names of the given data.frame.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do not need the methylKit:::!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry! my fault

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I got confused, I thought we are resetting names on the tabix files but this doesn't do that, right ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually the data.frame that we get from the tabix file does not have column names and with that function we retrieve them.

R/methSeg.R Outdated
## Tabix files
} else if(class(obj)=="methylDiffDB" | class(obj)=="methylRawDB"){

.run.fastseg.tabix = function(gr0, ...){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a function which actually takes the class of the object as argument:

.run.fastseg.tabix = function(gr0, class ,...) {

### and then you can directly set the colnames
.setMethylDBNames(df2getcolnames,class)
}

R/methSeg.R Show resolved Hide resolved
@katwre
Copy link
Contributor Author

katwre commented Jun 29, 2018

I improved the code according to your suggestions besides @al2na suggestion about the if clause #120 (diff)

@al2na
Copy link
Owner

al2na commented Jun 29, 2018

could we also comment the code wherever possible, please think about people who will maintain this in the future or your future selves. Certain things that are trivial are not going to be trivial after 3 months of not looking at the code.

@katwre
Copy link
Contributor Author

katwre commented Jul 2, 2018

@al2na I added more comments, hope it's better now

@katwre
Copy link
Contributor Author

katwre commented Jul 2, 2018

there is something wrong when join.neighbours=TRUE and initialize.on.subset!=1, I am checking it

@katwre
Copy link
Contributor Author

katwre commented Jul 4, 2018

I checked if with methylRawDB and multiple cores is faster than using methylRaw object on example of data with ~350K Cs (two chromosomes) and methylRaw is faster. I don't know why. Maybe it depends on the size of the input, I will check that

b <- benchmark(methylRaw.cores.1 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE),
               methylRaw.cores.2 =methSeg(obj.methylraw, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
               methylRawDB.cores.1 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE),
               methylRawDB.cores.2 = methSeg(obj, diagnostic.plot = F, join.neighbours = FALSE, cores=2),
               replications=5,
               columns=c('test', 'replications', 'elapsed'))
> print(b)
                test replications elapsed
1   methylRaw.cores.1            5  39.026
2   methylRaw.cores.2            5  38.495
3 methylRawDB.cores.1            5  46.146
4 methylRawDB.cores.2            5  45.640

@al2na
Copy link
Owner

al2na commented Jul 4, 2018 via email

@katwre
Copy link
Contributor Author

katwre commented Jul 16, 2018

I checked it using 5 chromosomes and it's not better.

> myRaw
methylRaw object with 3784497 rows
--------------
  chr   start     end strand coverage numCs numTs
1 chr21 9411552 9411552      +       45    12    33
2 chr21 9411553 9411553      -       70    27    43
3 chr21 9411784 9411784      +       31     4    27
4 chr21 9411785 9411785      -       46    12    34
5 chr21 9412099 9412099      +       26    15    11
6 chr21 9412100 9412100      -       35    16    19
--------------
  sample.id: id 
assembly: assembly 
context: CpG 
resolution: base 

library(rbenchmark)

b <- benchmark(methylRaw.cores.1 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE),
               methylRaw.cores.5 =methSeg(myRaw, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
               methylRawDB.cores.1 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE),
               methylRawDB.cores.5 = methSeg(mymethylRawDB, diagnostic.plot = F, join.neighbours = FALSE, mc.cores=5),
               replications=3,
               columns=c('test', 'replications', 'elapsed'))
> print(b)
test replications elapsed
1   methylRaw.cores.1            3 257.613
2   methylRaw.cores.5            3 259.420
3 methylRawDB.cores.1            3 295.970
4 methylRawDB.cores.5            3 297.785

thanks @alexg9010 for the suggestion to use profvis, but it didnt work for me, I got an error that I didn't what to do with. I used profmem instead and it showed that memory usage when there are parallel cores is smaller than without using multiple cores.

methylRaw.cores.1 = 47888 bytes 
methylRaw.cores.5 = 39656 bytes
methylRawDB.cores.1 = 151380128 bytes
methylRawDB.cores.5 = 121104112 bytes

@katwre
Copy link
Contributor Author

katwre commented Aug 13, 2018

@al2na @alexg9010 I didn't manage to show that this method is faster. Should we close this pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants