Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multivariate GEMMA hangs forever #243

Open
biona001 opened this issue Mar 20, 2021 · 6 comments
Open

Multivariate GEMMA hangs forever #243

biona001 opened this issue Mar 20, 2021 · 6 comments
Assignees
Labels

Comments

@biona001
Copy link

On a simulated dataset with 1000 samples, 10000 SNPs, and 2 traits, multivariate GEMMA ran for over an hour. Is this reasonable?

My commands are:

# generate relatedness matrix
gemma -bfile multivariate_2traits -gk 1 -o multivariate

# run gemma
gemma -bfile multivariate_2traits -k output/multivariate.cXX.txt -maf 0.0000001 -lmm 4 -n 1 2 -o gemma.polygenic.result

Output: (the 100% shows up almost immediately, but hangs forever and never displays the "done").

GEMMA 0.98.4 (2021-01-29) by Xiang Zhou and team (C) 2012-2021
Reading Files ...
## number of total individuals = 1000
## number of analyzed individuals = 1000
## number of covariates = 1
## number of phenotypes = 2
## number of total SNPs/var        =    10000
## number of analyzed SNPs         =    10000
Start Eigen-Decomposition...
REMLE estimate for Vg in the null model:
23.5963
6.2719	23.8002
se(Vg):
4.4394
2.7440	5.4686
REMLE estimate for Ve in the null model:
2.6355
0.0002	0.0002
se(Ve):
1.3386
0.8088	1.6734
REMLE likelihood = -4988.2498
MLE estimate for Vg in the null model:
25.3766
6.2899	23.8008
se(Vg):
5.8054
0.8981	1.0649
MLE estimate for Ve in the null model:
2.0757
0.0011	0.0000
se(Ve):
1.7524
0.0009	0.0000
MLE likelihood = -4981.3187
================================================== 100%

If I analyze the traits separately, the analysis finishes in ~5 seconds. I assume this is a bug but I'm not sure.

For reproducibility, my test PLINK files are zipped and attached.

multivariate.zip

@biona001
Copy link
Author

So I left the program running overnight, and it finished after 176 minutes. I am still suspicious it took so long with such small data, but I suppose this is not a bug. I will ask on the Google group instead.

FWIW, the output is

## REMLE estimate for Vg in the null model:
23.5963
6.27189 23.8002
## se(Vg):
4.43943
2.74395 5.46859
## REMLE estimate for Ve in the null model:
2.63552
0.00017173      0.000237989
## se(Ve):
1.33862
0.808768        1.67335
## MLE estimate for Vg in the null model:
25.3766 6.28991
6.28991 23.8008
## se(Vg):
5.80539
0.898146        1.06494
## MLE estimate for Ve in the null model:
2.07569 0.00105176
0.00105176      5.43669e-07
## se(Ve):
1.75241
0.000879422     4.27974e-07
## estimate for B (d by c) in the null model (columns correspond to the covariates provided in the file):
0.506032
1.00318
## se(B):
0.0513373
0.000487841
##
## Computation Time:
## total computation time = 176.405 min
## computation time break down:
##      time on eigen-decomposition = 0.0822538 min
##      time on calculating UtX = 0.0785359 min
##      time on optimization = 176.135 min
##

@biona001
Copy link
Author

I did 100 more simulations, and 25/100 simulations ran for over 1000 seconds. In contrast, the other 75 simulations usually run in 30-60 seconds.

I am reopening this issue, but feel free to close it.

@biona001 biona001 reopened this Mar 26, 2021
@pjotrp pjotrp added the bug label Aug 14, 2021
@pjotrp pjotrp self-assigned this Aug 14, 2021
@pjotrp
Copy link
Member

pjotrp commented Aug 15, 2021

Thank you for your data file. Not sure why timings are so divergent. Much of the time is spent in:

#4  0x0000000000480f62 in MphEM (func_name=func_name@entry=82 'R', max_iter=1000, max_prec=0.001,
    eval=eval@entry=0x5233c0, X=X@entry=0x540930, Y=Y@entry=0x5408d0, U_hat=U_hat@entry=0x53e380,
    E_hat=0x53e3e0, OmegaU=0x53e440, OmegaE=0x53e4a0, UltVehiY=0x53e500, UltVehiBX=0x53e560,
    UltVehiU=0x53e5c0, UltVehiE=0x53e620, V_g=0x540990, V_e=0x540a20, B=0x540ab0) at src/mvlmm.cpp:704

We need to look into optimizing the mvlmm code - there should be ample opportunity.

@biona001
Copy link
Author

biona001 commented Aug 15, 2021

Sorry I forgot about this issue.
FWIT, I think the main problem might be how I did the simulation. I believe it was something like

b ~ N(0, 1) # for k position of b only, others are 0
yi ~ N(xi^T * b, 1)

where yi is length 2 vector of phenotype, xi is length 10000 genotype vector (entries 0, 1 or 2), and b is length 10000 effect size vector but only k position of b is drawn from standard Normal. This might be a poor simulation model for GEMMA. Later I changed the simulation model and the divergence problem went away.

@pjotrp
Copy link
Member

pjotrp commented Aug 15, 2021

It is good to have edge cases. I'll see what I can do with it :)

@pjotrp
Copy link
Member

pjotrp commented Aug 16, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants