Releases: choishingwan/PRSice
Temporary Fix
Update Log
This is a temporary fix while we re-structure PRSice for unit testings and improve extensibility. Unfortunately, this fix stretches a rather long period of time, so I might not have an accurate log of changes. Here are what I remembered:
- Fix
--perm
.--perm
should now run --prevalence
should now provide the PRS.R2.adj information correctly- We change the calculation of PRS.R2 when covariates were provided. Previous, it was calculated as
Full.R2 - Null.R2
. Now it is calculated as1 - ( 1 - Full.R2) / ( 1 - Null.R2)
- Some slight changes to code such that printing full score matrix should be faster when we have enough memory
- Some attempt to reduce memory usage for
bgen
format with mixed success (it still require quite a lot of memory)
Known unfixed issue
bgen
would not work without--allow-inter
bgen
require more memory than it should- For some reason, we have to manually modify EIGEN library for window compilation. So there might be potential for bugs for the window build.
I will try to fix those whenever I have time, but I will mainly focus on the restructuring of PRSice.
BGEN sample selection bug fix
Update Log
- Thanks to report from @charlisech, we were able to pinpoint a bug related to sample selection when using bgen data.
Minor Bug Fix
Update Log
- Fix off by one error in PRSet best score output
- Fix problem for bgen file when sample selection is performed on bgen files containing sample information
Update bug fix
- Update Rscript such that it match features in executable (thus avoid problem in plotting)
- Fix a bug where PRSice will crash when there are missing covariates
Major Rlease - Increased unit testing
Update Log (2020-05-21)
- Previous bug fix fixed problem for no-regress, but caused all normal PRSice run to fail.
Update Log (2020-05-19)
- Fix output error where we always say 0 valid phenotype were included for continuous trait
- Fix problem with permutation where PRSice will crash if input are rank deficient
- Fix problem when provide a binary phenotype file with a fam file containing -9 as phenotype, PRSice will wrongly state that there are no phenotype presented
- Fix problem in Rscript where if sample ID is numeric and starts with 0, the best file will not merge with the phenotype file, causing 0 valid PRS to be observed
Update Log
- We now support multi-threaded clumping (separated by chromosome)
- Genotypes will be stored to memory during clumping (increase memory usage, significantly speed up clumping)
- Will only generate one .prsice file for all phenotypes
- .prsice file now has additional column call "Pheno"
- Introduced
--chr-id
which generate rs id based on user provided formula - Format of
--base-maf
and--base-info
are now changed to<name>:<value>
from<name>,<value>
- Fix a bug related to ambiguous allele dosage flipping when
--keep-ambig
is used - Better mismatch handling. For example, if your base file only provide the effective allele A without the non-effective allele information, PRSice will now do dosage flipping if your target file has G/C as effective allele and A /T as an non-effective allele (whereas previous this SNP will be considered as a mismatch)
- Fix bug in 2.2.13 where PRSice won't output the error message during command parsing stage
- If user provided the
--stat
information, PRSice will now error out instead of trying to look for BETA or OR in the file. - PRSice should now better recognize if phenotype file contains a header
- various small bug fix
New Compilation and Unit Testing
Update Log
- Implement unit testing for command parsing module
- command parsing should now be more consistent and should be less likely to be source of bugs
- Now allow the use of
--a1
and--a2
instead of--A1
and--A2
to save one shift click - Can now properly handle bgen file with phasing information
- Re-implement code for covariate parsing.
--cov-factor
and--cov-col
should now have a more well defined behaviour --full-back
no longer require argument (the expected behaviour)- Fixed the default distance for
--clump-kb
, default was Mb instead of Kb (only affects version 2.2.12) - Correctly capture negative value in
--binary-target
- Correctly capture out of bound p-values and other parameters
- Use of
--memory
will no longer error out PRSice unexpectedly - Behaviour change for
--keep-ambig
: Previously, when--keep-ambig
was set, PRSice will keep all ambiguous SNP and will not perform any form of flipping, e.g. strand flipping A/C to T/G or dosage flipping A/C to C/A. Now when--keep-ambig
was set, PRSice will perform dosage flipping but NOT strand flipping i.e. Base = A/T, Target = T/A, change Target dosage from 0,1,2 for T to 0,1,2 for A. You should only really use--keep-ambig
if you are certain that the strand information between your base and target data are identical - Format for --base-info and --base-maf are changed to : from ,
Bug fix and update
Update Log
- We have fixed some problem observed in the beta version of 2.2.12
a. Clumping now function as expected
b. Standard error calculated should now always be correct - Fix problem where PRSice doesn't honor the
--model
setting - Fix INFO score and MAF filtering in the base
- Fix output of the
--no-regress
.--no-regress
should now also generate a*.prsice
file which contains the number of SNPs included in the PRS - Fix problem related to set-based permutation
- We also now drastically speed up set based permutation when the
--ultra
option is used (Require more memory).
- We also now drastically speed up set based permutation when the
- PRSice should now be able to handle special characters in the base file
- Add
--num-auto
. User can now change the number of autosome in their samples (Note: we assume all autosome to be diploid) - Add
--keep-ambig-as-is
. When set, ambiguous SNPs that were kept will never be flipped. This should allow for slightly better control for the user - Completely remove
--pearson
as we don't have the manpower to maintain this feature - Also remove
--enable-mmap
as that doesn't help too much
Note
- Window builds are completely failing and I have no idea why. We will try to figure out the problem but it is unlikely that it will be anytime soon. As a result of that, the window build will be unavailable until further noticed.
- Currently, due to the flexibility of PRSice, there are large amount of functions that need to be tested before we can be confident for PRSice to work as expected. Therefore, I am hoping that until we complete the unit test for all feature of PRSice (which is extremely time consuming, and our current coverage is less than 5%), we will not add in any new features.
Fixing Standardization
Update Log
- We have fixed the problem where the parameter
--score
,--missing
and--model
were not honored by PRSice - In addition,
--score con_std
should now work as expected (standardizing among controls only) - Fix problem where PRSice didn't automatically remove invalid covariates.
Update (Nightly build)
- Fix memory and distance unit parsing
Note
- Standardization and Control standardization will only calculate the mean and sd based on samples with valid phenotype and covariates (i.e. samples included in the regression model)
Quick Fix on Permutation
Update Log
- In 2.2.10, there's a bug which caused 0 permutation to be performed even when
--perm
or--set-perm
were set to a higher number. This is now fixed - Introduced
--score con_std
which perform standardization only in control samples (was introduced in 2.2.10, but there's some serious bug that were only fixed with 2.2.10.b. Note: This is untested)
Regression, Set based clumping and Refactoring
Update Log
- Almost refactored the whole code base to make code cleaner and easier to read, thus hopefully reduce the number of bugs etc (Have not refactor code bases related permutation)
- A bug was found in set based clumping. When more than 62 (or 30 for 32bit machine) sets were provided, the only the last few sets were properly clumped with the possibility of leaving some correlated SNPs in earlier sets
- New glm algorithm for PRSice was sensitive to collinearity and can give very different result when compared to those calculated from R. This problem is now fixed
- Problem regarding the
--target-list
in the Rscript is now fixed - Some changes to the log to make things a bit clearer
- Add some more unit tests
- Fix problem when bgen file are used for
--ld
where the sample size can be wrong when no external sample is provided and a phenotype file is provided for the target.
Manually tested feature
There are a lot of functionality of PRSice and I have not been able to write unit tests for most of the features (currently unit test coverage is less than 20%). The following features are tested manually using some toy data:
- Binary PLINK input
- Clumping should generate identical results as PLINK 1.9
- PRS calculation should be identical as PLINK 1.9 (after considering flipping and when using the same input)
- MAF filtering should be identical to those calculated in PLINK (when there's no founder)
- Genotype missingness calculated should be identical to those calculated in PLINK
- Clumping with a reference panel should generate identical result as PLINK
- Filtering on LD reference panel work as expected
- Binary GEN input
- Clumping should generate identical results as PLINK 1.9 (doesn't matter if whether we use
--hard
or--allow-inter
) - Automatic hard coding (
--hard
) should generate identical PRS as those calculated using PLINK - PRS calculated on using dosage scoring (without
--hard
) are highly correlated with those generated with (--hard
) - Geno filtering and MAF filtering on target sample worked as expected.
- Geno filtering and MAF filtering on reference sample worked as expected.
- Clumping should generate identical results as PLINK 1.9 (doesn't matter if whether we use
Things that we have not tested
- We have not test data with founder samples
- We have only used the default
--missing
and--score
parameters for our testing - All permutation algorithms (
--perm
and--set-perm
) - Only tested the default genetic model (additive) of the
--model
parameter - The window compilation are not tested
--no-regress
and--all-score
were also not tested, but should in theory be ok
Features that might be problematic (use with caution)
- The INFO score calculated using
--info
filtering differ from those calculated from qctools. (Correlated, but in some situation, differ quite a lot). We have contacted author of qctool to see if that's an algorithmic difference or if there's a bug in PRSice (we have tried to follow the algorithm in the MaCH paper and in our manual testing, the number calculated from PRSice and those calculated manually are identical)
Note
I try my best to test run as many features as possible and are trying to implement as many unit test as possible. However, the lack of manpower means that there will always be features that I missed / things that are not thoroughly tested.
Please let us know if there are any problem or if PRSice didn't generate the expected results.