Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with function load_rhb_at_positions_no_NAs #93

Open
jamonterotena opened this issue Jan 22, 2024 · 4 comments
Open

Error with function load_rhb_at_positions_no_NAs #93

jamonterotena opened this issue Jan 22, 2024 · 4 comments

Comments

@jamonterotena
Copy link

jamonterotena commented Jan 22, 2024

I got the following error when executing STITCH with reference files:

Error in which((colClasses == "integer") & (h_row_1 != "-")) :
  dims [product 200] do not match the length of object [832]
Calls: STITCH ... extract_validate_and_load_haplotypes -> load_rhb_at_positions_no_NAs -> which
In addition: Warning message:
In (colClasses == "integer") & (h_row_1 != "-") :
  longer object length is not a multiple of shorter object length
Execution halted

It comes from the following function of the script reference-binary.R:

    ## first initialization
    ##
    ## want non-NULL values
    ## determine columns to get
    h_row_1 <- read.table(reference_haplotype_file, nrow = 1)
    if (!is.na(colClasses[1])) {
        haps_to_get <- which((colClasses == "integer") & (h_row_1 != "-")) - 1
    } else {
        haps_to_get <- which(h_row_1 != "-") - 1
    }

The first row of the reference hap file looks like this:

0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 1* 1* 0* 0* 1* 1* 0* 0* 1* 1* 0* 0* 0* 0* 1* 1* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 1* 1* 0* 0* 1* 1* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 1* 1* 0* 0* 1* 1* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 1* 0* 0* 0* 0* 1* 1* 0* 0* 1* 1* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 1* 1* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0* 0*

Notice there are 200 haplotypes for 100 founders. I suppose STITCH is taking only the 16x2 haplotypes from each corresponding specified 16way population.

What could be causing the issue? I find strange that the colClasses vector has 832 elements.

Jose

@jamonterotena
Copy link
Author

Hi again,

The problem is that the founders can belong to different 16way populations, so they were duplicated in the ID column of the .SAMPLES file, that contained 416 rows for the founders instead of 100. I wanted to ask you if it's possible to indicate that a founder ID can belong to two or more populations. Otherwise I will just split the .SAMPLES by families.

Best,
Jose

@rwdavies
Copy link
Owner

Thanks for the message, and identifying the problem, I've added a check to the code that should throw an error for that in the future

The only use of the reference sample file (alongside the reference haplotypes and legend file) is to exclude certain populations. Let's say you had prepared a file using all 1000 Genomes samples (from multiple populations), and in one run you wanted to do only Europeans, and another only East Asians, or something like that

Note STITCH can use reference haplotypes in one of three ways
Way 1 - if K matches the number of provided reference haplotypes, it will use them directly
Way 2 - if K is > number of reference haplotypes, it will fill the other haplotypes (ancestral ones used in STITCH) with noise
Way 3 - if K < number of reference haplotypes, it will initialize with some of the reference haplotypes (K of them), and then run a haploid version of EM to refine that estimate, using all samples

@rwdavies
Copy link
Owner

PS there are no actual "*" in the file right? The haps file? It should be only 0 and 1 (and spaces and newlines / enters)

@jamonterotena
Copy link
Author

bcftools convert -h -Oz -o reference_${chr} -S samples fbjcvars_${chr}.vcf.gz #Convert vcf with all founders
Produced those haplotype files. I read it means that the allele is unphased. I don't know how to turn that off, so I will just remove the asterisks from the hap files with a script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants