Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likely stupid question -> kmer u64 to kmer [u8] #14

Open
stela2502 opened this issue Feb 17, 2023 · 8 comments
Open

Likely stupid question -> kmer u64 to kmer [u8] #14

stela2502 opened this issue Feb 17, 2023 · 8 comments

Comments

@stela2502
Copy link

I have created u64 representations of a kmer like this:

for kmer in needletail::kmer::Kmers::new(seq, 32 as u8 ) {
     let km = Kmer::from(kmer).into_u64();

Is there a function to reverse that? To get from this u64 a kmer representation back?
I am quite sure there is and I am absolutely sure I will not find it ;-)
Please help!

I tried this, but it fails with an utf8 error:

std::str::from_utf8( &km.to_le_bytes()).unwrap().to_string()

I am sure it can be done and it should be simple. But I am not getting it ;-)

@stela2502
Copy link
Author

stela2502 commented Feb 17, 2023

Took me quite some time and a lot of ChatGPG 'talks' to get there but this is the way to go:

let km:u64 = 15561;
let mut data:String = "".to_string();
let kmer_size = 7; // otherwise you get the overhang on the end of the last u8 as 'A' in this example all 8th nucleotide would be 'A's
for u8_rep in &km.to_le_bytes(){
    if i < kmer_size{
         match u8_rep & 0x03{
               0b00000000 => data +="A",
               0b00000001 => data +="C",
               0b00000010 => data +="G",
               0b00000011 => data +="T",
                _ => data +="N",
         };
         i += 1;
    }
    if i < self.kmer_size{
         match u8_rep & 0x0C{
                0b00000000 => data +="A",
                0b00000100 => data +="C",
                0b00001000 => data +="G",
                0b00001100 => data +="T",
                 _ => data +="N",
          };
          i += 1;
    }
    if i < self.kmer_size{
          match u8_rep & 0x30{
                   0b00000000 => data +="A",
                   0b00010000 => data +="C",
                   0b00100000 => data +="G",
                   0b00110000 => data +="T",
                   _ => data +="N",
           };
           i += 1;
    }
    if i < self.kmer_size{
           match u8_rep & 0xC0{
                 0b00000000 => data +="A",
                 0b01000000 => data +="C",
                 0b10000000 => data +="G",
                 0b11000000 => data +="T",
                  _ => data +="N",
            };
            i += 1;
    }
}
print ("{}", data.to_string() )// is what you need then.

Would print CGATATT

likely not perfect, but it is working and I am glad.

@stela2502
Copy link
Author

fn u64_to_str (kmer_size:usize, km:&u64,  data:&mut String ){

    let mut i = 0;
    for u8_rep in km.to_le_bytes(){
        if i < kmer_size{
            match u8_rep & 0x03{
                0b00000000 => *data +="A",
                0b00000001 => *data +="C",
                0b00000010 => *data +="G",
                0b00000011 => *data +="T",
                _ => *data +="N",
            };
            i += 1;
        }
        if i < kmer_size{
            match u8_rep & 0x0C{
                0b00000000 => *data +="A",
                0b00000100 => *data +="C",
                0b00001000 => *data +="G",
                0b00001100 => *data +="T",
                _ => *data +="N",
            };
            i += 1;
        }
        if i < kmer_size{
            match u8_rep & 0x30{
                0b00000000 => *data +="A",
                0b00010000 => *data +="C",
                0b00100000 => *data +="G",
                0b00110000 => *data +="T",
                _ => *data +="N",
            };
            i += 1;
        }
        if i < kmer_size{
            match u8_rep & 0xC0{
                0b00000000 => *data +="A",
                0b01000000 => *data +="C",
                0b10000000 => *data +="G",
                0b11000000 => *data +="T",
                _ => *data +="N",
            };
            i += 1;
        }
        if i == kmer_size{
            break;
        }
   }
}

@rob-p
Copy link
Contributor

rob-p commented Feb 20, 2023

Hi @stela2502,

Sorry for the slow reply here. The other potential approach would be to use the bitmer_to_bytes function, which returns a Vec<u8> which can then be interpreted / converted to a string of the nucleotides. One benefit of your approach, however, is that the owned String argument is passed in, and so can be reused between many subsequent calls. It probably make sense to have a variant of bitmer_to_bytes that also takes an owned Vec<u8> argument as well for the same reason.

@stela2502
Copy link
Author

Thank you for this input. Even with the bitmer_to_bytes function I would have ended up with my problem to convert bits to sting. Do you know if the usage of an external &[u8;8] array would actually make a difference in the compiled version? Regarding the huge speed improvement between debug and release version of the program I assume the compiler is already cleaning out a lot of these problems. In addition I only actually need this function for error messages.

Would it make sense to actually add this functionality to the kmers library? It does not actually use any kmer object. But possibly in the utilities? Just grab the code if you think it makes sense. Most of the intellectual work was done by ChatGPT anyhow...

@theJasonFan
Copy link
Member

theJasonFan commented Feb 27, 2023

Hmm, @stela2502 what is the specific use case and type conversion(s) you would like?

Regarding speed / usage: if you only need kmers of length <= 32 then naive_impl::kmers is fast and has an easy into_u64() function that simply returns the internal u64 representation used for "naive" kmers.

Regarding error messages: Kmer derives Debug so you can always print the internal representation with something like:println!("{:?}", km) for some kmer km.

Hope this helps!

@stela2502
Copy link
Author

Hi @theJasonFan. The specific case was to check how my gene matching function actually matches sequences. As all the matching is performed on basically bits level it was REALLY hard for me to grasp what is going on. Being a bioinformatician I do not transcode u64 to DNA in my brain ;-).
By the way - do you know of an efficient matching algorithm for sequences in Rust? What I have seams to not be giving the results I want and alevin-fry does not seam to actually do the sequence matching in Rust as this seams to be done by salmon and that is cpp... What I am really hunting for is a genome wide matching approach as salmon does.

@theJasonFan
Copy link
Member

Would calling bitmer_to_bytes() then [std::str::from_utf8](https://doc.rust-lang.org/std/str/fn.from_utf8.html) work for your use case?

We do something like this in the tests where we compare kmer representations to a sequence of bytes:

assert_eq!(b"GTAC".to_vec(), s);

@stela2502
Copy link
Author

Hehehe, that is exactly what I had expected, but as I stated before (in my first post) I was unable to find the function.
More documentation please :-D
And for me as a bioinformatician it was hard to understand the function name 'bitmer_to_bytes'. I did neither get that it would 'digest' u64 nor that it would give back Utf8... The function is so much more beautiful than my hack ;-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants