Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud compatibility #85

Open
Farkal opened this issue Aug 3, 2020 · 6 comments
Open

Cloud compatibility #85

Farkal opened this issue Aug 3, 2020 · 6 comments

Comments

@Farkal
Copy link
Member

Farkal commented Aug 3, 2020

I open this issue to speak about the best way to have cloud compatibility.

What does it mean ? Simply that we should be able to decode the header step by step. Because the ifd can be anywhere in the file we should be able to decode ifd one by one and set the data to decode each time.

Example:

  • I get the first 1024 bytes of a file
  • Get the first ifd and decode it (here we could keep the current init method that decode the header and the first ifd)
  • If there is another ifd and it's outside the 1024 bytes i should be able to call next_image(new_bytes) and get an error if we can't decode the next ifd (better would be to tell the size we need = 2 + the entry count at the begining of the IFD * 12 )

If there is a way to dynamicaly add data to cursor and fake the position we should be able to already have something working but i didn't find any way to do that. I wait for your advices about this use case before making a PR 😉

@fintelia
Copy link
Contributor

fintelia commented Aug 3, 2020

I'm not sure if there is a crate for it, but it should be possible to make your own struct that takes a Reader and implements Read+Seek by buffering all data that is read. Something like:

struct BufferCursor<R: Read> { 
    r: R,
    data: Vec<u8>
    cursor: usize,
}
impl Read for BufferCursor {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        while cusor >= data.len() { 
             // read data into buffer
         }
        // copy data from self.data into buf
    }
}
impl Seek for BufferCursor {
    fn seek(&mut self, ...) {
         // Update cursor, but don't modify self.data or self.r
    }
}

If you do implement something like this, you should totally release it on crates.io though!

The bigger question is whether image-tiffwill actually do something sane when passed a reader like this. From skimming a bit of the code it looks plausible, but I can't promise that the decoder won't seek to the end of the stream without warning/before it is actually necessary.

@Farkal
Copy link
Member Author

Farkal commented Aug 7, 2020

After some investigation I found that the best way was to add a new limit var:

/// Decoding limits
#[derive(Clone, Debug)]
pub struct Limits {
    /// The maximum size of any `DecodingResult` in bytes, the default is
    /// 256MiB. If the entire image is decoded at once, then this will
    /// be the maximum size of the image. If it is decoded one strip at a
    /// time, this will be the maximum size of a strip.
    pub decoding_buffer_size: usize,
    /// The maximum size of any ifd value in bytes, the default is
    /// 1MiB.
    pub ifd_value_size: usize,
    // Allow to specify a file limit to raise an error if decoding of the data is impossible (for cloud range request purpose)
    pub file_limit: Option<u64>,
    /// The purpose of this is to prevent all the fields of the struct from
    /// being public, as this would make adding new fields a major version
    /// bump.
    _non_exhaustive: (),
}

Then on each call of the goto_offset_u64 i check the value and raise if it's upper the file_limit value

#[inline]
    pub fn goto_offset_u64(&mut self, offset: u64) -> TiffResult<()> {
        if let Some(file_limit) = self.limits.file_limit {
            if (file_limit as u64) < offset {
                return Err(TiffError::DataUnreachable(offset))
            }
        }
        self.reader
            .seek(io::SeekFrom::Start(offset))?;
        Ok(())
    }

I do the same in the read_ifd function

let num_tags = if self.bigtiff { self.read_long8()? } else { self.read_short()?.into() };

        if let Some(file_limit) = self.limits.file_limit {
            let tag_size = if self.bigtiff { 20 } else { 12 };
            if (file_limit as u64) < offset + num_tags * tag_size {
                return Err(TiffError::DataUnreachable(offset + num_tags * tag_size))
            }
        }

And i will add something on the decoding_offset function also.
Then the lib that is calling and fixing the limit can manage how to handle this. The buffer cursor is not very usefull if we have ifd at the end of the file, we will need to fetch lot of data for nothing.

@HeroicKatora
Copy link
Member

If there is another ifd and it's outside the 1024 bytes i should be able to call next_image(new_bytes) and get an error if we can't decode the next ifd (better would be to tell the size we need = 2 + the entry count at the begining of the IFD * 12 )

Could you explain how this error has to differ from the one already produced by any beyond the end seek or read? What doesn't exist is a recovery after such an error, in other words reading a strip or ifd isn't transactional and any error may leave it in an improper state for resumption.

@Farkal
Copy link
Member Author

Farkal commented Aug 10, 2020

There is two issues.
The first is that the error leave the reader in an improper state for resumption so you have to decode everything again.
The second (and what i am trying to fix here) is that you don't know where and how many bytes you need to read. But everything is in the ifd.
For each ifd you know that a tag is 12 bytes (20 for bigtiff) and you have the number of tags at the begining of the ifd so you can easily find if you will be able to read the whole tags or if you need to fetch more data.
For each offset of a tag you have the type of the value (BYTE, SBYTE, SHORT, LONG, FLOAT...) and the count, so you can also find if you will be able to access the whole data just by reading the ifd.
I have implemented this in my custom branch and here is the commit about managing offset -> Farkal@2adf263

We could solve the first issue by using custom reader and if the seek or the read fail we could fetch missing data BUT we don't know how much we need and on high resolution image this could lead to lot of requests 😕
That's why i think returning an error with the data needed could solve the issue and keep the decoder in good state (because we check before seeking).
This work great when all the ifd and tileoffset sor stripoffsets are packed at the beginning of the file (that the case in cloud optimised geotiff) but if we have ifd1 image1 ifd2 image2 ifd3 image3 we will need to fetch also the images to get all the ifd and to solve this case we should be able to provide the buffer when we run next_image (it would be better to have decode_ifd(data: Option<Vec<u8>>) if the data is set we use it else we fetch the file or having kind of CouldDecoder).

@Farkal
Copy link
Member Author

Farkal commented Oct 14, 2020

Also for better cloud support we should provide a step between the parsing of the tag type and the decoding of their values.
For example TileOffset can have lot of data and we don't want to get all the TileOffset but we prefer to know the index and the size of the values so then we can get the tile index easily for 2D image:

// tile_offset { index: 1253, value_size: 4 }
let tiles_per_row = (level.width as f32 / level.tile_width as f32).ceil() as u32;
let index = tile_offset.index + (y*tiles_per_row + x) * tile_offset.value_size

If we just add a method get_ifd() on Decoder this should solve this issue

@feefladder
Copy link

feefladder commented Sep 18, 2024

When jogging around in this code on a blue moon, I added the ifd::Value::Offset([u8; 8]) value to the enum and made read_chunk read out the offset and bytes tag data first and then read the chunk, all of which is in the tag (offset, count, index). However, In the end I added a RangeReader trait with blanket implementation, so the reader could use that for range requests. That would mean that Image.chunk_offsets/Image.chunk_bytes become different.

I'd propose an enum of the shape:

enum ChunkData {
    Uninitialized(Entry), // entry for easy retrieving the value, also its only 12 bytes, so data duplication is not really a problem I'd say
    Sparse(Entry, HashMap<u64, u64>),
    Dense(Vec<u64>),
}

impl ChunkData {
    /// Tries to get the value, returning None if not cached
    fn get(index: u64) -> Option<u64>{
        match self {
            Uninitialized(_) => None,
            Sparse(_, hm) => hm.get(index),
            Dense(v) => match v.get(index) {
                None => None,
                Some(v) => if v != 0 {v} else {None}
            }
              
        }
    }

    fn retrieve_single(reader, byte_order, bigtiff, index: u64) -> TiffResult<u64> {
       // matching based adding logic, 
    }
}

I put the rough outline in #249 with superfluous features for rumination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants