Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ngclient: reveal final form (finish API changes) #1580

Closed
jku opened this issue Sep 15, 2021 · 18 comments
Closed

ngclient: reveal final form (finish API changes) #1580

jku opened this issue Sep 15, 2021 · 18 comments
Assignees
Labels
backlog Issues to address with priority for current development goals ngclient
Milestone

Comments

@jku
Copy link
Member

jku commented Sep 15, 2021

There's a bunch of small API changes that I think would still make the API better (more consistent, simpler, safer). The implementations of the individual changes can probably be separate but I think it makes sense to discuss the design as a whole: it'll be a lot easier to discuss when the final form of the API is visible.

  • simplify the caching support: I've advocated for just dropping the "cache support" in updated_targets() before (since it's not really a specification feature) but... it looks like it can be just simplified
  • solve potential issues with target paths as local file paths
  • download_target() should return the local file path, not force client to guess
  • make the API more obvious and consistent: Updater.updated_targets() return value is not obvious, and using a list is weird
  • Updater.get_one_valid_targetinfo(): name is long without adding value
  • Updater.refresh(): there is no reason to force client to explicitly call this

Planned API usage

NOTE: these examples have been updated as the ideas have crystallized: earlier comments may refer to a slightly different example

    updater = Updater(
        metadata_dir,
        "https://example.com/metadata/",
        targets_cache_dir,
        "https://example.com/targets/",
    )
    # Update metadata for a target, then download target into cache if needed
    info = updater.get_targetinfo("file.txt")
    path = updater.find_cached_target(info)
    if path is None:
        path = updater.download_target(info)

    # Alternatively if client knows the local filename it wants to use
    info = updater.get_targetinfo("file.txt")
    if not updater.find_cached_target(info, "./file.txt"):
        updater.download_target(info, "./file.txt")

Things to note there:

  • refresh() is available but not required: it will be done on first get_targetinfo() if not explicitly called before
  • the functions for finding a local file and downloading a file work the same way: signatures are almost identical
  • both functions return a file path: caller does not have to guess where the file is stored locally
  • the local path argument to both functions can be a filename or a directory. If it is a directory, then Updater chooses a safe filename for target path (similar to current functionality). If the client knows the filename it wants, it can just use that as the argument. This allows using the Updater in a caching manner (using the same cache dir over time), or just to download a file without any cache
  • no local directories are created by any of the calls: If download_target() chooses the filename, it will encode the target path before using it as a filename.
  • find_cached_target() name is not great but I want it to be a counterpart to download_target (and I believe it's much better than the ambiguous updated_targets())

See #1604 for more details.

Considered, but not included changes

I considered these as well, but am currently not planning to include them:

  • Does ngclient: feature request "download target as bytes" #1556 make sense: should there be version of download_target() that returns the target file as bytes for the use case where writing to a real file is not necessary? As far as I can tell these would be entirely new functions though so this could be done at a later time without breaking API, so I'm planning to not do this now.
  • Should we support pathlib Paths instead of / in addition to str file paths? I think pathlib is complete enough in the Python versions we support so that's no longer a barrier... I'm leaning on "no we should not" as it feels more like a purity test that makes API more complicated to average developer and not a major practical benefit: Sure, filenames are not strings but I think I'm fine limiting Updater to the subset of filenames that are strings.
@jku jku added the ngclient label Sep 15, 2021
@sechkova sechkova added the backlog Issues to address with priority for current development goals label Sep 15, 2021
@sechkova sechkova added this to the Sprint 8 milestone Sep 15, 2021
@jku
Copy link
Member Author

jku commented Sep 15, 2021

related #1412, #1571

@jku
Copy link
Member Author

jku commented Sep 20, 2021

re-implementing the current directory-creating cache style would be easy for a client that wants to do that (and knows that it is safe to do so).

@jku
Copy link
Member Author

jku commented Sep 20, 2021

Would appreciate comments on this design: @sechkova, @joshuagl maybe?

This includes several things that are related enough that I wanted to cover them in one document: I don't intend to implement all in one PR, just wanted to document the end result I think would make sense.

@joshuagl
Copy link
Member

  • download_target() should return the local file path, not force client to guess

💯 this will make the API much nicer in use.

  • make the API more obvious and consistent: Updater.updated_targets() return value is not obvious, and using a list is weird

💯 the pluralised API has always felt a little off.

  • Updater.get_one_valid_targetinfo(): name is long without adding value

👍

  • Updater.refresh(): there is no reason to force client to explicitly call this

I am less certain about this. We should try and get this right for 1.0 so we don't have to break API too soon. Is always refreshing going to cause problems anywhere? It might be problematic for an air-gapped client (with local/cached files) that can't reach out to the network to refresh()?

  • the local path argument to both functions can be a filename or a directory. If it is a directory, then Updater chooses a safe filename for target path (similar to current functionality). If the client knows the filename it wants, it can just use that as the argument. This allows using the Updater in a caching manner (using the same cache dir over time), or just to download a file without any cache

How do we indicate the difference between a directory and a filename? directory paths must end with /? Is this too magical?

  • no local directories are created by any of the calls: If download_target() chooses the filename, it will encode the target path before using it as a filename.

Sounds reasonable. Will this work with how, for example, pip does caching today?

  • get_local_target() name is not great but I want it to be a counterpart to download_target and that was the best I could come up with (and I believe it's much better than the ambiguous updated_targets())

Other options: get_cached_target(), find_cached_target(), find_local_target()

Signatures

This is what I think the changes will look like in function signatures

    # local_path can be dir or file path
    # return value is file path or None
    @staticmethod
    def get_local_target(
        targetinfo: TargetFile, 
        local_path: str
    ) -> Optional[str]

    # local_path can be dir or file path (if it's a path, then a safe file name is generated)
    # return value is a file path
    def download_target(
        self,
        targetinfo: TargetFile,
        local_path: str,
        target_base_url: Optional[str] = None,
    ) -> str

Considered, but not included changes

I considered these as well, but am currently not planning to include them:

  • Does ngclient: feature request "download target as bytes" #1556 make sense: should there be version of download_target() that returns the target file as bytes for the use case where writing to a real file is not necessary? As far as I can tell these would be entirely new functions though so this could be done at a later time without breaking API, so I'm planning to not do this now.

metadata.py is already too large, let's avoid adding these until we have a use-case.

it feels more like a purity test that makes API more complicated to average developer and not a major practical benefit

😆 yes!

@jku
Copy link
Member Author

jku commented Sep 20, 2021

  • Updater.refresh(): there is no reason to force client to explicitly call this

I am less certain about this. We should try and get this right for 1.0 so we don't have to break API too soon. Is always refreshing going to cause problems anywhere? It might be problematic for an air-gapped client (with local/cached files) that can't reach out to the network to refresh()?

Doing the actual refresh has always been required so that's not changing with my plan... not being able to connect to remote always means failure (in this plan and current implementations). The question of "should we be able to do something with local metadata without checking remote metadata" is a reasonable one. Examples:

  • if I just ran the updater, then fix a typo in the target name and run again should it be possible for the client to not check root/timestamp?
  • (your example) should it be possible to check if a local target is valid according to local metadata, without updating newest timestamp?

These sounds like post 1.0 material to me but I'll write some ideas on implementation:

I think the first case is an optimization that might not be worth it... but a possible implementation could maybe be a Updater configuration where client could define cache_validity = 15 # seconds and for that time from last reference time Updater would then break specification and not update root/timestamp (and then would typically not need to update snapshot or targets or any delegated targets it happens to have locally). This would require the client or Updater to store the reference time and would require the Updater to believe the stored reference time later.

The second case I guess could be done with similar configuration -- but a practical limit with this will be the timestamp expiry: I don't think we want to add code for allowing expired metadata in some cases. But imagining an update system designed for this (with long enough expiries), I guess the same approach would work?

  • the local path argument to both functions can be a filename or a directory. If it is a directory, then Updater chooses a safe filename for target path (similar to current functionality). If the client knows the filename it wants, it can just use that as the argument. This allows using the Updater in a caching manner (using the same cache dir over time), or just to download a file without any cache

How do we indicate the difference between a directory and a filename? directory paths must end with /? Is this too magical?

my prototype code uses os.path.isdir(path_arg).

  • no local directories are created by any of the calls: If download_target() chooses the filename, it will encode the target path before using it as a filename.

Sounds reasonable. Will this work with how, for example, pip does caching today?

pips actual cache is the http cache (and we planned to let that be the case for TUF downloads as well). For the project index download case, the file would be temporary and filename does not matter so I suppose pip will let Updater choose the filename. For the real target files, I think pip will want to build the local filename itself (but it will want to use get_local_target() as in some cases the file might already exist).

@joshuagl
Copy link
Member

Thanks for working through these with me Jussi.

  • if I just ran the updater, then fix a typo in the target name and run again should it be possible for the client to not check root/timestamp?
  • (your example) should it be possible to check if a local target is valid according to local metadata, without updating newest timestamp?

These sounds like post 1.0 material to me but I'll write some ideas on implementation:

I agree that this is post 1.0 material. I just want to keep it in mind so we are able to satisfy the requirements.

The second case I guess could be done with similar configuration -- but a practical limit with this will be the timestamp expiry: I don't think we want to add code for allowing expired metadata in some cases. But imagining an update system designed for this (with long enough expiries), I guess the same approach would work?

Agree, we do not want code to allow for expired metadata in some cases. What I think is the sensible approach for air-gaps is for the client to be able to use local valid metadata but not choke if fetching remote metadata fails. If expires has passed, the client needs new metadata.

Looking over ngclient interfaces, this might be an additional method (added post 1.0) to explicitly load local metadata and a config optional + checking in appropriate places (i.e. if config.offline: return in refresh()) to ignore refresh calls?

  • the local path argument to both functions can be a filename or a directory. If it is a directory, then Updater chooses a safe filename for target path (similar to current functionality). If the client knows the filename it wants, it can just use that as the argument. This allows using the Updater in a caching manner (using the same cache dir over time), or just to download a file without any cache

How do we indicate the difference between a directory and a filename? directory paths must end with /? Is this too magical?

my prototype code uses os.path.isdir(path_arg).

and caller owns creating directories, makes sense. I'm still a bit concerned about how intuitive this method will feel to users. Are we OK with requiring users to ensure directories exist and possibly doing unexpected things when they do not?

For example, the following to calls to download_target() behave the same if /tmp/badger is an existing directory, but what if it isn't?

updater.download_target(tgtinfo, "/tmp/badger/")
updater.download_target(tgtinfo, "/tmp/badger")
  • no local directories are created by any of the calls: If download_target() chooses the filename, it will encode the target path before using it as a filename.

Sounds reasonable. Will this work with how, for example, pip does caching today?

pips actual cache is the http cache (and we planned to let that be the case for TUF downloads as well). For the project index download case, the file would be temporary and filename does not matter so I suppose pip will let Updater choose the filename. For the real target files, I think pip will want to build the local filename itself (but it will want to use get_local_target() as in some cases the file might already exist).

👍

@jku
Copy link
Member Author

jku commented Sep 20, 2021

  • get_local_target() name is not great but I want it to be a counterpart to download_target and that was the best I could come up with (and I believe it's much better than the ambiguous updated_targets())

Other options: get_cached_target(), find_cached_target(), find_local_target()

Hmm, yes, maybe 'cache' is a better term than 'local' (when contrasted with download_target())...

@jku
Copy link
Member Author

jku commented Sep 21, 2021

What I think is the sensible approach for air-gaps is for the client to be able to use local valid metadata but not choke if fetching remote metadata fails. If expires has passed, the client needs new metadata.

Looking over ngclient interfaces, this might be an additional method (added post 1.0) to explicitly load local metadata and a config optional + checking in appropriate places (i.e. if config.offline: return in refresh()) to ignore refresh calls?

refresh() is currently responsible for loading local metadata and that probably still makes sense... my hunch is that supporting a local-only use case would not be terribly complex, but would be complex enough that implementation should happen when we know that a real use case exists.

How do we indicate the difference between a directory and a filename? directory paths must end with /? Is this too magical?

my prototype code uses os.path.isdir(path_arg).

and caller owns creating directories, makes sense. I'm still a bit concerned about how intuitive this method will feel to users. Are we OK with requiring users to ensure directories exist and possibly doing unexpected things when they do not?

requiring directories to exist seems fine to me but...

For example, the following to calls to download_target() behave the same if /tmp/badger is an existing directory, but what if it isn't?

This is a reasonable criticism. My prototype code would indeed try to write to a file (first one would fail and second would succeed...). The only alternatives I can think of are adding duplicate functions for the two types of arguments (this seems unappealing) and adding separate arguments for filename and directory which looks a bit unique but could work:

# separate arguments for directory (required, must exist) and filename (optional)
path = updater.download_target(tgtinfo, "/tmp")
path = updater.download_target(tgtinfo, "/tmp", filename="badger")

# or optional arguments, either dir or filepath must be set
path = updater.download_target(tgtinfo, dir="/tmp")
path = updater.download_target(tgtinfo, filepath="/tmp/badger")

I'm not super into either one of these.

@joshuagl
Copy link
Member

This is a reasonable criticism. My prototype code would indeed try to write to a file (first one would fail and second would succeed...). The only alternatives I can think of are adding duplicate functions for the two types of arguments (this seems unappealing) and adding separate arguments for filename and directory which looks a bit unique but could work:

# separate arguments for directory (required, must exist) and filename (optional)
path = updater.download_target(tgtinfo, "/tmp")
path = updater.download_target(tgtinfo, "/tmp", filename="badger")

# or optional arguments, either dir or filepath must be set
path = updater.download_target(tgtinfo, dir="/tmp")
path = updater.download_target(tgtinfo, filepath="/tmp/badger")

I'm not super into either one of these.

Of these options I prefer the latter, users either tell us where to store a file we name or precisely where to write the file. Separate arguments makes it clear they are different use-cases. Mutually exclusive optional arguments might be a little confusing, but feels safer for users than a single argument.

@sechkova any thoughts on the proposed API here?

@sechkova
Copy link
Contributor

From what I understand, the intention is that if download_target gets a directory and it does not exist, it raises an exception but if it gets a file path then it encodes it as the filename?

I guess mutual exclusive if I have to choose from the options above, otherwise the result from this misunderstanding of the API:

path = updater.download_target(tgtinfo, dir="/tmp", filename"/tmp/badger")

(if /tmp exists) may come as a surpeise ( /tmp/%2Ftmp%2Fbadger)

Other options: get_cached_target(), find_cached_target(), find_local_target()

The find* names make it clearer to me why we use info and a name to return a path or None.

@jku
Copy link
Member Author

jku commented Sep 23, 2021

From what I understand, the intention is that if download_target gets a directory and it does not exist, it raises an exception but if it gets a file path then it encodes it as the filename?

can you explain which variant exactly you mean? I can't quite follow the question

To summarize there have been three options so far I think:

# original: single argument, will be understood as directory if the directory already exists
# case 1: argument is directory, target is saved in  "/tmp/<encoded_filename>"
path = updater.download_target(tgtinfo, "/tmp")
# case 2: argument is not directory, target is saved in  "/tmp/filename"
path = updater.download_target(tgtinfo, "/tmp/filename")

# variant 1: separate arguments for directory (required, dir must exist) and filename (optional)
# case 1: target is saved in  "/tmp/<encoded_filename>"
path = updater.download_target(tgtinfo, "/tmp")
# case 2: target is saved in  "/tmp/badger"
path = updater.download_target(tgtinfo, "/tmp", filename="badger")

# variant2: optional arguments, one of dir or filepath must be set
# case 1: target is saved in  "/tmp/<encoded_filename>"
path = updater.download_target(tgtinfo, dir="/tmp")
# case 1: target is saved in  "/tmp/badger"
path = updater.download_target(tgtinfo, filepath="/tmp/badger")

Variant 1 feels ugly and unusual. I can live with variant 2 or original.

@joshuagl
Copy link
Member

My preference is variant 2, due to the concern I raised with the original proposal.

@sechkova
Copy link
Contributor

can you explain which variant exactly you mean? I can't quite follow the question

Sorry, I've misunderstood partially. Let's go for variant 2 then.

jku pushed a commit to jku/python-tuf that referenced this issue Sep 27, 2021
Doing so is not always safe and has various other issues
(like target paths "a/../b" and "b" ending up as the same
local path).

Instead URL-encode the target path to make it a plain filename. This
removes any opportunity for path trickery and removes the need to create
the required sub directories (which we were not doing currently, leading
to failed downloads). URL-encoding encodes much more than we really need
but doing so should not hurt: the important thing is that it encodes
all path separators.

Return the actual filepath as return value. I would like to modify the
destination argument to be either a directory or a filename so caller
could decide the filename if they want to. But I won't do it now
because updated_targets() (the caching mechanism) relies on filenames
being chosen by TUF. The plan is to make it possible for caller to
choose the filename though.

This is part of bigger plan in theupdateframework#1580
Fixes theupdateframework#1571

Signed-off-by: Jussi Kukkonen <[email protected]>
jku pushed a commit to jku/python-tuf that referenced this issue Sep 27, 2021
Doing so is not always safe and has various other issues
(like target paths "a/../b" and "b" ending up as the same
local path).

Instead URL-encode the target path to make it a plain filename. This
removes any opportunity for path trickery and removes the need to create
the required sub directories (which we were not doing currently, leading
to failed downloads). URL-encoding encodes much more than we really need
but doing so should not hurt: the important thing is that it encodes
all path separators.

Return the actual filepath as return value. I would like to modify the
arguments so caller could decide the filename if they want to. But I
won't do it now because updated_targets() (the caching mechanism)
relies on filenames being chosen by TUF. The plan is to make it
possible for caller to choose the filename though.

This is clearly a "filesystem API break" for anyone depending on the
actual target file names, and does not make sense if we do not plan to
go forward with other updated_targets()/download_target() changes
listed in theupdateframework#1580.

This is part of bigger plan in theupdateframework#1580
Fixes theupdateframework#1571

Signed-off-by: Jussi Kukkonen <[email protected]>
jku pushed a commit to jku/python-tuf that referenced this issue Sep 27, 2021
Doing so is not always safe and has various other issues
(like target paths "a/../b" and "b" ending up as the same
local path).

Instead URL-encode the target path to make it a plain filename. This
removes any opportunity for path trickery and removes the need to create
the required sub directories (which we were not doing currently, leading
to failed downloads). URL-encoding encodes much more than we really need
but doing so should not hurt: the important thing is that it encodes
all path separators.

Return the actual filepath as return value. I would like to modify the
arguments so caller could decide the filename if they want to. But I
won't do it now because updated_targets() (the caching mechanism)
relies on filenames being chosen by TUF. The plan is to make it
possible for caller to choose the filename though.

This is clearly a "filesystem API break" for anyone depending on the
actual target file names, and does not make sense if we do not plan to
go forward with other updated_targets()/download_target() changes
listed in theupdateframework#1580.

This is part of bigger plan in theupdateframework#1580
Fixes theupdateframework#1571

Signed-off-by: Jussi Kukkonen <[email protected]>
@jku
Copy link
Member Author

jku commented Sep 27, 2021

Status report for this:

Other than those my WIP branch contains

  • a few test cleanups
  • change of updated_targets -> get_cached_target (including the directory and filepath arguments)
  • get_one_valid_targetinfo rename
  • implicit refresh

I think those will be ~two more PRs -- I will make them when current open one merges, but if you want to see the end result, see https://github.com/jku/python-tuf/commits/ngclient-final-form

@jku
Copy link
Member Author

jku commented Oct 7, 2021

Curently active PR for this is #1604 : Since there's a lot of target dir / filepath discussion here already, I'll mention that has a design that improves on the ones discussed here

After this the only thing left is the implicit refresh which is quite trivial.

MVrachev pushed a commit to MVrachev/tuf that referenced this issue Oct 8, 2021
Doing so is not always safe and has various other issues
(like target paths "a/../b" and "b" ending up as the same
local path).

Instead URL-encode the target path to make it a plain filename. This
removes any opportunity for path trickery and removes the need to create
the required sub directories (which we were not doing currently, leading
to failed downloads). URL-encoding encodes much more than we really need
but doing so should not hurt: the important thing is that it encodes
all path separators.

Return the actual filepath as return value. I would like to modify the
arguments so caller could decide the filename if they want to. But I
won't do it now because updated_targets() (the caching mechanism)
relies on filenames being chosen by TUF. The plan is to make it
possible for caller to choose the filename though.

This is clearly a "filesystem API break" for anyone depending on the
actual target file names, and does not make sense if we do not plan to
go forward with other updated_targets()/download_target() changes
listed in theupdateframework#1580.

This is part of bigger plan in theupdateframework#1580
Fixes theupdateframework#1571

Signed-off-by: Jussi Kukkonen <[email protected]>
@jku jku mentioned this issue Oct 8, 2021
3 tasks
@jku jku modified the milestones: Sprint 9, Sprint 10 Oct 13, 2021
@sechkova
Copy link
Contributor

First I think all points from this issue are done and it can be closed.
But just before that I wanted to suggest adding a user-friendly error message when refresh is called a second time during the lifetime of an Updater object. This is already documented but the error you get is "RuntimeError: Cannot update timestamp after snapshot". I suggest we add something more appropriate like "refresh()" can be done only once during the lifetime of an Updater".

@MVrachev
Copy link
Collaborator

Whoever closes that issue please have a look at #1135 as well as I think that's the only issue that stops it from being closed.

@jku
Copy link
Member Author

jku commented Nov 30, 2021

closing: client looks complete to me

@jku jku closed this as completed Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Issues to address with priority for current development goals ngclient
Projects
None yet
Development

No branches or pull requests

4 participants