antelope.txt

==========
Wed Apr 04 14:00:53 -0700 2018

Context for this jam session:
 (1) ARMA project to modernize excel-based study; implicating a redo of the calrecycle app
 (2) Allbirds potential project: "formalizing our approach to 'product footprints'... imagining a simple excel model with ~30 assumptions we can flex." ==> implicating no, you want a redone calrecycle app
 (3) Eco Data Science meetup on R-Shiny, with free opensource, paid hosting, and high-dollar enterprise solutions for shiny apps.

Now then: Moving parts in antelope:

 * we have the frontend, which is 100% javascript and can be provided from a dumb http server
  - we have config, which is currently bundled in the js but could easily be user-specific
  - the config could, furthermore, also easily contain API keys
  = the actual frontend app needs to be more or less rebuilt from scratch
 * we have the foreground, which is the BasicArchive that stores fragments, flows, and quantities that make up the product model
  ** the scenarios are still homeless **
 * we have the catalog, which is necessary to instantiate the foreground
  - and the resources, which are necessary to compute the foreground

The big question right now is where to store the scenarios? scenarios are per-privileged-user (though the argument is that every user should be able to test hypotheses, given a model)

To me, that really means that every instance- every session-- is a live antelope server, spun-up from scratch, and volatile.  The model author can write her own scenarios, and they would get stored natively within the foreground. Anyone who clones the foreground would get all those scenarios, and also could add their own.  But for that person to save their scenarios, would require that person to have their own storage, distinct from that of the model author.

The workflow would look like this:

 * User logs in to service
   = user's account includes a toybox of models, each of which includes:
     - a semantic foreground reference (username.project.module[.sub].foreground)
     - an API key which is actually a signed JWT that proves authorization + specifies access level
   = user selects a model to play with
     : server creates an AntelopeV1 container
     : server clones the specified foreground, which is equivalent to a github repo, into the container

OKAY STOP. The current Av1 server requires a full catalog, and not just a basic archive, and that is because the catalog needs to have resource files (each with their own access tokens) in order to compute the foreground.  Fine.  It's still a github repo, but it's a larger one.

what if upstream foregrounds get updated? ans: the updates get a new semantic reference.  Propagating the changes to the fork becomes equivalent to a git merge-- not trivial, but at least a familiar problem.

this is easier anyway because then each model is atomic / unitary. 

   = foreground publication itself has to implement access control
     : each foreground creates a public/private keypair; sends the public key to the auth server
     : owner is specified; owner key is generated by encoding owner's id with the private key
     : owner key is not changed with a fork-- thus preventing the forker from generating new keys
     : only owner is permitted to generate new keys.
     : owner requests key by providing recipient's user id.  key is generated by encoding the recipient's id with the private key, thus tying it to the recipient (and preventing the auth server from hijacking the key)
     : new key get stored in the foreground and sent to the auth server, which generates a JWT, and regsiters it with the billing server, NOT KNOWING THE RECIPIENT
     : key specifies access as 'agg', 'fg', or 'full'
       -- 'agg' indicates stage aggregation + lcia-level access (private)
       -- 'fg' indicates non-aggregated foreground + lcia-level background
       -- 'full' indicates non-aggregated foregorund + exchange-level background
       --- probably need a theory for this
     : Av1 server then sends the token to the recipient

   = when an authorized user connects, he performs a GET request with his user ID as a q param and the JWT as the AUTH payload.  antelope v1 server:
     : validates the token
     : notes the access level
     : confirms that the encrypted user ID matches the key
     : confirms that the key is authorized

   = an authorized user (agg, fg, full) may clone the foreground for the purposes of parameterization.
     : stage-level aggregation is not really protective of foreground data if exchange parameterization is allowed
     : agg level must prohibit exchange parameterizaton (only permits terminal parameterization)
     : the clone must have the auth information scrubbed of all keys not belonging to the requestor

     : clone is REQUIRED to parameterize? seems a bit heavyweight. it's not though if git clone --depth=1.  the AUOMA repo, which INCLUDES the calrecycle model, is only 2.7 MB and that includes 1MB of reports + eps.  uolca is about 1mb all told. and that's a big model.
     : so I think clone is good- especially since we don't need to save the image unless the owner has an account.

   = The last complicated bit is the private key.  We need a way of storing that that is persistent, but still protects it from the auth server / service provider, which owns all the repos.
     : I don't know how to solve that. for that we need some consultation.
     : https://stackoverflow.com/questions/11575398
     : https://softwareengineering.stackexchange.com/questions/205606
     : https://medium.freecodecamp.org/how-to-securely-store-api-keys-4ff3ea19ebda
     : TL;DR: BlackBox, Docker Secrets

OK, I think we have this worked out.

So in order to bring this about:
 - we need the Av1 server
 - we need the foreground publication
 - we need an auth server with encryption capabilities

In order to replicate the CalRecycle model we also need:
 - Av2 servers with privacy protection
 

==========
Thu Mar 29 13:58:23 -0700 2018

The conceptual center of the antelope framework is a small set of distinct, covering INTERFACES. (see also [interfaces.md])

Let's talk through the USED OIL example, in its truest implementation.

We have:
 - USLCI as open background database
   * how is it configured? it has to be configured by the model author; but maybe the data user would like to see the effects of different configurations (to wit: different allocation approaches). Currently that sort of thing is NOT SUPPORTED because the USLCI datasets are pre-rolled from GaBi
   * HOWEVER, the way one could imagine doing it is by (1) creating an alternative USLCI (new semantic origin) that has the modified allocation treatment and (2) applying alternative background terminations using SCENARIOS.  This is fully possible, and in fact suggests that the data provider can easily provide alternative system models, just as ecoinvent does, but notes that the utility of those alternative system models is much greater if the semantic origin (and not the external reference) is the only thing that changes.

 - Ecoinvent and PE data as private contributed data
   * the prior condition was that exchange data were not exposed in any query
   * but the private data still needed to be hosted within the antelope instance, which ultimately meant that the antelope data store became private.
   * instead, the data should be hosted elsewhere and the antelope server should have a resource file that provides access to it.  The resource file should (as it currently does) designate the privacy, and the antelope server, rather than *enforcing* privacy, should simply *report it*.  The outside host should be responsible for enforcing the privacy, saying in effect "access with that resource token only provides aggregated results."
   * where is this instantiated?  The remote av2 server which hosts, say, the thinkstep SP24 datasets, will process a request with an attached, signed, JWT that authorizes the requestor to access the background interface with a designated privacy flag.  the request has a privacy level associated with it, and that level is determined during the authorization step.


An issue to think about: UUID as distinct from semantic reference

e.g. in ecoinvent, it is perfectly desirable for every column of every system model of every minor version to have a truly unique UUID, and it is UNDESIRABLE to manufacture UUID collisions by using the same uuid and different origins for alternatively-configured versions of the same activity.

This is exactly the reason for having the external reference be the main identifying key (and really, either the external ref or the uuid should be accepted)

what this means is that the data provider needs to maintain an authoritative list of external references, and then a mapping of those references to UUIDs within the system models, so that an entered reference like ecoinvent.3.2.apos/market_for_flow_[geography] and ecoinvent.3.2.cutoff/market_for_flow_[geography] both resolve to the correct [same] process

this means that the EntityStore needs to support this mapping, needs to obtain it from somewhere, needs to enable data providers to manage it.  that way a user will be able to easily parameterize a flow termination by simply changing the origin without changing the reference (assuming the reference still exists in the new origin)


==========
Thu Mar 22 13:25:52 -0700 2018

In our new architecture, we have a number of different pieces to develop:

 * A web infrastructure for providing information on (a) individual studies and (b) databases (open source, public, backend, NAL supported) [Antelope v1/v2 server]
 * A research infrastructure for creating and reviewing product system models (open source, public, frontend + backend, UCSB supported) [js model editor; interacts w Av2 server]
 * A data infrastructure for intelligent access, retrieval, and review of data (closed source, commercial, frontend + backend, vault.lc) [js data reviewer; interacts w Av2 servers and proprietary backend [e.g. graph / tag db]]
 * A free model interactor (open source, public, frontend, interacts with Av1) [CalRecycle FrontEnd]
 * a premium model interactor + publisher

Thu 2018-03-22 13:59:30 -0700

The security model for the Av1 server is unclear (as ever).

How is the Av1 server used?

 - Study authors use it to publish studies to specific audiences, or to the general public
   = study content should not be available to someone who is not authorized by the study creator

 - authorized users may create their own scenarios, which *could or could not* need to be stored on the server
   = scenario specifications and results should not be available to people not authorized by the scenario creator

In the basic (non-privacy-preserving) imagining, the av1 server is housed in a container which is enclosed within a private cloud.  Users access a wrapper service which handles authentication and forwards authorized requests (internally) to the servers that can answer them.
 * here the wrapper service is aware of the contents of all requests and presumably has access to all resources in the internal cloud
 * this is not desirable if people want to use the service to share secret models with collaborators

Thereby, the privacy-preserving case requires that the av1 server be able to negotiate authorization with users directly.  This means that the study author needs to be able to specify access rights in a way that enables a 3rd party (vault.lc) to adjudicate access without being able to obtain access.

Thu 2018-03-22 15:23:43 -0700

OK, so the way I think this works is:
A: Study author; owns secred data
C: Client; granted authorization to view data or results
V: vault.lc; provides hosting and auth adjudication
R: repository to contain secret data; owned by hard private secure storage
 + A must have exclusive access to write to R
 + A must be able to mediate access to read from R (how?)
S: Antelope V1 server instance; created by V

0. A creates secret repo R, including an auth specification with:
 - R contains distinct token per authorized client C
 - R contains mapping of token to authorized views
 - stored non-volatile in R and can be updated, replaced, or revoked by A

1. A registers address of repo @R with vault V [case 0: provides $$$]

2. V instantiates Av1 server S, gives it @R and id of A
 - S generates pubkey pair, provides pubkey K_r,pub to V

3. S attempts to access R; A mediates access (how?)
 - A grants persistent access (how?)
 - R must be able to push out updates to S

4. A gives C her dedicated token
--- A no longer needs to participate, until a new Av1 is required

5. C contacts V with @R [case 1: provides $]
 - V constructs JWT with prudent expiry and signs it with K_r,pub

6. V provides @S and JWT to C 

7. C transmits request and JWT to S; supplies token as query param (alt.: uses token to establish session)
 - S validates JWT: ensures 
 - S validates query token
 - S answers request

Where do scenarios live? if S is stateful, there needs to be a way to make them persistent across instantiations
if they live on C, then S still has to be stateful, unless the full param list is specified with every query (seems wasteful)

I guess the easiest thing is for C to replicate the model and perform its own traversals to implement scenarios

Anyways, it's clear that the flask app needs to be able to receive and validate JWTs.


==========
Wed Dec 06 10:03:19 -0800 2017

I have come up with at least five different things that "Antelope" means:

 * the existing antelope v1 CLIENT interface for a finite set of integer-enumerated entities included in a collection of fragments

 * the SERVER for that, which may or may not be the same as the fragment builder

 * the antelope v2 server, which is meant to be a data clearinghouse that sits on top of an LcCatalog and can translate semantic.ref/entity/query into serialized results

 * the antelope v2 CLIENT, which allows me to talk to a remotely stashed ecoinvent so that I don't need it on my local machine

 * the antelope node server, which acts like an antelope v2 server for a single semantic endpoint and MAY or MAY NOT include Qdb capabilities.

On top of that, there are two persistent issues that are causing me anxiety moving forward:

 - is_elementary and the compartment manager in general was always a stopgap (written in the West Branch library one morning in 2016?) and could be either (a) better aligned with synlist or (b) part of a newly reimagined graph-based Qdb

 - flows properly being flowables, compartments being shifted to EXCHANGE TERMINATIONS, which would be a radical reimagining of pretty much everything.

I'm really excited about reducing flows to flowables, but it would break compatibility with just about everything, starting with the existing antelope v1 (not to mention the J Cleaner paper) which steadfastly applied a category to every flow / declared all flows have compartments. [J Cleaner fairly situated compartments as distinct semantic entities]

So-- if compartments are distinct semantic entities-- do we need to store them in archives?  If compartments are terminations, does that mean they are really processes??

NO, they are not processes, or if they are they are the mothers of all multifunctional processes.

It wouldn't be meaningful to call a compartment a process, because in order to do LCIA on it the process would have to be allocated (by CF) across all flows into that compartment that have impacts, and suddenly we are right back to having an exploded set of entities.

How would this even work? everything would have to get reimagined.  Background emissions, which are presently flow + direction, would need to be flow + termination (with direction implicit??? )

right? there's a deeper problem- the implicit natural directionality of compartments, of contexts. we've already had to start dealing with that in characterization.set_natural_direction() [which must be supplied a compartment manager] -- well so we've already acknowledged it. that's not to say we've solved it.

I think I need to spend some time thinking about how Qdb is supposed to work in this brave new world


Wed 2017-12-06 11:59:40 -0800

Back to antelope servers.  We're going to keep the current system of having flows be about paired flowable + context, but we want the antelope interface to be forward thinking.  So let's go through the API and make sure that it makes sense.

Well.. api.md is rather hopelessly out of date.  It can be simplified a lot.

Wed 2017-12-06 13:18:06 -0800

Worked on this for too long... FWIW I really need to get going on Swagger.  I am putting this down, eating some food, and then doing my important TODO for the day.


==========
Fri Jan 05 09:33:43 -0800 2018

Long month...

Swagger spec for antelope v2 is coming along great--

reading about JWT right now as the most likely form of authentication for the resource to use

Good discussion of shortcomings, mainly that there is a single secret key (in our case per resource) that can be compromised:
https://medium.com/@rahulgolwalkar/pros-and-cons-in-using-jwt-json-web-tokens-196ac6d41fb4

Also helpful comparison of JWT and OAuth, which are totally different:
http://www.seedbox.com/en/blog/2015/06/05/oauth-2-vs-json-web-tokens-comment-securiser-un-api/

lots of discussion etc etc:
https://auth0.com/blog/ten-things-you-should-know-about-tokens-and-cookies/
http://www.bubblecode.net/en/2016/01/22/understanding-oauth2/

https://auth0.com/blog/critical-vulnerabilities-in-json-web-token-libraries/

These people are (a) self-styled crypto badasses and (b) dead-set against all things JOSE
https://paragonie.com/blog/2017/03/jwt-json-web-tokens-is-bad-standard-that-everyone-should-avoid

All that said, JWTs seem like a perfect (at least near-term) solution to our particular security problem, particularly public-key-based JWTs, because they allow the resources to autonomously validate requests without any communication with the server.

The problem: if the tokens are ever supposed to be limited-use, then EITHER the resource has to check with the auth server OR the resource has to become stateful. There is no third option. the TOKEN can't be stateful.

Since we're adamant about resources being stateless, and since we ALREADY operate an auth server, I think we should just go that route.

The JWTs should have three different types:

 - unrestricted access with a privacy level (integer?)-- strong expiry-- server may still need to be contacted for revocation checks
 - metered access (notifies auth server per query for billing-- server returns 200 OK or 401 if revoked)
 - limited access (asks auth server per query for sufficiency-- server returns 200 OK or 401 insufficient)

Both of the latter two do away with the "federated" benefit of the JWT, but they preserve query privacy.

Here's how the system will work:

PLAYERS:

AS Authentication server: vault.lc
RP Resource Provider: ecoinvent.org
RC Resource Container: stateless Antelope V2 server: eiv3.2.apos
UU User: Ralph Fishnet
CA Client application: antelope container

STAGES:

Stage 1. Initialize resource
Stage 2. Authenticate session
Stage 3. Query resource

STEPS:

Stage 1: Initialize Resource

 * RP approaches AS with data to be made available as a resource
 * AS creates RC with one-time session key
 * RC creates asymmetric keypair
 * RC transmits public key with session key to AS

public+private key pair for RC
 $ AS creates stateless RC and seeds it with the "public" key which remains secret
 * AS deploys RC

Stage 2: Authenticate session

(mode a: user is already an RP licensee)

 * UU approaches AS and logs in to RP via OAuth2
 * OAuth2 grant establishes UU's account status
 * AS generates an unrestricted JWT with a 1-day expiry
 * AS provides a resource file to grant CA data access to RC using JWT

(mode b: user is a vault.lc user with pay-per-use for protected resources)

 $ monthly billing
 * UU logs into account and invokes CA
 * AS provides resource file supporting index interface to RC
 * AS provides resource file supporting data proxy interface to RC

(mode c: user is a vault.lc user with an ecoinvent limited-use)

 $ during monthly billing, AS generates a limit JWT with a 1-month expiry and a query limit
 * UU logs into account and invokes CA
 * AS provides resource file supporting index interface to RC
 * AS provides resource file to grant CA data access to RC using limit JWT
 * AS provides resource file supporting data proxy interface to RC

Stage 3: Query Resource

(mode a: user is already an ecoinvent licensee)

 * UU submits query
 * CA performs query to RC, using unlimited-use JWT (DWR! not revocable!)
 * RC receives query; decodes JWT; validates permission; [optionally increments auth server]; answers query
 * CA receives query
 * UU is happy
 
(mode b: user is a vault.lc user with an ecoinvent pay-per-use)

 * UU submits query
 $ CA performs query; proxy negotiates payment
 * proxy returns a JWT
 * CA forwards query with JWT to RC
 * RC receives query; decodes JWT; consumes token [notifies auth server]; answers query
 * CA receives query
 * UU is happy

(mode c: user is a vault.lc user with an ecoinvent limited-use)

 * UU submits query
 * CA performs query
 * RC receives query; decodes JWT; validates permission; increments auth server; answers query
 * CA receives query
 * UU is happy
 - if limit is exhausted, CA removes resource containing limit JWT; go to mode b


RC Constructor:

To create an RC, I need: a one-time key, an auth server...
Thu 2018-01-11 10:01:04 -0800
see notes in spiral notebook.


==========
Thu Jan 11 13:34:46 -0800 2018

Components required for an Antelope V2 Server:

 - AntelopeV2Server - main constructor
   :param auth_server:
   :param auth_pubkey:
   :param privacy:
   :param qdb_server:
   :param archive_init_args:
 - AntelopeQuery - replaces CatalogQuery
   - new QuantityKey - replaces QuantityRef - drop-in as argument in call to load_lcia_factors()
     :param qdb_server:
 - flask config + route-to-query mapping

I think that could be more or less it.

Thu 2018-03-22 13:25:41 -0700

This is updated in deadtree format in journal