Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raising of Traffic Threshold NFRs specified in the CDS #541

Closed
jimbasiq opened this issue Sep 4, 2022 · 11 comments
Closed

Raising of Traffic Threshold NFRs specified in the CDS #541

jimbasiq opened this issue Sep 4, 2022 · 11 comments

Comments

@jimbasiq
Copy link

jimbasiq commented Sep 4, 2022

Description

Basiq would like to raise concerns and propose a review and uplift of the currently specified Traffic Thresholds NFRs for CDR Data Holders. We believe the current limits are too low to support a data recipient serving the Australian consumer.

Can I please propose this topic as a priority for the Maintenance Iteration 13 starting in a couple of weeks?

Area Affected

To provide you with some detail to hopefully validate this as a worthwhile topic, our primary concern is rate limits for refreshing of data for all consumers for a given institution (Data Holder) for a given software product (Data Recipient):

We have a specified limit for unattended traffic per software product id, these are specified for outside of business hours, while for business hours the "best effort" is expected which means rates could be even lower. A Data Holder we have been working with has already confirmed we were hitting their limit for private endpoints and that limit is 50 TPS which is a limit that they will continue to use unless the CDS specifies a higher throughput.

In order to understand the rate limit and current limitations, here is an example of all requests we are sending within one data refresh job for one consumer:
GET access token (sent once - we are not 100% sure if this counts in rate limit)
GET the list of accounts
GET the list of balances
GET account details - should be targeted only if account.detail scope is present (number of requests equals the number of accounts)
GET transactions - should be targeted only if transaction.detail scope is present (number of requests is greater or equal to the number of accounts - in order to simplify let’s say that it is equal)
GET customer details (sent once)
This means that if we have all required scopes, we are sending at least 2*(n+2) requests, where n equals the number of accounts.

Now let’s imagine a perfect scenario where we are sending 50 requests per second each second in a day and let’s see how many jobs we could do depending on the average number of accounts per job:

n=3; 86400 * 50 / (2 * (3 + 2)) = 432000 jobs in total during a day

n=5; 86400 * 50 / (2 * (5 + 2)) = 308570 jobs in total during a day

300-430k connections for one software product for any of the big 4 banks is not enough and the example is a perfect scenario. Being realistic the real scenario could be 5-10 times worse than the perfect scenario, we see this as a serious limitation for several of our Partners.

Change Proposed

The largest Data Holder in Australia has just under 18m customers. It is very feasible that a successful Australian Fintech could attract half of Australian consumers, that would mean 9m consumers.

Rounding up to 10m (33 times higher than the current 300k the current TPS limit) to allow some growth I have 2 proposals:

  1. All Major banks need to adhere to Traffic NFRs for unattended traffic per software product id of 1,650 TPS (= 33 * 50). Non Major Banks to have a lower TPS - To be negotiated.
  2. The TPS provided by a Data Holder is calculated as a ratio to their customer base. e.g. 15 TPS per 100k customers.

Hopefully the issue is clear, please let me know if not and I can elaborate further.

@dpostnikov
Copy link

@jimbasiq What's the actual customer use case requiring frequent refresh with complete re-load?

@jimbasiq
Copy link
Author

jimbasiq commented Sep 7, 2022

Hi @dpostnikov,
We have many partners/customers on our platform serving Australian Consumer with services such as PFM (Personal Financial Management) or Wealth/Investment roundups. Both are absolute must have use cases for the Consumer data being up to date.

@jimbasiq
Copy link
Author

jimbasiq commented Sep 7, 2022

It is also worth mentioning that the current Web Scraping Connections we provide to our Partners/Customers are able to support hundreds of thousands of refreshes in a 24 hour period. It is hard to encourage our Partners to move over to CDR Open Banking connections if there is a severe degradation of service capacity in doing so.

@ShaneDoolanFZ
Copy link

Adatree supports this request. We've detailed a similar experience in #534. Asynchronous collection of data is an often used pattern with obvious benefits.

As CDR grows, more users mean more requests. Competing priorities will emerge if all refreshes must only occur during a customer present session. Consumer facing apps typically have a high traffic period so ADRs and data holders can expect huge spikes in traffic during those periods if customer present is the only real option (which is the case right now). Asynchronous collection avoids this by spreading load across a sensible timeframe. Real-time collection is not required in all cases.

It also allows for a cached fallback for data holder unavailability during a customer present session. If the ADH is not available for a real time call the latest data presented to the consumer is not stale to the point of being unusable i.e. the balance or transactions list might have been fetched an hour ago as opposed to 24 hours ago.

All of this results in better consumer outcomes regardless of use case by providing consumers with a more resilient CDR ecosystem.

@perlboy
Copy link

perlboy commented Sep 10, 2022

There's a number of issues raised in this thread that are worthy of being broken out.

NFR Suitability

The core focus of the original thread is one around "raising" the NFR thresholds. While this may seem like the right approach the reality is that it likely isn't. The method of describing the threshold seems inappropriate and penalises both small data holders and successful ADRs alike.

Biza.io raised this in DP208 and instead suggested that the NFR be bound to the number of active arrangements at a particular holder. Biza.io also requested usage data to make an evidence based decision. That is to say, Holders gained the benefit of being able to correlate real usage with requirement, could integrate it into their capacity management planning and could therefore design solutions that could correlate 1:1.

This would, by and large, resolve the upper bound problem because the upper bound would be relative to arrangement count. An ADR would get guaranteed throughput per arrangement, even if the TPS on this was lower overall parallelisation could compensate. Additionally, Biza.io outlined a number of implementation patterns we had observed Holders implementing to provide the DSB and broader industry knowledge around the challenges Holders face when weighing up cost vs. capability. As a nascent ecosystem the CDR has a very low utilisation which makes it quite difficult to justify huge capital expenditures in the smaller end of town.

Despite these suggestions, and in the face of numerous opposition, coupled with the participation of a number of ADRs (RAB, Xero, Intuit) but not those who are involved in this thread, the DSB binded the NFRs "as-is" with immediate effect. It would appear that the ADRs involved in this thread are now realising the same challenges others on both Holder and Recipient side identified.

As a result of this decision organisations have now made architectural decisions on this basis and consequently any alteration of the defined NFRs is now likely to be a long dated FDO - it would inappropriate to do otherwise.

Implementation Suitability

There is a reference in the original thread for a "data refresh job". This seems to imply a batch process which essentially resets a complete data set on a daily basis. In essence, a synchronous interface (non batch API) is being used to complete asynchronous activity. This is not only architecturally unsuitable, possibly as a hangover of applying existing collection approaches (ie. screen scraping) to the CDR I would also question the appropriateness with respect to recipients data minimisation obligations. Put another way why is a full batch run being done on all endpoints rather than requesting (and keeping hot) only data which has been requested by the Consumer themselves.

Nonetheless, assuming there is justified reasons for obtaining all of the data it seems inappropriate to be doing this even daily. I believe this is the context for the question @dpostnikov posed. Additionally the scenario for comparisons was described as "perfect" when that seems like a stretch.

Taking the use case given and assuming unattended behaviour (ie. the Consumer isn't waiting around).

  • GET access token: This is required at most every 120 seconds. This can be used as an upper bound for splay.
  • GET the list of accounts: It's unclear why this needs to be called every refresh because list balances provides account identifiers anyway and account detail provides more detail if available.
  • GET the list of balances: This can be used to simultaneously give list of balances and the account identifiers to assess whether there is changes
  • GET account details: It is unclear why this should be called regularly. The reality is that most accounts don't change their behaviour frequently except possibly after external events such as interest rate rises. It seems appropriate again to apply a multi-day splay on this information particularly for unattended. Put another way, if the Consumer is present update the specific account, if the Consumer isn't present eventual consistency seems appropriate.
  • GET transactions: Listing of transactions is limited to 1000 records and supports oldest-time. It is highly unusual for a personal bank account to have this many transactions occur in a 24 hour period (or in a month) and so checkpointing of local data and limiting the data set would be expected from a data recipient.
  • GET customer details: It is unclear again why this should be called regularly. A customers details don't change regularly and so eventual consistency seems appropriate.

Calculations

I'll stick with n being the number of accounts. I'll also stick with the AT being part of the threshold but personally I don't think it should be. There's no reason why an authorisation server can't produce many ATs and penalising the ADR because a Holder chose a low AT lifespan isn't appropriate. I know in Biza.io case we don't include what we consider administrative actions in our traffic thresholds, we only apply thresholds on APIs attached to source systems - which we believe is the thing the NFR upper bounds are intended to protect.

First Run

This is the absolute worse case scenario because it involves a completely new Consumer coming onboard with zero prior data and retrieving every detail.

1 x Access Token
1 x List of Accounts
1 x List of Balances
n x Account Details
n x Transactions. Assuming 1000 tx is enough, in our observations 2 years worth of history is less than 10,000 tx and that would be a very "busy" account. I've followed the OPs idea of 1:1 with account.
1 x Customer Details

Result: 4 + 2n

Taking the OPs idea of 50 TPS limit there is a total of 4,320,000 API calls to be made.

n=3: 4,320,000 / (4 + 6) = 432,000 sessions per day
n=5: 4,320,000 / (4 + 10) = 308,571 sessions per day

🥳 Huzzah, the numbers align with the OP but what's important here is that it represents the absolute worse case of doing a full load of all data in the background every day. I disagree with the statement the "real scenario could be 5-10 times worse" because separate partners should have separate software products but maybe I'm not following something.

Incremental Detail Calculation

Let's now assume we want to maintain the same level of detail but optimise and we have all detail scopes. We don't need to do list accounts because list balances will give us account identifiers and account details has the same detail.

1 x Access Token
1 x List of Balances
n x Account Details
n x Transactions. This is very likely to be acceptable and quite possibly high performance if checkpointing is used
1 x Customer Details

Result: 3 + 2n

n=3: 4,320,000 / (3 + 6) = 480,000 sessions per day
n=5: 4,320,000 / (3 + 10) = 332,308 sessions per day

No Detail Calculation

Let's assume after the first run or because we haven't been provided detail scopes we have no detail at all. This appears to be most aligned with a pure PFM use case especially if the Recipient has aligned its use of the PRD Data and productName and Holders are aligning it too because much of the account specific detail can be derived.

I've left the list of accounts in here still but this could be stripped further or called less than once a day as list of balances contains accountId for transactions call anyway.

1 x Access Token
1 x List of Balances
1 x List of Accounts
n x Transactions. This is very likely to be acceptable if checkpointing is used.

Result: 3 + n

n=3: 4,320,000 / (3 + 3) = 720,000 sessions per day
n=5: 4,320,000 / (3 + 5) = 540,000 sessions per day

Eventually Consistent Detail

The reality is that of a total sample set there are very few Consumers who will actively engage with an app every day. If they do it's because of a prompt driven by the value proposition and possibly this can be enabled by the CDR in a different way (ie. a shared signal of account changes etc). On this basis being eventually consistent, especially in an unattended scenario seems like it should be appropriate.

On this basis I'll hypothesise, after initial load, the following:

  • Update Customer Detail once every 3 days
  • Update Account Detail once every 5 days, I know the most likely change here is interest rates but the RBA only meets monthly so I'm going to go with averages
  • Update transactions once per day, this could be optimised further by comparing the previous balance values and a synchronisation sweep with customer present

1 x Access Token
1 x Access Token
1 x List of Balances
0.2n x Account Details
n x Transactions
0.3 x Customer Details

Result: 3 + 1.5n

n=3: 4,320,000 / (3 + 3) = 720,000 sessions per day
n=5: 4,320,000 / (3 + 7.5) = 411,428 sessions per day

Eventually Consistent No Detail

Same concept as above but this time we don't need detail updated continuously. Realistically updating detail could occur as a Consumer present call.

On this basis I'll hypothesise, after initial load, the following:

  • Update Accounts every 3 days
  • Update Balance every day (possibly use as trigger for transaction details)
  • Update transactions every 1.5 days

1 x Access Token
1 x List of Balances
0.3 x List of Accounts
0.66n x Transactions.

Result: 2.3 + 0.66n

n=3: 4,320,000 / (2.3 + 1.98) = 1,009,345 sessions per day
n=5: 4,320,000 / (2.3 + 5.5) = 553,846 sessions per day

Suitability

Without real usage data of the ecosystem it is difficult to assess what is "not enough" but suffice to say some basic optimisation appears to double the upper bound. In 2020 Frollo had 100,000 customers and represented 90% of the utilisation. It's unclear if the demand has 10x'd in 2 years hence the desire for usage data to inform the decision.

Alternatives

To me the NFR discussion seems to be more symptomatic of a broader set of problems including:

  1. Recipient implementations designed for historical batch retrieval being used in a live API context
  2. Lack of batch job lodgement and retrieval capability. This seems likely to become much more relevant in Energy context because some C&I customers have literally thousands of accounts impacting the overall scalability of a poll based system.
  3. Lack of event signalling mechanism. The DSB has mooted the use of SSE for this but it's a long way off.

I think overall the concern I have with simply increasing the NFRs is that it is simply patching over features of the CDR that aren't yet present. This seems to also be combining with the need for recipients, many of which have come from a batch based bank feed or cache based screen scraping environment, to change mindset and build solutions which align with best practices in a CDR context rather than wedging CDR into existing approaches.

Put another way, it seems a higher power to weight ratio to focus on feature capability to resolve the underlying problem versus forcing endlessly higher performance requirements that will simply be revisited over and over again.

@dpostnikov
Copy link

``

Nonetheless, assuming there is justified reasons for obtaining all of the data it seems inappropriate to be doing this even daily. I believe this is the context for the question @dpostnikov posed. Additionally the scenario for comparisons was described as "perfect" when that seems like a stretch.
Exactly, @perlboy you get me.

"PFM" can be designed in so many ways, more efficient or less efficient ways.

Unnecessary calls aside, I agree, there is definitely scalability issue with the current design (both CDR framework and data recipient design as a result).

The way to solve this problem is not to get a bigger hammer or build a bigger pipe (e.g.: replicate a batch design via APIs or increase thresholds). Secure event notification mechanism is missing and should probably be prioritised to solve for these use cases.

@ShaneDoolanFZ
Copy link

@perlboy absolutely fair point on our lack of participation on this topic before now. An arrangement based approach makes sense so as not to require all implementers to provision excess capacity "just in case" when the practical reality is throughput thresholds will only really be tested with the majors. A reasonable FDO is also not something we'd complain about given this feedback is after the fact, but it is feedback based on metrics not theory so I would hope it would be considered valuable even at this stage.

@RobHale-Truelayer
Copy link

Just to chime in with another dimension - that of the DH customer profile. Not all DHs are equal, even within a single industry or industry vertical. Some banks focus on lending rather than transactional accounts. Loan accounts have low transaction volumes - typically a monthly interest charge and perhaps one or two monthly payments. Sometimes there might be redraws or deposits but these aren't typical. Whereas a transaction account might have 50 or more times this volume. Profiling DHs ahead of imposing NFRs might be beneficial. The activity around non-bank lender participation is a case in point - personal loans would fit into the low volatility category. When combined with comparatively low customer volumes, it seems inappropriate to impose the same NFR thresholds on both categories of DH...

@jimbasiq
Copy link
Author

Hi All,

It is great to see we seem to have a general consensus on the current traffic rates being inadequate.

I'll look forward to discussing with you all on Wednesday what would be adequate and how the rate NFRs could differ dependent on industry vertical, DH size (members, loan book, other) and other ideas. With my ADR hat on I'd like to see a fair usage rate, with my DH hat on I'd like to not cripple the little guys with unfair obligations.

Let's not forget lack of penalties and the caveat of "best efforts". Both could damage businesses IMO.

@CDR-API-Stream
Copy link
Collaborator

A Decision Proposal is required #92 DSB Item - Reassess Non Functional Requirements has been added to DSBs future-plan backlog.

@CDR-API-Stream
Copy link
Collaborator

Closing as this issue will be considered as a Decision Proposal, see comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants