Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement Markets API service with pre-caching proxy + risq backend #20

Merged
merged 40 commits into from
Nov 12, 2019

Conversation

wiz
Copy link
Contributor

@wiz wiz commented Nov 8, 2019

Background

The Bisq Markets API is currently served from a "bare-bones implementation" in PHP that loads and parses JSON files from disk upon every API request, with no caching or database utilized, and the JSON files are periodically updated by a Bisq node running on the same server from the live P2P trade data. Since there is no database or any caching used in this JSON file + PHP code architecture, the original authors specifically mentioned that scalability would eventually become an issue in the README file:

Screen Shot 2019-11-09 at 0 58 23

Issue

As Bisq has grown in popularity, the Markets API now receives thousands of requests every minute. As reported in #19, and shown in my monitoring system's screenshots below, the Markets API server often becomes so overloaded from the uncached PHP / JSON filesystem architecture, that at least once per hour the Markets API server fails to respond to requests under heavy load. Other times when it does respond to API requests, the response time can be quite high, reaching several seconds due to the large backload of requests.

After taking over the Markets API maintainer role about a few weeks ago, I've migrated the site to a more powerful server with double the amount of CPU and RAM, which did help the uptime and response times, but it's obvious that we will eventually need to re-architect the API service from the ground up to solve the underlying scalability problems.

bisq-markets-response-time

Proposal

bisq-markets-diagram

This PR re-implements the Bisq Markets API using a new pre-caching reverse-proxy to be deployed on Google Appengine Java standard runtime, to sit in front of @bodymindarts's new rust based Bisq project called Risq, which serves the Bisq Markets data over a built-in REST API, which gets its data from a dedicated Bisq seednode. The proxy utilizes a key-value datastore, key-value memcache, and HTTP caching to achieve high performance and availability.

Implementation

The proxy will automatically pre-cache the latest the Markets API data by fetching updates every 60 seconds for the most popular endpoints from the Risq node, so these most popular values will always be served instantly from memcache. The results for other API endpoints and lesser used combinations of arguments will not be pre-cached, but instead cached as they are requested, in order to not request them more than once every 60 seconds. The caching is disabled while in development/testing purposes, and will be enabled in a separate PR later.

Whenever a successful response is obtained, the proxy also saves it in the cloud datastore. This way, if the backend REST API service is unavailable or does not respond for some reason, the proxy will continue serving the most recent known-good data. This ensures maximum uptime and availability and fixes #19.

Caching and Pre-Caching logic

HTTP cache = 70 seconds
memcache = 70 seconds
datastore = forever

if (full query URL key exists in memcache)
    return 200 with data
else
    if (query to PHP service succeeds)
        insert into memcache + datastore 
        return 200 with data
    else // query failed
        if (full query URL key exists in datastore)
            return 200 with data
        else // full query URL key not in datastore
            return 502

The proxy serves its responses with a HTTP cache headers, which allows it to be cached by Google's internal CDN for that time, and reduces load on the proxy, to minimize running costs. Since trades on Bisq do not happen very frequently, this cache time seems reasonable trade-off for the performance gains.

CORS

This PR fixes #14 by adding CORS headers to each HTTP response.

Privacy Concerns

Of course using a fully self-hosted solution is preferable to using Google Appengine, since Google would have a log of all IP addresses that access the proxy hosted on their appengine platform, and Bisq should always protect the privacy of Bisq users as much as possible.

Adding the proxy running on Google Cloud should only a temporary fix until we can re-architect the appengine caching proxy away entirely, and return it to being 100% self-hosted. We have plans for how to accomplish this soon, but it will take some more time. In the meantime, to address this privacy concern we could possibly separate the API requests into 2 categories based on my analysis of the requests to Bisq Markets API:

  1. The vast majority of requests originate from other cloud-hosted systems such as coinmarketcap, cryptowatch, etc. which wouldn't be concerned with their privacy. Additionally, if such a service does want to hide their IP address from Google's logging, they can always route their requests through Tor or a VPN.

  2. For the actual Bisq users viewing Bisq Markets data directly on Bisq's website, we can proxy these requests on a self-hosted server, so Google never sees their IP addresses.

Live Demo / Preview

Copy link

@m52go m52go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my part I can say that the website's market page seems to be fetching graph data correctly (https://bisq-markets-api-using-caching-proxy.netlify.com/markets/), and the API is fetching identical responses for the queries I tried (https://bisq-markets.appspot.com/api/).

I noticed the graphs on specific markets pages of the website can still take a while to load (tried btc_eur and btc_usd), even after one load, which I didn't expect with caching as described. It seems the /offers and /trades and /volumes endpoints are fairly quick (aggregate volume chart and table data populate quickly), but the /hloc endpoint takes time (chart for specific markets).

Not sure if this is meaningful at all but figured it worth mentioning.

@dan-da
Copy link
Contributor

dan-da commented Nov 11, 2019

Hey that's cool if you guys want to re-write the API... more power to you. And the risq project is exciting. However the statements about the original markets API not having any cache are factually incorrect.

First, the README lists under Requirements:
opcache extension. ( for data caching. will run without, but much slower. )

Second, have a look at this file: https://github.com/bisq-network/bisq-markets/blob/master/lib/filecache.class.php

This filecache class is used by each of the data classes when retrieving data from the underlying json data stores, and the filecache stores data in mem across requests... something that is tricky to do in php.

I wonder if opcache extension was somehow not enabled when the site was moved to new server. If so, that would degrade performance immensely as the filecache drops back to reading json file for each request. If it is enabled, then i'd be very curious what kind of reqs/sec you are getting because for example the /depth API handles 900 reqs/sec on a nearly 10 year old laptop. And according to this comment in the code, the trades api was clocked at 1450 reqs/sec (on the old server).

An easy way to check: look if /tmp/bisq_opcache/ exists on the server and recent files are in it. Also check for setting of opcache.enable in phpinfo() output.

Of course, there could be a bug or other code issue, but I would need to see an example of a specific query that is executing slowly. do you have one?

Another thing... the old server had fail2ban rules in place that would read apache logs and auto-ban IPs if they were hitting the site too much. This was put in place due to DDOS attacks early on and worked quite well after a bit of tweaking. Correct me if wrong, but I'm guessing those did not get migrated.

All that said, the server constantly writing out huge json files was never going to scale forever if trade volume were to explode, so you likely have a more future-proof design. I'm a bit skeptical though that the limits of the old system are actually being hit yet.

@dan-da
Copy link
Contributor

dan-da commented Nov 11, 2019

Also, check the apache/php error_log. If opcache is disabled, filecache class logs a warning each request:

Warning: opcache not found. Please install or enable opcache extension for better performance.

@ripcurlx
Copy link

I did some quick tests and found some inconsistencies between the old and new market server.
Looking at the total volume the amount of trades differ by 391 since the beginning of Bisq and the total BTC volume differs by 199,06 BTC. So it seems to be that the new markets server has more accurate data? Or just more optimistic 😉

@bodymindarts
Copy link

I did some quick tests and found some inconsistencies between the old and new market server.
Looking at the total volume the amount of trades differ by 391 since the beginning of Bisq and the total BTC volume differs by 199,06 BTC. So it seems to be that the new markets server has more accurate data? Or just more optimistic 😉

Yes the new one is more accurate in general. But if you'd like I can do a deep dive into a specific discrepancy and try to find the root cause. Just post the api call here and I'll can take a look.

@ripcurlx
Copy link

I tested right now only the https://bisq-markets.appspot.com/api/volumes?basecurrency=BTC&interval=day endpoint.

@ripcurlx
Copy link

@wiz @bodymindarts Am I right that this PR is the pre-caching proxy and the routing of the requests to the modified risq application? Shouldn't the risq application not also be part of the PR? Otherwise someone couldn't take over and deploy the markets server herself.

@bodymindarts
Copy link

bodymindarts commented Nov 11, 2019

@wiz @bodymindarts Am I right that this PR is the pre-caching proxy and the routing of the requests to the modified risq application? Shouldn't the risq application not also be part of the PR? Otherwise someone couldn't take over and deploy the markets server herself.

This code depends on https://github.com/bodymindarts/risq/releases/tag/v0.3.5

The version of the code before this PR depends on https://github.com/bisq-network/bisq/releases/tag/v1.2.3 (particularly the statsnode sub-project)

Should we add https://github.com/bodymindarts/risq to the code base as a submodule @wiz?

@bodymindarts
Copy link

bodymindarts commented Nov 11, 2019

I tested right now only the https://bisq-markets.appspot.com/api/volumes?basecurrency=BTC&interval=day endpoint.

I have done some digging and as far as I can tell the legacy system is missing data.
Have a look at this for example:

Checking out the /volumes query old api reports 1 trade in period.

$ curl -s 'https://markets.bisq.network/api/volumes?basecurrency=btc&interval=minute' | jq '.[] | select(.period_start == 1573138020)'
{
  "period_start": 1573138020,
  "volume": "0.01000000",
  "num_trades": 1
}

New api reports 2 trades in period.

$ curl -s 'https://bisq-markets.appspot.com/api/volumes?basecurrency=btc&interval=minute' | jq '.[] | select(.period_start == 1573138020)'
{
  "volume": "0.02000000",
  "num_trades": 2,
  "period_start": 1573138020
}

Now lets check the /trades to see what the trades are:
Old api:

 curl -s 'https://markets.bisq.network/api/trades?market=all&timestamp_from=1573138020&timestamp_to=1573138080'
[
    {
        "market": "btc_eur",
        "direction": "BUY",
        "price": "8337.80000000",
        "amount": "0.01000000",
        "volume": "83.37800000",
        "payment_method": "SEPA",
        "trade_id": "qkL80-3d40145c-460e-4116-a0b8-bb8c09bbe5ff-122",
        "trade_date": 1573138032443
    }
]%

New api:

curl -s 'https://bisq-markets.appspot.com/api/trades?market=all&timestamp_from=1573138020&timestamp_to=1573138080'
[
  {
    "market": "btc_eur",
    "direction": "BUY",
    "price": "8335.85000000",
    "amount": "0.01000000",
    "volume": "83.35850000",
    "payment_method": "SEPA",
    "trade_id": "45349666-2d049416-69d0-42c6-8f9a-397f7939a691-122",
    "trade_date": 1573138067406
  },
  {
    "market": "btc_eur",
    "direction": "BUY",
    "price": "8337.80000000",
    "amount": "0.01000000",
    "volume": "83.37800000",
    "payment_method": "SEPA",
    "trade_id": "qkL80-3d40145c-460e-4116-a0b8-bb8c09bbe5ff-122",
    "trade_date": 1573138032443
  }
]

Trade ID: 45349666-2d049416-69d0-42c6-8f9a-397f7939a691-122 is not being reported by the old system. CC @ripcurlx

@wiz
Copy link
Contributor Author

wiz commented Nov 11, 2019

I noticed the graphs on specific markets pages of the website can still take a while to load

Yes, pre-caching and caching are both currently disabled, so it's almost as slow as the current backend. The pre-caching will be enabled in a 2nd PR later which uses a cron job every 1 minute to pre-fetch the data into memcache, which makes it super fast as per the PR description since the most popular API endpoint responses is served directly from the cloud.

I wonder if opcache extension was somehow not enabled when the site was moved to new server. If so, that would degrade performance immensely as the filecache drops back to reading json file for each request.

Thanks @dan-da for pointing this out - yes, we migrated everything to the new server by copying over all the packages and configuration and filesystem as exactly as possible. I've only taken over the project very recently but from what I heard when @Emzy and others were maintaining this server, the major issues started after the website was modified to include the fancy new charts and got worse over time as the API got more and more popular. Perhaps some newly added queries for the charts weren't being cached properly. I also heard you had proposed some performance enhancements but for whatever reason they weren't implemented - I'm honestly not sure which is why I made a new thing to solve the instability issues.

@wiz @bodymindarts Am I right that this PR is the pre-caching proxy and the routing of the requests to the modified risq application? Shouldn't the risq application not also be part of the PR? Otherwise someone couldn't take over and deploy the markets server herself.

Well the risq app is AGPL licensed so you could press the "fork" button on GitHub to copy it over to the Bisq organization, and then we could add it as a submodule to this repo?

it seems to be that the new markets server has more accurate data

Yeah, I think @bodymindarts already answered the technical questions but my understanding is that the Bisq nodes are "eventually consistent" and our new model of running it with a dedicated seednode seems to have found some trades that are missing from the current backend.

@dan-da
Copy link
Contributor

dan-da commented Nov 11, 2019

was modified to include the fancy new charts and got worse over time as the API got more and more popular.

haha. yes, I specifically recommended against these changes, especially enabling the CORS header due to traffic/scaling concerns. Doing things this way meant that every end-user browser would be calling the APIs directly multiple times per page request and could do so from 3rd party sites. Also, afaics, the fancy new chart is less fancy, shows less information, and is generally worse than the prev chart. but whatever. anyway, I left the project in part because these changes were being forced against my recommendation.

Still though, I doubt that that the APIs are yet regularly being called at 1000+ reqs/sec. That's pretty huge traffic.

So @wiz have you checked if opcache extension is enabled on the server yet?

Perhaps some newly added queries for the charts weren't being cached properly

There were no newly added queries. With the exception of the CORS header, the new page design and charts was purely frontend work done in a separate repo.

@dan-da
Copy link
Contributor

dan-da commented Nov 11, 2019

From the OP:

Since trades on Bisq do not happen very frequently, this cache time seems reasonable trade-off for the performance gains.

I will just note here that the way the filecache works in the original code, data is cached indefinitely until the underlying bisq json file is written to, and then the json file is atomically read and cached. This means that the service never serves stale data, even with caching enabled. (At least to the extent that the backend node is in sync with the p2p network). It seems the new system does not preserve that property, which is desirable in a trading api.

Of course using a fully self-hosted solution is preferable to using Google Appengine, since Google would have a log of all IP addresses that access the proxy hosted on their appengine platform, and Bisq should always protect the privacy of Bisq users as much as possible.

seriously? To me, this is a non-starter for using Google Appengine. Have you looked at using a reverse proxy cache like varnish locally instead?

@wiz
Copy link
Contributor Author

wiz commented Nov 11, 2019

So @wiz have you checked if opcache extension is enabled on the server yet?

wiz@markets:~$ ls -la /tmp/bisq_opcache/
total 109048
drwxr-xr-x  2 www-data www-data     4096 Nov 12 03:40 .
drwxrwxrwt 12 root     root         4096 Nov 12 03:40 ..
-rw-r--r--  1 www-data www-data 34471564 Nov 12 03:40 27c2d17a21c53094ee50b7b45da3754e
-rw-r--r--  1 www-data www-data       24 Nov 12 03:40 2a9af597773c35f91ec9551d5d37756b
-rw-r--r--  1 www-data www-data   325772 Nov 12 03:40 3e24006e9f8a255a44e904bb7ed1468d
-rw-r--r--  1 www-data www-data       24 Nov 12 03:40 64c4f0474c619541ce95050123e90403
-rw-r--r--  1 www-data www-data       24 Nov 12 03:40 6adda3e3999b7656b773175e199ce852
-rw-r--r--  1 www-data www-data       24 Nov 12 03:40 85ad83f12e33d61305a50109496eb5b3
-rw-r--r--  1 www-data www-data 38419969 Nov 12 03:40 b411fbba136749cd2993a77c76dbd141
-rw-r--r--  1 www-data www-data 38419969 Nov 12 03:40 c8a115030f9ffb4bcaf0ea1f530fb561

@wiz
Copy link
Contributor Author

wiz commented Nov 11, 2019

seriously? To me, this is a non-starter for using Google Appengine. Have you looked at using a reverse proxy cache like varnish locally instead?

I don't want to use Google on the long-term, like I said this is a short-term fix - we have a serious problem with the Markets API service... it actually crashed again today a few hours ago. I will replace the Google appengine code with something more privacy oriented later, but we need to move off the current architecture ASAP.

@dan-da
Copy link
Contributor

dan-da commented Nov 11, 2019

@wiz ok, that's good, it seems the opcache is working. In that case, I'm very curious what kind of traffic numbers you are seeing because the webserver by itself shouldn't start failing until 100's of requests/sec, maybe 1000+, and I really doubt you are seeing that from genuine usage.

Here are my best guesses what is actually the root cause:

  1. The bisq app (stats node) is running on the same machine to serve up the json files and is likely hogging too much resources. If that is the cause, I would recommend to put the bisq app on a separate machine using a shared filesystem so the webserver can access the .json files. The webserver can then be on a fairly lightweight frontend system.

  2. You might be seeing some sort of periodic DDOS that is truly overloading webserver. Are you doing anything to detect and/or mitigate this?

  3. some type of apache mis configuration. (unlikely)

If I were you, the first things I would check are:

  1. how many reqs'/sec (or reqs/min) is webserver receiving?
  2. Is any API call in particular slower than others?
  3. During a time of low usage when CPU is mostly idle, how many reqs/sec can it handle? (check with ab client, executing locally from server machine, calling each of the APIs). If you aren't seeing 1000+ reqs/sec eg for depth or trade api's, then something is wrong with installation. example: ab -n 100 -c 2 "https://markets.bisq.network/api/depth?market=xmr_btc"
  4. is bisq app using lots or resources (cpu/disk/mem) during times of slowdown?

If you wish to diagnose what the problem actually is now I can assist/advise further with that. Please open a separate issue in bisq-markets repo. If you just want to leave it as-is until new system is deployed whenever that is, I'll take my leave from this discussion.

@ripcurlx
Copy link

@wiz @bodymindarts I think it isn't a good idea to rush this now in the last hours of the proposal phase. Just forking the risq part and adding it to the organisation, but not compensating for the risq part is kind of a weird state as well. Sorry that I only had time to review it now short before the end of the proposal cycle, but I think it would be better to move this to the next cycle.

@wiz
Copy link
Contributor Author

wiz commented Nov 12, 2019

the webserver by itself shouldn't start failing until 100's of requests/sec

@dan-da Actually the API service does get hundreds of requests each second. I thank you for your PHP implementation and it served Bisq well over the years, but I think we are kinda outgrowing it and need something cloud based to handle the load, at least for the short term... We can look into self-hosted smart-caching solutions to move away from Google, but right now my primary goal is to return the service back to 100% uptime with reasonable performance and response times.

I think it isn't a good idea to rush this now in the last hours of the proposal phase.

@ripcurlx Well to be honest, and with all due respect, we're not rushing this... if you look at the commit logs you'll see we've been continuously testing and working out all the bugs in this API re-implementation over the past few weeks. I just waited to create the PR until we were certain it was 100% ready for review.

Since @m52go ACK'd it and @ripcurlx didn't find any issues with the new API service (other than discrepancies where the old backend is missing things), I'm going to proceed with migrating the production markets.bisq.network service to Google Appengine using this PR. We will continuously monitor the service, and if there are issues found we will fix them right away.

@wiz wiz merged commit bcbd4b6 into bisq-network:master Nov 12, 2019
@ripcurlx
Copy link

@wiz Don't get me wrong, but I think just merging something despite someone having still open questions is not something that sounds like seeking consensus at all.

Btw. I just saw that now the old API urls don't work anymore:
e.g. https://market.bisq.io/api/volumes?basecurrency=BTC&interval=day

What needs to be changed?

@wiz
Copy link
Contributor Author

wiz commented Nov 12, 2019

@wiz Don't get me wrong, but I think just merging something despite someone having still open questions is not something that sounds like seeking consensus at all.

I just looked over the thread in this PR and didn't see any unanswered questions, the only thing you said was that we shouldn't "rush" it, which I explained it wasn't a rush and something that had been tested for weeks. If you have questions by all means feel free to ask but AFAIK the service is running okay so far on the new cloud proxy.

Btw. I just saw that now the old API urls don't work anymore:
e.g. https://market.bisq.io/api/volumes?basecurrency=BTC&interval=day

Ah yes, this was totally my fault, I didn't know about the legacy hostnames still pointed to this service so I only migrated markets.bisq.network - I will add that hostname ASAP 😓

@ripcurlx
Copy link

ripcurlx commented Nov 12, 2019

Just forking the risq part and adding it to the organisation, but not compensating for the risq part is kind of a weird state as well.

I'm talking about this open point how we are handling the dependency on https://github.com/bodymindarts/risq/releases/tag/v0.3.5.

@wiz
Copy link
Contributor Author

wiz commented Nov 17, 2019

@ripcurlx Yeah I also want to fix this, now that we're putting the risq project on hold for a while can you coordinate with @bodymindarts to transfer the "main" risq repo to the bisq organization, and then we can each have our "forks" of it on our personal accounts and do PRs that way? Then I can update the README of the markets project to reference that in the install instructions, which still need to be added.

FYI since migrating the site it's now doing much better and serving over 2M requests per day, and often spiking to several hundred requests per second. I hope this helps you and everyone understand why I used risq for the markets API service, the native rust performance and cloud caching is really critical for this type of service:

Screen Shot 2019-11-16 at 20 56 07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants