-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testnet & Mainnet Status Page testnet.polykey.com mainnet.polykey.com #599
Testnet & Mainnet Status Page testnet.polykey.com mainnet.polykey.com #599
Comments
Should also consider if this should just be part of the general backend for Polykey Enterprise, since we are setting up a nextjs application server there anyway. It is important to know that the public web page does not have the ability to directly connect to PK testnet node client service, since that requires authentication, and also the websocket transport on PK testnet requires specialised js-ws library, which doesn't yet work in the browser anyway. |
Also other inspirations like https://status.hey.xyz/ |
@tegefaulkes thoughts on using SRV record? @amydevs any notes on SRV records? |
Are SRV records really what we want? AFAIK it's only used to specify a port for a service on an existing address. |
It's one the alternatives to the A and AAAA record. |
Why not expose ports 80 and 443 from the container? and serve static html / js files with dashboard, or proxy traffic somewhere. Also, I'm probably out of context yet – do you want Polykey to be a self-sufficient service (like we can run it fully functional with pure Docker), or do you want it to use AWS / Cloudflare infrastructure? Depending on that we can think of a clear and easy solution. |
We have an issue regarding an http status page for the PK agent itself #412. That's separate to this issue, which is about a status page for the entire network. This would be unique to testnet or mainnet, and not part of any PK node. So exposing 80/443 wouldn't be sufficient to achieve this as that would just show the agent's own status page. |
How do you think we should collect and store historical data? I can imagine a couple of scenarios:
There is also a commercial all-in-one solution called DataDog https://www.datadoghq.com/pricing/. As for me, it's quite costly, and I had issues with its flexibility and maintenance. What's in your mind? |
About the DNS record problem: the most straightforward solution I can see is to use another domain. Let's say subdomain: status.testnet.polykey.com or panel.testnet.polykey.com |
We want to keep the agent process minimal. So the pull model is probably better. Logs wise we can output in different formats. |
It's not just a technical issue, it's also optics. It's just a smoother thing to point everybody to testnet.polykey.com or mainnet.polykey.com. We should be able to switch to using SRV records we control the entire DNS resolution process. The main issue is hosting the dashboard. Not sure if cloudflare worker can do this with live updates or something or whether we extend our current PKE to handle it. |
I've assigned this to @okneigres, please spec out the task list in the OP. I still need to set your email up and some account access which I'll do after our meeting. |
Name of game right now will be speed, so if we can get away with hosting our logs and metrics data elsewhere, that would be best. Later we can incorporate this into our PKE infrastructure. |
Just tried this:
It gives you:
So you still need a special hostname to point to the cluster IP addresses via A/AAAA records. That does mean we can still reserve |
I think a simple solution is to use https://no-ip.com/ to get a hostname to point to an ip, which can then be used with the corresponding |
Some notes about SRV records. Basically when a SRV record is created, it always has this structure:
Technically it could even be:
So what that means is that you then have to use So anyway I talked to chatgpt (https://chat.openai.com/share/48b02e63-414c-4a8f-838c-c441c3c2e1c4) about this and this is what it suggests:
This also enables us to support |
The place to modify this resolution process would be the |
@okneigres can you explore this https://developers.cloudflare.com/workers/examples/websockets/ to see if it is a viable option for hosting the page? Or if not, let's just go straight to implementing it on the PKE. |
Changing to #599 (comment) would give us:
|
Based on the today's discussion. Some things that need to be figured out:
|
As I mentioned, It would be neat to see what the network looks like with a force directed graph. https://observablehq.com/@d3/disjoint-force-directed-graph/2?intent=fork It would be a neat visualisation. But also useful in viewing the connectivity of the network and how it forms. |
While that is cool, I don't think it would be possible for us to efficiently or accurately represent such a map. Also I imagine such information might be a privacy issue too even though it's a public network. The most we could do is show some representation of the node graph, but the geo visualisation would be the most impactful for now. @okneigres this https://github.com/maxmind/GeoIP2-node is the most official library for, works only server side atm, not client side. You might need to investigate. |
Make sure you're not doing derived calculations on every RPC request. That should be just done later by the analysis system. |
The prometheus compatible metrics system should be a new issue. You want to complete this epic for just deployment and show that all in the dashboard. |
MatrixAI/Polykey-CLI#93 cannot be closing this epic. We still need the final design of the deployment version table onto the dashboard to close this off. |
@amydevs new issues required for:
|
Going to try out the Mimir lcoally, and submit some metrics. |
@amydevs I think your diagram is missing a critical piece of the puzzle here. The supabase database storing all the relevant metadata here and transactional data (especially if metric data is going to mimir). |
This was necessary to get mimir working: mimir = {
enable = true;
configuration = {
ingester = {
ring = {
replication_factor = 1;
};
};
multitenancy_enabled = false;
no_auth_tenant = "anonymous";
server = {
http_listen_network = "tcp";
http_listen_address = "";
http_listen_port = 8080;
http_listen_conn_limit = 0;
grpc_listen_network = "tcp";
grpc_listen_address = "";
grpc_listen_port = 9095;
grpc_listen_conn_limit = 0;
};
common = {
storage = {
backend = "s3";
s3 = {
# this is using R2 storage with special Mimir Prototype tokens
# note that this is not the same as the tokens for Cloudflare
# endpoint must not have protocol attached
endpoint = "....r2.cloudflarestorage.com";
region = "auto";
secret_access_key = "";
access_key_id = "";
};
};
};
blocks_storage = {
s3 = {
bucket_name = "mimir-blocks";
};
};
alertmanager_storage = {
s3 = {
bucket_name = "mimir-alertmanager";
};
};
ruler_storage = {
s3 = {
bucket_name = "mimir-ruler";
};
};
};
}; It wasn't well documented, but basically the 3 buckets need to be created ahead of time. Then an http endpoint and grpc endpoint are needed, the replication factor has to be 1 other wise it will refuse, and you have to disable multitenancy if there's only 1 org using this thing. The URL to push prometheus records in is then: I don't yet see any blocks uploaded to R2, but apparently:
But the docs seem wrong in a few places, so buyer beware: grafana/mimir#4187 Debugging service config takes too long, it's best to run the commands themselves first in the foreground in a nix-shell, then to transport working config to the system level later. |
To get build time information into PK CLI, we use However during a nix-build, it ignores the One way is to use Nix to get just the info we need: https://chat.openai.com/share/849f4c37-2cc1-48c1-9df7-00a2904f9819 Alternatively just bring |
@tegefaulkes as mentioned earlier today, the version metadata should be working and deployed and fetchable from PKNS. So you should be able to complete the entire CI loop in #94 now. That will be the priority. @amydevs For the remainder of this epic, it's about fixing up how we get the build information in the above comment, storing deployment information into the Supabase DB, graduating the configuration of the Supabase to Pulumi, and updating PKND to show that deployment information. Metrics infrastructure can be separately done. |
We will begin dog-fooding Polykey as soon as this is done. @tegefaulkes @amydevs @brynblack |
after that, should be it yeah, i have a separate issue on PKNS for tracking the state stuff @CMCDragonkai |
I think we should be changing our |
Did an issue get created for this @tegefaulkes? Also I think to fully finish off this issue we should have the rest of the deployment information. But as it stands, we can leave this closed. |
I have an issue MatrixAI/Polykey-CLI#97 for it. |
@amydevs mentioned that the metrics are now batched up by PKNS and stored in timescaledb, and so the PKND is not re-requesting data from AWS backend. So it's pretty good right now. Fetched every 30 minutes atm with a whole batch of data. |
The rest of the deployment table information is being discussed in the Orchestrator project now. |
@amydevs since PKNS has to be contacted by PKND, is it exposed to the wider internet, or is there a CF worker API acting as a gateway under If it is exposed to the wider internet we should have TLS support for these. If it's locked to the CF worker gateway (access control wise), then we can be less stringent. |
the cloudflare DNS records have the proxy enabled, which automatically is TLS enabled. The proxy proxies the TLS traffic to our insecure endpoints on the AWS ipv4 address. This is only done for |
Hmm I'm not understanding this. Is is possible to publically hit the PKNS endpoint via a non-TLS route? |
Specification
The testnet 6 deployment #551 now being done shows us the utility of having a single dashboard that would be useful for tracking analytics and operational metrics of the testnet.
Right now AWS's dashboard and logging really sucks. The cloudwatch dashboard is hard to configure and doesn't automatically update in relation to changes in our infrastructure (it's not configured through our infrastructure deployment code). And the logging systems is also hard to navigate, there's alot of IDs in AWS that relate to different resources, and it's hard to correlate all these resources together in relation to the actual nodes that we have deployed.
Of particular note are these pages:
What we would like instead is to aggregate information and place it on testnet.polykey.com.
Here are some examples.
Current cloudwatch:
There are some challenges though. Right now we use A records on cloudflare to route
testnet.polykey.com
:You can see here the 2 A records correspond to the Polykey testnet node container tasks.
If we navigate to
testnet.polykey.com
that would try to use one of those IPs and try to access via port 80 or 443. We should prefer 443 of course (https by default).Now browsers will do some sort of resolution:
So actually it's a bit problematic. We can't use those A records, as they are going to point to polykey nodes directly. We would want to route to a service potentially a cloudflare worker to show the testnet network status page visualisation.
One way to do this is through cloudflare proxying. You can enable proxying and add in rules to cloudflare so that it can show different DNS records. I think this may not work and the simplest solution is to actually to use a different record type.
So DNS record types that are relevant could be:
If we do change to using SRV record, we need to also address the bootstrapping into private network changes too.
Also in terms of setting up the dashboard, we could use a cloudflare worker which would not be long running. Not sure how to set this up. Another way is to always route to a cloudflare worker, and have the worker then do all the routing between the http status page and the actual nodes, the cloudflare workers seem quite flexible: https://developers.cloudflare.com/workers/examples/websockets/
Additional context
The above shows a sort of global network status of rocketpool, but I think grafana can show all of that too.
Tasks
testnet.polykey.com
andmainnet.polykey.com
to Dashboards.${nodeId}.testnet.polykey.com
to nodes_polykey_agent._udp.testnet.poly key.com
SRV records to${nodeId}.testnet.polykey.com
A records.testnet.polykey.com
andmainnet.polykey.com
records to point towards the dashboardThe text was updated successfully, but these errors were encountered: