-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test stable branch git fix up #687
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Runner is lacking instrumentation. It is responsible for many things and it's become hard to understand what tasks contribute to the overall latency of an indexer. In addition, we are now at a point where we need to drive down latencies to facilitate new * indexer use cases such as access keys. I've chosen to instrument Runner with OpenTelemetry. Tracing generally requires 3 items: An instrumented service, a trace collector, and a trace visualizer. The service is responsible for collecting and transmitting trace data to the collector. The collector should be able to receive trace data with little fuss to prevent performance impacts to the instrumented service. The collector then processes the trace data and transmits the processed data to the visualizer. The visualizer visualizes trace data and allows for filtering on traces. The benefit of OpenTelemetry over other options like Zipkin and Jaeger is that GCP already supports ingesting OpenTelemetry data. As such, we don't need to provision a collector ourselves, and can instead leverage GCP's existing collector & visualizer Tracing service. For local development, traces can be output to console, a Zipkin all-in-one container or to GCP (Requires Cloud Trace Agent role and specifying project ID). This is done by simply initializing the NodeSDK differently. In addition, we do not want to enable traces in prod yet, so by not specifying any exporter. This creates a No-Op Trace Exporter which won't attempt to record traces. No code changes were made changing code execution path. All tests pass with no changes, aside from having to replace snapshots due to changes in tabbing of mutation strings. I have manually verified mutation strings are still the same by stripping whitespace and checking against original.
Add support for `context.db.Table.select({column_name: ['a.near', 'b.near']})`. The same support is added for `delete`. Frontend support is added. I also improved parameter naming to reflect SQL statements like `where`, `values` and `set`.
context.db build failures are just logged instead of blocking to allow complex schemas but with only graphql calls available. However, these logs are repeatedly output and not tagged with an indexer name. This adds an indexer name to the log to aid debugging.
Checking provisioning status through Hasura takes 70-100ms on Dev. This PR caches the provisioning status inside of `Provisioner` and does make extra requests to Hasura on every run.
… 1st (Next Release) (#597) A temporary change in our codebase. Replacing the usage of node-sql-parser with kevin-node-sql-parser until April 1st. The reason for this substitution is that the official release of node-sql-parser lacks a version release for the additional SQL statements required for our current project needs. In the interim, this forked version addresses these shortcomings and allows us to incorporate the required SQL statements. Please note that this is a temporary measure, and we plan to revert to the official node-sql-parser version after April 1st, once the required features are officially available. See last comment for details
QueryApi has experienced issues with Postgres connections since the introduction of * indexers due to how QueryApi creates these connections through Application level connection pools. Since we can't use one pool for all workers, I've introduced PgBouncer as a Middleware to serve as an additional connection pooler in front of the DB.
- Upgrade `near-lake-primitives` to `0.2.0`, which includes `borsh` - Expose entire `near-lake-primitives` library to VM via `primitives`, e.g. borsh can be accessed via `primitives.borsh.fromBorsh()`
This PR expands provisioning to also schedule the cron jobs for adding/deleting log partitions. It assumes: 1. The `cron` database exists and has `pg_cron` enabled (near/near-ops#1665) 2. The `__logs` table exists and has the partition functions defined (#608) In relation to this flow, the high-level steps are: 1. Use an admin connection to the `cron` database to grant the required access to the user 2. Use a user connection to the `cron` database to schedule the jobs The cron job is executed under the user which schedules the job, therefore the user _must_ schedule the job as they are the only ones who have access to their schemas. If the admin were to schedule the job the job itself would fail as it doesn't have the required access. Merging this before 2. is fine, the jobs will just fail, but should start to succeed after it has been implemented.
This PR adds a very basic integration test for `Indexer`. It uses `testcontainers` to stand up both `postgres` and `hasura` so that `Indexer` can talk to real components rather than mocks. The test uses `Indexer` directly, which means S3/Redis are still somewhat mocked/ignored. We can add those in later if need be. This is essentially just the scaffolding for integration testing which can be expanded over time. The suite includes only 1 very basic test, which if successful should provide a fair amount of confidence that things are working as expected. The flow includes: provisioning, writing data to Postgres, and then asserting its existence via GraphQL. All errors bubble up from `Indexer` so this test should catch most problems. This PR points to #625, as so I could test the `pg_cron` flow via this integration test :)
Indexer schemas can have quoted or unquoted table & column names. However, QueryApi always quotes table names and does not quote column names during SQL query construction for context.db. This is because the AST generated form parsing the schema does not include if the identifier was quoted or not. However, recent updates tot he parsing library has added this functionality in. I've updated QueryApi to quote both the table name and the column names if they were quoted originally, and leave them unquoted otherwise. In addition, I've replaced kevin-node-sql-parser back with the original package now that the 5.0 update has released. I've also added a typscript examples folder for convenience, as well as a script to clear the local postgres database.
DmlHandler was using port number handed to it by Hasura, which is 5432. We want it to use 6432 which is the port specified by the env variable. 6432 points to pgBouncer.
The admin/cron connection is currently hard-coded to the `cron` database, but this needs to be configurable so that we can use the default DB (`postgres`) locally. Additionally, this PR combines the `pgbouncer` / `pg_cron` init scripts and uses the combined output in both docker compose and integration tests.
Prototype Draft: Integrated new logs schema for. We are writing to both default/public/indexer_logs_entries and provisioning a new table on new indexers under the user's provisioned database and respective schema. https://www.loom.com/share/ff21d7099cac403d9152c905f7e4ddcc?sid=5828ae99-377b-4510-ac8c-76c02fd232f2
Quick & dirty PR which short-circuits updating the status via GraphQL when it is unchanged. We don't need to check whether block height has changed, as that is updated in a separate call later on. I've also updated the integration tests to assert the output of status/logs.
I introduce the code necessary to perform status and last processed block height writes through Postgres. I also refactored DmlHandler and its usage in Indexer as caching of the database credentials allows for a simplification of its constructor.
The provisioning flow will not be run for existing Indexers, this PR adds a separate provisioning check/step which sets up the partitioned logs table for existing users. I've opted for a in-code approach as a "manual" migration script requires specific timing, i.e. we'd need to deploy the logs change, ensuring all new Indexers are provisioned correct, and then migrate all existing users to ensure that no Indexers are missed. But since the logs provisioning change is coupled with the logging itself, existing Indexers would fail to log until the migration is complete. My only concern for this approach is a "thundering herd". After this is deployed, all Indexers will attempt to provision there logs table at the same time - I'll monitor this in Dev. As this code is temporary, I didn't bother adding instrumentation/unit-tests, nor worry about the performance impact. It will be removed promptly. This is dependant on #608 and should be merged after.
This PR adds [tracing-stackdriver](https://github.com/NAlexPear/tracing-stackdriver), which outputs logs in a GCP compatible JSON format. To enable this, the `GCP_LOGGING_ENABLED` environment variable must be set. Further, I've added additional context to errors to aid debugging. near/near-ops#1695
First few commits is from this branch #640. Created this branch based off the initial branch. This PR intends to introduce the creation of the logs table by provisioning the logsSchema and the follow CRON jobs for new Users but does not use or writeLogs to the new logsTable itself. If this is merged by itself new users will have unused log table but the provisioning will occur. To provision existing users - #636
Migrating any data related to the Indexer into a common class to simplify data interactions with things like AccountId, which are common. I've also added an integ test for context DB.
Enable conditional provisioning of metadata table.
The object post astify() we receive from node-sql-parser changed. We can think about Version Control in the future. Quick fix here: returning the object to how it was and adding an additional layer on Editor to ensure mounting of types
Uncommented functionality so we actually start writing logs to new the Tables that have been provisioned in #643. Old logging implementation remains untouched as still functions (although it has been renamed from writeLog -> writeLogOld). We are writing to both log tables. ### 1. Provisioning and Logging (to both tables) for a new Indexer https://www.loom.com/share/3ad6d6ea3368412e8896340a74759ffb?sid=4d5379e8-5401-41bf-9e38-d0f8e8c4eca5 ### 2. Logging (to both tables) for a existing Indexer https://www.loom.com/share/4ba411f2bcb740e1842650f695ffb347?sid=253ced68-9d4c-459f-871b-b0a3ee00cd91 ### Provisioning and Logging new logs table for a existing Indexer (that does not have logs table) https://www.loom.com/share/2aa7c0cc882f4dbdb9e51fc2a9e9b7b9?sid=1aa511fe-3054-4d27-9996-2b9fddc44ed8
Depends on near/near-lake-framework-rs#102 This PR exposes a new metrics which counts the number of Get requests made to S3 by `near-lake-framework`. I wanted to start tracking this metric _before_ I merge the change which reduces them, so I can measure the impact of that change. The easiest way to track these requests was to pass a custom `S3Client` to `near-lake-framework`, so we can hook in to the actual requests made. The custom `S3Client` (`LakeS3Client`) is exactly the same as the default implementation in `near-lake-framework` itself, but with the added metric. This is essentially part 1 for #419, as the "reduction" in requests will build on this custom client, adding caching/de-duplication.
CRON statement functions was attempting to access a non-existent scoped property. Added syntax for dynamic sql generation to properly traverse. Tested by setting cron fn_create_partition to trigger every 30 seconds. Previously we would not see the Non Trackable functions and we would get the original error message below. Now we are able to view the 2 Non Trackable functions and the row succeeds. <img width="686" alt="Screenshot 2024-04-16 at 7 39 10 PM" src="https://github.com/near/queryapi/assets/42101107/b67e6e49-2f66-46d8-a41e-e1b51a6a2f06"> <img width="1378" alt="Screenshot 2024-04-16 at 8 02 24 PM" src="https://github.com/near/queryapi/assets/42101107/fbb4946e-8f23-4fc3-8765-0fad676897d1"> Original error `ERROR: function fn_delete_partition(unknown, date, unknown, unknown) does not exist LINE 1: SELECT fn_delete_partition('kevin33_near_component_01.__logs... ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.`
Enable writes of Status and Last Processed Block Height to Metadata table. Reorganizes provisioning to ensure writing of PROVISIONING status. Ensures IndexerMeta is available for writing error logs.
- fix: Use compatible types across inter-dependant crates - fix: Clippy
Each `BlockStream` uses its own dedicated `near-lake-framework` instance, and hence manages its own connection with S3. This leads to many duplicate S3 requests, particularly across the large majority of Indexers which follow the network tip, which request the same block data at the same time. This PR introduces a shared S3 client to be used across all `near-lake-framework` instances. `SharedLakeS3Client` ensures that duplicate requests made within a short time-frame, including those made in parallel, result in only a single request to S3. ## Cache Strategy This implementation will mostly impact `BlockStream`s following the network tip, i.e. `From Latest`. These streams will wait for new data in Near Lake S3, and request it as soon as it is available, at the same time. Therefore, it would be enough to cache the result alone, by the time we actually prime the cache, all other requests would have missed it and fired a request of their own. Locking while the request is in-flight also is not feasible, as this would force _every_ request to execute in sequence. Instead of caching the result of the request, we cache its computation. The first request initiates the request and stores its `Future`, then all subsequent requests retrieve that `Future` from cache and `await` its result, ensuring only one underlying request at most. ## Performance Impact My main concern with this implementation is the impact it will have on performance. Each request made must block to check the cache, introducing contention/delays. The lock is only held while checking the cache, and not while the request is being made, so my hope is that it does not impact too much. This may be something that needs to be iterated over time. From local testing the impact seemed to be negligible, but that was with 5 Indexers, it may be worse with many. I've added a metric to measure lock wait time, to determine whether this contention is becoming a problem.
Logs Table and Metadata Table are necessary for important functions of QueryApi. Hasura often invisibly fails to track tables and add permissions. These operations need to be successful, so I added a check which verifies tracking and permissions are correct, and reattempts them if not. When successful, the result is cached. In a successful case, the expensive hasura calls (getTableNames and getTrackedTablesWithPermissions) are done at most twice. I also combined the conditional provisioning functions since they are verified to work already.
The logs and metadata tables were created with a `__` prefix. Unfortunately, it turns out that the prefix is a reserved prefix used by Hasura. So, we are renaming the tables to be prefixed with `sys_`, which is not reserved, to the best of my knowledge. The specific process for the migration is: 1. Delete and recreate the cron DB in dev. This deletes the BD and any scheduled jobs. 2. Delete logs/metadata tables and any created partitions. 3. Create new tables. 4. Use new tables successfully.
…ilures (#675) Errors in provisioning should be logged to the machine as they can potentially be overwritten by errors in the finally block of the parent try catch. We ideally want to move the provisioning out to its own try catch but this is a simple fix for the time being. In addition the PRs to replace the logs/metadata tables failed due to untracking being partially successful. This PR allows untracking errors.
All tables have been migrated to using the new tables. This code hook for deleting old tables is no longer necessary.
This PR adds `winston` to introduce structured logging, and also write GCP compatible logs when `GCP_LOGGING_ENABLED` is set.
`context.set` was constructing an incorrect query #646 - this corrects that query.
Frontend now queries the new metadata table instead of the old one. <img width="1013" alt="image" src="https://github.com/near/queryapi/assets/22734869/4eae0f19-eda6-4e28-b244-cb78453aeeea"> Indexers which have not successfully run since the introduction of the new logs table will not have a last processed block height since they never processed a block successfully. So, I opted to set it as N/A and leave a tooltip on hover which says why its N/A. <img width="1015" alt="image" src="https://github.com/near/queryapi/assets/22734869/38bc3565-41e1-4562-9966-371337e673fb">
My due diligence was insufficient. As it turns out, yarn.lock IS used by the frontend during development. Adding it back.
Creates a new `winston` transport method to count logs by level, and exposes a new prometheus metric to record this value. Additionally, metrics recorded on the main thread were not captured. This was not an issue as the majority of metrics were done within the worker. But since we also log on main thread, this PR updates metrics aggregation to expose main thread metrics.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
current_historical_block_height
during historical backfill (DPLT-927 Writecurrent_historical_block_height
during historical backfill #102)CODEOWNERS
(chore: AddCODEOWNERS
#125)backend_only
mutations by default (DPLT-1042 Enablebackend_only
mutations by default #124)HISTORICAL
dimension toEXECUTION_TYPE
(DPLT-1044 ChangeHISTORICAL
dimension toEXECUTION_TYPE
#130)fetch
to VM runner (DPLT-1044 Exposefetch
to VM runner #131)context.fetchFromSocialApi
(DPLT-1044 Addcontext.fetchFromSocialApi
#134)backend_only
mutation default (chore: Removebackend_only
mutation default #150)README.md
(fix: Local development & add steps inREADME.md
#173)indexer_log_entries
indexes tohasura
migrations (feat: Addindexer_log_entries
indexes tohasura
migrations #238)close-completed-issues
toclose-completed-issues.yml
(fix: Renameclose-completed-issues
toclose-completed-issues.yml
#327)Histogram
to prevent stale metrics being scraped (fix: UseHistogram
to prevent stale metrics being scraped #332)HASURA_ENDPOINT_V2
env var to Lambda pipeline (fix: ExposeHASURA_ENDPOINT_V2
env var to Lambda pipeline #349)UNPROCESSED_STREAM_MESSAGES
metric on both failure/success (fix: WriteUNPROCESSED_STREAM_MESSAGES
metric on both failure/success #352)indexer_rules_engine
(refactor: Remove asyncindexer_rules_engine
#387)block-streamer
unit-testable and add tests (test: Makeblock-streamer
unit-testable and add tests #442)backend_only
(feat: Make log/state Hasura tablesbackend_only
#462)Promise.all
(fix: Ensure array is returned toPromise.all
#504)coordinator
/block-streamer
via environment (refactor: Configurecoordinator
/block-streamer
via environment #503)allowlist
(feat: Only start indexers set within theallowlist
#518)Dockerfile
for Coordinator V2 (feat: AddDockerfile
for Coordinator V2 #519)StartBlock
options within Coordinator & Block Streamer (feat: HandleStartBlock
options within Coordinator & Block Streamer #553)allowlist
from Coordinator V2 (feat: Removeallowlist
from Coordinator V2 #570)near-lake-primitives
to VM (feat: Exposenear-lake-primitives
to VM #613)Indexer
(test: Add integration tests forIndexer
#627)near-lake-framework
(feat: Count S3 get requests made bynear-lake-framework
#662)winston
instead ofconsole
(fix: Write towinston
instead ofconsole
#681)context.set
graphql query (fix: Correctcontext.set
graphql query #683)