-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] OpenSearch.org search functionality #1219
Comments
Shouldn't this RFC be in opensearch-project/project-website? |
This design doc is a good example of deep dive. I have a couple of high-level comments/questions:
|
We generally get more eyeballs on this repo, and since I wanted this to be a reference implementation for opensearch, I wanted eyes that would be here on it :) But happy to move it over in a bit, because you're right, that's where the rubber will hit the road. Thanks! |
@CEHENKLE It's also being tracked on project-website #229. I understand the 'higher profile' but, to a degree, this sits between several repos. From the perspective of someone trying to find or comment on it, I think it belongs in project-website or documentation-website. |
Architecturally, this is pretty Amazon centric. For someone to recreate this service they would pretty much have to be on AWS. This seems to defeat tenet #1. I think there is some very interesting things that are not in the diagram nor description. How do you get search queries to this search service? How do you take web pages and extract content from them for OpenSearch, etc.? Also, am I correct that the only real code here lies in Lambda and the rest is configuration? |
This repository provides a lot of resources which use AWS to create and manage different products. If we add investment to this proposal and get it added to AWS, it will be another feather in the cap of Open Search. |
Opensource in
Clients (browser) uses the exposed search endpoint (provided by API Gateway) and renders results based on response from search service.
It is left to the user on how to extract content from their website and ingest to cluster(general idea). They can either crawl the web pages and extract content or directly use the content if they have the source.
Mostly true. |
Thanks @stockholmux . I will add more details/diagrams to showcase those parts. |
Tracking Issue: #696
Overview
We plan to introduce search capabilities for end user documentation, javadocs etc on opensearch.org . The current search functionality is based on client side lookup to local index file. We want to power this search functionality with an OpenSearch cluster (well, why not!) and provide a reference implementation so that other users can reuse it to power their own app search using OpenSearch.
We are looking for feedback and to discuss this potential solution.
Tenets
Requirements
How would it work?
Proposed Architecture
A common way to create a search application with OpenSearch is to use web forms to send user queries to a server. Then you can authorize the server to call the OpenSearch APIs directly and have the server send requests to OpenSearch cluster.
We can write a client-side code that doesn't rely on a server, however, we should compensate for the security and performance risks. Allowing unsigned, public access to the OpenSearch APIs is inadvisable. Users might access unsecured endpoints or impact cluster performance through overly broad queries (or too many queries). For example see #687 and #1078.
NOTE: We will be using AWS for infrastructure, but users are free to use other cloud providers that provide similar functionality or use self-hosted solutions.
Based on above requirements, we have come up with a minimal set of components to achieve the search functionality.
Components
1. OpenSearch cluster
The OpenSearch cluster will comprise of set of data nodes and master nodes each inside their own auto-scaling groups. The nodes will hosted on EC2. Auto-scaling groups (ASG) will prevent against node drops by replacing with a new node.
To prevent against availability zone outages (AZs), all nodes within an ASG will spread across at-least 3 AZs. This configuration atleast 3 data nodes and 3 master nodes.
A Network Load Balancer (NLB) will balance the incoming requests to data nodes and provides a single endpoint to access the cluster. All the resources will be created in a private subnet, and thus not reachable over internet.
2. AWS API Gateway
3. AWS Lambda (for search)
4. AWS Lambda (for monitoring)
A monitoring Lambda function will allow us to continuously monitor the OpenSearch cluster for any problems like Red health, master issues, missing search indices, search sanity checks. This function will be triggered periodically by Cloudwatch Events (typically every minute) and emit Cloudwatch Metrics which we then be used to create Cloudwatch Alarms. The user can take appropriate action based on those alarms.
5. VPCLink
This feature allows us to map API Gateway endpoints to NLB and acts as a proxy. We use to this for ingestion/operations path. With HTTP Proxy Integration, API Gateway passes the entire request and response between the frontend and the backend.
6. Security, Authentication and Authorization
Security is the top-most priority. We will use OpenSearch security plugin to manage controlled access to cluster.
For initial implementation we will use Basic HTTP auth (username, password) based authentication. We will create individual users based on the below use cases that will be scoped to limited set of OpenSearch APIs using roles.
Since all the Lambda functions, NLB and nodes are in private subnet. We will be disabling HTTPS on security plugin.
7. AWS Secrets Manager
We will use AWS Secrets Manager to create and rotate credentials. The credentials will be encrypted at rest using AWS KMS key. Only certain users can fetch those secrets with restricted IAM policies attached to that user.
Apart from security plugin credentials, we will store private keys to ssh data nodes and master nodes.
8. Logging
In order for the service team to properly monitor service for security and operational issues, and in order for security teams to investigate issues, logs must be recorded and stored appropriately. Relevant logs should be built intentionally with security use cases - monitoring and investigations - in mind. Logs are both consumed by automation, and read by humans, and should address the needs of both.
Will will enable logging in the following places:
9. Bastion Hosts
For operations, users may require access to nodes. Since all the cluster nodes are in private subnet, we need bastion hosts to limit access to SSH port from restricted IP ranges (typically corporate VPN). This will be done by setting appropriate rules via Security Groups (SG).
We will create one bastion host in at-least three AZs to safeguard against AZ outages.
Artifacts
As a part of this implementation we will provide users with AWS CDK (infra as code) to spin up there own infra and other auxiliary tools to setup and manage that infra. This artifacts will hosted in a separate public Github repository. (Don't forget the tenets 😉 )
Future Enhancements
The text was updated successfully, but these errors were encountered: