Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Notification (CDC) Support #633

Closed
Xuanwo opened this issue May 12, 2021 · 28 comments
Closed

Idea: Notification (CDC) Support #633

Xuanwo opened this issue May 12, 2021 · 28 comments

Comments

@Xuanwo
Copy link
Contributor

Xuanwo commented May 12, 2021

Our Storage Service may support sending notifications to let users get the changes of storage.

This feature likes CDC(Change Data Capture) for DBMS.

We may need to:

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jun 27, 2021

Maybe we can't finish this work in go-storage alone?

We need to build whole CDC services:

  • a service which is able to
    • read/receive service's event and convert to the unified style.
    • send the unified style events to the specified target (webhook / mq / ...)
  • APIs that allow setting this service as the receiver of service native notifications
    • notify / eBPF
    • S3 webhook

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jun 28, 2021

@xxchan is working on this idea.

@xxchan
Copy link
Contributor

xxchan commented Jun 30, 2021

Here are some of my thoughts.

Why we may need to support this feature?

I think currently go-storage provides a unified interface to access storage services, and this feature is beginning to support configuring a complex feature for storage services.

Supporting notification configuration is the first step (feature 1). We should consider how users use notifications and how to help them.

When users may need this feature?

Although we may design a feature that can be used without go-storage, but I think we should start from go-storage.

I think a user is willing to use go-storage for:

  • cross-cloud business (e.g., migration)
  • good portability, no vendor lock-in

In the first case, notification is probably not needed(?).
In the second case, let's consider how does he use the notification.

Notification data flow

An event notification may flow in different paths:

image

If the user uses Lambda, I guess he may be willing to stick to the vendor and don't need us(?).
If he uses queue service, I'm not sure.

If he uses go-storage and sets notification destination to a customized server (Does it mean this feature has limited use cases?), then he will need to handle the specific notification format (e.g., oss event message, s3 event message), which avoids the purpose of "vendor agnostic".

So we can define a unified storage event message format for users. We can provide a library to convert vendor event message formats into ours (feature 2). (This can be analogous to https://github.com/xo/dburl, with which users can convert a unified connection string format into vendor ones)

As @Xuanwo mentioned that we may support different notification receivers, the event "destination" (customer managed server, subscribing notification as an HTTP endpoint) may further send event messages downstream, and thus we may help provide a unified interface for publishing messages (like a unified (maybe more than) MQ interface) (feature 3). We can even let the server simply forwarding messages as a dedicated halfway station (feature 4, using features 2 & 3).

Summary

Now we have 4 possible features:

  1. Configure notification in go-storage
  2. a library to convert vendor event message formats into a unified one
  3. a unified interface for publishing messages (analogy to go-storage?)
  4. a message forwarding application combining 2 & 3.

I think features 1 & 2 are very reasonable.
But I doubt the use cases of feature 4. Will users use a server just to forward messages without processing data? If so, it may also involve tricky things to consider, e.g., message delivery guarantee (retries, ordering, and deduplication).
Finally, It seems that feature 3 (a general one, not only serving feature 4) is beyond the scope of our organization.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jun 30, 2021

Nice thoughts! Let's resolve questions here.

In the first case, notification is probably not needed(?).

Take data migration and backup as examples, notification is needed to implement the incremental process. For example, with notification support, we can implement incremental migration so that we don't need to list all objects (which is very slow on huge buckets).

If the user uses Lambda, I guess he may be willing to stick to the vendor and don't need us(?).

We are focused on the storage layer itself, so the notification here is the native notification provided by storage services. That means:

  • We don't need to handle Lambda, Queue Service, and so on, they are out of our scope. We only need to handle notifications sent from storage services themselves.
  • After we implement the notification features, users/developers could build serverless services that can handle all storage backends.

Will users use a server just to forward messages without processing data?

Nice question.

Features 3&4 are indeed out of our community scopes. The reason why I include them here is: Between features 1 and 2, we need a service to receive the events. And feature 3&4 is the extension of this service.

The workflow looks like this:

  • Setup the service
  • Configure storage notification URL to this service's URL
  • Receive the event that normalized by this service
    • User could integrate the service directly into their code
    • Or user can configure this service to send the normalized event to a specific target (this is why I include them)

It's OK for me to wipe this service & feature 3&4 out of this proposal, we can discuss them later (maybe when dm plan to implement the incremental data migration).

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jun 30, 2021

ping dm's maintainer @Prnyself to take a look.

@xxchan
Copy link
Contributor

xxchan commented Jun 30, 2021

Between features 1 and 2, we need a service to receive the events.

I think this is just an HTTP server, so it should be decided by users themselves?

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jun 30, 2021

I think this is just an HTTP server, so it should be decided by users themselves?

You are right. Let's focus on our job and don't take the service into consideration.

@Prnyself
Copy link
Member

Prnyself commented Jul 1, 2021

Nice thoughts!

As a service-user, especially for an application based on Golang, being able to get a channel for notification is necessary and fundamental.

What's more, webhook or 3rd party message queue should also be supported in the future.

So it is really similar with the relationship between go-storage and go-service-xxx, if we want to support different message services.

But for now, I think we can firstly define the notification sturct, find out what infomation we need to send in notification. Maybe take the badger's db.Subscribe as a reference?

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 7, 2021

@xxchan Hi, what's the progress?

@xxchan
Copy link
Contributor

xxchan commented Jul 7, 2021

find out what infomation we need to send in notification

@Prnyself, to make it clear, I think we are not going to support "sending notifications", since this is an internal feature of storage services. We just enable users to turn it on with go-storage, and we cannot decide "what information to send in notification".

We can decide "what information is commonly needed in received notification" and define a unified format.


@Xuanwo My current plan is:

  1. Support notification configuration in go-storage (Set receiver to the cloud notification service or an HTTP endpoint).
  2. Define a unified storage event message format (or simply a go struct) along with a library to convert vendor event message formats into it.

If this is okay, I will draft an RFC for 1 soon.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 8, 2021

@Xuanwo My current plan is:

1. Support notification configuration in go-storage (Set receiver to the cloud notification service or an HTTP endpoint).

2. Define a unified storage event message format (or simply a go struct) along with a library to convert vendor event message formats into it.

If this is okay, I will draft an RFC for 1 soon.

The plan looks good to me!

@Xuanwo Xuanwo transferred this issue from beyondstorage/specs Jul 9, 2021
@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

Here's a (not verified) table of storage event types. We can see that they vary a lot:

  1. Different services have different advanced feature, e.g., download for s3, metadata update for gcs, abort_multipart for qingstor. If we omit them, the most basic and common events are only create & delete.
  2. Only half of the services support fine-grained event types.
  3. There's inconsistent behaviour: e.g., oss counts InitiateMultipartUpload & UploadPart as create event, while s3 and cos don't.

So I think this means that storage event is highly service-related and thus it is hard to provide a comprehensive unified event format.

oss s3 cos gcs qingstor azblob
ObjectCreated *
ObjectCreated:PutObject
ObjectCreated:PostObject
ObjectCreated:CopyObject
ObjectCreated:InitiateMultipartUpload
ObjectCreated:UploadPart
ObjectCreated:UploadPartCopy
ObjectCreated:CompleteMultipartUpload
ObjectCreated:AppendObject
ObjectDownloaded ObjectDownloaded:GetObject
ObjectRemoved *
ObjectRemoved:DeleteObject
ObjectRemoved:DeleteObjects
version delete
ObjectReplication *
ObjectReplication:ObjectCreated
ObjectReplication:ObjectRemoved
ObjectReplication:ObjectModified
OperationFailedReplication
metadata update
abort_multipart

@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

The APIs of configuring notification are similar (but oss does not have this API!). Params are: bucket name, event (type, filter, id, arn). The most tricky thing is event type. It seems hard to give a global event type (like global pairs)

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 13, 2021

Another thing I found out is that some services (like s3) only support sending events to internal services like Amazon SNS, Amazon SQS, or AWS Lambda.

@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

Another thing I found out is that some services (like s3) only support sending events to internal services like Amazon SNS, Amazon SQS, or AWS Lambda.

Actually only qingstor supports HTTP endpoint directly.

@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

And it seems to be encouraged to configure notification in the console instead of using API 🤔

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 13, 2021

So we now have two difficulties.

  • The event types are so different that it is impossible to abstract to a uniform event type
  • the service itself does not support sending events to a specific http endpoint

@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

For the second problem, my previous idea is to use SNS as a middle station, and add an HTTP endpoint subscription to the SNS topic. If so, the user will have to also provide the SNS arn besides HTTP endpoint.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 13, 2021

If so, the user will have to also provide the SNS arn besides HTTP endpoint.

But SNS arn is also very different between services? Can we create SNS for user?

@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

Can we create SNS for user?

some quick results (whether have CreateTopic API):

  • AWS SNS: yes
  • google Pub/Sub: no (gcloud cli tool/console)
  • aliyun MNS: yes (but no oss notification API)

@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

But SNS arn is also very different between services?

Not sure. Example:

  • AWS TopicArn: arn:aws:sns:us-east-2:123456789012:MyTopic
  • gcs notificationConfigs "topic": "projects/PROJECT_ID/topics/TOPIC_NAME"

My previous concern was that if the user will go to the console to create a topic, why doesn't he just continue to configure the notification there? So "Can we create SNS for user?" is a problem.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 13, 2021

Let's discuss event type later, it's a bit simpler.

My previous concern was that if the user will go to the console to create a topic, why doesn't he just continue to configure the notification there? So "Can we create SNS for user?" is a problem.

So there are two methods:

  • API that accepts the dst endpoint: that means we need to create an SNS service for the user if the service doesn't have native support.
  • API that accepts service internal ARN (in a plain string): that means the user needs to create SNS service by themself.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 13, 2021

Maybe related to #634

@xxchan
Copy link
Contributor

xxchan commented Jul 13, 2021

Is creating a service implicitly acceptable to users? One thing is that it involves billing.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 13, 2021

So there are two methods:

* API that accepts the dst endpoint: that means we need to create an SNS service for the user if the service doesn't have native support.

* API that accepts service internal ARN (in a plain string): that means the user needs to create SNS service by themself.

For method 1: I agree with your concern, it's not acceptable.
For method 2: It looks meaningless for users (why not config them in console directly?)

Maybe it's out of our scope to implement the notification config API (And we don't have the ability for it), let's wipe them out.


Without the notification API support, do you think it still useful to implement a global event struct type?

@xxchan
Copy link
Contributor

xxchan commented Jul 14, 2021

I think users may write this themselves with few lines of code and won't try to find a simple library to do so.

@Xuanwo Xuanwo added the backlog label Jul 14, 2021
@Xuanwo
Copy link
Contributor Author

Xuanwo commented Jul 14, 2021

Let's mark this idea as a backlog, and drop it for now, thanks to your research!

@Xuanwo Xuanwo added the idea label Jul 14, 2021
@Xuanwo
Copy link
Contributor Author

Xuanwo commented Aug 4, 2021

How about implement CDC via scanning? Like rockset does: https://rockset.com/blog/change-data-capture-what-it-is-and-how-to-use-it/

Change data capture (CDC) is a useful tool in many data architectures. Learn what CDC is, how it is implemented and when to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants