-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(warehouse): added base support for s3 datalake deletion as part of regulation API. #2515
Conversation
dd328aa
to
1816184
Compare
02d9940
to
f39e5b9
Compare
f39e5b9
to
411b2ba
Compare
Codecov ReportBase: 43.77% // Head: 43.84% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2515 +/- ##
==========================================
+ Coverage 43.77% 43.84% +0.06%
==========================================
Files 186 187 +1
Lines 39890 40015 +125
==========================================
+ Hits 17463 17545 +82
- Misses 21332 21370 +38
- Partials 1095 1100 +5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor or issues not introduced in this PR, but I will need a second pass
return fmt.Errorf("error while opening file, %w", err) | ||
} | ||
defer cleanCompressedFilePtr.Close() | ||
f, err = os.OpenFile(filepath.Join(statusTrackerTmpDir, StatusTrackerFileName), os.O_CREATE|os.O_RDWR, 0o644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we removed os.O_TRUNC
, if the file already exists, we are going to error, correct?
It is safer this way, I just wanted to make sure this was the intention behind this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't changed this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably did a bit more code movement, so it's fuzzing the boundaries.
@@ -63,6 +63,10 @@ func init() { | |||
// New returns FileManager backed by configured provider | |||
func (*FileManagerFactoryT) New(settings *SettingsT) (FileManager, error) { | |||
switch settings.Provider { | |||
case "S3_DATALAKE": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a duplicate fileMangaer for S3 Datalake when we can use S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provider comes in from above and it has S3_DATALAKE in it hardcoded. Let me see what change can I make !
StatusTrackerFileName = "rudderDeleteTracker.txt" | ||
supportedDestinations = []string{"S3"} | ||
supportedDestinations = []string{"S3", "S3_DATALAKE"} | ||
) | ||
|
||
const listMaxItem int64 = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe since we have defined FilesLimit
in BatchManager
struct, we don't need this anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix it.
method := "GET" | ||
genEndPoint := "/dataplane/workspaces/{workspace_id}/regulations/workerJobs" | ||
url := fmt.Sprint(j.URLPrefix, prepURL(genEndPoint, j.WorkspaceID)) | ||
url := fmt.Sprintf("%s/dataplane/workspaces/%s/regulations/workerJobs", j.URLPrefix, j.WorkspaceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating url
like this is indeed more readable & better. We can use this same approach to create url in UpdateStatus
func also & can get rid of prepURL
func.
@@ -0,0 +1,21 @@ | |||
package filehandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel file name of this file decreases it's readability. Not sure what would an apt name, how about fileHandler.go
.
Also, we might want to change the directory name fileHandler
from filehandler
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually names are camelcased in golang but filenames are snakecased in popular repository. The name local_file indicates the interface of local file handling. Let me know if you wanna have an offline discussion.
defer tmpFilePtr.Close() | ||
absPath, err := filepath.Abs(tmpFilePtr.Name()) | ||
|
||
defer f.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason I am ignoring error here is, it is an overkill to abort job & send failed
status to manager even when we failed to close file. I believe we can afford to ignore it.
defer f.Close() | |
defer func(){_=f.Close()}() |
if err := f.Close(); err != nil { | ||
return nil, fmt.Errorf("closing file: %w", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are already defering in line 121
if err := f.Close(); err != nil { | |
return nil, fmt.Errorf("closing file: %w", err) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat a different way to cleanly handle closing of file. Usually we wanna handle the error that comes with file closing, so we handle it multiple times in such a fashion.
// Note: download happens concurrently in 5 go routine by default. | ||
func (b *Batch) download(_ context.Context, completeFileName string) (string, error) { | ||
pkgLogger.Debugf("downloading file: %v", completeFileName) | ||
func (_ *Batch) cleanedFiles(_ context.Context, path string, job *model.Job) ([]string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the list of cleaned files could we really long, won't it make sense to return a pointer instead of []string
?
func name suggestion: getCleanedFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go doesn't provide automatic support for getters and setters. There's nothing wrong with providing getters and setters yourself, and it's often appropriate to do so, but it's neither idiomatic nor necessary to put Get into the getter's name
. If you have a field called owner (lower case, unexported), the getter method should be called Owner (upper case, exported), not GetOwner. The use of upper-case names for export provides the hook to discriminate the field from the method. A setter function, if needed, will likely be called SetOwner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reading this in effective go a while back where I came across this advice to not place get
in the name of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why would a pointer be beneficial ? In go I think, array's are passed by value and cost is pretty minimal right ?
return model.JobStatusFailed | ||
} | ||
|
||
pkgLogger.Infof("successfully completed loop of ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkgLogger.Infof("successfully completed loop of ") | |
pkgLogger.Infof("successfully completed loop") |
mu sync.Mutex | ||
FM filemanager.FileManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't it be better to put file handler inside Batch
struct, given it is specific to an instance of deletion job?
mu sync.Mutex | |
FM filemanager.FileManager | |
mu sync.Mutex | |
FM filemanager.FileManager | |
filehandlers map[string]filehandler.LocalFileHandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is filehandler needs to be created on processing of every file as they store the intermediate results to allow for cleaner interface. We have added a factory to instantiate a filehandler everytime we see a file of that sorts. So this might not work.
} | ||
} | ||
absPath, err := filepath.Abs(statusTrackerFilePtr.Name()) | ||
fName, err := batch.download(ctx, filepath.Join(prefix, StatusTrackerFileName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to increase code readability.
fName, err := batch.download(ctx, filepath.Join(prefix, StatusTrackerFileName)) | |
statusTrackerFile, err := batch.download(ctx, filepath.Join(prefix, StatusTrackerFileName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually within a function name which clearly defines the intent, we can shorten the name to indicate popular structure. I like to indicate files through f
etc. Wdyt ?
switch destType { | ||
case "S3": | ||
return map[string]filehandler.LocalFileHandler{ | ||
".json.gz": filehandler.NewGZIPLocalFileHandler(filehandler.CamelCase), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we anyways support handling parquet
file. And, any file type could we stored in S3, won't it better to add all the supported filehandler for all destination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think about it. That can be done.
return nil | ||
} | ||
|
||
func (h *GZIPLocalFileHandler) RemoveIdentity(ctx context.Context, attributes []model.User) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we have changed the logic of RemoveIdentity
and since this is going to be one of the most time consuming part of the loop during deletion. I strongly feel we should benchmark the old & new deletion logic.
I have a feeling that if sed
knows about all the users that needs to be matched for, it will do some kind of optimization do avoid literally iterating over each OR
condition of pattern file and look for a match in each line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the man of sed, -f is just reading the commands from the file and appending it to the list of commands to sed. -e which we have replaced it with does the same thing as well.
The main approach we have changed is letting sed completely handle the file from which the data needs to be removed and we are doing it in memory. This is something which needs to be benchmarked and checked against.
fa1fe1a
to
f0aff27
Compare
This feature is good to go. QA Approved. |
0dd3a12
to
bb97d4e
Compare
…t accidental matches
bb97d4e
to
d5fa7dd
Compare
Description
As per the regulation api, we are adding support to remove the user information data from s3 datalake.
Notion Ticket
https://www.notion.so/rudderstacks/S3-Datalake-Data-Deletion-Request-Dev-tasks-dfdba07974da4ffdb4b40af163bfdefc
Security