ADR for file uploads #207

hotzevzl · 2021-05-24T12:01:16Z

See https://vizzuality.atlassian.net/browse/MARXAN-438

vercel · 2021-05-24T12:01:18Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

marxan – ./app

🔍 Inspect: https://vercel.com/vizzuality1/marxan/CBhxNwzb9gsSc3EtiKFaVioJPZKY
✅ Preview: https://marxan-git-docs-apimarxan-438workflow-for-file-u-0045f4.vercel.app

marxan-storybook – ./app

🔍 Inspect: https://vercel.com/vizzuality1/marxan-storybook/EwXUPwVAEnJZLAJk6ZE3R8WBEKYX
✅ Preview: https://marxan-storybo-git-docs-apimarxan-438workflow-for-file-u-d80470.vercel.app

hotzevzl · 2021-05-24T12:29:44Z

@aagm @kgajowy @Dyostiq @alexeh please see a draft of ADR for file uploads strategies, outlining the context.

We currently support dispatching BullMQ job requests from the API service only, which is basically the source of this discussion - if we could (and wanted to) place jobs on a queue from the geoprocessing service (for itself), then we could simply reverse-proxy all requests to the geoprocessing service and let it deal with job dispatching, while already having the file locally (this is one of the options I outlined in the ADR, in fact).

aagm · 2021-05-24T15:31:03Z

First, thank you for putting this adr document together, is amazing.

Secondly and waiting for Dominik suggestions, IMO, as a safety net and latest resurce i will put directly proxy to geoprocessing.
But prior that and as we discussed today i will say: Shared volume through K8s and if that does not work try the http push/pull option

Dyostiq · 2021-05-25T08:32:45Z

My suggestions,
to be more flexible:

Use a shared volume on-premise, and when deployed to the cloud an s3 fuse may be mounted in the same place.
Use object storage, but provide an adapter mimicking a bucket/blob on the local filesystem – for a test, on-premise installations, etc.

Also, we can utilize a saga/process manager to decouple storage in the geoprocessing service.

Pseudocode:

import { EventBus } from '@nestjs/cqrs';
import * as fs from 'fs'

declare const procssingQueue: any
declare const eventBus: EventBus


class ProcessingRequested {
  constructor(public readonly id, public readonly data) {
  }
}

processingQueue.on('something', (id, data) => {
  eventBus.publish(new ProcessingRequested(id, data))
})

class FileUploaded {
  constructor(id, origin) {}
}

// object storage
declare const newObjectEvents: any

newObjectEvents.subscribe((id) => {
  eventBus.publish(new FileUploaded(id, 'object storage'))
})

// file storage
fs.watch('...', () => {
  eventBus.publish(new FileUploaded('id', 'file storage'))
})


// http/grpc/whatever push/pull
class Controller {
  handleFile() {
    // will handle the same way as file storage
  }
}

// the saga

class Saga {
  sagaState: any = {}

  constructor(
    private readonly sagaStateRepo: any,
    private readonly finalizer: any,
    private readonly process: any,
    private readonly streamResolver: any
  ) {
  }

  @On(ProcessingRequested)
  handleNewProcessing(event: ProcessingRequested) {
    const state = this.sageStateRepo.get(event.id)
    state.handle(event)
    state.save()
  }

  @On(FileUploaded)
  handleFileUploaded(event: FileUploaded) {
    const state = this.sagaStateRepo.get(event.id)
    state.handle(event)
    if (state.readyForProcessing) {
      const stream = this.streamResolver(event)
      this.process(state, stream)
      this.finalizer.cleanup(state)
    }
    state.save()
  }

  @Schedule('every 15 minutes')
  cleanup() {
    const states = this.sagaStateRepo.getAllStale()
    this.finalizer.cleanup(states)
    eventBus.publish(new ProcessingFailed())
  }
}

Actually, it may overcomplicate things. We need to be careful while designing it and be sure that we'll benefit from the decoupling.

Dyostiq · 2021-05-25T08:43:59Z

Another idea – use HTTP pull, but provide a generic link – maybe even in the message in the queue. It can be a signed link to an s3 bucket, or to another service, or API instance. If we want to treat geoprocessing only as a processing service, we keep the entire file management on the API side, the geoprocessing is not aware of where the file comes from, just gets an HTTP link from which it can read the stream. Then, we have geoprocessing and file storage decoupled. It all depends on where we want to assign the responsibility.

hotzevzl · 2021-05-26T17:53:27Z

@Dyostiq thank you for writing up your suggestions - I have added them to the ADR options, except this as I am not sure I understand:

2. Use object storage, but provide an adapter mimicking a bucket/blob on the local filesystem – for a test, on-premise installations, etc.

do you mean to use an in-cluster (compose or k8s) blob storage service such as Minio, and use an S3 interface throughout? or to use bucket storage via a fuse interface (so similar to your first suggestion, but in all cases rather than using shared volumes in local dev environments and bucket storage via fuse in Kubernetes clusters)?

I haven't added the saga suggestion to the ADR as I think it could be useful as a reference for implementation if relevant, but it may be orthogonal to the choice of backing storage.

Dyostiq · 2021-05-27T09:10:04Z

@hotzevzl yes the thing with saga is an implementation detail, as I said, but I wrote the comment to keep the idea somewhere.

With the second option what I meant is to always use the s3 interface for files, and have storage in API/geoprocessing available through the interface, and we can then switch seamlessly to regular s3/azure/minIO and so on.

vercel bot deployed to Preview – marxan May 24, 2021 12:29 View deployment

vercel bot deployed to Preview – marxan-storybook May 24, 2021 12:29 View deployment

aagm approved these changes May 24, 2021

View reviewed changes

vercel bot deployed to Preview – marxan May 26, 2021 17:48 View deployment

vercel bot deployed to Preview – marxan-storybook May 26, 2021 17:48 View deployment

hotzevzl added 5 commits May 27, 2021 11:54

WIP ADR for file uploads

65fea18

add notes

10757e6

add notes/caveats from PoC tests

952cece

add more suggestions from PR feedback

d31b483

add implementation notes from feedback

558fcb4

hotzevzl force-pushed the docs/api/MARXAN-438_workflow-for-file-uploads branch from 0f71560 to 558fcb4 Compare May 27, 2021 11:14

vercel bot deployed to Preview – marxan-storybook May 27, 2021 11:14 View deployment

vercel bot deployed to Preview – marxan May 27, 2021 11:14 View deployment

alexeh approved these changes May 28, 2021

View reviewed changes

add decision outcome and clarify bits

0be454d

vercel bot deployed to Preview – marxan May 28, 2021 15:30 View deployment

vercel bot deployed to Preview – marxan-storybook May 28, 2021 15:30 View deployment

hotzevzl marked this pull request as ready for review May 28, 2021 15:31

hotzevzl merged commit 018cb59 into develop May 28, 2021

hotzevzl deleted the docs/api/MARXAN-438_workflow-for-file-uploads branch May 28, 2021 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR for file uploads #207

ADR for file uploads #207

hotzevzl commented May 24, 2021

vercel bot commented May 24, 2021 •

edited

Loading

hotzevzl commented May 24, 2021

aagm commented May 24, 2021

Dyostiq commented May 25, 2021 •

edited

Loading

Dyostiq commented May 25, 2021

hotzevzl commented May 26, 2021

Dyostiq commented May 27, 2021 •

edited

Loading

ADR for file uploads #207

ADR for file uploads #207

Conversation

hotzevzl commented May 24, 2021

vercel bot commented May 24, 2021 • edited Loading

marxan – ./app

marxan-storybook – ./app

hotzevzl commented May 24, 2021

aagm commented May 24, 2021

Dyostiq commented May 25, 2021 • edited Loading

Dyostiq commented May 25, 2021

hotzevzl commented May 26, 2021

Dyostiq commented May 27, 2021 • edited Loading

vercel bot commented May 24, 2021 •

edited

Loading

Dyostiq commented May 25, 2021 •

edited

Loading

Dyostiq commented May 27, 2021 •

edited

Loading