DRAFT failed inserts recovery #139

lukeindykiewicz · 2020-11-11T14:18:11Z

This is not ready solution, it's a stub of solution that should contain the most important parts. PR is to verify the concept and direction it follows.

dilyand

Looks like it should work @lukeindykiewicz !

benjben

👍

benjben · 2020-11-12T08:05:26Z

repeater/src/main/scala/com.snowplowanalytics.snowplow.storage.bigquery.repeater/Recover.scala

+object Recover {
+
+  //TODO: hardcode proper GCS path
+  val DeadQueuePath = Path("gs://sp-storage-loader-failed-inserts-dev1-com_snplow_eng_gcp/dead_queue")


Should not appear on a public repo like this.

Good point - thanks. Do you have an idea where to put it for such one off job? env var feels right, but I would prefer to not touch TF for this task.

https://github.com/snowplow-incubator/snowplow-bigquery-loader/blob/master/repeater/src/main/scala/com.snowplowanalytics.snowplow.storage.bigquery.repeater/RepeaterCli.scala#L36

Thanks Dilyan. This means changes in TF unfortuntelly.

Sorry, not sure I follow. Why does it mean changes in TF? Isn't it the same bucket that is being passed in as argument to Repeater on deploy?

Well there are only 2 possibilities for configuration:

either it's hard-coded

or it needs to be provided, meaning TF

(Dilyan's answer appeared while I was writing)

Thanks @dilyand You might be right we already have it! Awesome - thanks!

Forced pushed to remove the hardcoded path.

benjben · 2020-11-12T08:07:02Z

repeater/src/main/scala/com.snowplowanalytics.snowplow.storage.bigquery.repeater/Recover.scala

+    resources
+      .store
+      .list(DeadQueuePath)
+      .evalMap(resources.store.getContents)


I'm not familiar with Store.getContents, do all the data need to fit into memory ? If yes, are we sure that data is not too big?

I asumed that every single file should fit in memory without problems. Is that a wrong assumption?

Maybe you can read the files line by line to be sure?

Yes, I can definitely. I just thought these files are pretty small. Thanks you both!

benjben · 2020-11-12T08:12:03Z

repeater/src/main/scala/com.snowplowanalytics.snowplow.storage.bigquery.repeater/Recover.scala

+import cats.effect.Sync
+import cats.syntax.all._
+
+object Recover {


Don't we want a main() to be able to recover already existing bad rows with the percentage issue ?

what do you mean by ?

It’s added in Repeater.scala to main for the repeater
https://github.com/snowplow-incubator/snowplow-bigquery-loader/pull/139/files#diff-d0e5513b7035a18d3239b7e5b227c63eb1db4386df406d89c23b3b6e7e331cb5

Do you have something else in mind?

I thought that there was 2 use cases:

auto recover failed inserts live

recover already existing bad rows

The main() in Repeater.scala does the first one but I struggle to see where 2) happens.

This job is only to for point 2). The main starts the Recover stream, which does the job. Do I miss something?

My understanding is that Repeater is a long running Scala app that reads from PubSub and writes to BQ. How will you make it run as a "batch" run to just read GCS ?

As I see it recover in repeater should only be called for current failed inserts, not for the whole history.

I added stream to Repeater that will work next to the main flow. It will read the data from GCS and write to Pub/Sub. Kind of similar to what repeater does, but the other way round. It will read all the data that are in filtered buckets, recover them, write to pub/sub and that's it. Stream will stop. All other streams with the main flow will continue to work normally. I'm only adding functionality, not changing the existing one.

It's here: https://github.com/snowplow-incubator/snowplow-bigquery-loader/pull/139/files#diff-d0e5513b7035a18d3239b7e5b227c63eb1db4386df406d89c23b3b6e7e331cb5R55

It means that we will start the repeater and will expect it to go and try to recover failed inserts from weeks ago. I guess what is troubling me is that we want to use a long-running streaming app to perform a batch job (for this very use case).

That's why it's a one off job. There is nothing wrong in this solution, imho. We just add recovery feature to the app to not spin the new app and bother with all the infrastructure code and troubles spinning up something new. The long-term goal would be to use recovery project for this purpose, but currently it can not recover events in this part of the pipeline.

benjben

Instead of having BQ loader writing failed insert to GCS, and then BQ loader reading these files from GCS, can't we directly try to recover on the fly and then write to GCS only if recovery was not successful ?

benjben · 2020-11-12T08:37:49Z

repeater/src/main/scala/com.snowplowanalytics.snowplow.storage.bigquery.repeater/Recover.scala

+  def recoverFailedInserts[F[_]: Timer: Concurrent](resources: Resources[F]): Stream[F, Unit] =
+    resources
+      .store
+      .list(DeadQueuePath)


Do we have one Path for each batch of failed inserts or is it always the same? In later case if failed inserts keep being added we will always try to reprocess them all. I didn't see a place where we remove a successfully recovered file.

Good point. It's not removed at the moment, I'll probably ask support to remove it after successful recovery. To not read the same file twice the filtering part (in TODO) will ensure we only recover the folder once.
I don't think there will be problems like every second event will fail.
I rather assume that all will be ok, or fail (if there is a mistake in the recovery).

To not read the same file twice the filtering part (in TODO) will ensure we only recover the folder once.

So you want to maintain a manifest outside of BQ loader ?

No. I will simply filter for proper dates in which the failed events occurred. I will not do it for all past and current events in dead_queue, only for particular dates.

lukeindykiewicz · 2020-11-12T08:44:24Z

Instead of having BQ loader writing failed insert to GCS, and then BQ loader reading these files from GCS, can't we directly try to recover on the fly and then write to GCS only if recovery was not successful ?

We would do it like you say, but this events are already there and waiting for us. This wrong column name is already corrected and we only need to reprocess past failed events.

benjben · 2020-11-12T09:13:42Z

Maybe I'm just missing some context. Do we just want to run it once and then remove the recover part and go back to the normal job ?

lukeindykiewicz · 2020-11-12T09:14:51Z

Maybe I'm just missing some context. Do we just want to run it once and then remove the recover part and go back to the normal job ?

Yes, that's exactly what we're planning to do.

benjben · 2020-11-12T09:16:18Z

Ok sorry I totally missed that, I thought that the plan was also to add auto-recovery for _%. My bad!

lukeindykiewicz · 2020-11-12T09:20:46Z

No worries, Thanks for asking questions Ben!

chuwy

I agree with @benjben that it feels a bit hacky to combine this job with the repeater, but also respect the time pressure, so agree it might be justified. But I also see two possiblle correctness mistakes:

The payload - the original one does not contain eventId, etlTstamp and payload keys - if you look at EventContainer decoder - you'll see that it's just content of payload. I might be missing something, but that's how I remember it.
Global _% replace. The chance of corrupting data is small, but still there - I would strongly recommend switching to keys only.

Also I strongly rercommend to use BadRow type for parsing.

chuwy · 2020-11-12T09:21:27Z

repeater/src/test/resources/payload_fixed.json

@@ -0,0 +1,124 @@
+{
+  "eventId": "90aebb1b-1d49-455b-9049-15ec72dfe5a9",
+  "etlTstamp": "2020-11-10T16:56:39.283Z",


I might be missing something, but I don't think there are eventId and etlTstamp anywhere in repeater's output.

So, I'm also not sure where data like this should go, it's neither a bad row (not SDJ, nor failed insert that can be forwarded to BQ)

chuwy · 2020-11-12T09:27:02Z

repeater/src/main/scala/com.snowplowanalytics.snowplow.storage.bigquery.repeater/Recover.scala

+  def recover: String => Either[String, EventContainer] =
+    b =>
+      stringToFailedInsertBadRow(b).map { ev =>
+        val fixed = fix(ev.payload)


This can potentially corrupt a lot of good data - if there are any vaules with _% it wil change them without a reason - it should operate only on keys.

Create the stub for readign from GCS and writing to pub/sub

8c16d84

lukeindykiewicz requested review from chuwy and dilyand November 11, 2020 14:18

dilyand reviewed Nov 12, 2020

View reviewed changes

benjben approved these changes Nov 12, 2020

View reviewed changes

benjben reviewed Nov 12, 2020

View reviewed changes

benjben self-requested a review November 12, 2020 09:02

Add recovery steps

6ddd52e

lukeindykiewicz force-pushed the feature/failed-inserts-recovery branch from 73ef80b to 6ddd52e Compare November 12, 2020 09:06

chuwy suggested changes Nov 12, 2020

View reviewed changes

lukeindykiewicz and others added 15 commits November 13, 2020 02:39

Parse json and filter bucket

910fc78

Use values from config

717e053

Bump to 0.6.2-rc1

8f8f613

Add more logging

9d29238

Bump to 0.6.2-rc2

67ad2fb

Fix gcs path

d1cc980

Bump to 0.6.2-rc3

02a67af

Fix pubsub producer config

d93a54e

Bump to 0.6.2-rc4

6ef950a

Add more logging

c954749

Bump to 0.6.2-rc5

a34d24f

Modify event version + use full column name

05ef0fb

Bump to 0.6.2-rc6

0eb652a

Fix column name

7af2dad

Bump to 0.6.2-rc7

230655c

chuwy added 11 commits November 18, 2020 10:37

Bring event_version fix back

e63f94b

[WIP] Common: migrate from Travis to GH actions (close #142)

644b130

[DELETEME] Keep only repeater publishing

cca8c58

Bump to 0.6.2-rc8

ae69e45

Switch to contexts

ba50264

Make column names configurable

06d570d

Bump to 0.6.2-rc9

a064ce1

Implement counters and additional sink, fix concatenated file

b765359

Bump to 0.6.2-rc10

a98be19

Fix a file pipe

25380d8

Bump to 0.6.2-rc11

5c3a933

DRAFT failed inserts recovery #139

Are you sure you want to change the base?

DRAFT failed inserts recovery #139

Conversation

lukeindykiewicz commented Nov 11, 2020

dilyand left a comment

Choose a reason for hiding this comment

benjben left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjben Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjben left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukeindykiewicz commented Nov 12, 2020

benjben commented Nov 12, 2020

lukeindykiewicz commented Nov 12, 2020

benjben commented Nov 12, 2020 • edited Loading

lukeindykiewicz commented Nov 12, 2020

chuwy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjben Nov 12, 2020 •

edited

Loading

benjben commented Nov 12, 2020 •

edited

Loading