Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instagram Connector: Airbyte Reset/Upgrade and Connector Failing #5435

Closed
jim-barlow opened this issue Aug 16, 2021 · 23 comments
Closed

Instagram Connector: Airbyte Reset/Upgrade and Connector Failing #5435

jim-barlow opened this issue Aug 16, 2021 · 23 comments

Comments

@jim-barlow
Copy link

jim-barlow commented Aug 16, 2021

Environment

  • Airbyte version: 0.29.7-alpha (Upgraded from 0.29.2-alpha)
  • OS Version / Instance: GCP n1 n1-standard-1 3.75 GB 15GB
  • Deployment: Docker
  • Source Connector and version: Instagram 0.1.7 upgraded to 0.1.8
  • Destination Connector and version: BigQuery 0.3.12
  • Severity: High
  • Step where error happened: Sync job / Update connector / Upgrade Airbyte

Current Behavior

So some really strange behaviour happened today (around 2021-08-16 06:52:43 UTC I think) , I connected to my VM running Airbyte via SSH and everything looked fine. Then I refreshed a short while later and it went back to the screen as if it was a fresh Airbyte installation. Luckily I only have one critical job running on and it wasn't running at the time, so I set it back up (pointing to the same destination dataset), upgrading Airbyte and the Instagram connector in the process, and left it to sync.

However it is now failing (logs below) and going into the second job after 10hrs.

Expected Behavior

It normally syncs the full dataset in 5-6hrs.

Logs

Scheduler Logs
scheduler-logs.log

Server Logs
server-logs.log

Latest Fail Logs
logs-425-0.log

Steps to Reproduce

  1. Sync a load of instagram accounts
  2. Wait & watch

Are you willing to submit a PR?

I wish I had your jedi skills

@jim-barlow jim-barlow added the type/bug Something isn't working label Aug 16, 2021
@jim-barlow
Copy link
Author

More logs here from the failed run.

logs-425-1.log

Weirdly, I don't know if it's unrelated but I got some webhook notifications which seemed to come from the old AIrbyte installation which I've since upgraded. I have since fully reset the configs, upgraded everything and set it back up but with a much more recent start date and it seems to be running OK, however I will update...

@jim-barlow
Copy link
Author

This is still failing after approx 6 hrs, logs here:
logs-2-0.log

I think the issue is with the DBT normalization, check out these log entries:

2021-08-17 13:22:15 INFO () LineGobbler(voidCall):85 - �[31mCompleted with 1 error and 0 warnings:�[0m
2021-08-17 13:22:15 INFO () LineGobbler(voidCall):85 - 
2021-08-17 13:22:15 INFO () LineGobbler(voidCall):85 - �[33mDatabase Error in model media_children (models/generated/airbyte_tables/airbyte_instagram_sync/media_children.sql)�[0m
2021-08-17 13:22:15 INFO () LineGobbler(voidCall):85 -   No matching signature for function JSON_EXTRACT for argument types: ARRAY<STRING>, STRING. Supported signature: JSON_EXTRACT(STRING, STRING) at [18:5]
2021-08-17 13:22:15 INFO () LineGobbler(voidCall):85 -   compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/airbyte_instagram_sync/media_children.sql
2021-08-17 13:22:15 INFO () LineGobbler(voidCall):85 - 
2021-08-17 13:22:15 INFO () LineGobbler(voidCall):85 - Done. PASS=9 WARN=0 ERROR=1 SKIP=1 TOTAL=11

@jim-barlow
Copy link
Author

As some additional fun, this somehow managed to delete ALL my historic data as part of the process! Fortunately I can restore from backups but it's taken all day, not sure who triages these @keu @yevhenii-ldv but it would be great for that to not happen again!

@jim-barlow
Copy link
Author

Attempted to connect the same source to the BigQuery (denormalized typed struct) destination, this time it didn't fail but is still running 14 hrs later and has no logs for the past 9hrs:

logs-5-0.log

@keu
Copy link
Contributor

keu commented Aug 18, 2021

Hi @jimbeepbeep, sorry to hear that. Our team is looking into the issue, will let you know about the result shortly.

@jim-barlow
Copy link
Author

jim-barlow commented Aug 18, 2021

Thanks @keu, I actually think the previous connection (with basic normalization) was only failing on the DBT job becuase of this error message:

No matching signature for function JSON_EXTRACT for argument types: ARRAY<STRING>, STRING. Supported signature: JSON_EXTRACT(STRING, STRING) at [18:5]

This data type (ARRAY<STRING>) should be extracted from JSON using JSON_VALUE_ARRAY in BigQuery, hence the error and the fail.

I have resorted to creating a new connection without normalization and then writing the SQL in BigQuery to UNION with the backup dataset, which will hopefully get me to a consistent dataset without any gaps. It has been running for an hour and read about 1/3 of the records which is pretty normal, fingers crossed this solution works as an interim!

I have had a little exposure to DBT but would like to understand more, so happy to apply some of my resources to help solve this issue, although we would need some guidance on where to start and the best approach. @cgardens I know you're the resident expert, so any pointers would be great!

@jim-barlow
Copy link
Author

OK @keu good news (from my perspective anyway!) - the job without normalization (to the normal BigQuery destination 0.3.12) succeeds! Luckily I'm pretty used to working with JSON as well as nested/repeating fields in BigQuery, so I can work with the raw data and decode it into the same schema as when the sync was functioning properly, then UNION into the backup data and we are back up and running. However it's obviously not good that neither the normalization nor the denormalized destinations seem to work with this connector... as I said I am pretty handy with BigQuery transformations (especially ARRAY/STRUCT and UNNEST syntax) so I'd like to help fix if you can point me in the right direction! I am a DBT beginner but keen to level up.

@cgardens
Copy link
Contributor

@ChristopheDuong is actually the expert on this one. 😄

Is the issue that the type for the field is front in the catalog for this source? Or is the issue that this destination is not handling array types properly? No problem if you're not sure, but that's probably the next diagnostic question for us to figure out.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Aug 19, 2021

This data type (ARRAY) should be extracted from JSON using JSON_VALUE_ARRAY in BigQuery, hence the error and the fail.

It should be using this json extract function (through dbt macros) but i am not sure what's actually happening and the generated SQL ended up using the non array function (or the cast were not done properly) in your use case:

json_extract_array({{ json_column }}, {{ format_json_path(normalized_json_path) }})

I would need to dig deeper to gain a better view of what is happening here

in the meantime, it could help if you could share the catalog.json file that ended up failing... You can follow this guide to access it:
https://docs.airbyte.io/operator-guides/browsing-output-logs#reading-the-content-of-the-catalog-json-file

@marcosmarxm
Copy link
Member

@ChristopheDuong the field is an object so is calling the correct json_extract function the problem is the argument Airbyte sent to it. #5467 solves this problem.

@jim-barlow
Copy link
Author

I would need to dig deeper to gain a better view of what is happening here

in the meantime, it could help if you could share the catalog.json file that ended up failing... You can follow this guide to access it:
https://docs.airbyte.io/operator-guides/browsing-output-logs#reading-the-content-of-the-catalog-json-file

Thanks @ChristopheDuong, this is very cool - awesome! In the failing workspace I have a destination_catalog.json and source_catalog.json, which one do you need and is there a good way of copying them out? I can view them using a cat command but they're too long to copy/paste and they have dodgy line breaks etc.

I had a dig around and found the dbt.log file, within which I found the compiled SQL which is failing - the problem seems to be:

    json_EXTRACT(table_alias.children, "$['owner']") AS owner

Aside from the weird capitalization (which doesn't affect execution, but might signify a lack of coffee) I think the issue stems from the fact that the owner is not a string, it's an object containing one field, the id. Also I'm not sure the table_alias should be there. In my initial attempt to decode this I just pushed objects and arrays as JSON strings, but if it helps at all, my equivalent SQL to decode the _airbyte_raw_media table into a view follows, although it does leverage complex data types so the syntax is not equivalent to your normalized version. Let me know next steps.

WITH 
inbound_raw_data AS (
SELECT * 
FROM `[PROJECT].[DATASET]._airbyte_raw_media`
),

base_data_scaffold AS (
SELECT
_airbyte_emitted_at,
_airbyte_ab_id,
FROM inbound_raw_data
),

extracted_json_values AS (
SELECT
_airbyte_ab_id,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.id') AS STRING) AS id,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.ig_id') AS STRING) AS ig_id,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.caption') AS STRING) AS caption,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.page_id') AS STRING) AS page_id,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.username') AS STRING) AS username,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.media_url') AS STRING) AS media_url,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.permalink') AS STRING) AS permalink,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.shortcode') AS STRING) AS shortcode,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.timestamp') AS TIMESTAMP) AS timestamp,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.like_count') AS INT64) AS like_count,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.media_type') AS STRING) AS media_type,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.thumbnail_url') AS STRING) AS thumbnail_url,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.comments_count') AS INT64) AS comments_count,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.is_comment_enabled') AS BOOL) AS is_comment_enabled,
SAFE_CAST(JSON_VALUE(_airbyte_data, '$.business_account_id') AS STRING) AS business_account_id,
FROM inbound_raw_data
),

extracted_json_objects AS (
SELECT
_airbyte_ab_id,
STRUCT(JSON_VALUE(JSON_QUERY(_airbyte_data, '$.owner'), '$.id') AS id) AS owner,
FROM inbound_raw_data
),

extracted_json_arrays AS (
SELECT
_airbyte_ab_id,
JSON_QUERY_ARRAY(_airbyte_data, '$.children') AS children_array
FROM inbound_raw_data
),

children_array_extracted AS (
SELECT 
_airbyte_ab_id,
children,
children_array_offset
FROM extracted_json_arrays 
LEFT JOIN UNNEST(children_array) AS children WITH OFFSET AS children_array_offset
WHERE children_array IS NOT NULL
),

children_array_struct AS (
SELECT 
_airbyte_ab_id,
ARRAY_AGG(
STRUCT(
children_array_offset,
STRUCT(JSON_VALUE(JSON_QUERY(children, '$.owner'), '$.id') AS id) AS owner,
SAFE_CAST(JSON_VALUE(children, '$.media_type') AS STRING) AS media_type,
SAFE_CAST(JSON_VALUE(children, '$.id') AS INT64) AS id,
SAFE_CAST(JSON_VALUE(children, '$.ig_id') AS INT64) AS ig_id,
SAFE_CAST(JSON_VALUE(children, '$.shortcode') AS STRING) AS shortcode,
SAFE_CAST(JSON_VALUE(children, '$.permalink') AS STRING) AS permalink,
SAFE_CAST(JSON_VALUE(children, '$.media_url') AS STRING) AS media_url,
SAFE_CAST(JSON_VALUE(children, '$.username') AS STRING) AS username,
SAFE_CAST(JSON_VALUE(children, '$.timestamp') AS TIMESTAMP) AS timestamp
)
) AS children
FROM children_array_extracted
GROUP BY _airbyte_ab_id
),

joined_output_data AS (
SELECT * FROM base_data_scaffold
LEFT JOIN extracted_json_values USING (_airbyte_ab_id)
LEFT JOIN extracted_json_objects USING (_airbyte_ab_id)
LEFT JOIN children_array_struct USING (_airbyte_ab_id)
)


SELECT * FROM joined_output_data

@jim-barlow
Copy link
Author

OK I figured out the log export, here are the log files from the failed run @ChristopheDuong:
2_0_dbt.log
2_0_logs.log

I had to change the .json extensions to .txt to attach here, all files here as requested:
2_0_source_catalog.txt
2_0_source_config.txt
2_0_destination_catalog.txt
2_0_destination_config.txt

I did look in the logs etc. for the failed BigQuery (denormalized typed struct) destination as I was interested in the DBT transformation and compiled SQL, however the /dbt directory did not exist so parking this for now.

FYI the raw sync which I am decoding in BigQuery using SQL seems to have stabilised today (125k records, 4-6hrs) after a few overly long runs yesterday (218-245k records, 8-9hrs).

@marcosmarxm
Copy link
Member

marcosmarxm commented Aug 23, 2021

@jimbeepbeep can you update to latest version? the normalization should work now.

@jim-barlow
Copy link
Author

OK thanks @marcosmarxm, I have upgraded Airbyte to 0.29.12-alpha but the Instagram connector version (which was previously 0.1.8) was not incremented (see screenshot, I was expecting 0.1.9):

Screen Shot 2021-08-24 at 4 54 57 pm

My raw export job (which I had to implement as an emergency with manually-coded post-load transforms to align to the historic schema) now fails, whereas it was working great only this morning, oof. Logs follow, I would have through that this was the most robust so this is strange... there's an error in there I haven't seen before : "message": "(#10) Application does not have permission for this action"

logs-27-0.log
logs-27-2.log
logs-27-1.log

However the normalized sync appears to have worked this time, with the denormalized one still running, I can work with this but will have to reconstruct the historic dataset from the different sources which is a pain but workable. Let me know if there's anything I should do/expect - every time I upgrade Airbyte something seems to break on this connector!

@sherifnada sherifnada added the area/connectors Connector related issues label Aug 25, 2021
@jim-barlow
Copy link
Author

Hi @marcosmarxm / @ChristopheDuong do you know if this connector ever got upgraded? I haven't seen an updated version past 0.1.8 (which I am currently running on 0.29.12-alpha) and the syncs have started to fail again (or take a really long time, meaning that the scheduling gets knocked off and then the monitoring we have in place is impacted). Please let me know, also if I need to be running a more recent version of Airbyte to pick up the changes? Let me know and I can action, thanks.

@jim-barlow
Copy link
Author

@johnlafleur this is the issue related to the Instagram connector which we discussed today. It was apparently fixed but the connector upgrade never appeared. Starting yesterday it has started failing on every configuration I try (No Normalization, Basic Normalization, and Denormalized. I have upgraded everything from the Airbyte version, all connectors and even my VM's hard drive but it's still failing after about 6 hrs (which is normally the time it takes for a full sync).

This is the (potential Airbyte Cloud) client who's now testing Fivetran because of reliability issues with this connector, so anything I can do to fix this is critically important!

Environment

Airbyte version: 0.29.22-alpha
OS Version / Instance: GCP n1 n1-standard-2 7.5 GB 25GB
Deployment: Docker
Source Connector and version: Instagram 0.1.8
Destination Connector and version: BigQuery 0.4.0 and BigQuery (denormalized typed struct) 0.1.5
Severity: Critical
Step where error happened: Sync job

Logs

Basic Normalization (FAIL)

logs-208-0.txt

Denormalized (FAIL)

logs-209-0.txt

No Normalization (PENDING)

I have the connection without any normalization running now (I was trying to offset the times so they don't run at the same time), and will post the logs once Attempt 1 is complete.

Please help!

@ChristopheDuong
Copy link
Contributor

I see multiple exceptions in your logs...

One of them is always showing on the source connector side:

Syncing stream: users 
2021-09-28 10:00:04 ERROR () LineGobbler(voidCall):85 - /usr/local/lib/python3.7/site-packages/facebook_business/utils/api_utils.py:30: UserWarning: media does not allow field children
2021-09-28 10:00:04 ERROR () LineGobbler(voidCall):85 -   warnings.warn(message)
2021-09-28 10:00:04 ERROR () LineGobbler(voidCall):85 - /usr/local/lib/python3.7/site-packages/facebook_business/utils/api_utils.py:30: UserWarning: value of metric might not be compatible.  Expect list<metric_enum>;  got <class 'list'>
2021-09-28 10:00:04 ERROR () LineGobbler(voidCall):85 -   warnings.warn(message)
2021-09-28 10:00:04 ERROR () LineGobbler(voidCall):85 - /usr/local/lib/python3.7/site-packages/facebook_business/utils/api_utils.py:30: UserWarning: value of period might not be compatible.  Expect list<period_enum>;  got <class 'str'>
2021-09-28 10:00:04 ERROR () LineGobbler(voidCall):85 -   warnings.warn(message)

And another is tied to the BigQuery destination which may be related to this here:
https://github.com/airbytehq/airbyte/pull/5614/files#r717475197
which causes:

Exception in thread "main" java.lang.RuntimeException: com.google.cloud.bigquery.BigQueryException: Provided Schema does not match Table tripscout-151203:airbyte_instagram_sync._airbyte_raw_story_insights. Field _airbyte_emitted_at has changed type from TIMESTAMP to STRING 

@jim-barlow
Copy link
Author

Thankd @ChristopheDuong - on the destination side that error is really strange as the _airbyte_raw_story_insights table has the field _airbyte_emitted_at as a TIMESTAMP. However I also have a number of _airbyte_tmp...story_insights tables which might be causing the issues as I can see that at least one has the _airbyte_emitted_at as a string. Do you think it would be a good approach to stop the in-flight sync, clear out all of the _airbyte_tmp tables and attempt a re-sync? If so I can script that but please let me know if you think it should work...

@ChristopheDuong
Copy link
Contributor

Yes, you can safely delete all _airbyte_tmp tables.

Once the sync is done, those tables should normally be deleted, but in your case, there might have been exceptions and the cleaning did not manage to clear it out.

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Sep 28, 2021

it seems for the bigquery exception, this is the related issue #5959 / PR #5981

@jim-barlow
Copy link
Author

Thanks @ChristopheDuong, I have cleared all the _tmp tables out, changed the schema on the _raw tables to STRING STRING STRING and hopefully I might get one step further this time. Not sure if the conversion from _raw to final tables will cause another issue, but if so I can deal with that in the same manner. Fingers crossed.

@jim-barlow
Copy link
Author

Thanks team (@marcosmarxm, @ChristopheDuong), I saw the Instagram Connector 0.1.9 became available and upgraded - whatever you did seems to have fixed my issues and I now have two parallel syncs (one raw and decoded in BigQuery, one base normalized) running linke a dream. I never could get the normalised one to work but not a problem. Closing this issue, thanks.

@ChristopheDuong
Copy link
Contributor

Great news!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants