Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support for conversation_id #115

Merged
merged 17 commits into from
Oct 28, 2024
Merged

Conversation

ri72miieop
Copy link
Contributor

@ri72miieop ri72miieop commented Oct 10, 2024

This PR adds the field conversation_id to the tweets table. It has a value when the whole chain of replies up to the main tweet is found on the archive, otherwise it stays NULL.

When the user finishes the upload of the archive and the upload_stage is changed to 'complete' a new job with the key 'update_conversation_ids' is queued. This job calls a function which runs the update over all the tweets (this is needed to also update conversation_id of other users that have interacted with tweets from the one who uploaded, but there's some space for optimizations here) and refreshes the materialized view.

The materialized view 'main_thread_view' contains only the tweets in the main thread, so no replies from other users nor a continuation of a conversation between the user who wrote the thread and others.

related to #70

Copy link

vercel bot commented Oct 10, 2024

@ri72miieop is attempting to deploy a commit to the theexgenesis' projects Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

vercel bot commented Oct 11, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
community-archive ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 28, 2024 8:00pm

@TheExGenesis
Copy link
Owner

@ri72miieop you can run pnpm build locally to check if it'll build in production - in this case it was this error

image

@TheExGenesis
Copy link
Owner

btw for long running materialized view refreshes it's best to do it concurrently - otherwise selects are prevented REFRESH MATERIALIZED VIEW CONCURRENTLY main_thread_view;

@TheExGenesis
Copy link
Owner

I also wouldn't put conversation_id in the main tweets table, I'd make a separate table, just bc it's something we're adding on

@TheExGenesis
Copy link
Owner

I would also put all these functions in the private schema, the only reason to be in public is if we need to call them from the client side, as a user

@TheExGenesis
Copy link
Owner

I'm also wondering if that materialized view is necessary. Do you have a specific purpose in mind for it? Might it be enough to have a regular view that computes just in time?

* 'main' of https://github.com/ri72miieop/community-archive:
  fix: import formatteduser type
  script: check archives in storage but not db and vice versa
  refactor: user info, add user followers to featured profiles, likes to user dir
  feat: word occurrences pg fn
  Update archive_data.md
  Update archive_data.md
  style: format counts
  script: anon key download storage
  fix: don't require like.js
  fix: delete timeout migration
  doc: what data we use from archive
  fix: delete archive timeout
Added index on main_thread_view
Refresh materialized view concurrently
@TheExGenesis
Copy link
Owner

for context here

image
image

…tions

Change materilized view to calculate info for all threads for a function that does that same for a single conversation_id
Create view tweets_w_conversation_id
@ri72miieop
Copy link
Contributor Author

Changes made:

  • Move conversation_id from tweets table to a new table called conversations
  • Change materialized view to calculate info for all threads for a function that does that same for a single conversation_id
  • Create view tweets_w_conversation_id

It's compiling on my machine but it seems to be failing on Vercel? Not sure if that's because there's some manual process necessary or it is failing to compile, if there's any issue let me know.

@ri72miieop ri72miieop changed the title feat: Add conversation_id to tweets and created materialized view with the main threads of every user feat: support for conversation_id Oct 18, 2024
@TheExGenesis
Copy link
Owner

@ri72miieop Thanks for updating! Looping over individual tweets seems inefficient, have you considered starting from root tweets and processing them in layers? I asked Claude to draft a proposal. Haven't tested yet.

image

@ri72miieop
Copy link
Contributor Author

Yeah this is similar to what I was doing before. I tested it now, results:
imagem

Going one by one seems inefficient but I am taking advantage of the order of the tweets, so when I search for the tweet with the id reply_to_tweet_id in the temporary table I have a guarantee that it either is there and already has the conversation_id computed or is null which means we don't have the complete thread in the db yet so we can ignore for now.

Meanwhile in the recursive case if it is a thread with like 300 elements it will have to recursively search many times for each of the tweet (this is why I it was taking 1m with my test data that had 10% of the tweets of the archive and with the full archive it was timing out).

my SQL is rusty, so if I missed something let me know, but I think this is the reason recursively it scales so much worse with more data than just going through each one of them one by one

@TheExGenesis
Copy link
Owner

testing the latest functions
btw there's a typo in conversations table, needs a comma after "text"


CREATE TABLE IF NOT EXISTS "public"."conversations" (
    "tweet_id" text NOT NULL PRIMARY KEY,
    "conversation_id" text,
    FOREIGN KEY (tweet_id) REFERENCES public.tweets(tweet_id)
);

@TheExGenesis TheExGenesis merged commit e4639d6 into TheExGenesis:main Oct 28, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants