Initial Proof of Concept for Targeting with embeddings #818

ericholscher · 2024-02-02T22:28:51Z

This is very much a draft,
but shows what creating and storing embeddings in Postgres looks like.

Doesn't implement any querying, but can be done with something like::

# Load shell

ADSERVER_ANALYZER_BACKEND=adserver.analyzer.backends.SentenceTransformerAnalyzerBackend ./manage.py shell_plus

# Load example data into DB

import yaml
from yaml import Loader
from adserver.analyzer.tasks import analyze_url

data = yaml.load(open("/model/assets/categorized-data.yml"), Loader)

for dat in data:
     url = dat['url']
     analyze_url(url, publisher_slug='ethicaladsio', force=True)


# Run initial test query

from pgvector.django import L2Distance
from adserver.analyzer.tasks import analyze_url

url = "https://observablehq.com/"
analyze_url(url, publisher_slug='ethicaladsio', force=True)

aurl = AnalyzedUrl.objects.get(url=url)

for url in AnalyzedUrl.objects.exclude(url=aurl.url).order_by(L2Distance('embedding', aurl.embedding))[:10]:
     print(url)

This was setting folks back after we re-enabled paids ads. I'm not sure this is the cleanest way to do this, but seems reasonable.

This is very much a draft, but shows what creating and storing embeddings in Postgres looks like. Doesn't implement any querying, but can be done with something like:: # Load example data into DB import yaml from yaml import Loader data = yaml.load(yam, Loader) for dat in data: url = dat['url'] analyze_url(url, publisher_slug='ethicaladsio', force=True) # Run initial test query from pgvector.django import L2Distance from adserver.analyzer.tasks import analyze_url url = "https://observablehq.com/" analyze_url(url, publisher_slug='ethicaladsio', force=True) aurl = AnalyzedUrl.objects.get(url=url) for url in AnalyzedUrl.objects.order_by(L2Distance('embedding', aurl.embedding))[1:6]: print(url)

davidfischer

I couldn't get the migrations to run correctly even after building a new docker image. Am I missing something?

Running migrations:
  Applying adserver_analyzer.0003_add_embeddings...Traceback (most recent call last):
...
django.db.utils.ProgrammingError: type "vector" does not exist
LINE 1: ...rver_analyzer_analyzedurl" ADD COLUMN "embedding" vector(3) ...

davidfischer · 2024-02-07T22:33:43Z

adserver/tasks.py

    for publisher in Publisher.objects.filter(
-        allow_paid_campaigns=True, created__lt=threshold
+        allow_paid_campaigns=True, created__lt=threshold, modified__lt=threshold


I think this will not currently work as intended although I do think we want this. We calculate publisher CTRs nightly and this updates the modified time:

ethical-ad-server/adserver/tasks.py

Lines 702 to 717 in b9ac3b5

@app.task()

def calculate_publisher_ctrs(days=7):

"""Calculate average CTRs for paid ads on a publisher for the last X days."""

sample_cutoff = get_ad_day() - datetime.timedelta(days=days)

for publisher in Publisher.objects.all():

queryset = AdImpression.objects.filter(

date__gte=sample_cutoff,

publisher=publisher,

advertisement__flight__campaign__campaign_type=PAID_CAMPAIGN,

)

report = PublisherReport(queryset)

report.generate()

publisher.sampled_ctr = report.total["ctr"]

publisher.save()

We could change the publisher CTR calculations to only run on those where paid ads are approved. That way most publishers won't be updated nightly. Or we could make the save query into an update so the mod time isn't updated (and a historical record isn't created)

Ah yea.. this must have snuck in from a PR I branched off... definitely didn't do it as part of this PR.

Will roll this back, since it's a different change.

ericholscher · 2024-02-08T17:35:24Z

@davidfischer ah, yea. You need to enable pgvector in the DB. I forgot to note that: https://github.com/pgvector/pgvector?tab=readme-ov-file#getting-started

davidfischer · 2024-02-08T19:35:39Z

I think there's a few problems here.

This branch hasn't taken any of the updates from main since ~August before we upgraded Postgres for Django 4.2. I think we need to recreate/rebase the PR as a bunch of things are going to be off and there's going to be conflicts.
Rather than using a 3rd party docker image using an unknown version of Postgres and an unknown version of pgvector, let's just stick with the pinned version of PG we are using (15.2) and add building the extension into the Dockerfile. Hopefully it's as easy as adding a few steps to the Dockerfile.
Seems fairly easy to add migrations.RunSQL('CREATE EXTENSION IF NOT EXISTS vector;'), to the migration (can we collapse the two migrations to one?)

davidfischer · 2024-02-08T19:40:58Z

adserver/analyzer/models.py

@@ -55,6 +56,8 @@ class AnalyzedUrl(TimeStampedModel):
        ),
    )

+    embedding = VectorField(dimensions=384, default=None, null=True, blank=True)


I suspect we will need some sort of approximate index here, but that can come later.

Aye, that's definitely a next step once we start querying it.

Definitely interesting:

You can add an index to use approximate nearest neighbor search, which trades some recall for speed. Unlike typical indexes, you will see different results for queries after adding an approximate index.

https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing

…embedding-poc

- Create vector extension in the migration - Ensure psql on the django docker image - Use our maintenance scripts in the pg image while still using pgvector

davidfischer · 2024-02-09T00:01:09Z

requirements/analyzer.txt

+
+
+sentence-transformers
+pgvector


Because we import this directly in models.py, this probably has to go in the base requirements.

Actually this is probably just going to be a nightmare for testing. Having a field be a postgres specific field may require our testing setup to change since our tests are run with an in-mem sqlite setup.

FWIW, sqlite has a similar extension: https://github.com/asg017/sqlite-vss -- but might be worth just running tests in postgres? 🤷

Looks like there isn't an easy way to use the sqlite package in Django. Another idea I had that probably makes sense:

Break the embeddings out into their own model, with a FK or OneToOne to the AnalyzedURL? That way we could keep this all self-contained.

I think it's fine to have the embeddings on AnalyzedUrl. I think we just have to change the test skip logic (this) for the analyzer. We could ensure that adserver.analyzer is excluded from testing entirely and when it is tested that it uses Postgres. Thoughts?

Yea, that likely makes sense as well, if we don't have tests for the code currently that we'd be skipping.

…embedding-poc

davidfischer

This looks great and I think disabling the analyzer is a sensible default (especially for OSS users of our project). It might be nice to find a way to run tests on the analyzer but not a blocker for merging this.

ericholscher added 2 commits November 20, 2023 09:01

Don't disable publishers who we recently updated to paid

8ab2a94

This was setting folks back after we re-enabled paids ads. I'm not sure this is the cleanest way to do this, but seems reasonable.

ericholscher requested a review from a team as a code owner February 2, 2024 22:28

ericholscher requested a review from davidfischer February 2, 2024 22:28

ericholscher changed the title ~~embedding poc~~ Initial Proof of Concept for Targeting with embeddings Feb 2, 2024

ericholscher added 2 commits February 2, 2024 15:25

Preproces input and store model locally

9dc8728

Store data in model for now

a5f66dd

davidfischer reviewed Feb 8, 2024

View reviewed changes

ericholscher and others added 4 commits February 8, 2024 12:12

Merge branch 'main' of github.com:readthedocs/ethical-ad-server into …

42b16a8

…embedding-poc

Merge branch 'main' of github.com:readthedocs/ethical-ad-server into …

5d90d96

…embedding-poc

A few small fixes for running on latest main

5b615ec

Docker QoL improvements

3ff59ba

- Create vector extension in the migration - Ensure psql on the django docker image - Use our maintenance scripts in the pg image while still using pgvector

davidfischer reviewed Feb 9, 2024

View reviewed changes

ericholscher and others added 13 commits February 8, 2024 16:50

Refactor key concepts onto base

a56362c

More cleanup

d920e15

Refactor backends to only fetch once

59292f4

Remove unneded changes

aed36de

Reduce churn

59d833d

One more

5cee93f

More silliness

b0b8b39

A bit more refactoring

d1e12f2

Update dependencies and pin them

c97ae42

Make tests on the analyzer optional

92c4287

Merge branch 'main' of github.com:readthedocs/ethical-ad-server into …

8703a93

…embedding-poc

Default ADSERVER_DECISION_BACKEND off

1879ae3

Update comment

c830e0d

davidfischer approved these changes Feb 19, 2024

View reviewed changes

ericholscher merged commit 4db23c9 into main Feb 19, 2024
1 check passed

ericholscher deleted the embedding-poc branch February 19, 2024 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Proof of Concept for Targeting with embeddings #818

Initial Proof of Concept for Targeting with embeddings #818

ericholscher commented Feb 2, 2024 •

edited

Loading

davidfischer left a comment

davidfischer Feb 7, 2024

davidfischer Feb 8, 2024 •

edited

Loading

ericholscher Feb 9, 2024

ericholscher Feb 9, 2024

ericholscher commented Feb 8, 2024

davidfischer commented Feb 8, 2024 •

edited

Loading

davidfischer Feb 8, 2024

ericholscher Feb 9, 2024 •

edited

Loading

davidfischer Feb 9, 2024

davidfischer Feb 9, 2024

ericholscher Feb 9, 2024

ericholscher Feb 9, 2024

davidfischer Feb 12, 2024

ericholscher Feb 12, 2024

davidfischer left a comment

	@app.task()
	def calculate_publisher_ctrs(days=7):
	"""Calculate average CTRs for paid ads on a publisher for the last X days."""
	sample_cutoff = get_ad_day() - datetime.timedelta(days=days)

	for publisher in Publisher.objects.all():
	queryset = AdImpression.objects.filter(
	date__gte=sample_cutoff,
	publisher=publisher,
	advertisement__flight__campaign__campaign_type=PAID_CAMPAIGN,
	)
	report = PublisherReport(queryset)
	report.generate()

	publisher.sampled_ctr = report.total["ctr"]
	publisher.save()



		sentence-transformers
		pgvector

Initial Proof of Concept for Targeting with embeddings #818

Initial Proof of Concept for Targeting with embeddings #818

Conversation

ericholscher commented Feb 2, 2024 • edited Loading

davidfischer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfischer Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Feb 8, 2024

davidfischer commented Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

ericholscher Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfischer left a comment

Choose a reason for hiding this comment

ericholscher commented Feb 2, 2024 •

edited

Loading

davidfischer Feb 8, 2024 •

edited

Loading

davidfischer commented Feb 8, 2024 •

edited

Loading

ericholscher Feb 9, 2024 •

edited

Loading