Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on "Implementing faceted search with Django and PostgreSQL" #1

Open
simonw opened this issue Oct 5, 2017 · 4 comments
Open

Comments

@simonw
Copy link
Owner

simonw commented Oct 5, 2017

Comments on https://simonwillison.net/2017/Oct/5/django-postgresql-faceted-search/

@pauloxnet
Copy link

Hi Simon, and thank you for sharing your article.

We used the postgresql's full-text search module in django for some projects at work and I think that striping html tags from your Entry's body is not necessary.

I tried to apply to_tsvector on a snippet of html code from your article, before and after strip_tags, and the resulting vectors are the same:

SQL Fiddle

PostgreSQL 9.6 Schema Setup:

Query 1:

SELECT to_tsvector('<p>I''ve added <a href="https://simonwillison.net/search/">a faceted search engine</a> to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:</p><ul><li><a href="https://simonwillison.net/search/?q=postgresql">All content matching "postgresql"</a></li><li><a href="https://simonwillison.net/search/?q=django&amp;type=quotation">Just quotations matching "django"</a></li><li><a href="https://simonwillison.net/search/?q=python+javascript&amp;tag=mozilla&amp;year=2007">All content matching "python" and "javascript" with the tag "mozilla" posted in 2007</a></li></ul><p>It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.</p>')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        to_tsvector |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007':70 'ad':3 'also':72 'appli':89 'back':86 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':104 'count':75 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,74 'feel':98 'filter':26,29,32,93 'general':97 'get':85,95 'javascript':63 'like':24 'mani':81 'match':52,56,60 'mozilla':67 'navig':107 'one':90 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':73 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':82 'search':6,18,20,49 'shape':101 'sql':23 'support':15 'tag':28,66 'tell':79 'text':17 'type':35 've':2 'vs':37,39 |

Query 2:

SELECT to_tsvector('I''ve added a faceted search engine to this blog, powered by PostgreSQL. It supports regular text search (proper search, not just SQL"like" queries), filter by tag, filter by date, filter by content type (entries vs blogmarks vs quotation) and any combination of the above. Some example searches:All content matching "postgresql"Just quotations matching "django"All content matching "python" and "javascript" with the tag "mozilla" posted in 2007It also provides facet counts, so you can tell how many results you will get back before you apply one of these filters - and get a general feeling for the shape of the corpus as you navigate it.')

Results:

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          to_tsvector |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| '2007it':70 'ad':3 'also':71 'appli':88 'back':85 'blog':10 'blogmark':38 'combin':43 'content':34,51,59 'corpus':103 'count':74 'date':31 'django':57 'engin':7 'entri':36 'exampl':48 'facet':5,73 'feel':97 'filter':26,29,32,92 'general':96 'get':84,94 'javascript':63 'like':24 'mani':80 'match':52,56,60 'mozilla':67 'navig':106 'one':89 'post':68 'postgresql':13,53 'power':11 'proper':19 'provid':72 'python':61 'queri':25 'quotat':40,55 'regular':16 'result':81 'search':6,18,20,49 'shape':100 'sql':23 'support':15 'tag':28,66 'tell':78 'text':17 'type':35 've':2 'vs':37,39 |

@simonw
Copy link
Owner Author

simonw commented Oct 6, 2017

You're absolutely right! Thanks, that's really interesting. I'll update the entry to reflect this.

@simonw
Copy link
Owner Author

simonw commented Oct 6, 2017

Added this to the post:

Update 6th October 8:23pm UTC - it turns out this step is entirely unnecessary. Paolo Melchiorre points out that the PostgreSQL ts_vector() function already handles tag removal. Sure enough, executing SELECT to_tsvector('<div>Hey look what happens to <blockquote>this tag</blockquote></div>') using SQL Fiddle returns 'happen':4 'hey':1 'look':2 'tag':7, with the tags already stripped.

I'm leaving the regexp_replace stuff in there though as it's a nice demonstration of how to run regular expressions inside of PostgreSQL, even though they aren't actually needed here.

@pauloxnet
Copy link

Thank you for mentioning me.

I think you have done well to leave the regexp_replace stuff, because it is very interesting.

I think I will use some of the ideas contained in your article in my upcoming talk about full-text search in django with postgresql, obviously quoting you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants