Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update firstSignificantSubdomain function #3601

Merged
merged 1 commit into from
Nov 21, 2018
Merged

Update firstSignificantSubdomain function #3601

merged 1 commit into from
Nov 21, 2018

Conversation

hatarist
Copy link
Contributor

Added gov, mil and edu 2nd level domains.

I also parsed the public suffix list (https://publicsuffix.org/list/public_suffix_list.dat) to get the most popular 2nd level domains, and here's the top ones:
(on the left - amount of TLDs that have such 2nd level domain)

    145 org
    134 com
    128 net
    122 gov
    121 edu
     66 co
     58 mil
     54 blogspot
     42 nom
     40 ac
     32 info
     22 nym
     21 biz
     19 barsy
     18 int
     17 name
     16 or
     15 gob
     13 web
     12 go
     11 tm
     11 sch
     11 pro
     11 med
     11 cloudns
     10 asso
      9 tv

I guess it'd be a performance issue to use the whole public suffix list?

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Added gov, mil, edu 2nd level domains
@alexey-milovidov
Copy link
Member

alexey-milovidov commented Nov 18, 2018

I guess it'd be a performance issue to use the whole public suffix list?

It should not if we will use carefully crafted hash table.
(Now it is not required.)

Copy link
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test.

Look at dbms/tests/queries directory.

@alexey-milovidov
Copy link
Member

You can also provide a performance test to test for performance degradation.
Look at the dbms/tests/performance directory.

@alexey-milovidov alexey-milovidov merged commit 8590348 into ClickHouse:master Nov 21, 2018
@alexey-milovidov
Copy link
Member

Performance is degraded by 12% on URLs from Yandex.Metrica.

clickhouse-benchmark <<< "SELECT count() FROM test.hits WHERE NOT ignore(firstSignificantSubdomain(URL))"

From 0.137 sec. to 0.153 sec.
That's non satisfactory.

@alexey-milovidov
Copy link
Member

Performance has been improved up to 0.116 sec by 2662594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants