Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinot dev docs #14401

Open
gortiz opened this issue Nov 6, 2024 · 1 comment
Open

Pinot dev docs #14401

gortiz opened this issue Nov 6, 2024 · 1 comment

Comments

@gortiz
Copy link
Contributor

gortiz commented Nov 6, 2024

Pinot dev docs

I've recently read the Velox Developer Guide and I really
think that it would be super useful to have something like that.
Quoting that page:

This guide is intended for Velox contributors and developers of Velox-based applications.

That is exactly what I think we need for Pinot.
We should have a developer guide that explains centralizes all the information that a developer needs to contribute to
Pinot.
That includes how to build Pinot, how to run Pinot, how to write tests, how to write documentation, etc. but also
information on the design choices that were made, explaining key classes and concepts, etc.

Current state

We already have some developer documentation for Pinot, but it is spread across many different places and sometimes
the information is outdated or incomplete.

The most important source of developer documentation is the
Contribution Guidelines page in User documentation.
This page is focused on the process of contributing to Pinot, but it also has some information on how to build Pinot.

There are other sources of information like:

Developer documentation we need

There are many things that we need to document for developers.
For example, the following are questions I had to answer in the past months:

  • Explaining the datatypes in Pinot. How are they stored? How are they converted? For example see
    types in Velox.
  • Explaining type validation in Pinot. For example, MSQ does it in a different way than how SSQ does it.
  • Explaining implementation key differences between MSQ and SSQ.
  • Explaining how queries are parsed, validated, optimized, how a broker decides which server executes which parts
    of the query, how are these plans are sent to the servers etc.
  • Explaining how different join types are implemented in Pinot.
  • Explaining that queries need to deal with the fact that segments may not be refreshed and therefore contain
    different indexes than the ones indicated in the latest table config. Also explaining how we deal with that.

I don't think we would need to write all this documentation from scratch and explain every detail.
That could change very often and in the worst case it would end up being a translation of what the code does but in
English.
Instead I think we should explain the key points (ideally with diagrams) and refer to the important classes and methods
in the codebase.

Example: Timestamp indexes

Here I'm going to write around a page of important information I learned about timestamp indexes in Pinot by solving
issues and reading the code but I would have loved to have this information synthesized in a single place at the time.
This is the kind of information our developers may need and the one we don't have an actual place for.

Click here to see the example

Timestamp indexes are a key feature in Pinot, but they are very different from other indexes.
Although they are called indexes in the user documentation, some committers call them "syntactic sugar" because they
are not indexes in the codebase.
Instead, when the user configures a timestamp index in their TableConfig:

  1. A new column is created for each cardinality of the timestamp index (one for days, one for months, etc).
  2. A range index is created for each of these columns.
  3. Whenever a query is received, the broker rewrites the query to use these columns instead of the original
    timestamp column if the query has a filter using one of the cardinalities.

Some of these steps are described in the timestamp page of the user documentation, but not all of
them.

Specifically, there is one key point on timestamp indexes.
All other column indexes optimize queries at the segment level (in the servers) by changing the way FilterPlanNode are
transformed into different Operators (in FilterPlanNode.constructPhysicalOperator).
Meanwhile, timestamp indexes optimize queries at the broker level (as explained above).
The broker analyzes the query to look for all usages of the original column that can be optimized.
For example if there is a timestamp index on event_time that includes the YEAR granularity,
the broker marks in the meta-information that any call to dateTrunc('YEAR', event_time) can be rewritten as
$event_time$YEAR.
Brokers do this BaseSingleStageBrokerRequestHandler.handleExpressionOverride(), setting the expressionOverrideMap
attribute of QueryConfig.

Then the server that receives the query verifies that the column $event_time$YEAR exists in the segment
(remember that the segment may not be updated to the latest table config!) and if it does, it rewrites the query
before FilterPlanNode.constructPhysicalOperator is called.
Servers do this when building TableCache.TableConfigInfo, which obtains the information from
QueryConfig.getExpressionOverrideMap()

At least this is how it works in Single-stage query engine (SSQ).
In Multi-stage query engine (MSQ) the broker doesn't rewrite the query and therefore timestamp indexes are not used
(ie #11409 tried to add support for it).

How do we add these new cardinality columns?
They are added in TimestampIndexUtils.applyTimestampIndex, which modifies the schema and the table config of the
table.

Do we store the modified schema and table config somewhere?
No, it is not stored in Zookeeper nor in the segment metadata.
Therefore it is very important for developers to know there are two kinds of Schema and TableConfig objects:

  • The ones that are not enriched. They are the ones that are persisted and shown to the users.
  • The ones that are used at runtime. They are the ones that have been enriched with the timestamp indexes.

And given the typesystem doesn't help (we don't have EnrichedSchema and EnrichedTableConfig classes), developers
need to know when they are working with one or the other.

Proposal

I propose to create a new site for user documentation site.
This site would be written using MkDocs and the code would be stored in the docs folder of
the Pinot GitHub repository.

I already opened a #14346 that includes all the machinery to build the site and
a couple of (I hope useful) pages describing some key aspects of the lifecycle of a multi-stage query in Pinot.

Having to have a new site for the developer documentation is not the only way to go.
In fact, it may remind the famous XKCD comic about standards:

a new standard!

But there are reasons to think that the tools we use right now are not the best for the job:

  • The user documentation is written in GitBook, which is not the best tool to write developer documentation.
    Specifically, although GitBook supports markdown, it is biased to be used through the GitBook website, whose UI
    is very confusing (AFAIK no committer likes GitBook).
  • The user documentation is focused on how to use Pinot, while the developer documentation should be focused on how
    Pinot works. It is ok to have some overlap and to link between them, but it should be clear for readers which one
    they are reading.
  • The user documentation is external to the code repository, which makes it harder to keep it in sync with the
    codebase. For example, if a PR in the code changes a feature, it is harder for both the committer and the reviewers
    to remember to update the documentation in GitBook.
  • Google Drive is good to discuss design documents while there are being written, but it is not a good place to
    store them. It is hard to search, hard to link to, hard to keep in sync with the codebase, etc.

This dev site written in MkDocs solves all this issues.
Being written in markdown, it is easy to write and to review.
Being hosted in the code repository, it is easy to keep in sync with the codebase.
It is also very easy to publish this pages.
There are a lot of information online on how to publish MkDocs pages in different places including GitHub Pages.
The Apache Foundation has a page on how to publish MkDocs pages in the
ASF infrastructure.

Closing notes

This is probably not a high priority task, but I think it is a very important one.
I think that having a centralized place for developer documentation would make it easier for new contributors to start
contributing to Pinot and for current contributors to understand the codebase better, which would make them more
productive and reduce the number of bugs.

Writing documentation is not a task that can be done in a single week, but instead it is a task that should be done
incrementally and continuously.
The easier the tools are to use, the more likely it is that the documentation will be written and maintained.

I would like to hear your thoughts on this proposal.

@robertzych
Copy link
Contributor

@gortiz Wow and thanks! This type of documentation is invaluable for both contributors and committers. I look forward to seeing more developer documentation and would be happy to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants