-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a PandasDatasource #3302
Comments
There's Druid and SqlAlchemy connectors in the source code and a PR for ElasticSearch. The idea is sound, looking forward to a PR. |
So we want the web server(s) to be stateless (Superset isn't a database!), that probably means that at each web request Seems pretty inefficient, why not flush your dataframe to a proper sql-speaking database an using the sqla connector. An Apache Arrow connector seems like a more sensical idea though I'm unclear on the specifics. |
@mistercrunch I agree that we want the web server(s) to be stateless. Our use case is that we want to read data from a REST API that serves up (among other things), csv-formatted data. I can't just flush the dataframe to a SQLA database as a one-off exercise because I need the slice to be able to show up to date information (bearing in mind the cache timeout on the slice). I'm new to Superset, so I may have misunderstood the processing - in which case, please correct me... There seem to be two main types of web request going on:
In the event that you have a SQLA connector, then each web request involves creating a connection to the DB (unless pooled), executing a query which does the read/filter/aggregate/compute and returns the data. I.e. the work to be done is the same, but in the SQLA case we are doing it on the database server, whereas I was proposing doing it on the web server. I recognize that doing it on the database server is sensible for most use cases. My understanding of how Superset works is that once the slices have been designed, then the results of the query can be cached. I.e. once a slice has been published that does its filter/aggregate/compute in Pandas, it would only be doing it once per cache expiry period, because the result of that computation would be stored in the cache and each time the slice is viewed the result would be returned directly from cache. I did think of an alternative approach where If we take this approach, then each time we need to read the data from SQLA table, we need to first fetch the up to date data from the remote URL, read it into Pandas, drop the existing SQL table, write a new table from the Pandas DataFrame, then execute the query. As far as I can tell, executing the query involves reading the results back into a Pandas Dataframe using |
I'm guessing your REST API is backend by some sort of database, couldn't you just access that? Alternatively, if you don't own that database and want to run analytical workloads off of it, shouldn't you just write a data pipeline to bring that data over into a local datastore? How would the owner of that service feel when front-loading the cache in the morning? You'd be potentially hitting that REST API pretty hard. The issue with the Pandas approach is you can't "pushdown" the filtering and/or aggregations, meaning the web server has to do all that heavy lifting. It's true that things would get cached, but any slice-dicing beyond the cache would fall off a perf cliff. There's no reason why it wouldn't work for small datasets though, though I doubt is would be very useful to others. |
@mistercrunch you're right that the REST API is backed by a database, but although it is ours and we control it, the data comes from a third party product and so we don't control the schema. The data is largely stored in a monolithic table which contains a JSONB column that contains the data of real interest. The fields in the JSONB column vary in a NoSQL fashion. The REST API extracts this data and presents it in a recognizable tabular form to the analysts. I don't think I could give the analysts access to the underlying table using Superset because it would be too hard for them to write the SQL to extract the data values that they want from the JSONB column. We have considered creating views on the monolithic table that provide the correct view of the data, and we could automate that step. But if we give analysts direct access to those views then we loose all the fine-grained permissions that are enforced by the REST API and we would have to duplicate the permissions model in Postgresql. We think we have a design for pushing down the filtering to our specific REST API in a subclass of a generic PandasConnector. I can also imagine how some other specific implementations might be able to push down grouping and aggregating. At the moment though, any custom connector has to be prepared to implement the whole I accept that my use case with small data volumes from a REST API is not the main target audience for Superset and it might not be that valuable to others. |
If you want to build it and get to a point where it works well, we could add it to a |
@mistercrunch please can you some provide some guidance on how we would handle migrations in a contrib module, and/or lay out the directory structure. For example, if we have
Then we need a way to tell Superset to include those migrations when we migrate the db. We can do that by changing the value of A separate one keeps a loose coupling between the contrib folder and superset. But it adds more complexity to the setup and management. |
@mistercrunch please can you also provide some guidance on how we would handle changes to the JavaScript client? Specifically, we would need to change (at a minimum) A solution where the DataSource registers which sections apply and they are passed to the client in I think that these issues (both setting up the JavaScript client and adding new migrations) will apply to most other custom connectors, so if we want to encourage development of a wider range of connectors then we will probably need to solve them. |
I really don't see how having migrations for the superset database separated from the upstream one can work. Any chance you make your code public so we can take a look to what is needed? |
@xrmx the code is still very rough - I haven't pushed it to GitHub yet - partially because I want to make sure the directory structure matches your expectations in order to get it merged! I am also not planning to push it to GitHub until I have an actual working implementation that can do filtering and grouping, and we haven't got that far yet. However, we need to store metadata about the DataSources and the associated Columns and Metrics. The required metadata is very similar to that required for SqlaTable datasources, but not identical. So the migration just adds three new tables. Currently the migration looks like:
|
@xrmx said:
In theory, we could change the
And then they could update their local # Include additional data sources
ADDITIONAL_MODULE_DS_MAP = {
'xmymodule.connectors.pandas.models': ['PandasDatasource'],
}
# And their migrations
ADDITIONAL_VERSION_LOCATIONS = ['xmymodule/migrations/versions'] Possibly you want to specify the version locations as modules rather than directories: ADDITIONAL_VERSION_LOCATIONS = ['xmymodule.migrations.versions'] It would create an extension point that would allow people to add new functionality to Superset that requires migrations, without needing them to maintain a fork. And it would make it easier for other people to choose to deploy those extensions in their Superset installation without having to try and merge in code to their own fork - they could just deploy Superset and |
For reference, the PR for the Elasticsearch connector is at https://github.com/apache/incubator-superset/pull/2953/files |
I think you could enable an migrate = app.extensions['migrate'].migrate
@migrate.configure
def configure_version_locations(migrate_config):
# Get additional migration locations
locations = config.get('ADDITIONAL_VERSION_LOCATIONS', [])
# Include the main superset migrations
locations.append('%(here)s/versions')
migrate_config.set_main_option('version_locations', ' '.join(locations))
return migrate_config |
@rhunwicks my worry is that contrib migrations may have references to upstream models that will change in the future. So when applied to a different time they may break. |
@xrmx it's true that if someone developed a model that had a dependency on a core Superset model, and then that model changed significantly, then the extension might stop working. But that would be the case whether or not the dependent model was in the core Alembic, and thus Flask-Migrate, is quite capable of dealing with a branched migration tree, so changes can be made to the main models in the main tree without affecting the extension migrations provided that either they are not dependent on any upstream models or the upstream models don't change in a breaking way. For example, our models have a foreign key to If someone wrote an extension that did stop working as a result of change to an upstream model then they would need to update their component to account for the changes in order to continue working with the most recent version of Superset. This is a normal workflow for an add on component for an open source project. On balance, I think that it is preferable to have an However, my main objective is to write a connector that can work with core Superset on an ongoing basis. So if the core team decide that Connectors should live in a contrib directory and keep their migrations in the main |
One approach that might be relevant for writing custom connectors is https://github.com/yhat/pandasql. That project allows you to use SQLite-compatible SQL to query a Pandas Dataframe. It does it (AFAIK) by automatically serializing the dataframe to a temporary SQLite DB and then executing the query against it and reading back the result. In theory, that would allow us to create a new connector that inherited from I am not planning on using that approach myself, because I think that the overhead of serializing and then deserializing the rows will be too much, but it does solve some of the other problems, and so I am mentioning it here in case it is relevant for anyone else. |
@rhunwicks Thanks for the thoroughly analysis I'd really hope we can have a pandas connector as first class citizen. Imagine how useful that would be for not tech savvy organizations where they can just upload a csv or an excel and explore that. |
@mistercrunch when is the order of records returned by the Looking at
Is it important that when I assume not, because then we have both |
@mistercrunch the Travis-CI run for the PR fails because it doesn't run the migration for PandasConnector, but I can't fix that until I know whether you want to:
Please can you look at the PR when you get a chance and let me know which way you want me to set it up? |
I think not, because it looks like 816c517 has made I'll update the tests in the PR to account for that. |
@mistercrunch I have added some docs to the PR for a Pandas Connector, and fixed it up to pass the I'd appreciate some guidance on whether you think this will make it into the Superset repo at some point, because of the effort required to keep it in a ready to merge state (e.g. fixing merge conflicts in We are using Pandas Connector on our deployment, and for us at least it is much more useful than the Import CSV functionality merged in #3643 because it allows Superset dashboards to include current data from remote files or APIs without someone having to manually upload a new version of the CSV. |
@rhunwicks I want to do a POC about this PandasDatasource but I'm not sure how to connect superset with a REST API that already get data from a db and add it to a pandas dataframe, can you help me? |
@jevanio unrelated to this PR please use the mailing list or gitter. |
Having the migrations and requirements mixed in with the main Superset ones without ever merging this MR causes frequent merge conflicts. Therefore, I have closed MR #3492. We will continue to maintain the connector in order to allow us to use Superset with APIs and remote files, but we will do so in our fork rather than a MR. If Superset publishes a specification for developing third party connectors in future then we will repackage our connector according to the specification. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue |
Superset version
0.19.0
Expected results
There are a large number of Issues asking about adding new Datasources / Connectors:
Unfortunately, I can't find any examples of a working third party datasource / connector on Github and I think this is possibly because of the complexity and level of effort required to implement a BaseDatasource subclass with all the required methods. In particular, it needs to be able to report the schema and do filtering, grouping and aggregating.
Pandas has great import code, and I have seen Pandas proposed an a method for implementing a CSV connector - see #381 - read the CSV using Pandas and then output to sqlite and then connect to sqlite using the SQLA Datasource to create the slices.
This approach could be extended to other data formats that Pandas can read, e.g. Excel, HDF5, etc.
However, it is not ideal because the sqlite file will be potentially be out of date as soon as it is loaded.
I'd like to propose an altenative: a PandasDatasource that allows the user to specify the import method (
read_csv
,read_table
,read_hdf
, etc.) and a URL and which then queries the URL using the method to create a Dataframe. It reports the columns available and their types based on the dtypes for the Dataframe. And by default it allows grouping, filtering and aggregating using Pandas built in functionality.I realize that this approach won't work for very large datasets that could overwhelm the memory of the server, but it would work for my use case and probably for many others. The results of the read, filter, group and aggregate would be cached anyway, so the large memory usage is potentially only temporary.
This would also make it very much easier for people working with larger datasets to create a custom connector to suit their purposes. For example, someone wanting to use BigQuery (see #945) could extend the PandasDatasource to use
read_gbq
and to pass the filter options through to BigQuery but still rely on Pandas to do grouping and aggregating. Given that starting point, someone else might come along later and add the additional code necessary to pass some group options through to BigQuery.The point is that instead of having to write an entire Datasource and implement all methods, you could extend an existing one to scratch your particular itch, and over time as more itches get scratched we would end up with a much broader selection of datasources for Superset.
The text was updated successfully, but these errors were encountered: