This repository contains code for an early stage statistical visualization app using ShinyLive/Python. You can test it at https://jm-rpc.github.io/plotit/ . If you're here, I don't need to extoll the virtues of the Shiny Live deployment strategy. Fair warning: this code is pretty new -- hence buggy, use at your own risk. This project started as mostly a proof-of-concept exercise. And then got completely out of hand. Having said that, plotit offers a minimalist and fairly easy to use interface (well, easier than doing everything from the commandline), and will accomodate relatively large datasets( beyond about 2M rows it can be a bit slow). As far as possible, plotit repackages off-the-shelf functions from seaborn, statsmodels, spicy, etc. while at the same time trying to keep the number of required packages within reason (and within ShinyLive's limitations).
The design philosophy of plotit() is that for researchers using either python or R it is now fairly easy for (programmers and non-programmers) to get an LLM to write single purpose scripts to estimate a specific model on a specific data set and to create specific graphics. However, the process of repeatedly writing a specific prompt for specific model, debugging the LLM output and running model, checking results, thinking up a new model, etc. can become very tedious. Even for experienced programmers, the process of trying different data sets and trying different models is faster and more fun if a simple graphical user interface is made available.
plotit was meant to be a simple, easy to use model and graphics agnostic data exploration tool. It does not try to fit the data into a fixed format, graphic type or model. Instead, it offers some standard statistical graphics for the user to try. Careful attention was paid to handling missing data. Rows with missing data (NaN's or na's in python) are not dropped on input. At any point where we need to drop rows containing missing data, only the rows that are missing data in the currently active columns (either being displayed or part of a linear model) will be dropped. This means, for example, that when you use a variable to color a plot there may be a NA category. You can remove it with the subsetting feature of plotit. Switching between data graphics and modeling is easy in plotit and invites a trial and error approach to exploratory modeling. plotit keeps a log of everything it does so that you always have a record of what you have done. It also lets you save copies of the data files for models you have fit and graphics you have made. The log combined with concientiously saving data files makes your analysis reproducible.
Here's what plotit will do (so far):
- Open a .csv file of data from its local computing environment and give a simple summary of the data in the file. Currently, rows containing NaNs are dropped on input only by use request.
- Create a grid of scatter plots and calculate Pearson correlations for chosen variables
- Statstical graphics. The user chooses variables (X, Y, Z, and a color variable), and type of plots: a. One variable: Histogram, Box Plot, KDE b. Two variables (x and y): 2D scatter plot with coloring. Three variables: interactive 3D plotting with plotly (am working on how to use rgl widgets to support matplotlib interactive 3D, so far no luck) Subsetting: For variables with fewer than 50 or so unique values, choose subsets based on the outcomes of the variables. You can subset on more than one variable.
- Linear models: fit either OLS, logistic, Poisson, or negative binomial models. NOTE: the linear model tab will fit the model to the subset chosen in the Plots tab. However, it will remove rows with missing data in the columns of the current model. Therefore, the number of observations may fluctuate if there are missing observations in the original data. Always check the number of observations!
- After fitting a model, you can go back to the plotting page a you will see that "Model Data" has been chosen and you can use the plotting tools to visualize your model (plot predictions, residuals, confidence intervals etc.).
- A collection of standard plots for linear model diagnostics: ROC curve for logistic regression, leverage and influence for linear regression. Scatter against independent variables, etc.
The documentation goes into more details an has some examples.
To Do's
- Data set splitting (training and test)
- Predictions using new data
- Learn how to use Shiny modules
- Maybe take requests for features......
- As always, polite bug reports are appreciated. Suggestions for how to improve the code are greatly appreciated and will be cited if used. Gripes about my awful programming style are not appreciated (I already know this).