Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add content to format_metadata, different_tools and help #25

Merged
merged 8 commits into from
Nov 23, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 78 additions & 60 deletions BigData/different_tools.md
Original file line number Diff line number Diff line change
@@ -1,126 +1,144 @@
# Identifying which languages/tools are best suited to specific tasks
## Python

This page contains:
- [Python](#python)
- [R](#r)
- [MATLAB](#matlab)
- [NCO](#nco-netcdf-operators)
- [CDO](#cdo-climate-data-operators)

Other languages and tools exist which can work with netCDF data (e.g. C, FORTRAN, ArcGIS, QGIS, paraview, panoply, Ferret, as well as the deprecated NCL), but on this page we focus on tools commonly used for *analysis* of large scale (tyipcally netCDF) climate data.

## Python
This is a free, open-source language that is a standard tool used in many organisations and industries. It interfaces with other programs and tools like ArcGIS. Packages like xarray are great for analysing large gridded time-series data in climate and environmental science fields. Python creates beautiful plots.

Os, sys, glob: to handle directories and files
`os, sys, glob` - to handle directories and files

`numpy` - numerical python

Numpy: numerical math
`matplotlib` - to create plots

Matplotlib: to create plots
`cartopy` - plots maps from geospatial data

Other plotting packages: https://mode.com/blog/python-data-visualization-libraries/: plotly, seaborn, holoviews
Other plotting packages: https://mode.com/blog/python-data-visualization-libraries/ - `plotly, seaborn, holoviews, bokeh`

Pandas: timeseries, integrates with numpy
`pandas` - timeseries, integrates with numpy

Xarray: gridded data, integrates with pandas, include basic plotting capabilities
`xarray` - gridded data, integrates with pandas, include basic plotting capabilities

Dask: to parallelise tasks and manage memory more efficiently , integrates with xarray
`dask` - to parallelise tasks and manage memory more efficiently , integrates with xarray

Calendar: to handle calendars and time information
`calendar, datetime` - to handle calendars and time information

Netcdf4 - to handle netcdf files, usually integrated in tools like xarray, pandas
`netcdf4` - to handle netCDF files, usually integrated in tools like xarray, pandas

hdf5, hdf4, h4netcdf, hdfeos2, hdfeos5, h5py, pyhdf - to handle various hdf formats they have different advantages
`hdf5, hdf4, h4netcdf, hdfeos2, hdfeos5, h5py, pyhdf` - to handle various HDF formats they have different advantages

Pygrib -m to handle grib file
`pygrib` - to handle GRIB file

Requests: download/upload from/to website (not specifically analysis but can be useful for data handling)
`requests` - download/upload from/to website (not specifically analysis but can be useful for data handling)

Csv - to handle csv files
`csv` - to handle CSV files

Json - to handle json files (often useful to store table information and pass schema, vocabularies and other dictionary style information to programs)
`json` - to handle JSON files (often useful to store table information and pass schema, vocabularies and other dictionary style information to programs)

Yaml - to handle yaml files - often use to handle program configurations
`yaml` - to handle yaml files - often use to handle program configurations

Rasterio, rasterstats, rio-xarray, geopandas, fiona - to handle raster and shapefiles
`rasterio, rasterstats, rio-xarray, geopandas, fiona` - to handle raster and shapefiles

Zarr -
`gdal` - useful for reprojecting data and interfacing with geoTIFFs

Specific tools:
`scipy` - scientific python tools

Iris -
`zarr` - to read and write datasets as zarr archives

Cfcheker.py - checking against CF and ACDD conventions
### Specific tools:

marineHeatwaves / xmhw - calculate MHW statistics
`Iris` - MetOffice tool for working with CF-compliant netCDF data

CleF - discovering ESGF datasets at NCI
`cfcheker.py` - checks netCDF files against CF and ACDD conventions

ClimTas - makes it easier to apply and extend dask functions
`marineHeatwaves` / `xmhw` - calculate MHW statistics

Xclim -
`CleF` - discovering ESGF datasets at NCI

Cosima cookbook
`ClimTas` - makes it easier to apply and extend dask functions

Cdo - to call cdo operators (Scott has a regridding function that exploit this)
`Xclim` -

Wrf-python -
`Cosima cookbook` - various python libraries for ocean and sea ice

Siphon - to navigate thredds servers
`cdo` - to call cdo operators (Scott has a regridding function that exploits this)

Xesmf -
`Wrf-python` -

Udunits2 -
`Siphon` - to query and navigate THREDDS servers

Eofs -
`Xesmf` - regridding tool

Eccodes -
`udunits2` - Library used to interpret units of measurement

Earthpy -
`Eofs` -

xgcm - work with offset grids
`Eccodes` -

Specific distributions:
`Earthpy` -

Anaconda
`xgcm` - work with offset grids

miniconda
### Specific toolsets:

Pangeo
[Anaconda](https://www.anaconda.com/): Contains pretty much all the python libraries you'd want to get started, great for newcomers but takes up a lot of space. Not recommended on shared systems with quotas but good on local laptops. Includes Spyder, a Matlab-like programming environment (IDE).

scipy
[miniconda](https://docs.conda.io/en/latest/miniconda.html): A lightweight version of anaconda which by default only includes core libraries, good for building specific environments for data analysis. This underpins the `conda` modules in the `hh5` project at NCI.

[Pangeo](https://pangeo.io/): A community for analysis of large scale climate data. Built on tools like python, xarray, dask, iris, cartopy.

## R
This is a free, open-source statistical programming language. It is used mainly in research, but it is also a standard tool in many organisations. This tool is great for statistical analysis.

Dplyr, tidyr, tidyverse - Dataframe manipulation
`dplyr, tidyr, tidyverse` - Dataframe manipulation

ggplot2 - Creating graphics
`ggplot2` - creating graphics

purrr - data wrangling
`purrr` - data wrangling

rio - data import/export
`rio` - data import/export

Shiny - report results, e.g., build interactive web apps
`Shiny` - report results, e.g., build interactive web apps

Mlr - machine learning tasks
`Mlr` - machine learning tasks

Leaflet - mapping and working on interactive maps
`Leaflet` - mapping and working on interactive maps

tidymodels - modeling and machine learning
`tidymodels` - modeling and machine learning

sp, maptools - processing spatial data
`sp, maptools` - processing spatial data

Zoo,xls - for time series data
`zoo,xls` - for time series data

climpact - https://github.com/ARCCSS-extremes/climpact heatwave/extremes statistics
`climpact` - https://github.com/ARCCSS-extremes/climpact Heatwave/extremes statistics

https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages
Recommended list of packages: https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages

### Specific toolsets

[Rstudio](https://support.rstudio.com/hc/en-us) IDE

## MATLAB
MATLAB (Matrix Laboratory) is a licenced tool. It is the best tool when dealing with large matrices and matrix manipulations. It allows examining the content of data quickly in a built-in docked or undocked window within the tool to gain an overview of the pattern and structures presented in the data. This tool is helpful because many data types, for example, large image files and large tabular data, can be converted into matrices and analysed efficiently in MATLAB. MATLAB provides an easy-to-use environment with interactive applications, which is excellent for novel programmers.
MATLAB (Matrix Laboratory) is a licenced tool. It is a good tool when dealing with large matrices and matrix manipulations. It allows examining the content of data quickly in a built-in docked or undocked window within the tool to gain an overview of the pattern and structures presented in the data. This tool is helpful because many data types, for example, large image files and large tabular data, can be converted into matrices and analysed efficiently in MATLAB. MATLAB provides an easy-to-use environment with interactive applications, which is excellent for novice programmers. MATLAB also has excellent help resources and a useful online community.

As a licensed tool matlab might not be available to other researchers and collaborators, so even if you are producing data with matlab, avoid saving the data as mat files, use the best alternative open source format instead.
As a licensed tool MATLAB might not be available to other researchers and collaborators, so even if you are producing data with Matlab, it is best to avoid saving the data as `.mat` files, use the best alternative open source format instead.

## NCO - NetCDF Operators
NetCDF Operators toolkit of command-line operators to both handle and perform analysis on netCDF files. It is the tool of choices to add, rename, modified attributes and variables. It can add internal compression to netcdf4 files and convert between different formats. It is also useful to concatenate files, performing averages and other simple mathematical operations on an entire variable, extracting or deleting variables. The advantage is that the results will be automatically saved in a netcdf file.
Limitations: memory? File size?
[NetCDF Operators](http://nco.sourceforge.net/) is a toolkit of command-line operators to both handle and perform analysis on netCDF files. It is the tool of choice to add, rename, and modify attributes and variables. It can add internal compression to netCDF4 files and convert between different formats. It is also useful to concatenate files, performing averages and other simple mathematical operations on an entire variable, extracting or deleting variables. The advantage is that the results will be automatically saved in a netCDF file.


## CDO - Climate Data Operators
CDO, like NCO is a large tool set to handle and analyse climate and weather data. CDO can also work with grib files, in fact it is a useful tool to convert from grib to netcdf and vice versa. CDO can also be used to compress, convert and concatenate files. However this is usually in conjunction with another operation.
[CDO](https://code.mpimet.mpg.de/projects/cdo/), like NCO, is a large command-line tool set to handle and analyse climate and weather data. CDO can also work with GRIB files, in fact it is a useful tool to convert from GRIB to netCDF and vice versa. CDO can also be used to compress, convert and concatenate files, often in conjunction with another operation.

One of the strengths of CDO is its ability to combine operations in succession of steps without creating intermediate files, using little additional memory in the process.

One of the strengths of CDO is its ability to combine operations in succession of steps without creating intermediate files.
CDO is useful to calculate climatologies, regrid datasets, select subset both spatially and temporally. It can be used to perform simple transformations across an entire variable as for NCO. It is useful to handle time axis operations as going from unlimited to limited dimension and setting a new reference time. CDO can integrate with other languages such as python using the ‘cdo’ module.
CDO is useful to calculate climatologies, regrid datasets, select subset both spatially and temporally. It can be used to perform simple transformations across an entire variable as for NCO. It is useful to handle time axis operations as going from unlimited to limited dimension and setting a new reference time. CDO can integrate with other languages such as python using the `cdo` module.

Limitations: specific versions can have issues with threading
Limitations: specific versions can have issues with threading, meaning chained commands are not always safe. CDO **cannot** be built in threadsafe mode due to underpinning HDF dependencies which means some versions simply are not reliable and can cause random segfaults when using chained operations.
Loading