Earth science or geoscience includes all fields of natural science related to the planet Earth. This is a branch of science dealing with the physical, chemical, and biological complex constitutions and synergistic linkages. Earth science encompasses four main branches of study the biosphere, the hydrosphere, the atmosphere, and the lithosphere, each of which is further broken down into more specialized fields.
Python is a widely used, open-source programming language. In Earth science, scientific programming languages like Python, help you speed up and automate lengthy tasks like selecting and downloading large datasets or performing repetitive calculations that you might otherwise have to do manually.
Depending on the type of scientific application sensor or measuring device, geoscience associated data is stored in many formats and data types. We need to be aware of this, so we can plan how to read data into our analysis environment.
There is a set of basic general Python libraries that allows us to perform data analysis and data visualization. They involve a set of data structures, available mathematical operations, defined statistical analysis functions and a collection of different functions to visualize data properties.
- numpy. NumPy is the fundamental library for scientific computing in Python. It provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and more.
- matplotlib. Matplotlib is the main plotting library for Python. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits. A more popular and user friendly visualization library derived from Matplotlib is Seaborn.
- pandas. Pandas is a Python software library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.
- scipy. SciPy is a free and open-source Python library used for scientific computing and technical computing. It includes modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.
There is a collection of available Python Libraries to work with spatially distribuited data. We mention a list of the most relevant ones.
- arraylake. Arraylake is a data lake platform for managing multidimensional arrays and metadata in the cloud.
- dask. Dask is a flexible library for parallel computing in Python.
- kerchunk. Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …), allowing efficient access to the data from traditional file systems or cloud object storage.
- numba. Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.
- [pooch]. Pooch is a data file downloader.
- pystac. PySTAC is a library for working with SpatioTemporal Asset Catalogs (STAC).
- stackstac. Load a STAC collection into xarray with dask.
- xarray. Xarray is an open source project and Python package that introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, which allows for more intuitive, more concise, and less error-prone user experience.
- xarray-spatial. Xarray-Spatial implements common raster analysis functions using Numba and provides an easy-to-install, easy-to-extend codebase for raster analysis.
- zarr. Zarr is a format for the storage of chunked, compressed, N-dimensional arrays.
- GDAL. GDAL is a translator library for raster and vector geospatial data formats that is released under an MIT style Open Source License by the Open Source Geospatial Foundation.
- RasterFrames. RasterFrames brings together Earth-observation (EO) data access, cloud computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity and a huge challenge to the data analysis community.
- rasterio. Rasterio allows access to geospatial raster data.
- rasterstats. Rasterstats is a Python module for summarizing geospatial raster datasets based on vector geometries. It includes functions for zonal statistics and interpolated point queries. The command-line interface allows for easy interoperability with other GeoJSON tools.
- RSGISLib.The Remote Sensing and Geographical Information Systems software Library (RSGISLib), contains a number of algorithms for processing and analysing remote sensing data that are the product of research carried out by the authors and their collaborators.
-
fiona. Fiona focuses on reading and writing data in standard Python IO style and relies upon familiar Python types and protocols such as files, dictionaries, mappings, and iterators. Fiona can read and write real-world data using multi-layered GIS formats and zipped virtual file systems and integrates readily with other Python GIS packages such as pyproj, Rtree, and Shapely.
-
GDAL/OGR. Several software programs use the GDAL/OGR libraries to allow them to read and write multiple GIS formats.
-
geomesa. GeoMesa is an open source suite of tools that enables large-scale geospatial querying and analytics on distributed computing systems. GeoMesa provides spatio-temporal indexing on top of the Accumulo, HBase, Google Bigtable and Cassandra databases for massive storage of point, line, and polygon data. GeoMesa also provides near real time stream processing of spatio-temporal data by layering spatial semantics on top of Apache Kafka.
-
geopandas. GeoPandas is an open source Python library for working geospatial data. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. GeoPandas further depends on fiona for file access and matplotlib for data visualization.
-
pyproj. Python interface to PROJ (cartographic projections and coordinate transformations library).
-
shapely. Shapely is a BSD-license Python package for manipulation and analysis of planar geometric objects.
- PDAL. PDAL is a C++ library for translating and manipulating point cloud data. It is very much like the GDAL library which handles raster and vector data.
- laspy. LAS (and its compressed counterpart LAZ), is a popular format for lidar point cloud and full waveform, laspy reads and writes these formats and provides a Python API via Numpy Arrays.
- numpy. NumPy is the fundamental library for scientific computing in Python. When combined with a reader/writer library like LasPy, we can store point cloud data in a NumPy array, as well as filter/process the data. NumPy is also good for general use across the geospatial domain.
- cartopy. Cartopy is a Python package designed for geospatial data processing in order to produce maps and other geospatial data analyses. Cartopy makes use of the powerful PROJ, NumPy and Shapely libraries and includes a programmatic interface built on top of Matplotlib for the creation of publication quality maps.
- geoplotlib. geoplotlib is a Python toolbox for visualizing geographical data and making maps.
- ipyleaflet. ipyleaflet creates interactive maps in a Jupyter Notebook.
- Folium. An alternative to Ipyleaflet, Folium is also a bridge to
leaflet.js
. The difference between the two is that Folium is built toward static visualizations, whereas Ipyleaflet builds interactive widgets. A useful feature of Folium is that it provides easy functionality to export an interactive map to HTML, making it a useful tool in web development. - TorchGeo. TorchGeo is a PyTorch domain library, similar to torchvision, providing datasets, samplers, transforms, and pre-trained models specific to geospatial data.
- A course on Geographic Data Science. University of Liverpool. Dani Arribas-Bel.
- An Introduction to Earth and Environmental Data Science. Ryan Abernathey.
- Earth Lab. Resources developed by Earth Lab at University of Colorado, Boulder. The website contains, course lessons and blog posts related to earth data science.
- Geographic Data Science with Python. Sergio J. Rey, Dani Arribas-Bel, Levi J. Wolf.
- Geospatial Data Science. Michael Szell. University of Copenhagen.
- Introduction to GIS Programming. Qiusheng Wu.
- Introduction to Python for Geographic Data Analysis. Henrikki Tenkanen, Vuokko Heikinheimo & David Whipp.
- Introduction to Spatial Data Programming with Python. Michael Dorman. Department of Geography and Environmental Development, Ben-Gurion University of the Negev.
- Jupyter Meets the Earth.
- Project Pythia. An education and training hub for the geoscientific Python community.
- Spatial Data Management. Qiusheng Wu. University of Tennessee, Knoxville.
- The Ultimate List of GIS Formats and Geospatial File Extensions. GISGeography.
Created: 08/18/2022; Updated: 11/22/2023
Carlos Lizárraga.