Why share when you can have the entire Gaia database locally?
MyGaiaDB
is a simple Python package that provides scripts to help you set up a local
Gaia DR3 database, along with local 2MASS, ALLWISE, and CATWISE databases, without needing
administrative privileges. It's compatible with all major platforms (Linux, Mac, and Windows)
because MyGaiaDB
is serverless and only requires Python with sqlite
, as long as you have
enough disk space.
This code was primarily developed to assist with my research project involving Gaia DR3 XP spectra. It is not intended to cover every aspect of research using Gaia's 1 billion stars. The main goal of this code is to make setting up a local Gaia database, along with 2MASS, ALLWISE, and CATWISE, accessible to everyone. Potential use cases include, but are not limited to, making complex and lengthy cross-matching queries with multiple databases, which can be time-consuming (especially given the online ESA Gaia archive's timeout limitations).
Feel free to modify the code and submit pull requests to enhance its functionality for yourself and others.
Parts of this code are not continuously tested, as it is impractical to run this code with several terabytes of Gaia data on GitHub Actions.
Table of Contents
This code requires python >= 3.10
with numpy
, pandas
, h5py
, astropy
and tqdm
.
This code only needs the sqlite3
library that comes bundled with your Python installation. You need to make sure that the
sqlite3ext.h
file is available in your path and your Python is compiled with sqlite extension enabled (i.e., compiled with PYTHON_CONFIGURE_OPTS="--enable-loadable-sqlite-extensions"
).
For some optional functionalities, galpy
, mwdust
are required. Downloading functions will require wget
.
You can install the precompiled MyGaiaDB
wheels by running pip install mygaiadb.
Alternatively, you can compile the code locally from the source. You'll need to add the folder containing sqlite3ext.h
to your INCLUDE environment variable. Depending on your platforms, the commands are:
- Windows Command Prompt:
set INCLUDE=%CONDA_PREFIX%\Library\include;%INCLUDE%
- Windows PowerShell:
$env:INCLUDE="$env:CONDA_PREFIX\Library\include"
- MacOS with Conda:
export CPPFLAGS=-I$CONDA_PREFIX/include
- MacOS with
sqlite
installed via Homebrew:export LDFLAGS="-L$(brew --prefix sqlite)/lib" CPPFLAGS="-I$(brew --prefix sqlite)/include"
- Linux: No action needed usually
Then, you can run python -m pip install .
to install the latest commits from GitHub or python -m pip install -e .
to develop MyGaiaDB
locally.
Ensure you have at least ~8TB of free disk space with fast random read speed for optimal query performance. First, set
an environment variable called MY_ASTRO_DATA that points to a folder containing your astronomical data. To be compatible
with other Python packages, there should be a folder called gaia_mirror
under MY_ASTRO_DATA that contains all Gaia data (similar to the
GAIA_TOOLS_DATA environment variable used by Jo Bovy's gaia_tools).
If starting from scratch on a clean computer, set the MY_ASTRO_DATA environment variable, and
MyGaiaDB
will populate the necessary files and folders. MyGaiaDB
will use the ~/.mygaiadb
folder to save user-specific settings and tables.
If you already have the data on your computer but in a different directory structure and cannot move them, you can use symbolic links to create
the required folder structure without duplicating files. For Linux and MacOS, use ln -s {source-dir-or-file-path} {symbolic-dir-or-file-path}
.
For Windows, use mklink {symbolic-file-path} {source-file-path}
or mklink /D {symbolic-dir-path} {source-dir-path}
.
The case sensitive folder structure should look something like the following chart:
~/ ├── .mygaiadb $MY_ASTRO_DATA/ ├── gaia_mirror/ │ ├── Gaia/ │ │ ├── gdr3/ │ │ │ ├── Astrophysical_parameters/astrophysical_parameters/ │ │ │ │ ├── _MD5SUM.txt │ │ │ │ ├── AstrophysicalParameters_000000-003111.csv.gz │ │ │ │ ├── ****** │ │ │ │ └── AstrophysicalParameters_786097-786431.csv.gz │ │ │ ├── gaia_source/ │ │ │ │ ├── _MD5SUM.txt │ │ │ │ ├── GaiaSource_000000-003111.csv.gz │ │ │ │ ├── ****** │ │ │ │ └── GaiaSource_786097-786431.csv.gz │ │ │ ├── cross_match/ │ │ │ │ ├── allwise_best_neighbour/ │ │ │ │ │ ├── _MD5SUM.txt │ │ │ │ │ ├── allwiseBestNeighbour0001.csv.gz │ │ │ │ │ ├── ****** │ │ │ │ │ └── allwiseBestNeighbour0033.csv.gz │ │ │ │ ├── tmasspscxsc_best_neighbour/ │ │ │ │ │ ├── _MD5SUM.txt │ │ │ │ │ ├── tmasspscxscBestNeighbour0001.csv.gz │ │ │ │ │ ├── ****** │ │ │ │ │ └── tmasspscxscBestNeighbour0047.csv.gz │ │ │ ├── Spectroscopy/ │ │ │ │ ├── rvs_mean_spectrum/ │ │ │ │ │ ├── _MD5SUM.txt │ │ │ │ │ ├── RvsMeanSpectrum_000000-003111.csv.gz │ │ │ │ │ ├── ****** │ │ │ │ │ └── RvsMeanSpectrum_786097-786431.csv.gz │ │ │ │ ├── xp_continuous_mean_spectrum/ │ │ │ │ │ ├── _MD5SUM.txt │ │ │ │ │ ├── XpContinuousMeanSpectrum_000000-003111.csv.gz │ │ │ │ │ ├── ****** │ │ │ │ │ └── XpContinuousMeanSpectrum_786097-786431.csv.gz │ │ │ │ ├── xp_sampled_mean_spectrum/ │ │ │ │ │ ├── _MD5SUM.txt │ │ │ │ │ ├── XpSampledMeanSpectrum_000000-003111.csv.gz │ │ │ │ │ ├── ****** │ │ │ │ │ └── XpSampledMeanSpectrum_786097-786431.csv.gz ├── 2mass_mirror/ │ ├── psc_aaa.gz │ ├── ****** │ └── xsc_baa.gz ├── allwise_mirror/ │ ├── wise-allwise-cat-part01.bz2 │ ├── ****** │ └── wise-allwise-cat-part48.bz2 └── catwise_mirror/ └── 2020/ ├── 000/ │ ├── 0000m016_opt1_20191208_213403_ab_v5_cat_b0.tbl.gz │ └── ****** ├── 001/ │ ├── 0015m016_opt1_20191209_054819_ab_v5_cat_b0.tbl.gz │ └── ****** └── ******
To download and populate data with MyGaiaDB
, you can do
from mygaiadb.data import download
# for gaia_source
download.download_gaia_source()
# for gaia astrophysical_parameters
download.download_gaia_astrophysical_parameters()
# for 2mass best neightbour
download.download_2mass_best_neightbour()
# for allwise best neightbour
download.download_allwise_best_neightbour()
# for 2MASS
download.download_2mass()
# for allwise
download.download_allwise()
# for catwise
download.download_catwise()
# for xp continuous
download.download_gaia_xp_continuous()
# for xp sampled
download.download_gaia_xp_sampled()
# for rvs spectra
download.download_gaia_rvs()
Official data links:
- Official Gaia data can be accessed here: https://cdn.gea.esac.esa.int/Gaia/
- Official 2MASS data can be accessed here: https://irsa.ipac.caltech.edu/2MASS/download/allsky/
- Official ALLWISE data can be accessed here: https://irsa.ipac.caltech.edu/data/download/wise-allwise/
- Official CATWISE data can be accessed here: https://catwise.github.io/
The following functions are used to compile the databases. Each function only needs to be run once per computer where the data is stored,
and each function will generate large-sized files. If you are using a shared computing server, only one user needs to run these functions.
Once the databases are compiled, share the MY_ASTRO_DATA folder path with other users. They can set their own MY_ASTRO_DATA environment
variable to point to that folder. Multiple users on the same computers can access the SQL database simultaneously, as MyGaiaDB
will set the
databases to read-only mode before loading them, preventing accidental modifications.
from mygaiadb.data import compile
# compile Gaia SQL dataset
compile.compile_gaia_sql_db()
# compile 2MASS SQL dataset
compile.compile_tmass_sql_db()
# compile ALLWISE SQL dataset
compile.compile_allwise_sql_db()
# compile CATWISE SQL dataset
compile.compile_catwise_sql_db()
# turn compressed XP coeffs files to h5, with options to save correlation matrix too
# a large amount of disk space (~3TB) is required if save_correlation_matrix=True
compile.compile_xp_continuous_h5(save_correlation_matrix=False)
# compile all XP coeffs into a single h5, partitioned batches of stars by their HEALPix
# with options to save correlation matrix too, BUT it requires yo to run compile_xp_continuous_h5(save_correlation_matrix=True) first
# a large amount of disk space (~3TB) is required if save_correlation_matrix=True
compile.compile_xp_continuous_allinone_h5(save_correlation_matrix=False)
In MyGaiaDB
, the following databases are available if all have been compiled:
gaiadr3.gaia_source
, gaiadr3.allwise_best_neighbour
, gaiadr3.tmasspscxsc_best_neighbour
,
gaiadr3.astrophysical_parameters
, tmass.twomass_psc
, allwise.allwise
and catwise.catwise
.
A few utility functions are available to list tables and their columns. Below is a brief description of each table:
gaiadr3.gaia_source
gaiadr3.allwise_best_neighbour
Identical toallwise_best_neighbour
on Gaia Archivegaiadr3.tmasspscxsc_best_neighbour
Identical totmass_psc_xsc_best_neighbour
on Gaia Archivegaiadr3.astrophysical_parameters
A simplified version ofastrophysical_parameters
on Gaia Archive with only essential columns retainedtmass.twomass_psc
A simplified version of 2MASS Point Source Catalog (PSC) with only essential columns retainedOfficial description: https://irsa.ipac.caltech.edu/2MASS/download/allsky/format_psc.htmlallwise.allwise
A simplified version of ALLWISE source catalog with only essential columns retainedOfficial description: https://wise2.ipac.caltech.edu/docs/release/allwise/expsup/sec2_1a.htmlcatwise.catwise
A simplified version of CATWISE source catalog with only essential columns retainedOfficial description: https://irsa.ipac.caltech.edu/data/WISE/CatWISE/gator_docs/catwise_colDescriptions.html
You can use the list_all_tables()
function to get a list of tables, excluding user_table
. For example:
>>> from mygaiadb.query import LocalGaiaSQL
>>> # initialize a local Gaia SQL database instance
>>> local_db = LocalGaiaSQL()
>>> # print a list of tables
>>> print(local_db.list_all_tables())
['gaiadr3.gaia_source', ...]
You can use get_table_column(table_name)
to get a list of columns of a table which must be in the format of
{database_name}.{table_name}
, gaiadr3.gaia_source
in this case
>>> from mygaiadb.query import LocalGaiaSQL
>>> # initialize a local Gaia SQL database instance
>>> local_db = LocalGaiaSQL()
>>> # print a list of columns of a table
>>> print(local_db.get_table_column("gaiadr3.gaia_source"))
['source_id', 'random_index', ...]
If you want to manage and edit the databases with GUI, you can try to use SQLiteStudio or DB Browser for SQLite.
SQL query in MyGaiaDB
is largely the same as Gaia Archive. Although MyGaiaDB
has implemented all mathematical functions in way aligned with ADQL,
MyGaiaDB
has only implemented a few advanced ADQL function like DISTANCE
as well as additional functions on ESA Gaia TAP+.
For example the following query which used a special function called DISTANCE
defined in ADQL
SELECT DISTANCE(179., 10., G.ra, G.dec) as ang_sep
FROM gaiadr3.gaia_source as G
WHERE G.source_id = 154378304526208
returns 132.172604 on Gaia Archive. And you can also use such query in the same manner as you would on Gaia Archive
>>> from mygaiadb.query import LocalGaiaSQL
>>> # initialize a local Gaia SQL database instance
>>> local_db = LocalGaiaSQL()
>>> local_db.query("""
... SELECT DISTANCE(179., 10., G.ra, G.dec) as ang_sep
... FROM gaiadr3.gaia_source as G
... WHERE G.source_id = 154378304526208
... """)
ang_sep
0 132.172604
which you will get the same result of 132.172604.
For example the following query which utilize conventional maths function to approximate uncertainty in Gaia G magnitude
>>> from mygaiadb.query import LocalGaiaSQL
>>> # initialize a local Gaia SQL database instance
>>> local_db = LocalGaiaSQL()
>>> # CDS equation for conversion: http://vizier.cds.unistra.fr/viz-bin/VizieR-n?-source=METAnot&catid=1350¬id=63&-out=text
>>> local_db.query("""
... SELECT sqrt(power(((2.5 / log(10)) * (1 / G.phot_g_mean_flux_over_error)), 2) + power(0.0027553202, 2)) as phot_g_mean_mag_error
... FROM gaiadr3.gaia_source as G
... WHERE G.source_id = 154378304526208
... """)
phot_g_mean_mag_error
0 0.003619
Another example is the following query that works on Gaia Archive will also work in MyGaiaDB
to select the first 100 gaia sources with XP coefficients
SELECT TOP 100 *
FROM gaiadr3.gaia_source as G
WHERE (G.has_xp_continuous = 'True')
To run this query in MyGaiaDB
, you can do the following and will get a pandas Dataframe back as the result
>>> from mygaiadb.query import LocalGaiaSQL
>>> # initialize a local Gaia SQL database instance
>>> local_db = LocalGaiaSQL()
>>> local_db.query("""SELECT TOP 100 * FROM gaiadr3.gaia_source as G WHERE (G.has_xp_continuous = 'True')""")
source_id random_index ... has_xp_sampled has_rvs
0 ...
The following example query is too complex for Gaia Archive, thus you will get timeout error but luckily, you've got MyGaiaDB
to do the job.
The following example query from gaia_source
table, astrophysical_parameters
table, 2MASS and ALLWISE table all at once.
Moreover, MyGaiaDB
set each dataset to read-only before loading it. If you want to edit the database afterward, you must set the
appropriate permission manually each time you have used MyGaiaDB
.
>>> from mygaiadb.query import LocalGaiaSQL
>>> # initialize a local Gaia SQL database instance
>>> local_db = LocalGaiaSQL()
>>> query = """
... SELECT G.source_id, G.ra, G.dec, G.pmra, G.pmdec, G.parallax, G.parallax_error, G.phot_g_mean_mag, GA.logg_gspspec,
... TM.j_m, AW.w1mpro
... FROM gaiadr3.gaia_source as G
... INNER JOIN gaiadr3.astrophysical_parameters as GA on GA.source_id = G.source_id
... INNER JOIN gaiadr3.tmasspscxsc_best_neighbour as T on G.source_id = T.source_id
... INNER JOIN gaiadr3.allwise_best_neighbour as W on W.source_id = T.source_id
... INNER JOIN tmass.twomass_psc as TM on TM.designation = T.original_ext_source_id
... INNER JOIN allwise.allwise as AW on AW.designation = W.original_ext_source_id
... WHERE (G.has_xp_continuous = 1) AND (G.ruwe < 1.4) AND (G.ipd_frac_multi_peak <= 2) AND (G.ipd_gof_harmonic_amplitude<0.1) AND (GA.logg_gspspec < 3.0)
... """
>>> # take ~12 hours to complete
>>> local_db.save_csv(query, "output.csv", chunksize=50000, overwrite=True, comments=True)
...
As you can see for has_xp_continuous
, we can also use 1
to represent true
which is used by Gaia archive but both are fine with MyGaiaDB
.
The overwrite=True
means the function will save the file even if the file with the same name already exists. The comments=True
means the function will
save the query as a comment in the csv file so you know how to reproduce the query result. To read the comments from the csv file, you can use the following code
>>> from itertools import takewhile
>>> with open("output.csv", "r") as fobj:
... headiter = takewhile(lambda s: s.startswith("#"), fobj)
... header = list(headiter)
>>> print(" ".join(header).replace(" # ", "").replace("# ", ""))
SELECT G.source_id, G.ra, G.dec, G.pmra, G.pmdec, G.parallax, G.parallax_error, G.phot_g_mean_mag, GA.logg_gspspec,
TM.j_m, AW.w1mpro
FROM gaiadr3.gaia_source as G
INNER JOIN gaiadr3.astrophysical_parameters as GA on GA.source_id = G.source_id
INNER JOIN gaiadr3.tmasspscxsc_best_neighbour as T on G.source_id = T.source_id
INNER JOIN gaiadr3.allwise_best_neighbour as W on W.source_id = T.source_id
INNER JOIN tmass.twomass_psc as TM on TM.designation = T.original_ext_source_id
INNER JOIN allwise.allwise as AW on AW.designation = W.original_ext_source_id
WHERE (G.has_xp_continuous = 1) AND (G.ruwe < 1.4) AND (G.ipd_frac_multi_peak <= 2) AND (G.ipd_gof_harmonic_amplitude<0.1) AND (GA.logg_gspspec < 3.0)
MyGaiaDB
also has callbacks functionality called LambdaCallback
, these callbacks can be used when you do query. For example,
you can create a callbacks to convert ra
in degree to ra_rad
in radian. So your csv file in the end will have a new column
called ra_rad
. Functions in LambdaCallback
must have arguments with exact column names in your query so MyGaiaDB
knows
which columns to use on the fly.
>>> import numpy as np
>>> from mygaiadb.query import LocalGaiaSQL, LambdaCallback
>>> # initialize a local Gaia SQL database instance
>>> local_db = LocalGaiaSQL()
>>> query = """
... SELECT G.source_id, G.ra, G.dec
... FROM gaiadr3.gaia_source as G
... LIMIT 100
... """
>>> ra_conversion = LambdaCallback(new_col_name="ra_rad", func=lambda ra: ra / 180 * np.pi)
>>> local_db.save_csv(query, "output.csv", chunksize=50000, overwrite=True, callbacks=[ra_conversion], comments=True)
We also have a few useful callbacks included by default to add columns like zero-point corrected parallax or extinction
>>> from mygaiadb.query import ZeroPointCallback, DustCallback
>>> query = """
... SELECT G.source_id, G.ra, G.dec, G.parallax, G.phot_bp_mean_mag, G.nu_eff_used_in_astrometry, G.pseudocolour, G.astrometric_params_solved
... FROM gaiadr3.gaia_source as G
... LIMIT 100000
... """
>>> # adding zero-point corrected parallax using official Gaia DR3 parallax zero-point python package
>>> zp_callback = ZeroPointCallback(new_col_name="parallax_w_zp")
>>> # adding SFD E(B-V) in 2MASS H band filter using mwdust python package
>>> dust_callback = DustCallback(new_col_name="sfd_ebv", filter="2MASS H", dustmap="SFD")
>>> local_db.save_csv(query, "output.csv", chunksize=50000, overwrite=True, callbacks=[zp_callback, dust_callback])
MyGaiaDB
support the use of user uploaded table. You can load your table first by pandas
and then do
>>> from mygaiadb.query import LocalGaiaSQL
>>> import pandas as pd
>>> local_db = LocalGaiaSQL()
>>> local_db.upload_user_table(pd.DataFrame({"source_id": [158329674677120, 27373239622051072, 11638124421673088]}), tablename="my_table_1")
and then carry-on doing query with my_table_1
cross-matching with other tables like
>>> local_db.query("""SELECT * FROM gaiadr3.gaia_source as G INNER JOIN user_table.my_table_1 as MY on MY.source_id = G.source_id""")
source_id random_index ... has_rvs source_id
0 ...
You can check the list of your own user tables with column names by using list_user_tables()
>>> local_db.list_user_tables()
{'my_table_1': ['source_id']}
and you can remove a user table like my_table_1
in this case by using remove_user_table()
>>> local_db.remove_user_table("my_table_1")
There can be use case where you want to run a function (e.g., a machine learning model) to a large batch of Gaia source ids with reasonable memory usage.
You can use MyGaiaDB
to do that in batch provided you have compiled a single h5 with mygaiadb.compile.compile_xp_continuous_allinone_h5()
from mygaiadb.spec import yield_xp_coeffs
for i in yield_xp_coeffs(a_very_long_source_id_array):
coeffs, idx = i
# XP coeffs of idx from the original a_very_long_source_id_array
# alternatively if you also want coeffs error
for i in yield_xp_coeffs(a_very_long_source_id_array, return_errors=True):
coeffs, idx, coeffs_err = i # unpack
# alternatively if you want coeffs error and some other columns like bp_n_relevant_basesand rp_n_relevant_bases
# ref: https://gea.esac.esa.int/archive/documentation/GDR3//Gaia_archive/chap_datamodel/sec_dm_spectroscopic_tables/ssec_dm_xp_summary.html
for i in yield_xp_coeffs(a_very_long_source_id_array, return_errors=True, return_additional_columns=["bp_n_relevant_bases", "rp_n_relevant_bases"]):
coeffs, idx, coeffs_err, bp_n_relevant_bases, rp_n_relevant_bases = i # unpack
For example you want to infer M_H
with your machine learning model on many XP spectra
from mygaiadb.spec import yield_xp_coeffs
m_h = np.ones(len(a_very_long_source_id_array)) * np.nan
for i in yield_xp_coeffs(a_very_long_source_id_array):
coeffs, idx = i # unpack
m_h[idx] = your_ml_model(coeffs)
This package is original developed for the paper Towards an astronomical foundation model for stars with a Transformer-based model.
This project is licensed under the MIT License - see the LICENSE file for details.