Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Simbad.query_objects & IRSA.query_region searches #3025

Open
ericasaw opened this issue Jun 7, 2024 · 6 comments
Open

Slow Simbad.query_objects & IRSA.query_region searches #3025

ericasaw opened this issue Jun 7, 2024 · 6 comments
Assignees

Comments

@ericasaw
Copy link

ericasaw commented Jun 7, 2024

Hi!
I have what might be considered an unusual use case for astroquery--I cross match (~20,000) objects with different catalogs using the Simbad, IRSA, and Xmatch queries for an instrument archive. I wrote code that completed all of this cross-matching for me several years ago and have been using it to update the archive I manage since then. I recently updated my environment moving astroquery to the newest 4.7 release, but my prior code doesn't work like it used to in astroquery 4.3 and I was wondering if something changed.

In particular:

  • Simbad.query_objects has become unusable for my list of objects (~17,000) for name searching. I primarily use this function to reverse search 2MASS names returned from another cross match method to verify the integrity of the cross match. Even after increasing the timeout limit to over 24 hours, the function still failed to search all of the objects in that time. In astroquery 4.3 I did not have this issue, often a similar amount of objects were searched in 1-2 hours. I am mostly confused because if I loop through each object one at a time and search the names using Simbad.query_object the results are incredibly fast (done in under an hour).
  • IRSA.query_region using 2MASS names (I query a 5 arcsecond cone) has also become incredibly slow per object and appears to become even slower for objects later in the list, which again, I didn't struggle with in astroquery 4.3.
  • I am a little confused by some of the search behavior from IRSA.query_region when using names to search for objects in the 2MASS catalog. Some of the names I put into the function return no results when searched with a radius of 5 arcsec, but show up in results if I widen the search radius to 10 arcsec. I thought by using the 2MASS names it would just return a result if the name appeared in the 2MASS catalog but it still seems like the search is still coordinate based? Is there a way to search the 2MASS catalog directly using the 2MASS name and not the coordinates?

I do feel like the Xmatch function has sped up significantly since astroquery 4.3 which I love! I was just wondering if there were any changes made there that could have affected the Simbad and IRSA search functions.

@bsipocz
Copy link
Member

bsipocz commented Jun 7, 2024

I would suggest separating these into two different issues, one for simbad and one for irsa. If possible including code examples, too as that would help any debugging/benchmarking as well as that way we can spot if something is used in a non-intended way (and thus can improve the docs to point out what not to do)

I can say for irsa, that we totally switched out the backend, but not much has changed in the method's code, but a lot could have happened in the past 3 years on server side, etc. So an example code would also help us narrow down the problem to a useful suggestion (as e.g. new methods has been added since then)

@ManonMarchand
Copy link
Member

On the SIMBAD part

If I assume that you want the list of identifiers, the main identifier, and the positions for your 2MASS objects, then the proper way to do your query for now is with a TAP query (in the next astroquery version, this will be used behind the scenes by query_objects).

Let's first generate a sample of 10k 2MASS identifiers:

# let's get 10000 random 2MASS objects
from astroquery.simbad import Simbad
query = """SELECT TOP 10000 id from ident
WHERE id like '2MASS%'
"""
random_2MASS = Simbad.query_tap(query)
print(random_2MASS)
           id          
-----------------------
2MASS J00000002+7417074
2MASS J00000007-0529397
2MASS J00000007-3044366
2MASS J00000009-5455467
2MASS J00000011+0522500
2MASS J00000014+6055141
2MASS J00000015-2913020
2MASS J00000016+3208474
2MASS J00000019-1924498
2MASS J00000021+0105203
2MASS J00000022-3008557
2MASS J00000023-5709445
2MASS J00000024-5742487
2MASS J00000025+5210402
2MASS J00000025-7541166
2MASS J00000026-3441523
.
.
.

This part will be skipped for you, as you already have your own list. But you should have an astropy table with a single column with your own sample (if there are more columns you will loose upload time when we send the table to SIMBAD)

We will now write the TAP query:

query = """SELECT main_id, ra, dec, ids 
FROM random_2MASS 
JOIN ident ON ident.id = random_2MASS.id
JOIN basic ON basic.oid = ident.oidref 
JOIN ids ON basic.oid = ids.oidref 
"""

result = Simbad.query_tap(query, random_2MASS=random_2MASS)
<Table length=10000>
        main_id          ...
                         ...
         object          ...
------------------------ ...
        UCAC4 822-000001 ...
               HD 224701 ...
              CTLGD 2509 ...
   GES J00000009-5455467 ...
        UCAC4 477-000001 ...
        UCAC4 755-000001 ...
              CTLGD 9869 ...
   ATO J000.0007+32.1464 ...
        UCAC4 353-000001 ...
               HD 224700 ...
              CTLGD 5514 ...
        UCAC4 165-000001 ...
        UCAC4 162-000001 ...
         TYC 3258-1994-1 ...
        UCAC4 072-000001 ...
          TYC 6992-893-1 ...
   ATO J000.0011+31.2017 ...
.
.
.

It took 5.2 seconds on my machine.

Query explanation

We select

  • main_id : the one that apperas on top of SIMBAD's pages
  • ra, dec, : the position in ICRS
  • ids : a string with all the identifiers known to SIMBAD for this object

You could chose more columns from Simbad.list_columns().

The random_2MASS is our astropy table that we sent to SIMBAD's servers. It has to be joined to the tables containing the columns we want :

  • ident contains all the ids, so it's the only one that can be joined to our random_2MASS
  • basic has main_id, ra, and dec
  • ids has the string with all the identifiers

See this help page for more explanation.

An other possible speed-up for you is to be sure that you use the SIMBAD mirror closer to you (there is one in Europe and one in the USA).

On Xmatch

@fxpineau : you have a happy user 🙂

@ericasaw
Copy link
Author

ericasaw commented Jun 12, 2024

@ManonMarchand Thank you for the SIMBAD example! I've never used the tap search function before since query_objects has always worked for me up until now so this is super helpful :-)

@bsipocz Here is an example for the IRSA behavior I'm noticing (particularly for the name matching using IRSA.query_region where it still seems to be using a coordinate match rather than searching using the 2MASS identifier)

These are a few example 2MASS identifiers I have noticed the behavior for: 2MASS J21065473+3844265, 2MASS J21065341+3844529, 2MASS J11052903+4331357, 2MASS J05420897+1229252, 2MASS J23055131-3551130

If you run the following code:

from astroquery.ipac.irsa import Irsa
import astropy.units as u

#this is just one of the example names
result = Irsa.query_region('2MASS J21065473+3844265', catalog="fp_psc", radius=5 * u.arcsec)

result turns up as an astropy table with no entries.

If instead you expand the radius to 10 arcseconds using the same code above, the appropriate object is found. Perhaps I am making the same mistake here as I was with SIMBAD as @ManonMarchand pointed out and instead I should be using a TAP query?

As for the time, I used IRSA.query_region to look for 16,055 objects in a loop one by one (the 16,055 is not a unique list, there are some objects repeated multiple times) which took 13 hours to run. Granted there are a few other things happening in the loop (saving the results table to a dictionary and printing out a progress report for the loop) so that is likely an exaggerated run time, but still the querying takes much longer than in astroquery 4.3.

The loop looks like this:

from astroquery.ipac.irsa import Irsa
Irsa.TIMEOUT = 3600
from termcolor import colored
import astropy.units as u

#for the objects with found 2MASS names search for them in the IRSA catalog
results = {}
i = 0
for name in names_2mass:
    #5 arcsec is the size of the IGRINS slit, 10 arcsec is required to search the names well
    result = Irsa.query_region(name, catalog="fp_psc", radius=10 * u.arcsec)
    #if there is a result returned
    if len(result) > 0:
        #if the result is multiple objects, keep the one closest in distance
        if len(result) > 1:
            results[has_2mass[i]] = result.to_pandas().head(1)
        #save the results df to a dictionary for later
        else:
            results[has_2mass[i]] = result.to_pandas()
    #if the name search doesnt return an object, print the object name
    else:
        print(colored(f"FAILED {name}", 'light_red'))
    #update the terminal with loop progress
    print(colored(f"{i+1}", 'magenta'), colored(f"/ {len(names_2mass)}", 'light_blue')) 
    i += 1

@ericasaw
Copy link
Author

I spent some time this afternoon looking into this and it seems like the Irsa.query_region function in 4.7 builds a TAP query based on input coordinates (which I guess come from the 2MASS identifier name) and then uses the Irsa.query_tap function to look for the object within a specified radius. It's still unclear to me why the TAP query doesn't return the object as expected, maybe it is the type of shape I choose to query with (cone)?

Looking through the IRSA VO Table Access Protocol (TAP) Instructions there is no way to TAP query by name as there is for SIMBAD, which is kind of frustrating. I think that the old Irsa.query_region function in 4.3 worked via requests but also seems to have used coordinates instead of names? Looking at the IRSA Catalog Search Service
Application Program Interface
it looks like you can feed in names, but still the search seems to use coordinates even if the name is given.

My guess is that the search result now is slower than in astroquery 4.3 due to the response time of IRSA. Based on my experience with how fast the SIMBAD.query_tap function this afternoon (which is very fast) it is interesting to me how slow the Irsa.query_tap function seems to work (behind the scenes of Irsa.query_region). I'm not sure if it is worth the time for me to go through and build a ADQL query for all of the objects since that is basically what Irsa.query_region does anyway.

@ManonMarchand
Copy link
Member

Perhaps I am making the same mistake here as I was with SIMBAD as @ManonMarchand pointed out

Sorry that I made it sound like a mistake, query tap is new since astroquery 0.4.7 for Simbad.

@aoberto
Copy link
Contributor

aoberto commented Jun 14, 2024

If we want to dig a bit more in the SIMBAD time issue using query_objects, it will be better having more details on selected columns in the output and list of example names. I just tried 5000 object names, 2MASS or not, in SIMBAD or not, and it tooks about 30s.
But as the new version of astroquery.simbad is in the way to be released, may be it is not so necessary to dig here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants