Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: server mode collections search #909

Merged
merged 10 commits into from
Mar 8, 2024
95 changes: 63 additions & 32 deletions docs/notebooks/api_user_guide/4_search.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2228,7 +2228,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous request we made use of the whoosh query language which can be used to do complex text search. It supports the boolean operators AND, OR and NOT to combine the search terms. If a space is given between two words as in the example above, this corresponds to the operator AND. Brackets '()' can also be used. The example above also shows the use of the wildcard operator '*' which can represent any numer of characters. The wildcard operator '?' always represents only one character. It is also possible to match a range of terms by using square brackets '[]' and TO, e.g. [A TO D] will match all words in the lexical range between A and D. Below you can find some examples for the different operators."
"In the previous request we made use of the [whoosh query language](https://whoosh.readthedocs.io/en/latest/querylang.html#the-default-query-language) which can be used to do complex text search. It supports the boolean operators `AND`, `OR` and `NOT` to combine the search terms. If a space is given between two words as in the example above, this corresponds to the operator AND. Brackets `()` can also be used. The example above also shows the use of the wildcard operator `*` which can represent any numer of characters. The wildcard operator `?` always represents only one character. It is also possible to match a range of terms by using square brackets `[]` and TO, e.g. `[A TO D]` will match all words in the lexical range between A and D. Below you can find some examples for the different operators."
]
},
{
Expand Down Expand Up @@ -2274,9 +2274,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"returns all product types where the platform is either LANDSAT or SENTINEL1:\n",
"\n",
"['L57_REFLECTANCE', 'LANDSAT_C2L1', 'LANDSAT_C2L2', 'LANDSAT_C2L2ALB_BT', 'LANDSAT_C2L2ALB_SR', 'LANDSAT_C2L2ALB_ST', 'LANDSAT_C2L2ALB_TA', 'LANDSAT_C2L2_SR', 'LANDSAT_C2L2_ST', 'LANDSAT_ETM_C1', 'LANDSAT_ETM_C2L1', 'LANDSAT_ETM_C2L2', 'LANDSAT_TM_C1', 'LANDSAT_TM_C2L1', 'LANDSAT_TM_C2L2', 'S1_SAR_GRD', 'S1_SAR_OCN', 'S1_SAR_RAW', 'S1_SAR_SLC']"
"returns all product types where the platform is either LANDSAT or SENTINEL1."
]
},
{
Expand Down Expand Up @@ -2319,9 +2317,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"returns all product types which contain either the keywords LANDSAT and collection2 or the keyword SAR:\n",
"\n",
"['LANDSAT_C2L1', 'LANDSAT_C2L2', 'LANDSAT_C2L2ALB_BT', 'LANDSAT_C2L2ALB_SR', 'LANDSAT_C2L2ALB_ST', 'LANDSAT_C2L2ALB_TA', 'LANDSAT_C2L2_SR', 'LANDSAT_C2L2_ST', 'LANDSAT_ETM_C2L1', 'LANDSAT_ETM_C2L2', 'LANDSAT_TM_C2L1', 'LANDSAT_TM_C2L2', 'S1_SAR_GRD', 'S1_SAR_OCN', 'S1_SAR_RAW', 'S1_SAR_SLC']"
"returns all product types which contain either the keywords LANDSAT and collection2 or the keyword SAR."
]
},
{
Expand Down Expand Up @@ -2366,9 +2362,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"returns all product types where the platformSerialIdentifier is composed of 'L' and one other character:\n",
"\n",
"['L57_REFLECTANCE', 'L8_OLI_TIRS_C1L1', 'L8_REFLECTANCE', 'LANDSAT_C2L1', 'LANDSAT_C2L2', 'LANDSAT_C2L2ALB_BT', 'LANDSAT_C2L2ALB_SR', 'LANDSAT_C2L2ALB_ST', 'LANDSAT_C2L2ALB_TA', 'LANDSAT_C2L2_SR', 'LANDSAT_C2L2_ST', 'LANDSAT_ETM_C1', 'LANDSAT_ETM_C2L1', 'LANDSAT_ETM_C2L2', 'LANDSAT_TM_C1', 'LANDSAT_TM_C2L1', 'LANDSAT_TM_C2L2']"
"returns all product types where the platformSerialIdentifier is composed of 'L' and one other character."
]
},
{
Expand Down Expand Up @@ -2439,9 +2433,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"returns all product types where the platform is SENTINEL1, SENTINEL2 or SENTINEL3:\n",
"\n",
"['S1_SAR_GRD', 'S1_SAR_OCN', 'S1_SAR_RAW', 'S1_SAR_SLC', 'S2_MSI_L1C', 'S2_MSI_L2A', 'S2_MSI_L2A_COG', 'S2_MSI_L2A_MAJA', 'S2_MSI_L2B_MAJA_SNOW', 'S2_MSI_L2B_MAJA_WATER', 'S2_MSI_L3A_WASP', 'S3_EFR', 'S3_ERR', 'S3_LAN', 'S3_OLCI_L2LFR', 'S3_OLCI_L2LRR', 'S3_OLCI_L2WFR', 'S3_OLCI_L2WRR', 'S3_RAC', 'S3_SLSTR_L1RBT', 'S3_SLSTR_L2AOD', 'S3_SLSTR_L2FRP', 'S3_SLSTR_L2LST', 'S3_SLSTR_L2WST', 'S3_SRA', 'S3_SRA_A', 'S3_SRA_BS', 'S3_SY_AOD', 'S3_SY_SYN', 'S3_SY_V10', 'S3_SY_VG1', 'S3_SY_VGP', 'S3_WAT']"
"returns all product types where the platform is SENTINEL1, SENTINEL2 or SENTINEL3."
]
},
{
Expand All @@ -2454,7 +2446,7 @@
},
{
"cell_type": "code",
"execution_count": 74,
"execution_count": 4,
"metadata": {},
"outputs": [
{
Expand All @@ -2477,7 +2469,7 @@
" 'LANDSAT_TM_C2L2']"
]
},
"execution_count": 74,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -2491,29 +2483,68 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"['LANDSAT_C2L1', \n",
"'L57_REFLECTANCE', \n",
"'LANDSAT_C2L2', \n",
"'LANDSAT_C2L2ALB_BT', \n",
"'LANDSAT_C2L2ALB_SR', \n",
"'LANDSAT_C2L2ALB_ST', \n",
"'LANDSAT_C2L2ALB_TA', \n",
"'LANDSAT_C2L2_SR', \n",
"'LANDSAT_C2L2_ST', \n",
"'LANDSAT_ETM_C1', \n",
"'LANDSAT_ETM_C2L1', \n",
"'LANDSAT_ETM_C2L2', \n",
"'LANDSAT_TM_C1', \n",
"'LANDSAT_TM_C2L1', \n",
"'LANDSAT_TM_C2L2']"
"The product types in the result are ordered by how well they match the criteria. In the example above only the first product type (LANDSAT_C2L1) matches the second parameter (platformSerialIdentifier=\"L1\"), all other product types only match the first criterion. Therefore, it is usually best to use the first product type in the list as it will be the one that fits best."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The product types in the result are ordered by how well they match the criteria. In the example above only the first product type (LANDSAT_C2L1) matches the second parameter (platformSerialIdentifier=\"L1\"), all other product types only match the first criterion. Therefore, it is usually best to use the first product type in the list as it will be the one that fits best."
"Per paramater guesses are joined using a `UNION` by default (`intersect=False`). This can also be changed to an intersection:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['LANDSAT_C2L1']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dag.guess_product_type(platform=\"LANDSAT\", platformSerialIdentifier=\"L1\", intersect=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Whoosh query language](https://whoosh.readthedocs.io/en/latest/querylang.html#the-default-query-language) *free text search* can also be passed to the method, it will be used to search in `title`, `abstract` and `keywords` fields:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ERA5_SL_MONTHLY',\n",
" 'ERA5_PL_MONTHLY',\n",
" 'ERA5_LAND_MONTHLY',\n",
" 'ERA5_SL',\n",
" 'ERA5_PL',\n",
" 'GLOFAS_SEASONAL_REFORECAST',\n",
" 'SEASONAL_MONTHLY_PL',\n",
" 'SEASONAL_MONTHLY_SL']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dag.guess_product_type(\"ECMWF AND MONTHLY\")"
]
},
{
Expand Down
48 changes: 41 additions & 7 deletions eodag/api/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,15 +289,15 @@ def build_index(self) -> None:
product_types_schema = Schema(
ID=fields.STORED,
alias=fields.ID,
abstract=fields.STORED,
abstract=fields.TEXT,
instrument=fields.IDLIST,
platform=fields.ID,
platformSerialIdentifier=fields.IDLIST,
processingLevel=fields.ID,
sensorType=fields.ID,
md5=fields.ID,
license=fields.ID,
title=fields.ID,
title=fields.TEXT,
missionStartDate=fields.ID,
missionEndDate=fields.ID,
keywords=fields.KEYWORD(analyzer=kw_analyzer),
Expand Down Expand Up @@ -914,16 +914,32 @@ def get_alias_from_product_type(self, product_type: str) -> str:

return self.product_types_config[product_type].get("alias", product_type)

def guess_product_type(self, **kwargs: Any) -> List[str]:
"""Find eodag product types codes that best match a set of search params
def guess_product_type(
self,
free_text_filter: Optional[str] = None,
intersect: bool = False,
**kwargs: Any,
) -> List[str]:
"""Find eodag product types ids that best match a set of search params

See https://whoosh.readthedocs.io/en/latest/querylang.html#the-default-query-language
for syntax.

:param free_text_filter: whoosh compatible free text search filter used to search
`title`, `abstract` and `keywords`
:type free_text_filter: Optional[str]
:param intersect: join results for each parameter using INTERSECT instead of UNION
:type intersect: bool
:param kwargs: A set of search parameters as keywords arguments
:returns: The best match for the given parameters
:rtype: list[str]
:raises: :class:`~eodag.utils.exceptions.NoMatchingProductType`
"""
if kwargs.get("productType", None):
return [kwargs["productType"]]
free_text_search_params = (
["title", "abstract", "keywords"] if free_text_filter else []
)
supported_params = {
param
for param in (
Expand All @@ -934,26 +950,44 @@ def guess_product_type(self, **kwargs: Any) -> List[str]:
"sensorType",
"keywords",
"md5",
"abstract",
"title",
)
if kwargs.get(param, None) is not None
}
if not self._product_types_index:
raise EodagError("Missing product types index")
with self._product_types_index.searcher() as searcher:
results = None
# For each search key, do a guess and then upgrade the result (i.e. when
# merging results, if a hit appears in both results, its position is raised
# to the top. This way, the top most result will be the hit that best
# Using `upgrade_and_extend`, for each search key, do a guess and
# then upgrade the result (i.e. when merging results,
# if a hit appears in both results, its position is raised
# to the top). This way, the top most result will be the hit that best
# matches the given queries. Put another way, this best guess is the one
# that crosses the highest number of search params from the given queries

# Always use UNION to join free_text_search results
for search_key in free_text_search_params:
query = QueryParser(search_key, self._product_types_index.schema).parse(
free_text_filter
)
if results is None:
results = searcher.search(query, limit=None)
else:
results.upgrade_and_extend(searcher.search(query, limit=None))

# join results from kwargs using UNION or INTERSECT
for search_key in supported_params:
query = QueryParser(search_key, self._product_types_index.schema).parse(
kwargs[search_key]
)
if results is None:
results = searcher.search(query, limit=None)
elif intersect:
results.filter(searcher.search(query, limit=None))
else:
results.upgrade_and_extend(searcher.search(query, limit=None))

guesses: List[str] = [r["ID"] for r in results or []]
if guesses:
return guesses
Expand Down
1 change: 1 addition & 0 deletions eodag/resources/stac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ conformance:
- https://api.stacspec.org/v1.0.0/ogcapi-features#query
- https://api.stacspec.org/v1.0.0/ogcapi-features#sort
- https://api.stacspec.org/v1.0.0/collections
- https://api.stacspec.org/v1.0.0/collection-search#free-text
- http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/core
- http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/oas30
- http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/geojson
Expand Down
9 changes: 9 additions & 0 deletions eodag/resources/stac_api.yml
Original file line number Diff line number Diff line change
Expand Up @@ -174,9 +174,12 @@ paths:
operationId: getCollections
parameters:
- $ref: '#/components/parameters/provider'
- $ref: '#/components/parameters/q'
responses:
'200':
$ref: '#/components/responses/Collections'
'202':
$ref: '#/components/responses/Accepted'
'500':
$ref: '#/components/responses/ServerError'
/collections/{collectionId}:
Expand Down Expand Up @@ -1913,6 +1916,12 @@ components:
text/html:
schema:
type: string
Accepted:
description: The request has been accepted, but the data is not yet ready. Please wait a few minutes before trying again.
content:
application/json:
schema:
$ref: '#/components/schemas/exception'
Collections:
description: >-
The feature collections shared by this API.
Expand Down
21 changes: 20 additions & 1 deletion eodag/rest/stac.py
Original file line number Diff line number Diff line change
Expand Up @@ -637,10 +637,29 @@ def __get_product_types(
"""
if filters is None:
filters = {}
free_text_filter = filters.pop("q", None)

# product types matching filters
try:
guessed_product_types = self.eodag_api.guess_product_type(**filters)
guessed_product_types = (
self.eodag_api.guess_product_type(**filters) if filters else []
)
except NoMatchingProductType:
guessed_product_types = []

# product types matching free text filter
if free_text_filter and not guessed_product_types:
whooshable_filter = " OR ".join(
[f"({x})" for x in free_text_filter.split(",")]
)
try:
guessed_product_types = self.eodag_api.guess_product_type(
whooshable_filter
)
except NoMatchingProductType:
guessed_product_types = []

# list product types with all metadata using guessed ids
if guessed_product_types:
product_types = [
pt
Expand Down
59 changes: 59 additions & 0 deletions tests/resources/ext_product_types_free_text_search.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
{
"astraea_eod": {
"providers_config": {
"foo": {
"productType": "foo",
"metadata_mapping": {
"cloudCover": "$.null"
}
},
"bar": {
"productType": "bar",
"metadata_mapping": {
"cloudCover": "$.null"
}
},
"foobar": {
"productType": "foobar",
"metadata_mapping": {
"cloudCover": "$.null"
}
}
},
"product_types_config": {
"foo": {
"abstract": "abstractFOO - This is FOO. FooAndBar",
"instrument": "Not Available",
"platform": "Not Available",
"platformSerialIdentifier": "Not Available",
"processingLevel": "Not Available",
"keywords": "suspendisse",
"license": "WTFPL",
"title": "titleFOO - Lorem FOO collection",
"missionStartDate": "2012-12-12T00:00:00.000Z"
},
"bar": {
"abstract": "abstractBAR - This is BAR",
"instrument": "Not Available",
"platform": "Not Available",
"platformSerialIdentifier": "Not Available",
"processingLevel": "Not Available",
"keywords": "lectus,lectus_bar_key",
"license": "WTFPL",
"title": "titleBAR - Lorem BAR collection (FooAndBar)",
"missionStartDate": "2012-12-12T00:00:00.000Z"
},
"foobar": {
"abstract": "abstract FOOBAR - This is FOOBAR",
"instrument": "Not Available",
"platform": "Not Available",
"platformSerialIdentifier": "Not Available",
"processingLevel": "Not Available",
"keywords": "tortor",
"license": "WTFPL",
"title": "titleFOOBAR - Lorem FOOBAR collection",
"missionStartDate": "2012-12-12T00:00:00.000Z"
}
}
}
}
Loading
Loading