Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical evaluation of the proposed source identifiers #31

Open
chad-earthscope opened this issue Oct 8, 2020 · 18 comments
Open

Technical evaluation of the proposed source identifiers #31

chad-earthscope opened this issue Oct 8, 2020 · 18 comments

Comments

@chad-earthscope
Copy link
Member

Pre-release versions of libmseed implement functions to convert the originally proposed identifiers to and from individual SEED codes. Furthermore, at the IRIS DMC, we have implemented database functions to convert between these identifiers and SEED codes.

To further evaluate the technical feasibility and practicalities of these identifiers, I have written a Python module that validates the identifiers and converts to and from SEED codes. The module includes a "main" for use as a standalone converter, and is included in the repository: fdsn_source_identifiers.py

For all of these software packages, mapping of details between the two standards follows the description in the specification:
http://docs.fdsn.org/projects/source-identifiers/en/v1.0/definition.html#mapping-of-seed-2-4-codes

With this experience, I'm quite confident in the feasibility of the proposed identifiers. They are usable and a good path exists between them and SEED codes.

@crotwell, @WayneCrawford, and @flofux I encourage you to try the converter to see if this matches your expectations.

@crotwell
Copy link

crotwell commented Oct 9, 2020

Hi

Three issues I see immediately. First is minor, but I think you want to decorate the from_nslc method with @staticmethod to avoid it accidentally being called as an instance method, where network would effectively become self. See Python Docs.

Second, is it looks like the regular expressions allow lower case letters, but the new spec only allows upper case letters. I know SEED technically allows lower case, but we probably do not want to have a disagreement between the example code and the new spec. They should be consistent, either both allowing lower case or both forbidding it I think.

python3 fdsn_source_identifiers.py xx abc 00 bhz
Input Network: 'xx', Station: 'abc', Location: '00', Channel: 'bhz'
=> SourceID: 'FDSN:xx_abc_00_b_h_z'

but

Each of these codes must be composed of the following ASCII character sets:

Uppercase [A-Z], ASCII 65 through 90

Numeric [0-9], ASCII 48 through 57

The station and location codes may further be composed of the following ASCII character:

Dash “-”, ASCII 45

Third, it doesn't check for length, so the mapping is wrong in cases where the mapping should fail. For example an invalid SEED code like this fails (correctly):

python3 fdsn_source_identifiers.py XX ABCD 01 BHZZ       
Invalid channel code:'BHZZ'

but this passes even though it is wrong in the same way, just location instead of channel length:

python3 fdsn_source_identifiers.py XX ABCD 0123456789 BHZ
Input Network: 'XX', Station: 'ABCD', Location: '0123456789', Channel: 'BHZ'
=> SourceID: 'FDSN:XX_ABCD_0123456789_B_H_Z'

and of course this location code is illegal even in the new spec. Would be good in this converter checked that input NSLC were valid SEED lengths (2,5,2,3) and on the reverse mapping failed for valid source ids that are too big to map to SEED style NSLC. It should also check the lengths of the N S L B S sS elements of the source id, 8,8,8,...

Actually, looking at the spec, are their length restrictions on the band, source and subsource codes? I know the intent is for source codes to be FDSN issued, but subsource codes are mostly up to the user. Regardless a length limit would probably be really helpful to software. Did we miss that or am I just not seeing it?

@crotwell
Copy link

crotwell commented Oct 9, 2020

Also be nice if the temp network convention was part of the converter.

fdsn_source_identifiers.py FDSN:XA2020_ABC_00_B_H_Z
Input SourceID: 'FDSN:XA2020_ABC_00_B_H_Z'
=> Network: 'XA2020'Station: 'ABC'Location: '00'Channel: 'BHZ'

could do this instead:

fdsn_source_identifiers.py FDSN:XA2020_ABC_00_B_H_Z
Input SourceID: 'FDSN:XA2020_ABC_00_B_H_Z'
=> Network: 'XA'Station: 'ABC'Location: '00'Channel: 'BHZ'

@chad-earthscope
Copy link
Member Author

Thanks @crotwell. The code started out as a mapping exercise with a little sprinkling of validation. I was unsure how much validation to add, but I agree that this is a good place to be strict. Essentially it will reduce to the strictest combination of SEED and SourceID rules.

Note: there is a rather important class of usage where decomposing the SourceID into components that are SEED-like but do not conform to SEED rules. This is needed to store the longer-than-SEED codes in databases and other systems/formats; it will be the norm for a long time.

First is minor, but I think you want to decorate the from_nslc method with @staticmethod

Fixed.

Second, is it looks like the regular expressions allow lower case letters

Fixed.

Third, it doesn't check for length

Fixed.

Actually, looking at the spec, are their length restrictions on the band, source and subsource codes? I know the intent is for source codes to be FDSN issued, but subsource codes are mostly up to the user. Regardless a length limit would probably be really helpful to software. Did we miss that or am I just not seeing it?

There are no length restrictions beyond using defined defined codes with fixed lengths. I disagree that subsource are mostly up to the user, and I doubt folks would come up with huge values because it made sense for their case. In libmseed the entire SourceID is capped at 64 bytes, seems a rational approach for any software.

@crotwell
Copy link

crotwell commented Oct 9, 2020

Note: there is a rather important class of usage where decomposing the SourceID into components that are SEED-like but do not conform to SEED rules. This is needed to store the longer-than-SEED codes in databases and other systems/formats; it will be the norm for a long time.

Can you clarify the purpose of the converter? Is it an example of how sourceIds work in code, to show how to convert valid NSLC to/from valid SourceIds or to provide a tool to store sourceIds in a database that was set up to handle separate NSLC as 4 string fields?

The first two seem plausible uses as a "technical evaluation", but the last seems too specific to whatever system will use it.

I guess my concern is that this current code shows a "mapping from sourceId to NSLC" that isn't in line with the spec, which says if the lengths are not ok, then there is no mapping.

python3 fdsn_source_identifiers.py FDSN:XA2020_ABCDEFG_12345678_BB_HH_ZZ 
Input SourceID: 'FDSN:XA2020_ABCDEFG_12345678_BB_HH_ZZ'
=> Network: 'XA2020' Station: 'ABCDEFG' Location: '12345678' Channel: 'BB_HH_ZZ'

which seems to imply that there is a mapping from these long codes to NSLC.

I should say this code makes total sense in terms of IRIS storing things or searching for things in its database, but less sure that it makes sense in terms of a technical example of source ids in code in practice and may generate the idea that there are mappings that are not valid.

I tossed together some javascript and a web page to do conversions with I think more strict checking. Perhaps it is useful in addition?

http://www.seis.sc.edu/~crotwell/fdsn_source-identifiers/converter/

@crotwell
Copy link

crotwell commented Oct 9, 2020

Pull request here: #32

@chad-earthscope
Copy link
Member Author

Can you clarify the purpose of the converter? Is it an example of how sourceIds work in code, to show how to convert valid NSLC to/from valid SourceIds or to provide a tool to store sourceIds in a database that was set up to handle separate NSLC as 4 string fields?

It was for technical evaluation of the specification, with the hope that it would be useful for generic decomposition into individual codes and back.

But this did not allow strict adherence to SEED limits and, as you pointed out, may leave the wrong impression of validity for all cases. So, I've re-arranged the code to focus on SourceID to/from SEED conversion and enforce SEED limitations.

python3 fdsn_source_identifiers.py FDSN:XA2020_ABCDEFG_12345678_BB_HH_ZZ
Input SourceID: 'FDSN:XA2020_ABCDEFG_12345678_BB_HH_ZZ'
=> Network: 'XA2020' Station: 'ABCDEFG' Location: '12345678' Channel: 'BB_HH_ZZ'

that now produces:

fdsn_source_identifiers.py FDSN:XA2020_ABCDEFG_12345678_BB_HH_ZZ
Invalid SEED network code:'XA2020'
Invalid SEED station code:'ABCDEFG'
Invalid SEED location code:'12345678'
Invalid SEED channel code:'BB_HH_ZZ'

I guess my concern is that this current code shows a "mapping from sourceId to NSLC" that isn't in line with the spec, which says if the lengths are not ok, then there is no mapping.

But it is, depending on what you think NSLC means. This specification redefines all assumptions of what "NSLC" means, and we should be very careful to clarify when it is "SEED NSLC" versus the expanded codes.

I should say this code makes total sense in terms of IRIS storing things or searching for things in its database, but less sure that it makes sense in terms of a technical example of source ids in code in practice and may generate the idea that there are mappings that are not valid.

Another example: the FDSN web service request formats/parameters require network, station, location, and channel fields. There will be oodles of places like this as nearly all existing codes deal with the NSLQ-quartet as individual codes at some level. The mapping between Source IDs and its constituent codes will be happening for a long time and have nothing to do with the SEED specification directly.

@crotwell
Copy link

But it is, depending on what you think NSLC means. This specification redefines all assumptions of what "NSLC" means, and we should be very careful to clarify when it is "SEED NSLC" versus the expanded codes.

Good point, when I hear NSLC I think SEED, but that is maybe no longer true. But if I am confused then there are lots of others that will be too, so being extra clear is worth while.

Does the spec need to address this mapping to "generic" NSLC in addition to SEED NSLC? While the NSL parts are clear, I suppose there is some ambiguity in how to make a single string "channel" from the band, source, subsource. Basically should the spec say something along the lines of:

When using source ids in systems that expect 4 separate strings, but SEED lengths do not need to be maintained, the network, station and location codes should be used unmodified. The band, source and subsource codes should be mapped to a single string keeping the underscore separator, for example 'B_ABC_Z`?

There is I guess some choice in how to map 'B_H_Z' as it could be 'BHZ' to correspond to SEED style, but perhaps also 'B_H_Z' to be consistent with longer source codes. If both are allowed, then any "equals" operation has to check two cases, which can be inefficient and lead to potential bugs. Maybe this doesn't matter too much, but would be good to be proactive to ensure as much consistency as possible.

@chad-earthscope
Copy link
Member Author

Does the spec need to address this mapping to "generic" NSLC in addition to SEED NSLC?

I vote no, mostly because there is no one clear way:

There is I guess some choice in how to map 'B_H_Z' as it could be 'BHZ' to correspond to SEED style, but perhaps also 'B_H_Z' to be consistent with longer source codes.

I think that's the problem, different implementations/uses will have different needs.

What is clear is that BHZ is not a valid channel designation in a Source Identifier, there can be no ambiguity in the specification.

As you say, the channel is where it gets potentially confusing. I think it will be very common to collapse the extended B_H_Z in many systems, but others may leave it expanded. A likely scenario is that FDSN request formats will be implemented to accept either BHZ and B_H_Z and treat them equivalently.

I've tried to address that in the expanded "Note" at the top of the channel page: http://docs.fdsn.org/projects/source-identifiers/en/v1.0/channel-codes.html

Let me know what you think.

@crotwell
Copy link

So, I guess I am still feeling that this may add more confusion. It seems like there are 3 concepts floating around:

  1. old style SEED NSLC
  2. new style Source Ids
  3. Hybrid non-SEED NSLC that split Source Ids into 4 substrings in order to use in NSLC systems that have been modified to accept longer NSLC strings but not modified to use source ids directly.

The spec is about 2, and includes a mapping to/from 1 which is needed as it is the current default. All makes sense to me. You also say mapping to 3 should not be in the spec, I agree.

But the new Note and the python example seem to be mostly about 3. Just to be clear, there is nothing wrong with 3, it probably will be a common way to use things for some time to come, but I am not sure it belongs in the source id spec as it creates the idea that there is a cannonical way to break a single string source id into 4 substrings that look like NSLC string, just longer and different. This mapping makes total sense in a note on how to use source ids in existing fdsn web services, or in documentation on how to use them in stationxml, but maybe this mapping idea would be better given there instead of here so it doesn't muddy the waters.

Another concern is that the Note is kind of duplicating the SEED mapping idea from the earlier section. I feel like all stuff about how to map to/from should live in just one section.

My $0.02 is that you are promoting using a single string channel id, which is a really good idea. Don't weaken the argument by sort of kind of introducing a variation of the single string that is split into 4 strings. My advice would be to not say anything about idea 3 in the spec at all, and have the example code illustrate the strict SEED NSLC <-> SourceId mapping, but not go beyond that. The spec will be stronger for being more limited in what it does.

@chad-earthscope
Copy link
Member Author

My $0.02 is that you are promoting using a single string channel id, which is a really good idea. Don't weaken the argument by sort of kind of introducing a variation of the single string that is split into 4 strings. My advice would be to not say anything about idea 3 in the spec at all, and have the example code illustrate the strict SEED NSLC <-> SourceId mapping, but not go beyond that. The spec will be stronger for being more limited in what it does.

Agreed. I've removed the comment in 2aa0507

@crotwell
Copy link

Thoughts on javascript example/pull request?

@chad-earthscope
Copy link
Member Author

Thoughts on javascript example/pull request?

I added comments to the PR. Seems incomplete. I'm not opposed to adding it, but I'm curious what you consider the value is beyond another try at converting?

Also, is this holding up the technical review addition to the report?

@crotwell
Copy link

Don't have to add it, I was just thinking a straight converter web page would be useful. Making it available somewhere so people can play might be a easier than having to run python locally. I really don't care that much about it, unless having multiple languages represented as examples is useful?

@crotwell
Copy link

Thinking some more, I would suggest you reject this pull request. I can release this separately and it doesn't need to be part of the fdsn repo.

@WayneCrawford
Copy link

Hi all,

I did a very basic check of the code, much less complete than Philip. The only question I have is that when I enter:

fdsn_source_identifiers.py` `FDSN:4G2000_LSVCO_00_B_H_3

I get

Invalid SEED network code:'4G2000'

Whereas I think Philip had suggested that it give

Invalid SEED network code:'4G2000'
Input Network: '4G' Station: 'LSVCO' Location: '00' Channel: 'BH3'

Is this too unspecific to code, or just not desirable?

Cheers
Wayne

@chad-earthscope
Copy link
Member Author

Whereas I think Philip had suggested that it give

Invalid SEED network code:'4G2000'
Input Network: '4G' Station: 'LSVCO' Location: '00' Channel: 'BH3'
Is this too unspecific to code, or just not desirable?

My nit is that this mapping is not universally true and adding it to the converter may imply that it is. While the FDSN will reserve the codes it's up to data centers if they will accept the codes equally. Second nit is that it's irreversible, one-way mapping. But I agree that its useful for illustrate as long as it's clear the mapping is being employed, so now the converter produces:

% fdsn_source_identifiers.py FDSN:4G2000_LSVCO_00_B_H_3
Invalid SEED network code:'4G2000', but mappable via transitional convention
Input SourceID: 'FDSN:4G2000_LSVCO_00_B_H_3'
=> Network: '4G' Station: 'LSVCO' Location: '00' Channel: 'BH3'

Now that I re-read that section of the spec. I note we've left too much vagueness in this:

<2-character code><4-digit start year>

which needs to be qualified that the 2-char code needs to start with XYZ0-9 explicitly. Obviously we cannot have CO2020 registered, which could map to CO. So I've added:

where the initial character is a digit (0-9) or the letters X, Y or Z.

Let me know if you disagree.

@crotwell
Copy link

+1 on your code and on the change to the spec.

@WayneCrawford
Copy link

Sounds good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants