-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requirement: A new method of identifying a time series #4
Comments
I am strongly in favor of defining a single time series identifier for two main reasons:
While this would set the stage for having a complete new identifier scheme, we have a large volume of miniSEED 2.x that uses the old identifiers that must be accommodated. To this end, IRIS proposed a solution during the previous technical discussion, which has been modified slightly since then, to define an identifier constructed as a Uniform Resource Name (URN) with the following pattern:
where the network, station and channel codes are required to be non-empty and the location code may be empty. The 3 underscore (ASCII 95) delimiters must always be present. Example identifiers: The "FDSN:" namespace would identify this combination of SEED codes. Alternates schemes could be defined or adopted in the future. For reference the current working draft of this proposal is attached: There are a few areas that are known to still need work:
|
I am also in favor of the single identifier. It has many advantages, but also a couple disadvantages worth noting. An advantage of a single identifier is that the likely most common operation on miniseed data, matching of records with channels, is a single string comparison instead of 4 currently. The truncation of null bytes or searching for the '~' step is also reduced from 4 to 1, perhaps making extraction quicker. But a single identifier will use additional bytes. The IRIS proposal for example has effectively 8 extra bytes compared with existing miniseed2. Also, extraction of the network or station code becomes a more expensive string splitting operation. I think the tradeoff is acceptable, but the disadvantages are worth noting. Instead of "FDSN:" perhaps defining that the stored version identifier that starts with ':' implies 'FDSN:' would save 4 bytes. All other non-fdsn namespaces would have to be fully specified. |
@chad-iris
is there going to be an opportunity to comment on this working draft? If the consensus is to move to a single identifier that simply extends SNCL, the actual proposed extension requires significant discussion [1]. Does it fit into this stage of the discussion, or when would you see this taking place? John [1] e.g. Generally I agree with what is being proposed in the working draft, though I think the channel code could be extended beyond 4 characters to add some additional information about synthetic / processed data without having to scamper off to an alternative URL. For example, it would be useful to an additional new data type code to indicate whether data is raw / processed or synthetic (default is raw). Then e.g. a synthetic BHN stream can be identified as BHN-X or a strong motion channel converted to acceleration can be identified as HGZ-Y. At the moment, using X as the band code looses all this information for processed / synthetic streams. |
Yes, absolutely. From the NGF perspective, the important part is whether we agree to this kind of identifier. If we collectively agree, then the definition of NGF can move forward (imposing, perhaps, only a maximum identifier length of 255) while the discussion of what form the extensions take being split-off into a separate conversation. I suggest one of these options:
Those are in my order of preference. I volunteer to create another GitHub project for FDSN identifiers and create issues to discuss form and expansion and rules for each of the 4 codes (network, station, location, and channel) if there is agreement to do this. Even if the consensus is to keep 4 fields for each of the 4 codes we need to discuss how to expand them. Expanding the codes is a very important topic, of all the changes we are discussing it is the one that will effect end-users the most in my opinion. It merits a separate conversation that is not muddled with the rest of the NGF details. |
I agree its very important, and it also couples into some of the discussions in other conversations, e.g. if we agree that identification of processed data is part of the new naming convention, then we can agree and close #10 , so we need to begin discussing it soon. I propose we move forward on this with Chad's first suggestion - its going to get too messy to fold this entire topic into the single issue here. |
Can we keep it here instead of a second github? Add as many issues as you think you need, but following similar discussions arbitrarily split into 2 repositories makes it harder to follow I feel. |
I think that this is a key topic and I do think that the 4 key fields that correspond to fields in miniSeed2 are still the correct ones. Since there is flexibility in how many actual characters can be used for each field in general this could result in space savings even with the added field separators. Also I do not think that the size of the combined identifier is that important and would not make that an issue. Life today would be simpler if the original miniSeed had not been so stingy. As time passes, lengths, bytes, and such things that relate to the size become less and less important. In general I am in favor of the time series construct proposed by Chad above |
I think there should be some discussion related to the Channel field since it is really trying to specify three different attributes of a channel in a single field. Would it make sense to break out the current three fields separately into BandCode, Instrument code, and orientation. It would give greater flexibility than keeping them together as one. Users could still specify things such as BHZ but the interfaces would map those into B_H_Z for instance for query processing. Users might not be impacted but data generators could have greater flexibility and capability. |
I read we all agree that
Now, if we define "our" FDSN URIs as FDSN:_, in order to correct for this, I would propose to allow more noise on the FDSN-style URIs, e.g. by
e.g.: "FDSN:" - fixed |
"FDSN:ch.ethz.sed/streams?sncl=CH_DAVOX__HHZ&version=2" is 53 bytes, which is bigger than the entire header for many miniseed2 records. That seems a bit excessive. While more structure and flexibility in the identifier is a good thing, it has to be weighed against the cost of the overhead, especially since it will be repeated in every single NGF record. |
@crotwell in a legacy environment,
|
Summary(Please let me know if I missed a point or misunderstood something) There seems to be consensus to using a single time series identifier in the approximate form of Please vote on the following issues:
|
Yes.
Yes. This is critical for providing future ability to create other identifiers. I do not believe all FDSN identifiers need to go under FDSN: namespace, the FDSN can create other name spaces.
The ASCII subset used for SEED 2.4 plus a few extra characters already proposed. The transition to a URN-style identifier with a name space puts us on the path for creating new identifiers in the future that support a broader encodings to full UTF-8, but there are a lot of changes to systems and implications for usability if we did that now. |
This is to some degree independent of the format. Each namespace (depending on how we do it) could still allow only a subset of what the format itself can store. But if we choose anything "less" than a UTF variant we limit the format and moving to an UTF encoding would require a new revision of the core data format. |
I would think we could define a new identifier type and use it with the same core format just like we can define a new encoding type and not change the core format. For example, a "FDSN-U8:" namespace could be created in the future to have some kind of identifiers that allow UTF-8, the format does not need to change. Just like with encodings, it's the readers that need to support those new variations. |
This works as long as the namespace itself is limited to some defined text encoding. But this is likely only an academic and not a practical problem. |
1 - YES |
1 yes |
|
Yes
Yes
255
ASCII, since the identifier will have fixed length and accents are mostly a problem in IDs. But, what happens for non-latin alphabets? |
|
|
|
A new method of identifying a time series is required for NGF: It should be adequate to meet the need to deploy multiple sensors, retain semantic meaning where possible and support a significant increase in the number of sensors deployed as a single project.
The text was updated successfully, but these errors were encountered: