Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purpose of CSIP structMap #426

Closed
koit opened this issue May 1, 2019 · 4 comments
Closed

Purpose of CSIP structMap #426

koit opened this issue May 1, 2019 · 4 comments
Assignees
Labels
enhancement Issues that are an enhancement needed to be evaluated and action decided Proof Read Issues to be dealt with in a final proof read. Solved? Have this issue been handled? Often used in conjunction with the label "help wanted" v2-structMap Issues to be picked up at structMap review

Comments

@koit
Copy link
Contributor

koit commented May 1, 2019

The explanation of the purpose of structMap in mets.xsd and METSPrimer.pdf is clear and METSPrimer has some good examples of its use (see pages 62 and 65). The purpose of CSIP structMap is less clear and this makes it hard to contextualise the 33 requirements (and thus, to create a valid IP).

The intro text of 5.3.6. "Use of the METS structural map (element structMap)" states:

In CSIP the structMap describes the higher level structure of all the content in the root and may link to representations.
/…/

  • The internal structure of the structural map (expressed by div elements) follows the CSIP high level physical structure as described in Section 4, therefore grouping together metadata, representations, schemas, documentation and user-defined folders into their own div elements;
    /…/
  • In case both root and representation METS files exist, the structural map in the root METS file
    • Reference the fileGrp which describes all files in all folders with the exception of the content of the representation folders
    • Lists all representations (as separate div elements)
    • Lists only the appropriate representation METS file using the mptr element as the content of the representation

This can be summed up as: "The Purpose of CSIP structMap is to mirror the physical folder structure of the IP and if representations are present then point to the METS.xml files that describe them." This conclusion is mirrored by the examples:

  • structMapExample1 in CSIP.xml is pure folder structure
  • structMapExample2 is folder structure + mptr to the METS.xml that describes the files in specific folders.

But why is it necessary? The same info can be derived by parsing the fileSec/fileGrp/file and mdRef elements of all METS.xml files in the IP. Processing speed could be the added value here, as the structure can be quickly read from strucMap, compared to the effort of reverse engineering the structure from the descriptions of individual files. However, there is duplication here, so a risk of conflicting descriptions.

There is also a rather softly posed requirement "Reference the fileGrp which describes all files in all folders /…/" to be used in the case of representations. It seems mandatory when representations are present, so it should be made an explicit SHOULD rule. Also, if it makes sense for representations, it is equally reasonable to have it for non-rep cases, too. There should be clear instructions on where and how to place the fileGrp references.

On another thought, shouldn't the purpose of CSIP structMap be to describe the conceptual, rather than the physical structure of the package? As the folder structure is not mandatory any more, we might see folder structures like this:

SIP_001/
       |- data1/
       |- data2/
       |- data3/
       |- data4/
       |- metadata/

In such case, the content files could be randomly spread into the data folders, e.g. to make the data folders fit some size limit. Or grouped by file name such as is often done in uuid-named or sequentially named file and folder structures, e.g. the actual structure of MS Outlook 2011 for Mac (the letters seem to mean Trillion, Billion, Million, Kilo, each folder containing up to 1000 items):

Messages/
        |- 0T/
             |- 0B/
                  |- 0M/
                       |- 6K/
                       |    |- x00_6021.olk14Message
                       |    |- x00_6022.olk14Message
                       |    |- x00_6023.olk14Message
                       |
                       |- 186K/
                       |      |- x00_186029.olk14Message
                       |      |- x00_186030.olk14Message
                       |      |- x00_186031.olk14Message
                       |      |- x00_186032.olk14Message
                       |
                       |- 192K/
                              |- x00_192058.olk14Message
                              |- x00_192059.olk14Message
                              

A conceptual CSIP structMap would make a lot of sense in such cases.

Anyway, no matter what the purpose of CSIP structMap, it should be stated clearly, supported by the requirements and illustrated with intuitive examples.

I know it sounds like useless theorising, but structure-related requirements are currently not clear (we experienced this when creating minimal valid IPs, see DILCISBoard/eark-ip-test-corpus#211 and DILCISBoard/eark-ip-test-corpus#212), and I've got a hunch that the unclarity might stem from unclear purpose statements.

@koit
Copy link
Contributor Author

koit commented May 5, 2019

After analysing it for a few days, the issue of structMap and fileSec labour division seems overwhelming, with too many loose ends.

A good example is referencing. mets.xsd prescribes referencing in only one direction and only at the individual file level: from structMap/div/fptr to fileSec/fileGrp/file. There is no "proper" way to reference a fileGrp (the workaround via structMap/div/@CONTENTIDS involves a type mismatch of xs:anyURI vs xml:ID).

Alternatives for structMap/div:
A) folders-only, i.e. no references to fileSec;
B) references to fileSec/fileGrp using structMap/div/@CONTENTIDS;
C) references to all individual files using structMap/div/fptr.

Each has its strengths and weaknesses. Also, A and B would greatly benefit from structuring fileGrp to mirror the folder structure. Should it be a complete tree of fileGrp elements or just a flat list where each fileGrp describes the files of one folder? The latter case could be facilitated by adding a new attribute fileGrp/@csip:folderName to indicate the folder name (this can be done as fileGrpType has the xs:anyAttribute). Another option would be to introduce reverse referencing by creating fileGrp/@csip:structMapDivID to point to a folder div in the structMap.

While making the choice we also need to consider:

  • The differences between root METS.xml and representation METS.xml;
  • Should the representations be completely independent or should they be "aware" of the package they are in (i.e. should representation METS.xml contain references to the IP);
  • Where to base the folder/file paths: parent folder of the IP (e.g. the path to representation METS file could be SIP_001/representations/rep1/METS.xml) or the root folder of the IP (e.g. representations/rep1/METS.xml), and what about the paths in representation METS.xml;
  • The differences in case of segmentation;
  • Performance issues of the alternatives (see structMap #83 for a METS.xml sample with comments from @andersbonielsen);
  • Is it feasible to strictly prescribe the usage model of fileGrp;
  • How to handle flat IPs (some reviewers of v.2.0 draft pointed out that they don't use any folder structure).

Version 2 schedule leaves no time to properly consider all these aspects. So I propose we fix only the obvious mistakes and otherwise leave the current solution as it is. Soon after the release of v.2.0 we should create a task force to develop a complete solution for structMap and fileSec. This should involve analysis of real life IPs from different institutions and prototyping complete IPs for different alternative solutions.

@carlwilson carlwilson added this to the CSIP version 2.0 milestone May 7, 2019
@carlwilson carlwilson added the v2-structMap Issues to be picked up at structMap review label May 7, 2019
@karinbredenberg karinbredenberg added Solved? Have this issue been handled? Often used in conjunction with the label "help wanted" Proof Read Issues to be dealt with in a final proof read. labels May 10, 2019
@karinbredenberg
Copy link
Contributor

What hasnt been handled is moved to the next milestone.

@karinbredenberg karinbredenberg modified the milestones: CSIP version 2.0, CSIP Version 2.1 May 16, 2019
@carlwilson carlwilson modified the milestones: CSIP Version 2.1, CSIP v2.0.4 Apr 30, 2020
@carlwilson
Copy link
Collaborator

I feel what hasn't been handled is pushed to the next milestone but that this gets serious consideration then. I think a response now might be rushed as we're likely to need some good test cases to illustrate all of the issues. In general, I'm against repetition (I tend to regard all repetition as unnecessary) as it leads to internal inconsistency, i.e. chaos.

@karinbredenberg
Copy link
Contributor

This needs to be pushed to the next major version update. Needs more discussion and investigation to see if the concerns have already been handled and if more rewording is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues that are an enhancement needed to be evaluated and action decided Proof Read Issues to be dealt with in a final proof read. Solved? Have this issue been handled? Often used in conjunction with the label "help wanted" v2-structMap Issues to be picked up at structMap review
Projects
None yet
Development

No branches or pull requests

3 participants