-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk export of documents in specific output format #3402
Comments
If you export the entire project from the project settings, all documents are in there. |
Thank you for the hint! _ The filename inside the ZIP-file seems to be hardcoded to "admin.xmi" This makes the bulk export nearly unusable for XMI-format as there is no backlink to the original document file name? |
Yeah, I guess the name of that zip file should be fixed to properly represent the document name... |
Actually, the XMI file itself should contain a "DocumentMetaData" annotation and the name of the document should be in there. Not perfect, but also not a lost cause. |
just found it (documentTitle). |
I don't know if anybody needs the "webanno...export.zip" file name. |
Any plans to add this to the next release or should I write a reformat script meanwhile? |
Additionally, for the project-level export I think the TypeSystem.xml would be required only once. |
If you need a solution now, you better write a script. I'll mark the issue for v25, but no promises. |
Exporting only one TS file would require quite a bit of refactoring at the moment... The new UIMA JSON format is better in this situation, because it embeds the type system - dkpro-cassis can also read/write it. |
... read: there s no need to wrap the new UIMA JSON format in a ZIP |
Guess this will split at least to v26. |
Regarding UIMA JSON: IMHO it is not a good idea to embed the whole type system, as this will blow up the json unnecessarily. |
You are absolutely right here and I have noticed that as well. By default, only a minimal type system should be included (i.e. one that covers only the types actually used in the respective JSON file). However, I didn't have the time to investigate yet why the included type system is not minimized and if that is a bug/missing setting in INCEpTION or in the underlying UIMA library. |
I think a pointer to the underlying json schema would be best. |
Having a pointer would mean the pointer needs to be resolvable... |
What about having a pointer to schema + embed the schema in zip file to be always in sync? |
As said before, having only a single copy of the type system in the export ZIP is currently not possible - it requires refactoring. The current architecture passes each document in turn to the exporter and the exporter is the one that creates the type system - the exporter is not aware that it is being called multiple times and the caller of the exporter has no knowledge that the export might be creating a type system... Having the type system be large/redundant is a bit annoying but at least it "just" costs space... |
Can you create a zip with multiple zips inside as you do for the project export? |
In principle yes, but I am not sure if I want to. I'd personally prefer keeping the JSON file as a self-contained file with an embedded type system. My impression is that having a fully self-contained format is also important. It could be considered to offer two export options, one with a shared type system (after some refactoring) and one with redundantly embedded type systems, but at the moment it seems to early to me to do that. When implementing the bulk-export for XMIs, it will be as you say: ZIP with nested ZIPs, each nested ZIP containing the XMI file and a |
Could it be that the type system differs if exported from Inception? XMI: OK for me, but same issue with same TypeSystem.xml (as from same document storage). |
INCEpTION has the concept of a project-specific type system, so all annotations in a project are expected to adhere to the same type system - though they might be (internally) lagging behind because upgrading a document to the project type system is a lazy action that usually happens only when opening the document for annotation and also as part of exporting the document. |
Thank you for your explanation! I think that lazy propagation of new properties should be mentioned somewhere in the docs (if not already the case)? If an attribut is removed, that is forcefully propagated through all documents. |
Yes. Otherwise, the lazy update should not make a difference to users because it is an internal behavior. Most internal behavior is implicit in the code and not documented - otherwise there'd be no end to writing documentation. |
OK, agreeed thank you ! |
Is your feature request related to a problem? Please describe.
Currently it seems only possible to export one document in the annotation view in a desired format (e.g. Apache UIMA CAS XMI 1.1)
Describe the solution you'd like
It would be nice to have the option to export multiple documents (e.g. in Settings - documents list)
The text was updated successfully, but these errors were encountered: