Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk export of documents in specific output format #3402

Open
jfiala opened this issue Sep 26, 2022 · 27 comments
Open

Bulk export of documents in specific output format #3402

jfiala opened this issue Sep 26, 2022 · 27 comments

Comments

@jfiala
Copy link
Contributor

jfiala commented Sep 26, 2022

Is your feature request related to a problem? Please describe.
Currently it seems only possible to export one document in the annotation view in a desired format (e.g. Apache UIMA CAS XMI 1.1)

Describe the solution you'd like
It would be nice to have the option to export multiple documents (e.g. in Settings - documents list)

@reckart
Copy link
Member

reckart commented Sep 26, 2022

If you export the entire project from the project settings, all documents are in there.

@reckart
Copy link
Member

reckart commented Sep 26, 2022

In v25 you will be able to export multiple documents from the documents list in the project settings - but this export will just retrieve the originally uploaded documents again, not the annotated documents.

Screenshot 2022-09-26 at 15 44 34

I think it might be considered to add export option to the workload management pages...

@jfiala
Copy link
Contributor Author

jfiala commented Sep 26, 2022

Thank you for the hint!
However, there are two obstacles for Apache UIMA CAS 1.1 here:
_ each document is contained within a zip file with random number file name
e.g.: webanno5173705207688145129export.zip
without a backlink to the document name

_ The filename inside the ZIP-file seems to be hardcoded to "admin.xmi"

This makes the bulk export nearly unusable for XMI-format as there is no backlink to the original document file name?
So I have to open and export each document manually, additionally selecting the output format for each document :(.

@jfiala
Copy link
Contributor Author

jfiala commented Sep 26, 2022

In v25 you will be able to export multiple documents from the documents list in the project settings - but this export will just retrieve the originally uploaded documents again, not the annotated documents.

Screenshot 2022-09-26 at 15 44 34

I think it might be considered to add export option to the workload management pages...

It would be great to be able to export the annotated documents :).

@reckart
Copy link
Member

reckart commented Sep 26, 2022

Yeah, I guess the name of that zip file should be fixed to properly represent the document name...

@reckart
Copy link
Member

reckart commented Sep 26, 2022

Actually, the XMI file itself should contain a "DocumentMetaData" annotation and the name of the document should be in there. Not perfect, but also not a lost cause.

@jfiala
Copy link
Contributor Author

jfiala commented Sep 26, 2022

just found it (documentTitle).
Thats good for a workaround with a postprocessing script, thank you!

@jfiala
Copy link
Contributor Author

jfiala commented Sep 26, 2022

I don't know if anybody needs the "webanno...export.zip" file name.
I'd suggest to simply use the document title for the zip and xmi file name (both for document-level and project-level export).

@jfiala
Copy link
Contributor Author

jfiala commented Sep 26, 2022

Any plans to add this to the next release or should I write a reformat script meanwhile?

@jfiala
Copy link
Contributor Author

jfiala commented Sep 26, 2022

Additionally, for the project-level export I think the TypeSystem.xml would be required only once.
So there could be one zip file with all the documents + one TypeSystem.xml?

@reckart
Copy link
Member

reckart commented Sep 26, 2022

If you need a solution now, you better write a script. I'll mark the issue for v25, but no promises.

@reckart
Copy link
Member

reckart commented Sep 26, 2022

Exporting only one TS file would require quite a bit of refactoring at the moment...

The new UIMA JSON format is better in this situation, because it embeds the type system - dkpro-cassis can also read/write it.

@reckart
Copy link
Member

reckart commented Sep 26, 2022

The new UIMA JSON format is better in this situation, because it embeds the type system - dkpro-cassis can also read/write it.

... read: there s no need to wrap the new UIMA JSON format in a ZIP

@reckart reckart added this to the 25.0 milestone Sep 26, 2022
@reckart reckart added the ⭐️ Enhancement New feature or request label Sep 26, 2022
@reckart
Copy link
Member

reckart commented Oct 3, 2022

Guess this will split at least to v26.

@reckart reckart modified the milestones: 25.0, 26.0 Oct 3, 2022
@jfiala
Copy link
Contributor Author

jfiala commented Oct 4, 2022

Regarding UIMA JSON: IMHO it is not a good idea to embed the whole type system, as this will blow up the json unnecessarily.
Especially when you have thousands of documents. The XMI we use currently is a great and compact format.

@reckart
Copy link
Member

reckart commented Oct 4, 2022

You are absolutely right here and I have noticed that as well. By default, only a minimal type system should be included (i.e. one that covers only the types actually used in the respective JSON file). However, I didn't have the time to investigate yet why the included type system is not minimized and if that is a bug/missing setting in INCEpTION or in the underlying UIMA library.

@jfiala
Copy link
Contributor Author

jfiala commented Oct 4, 2022

I think a pointer to the underlying json schema would be best.
That would maybe also allow to keep document properties in a versioned way....

@reckart
Copy link
Member

reckart commented Oct 4, 2022

Having a pointer would mean the pointer needs to be resolvable...

@jfiala
Copy link
Contributor Author

jfiala commented Oct 4, 2022

What about having a pointer to schema + embed the schema in zip file to be always in sync?

@reckart
Copy link
Member

reckart commented Oct 4, 2022

As said before, having only a single copy of the type system in the export ZIP is currently not possible - it requires refactoring. The current architecture passes each document in turn to the exporter and the exporter is the one that creates the type system - the exporter is not aware that it is being called multiple times and the caller of the exporter has no knowledge that the export might be creating a type system...

Having the type system be large/redundant is a bit annoying but at least it "just" costs space...

@jfiala
Copy link
Contributor Author

jfiala commented Oct 4, 2022

Can you create a zip with multiple zips inside as you do for the project export?
And use a different file name for the document json.
Then we can export and then simply extract everything and overwrite the json x times (as it is the same for all, we don't mind).

@reckart
Copy link
Member

reckart commented Oct 4, 2022

Can you create a zip with multiple zips inside as you do for the project export?

In principle yes, but I am not sure if I want to. I'd personally prefer keeping the JSON file as a self-contained file with an embedded type system. My impression is that having a fully self-contained format is also important. It could be considered to offer two export options, one with a shared type system (after some refactoring) and one with redundantly embedded type systems, but at the moment it seems to early to me to do that.

When implementing the bulk-export for XMIs, it will be as you say: ZIP with nested ZIPs, each nested ZIP containing the XMI file and a typesystem.xml file - and if you extracted all of these into the same folder, the typesystem.xml files would override each other.

@jfiala
Copy link
Contributor Author

jfiala commented Oct 4, 2022

Could it be that the type system differs if exported from Inception?
IMHO using a schema with a pointer in each json/xml would be the cleanest solution.
However, if it is simpler I think it is OK to have the type system included, but it would be nice to have it minimized in the json.

XMI: OK for me, but same issue with same TypeSystem.xml (as from same document storage).
Using an xml schema would have the advantage of validation against the xml files, but is not necessary.

@reckart
Copy link
Member

reckart commented Oct 4, 2022

Could it be that the type system differs if exported from Inception?

INCEpTION has the concept of a project-specific type system, so all annotations in a project are expected to adhere to the same type system - though they might be (internally) lagging behind because upgrading a document to the project type system is a lazy action that usually happens only when opening the document for annotation and also as part of exporting the document.

@jfiala
Copy link
Contributor Author

jfiala commented Oct 4, 2022

Thank you for your explanation! I think that lazy propagation of new properties should be mentioned somewhere in the docs (if not already the case)? If an attribut is removed, that is forcefully propagated through all documents.

@reckart
Copy link
Member

reckart commented Oct 4, 2022

If an attribut is removed, that is forcefully propagated through all documents.

Yes.

Otherwise, the lazy update should not make a difference to users because it is an internal behavior. Most internal behavior is implicit in the code and not documented - otherwise there'd be no end to writing documentation.

@jfiala
Copy link
Contributor Author

jfiala commented Oct 4, 2022

OK, agreeed thank you !
Let me know if you want a doc PR for that.

@reckart reckart modified the milestones: 26.0, 27.0 Nov 5, 2022
@reckart reckart modified the milestones: 27.0, 28.0 Jan 4, 2023
@reckart reckart modified the milestones: 28.0, 29.0 May 2, 2023
@reckart reckart modified the milestones: 29.0, 30.0 Jul 29, 2023
@reckart reckart modified the milestones: 30.0, 31.0 Nov 5, 2023
@reckart reckart modified the milestones: 31.0, 32.0 Jan 6, 2024
@reckart reckart modified the milestones: 32.0, ⭐️ Feature backlog Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants