Bulk export of documents in specific output format #3402

jfiala · 2022-09-26T13:41:28Z

Is your feature request related to a problem? Please describe.
Currently it seems only possible to export one document in the annotation view in a desired format (e.g. Apache UIMA CAS XMI 1.1)

Describe the solution you'd like
It would be nice to have the option to export multiple documents (e.g. in Settings - documents list)

reckart · 2022-09-26T13:42:30Z

If you export the entire project from the project settings, all documents are in there.

reckart · 2022-09-26T13:46:09Z

In v25 you will be able to export multiple documents from the documents list in the project settings - but this export will just retrieve the originally uploaded documents again, not the annotated documents.

I think it might be considered to add export option to the workload management pages...

jfiala · 2022-09-26T13:46:39Z

Thank you for the hint!
However, there are two obstacles for Apache UIMA CAS 1.1 here:
_ each document is contained within a zip file with random number file name
e.g.: webanno5173705207688145129export.zip
without a backlink to the document name

_ The filename inside the ZIP-file seems to be hardcoded to "admin.xmi"

This makes the bulk export nearly unusable for XMI-format as there is no backlink to the original document file name?
So I have to open and export each document manually, additionally selecting the output format for each document :(.

jfiala · 2022-09-26T13:47:04Z

In v25 you will be able to export multiple documents from the documents list in the project settings - but this export will just retrieve the originally uploaded documents again, not the annotated documents.

I think it might be considered to add export option to the workload management pages...

It would be great to be able to export the annotated documents :).

reckart · 2022-09-26T13:47:41Z

Yeah, I guess the name of that zip file should be fixed to properly represent the document name...

reckart · 2022-09-26T13:48:16Z

Actually, the XMI file itself should contain a "DocumentMetaData" annotation and the name of the document should be in there. Not perfect, but also not a lost cause.

jfiala · 2022-09-26T13:48:58Z

just found it (documentTitle).
Thats good for a workaround with a postprocessing script, thank you!

jfiala · 2022-09-26T13:50:00Z

I don't know if anybody needs the "webanno...export.zip" file name.
I'd suggest to simply use the document title for the zip and xmi file name (both for document-level and project-level export).

jfiala · 2022-09-26T13:50:22Z

Any plans to add this to the next release or should I write a reformat script meanwhile?

jfiala · 2022-09-26T13:51:32Z

Additionally, for the project-level export I think the TypeSystem.xml would be required only once.
So there could be one zip file with all the documents + one TypeSystem.xml?

reckart · 2022-09-26T13:51:35Z

If you need a solution now, you better write a script. I'll mark the issue for v25, but no promises.

reckart · 2022-09-26T13:52:45Z

Exporting only one TS file would require quite a bit of refactoring at the moment...

The new UIMA JSON format is better in this situation, because it embeds the type system - dkpro-cassis can also read/write it.

reckart · 2022-09-26T13:54:24Z

The new UIMA JSON format is better in this situation, because it embeds the type system - dkpro-cassis can also read/write it.

... read: there s no need to wrap the new UIMA JSON format in a ZIP

reckart · 2022-10-03T21:05:39Z

Guess this will split at least to v26.

jfiala · 2022-10-04T08:19:10Z

Regarding UIMA JSON: IMHO it is not a good idea to embed the whole type system, as this will blow up the json unnecessarily.
Especially when you have thousands of documents. The XMI we use currently is a great and compact format.

reckart · 2022-10-04T08:28:50Z

You are absolutely right here and I have noticed that as well. By default, only a minimal type system should be included (i.e. one that covers only the types actually used in the respective JSON file). However, I didn't have the time to investigate yet why the included type system is not minimized and if that is a bug/missing setting in INCEpTION or in the underlying UIMA library.

jfiala · 2022-10-04T08:34:17Z

I think a pointer to the underlying json schema would be best.
That would maybe also allow to keep document properties in a versioned way....

reckart · 2022-10-04T08:34:58Z

Having a pointer would mean the pointer needs to be resolvable...

jfiala · 2022-10-04T08:40:47Z

What about having a pointer to schema + embed the schema in zip file to be always in sync?

reckart · 2022-10-04T08:43:11Z

As said before, having only a single copy of the type system in the export ZIP is currently not possible - it requires refactoring. The current architecture passes each document in turn to the exporter and the exporter is the one that creates the type system - the exporter is not aware that it is being called multiple times and the caller of the exporter has no knowledge that the export might be creating a type system...

Having the type system be large/redundant is a bit annoying but at least it "just" costs space...

jfiala · 2022-10-04T08:51:52Z

Can you create a zip with multiple zips inside as you do for the project export?
And use a different file name for the document json.
Then we can export and then simply extract everything and overwrite the json x times (as it is the same for all, we don't mind).

reckart · 2022-10-04T09:14:51Z

Can you create a zip with multiple zips inside as you do for the project export?

In principle yes, but I am not sure if I want to. I'd personally prefer keeping the JSON file as a self-contained file with an embedded type system. My impression is that having a fully self-contained format is also important. It could be considered to offer two export options, one with a shared type system (after some refactoring) and one with redundantly embedded type systems, but at the moment it seems to early to me to do that.

When implementing the bulk-export for XMIs, it will be as you say: ZIP with nested ZIPs, each nested ZIP containing the XMI file and a typesystem.xml file - and if you extracted all of these into the same folder, the typesystem.xml files would override each other.

jfiala · 2022-10-04T10:27:49Z

Could it be that the type system differs if exported from Inception?
IMHO using a schema with a pointer in each json/xml would be the cleanest solution.
However, if it is simpler I think it is OK to have the type system included, but it would be nice to have it minimized in the json.

XMI: OK for me, but same issue with same TypeSystem.xml (as from same document storage).
Using an xml schema would have the advantage of validation against the xml files, but is not necessary.

reckart · 2022-10-04T10:42:41Z

Could it be that the type system differs if exported from Inception?

INCEpTION has the concept of a project-specific type system, so all annotations in a project are expected to adhere to the same type system - though they might be (internally) lagging behind because upgrading a document to the project type system is a lazy action that usually happens only when opening the document for annotation and also as part of exporting the document.

jfiala · 2022-10-04T11:26:11Z

Thank you for your explanation! I think that lazy propagation of new properties should be mentioned somewhere in the docs (if not already the case)? If an attribut is removed, that is forcefully propagated through all documents.

reckart · 2022-10-04T11:28:20Z

If an attribut is removed, that is forcefully propagated through all documents.

Yes.

Otherwise, the lazy update should not make a difference to users because it is an internal behavior. Most internal behavior is implicit in the code and not documented - otherwise there'd be no end to writing documentation.

jfiala · 2022-10-04T12:12:44Z

OK, agreeed thank you !
Let me know if you want a doc PR for that.

reckart added this to the 25.0 milestone Sep 26, 2022

reckart added the ⭐️ Enhancement New feature or request label Sep 26, 2022

reckart mentioned this issue Sep 26, 2022

Fix name of ZIP file used for exported XMIs #3403

Closed

reckart modified the milestones: 25.0, 26.0 Oct 3, 2022

reckart mentioned this issue Oct 4, 2022

Type system in exported JSON should be minimized #3422

Closed

reckart added Module: Dynamic Workload Module: Static Workload labels Oct 25, 2022

reckart modified the milestones: 26.0, 27.0 Nov 5, 2022

reckart modified the milestones: 27.0, 28.0 Jan 4, 2023

reckart modified the milestones: 28.0, 29.0 May 2, 2023

reckart modified the milestones: 29.0, 30.0 Jul 29, 2023

reckart modified the milestones: 30.0, 31.0 Nov 5, 2023

reckart modified the milestones: 31.0, 32.0 Jan 6, 2024

reckart modified the milestones: 32.0, ⭐️ Feature backlog Feb 6, 2024

reckart added the Module: Import/Export label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk export of documents in specific output format #3402

Bulk export of documents in specific output format #3402

jfiala commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

jfiala commented Sep 26, 2022 •

edited

Loading

jfiala commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

jfiala commented Sep 26, 2022

jfiala commented Sep 26, 2022

jfiala commented Sep 26, 2022

jfiala commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Oct 3, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022 •

edited

Loading

reckart commented Oct 4, 2022 •

edited

Loading

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022 •

edited

Loading

Bulk export of documents in specific output format #3402

Bulk export of documents in specific output format #3402

Comments

jfiala commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

jfiala commented Sep 26, 2022 • edited Loading

jfiala commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

jfiala commented Sep 26, 2022

jfiala commented Sep 26, 2022

jfiala commented Sep 26, 2022

jfiala commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Sep 26, 2022

reckart commented Oct 3, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022 • edited Loading

reckart commented Oct 4, 2022 • edited Loading

jfiala commented Oct 4, 2022

reckart commented Oct 4, 2022

jfiala commented Oct 4, 2022 • edited Loading

jfiala commented Sep 26, 2022 •

edited

Loading

jfiala commented Oct 4, 2022 •

edited

Loading

reckart commented Oct 4, 2022 •

edited

Loading

jfiala commented Oct 4, 2022 •

edited

Loading