HTML document work #61

hmdne · 2024-05-23T06:04:37Z

This branch is aiming to be able to convert a HTML file from metanorma/reverse_adoc#90.

Metanorma PR checklist

Breaking changes (list related PRs)
Documentation update required (create task for this)
External dependency introduced (documentation update need)
Gem with native library introduced

codecov · 2024-05-23T06:06:44Z

Codecov Report

Attention: Patch coverage is 97.44246% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 98.46%. Comparing base (defb04a) to head (d8963e8).
Report is 13 commits behind head on main.

Files	Patch %	Lines
lib/coradoc/reverse_adoc/html_converter.rb	87.50%	8 Missing ⚠️
lib/coradoc/reverse_adoc/converters/table.rb	97.97%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      metanorma/reverse_adoc#61      +/-   ##
==========================================
+ Coverage   96.67%   98.46%   +1.78%     
==========================================
  Files          42       46       +4     
  Lines        1054     1306     +252     
==========================================
+ Hits         1019     1286     +267     
+ Misses         35       20      -15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ronaldtse · 2024-05-23T06:08:07Z

These tasks will be necessary for the task:

hmdne · 2024-05-23T06:52:31Z

I use AsciiDoctor to round-trip a document. This is one of the first issues I found that turned out to be an issue with AsciiDoctor actually (unless I am mistaken and this is not possible in AsciiDoc):

asciidoctor/asciidoctor#4595

Anyway, the document round trips successfully at this point, though there are still a lot of issues remaining.

ronaldtse · 2024-05-23T06:55:28Z

That's fine. We will need to ensure we test Coradoc against AsciiDoctor behavior.

Coradoc is meant to be a replacement to AsciiDoctor:

Coradoc should parse HTML and return AsciiDoc
Coradoc should parse AsciiDoc and return the resulting Document model tree
Coradoc should convert that Document model tree into other formats, including HTML (AsciiDoctor processes AsciiDoc into HTML)

ronaldtse · 2024-05-23T06:57:45Z

I use AsciiDoctor to round-trip a document. This is one of the first issues I found that turned out to be an issue with AsciiDoctor actually (unless I am mistaken and this is not possible in AsciiDoc):

asciidoctor/asciidoctor#4595

A normal AsciiDoctor table cell is plain text only. To allow the image in a table cell you need to specify as an "AsciiDoc table cell".

[cols="1,1"]
|===
|cell1
a|image::images/004.webp["",200,100]
|===

hmdne · 2024-05-23T06:57:48Z

I just realized this was a bogus issue report, and it's an issue on our side actually.

ronaldtse · 2024-05-23T06:59:09Z

Let's gather up any questions within Coradoc first and the team will answer any questions so we don't affect others' repositories.

cc @opoudjis @Intelligent2013 @manuelfuenmayor @anermina

hmdne · 2024-05-23T08:33:11Z

6c4a059 makes it so that tables are now computed correctly (mostly, still in testing).

This makes the following fragment:

Being roundtripped into:

What's apparent is a difference between the column widths (I add to a table an attribute cols="3*", for instance), which makes the resulting HTML syntax having predefined column widths. The original document just relies on a web browser to deduce column widths. I have found no way to disable this behavior.

Another difference is a lack of BGCOLOR. Should I pass this attribute along? Perhaps when some setting is enabled?

hmdne · 2024-05-23T09:03:42Z

After this commit, the document is mostly readable in my opinion. There are still some crucial issues that I can see, but the document is now, let's say, testable.

Note: I still haven't implemented --split-sections option, so there's just a single .adoc file being output.

Below is an archive that contains an adoc file created using this branch and also a html file that is a result of AsciiDoctor processing of that file:
document.tar.gz

ronaldtse · 2024-05-23T14:19:32Z

Thanks @hmdne , this is respectable progress!

The only thing is that the document is to be tested using Metanorma, not AsciiDoctor. The sample document for that is in the mn-samples-plateau repository (001-v3 is the v3 of this document, the new HTML version is 001-v4)

This HTML document was developed to adhere to Metanorma styling.

By this, we mean - if before a link there's a space, or beginning of a block, we don't need to add another space. In fact, we shouldn't, because in a case of code like... <div><a href="test">test</a></div> If we add a space before a link, we open a code block and thus we just get a source code and not a link.

In particular, I was curious what caused a performance problem on a large document I'm working on. Turned out, it was a remove_inner_whitespace procedure in Cleaner. With a simple fix I managed to make it finish in 1 second, instead of 170s. All the rest of the processing combined takes 10s, so we will be able to progress much faster on next issues.

Happened to me once, but could happen at any time in production.

The idea here, is that HTML content generators may often introduce a lot of unnecessary markup, that only makes sense in the HTML+CSS context. The idea is that certain cases can be simplified, making it so that the result is equivalent, but much simpler, allowing us to generate a nicer AsciiDoc syntax for those cases.

hmdne · 2024-05-24T02:35:42Z

@ronaldtse Thanks for clarification. I will take a deeper look at how they compare. For now, I need to work a little bit more on tables, so that we will produce necessarily correct AsciiDoc output.

hmdne · 2024-05-24T03:12:18Z

@ronaldtse A question - this document is not necessarily a semantic HTML, it sometimes uses styling. For instance:

Instead of <h2> it does <div class="subtitledata">. Instead of <th> it does <td BGCOLOR="#dddddd">

Creating a proper document won't be possible with that in mind. We can't add exceptions like this to reverse_adoc logic, since this is internal to just this document and its styling (or should we? I think the purpose of reverse_adoc is to be agnostic to formats). Otherwise, we will need to add a script to preprocess it and perhaps even postprocess it if Metanorma-compatible content is desired. Can you perhaps provide us some hints on that? (As in, is it a scope of this task, in which repo should such pre/postprocessors land, etc.)

Let's move the logic of delimiting tables to Coradoc, as I think it makes more sense to be there. This changes semantics a little - now one-line rows are generated if there are any AsciiDoc cells. Before that, it was a logic of Cell to decide if it wanted to be generated multiline or not. This results in nicer tables.

hmdne · 2024-05-25T11:52:34Z

@ronaldtse Handling lists was very tricky, but it's ready now. I have also uncovered something like a definition list in 7.2.4, but since their use of markup (.text2data, .text3data) is not consistent, I can't reliably detect them.

What I can see as remaining tasks to be done in this PR:

Investigate what to do with .text2data and .text3data
Correct an issue with \<< Something >> and with \n +
Split sections into files
Correct an edge case with table column size computation
Add some tests for new features introduced

hmdne · 2024-05-25T18:14:47Z

To make things easier, I'm uploading the current version of the document generated:

document.tar.gz

I plan to continue development tomorrow (Sunday) on 4-6 AM GMT+2.

hmdne · 2024-05-26T14:33:30Z

We have generated a section tree at this point, so we may split sections into individual files. I am not entirely sure this approach will correctly translate into all documents, not only the one we are working on.

hmdne · 2024-05-27T09:37:58Z

Thanks to a suggestion from @xyz65535 I have handled indentation in the document with [none] unordered lists. This should preserve as much semantics from the incoming document as possible.

In addition, I finalized a plugin implementation. It is now possible to plug-in at any meaningful state of AsciiDoc generation. I suppose this could be used to add something like a Metanorma plug-in, that would for instance try to extract and produce data that is meaningful to Metanorma, but not necessarily in the AsciiDoc standard. The plugin architecture should support multiple plugins to be used for any conversion.

hmdne · 2024-05-27T09:45:58Z

Here's some example from 7.1.2.4:

Original document:

Our document:

AsciiDoc for that fragment:

ronaldtse · 2024-05-27T09:54:01Z

@hmdne the ideal AsciiDoc encoding:

==== 変換規則

===== スキーマ変換規則

* スキーマ変換規則は、1-UR3.0及びCityGML2.0に従う。
* なお、標準製品仕様書は、応用スキーマクラス図及びこれに対応するXMLSchemaを新規に作成するのではなく、1-UR3.0及びCityGML2.0から必要な部分のみを選択し、使用している。
* 応用スキーマクラス図に示す、クラス名、属性名及び関連役割名は、1-UR3.0及びCityGML2.0において定義されたタグに一致させている。
* また、複数の名前空間から選択しているため、全てのクラス名に、エ-UR3.0又はCityGML2.0名前空間の接頭辞を付ける。

===== インスタンス変換規則

GMLに準拠する。

* オブジェクト識別子（gml:id）
+
--
データ製品に含まれる全ての地物には、gml：idによる識別可能な値を与えることとし、その値には［接頭辞］_［UUID］を使用する。

［接頭辞］は、CityGML及びューURの各パッケージに与えられた接頭辞（表7-4）を使用する。

［UUID］は、Universally Unique Identifier（UUID）［2］とする。UUIDとは、ソフトウェア上でオブジェクトを一意に識別するための識別子であり、128ビット（16バイト）の値で表す。先頭から4ビットごとに16進数の値（0～f）に変換し、8桁-4桁-4桁-4桁-12桁に切って表現する。
--

* 集成の実装
+
--
応用スキーマに示された地物間の集成は、部品となるオブジェクトを、全体となるオブジェクトの子要素として記述する。

この時、部品となるオブジェクトの識別子（gm1：id）を、全体となるオブジェクト以外のオブジェクトが参照してもよい。
--

* 空間参照系の識別
+
--
幾何オブジェクトに適用される空間参照系は、都市モデル（core:CityModel）に挿入されるEnvelop要素の属性snsNameにおいて、以下のEPSGコードを挿入することにより識別する。

[cols="9,4"]
|===
| 空間参照系の名称 | srsNameに挿入する値

| 日本測地系2011における経緯度座標系と東京湾平均海面を基準とする標高の複合座標参照系
| http://www.opengis.net/def/crs/EPSG/0/6697
|===
--

* schemaLocationの指定
+
i-URの符号化様は、30都市モデル内のschemasフォルダ（7.2.4）に格納したXMLSchemaファイルへの相対パスによりschemaLocationを指定する。

The interesting thing about the PLATEAU documents is they use the clause scheme like this:

So the Level 4 and Level 5 are actually not lists, they are clauses (sections).

hmdne · 2024-05-27T10:07:53Z

The last clause level is not something we can extract programmatically, as the only class we have available is "text2data" - all we can deduce from that is that the author intended a "level 2 indentation". This class is used a lot in the document, for instance the underlined parts are also "text2data":

While this example in particular we handle specially as per your request, it's compiled into a numbered list, in other part of the document, those are "text2data":

I see no way from this how to interpret "text2data" in any other way, programmatically, as "level 2 indentation" and that's what I try to accomplish with lists.

ronaldtse · 2024-05-27T10:53:38Z

@hmdne there are always a balance between automated processing and manual processing, and I do agree that there are some portions we have to manually fix up after automated processing. As long as we know what work remains (ping @metanorma/editors ) that's fine.

hmdne · 2024-05-27T13:46:35Z

I have completed the last task on this issue. This will still need some testing, but other than that, I don't see any more remaining problems with conversion.

Below is the (hopefully) final version of document, ready for review:

document.tar.gz

hmdne · 2024-05-27T17:13:29Z

@ronaldtse There was a minor fix uncovered by the test suite, but it doesn't affect the document. I think this PR is ready.

ronaldtse · 2024-05-29T04:43:58Z

@hmdne can you let me know how you've tested the feature?

This is what I used.

$ bundle exec reverse_adoc -rcoradoc/reverse_adoc/plugins/plateau --split-sections 2 --external-images -o plateau/index.adoc index.html

I have additional issues that I will file separately now.

ronaldtse

Thank you @hmdne!

ronaldtse · 2024-05-29T05:22:35Z

The remaining issues are at:

hmdne added 5 commits May 23, 2024 08:48

Fixup: coradoc/reverse_adoc

1cf71fd

Don't die on empty images

cee010c

Ensure URI is loaded

3895b1e

Output a correct extension for SVG files

3cfb2df

Skip empty images - those are used most often for CSS flow etc.

52a5336

hmdne force-pushed the html-document-work branch from 848952c to 52a5336 Compare May 23, 2024 06:48

hmdne added 2 commits May 23, 2024 09:01

Ensure asciidoc markup for a table cell if it contains images

2eee88f

Set column number if we are not able to provide good syntax

6c4a059

hmdne added 2 commits May 24, 2024 03:29

hmdne force-pushed the html-document-work branch from c3c0d5a to bdd1dab Compare May 24, 2024 01:38

hmdne added 3 commits May 24, 2024 04:23

img extraction tempfile - Fix race condition

2b55df9

Happened to me once, but could happen at any time in production.

For adoccell, only consider images that have SRC

4e40b00

hmdne added 4 commits May 24, 2024 05:22

Make simplification more aggressive

b0a5059

More verbose name for simplification method

26d1232

Ignore image w/h if given with %

3ddfbc1

hmdne added 2 commits May 25, 2024 14:36

Correct table column size computation

67d20a7

Tests: Add Plugin

1283734

hmdne added 3 commits May 26, 2024 07:14

Tests: Table computation

29502c5

Tests: Test a href==text case

7c2e301

Introduce visitor pattern to CoraDoc nodes; generate a section tree

9fa7e51

hmdne added 2 commits May 27, 2024 11:26

Plugin system: finalize the implementation

02de9ea

Plateau: Correctly handle indentation

83ee91a

hmdne added 2 commits May 27, 2024 12:36

Move text processing to Coradoc

e208c72

Text: Don't escape things like << ABC >>, because \ is passed thru

7550de2

hmdne added 4 commits May 27, 2024 13:40

Plateau: Correct a problem after Text node change

0546576

More robust handling for link constrainment

0c76c6c

Fix the " +\n" issue

3494176

Section splitting support

ed667b2

Tests: Section splitting

d8963e8

hmdne marked this pull request as ready for review May 27, 2024 17:08

ronaldtse approved these changes May 29, 2024

View reviewed changes

ronaldtse merged commit e85eaa8 into metanorma:main May 29, 2024
15 of 16 checks passed

ronaldtse mentioned this pull request May 29, 2024

Remaining Plateau issues with reverse_adoc #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML document work #61

HTML document work #61

hmdne commented May 23, 2024

codecov bot commented May 23, 2024 •

edited

Loading

ronaldtse commented May 23, 2024 •

edited

Loading

hmdne commented May 23, 2024

ronaldtse commented May 23, 2024

ronaldtse commented May 23, 2024

hmdne commented May 23, 2024

ronaldtse commented May 23, 2024

hmdne commented May 23, 2024

hmdne commented May 23, 2024

ronaldtse commented May 23, 2024

hmdne commented May 24, 2024

hmdne commented May 24, 2024

hmdne commented May 25, 2024 •

edited

Loading

hmdne commented May 25, 2024

hmdne commented May 26, 2024

hmdne commented May 27, 2024

hmdne commented May 27, 2024

ronaldtse commented May 27, 2024

hmdne commented May 27, 2024

ronaldtse commented May 27, 2024

hmdne commented May 27, 2024

hmdne commented May 27, 2024

ronaldtse commented May 29, 2024

ronaldtse left a comment

ronaldtse commented May 29, 2024

HTML document work #61

HTML document work #61

Conversation

hmdne commented May 23, 2024

Metanorma PR checklist

codecov bot commented May 23, 2024 • edited Loading

Codecov Report

ronaldtse commented May 23, 2024 • edited Loading

hmdne commented May 23, 2024

ronaldtse commented May 23, 2024

ronaldtse commented May 23, 2024

hmdne commented May 23, 2024

ronaldtse commented May 23, 2024

hmdne commented May 23, 2024

hmdne commented May 23, 2024

ronaldtse commented May 23, 2024

hmdne commented May 24, 2024

hmdne commented May 24, 2024

hmdne commented May 25, 2024 • edited Loading

hmdne commented May 25, 2024

hmdne commented May 26, 2024

hmdne commented May 27, 2024

hmdne commented May 27, 2024

ronaldtse commented May 27, 2024

hmdne commented May 27, 2024

ronaldtse commented May 27, 2024

hmdne commented May 27, 2024

hmdne commented May 27, 2024

ronaldtse commented May 29, 2024

ronaldtse left a comment

Choose a reason for hiding this comment

ronaldtse commented May 29, 2024

codecov bot commented May 23, 2024 •

edited

Loading

ronaldtse commented May 23, 2024 •

edited

Loading

hmdne commented May 25, 2024 •

edited

Loading