Compendium is a processor for generating, unifying and converting different input sources like AsciiDoc files, Markdown files, Confluence and HTML websites into different output formats. We can select all the content or only parts of it from the input files and generate an AsciiDoc, Markdown, HTML or PDF as output. A composition of concise but detailed information about a particular subject obtained from different sources with the aim to be published in a single document, book or other publication.
Compendium uses a JSON Config file with two parts that define where and how to get the desired input data:
Compendium currently accepts AsciiDoc, Markdown, Confluence and HTML as input formats.
In this part of the configuration file define the sources of the input files, the types and assign a reference id or name to them.
- reference: id of the source.
- source_type: (i.e asciidoc, markdown, html-url, confluence).
- source: URL or PATH where the information is located. (i.e. https://adcenter.pl.s2-eu.capgemini.com/confluence/)
"sources": [
{
"reference": "project1",
"source_type": "asciidoc",
"source": "./test-data/input/input-data1"
},
{
"reference": "project2",
"source_type": "asciidoc",
"source": "./test-data/input/input-data2"
},
{
"reference": "confluence2",
"source_type": "confluence",
"source": "https://adcenter.pl.s2-eu.capgemini.com/confluence/",
"space": "JQ",
"context": "capgemini"
}
]
[x] To read from confluence internal network add this arguments to the source part:
- context: capgemini
- space: space key of the project, all the urls of the project have this letters. i.e.: (https://adcenter.pl.s2-eu.capgemini.com/confluence/display/HD/2.+Objectives ) space ⇒ HD
[ ] To read from confluence private account add this arguments to the source part:
- context: external
- space: depend on the account, all the urls have a two or three letters / < context > /.
- reference: it refers the source reference, must be the same (source id).
- document: file name or name/id project inside source path referred (i.e 6.+Entity+relationship+diagram).
- sections: section/s that you want to extract. If you want to extract all the content of the document you should leave this argument blank, but if you want to extract different sections, write them in an array. (i.e sections: [h1, h3])
"documents": [
{
"reference": "project1",
"document": "manual"
},
{
"reference": "project2",
"document": "brownfox2"
},
{
"reference": "project2",
"document": "paragraph1"
},
{
"reference": "confluence2",
"document": "Jump+the+queue+Home",
"sections": ["Epic 2. Consult the queue"]
}
]
- AsciiDoc documents:
- source_type: asciidoc (reads directly from local .adoc documents)
- source: Local Path.
- Markdown documents:
- source_type: markdown (reads directly from local .md documents)
- source: Local Path.
- Confluence pages:
- source_type: confluence
- source: base url of confluence account
- context: capgemini (internal network) or external(private confluence account)
- space: JQ (project space key)
- HTML pages directly from a website:
- source_type: url-html
- source: url
In the url-html type the document part have an optional attribute: (document is an index, where we have to extract all the links from. And include them in the output file, so that we download all the pages from a site). The document has to be unique and consider the following:
- document: index url
- is_index: true or false (to indicate if we have to read an index)
- HTML
- AsciiDoc
- Markdown
Compendium works like a merger and compiler. It gets pieces of information from different sources and formats, merges them into a single file and generates an output file with the desired output format (PDF, HTML, AsciiDoc or Markdown).
In this section the main compilation and merging process is described:
- Lexical Analysis:
- The sequences of characters from the input files are tokenized in the Scanner or Tokenization process.
- TextIn objects transform the input source code to an HTML tokenized code using Asciidoctor.js for the AsciiDoc files or Showdown.js for the Markdown files. Confluence data is recovered via JSON and can be transcoded directly. Obviously, HTML files are already tokenized.
- Syntax Analysis:
- The HTML tokenized code is parsed to a Parse Tree in the Parsing or Hierarchical Analysis process.
- TextIn objects parse XML (HTML) code using html-parser generating a Tree datastructure that represents the content. TextIn objects then go through all of the branches of the parsed Tree returning
Array<TextSegment>
elements that will be used to generate the Transcript objects.
- Semantic Analysis:
- The
Array<TextSegment>
is iterated through filtering functions that remove the unwanted information.
- The
At this point we have parsed all the input information and filtered it. We have an Array<TextSegment>
with all the pieces of information from the sources.
-
Intermediate Code Generation:
- The
Array<TextSegment>
that we generated on the previous steps is now used to create Transcript objects that will contain all the TextSegments information. - Our Transcript elements are an intermediate representation of the input source data that makes easier to work with them to generate the output file.
- The
-
Code Optimization:
- In this step, all the Transcript elements are merged using the Merger into a single Transcript. The output is almost ready to generate a file with the selected output format.
- Output Code Generation:
- In this final step, our Transcript elements are the input for the TextOut object that will generate the output code in the desired format.
- Transcript elements are used to generate an AsciiDoc that will be transformed to HTML using asciidoctor that is used as Intermediate Code again for generating the correct output. Markdown is generated using TurnDown service from the HTML code, PDF using htmlto also from the HTML and AsciiDoc directly from the code generated from the Transcript elements.