fix(slb-495): add some pdf convert scripts (experimental)

fix(slb-495): update pdf converter to pdf2md and more
AmazeeLabs · Dec 19, 2024 · c974d64 · c974d64
1 parent 98976d1
commit c974d64
Show file tree

Hide file tree

Showing 10 changed files with 812 additions and 111 deletions.
diff --git a/apps/converter/708651f5f96a/content.md b/apps/converter/708651f5f96a/content.md
@@ -0,0 +1,238 @@
+## Sample PDF Document 
+
+#### Robert Maron 
+
+#### Grzegorz Grudzi´nski 
+
+#### February 20, 1999 
+
+
+##### 2 
+
+
+# Contents 
+
+**1 Template 5** 1.1 How to compile a .tex file to a .pdf file............. 5 1.1.1 Tools............................ 5 1.1.2 How to use the tools.................... 5 1.2 How to write a document...................... 6 1.2.1 The main document..................... 6 1.2.2 Chapters.......................... 6 1.2.3 Spell-checking....................... 6 1.3 LATEX and pdfLATEX capabilities................... 7 1.3.1 Overview.......................... 7 1.3.2 LATEX............................ 7 1.3.3 pdfLATEX.......................... 7 1.3.4 Examples.......................... 7 
+
+##### 3 
+
+
+##### 4 CONTENTS 
+
+
+# Chapter 1 
+
+# Template 
+
+## 1.1 How to compile a .tex file to a .pdf file 
+
+#### 1.1.1 Tools 
+
+To process the files you (may) need: 
+
+- pdflatex (for example from tetex package ≥ 0.9-6, which you can     get from Red Hat 5.2); 
+
+- acroread (a PDF viewer, available from [http://www.adobe.com/);](http://www.adobe.com/);) 
+
+- ghostscript ≥ 5.10 (for example from Red Hat Contrib) and ghostview     or gv (from RedHat Linux); 
+
+- efax package could be useful, if you plan to fax documents. 
+
+#### 1.1.2 How to use the tools 
+
+Follow these steps: 
+
+1. put all source .tex files in one directory, then chdir to the directory (or put     some of them in the LATEXsearch path — if you know how to do this); 
+
+2. run “pdflatex file.tex” on the main file of the document three times     (three — to prepare valid table of contents); 
+
+3. to see or print the result use acroread (unfortunately some versions of     acroread may produce PostScript which is too complex), or 
+
+##### 5 
+
+
+##### 6 CHAPTER 1. TEMPLATE 
+
+4. run ghostscript: “gv file.pdf” to display or:     “gs -dNOPAUSE -sDEVICE=pswrite -q -dBATCH -sOutputFile=file.ps file.pdf”     to produce a PostScript file; 
+
+5. run “fax send phone-number file.ps” as root to send a fax, or — if you     know how to do this — modify the fax script to be able to fax .pdf files directly     (you have to insert “|%PDF*” somewhere... ). 
+
+### 1.2 How to write a document 
+
+#### 1.2.1 The main document 
+
+Choose the name of the document, say document. Copy template.tex to document.tex, then edit it, change the title, the authors and set proper include(s) for all the chapters. 
+
+#### 1.2.2 Chapters 
+
+Each chapter should be included in the main document as a separate file. You can choose any name for the file, but we suggest adding a suffix to the name of the main file. For our example we use the file name document_chapter1.tex. 
+
+First, copy template_chapter.tex to document_chapter1.tex and add the line 
+
+\include{document_chapter1} 
+
+in the document.tex, then edit document_chapter1.tex, change the chapter title and edit the body of the chapter appropriately. 
+
+#### 1.2.3 Spell-checking 
+
+_Do_ use a spell-checker, please! 
+
+You may also want to check grammar, style and so on. Actually you should do it (if you have enough spare time). But you _must_ check spelling! 
+
+You can use the ispell package for this, from within emacs, or from the command line: 
+
+ispell -t document_chapter1.tex 
+
+
+##### 1.3. LATEX AND PDFLATEX CAPABILITIES 7 
+
+### 1.3 LATEX and pdfLATEX capabilities 
+
+#### 1.3.1 Overview 
+
+First you edit your source .tex file. In LATEX you compile it using the latex command to a .dvi file (which stands for device-independent). The .dvi file can be converted to any device-dependent format you like using an appropriate driver, for example dvips. When producing .pdf files you should use pdflatex, which produces directly .pdf files out of .tex sources. Note that in the .tex file you may need to use some PDF specific packages. For viewing .tex files use your favourite text editor, for viewing .dvi files under X Window System use xdvi command, .ps files can be viewed with gv (or ghostview) and .pdf files with acroread, gv or xpdf. 
+
+#### 1.3.2 LATEX 
+
+A lot of examples can be found in this document. You should also print 
+
+- doc/latex/general/latex2e.dvi and 
+
+- doc/latex/general/lshort2e.dvi 
+
+from your tetex distribution (usually in 
+
+- /usr/share/texmf or 
+
+- /usr/lib/texmf/texmf). 
+
+#### 1.3.3 pdfLATEX 
+
+Consult doc/pdftex/manual.pdf from your tetex distribution for more details. Very useful informations can be found in the hyperref and graphics package manuals: 
+
+- doc/latex/hyperref/manual.pdf and 
+
+- doc/latex/graphics/grfguide.dvi. 
+
+#### 1.3.4 Examples 
+
+**References** 
+
+MIMUW 
+
+
+##### 8 CHAPTER 1. TEMPLATE 
+
+**Hyperlinks** 
+
+This is a target. And this is a link. 
+
+**Dashes, etc.** 
+
+There are three kinds of horizontal dash: 
+
+- - (use inside words; for example “home-page”, “X-rated”) 
+
+- – (use this one between numbers; for example “pages 2–22”) 
+
+- — (use this one as a sentence separator — like here) 
+
+**National characters** 
+
+- ó, é, í,... 
+
+- è, à, ì,... 
+
+- ô, ê,... 
+
+- õ, ñ,... 
+
+- ö, ë,... 
+
+- z˙ 
+
+- a, ˛˛e 
+
+- ł, ø, ß 
+
+There are other ways to do this, see the documentation for inputenc package. 
+
+**Reserved characters** 
+
+Some characters have some special meaning, thus cannot be entered in the usual way. 
+
+- $ & % # _ { } 
+
+- \ 
+
+- ˜ ˆ 
+
+
+##### 1.3. LATEX AND PDFLATEX CAPABILITIES 9 
+
+**Math** 
+
+- 12 , 12 n,... 
+
+- i 1 , i 2 n,... 
+
+• 12 , (^22) −n 3 ,... 
+
+- α, β, γ, Ω,... 
+
+- →, ⇒, ≥, 6 =, ∈, ?,... 
+
+- 
+
+##### √ 
+
+##### 2 ,... 
+
+##### • 2 + 2,... 
+
+ For more examples and symbols see chapter 3 of lshort2e.dvi. 
+
+**Fonts** 
+
+- Roman 
+
+- _Emphasis_ 
+
+- Medium weight — the default 
+
+- **Boldface** 
+
+- Upright 
+
+- Slanted 
+
+- Sans serif 
+
+- SMALL CAPS 
+
+- Typewriter 
+
+- and sizes: 
+
+**-** tiny 
+
+**-** scriptsize 
+
+**-** footnotesize 
+
+**-** small 
+
+**-** normalsize 
+
+
+##### 10 CHAPTER 1. TEMPLATE 
+
+**-** large 
+
+## – Large 
+
+## – LARGE 
+
+# – huge 
+
+# – Huge
diff --git a/apps/converter/htmlToMarkdown.js b/apps/converter/htmlToMarkdown.js
@@ -183,7 +183,7 @@ export async function htmlToMarkdown(url) {
 
   const html = await extractMainContentFromUrl(url);
   // Generate folder name based on HTML content
-  const folderName = generateFolderName(html);
+  const folderName = generateFolderName(url);
   const outputDir = path.join(__dirname, folderName);
   const imagesDir = path.join(outputDir, 'images');
 

diff --git a/apps/converter/index.js b/apps/converter/index.js
@@ -6,6 +6,7 @@ import { fromMarkdown } from 'mdast-util-from-markdown';
 import { toHast } from 'mdast-util-to-hast';
 
 import { htmlToMarkdown } from './htmlToMarkdown.js';
+import { fetchContentJinaAi } from './jinaAi.js';
 import { pdfToMarkdown } from './pdfToMarkdown.js';
 import { wordToMarkdown } from './wordToMarkdown.js';
 
@@ -223,6 +224,53 @@ app.get('/html-convert', async (req, res) => {
   }
 });
 
+app.get('/jina-convert', async (req, res) => {
+  const url = req.query.path;
+
+  if (!url) {
+    return res.status(400).json({
+      error: "Please provide a URLas 'path' query parameter",
+    });
+  }
+
+  try {
+    // First convert Word to Markdown
+    const { markdownPath, warnings, outputDir } = await fetchContentJinaAi(url);
+
+    // Then read and process the Markdown
+    const markdown = readFileSync(markdownPath, 'utf-8');
+    const mdast = fromMarkdown(markdown);
+
+    const md = readFileSync(markdownPath, 'utf-8');
+    const ast = parse(md);
+
+    mdast.children.forEach(async (element, index) => {
+      const hast = toHast(element, { allowDangerousHtml: true });
+      const html = toHtml(hast, { allowDangerousHtml: true });
+      element.type = ast.children[index].type;
+      element.raw = ast.children[index].raw;
+      element.htmlValue = html;
+    });
+
+    const enhanced = await enhanceMdastNodesRecursive(mdast, outputDir);
+    // Return the processed content along with conversion info
+    res.json({
+      content: enhanced.children,
+      outputDirectory: outputDir,
+      warnings: warnings,
+    });
+  } catch (error) {
+    if (error.code === 'ENOENT') {
+      res.status(404).json({ error: `File not found: ${url}` });
+    } else {
+      res.status(500).json({
+        error: 'Error processing document',
+        details: error.message,
+      });
+    }
+  }
+});
+
 app.get('/pdf-convert', async (req, res) => {
   const filePath = req.query.path;
 
@@ -234,7 +282,7 @@ app.get('/pdf-convert', async (req, res) => {
 
   try {
     // First convert Word to Markdown
-    const { markdownPath, outputDir } = await pdfToMarkdown(filePath);
+    const { markdownPath, warnings, outputDir } = await pdfToMarkdown(filePath);
 
     // Then read and process the Markdown
     const markdown = readFileSync(markdownPath, 'utf-8');
@@ -252,12 +300,11 @@ app.get('/pdf-convert', async (req, res) => {
     });
 
     const enhanced = await enhanceMdastNodesRecursive(mdast, outputDir);
-
     // Return the processed content along with conversion info
     res.json({
       content: enhanced.children,
       outputDirectory: outputDir,
-      // warnings: warnings,
+      warnings: warnings,
     });
   } catch (error) {
     if (error.code === 'ENOENT') {