Skip to content

Commit

Permalink
fix(slb-495): add some pdf convert scripts (experimental)
Browse files Browse the repository at this point in the history
fix(slb-495): update pdf converter to pdf2md and more
  • Loading branch information
dspachos committed Dec 19, 2024
1 parent 98976d1 commit c974d64
Show file tree
Hide file tree
Showing 10 changed files with 812 additions and 111 deletions.
238 changes: 238 additions & 0 deletions apps/converter/708651f5f96a/content.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
## Sample PDF Document

#### Robert Maron

#### Grzegorz Grudzi´nski

#### February 20, 1999


##### 2


# Contents

**1 Template 5** 1.1 How to compile a .tex file to a .pdf file............. 5 1.1.1 Tools............................ 5 1.1.2 How to use the tools.................... 5 1.2 How to write a document...................... 6 1.2.1 The main document..................... 6 1.2.2 Chapters.......................... 6 1.2.3 Spell-checking....................... 6 1.3 LATEX and pdfLATEX capabilities................... 7 1.3.1 Overview.......................... 7 1.3.2 LATEX............................ 7 1.3.3 pdfLATEX.......................... 7 1.3.4 Examples.......................... 7

##### 3


##### 4 CONTENTS


# Chapter 1

# Template

## 1.1 How to compile a .tex file to a .pdf file

#### 1.1.1 Tools

To process the files you (may) need:

- pdflatex (for example from tetex package ≥ 0.9-6, which you can get from Red Hat 5.2);

- acroread (a PDF viewer, available from [http://www.adobe.com/);](http://www.adobe.com/);)

- ghostscript ≥ 5.10 (for example from Red Hat Contrib) and ghostview or gv (from RedHat Linux);

- efax package could be useful, if you plan to fax documents.

#### 1.1.2 How to use the tools

Follow these steps:

1. put all source .tex files in one directory, then chdir to the directory (or put some of them in the LATEXsearch path — if you know how to do this);

2. run “pdflatex file.tex” on the main file of the document three times (three — to prepare valid table of contents);

3. to see or print the result use acroread (unfortunately some versions of acroread may produce PostScript which is too complex), or

##### 5


##### 6 CHAPTER 1. TEMPLATE

4. run ghostscript: “gv file.pdf” to display or: “gs -dNOPAUSE -sDEVICE=pswrite -q -dBATCH -sOutputFile=file.ps file.pdf” to produce a PostScript file;

5. run “fax send phone-number file.ps” as root to send a fax, or — if you know how to do this — modify the fax script to be able to fax .pdf files directly (you have to insert “|%PDF*” somewhere... ).

### 1.2 How to write a document

#### 1.2.1 The main document

Choose the name of the document, say document. Copy template.tex to document.tex, then edit it, change the title, the authors and set proper include(s) for all the chapters.

#### 1.2.2 Chapters

Each chapter should be included in the main document as a separate file. You can choose any name for the file, but we suggest adding a suffix to the name of the main file. For our example we use the file name document_chapter1.tex.

First, copy template_chapter.tex to document_chapter1.tex and add the line

\include{document_chapter1}

in the document.tex, then edit document_chapter1.tex, change the chapter title and edit the body of the chapter appropriately.

#### 1.2.3 Spell-checking

_Do_ use a spell-checker, please!

You may also want to check grammar, style and so on. Actually you should do it (if you have enough spare time). But you _must_ check spelling!

You can use the ispell package for this, from within emacs, or from the command line:

ispell -t document_chapter1.tex


##### 1.3. LATEX AND PDFLATEX CAPABILITIES 7

### 1.3 LATEX and pdfLATEX capabilities

#### 1.3.1 Overview

First you edit your source .tex file. In LATEX you compile it using the latex command to a .dvi file (which stands for device-independent). The .dvi file can be converted to any device-dependent format you like using an appropriate driver, for example dvips. When producing .pdf files you should use pdflatex, which produces directly .pdf files out of .tex sources. Note that in the .tex file you may need to use some PDF specific packages. For viewing .tex files use your favourite text editor, for viewing .dvi files under X Window System use xdvi command, .ps files can be viewed with gv (or ghostview) and .pdf files with acroread, gv or xpdf.

#### 1.3.2 LATEX

A lot of examples can be found in this document. You should also print

- doc/latex/general/latex2e.dvi and

- doc/latex/general/lshort2e.dvi

from your tetex distribution (usually in

- /usr/share/texmf or

- /usr/lib/texmf/texmf).

#### 1.3.3 pdfLATEX

Consult doc/pdftex/manual.pdf from your tetex distribution for more details. Very useful informations can be found in the hyperref and graphics package manuals:

- doc/latex/hyperref/manual.pdf and

- doc/latex/graphics/grfguide.dvi.

#### 1.3.4 Examples

**References**

MIMUW


##### 8 CHAPTER 1. TEMPLATE

**Hyperlinks**

This is a target. And this is a link.

**Dashes, etc.**

There are three kinds of horizontal dash:

- - (use inside words; for example “home-page”, “X-rated”)

- – (use this one between numbers; for example “pages 2–22”)

- — (use this one as a sentence separator — like here)

**National characters**

- ó, é, í,...

- è, à, ì,...

- ô, ê,...

- õ, ñ,...

- ö, ë,...

-

- a, ˛˛e

- ł, ø, ß

There are other ways to do this, see the documentation for inputenc package.

**Reserved characters**

Some characters have some special meaning, thus cannot be entered in the usual way.

- $ & % # _ { }

- \

- ˜ ˆ


##### 1.3. LATEX AND PDFLATEX CAPABILITIES 9

**Math**

- 12 , 12 n,...

- i 1 , i 2 n,...

• 12 , (^22) −n 3 ,...

- α, β, γ, Ω,...

- →, ⇒, ≥, 6 =, ∈, ?,...

-

#####

##### 2 ,...

##### • 2 + 2,...

For more examples and symbols see chapter 3 of lshort2e.dvi.

**Fonts**

- Roman

- _Emphasis_

- Medium weight — the default

- **Boldface**

- Upright

- Slanted

- Sans serif

- SMALL CAPS

- Typewriter

- and sizes:

**-** tiny

**-** scriptsize

**-** footnotesize

**-** small

**-** normalsize


##### 10 CHAPTER 1. TEMPLATE

**-** large

## – Large

## – LARGE

# – huge

# – Huge
2 changes: 1 addition & 1 deletion apps/converter/htmlToMarkdown.js
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ export async function htmlToMarkdown(url) {

const html = await extractMainContentFromUrl(url);
// Generate folder name based on HTML content
const folderName = generateFolderName(html);
const folderName = generateFolderName(url);
const outputDir = path.join(__dirname, folderName);
const imagesDir = path.join(outputDir, 'images');

Expand Down
53 changes: 50 additions & 3 deletions apps/converter/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import { fromMarkdown } from 'mdast-util-from-markdown';
import { toHast } from 'mdast-util-to-hast';

import { htmlToMarkdown } from './htmlToMarkdown.js';
import { fetchContentJinaAi } from './jinaAi.js';
import { pdfToMarkdown } from './pdfToMarkdown.js';
import { wordToMarkdown } from './wordToMarkdown.js';

Expand Down Expand Up @@ -223,6 +224,53 @@ app.get('/html-convert', async (req, res) => {
}
});

app.get('/jina-convert', async (req, res) => {
const url = req.query.path;

if (!url) {
return res.status(400).json({
error: "Please provide a URLas 'path' query parameter",
});
}

try {
// First convert Word to Markdown
const { markdownPath, warnings, outputDir } = await fetchContentJinaAi(url);

// Then read and process the Markdown
const markdown = readFileSync(markdownPath, 'utf-8');
const mdast = fromMarkdown(markdown);

const md = readFileSync(markdownPath, 'utf-8');
const ast = parse(md);

mdast.children.forEach(async (element, index) => {
const hast = toHast(element, { allowDangerousHtml: true });
const html = toHtml(hast, { allowDangerousHtml: true });
element.type = ast.children[index].type;
element.raw = ast.children[index].raw;
element.htmlValue = html;
});

const enhanced = await enhanceMdastNodesRecursive(mdast, outputDir);
// Return the processed content along with conversion info
res.json({
content: enhanced.children,
outputDirectory: outputDir,
warnings: warnings,
});
} catch (error) {
if (error.code === 'ENOENT') {
res.status(404).json({ error: `File not found: ${url}` });
} else {
res.status(500).json({
error: 'Error processing document',
details: error.message,
});
}
}
});

app.get('/pdf-convert', async (req, res) => {
const filePath = req.query.path;

Expand All @@ -234,7 +282,7 @@ app.get('/pdf-convert', async (req, res) => {

try {
// First convert Word to Markdown
const { markdownPath, outputDir } = await pdfToMarkdown(filePath);
const { markdownPath, warnings, outputDir } = await pdfToMarkdown(filePath);

// Then read and process the Markdown
const markdown = readFileSync(markdownPath, 'utf-8');
Expand All @@ -252,12 +300,11 @@ app.get('/pdf-convert', async (req, res) => {
});

const enhanced = await enhanceMdastNodesRecursive(mdast, outputDir);

// Return the processed content along with conversion info
res.json({
content: enhanced.children,
outputDirectory: outputDir,
// warnings: warnings,
warnings: warnings,
});
} catch (error) {
if (error.code === 'ENOENT') {
Expand Down
Loading

0 comments on commit c974d64

Please sign in to comment.