[BUG] text extraction for list not work as exptected #508

traitman · 2023-01-25T15:46:00Z

Description

the text in pdf list on the same line does not on the same line after extract to text.

for example:

6. Miu to kekkon shitai desu.

becomes

6.

Miu to kekkon shitai desu.

Expected Behavior

text extraction for list works OK, the result should be:

6. Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9. O, o-jō-san o kudasai.

10. TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

11. Soretomo uchi no kaisha ga hoshii no ka?

12. TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.
2. TENZO RAIZO: Huh?
3. ŌZORAHARUYA: ah, um...sir, please give me your daughter.
4. TENDOU RAIZO: Huh?
5. ŌZORA HARUYA: I...I want to be with Miu forever.
6. I want to marry Miu. 
7. TENDO RAIZO: I? Miu? (harrumph)
8. ŌZORA HARUYA: I...I'm sorry. I want to marry Miu.

Actual Behavior

the list extraction is buggy

for example:

the first list extracted result in

6.

Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9.

10.

11.

12.

O, o-jō-san o kudasai.

TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

Soretomo uchi no kaisha ga hoshii no ka?

TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: 
2. TENZO RAIZO: 
3. ŌZORAHARUYA: 
4. TENDOU RAIZO: 
5. ŌZORA HARUYA: 
6. I want to marry Miu. 
7. TENDO RAIZO: 
8. ŌZORA HARUYA:  (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.

Huh?

ah, um...sir, please give me your daughter.

Huh?

I...I want to be with Miu forever.

I? Miu? (harrumph)

I...I'm sorry. I want to marry Miu.

Attachments

Include a self-contained reproducible code snippet and PDF file that demonstrates the issue.
B_S4L4_p4_github.pdf

package main

import (
	"fmt"
	"os"

	"github.com/unidoc/unipdf/v3/extractor"
	"github.com/unidoc/unipdf/v3/model"
)

func main() {
	if err := outputPdfText(os.Args[1]); err != nil {
		panic(err)
	}
}

// outputPdfText prints out contents of PDF file to stdout.
func outputPdfText(inputPath string) error {
	f, err := os.Open(inputPath)
	if err != nil {
		return err
	}

	defer f.Close()

	pdfReader, err := model.NewPdfReader(f)
	if err != nil {
		return err
	}

	numPages, err := pdfReader.GetNumPages()
	if err != nil {
		return err
	}

	fmt.Printf("--------------------\n")
	fmt.Printf("PDF to text extraction:\n")
	fmt.Printf("--------------------\n")
	for i := 0; i < numPages; i++ {
		pageNum := i + 1

		page, err := pdfReader.GetPage(pageNum)
		if err != nil {
			return err
		}

		ex, err := extractor.New(page)
		if err != nil {
			return err
		}

		pt, _, _, err := ex.ExtractPageText()
		if err != nil {
			return err
		}

		text := pt.Text()
		// text, err := ex.ExtractText()
		// if err != nil {
		// 	return err
		// }

		fmt.Println("------------------------------")
		fmt.Printf("Page %d:\n", pageNum)
		fmt.Printf("\"%s\"\n", text)
		fmt.Println("------------------------------")
	}

	return nil
}

The text was updated successfully, but these errors were encountered:

github-actions · 2023-01-25T15:46:36Z

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

traitman · 2023-01-25T15:52:43Z

related: #38

anovik · 2024-11-25T06:38:20Z

@traitman We improved the simple mode of text extraction in the latest release of UniPDF https://github.com/unidoc/unipdf/releases/tag/v3.64.0 and were able to get the correct result with your file.

Please update UniPDF to the latest version.

Here is the example how to enable simple mode https://github.com/unidoc/unipdf-examples/blob/master/extract/pdf_simple_extraction.go you just need to set UseSimplerExtractionProcess: true in the options of extractor.

I am closing the issue, feel free to re-open or post a new one in case of any problem.

anovik closed this as completed Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] text extraction for list not work as exptected #508

[BUG] text extraction for list not work as exptected #508

traitman commented Jan 25, 2023 •

edited

Loading

github-actions bot commented Jan 25, 2023

traitman commented Jan 25, 2023

anovik commented Nov 25, 2024

[BUG] text extraction for list not work as exptected #508

[BUG] text extraction for list not work as exptected #508

Comments

traitman commented Jan 25, 2023 • edited Loading

Description

Expected Behavior

Actual Behavior

Attachments

github-actions bot commented Jan 25, 2023

traitman commented Jan 25, 2023

anovik commented Nov 25, 2024

traitman commented Jan 25, 2023 •

edited

Loading