Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] text extraction for list not work as exptected #508

Closed
traitman opened this issue Jan 25, 2023 · 3 comments
Closed

[BUG] text extraction for list not work as exptected #508

traitman opened this issue Jan 25, 2023 · 3 comments

Comments

@traitman
Copy link

traitman commented Jan 25, 2023

Description

the text in pdf list on the same line does not on the same line after extract to text.

for example:

6. Miu to kekkon shitai desu.

becomes

6.

Miu to kekkon shitai desu.

Expected Behavior

text extraction for list works OK, the result should be:

6. Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9. O, o-jō-san o kudasai.

10. TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

11. Soretomo uchi no kaisha ga hoshii no ka?

12. TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.
2. TENZO RAIZO: Huh?
3. ŌZORAHARUYA: ah, um...sir, please give me your daughter.
4. TENDOU RAIZO: Huh?
5. ŌZORA HARUYA: I...I want to be with Miu forever.
6. I want to marry Miu. 
7. TENDO RAIZO: I? Miu? (harrumph)
8. ŌZORA HARUYA: I...I'm sorry. I want to marry Miu. 

Actual Behavior

the list extraction is buggy

for example:
image

the first list extracted result in

6.

Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9.

10.

11.

12.

O, o-jō-san o kudasai.

TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

Soretomo uchi no kaisha ga hoshii no ka?

TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: 
2. TENZO RAIZO: 
3. ŌZORAHARUYA: 
4. TENDOU RAIZO: 
5. ŌZORA HARUYA: 
6. I want to marry Miu. 
7. TENDO RAIZO: 
8. ŌZORA HARUYA:  (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.

Huh?

ah, um...sir, please give me your daughter.

Huh?

I...I want to be with Miu forever.

I? Miu? (harrumph)

I...I'm sorry. I want to marry Miu.

Attachments

Include a self-contained reproducible code snippet and PDF file that demonstrates the issue.
B_S4L4_p4_github.pdf

package main

import (
	"fmt"
	"os"

	"github.com/unidoc/unipdf/v3/extractor"
	"github.com/unidoc/unipdf/v3/model"
)

func main() {
	if err := outputPdfText(os.Args[1]); err != nil {
		panic(err)
	}
}

// outputPdfText prints out contents of PDF file to stdout.
func outputPdfText(inputPath string) error {
	f, err := os.Open(inputPath)
	if err != nil {
		return err
	}

	defer f.Close()

	pdfReader, err := model.NewPdfReader(f)
	if err != nil {
		return err
	}

	numPages, err := pdfReader.GetNumPages()
	if err != nil {
		return err
	}

	fmt.Printf("--------------------\n")
	fmt.Printf("PDF to text extraction:\n")
	fmt.Printf("--------------------\n")
	for i := 0; i < numPages; i++ {
		pageNum := i + 1

		page, err := pdfReader.GetPage(pageNum)
		if err != nil {
			return err
		}

		ex, err := extractor.New(page)
		if err != nil {
			return err
		}

		pt, _, _, err := ex.ExtractPageText()
		if err != nil {
			return err
		}

		text := pt.Text()
		// text, err := ex.ExtractText()
		// if err != nil {
		// 	return err
		// }

		fmt.Println("------------------------------")
		fmt.Printf("Page %d:\n", pageNum)
		fmt.Printf("\"%s\"\n", text)
		fmt.Println("------------------------------")
	}

	return nil
}
@github-actions
Copy link

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

@traitman
Copy link
Author

related: #38

@anovik
Copy link

anovik commented Nov 25, 2024

@traitman We improved the simple mode of text extraction in the latest release of UniPDF https://github.com/unidoc/unipdf/releases/tag/v3.64.0 and were able to get the correct result with your file.

Please update UniPDF to the latest version.

Here is the example how to enable simple mode https://github.com/unidoc/unipdf-examples/blob/master/extract/pdf_simple_extraction.go you just need to set UseSimplerExtractionProcess: true in the options of extractor.

I am closing the issue, feel free to re-open or post a new one in case of any problem.

@anovik anovik closed this as completed Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants