`block_pour_docx` but only for part of file to be poured #97

SchmidtPaul · 2023-01-04T09:13:04Z

SchmidtPaul
Jan 4, 2023

I am writing a docx-document using RMarkdown and would like to include parts of another file otherdoc.docx. I guess it is like using block_pour_docx, but only for parts of the document.

Let's say otherdoc.docx looks like this:

In my RMarkdown document I would like to work like so:

---
output: officedown::rdocx_document
---

```{r setup, include=FALSE}
library(officedown)
library(officer)
```

# Introduction

Some text and r chunks here

```{r}
source("somecode.R"))
result
```

# Part 1

More text and r chunks here

```{r}
# Pour ONLY THE PART 1 content of the docx
block_pour_docx("otherdoc.docx")
```

# Part 2

More text and r chunks here

```{r}
# Pour ONLY THE PART 2 content of the docx
block_pour_docx("otherdoc.docx")
```

What are my options to achieve my goal here?

P.S.: Thanks for the fantastic officeverse

davidgohel · 2023-01-04T10:46:35Z

davidgohel
Jan 4, 2023
Maintainer

Hello

I don't know. There is no function like this in officer nor officedown.

Did you try with officer::cursor_* in a loop with officer::body_remove()?

0 replies

SchmidtPaul · 2023-01-05T08:51:41Z

SchmidtPaul
Jan 5, 2023
Author

Ok, thanks - I see.
I've come up with a first solution that works given that the parts I want are named via and separated by heading 1-headers. This would be the R code that extracts a part (here it is Part 2).

However, please find my question at the end.

slicedocx.R

library(officedown)
library(officer)
library(tidyverse)

# gather necessary info ---------------------------------------------------
doc <- read_docx("otherdoc.docx")
doc_summ <- docx_summary(doc)

index_max <- max(doc_summ$doc_index)

doc_slices <- doc_summ %>%
  filter(style_name == "heading 1") %>%
  transmute(text = text,
            index_start = doc_index,
            index_end = lead(doc_index)-1)


# get a part --------------------------------------------------------------
want <- "Part 2"

docpart <- doc

i <- doc_slices %>% 
  filter(text == want) %>% 
  pivot_longer(-text) %>% 
  pull(value)


# delete everything after part
if (!is.na(i[2])) {
  for (j in (i[2] + 1):index_max) {
    docpart <- docpart %>% cursor_end() %>% body_remove()
  }
}

# delete everything before part
if (i[1]>1) {
  for (j in 1:(i[1]-1)) {
    docpart <- docpart %>% cursor_begin() %>% body_remove()
  }
}

For comparison:

docx_summary(doc) # before
#    doc_index content_type     style_name                         text level num_id row_id is_header cell_id col_span row_span
# 1           1    paragraph      heading 1                       Part 1    NA     NA     NA        NA      NA       NA       NA
# 2           2    paragraph           <NA> This is some text in part 1.    NA     NA     NA        NA      NA       NA       NA
# 3           3    paragraph List Paragraph                     Part 1 a     1      1     NA        NA      NA       NA       NA
# 4           4    paragraph List Paragraph                     Part 1 b     1      1     NA        NA      NA       NA       NA
# 5           5    paragraph           <NA>     And more text in part 1.    NA     NA     NA        NA      NA       NA       NA
# 6           6    paragraph      heading 1                       Part 2    NA     NA     NA        NA      NA       NA       NA
# 7           7    paragraph           <NA> This is some text in part 2.    NA     NA     NA        NA      NA       NA       NA
# 1.1         8   table cell     Table Grid                    Tabhead A    NA     NA      1     FALSE       1        1        1
# 1.4         8   table cell     Table Grid                            1    NA     NA      2     FALSE       1        1        1
# 2.2         8   table cell     Table Grid                    Tabhead B    NA     NA      1     FALSE       2        1        1
# 2.5         8   table cell     Table Grid                            2    NA     NA      2     FALSE       2        1        1
# 3.3         8   table cell     Table Grid                    Tabhead C    NA     NA      1     FALSE       3        1        1
# 3.6         8   table cell     Table Grid                            3    NA     NA      2     FALSE       3        1        1
# 11          9    paragraph           <NA>                                 NA     NA     NA        NA      NA       NA       NA
# 12         10    paragraph      heading 1                       Part 3    NA     NA     NA        NA      NA       NA       NA
# 13         11    paragraph           <NA> This is some text in part 3.    NA     NA     NA        NA      NA       NA       NA

doc_slices
#      text index_start index_end
# 1  Part 1           1         5
# 6  Part 2           6         9
# 12 Part 3          10        NA

docx_summary(docpart) # after
#     doc_index content_type style_name                         text level num_id row_id is_header cell_id col_span row_span
# 1           1    paragraph  heading 1                       Part 2    NA     NA     NA        NA      NA       NA       NA
# 2           2    paragraph       <NA> This is some text in part 2.    NA     NA     NA        NA      NA       NA       NA
# 1.1         3   table cell Table Grid                    Tabhead A    NA     NA      1     FALSE       1        1        1
# 1.4         3   table cell Table Grid                            1    NA     NA      2     FALSE       1        1        1
# 2.2         3   table cell Table Grid                    Tabhead B    NA     NA      1     FALSE       2        1        1
# 2.5         3   table cell Table Grid                            2    NA     NA      2     FALSE       2        1        1
# 3.3         3   table cell Table Grid                    Tabhead C    NA     NA      1     FALSE       3        1        1
# 3.6         3   table cell Table Grid                            3    NA     NA      2     FALSE       3        1        1
# 11          4    paragraph       <NA>                                 NA     NA     NA        NA      NA       NA       NA

@davidgohel How would I now pour the docpart into my RMarkdown file? If I understand it correctly I could print a new .docx file first in order to use block_pour_docx(), but there is probably an easier way? This obviously does not work:

mymarkdown.Rmd

---
output: officedown::rdocx_document
---

# Introduction

Some text and r chunks here

```{r}
source("slicedocx.R")
print(docpart)
```

More text and r chunks here

0 replies

SchmidtPaul · 2023-02-01T10:59:22Z

SchmidtPaul
Feb 1, 2023
Author

Hey @davidgohel, I took another shot at this and would be thankful for a comment.

Basically, I created a function get_docx_part() that

reads a docx file,
removes everything except the part I am interested in. The part I am interested in is simply defined as everything after a certain header and before the next header (both of style heading 1).
prints a temporary docx file temp.docx

I can then use block_pour_docx("temp.docx") inside my RMarkdown file.

Function

library(officedown)
library(officer)
library(tidyverse)

get_docx_part <- function(infile, heading1title, outfile){
  
  # import and summary ------------------------------------------------------
  doc <- read_docx(infile)
  doc_summ <- docx_summary(doc)
  index_max <- max(doc_summ$doc_index)
  
  # info on all parts (i.e. sections separated by "heading 1" headers)
  doc_parts_info <- doc_summ %>%
    filter(style_name == "heading 1") %>%
    transmute(text = text,
              index_start = doc_index,
              index_end = lead(doc_index)-1)
  

  # delete unwanted parts ---------------------------------------------------
  # prepare
  docpart <- doc
  
  i <- doc_parts_info %>% 
    filter(text == heading1title) %>% 
    pivot_longer(-text) %>% 
    pull(value)
  
  # delete everything after part
  if (!is.na(i[2])) {
    for (j in (i[2] + 1):index_max) {
      docpart <- docpart %>% cursor_end() %>% body_remove()
    }
  }
  
  # delete everything before part
  if (i[1]>1) {
    for (j in 1:(i[1]-1)) {
      docpart <- docpart %>% cursor_begin() %>% body_remove()
    }
  }
  

  # print -------------------------------------------------------------------
  # create temporary docx file
  print(docpart, target = outfile)

}

Example Rmd

---
output: officedown::rdocx_document
---

# Introduction

Some text and r chunks here

```{r, echo=FALSE, message=FALSE}
source("get_docx_part.R") # get function

get_docx_part(infile = "BaseDocument.docx",
              heading1title = "Second header title",
              outfile = "temp.docx")

block_pour_docx("temp.docx")
```

# Conclusion

More text and r chunks here

Check results

Questions

As you can see, it so far works as intended. However,

what do you generally think about the approach?
I actually wanted to include the block_pour_docx("temp.docx") in the get_docx_part() function, but it did not work when inside the function?!
I wanted to delete temp.docx immediately after pouring, but it led to an error?!

1 reply

davidgohel Feb 1, 2023
Maintainer

Hello

I did not test but it seems ok to me
yes, it needs to be executed in a knitr context, otherwise knitr::knit_print() is not called
wait the final document is created and then you can delete the file you want to pour

SchmidtPaul · 2023-03-10T10:37:06Z

SchmidtPaul
Mar 10, 2023
Author

Hey @davidgohel, I have been implementing this method successfully now, but realized it does not extend well for repeating the step multiple times in the same knitting process. This is because everytime a "temp.docx" is created to be poured, which overwrites the old one, but apparently the pouring happens at the very end. So I ultimately end up with all poured section being identical to the last one that was poured.

I guess this goes along with you telling me to "wait the final document is created and then you can delete the file you want to pour".

Any suggestions how I could tackle this issue?

1 reply

davidgohel Mar 10, 2023
Maintainer

I would use tempfile(fileext = ".docx") instead of hard coded value ("temp.docx")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`block_pour_docx` but only for part of file to be poured #97

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

block_pour_docx but only for part of file to be poured #97

SchmidtPaul Jan 4, 2023

Replies: 4 comments · 2 replies

davidgohel Jan 4, 2023 Maintainer

SchmidtPaul Jan 5, 2023 Author

SchmidtPaul Feb 1, 2023 Author

Function

Example Rmd

Check results

Questions

davidgohel Feb 1, 2023 Maintainer

SchmidtPaul Mar 10, 2023 Author

davidgohel Mar 10, 2023 Maintainer

`block_pour_docx` but only for part of file to be poured #97

SchmidtPaul
Jan 4, 2023

Replies: 4 comments 2 replies

davidgohel
Jan 4, 2023
Maintainer

SchmidtPaul
Jan 5, 2023
Author

SchmidtPaul
Feb 1, 2023
Author

davidgohel Feb 1, 2023
Maintainer

SchmidtPaul
Mar 10, 2023
Author

davidgohel Mar 10, 2023
Maintainer