This converts html files to org-mode focusing on keeping the formatted text (no embedded span or div in the output) based on google’s gumbo parser. It comes in two version (Python and Nim) which are kept in sync (does the same thing) as much as possible.
- Install
gumbo
- Run your executable of choice with one html file (or url) as argument. Output org-mode file goes to stdout.
- Install the gumbo-parser python binding
pip install gumbo (--user)
- To compile the nim executable (tested with nim 0.18 )
cd nim
nim c html_to_org.nim
requires libgumbo-dev
- handle HTML anchor/fragment links
- probably need a uid for each header
- fix wrong wrap in nim’s version
- “browse the web in org-mode” mode