plugin to scrape website & convert HTML to markdown #923

100ideas · 2017-09-30T11:41:09Z

I'm enjoying boostnote after switching from evernote & quiver.app - thank you to everyone who has contributed to this promising open source tool.

I keep a "code" notebook for technical notes-to-self and today I wanted to add a "clipping" of a blog post to it. I wasn't sure what the best way was (sometimes I try copying-and-pasting directly from the browser, which worked OK in quiver's rich-text note mode... but rtf, gross), so I tried out a few tools for automatically converting from HTML to markdown.

pandoc has a command-line option to fetch content from URL and can convert to/from HTML, markdown, and many other formats. Install on osx with brew install pandoc, then:

pandoc -f html --normalize --wrap=none -t markdown_github+backtick_code_blocks+autolink_bare_uris -o output.md <URL>

as a handy fish shell function:

❯ function panscrape --description='usage: panscrape [URL] > blog_clipping.md'
      pandoc -f html --normalize --wrap=none -t markdown_github+backtick_code_blocks+autolink_bare_uris $argv
  end
❯ funcsave panscrape
❯ panscrape "https://shapeshed.com/command-line-utilities-with-nodejs/" > clipping.md

# or to copy directly to system clipboard
❯ panscrape "https://shapeshed.com/command-line-utilities-with-nodejs/" | pbcopy

Pandoc does an OK job but isn't definitely not perfect, so some manual editing of the output may be necessary, for instance deleting header & footer content.

If you don't want to install anything, fuckyeahmarkdown.com seems to have an alright hosted converter.

feature request

Add a command (plugin?) to Boostnote that takes a URL as input, scrapes the page, converts the html to markdown, and creates a new note filled with the result.

Starting points:

node-europa "is a Node.js module for converting HTML into valid Markdown that uses the Europa Core engine."
scrape-markdown CLI tool based on node-europa
- note: npm package is out-of-date and does not work; install from source repo with npm install github:evangoer/scrape-markdown
- run locally ./node_modules/.bin/scrape-markdown [URL]

I would be happy to help with implementation.

#405

IssueHunt Summary

awolf81 has been rewarded.

Backers (Total: $100.00)

boostio ($100.00)

Submitted pull Requests

#3099 Html to md feature

Tips

Checkout the Issuehunt explorer to discover more funded issues.
Need some help from other developers? Add your repositories on IssueHunt to raise funds.

IssueHunt has been backed by the following sponsors. Become a sponsor

The text was updated successfully, but these errors were encountered:

kentchiu · 2017-09-30T12:54:50Z

I use the copy as markdown plugin of chrome. I find it very convenient.

100ideas · 2017-09-30T21:36:10Z

Thanks for sharing! Looks like copy-as-markdown uses reMarked.js internally, another option besides node-europa for the putative Boostnote plugin.

Screenshot comparing reMarked.js vs pandoc - reMarked has trouble parsing the code blocks for some reason:

reMarked code blocks fenced with 'true'?

This is funny. I couldn't figure out why the reMarked demo was fencing code blocks with 'true'. I think it's just a mistake in how the reMarker object is configured on the demo page:

// code blocks will be delimited with the string 'true'
var reMarker = new reMarked({gfm_code: true});

// this is what we want
// try it by pasting into the console at reMarked demo site
var reMarker = new reMarked({gfm_code: "```"});
reMarker.render(document.getElementById('html-inp').value)

example reMarked.js output w/ {gfm_code: "```"}:

The basics
----------

To create an executable Node.js script all you need is a Node.js shebang at the top of the script and then some code to execute.

```
#!/usr/bin/env node

console.log('hello world');
```

Assuming you are on a UNIX like system you can do this to make the script executable

```
chmod u+x yourscript
```

Now you can run it and you should see ‘hello world’ printed.

```
./yourscript
hello world
```

Handling arguments
------------------

As you get beyond basic scripts you’ll want to pass arguments into the script. The arguments passed to a script are available as `process.argv`.

If you pass arguments to the simple example above and add `console.log(process.argv)` you’ll see the arguments are available as an array. For example if you run

conclusion

I think reMarked.js - when properly configured - produces better output compared with pandoc, and possibly node-europa.

tgrrr · 2018-03-15T22:55:17Z

I just found copycat, and am testing it against copy as markdown (no affiliation). Combined with One Tab, my research aka open tabs aka browsing history are becoming useful articles and lists.

IssueHuntBot · 2018-05-08T17:54:20Z

@kazup01 has boosted this issue with $100. Visit this issue on Issuehunt

IssueHuntBot · 2018-05-26T04:39:44Z

@StormBurpee has started working. Visit this issue on Issuehunt

IssueHuntBot · 2018-05-26T07:35:50Z

@StormBurpee has submitted output. Visit this issue on Issuehunt

StormBurpee · 2018-05-26T07:47:16Z

Hey guys, feel free to take a look at the pull request I made for this feature over at #1981
Based of the url you suggested in the original post it works great, and I've been doing some testing with a bunch of other websites that I look at, and even ones that you probably wouldn't expect to work.

In the issue I've attached a few example photos for you to see.

IssueHuntBot · 2018-07-19T08:05:06Z

@Rokt33r has stopped working. Visit this issue on Issuehunt

IssueHuntBot · 2018-08-28T11:32:00Z

@kazup01 cancelled funding, $100, of this issue. Visit this issue on Issuehunt

IssueHuntBot · 2018-08-28T11:32:11Z

@BoostIO funded this issue with $100. Visit this issue on Issuehunt

IssueHuntBot · 2018-10-05T13:20:09Z

@edokan has started working. Visit this issue on Issuehunt

liuhoward · 2019-03-19T16:07:59Z

a good web clipper: https://github.com/mika-cn/maoxian-web-clipper/

laike9m · 2019-06-02T01:27:55Z

Would a web clipper(like Evernote's browser extension) be a better solution for this?

issuehunt-oss · 2019-09-05T02:49:53Z

@ZeroX-DG has rewarded $90.00 to @AWolf81. See it on IssueHunt

💰 Total deposit: $100.00
🎉 Repository reward(0%): $0.00
🔧 Service fee(10%): $10.00

Flexo013 · 2019-10-20T14:47:41Z

This feature is now available as of 0.13.0, when creating a new note:

100ideas · 2019-10-22T21:23:32Z

sweet!

100ideas mentioned this issue Sep 30, 2017

demo outputs code blocks fenced with 'true' not '```' leeoniya/reMarked.js#47

Open

Rokt33r added the feature request 🌟 Issue is a new feature request. label Mar 16, 2018

StormBurpee mentioned this issue May 26, 2018

Import note from url with markdown #1981

Closed

StormBurpee mentioned this issue May 26, 2018

Browser extension(s) to save web pages to Boostnote #1356

Open

kazup01 added the bounty label Nov 7, 2018

nickdotht added a commit to nickdotht/Boostnote that referenced this issue May 18, 2019

[Patch BoostIO#1356 & BoostIO#923] Update to latest source

41838b2

AWolf81 mentioned this issue Jun 29, 2019

Html to md feature #3099

Merged

Flexo013 added funded on issuehunt 💵 Issue has received funding that will be rewarded to the contributor solving this issue. and removed 💵 Funded on IssueHunt labels Jul 25, 2019

issuehunt-oss bot added the 🎁 Rewarded on Issuehunt label Sep 5, 2019

Flexo013 removed the funded on issuehunt 💵 Issue has received funding that will be rewarded to the contributor solving this issue. label Sep 5, 2019

Flexo013 mentioned this issue Oct 9, 2019

HTML code blocks copied/pasted from HTML pages not converted to Markdown format correctly #2913

Closed

Flexo013 closed this as completed Oct 20, 2019

Flexo013 added the rewarded on issuehunt 🎁 Issue has been resolved and a contributor has been rewarded. label Feb 16, 2020

Flexo013 removed the 🎁 Rewarded on Issuehunt label Feb 16, 2020

100ideas mentioned this issue Nov 11, 2020

using nb with other markdown notebook editors xwmx/nb#68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plugin to scrape website & convert HTML to markdown #923

plugin to scrape website & convert HTML to markdown #923

100ideas commented Sep 30, 2017 •

edited by issuehunt-oss bot

Loading

awolf81 has been rewarded.

Backers (Total: $100.00)

Submitted pull Requests

Tips

kentchiu commented Sep 30, 2017

100ideas commented Sep 30, 2017 •

edited

Loading

tgrrr commented Mar 15, 2018

IssueHuntBot commented May 8, 2018

IssueHuntBot commented May 26, 2018

IssueHuntBot commented May 26, 2018

StormBurpee commented May 26, 2018

IssueHuntBot commented Jul 19, 2018

IssueHuntBot commented Aug 28, 2018

IssueHuntBot commented Aug 28, 2018

IssueHuntBot commented Oct 5, 2018

liuhoward commented Mar 19, 2019

laike9m commented Jun 2, 2019

issuehunt-oss bot commented Sep 5, 2019

Flexo013 commented Oct 20, 2019

100ideas commented Oct 22, 2019

plugin to scrape website & convert HTML to markdown #923

plugin to scrape website & convert HTML to markdown #923

Comments

100ideas commented Sep 30, 2017 • edited by issuehunt-oss bot Loading

feature request

awolf81 has been rewarded.

Backers (Total: $100.00)

Submitted pull Requests

Tips

kentchiu commented Sep 30, 2017

100ideas commented Sep 30, 2017 • edited Loading

reMarked code blocks fenced with 'true'?

example reMarked.js output w/ {gfm_code: "```"}:

conclusion

tgrrr commented Mar 15, 2018

IssueHuntBot commented May 8, 2018

IssueHuntBot commented May 26, 2018

IssueHuntBot commented May 26, 2018

StormBurpee commented May 26, 2018

IssueHuntBot commented Jul 19, 2018

IssueHuntBot commented Aug 28, 2018

IssueHuntBot commented Aug 28, 2018

IssueHuntBot commented Oct 5, 2018

liuhoward commented Mar 19, 2019

laike9m commented Jun 2, 2019

issuehunt-oss bot commented Sep 5, 2019

Flexo013 commented Oct 20, 2019

100ideas commented Oct 22, 2019

100ideas commented Sep 30, 2017 •

edited by issuehunt-oss bot

Loading

100ideas commented Sep 30, 2017 •

edited

Loading