Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin to scrape website & convert HTML to markdown #923

Closed
100ideas opened this issue Sep 30, 2017 · 16 comments
Closed

plugin to scrape website & convert HTML to markdown #923

100ideas opened this issue Sep 30, 2017 · 16 comments
Labels
feature request 🌟 Issue is a new feature request. rewarded on issuehunt 🎁 Issue has been resolved and a contributor has been rewarded.

Comments

@100ideas
Copy link

100ideas commented Sep 30, 2017

Issuehunt badges

I'm enjoying boostnote after switching from evernote & quiver.app - thank you to everyone who has contributed to this promising open source tool.

I keep a "code" notebook for technical notes-to-self and today I wanted to add a "clipping" of a blog post to it. I wasn't sure what the best way was (sometimes I try copying-and-pasting directly from the browser, which worked OK in quiver's rich-text note mode... but rtf, gross), so I tried out a few tools for automatically converting from HTML to markdown.

pandoc has a command-line option to fetch content from URL and can convert to/from HTML, markdown, and many other formats. Install on osx with brew install pandoc, then:

pandoc -f html --normalize --wrap=none -t markdown_github+backtick_code_blocks+autolink_bare_uris -o output.md <URL>

as a handy fish shell function:

function panscrape --description='usage: panscrape [URL] > blog_clipping.md'
      pandoc -f html --normalize --wrap=none -t markdown_github+backtick_code_blocks+autolink_bare_uris $argv
  endfuncsave panscrape
❯ panscrape "https://shapeshed.com/command-line-utilities-with-nodejs/" > clipping.md

# or to copy directly to system clipboard
❯ panscrape "https://shapeshed.com/command-line-utilities-with-nodejs/" | pbcopy

Pandoc does an OK job but isn't definitely not perfect, so some manual editing of the output may be necessary, for instance deleting header & footer content.

html-md-conversion-pandoc-boostnote

If you don't want to install anything, fuckyeahmarkdown.com seems to have an alright hosted converter.

feature request

Add a command (plugin?) to Boostnote that takes a URL as input, scrapes the page, converts the html to markdown, and creates a new note filled with the result.

Starting points:

  • node-europa "is a Node.js module for converting HTML into valid Markdown that uses the Europa Core engine."
  • scrape-markdown CLI tool based on node-europa
    • note: npm package is out-of-date and does not work; install from source repo with npm install github:evangoer/scrape-markdown
    • run locally ./node_modules/.bin/scrape-markdown [URL]

I would be happy to help with implementation.

#405


IssueHunt Summary

awolf81 awolf81 has been rewarded.

Backers (Total: $100.00)

Submitted pull Requests


Tips


IssueHunt has been backed by the following sponsors. Become a sponsor

@kentchiu
Copy link

I use the copy as markdown plugin of chrome. I find it very convenient.

@100ideas
Copy link
Author

100ideas commented Sep 30, 2017

Thanks for sharing! Looks like copy-as-markdown uses reMarked.js internally, another option besides node-europa for the putative Boostnote plugin.

Screenshot comparing reMarked.js vs pandoc - reMarked has trouble parsing the code blocks for some reason:
html-md-conversion-pandoc-remarked_vs_pandoc

reMarked code blocks fenced with 'true'?

This is funny. I couldn't figure out why the reMarked demo was fencing code blocks with 'true'. I think it's just a mistake in how the reMarker object is configured on the demo page:

// code blocks will be delimited with the string 'true'
var reMarker = new reMarked({gfm_code: true});

// this is what we want
// try it by pasting into the console at reMarked demo site
var reMarker = new reMarked({gfm_code: "```"});
reMarker.render(document.getElementById('html-inp').value)

example reMarked.js output w/ {gfm_code: "```"}:

The basics
----------

To create an executable Node.js script all you need is a Node.js shebang at the top of the script and then some code to execute.

```
#!/usr/bin/env node

console.log('hello world');
```

Assuming you are on a UNIX like system you can do this to make the script executable

```
chmod u+x yourscript
```

Now you can run it and you should see ‘hello world’ printed.

```
./yourscript
hello world
```

Handling arguments
------------------

As you get beyond basic scripts you’ll want to pass arguments into the script. The arguments passed to a script are available as `process.argv`.

If you pass arguments to the simple example above and add `console.log(process.argv)` you’ll see the arguments are available as an array. For example if you run

conclusion

I think reMarked.js - when properly configured - produces better output compared with pandoc, and possibly node-europa.

@tgrrr
Copy link

tgrrr commented Mar 15, 2018

I just found copycat, and am testing it against copy as markdown (no affiliation). Combined with One Tab, my research aka open tabs aka browsing history are becoming useful articles and lists.

@Rokt33r Rokt33r added the feature request 🌟 Issue is a new feature request. label Mar 16, 2018
@IssueHuntBot
Copy link

@kazup01 has boosted this issue with $100. Visit this issue on Issuehunt

@IssueHuntBot
Copy link

@StormBurpee has started working. Visit this issue on Issuehunt

@IssueHuntBot
Copy link

@StormBurpee has submitted output. Visit this issue on Issuehunt

@StormBurpee
Copy link
Contributor

Hey guys, feel free to take a look at the pull request I made for this feature over at #1981
Based of the url you suggested in the original post it works great, and I've been doing some testing with a bunch of other websites that I look at, and even ones that you probably wouldn't expect to work.

In the issue I've attached a few example photos for you to see.

@IssueHuntBot
Copy link

@Rokt33r has stopped working. Visit this issue on Issuehunt

@IssueHuntBot
Copy link

@kazup01 cancelled funding, $100, of this issue. Visit this issue on Issuehunt

@IssueHuntBot
Copy link

@BoostIO funded this issue with $100. Visit this issue on Issuehunt

@IssueHuntBot
Copy link

@edokan has started working. Visit this issue on Issuehunt

@kazup01 kazup01 added the bounty label Nov 7, 2018
@liuhoward
Copy link

a good web clipper: https://github.com/mika-cn/maoxian-web-clipper/

nickdotht added a commit to nickdotht/Boostnote that referenced this issue May 18, 2019
@laike9m
Copy link

laike9m commented Jun 2, 2019

Would a web clipper(like Evernote's browser extension) be a better solution for this?

@Flexo013 Flexo013 added funded on issuehunt 💵 Issue has received funding that will be rewarded to the contributor solving this issue. and removed 💵 Funded on IssueHunt labels Jul 25, 2019
@issuehunt-oss
Copy link

issuehunt-oss bot commented Sep 5, 2019

@ZeroX-DG has rewarded $90.00 to @AWolf81. See it on IssueHunt

  • 💰 Total deposit: $100.00
  • 🎉 Repository reward(0%): $0.00
  • 🔧 Service fee(10%): $10.00

@Flexo013 Flexo013 removed the funded on issuehunt 💵 Issue has received funding that will be rewarded to the contributor solving this issue. label Sep 5, 2019
@Flexo013
Copy link
Contributor

This feature is now available as of 0.13.0, when creating a new note:

image

@100ideas
Copy link
Author

sweet!

@Flexo013 Flexo013 added the rewarded on issuehunt 🎁 Issue has been resolved and a contributor has been rewarded. label Feb 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 🌟 Issue is a new feature request. rewarded on issuehunt 🎁 Issue has been resolved and a contributor has been rewarded.
Projects
None yet
Development

No branches or pull requests

10 participants