Several improvements around handling CSS and others. #33

miguelitoelgrande · 2019-03-20T10:21:05Z

This is my fork providing the following improvements (mainly improvements in extractHTML.js logic):

support for additional CSS attributes (display, colspan, border-collapse,...)
support for fancy article headers {background-image, position, z-index, background-*}
retain original image filenames where applicable (make editing of epub easier)
better naming for resulting CSS rules (make editing of epub easier)
"Dedup" of CSS rules - only store relevant changes compared to parent elements (smaller CSS files)
additional tags and attributes
some bug fixes (e.g. syntax highlighting in pre and code environments...)
Related:
leverage ModHeader extension: ensure, images in the ePub are in accepted formats, not WebP images.
Pushed "page cleanups" via CSS and JavaScript to userscripts for TamperMonkey

Thank you for your great work. save-as-ebook is really a great extension.

need to accept "float" CSS attribute. Bugfix in including style classes: per tag - had a <span> inside <p> with same class name.

some fixes for smaller/more accurate formatting and CSS (experimented with iX Select and Wikipedia pages..

experimental stuff to handle CSS (reuse) better - more desciptive class names and easier to merge/modify multiple articles from same site later...

just my personal notebook to keep those for future use

...useful to retain table layouts as these are not covered with CSS.

... to retain table formatting.

..check if parent node has exact same values..

e.g. to use with Tampermonkey. Pre-Processing for cleaner output with save-as-ebook

…On BI.user.js

reason: some paragraphs used as list-items.

…_ Chris Webb,Kasper On BI, Radacad, etc..user.js

well, first attempt.

- better image filenames (most cases) - might cause de-dup on name collisions. - CSS: background-image and related attributes are now taken over with more fidelity. - handle styling of links and other exceptions better (a, strong, em - tags) TODO: need to produce better class names

alexadam · 2019-03-20T12:22:18Z

Wow, you did a lot of changes. I need a few days to review it... I think I'll allocate my next week to work on this project because there are a lot of issues waiting to be solved. Thanks for your help, I might come back with questions about your changes.

miguelitoelgrande · 2019-03-20T12:52:37Z

Hi Adam,
anytime. Your extension is a great help in reading tech stuff on the Kindle instead of printing etc.
My changes focus on producing more accurate output and helping to edit the resulting epub afterwards.
The most relevant changes are in the extractHtml.js.
The userscripts do a great job in preprocessing.

Looking forward to hear from you,
Michael

alexadam · 2019-03-30T12:57:55Z

Sorry for the late reply. I’ll address your changes in 2 parts: first, about the additional css/js files and, then about the extractHtml.js changes.

There is a problem with cleaning the page before generating the ebook. I don’t like the current feature of inserting custom CSS to remove unwanted elements… It’s not user friendly because of the UI/UX and because a lot of people don’t know CSS. I’m thinking about removing it and find a better solution.

At the time I was working on it there was a bug on FF - you couldn’t access the reader mode from a web extension https://bugzilla.mozilla.org/show_bug.cgi?id=1286387 It doesn’t look like they fixed it but maybe there is a workaround, I’ll investigate more

I’ve never used Tampermonkey but it seems you have to add a script for each page you want to clean, and… I don’t want to add thousands of scripts, for every possible site that can be saved as ebook, to the main repo. Those scripts should be stored locally, if you need them.

I think that a ‘save as ebook’ app should do (only) what it says: save as ebook. I would remove everything related to ‘cleaning’ a web page, because cleaning is a non mandatory, unrelated preliminary step. And mixing them causes a lot of problems because you cannot make everybody happy.

There should be something like what ublock is for ads - an universal ‘reader mode’ extension, with a database of scripts and styles for as many sites as possible. So you don’t have to maintain anything or write code or waste time trying to identify which elements should be removed. I’ll take a look to see what’s available…

I didn’t have time to look on extractHtml.js changes, I’ll do it later.

miguelitoelgrande · 2019-04-01T21:56:10Z

Hi Adam, absolutely on the same page. I would also focus save-as-ebook on the part of extracting given pages to epub and leave the preprocessing to different tooling. My user scripts only serve as an example for this. As I am using several PCs throughout the week, I was looking for an easy way to keep my CSS definitions in sync. Tampermonkey can do this (even though its main focus is JS, not CSS). But this way I can also simplify the document structure where needed. I did not think about the "reader mode" so far. I thought about switching to the print layout where applicable, but anyways, this would be preprocessing. And some page owners probably do not really care for use cases like printing or conversion to ebooks. Another reason for the user scripts: I started using an extra script for hypenation in Chrome (I think FF has this built in?). Very handy as I do not need an extra iteration through Calibre for this (Kindle KF8 does not perform the hypenation unless there are soft-hypens in the document). So, let's focus on " extractHtml.js". I am pretty sure, you will like some of the fixes. I can provide some URLs of sample pages if some improvements are not clear. As you might imagine, I am primary converting technical pages including tables and source code highlighting, so a lot more markup in the text than on novels and such. Kind regards, Michael

alexadam · 2019-04-15T14:23:20Z

sorry for taking so long... you did a lot of changes and I don't have enough time :) I'm trying to think of a way for automatically testing the web ext. before a release, maybe with Puppeteer...

alexadam · 2019-04-15T17:35:56Z

ok, so I created a 'tests' folder with a small puppeteer app that starts a chrome instance + the extension. I want to add some test pages & epub references and find a way to compare them with what is being generated.
I don't know if this is the best way to do it, but is the quickest for now...
In the next days I'll add as many references as possible and pages that didn't work or have issues.

poire-z · 2019-08-21T07:20:53Z

Hello, and thanks for that extension!

Just letting you know of a few fixes and improvements I've made for my own reading of the generated EPUBs with KOReader on eInk devices (some context here).
I've taken many bits from @miguelitoelgrande work (and from #19), so it feels a bit awkward opening a PR with my changes :) Also, I'm using it with some older version of Firefox, and can't (don't really have time) to check how it would work with newer Firefox or Chrome.
But feel free to pick any of my fixes that make sense.

@miguelitoelgrande : regarding your commit:

handle styling of links and other exceptions better (a, strong, em - tags)

This is a bit wrong. These are not exceptions, and what you did to these should be done to all tags, for all CSS properties that are inherited (per specs) and only them. I followed up on your huge improvements commit in poire-z@8daadb5 if you're interested.

Note: the choice of which styles to include or not is quite use-case dependant :|
@miguelitoelgrande added background-image, but I don't want them (as well as letter-spacing and others). But I want "float", which I understand many other EPUB reading softwares will not want. So, the choice of what styles to save or not may require manual tweaking to the code (until there is some UI configuration for that :)

added "start" attribute for <ol> lists with explicit numbering

miguelitoelgrande added 30 commits December 14, 2018 22:02

bugfix for Drop Caps

820379f

need to accept "float" CSS attribute. Bugfix in including style classes: per tag - had a <span> inside <p> with same class name.

Update extractHtml.js

73a332c

Update extractHtml.js

93c837d

some fixes for smaller/more accurate formatting and CSS (experimented with iX Select and Wikipedia pages..

Update extractHtml.js

e514cdf

experimental stuff to handle CSS (reuse) better - more desciptive class names and easier to merge/modify multiple articles from same site later...

Create AdditionalCSSstyles_MM.css

10b3a87

just my personal notebook to keep those for future use

Update AdditionalCSSstyles_MM.css

895d445

Update AdditionalCSSstyles_MM.css

74e1d35

Update AdditionalCSSstyles_MM.css

92d2203

Update AdditionalCSSstyles_MM.css

a2be4f2

Update AdditionalCSSstyles_MM.css

e091980

Update AdditionalCSSstyles_MM.css

f985170

Update AdditionalCSSstyles_MM.css

373ed39

Update AdditionalCSSstyles_MM.css

34b6d3a

Update AdditionalCSSstyles_MM.css

162ac06

Update AdditionalCSSstyles_MM.css

ee6000a

Update AdditionalCSSstyles_MM.css

3a8cac0

Update AdditionalCSSstyles_MM.css

eb70ecb

Update extractHtml.js

2d13ba3

Update AdditionalCSSstyles_MM.css

e065c7d

Update AdditionalCSSstyles_MM.css

8b5f958

keeping some attributes of tags (e.g. colspan)

7db15ac

...useful to retain table layouts as these are not covered with CSS.

allowing empty <td> elems.

90b53fc

... to retain table formatting.

added css border-collapse

7e098a9

Update AdditionalCSSstyles_MM.css

592304e

first config for ModHeader

47d8afd

Update README.md

0a435c1

Update ModHeader-config-MM.txt

9e7f8f0

remove redundant style rules

9780552

..check if parent node has exact same values..

Update extractHtml.js

aa3e788

additional Userscripts

c68856e

e.g. to use with Tampermonkey. Pre-Processing for cleaner output with save-as-ebook

miguelitoelgrande added 19 commits February 22, 2019 19:18

Add files via upload

e6bc34b

Update WhatsApp.user.js

b7c4467

Update Heise News (inkl TechnologyReview).user.js

ada36c9

Update Golem.de.user.js

9a4d211

Add files via upload

b79b20c

Delete sqlbi.com.user.js

35611bc

Add files via upload

7434ce8

Rename Chris Webb's BI Blog.user.js to BI Blogs_ Chris Webb & Kasper …

9ae9fe5

…On BI.user.js

Add files via upload

1ac0e6e

Update Microsoft Docs.user.js

2c40ad1

added 'display' to CSS attributes

91abe54

reason: some paragraphs used as list-items.

avoid extra DIV below BODY tag

68a4732

Rename BI Blogs_ Chris Webb & Kasper On BI.user.js to WordPress Blogs…

5895537

…_ Chris Webb,Kasper On BI, Radacad, etc..user.js

Add files via upload

773fba2

Add files via upload

5a19c24

handle CSS background-image attribute

52232ab

well, first attempt.

Update extractHtml.js

d94a84f

Update README.md

2915638

Update extractHtml.js

507b2b8

added "start" attribute for <ol> lists with explicit numbering

bunglegrind mentioned this pull request Apr 25, 2022

Exploit cascading properties in order to reduce CSS size bunglegrind/save-as-ebook#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several improvements around handling CSS and others. #33

Several improvements around handling CSS and others. #33

miguelitoelgrande commented Mar 20, 2019

alexadam commented Mar 20, 2019

miguelitoelgrande commented Mar 20, 2019

alexadam commented Mar 30, 2019

miguelitoelgrande commented Apr 1, 2019 via email •

edited

Loading

alexadam commented Apr 15, 2019

alexadam commented Apr 15, 2019

poire-z commented Aug 21, 2019

Several improvements around handling CSS and others. #33

Are you sure you want to change the base?

Several improvements around handling CSS and others. #33

Conversation

miguelitoelgrande commented Mar 20, 2019

alexadam commented Mar 20, 2019

miguelitoelgrande commented Mar 20, 2019

alexadam commented Mar 30, 2019

miguelitoelgrande commented Apr 1, 2019 via email • edited Loading

alexadam commented Apr 15, 2019

alexadam commented Apr 15, 2019

poire-z commented Aug 21, 2019

miguelitoelgrande commented Apr 1, 2019 via email •

edited

Loading