-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several improvements around handling CSS and others. #33
base: master
Are you sure you want to change the base?
Conversation
need to accept "float" CSS attribute. Bugfix in including style classes: per tag - had a <span> inside <p> with same class name.
some fixes for smaller/more accurate formatting and CSS (experimented with iX Select and Wikipedia pages..
experimental stuff to handle CSS (reuse) better - more desciptive class names and easier to merge/modify multiple articles from same site later...
just my personal notebook to keep those for future use
...useful to retain table layouts as these are not covered with CSS.
... to retain table formatting.
..check if parent node has exact same values..
e.g. to use with Tampermonkey. Pre-Processing for cleaner output with save-as-ebook
reason: some paragraphs used as list-items.
…_ Chris Webb,Kasper On BI, Radacad, etc..user.js
well, first attempt.
- better image filenames (most cases) - might cause de-dup on name collisions. - CSS: background-image and related attributes are now taken over with more fidelity. - handle styling of links and other exceptions better (a, strong, em - tags) TODO: need to produce better class names
Wow, you did a lot of changes. I need a few days to review it... I think I'll allocate my next week to work on this project because there are a lot of issues waiting to be solved. Thanks for your help, I might come back with questions about your changes. |
Hi Adam, Looking forward to hear from you, |
Sorry for the late reply. I’ll address your changes in 2 parts: first, about the additional css/js files and, then about the extractHtml.js changes. There is a problem with cleaning the page before generating the ebook. I don’t like the current feature of inserting custom CSS to remove unwanted elements… It’s not user friendly because of the UI/UX and because a lot of people don’t know CSS. I’m thinking about removing it and find a better solution. At the time I was working on it there was a bug on FF - you couldn’t access the reader mode from a web extension https://bugzilla.mozilla.org/show_bug.cgi?id=1286387 It doesn’t look like they fixed it but maybe there is a workaround, I’ll investigate more I’ve never used Tampermonkey but it seems you have to add a script for each page you want to clean, and… I don’t want to add thousands of scripts, for every possible site that can be saved as ebook, to the main repo. Those scripts should be stored locally, if you need them. I think that a ‘save as ebook’ app should do (only) what it says: save as ebook. I would remove everything related to ‘cleaning’ a web page, because cleaning is a non mandatory, unrelated preliminary step. And mixing them causes a lot of problems because you cannot make everybody happy. There should be something like what ublock is for ads - an universal ‘reader mode’ extension, with a database of scripts and styles for as many sites as possible. So you don’t have to maintain anything or write code or waste time trying to identify which elements should be removed. I’ll take a look to see what’s available… I didn’t have time to look on extractHtml.js changes, I’ll do it later. |
Hi Adam,
absolutely on the same page.
I would also focus save-as-ebook on the part of extracting given pages
to epub and leave the preprocessing to different tooling.
My user scripts only serve as an example for this. As I am using several
PCs throughout the week, I was looking for an easy way to keep my CSS
definitions in sync. Tampermonkey can do this (even though its main
focus is JS, not CSS). But this way I can also simplify the document
structure where needed.
I did not think about the "reader mode" so far. I thought about
switching to the print layout where applicable, but anyways, this would
be preprocessing. And some page owners probably do not really care for
use cases like printing or conversion to ebooks.
Another reason for the user scripts: I started using an extra script for
hypenation in Chrome (I think FF has this built in?). Very handy as I do
not need an extra iteration through Calibre for this (Kindle KF8 does
not perform the hypenation unless there are soft-hypens in the document).
So, let's focus on " extractHtml.js". I am pretty sure, you will like
some of the fixes. I can provide some URLs of sample pages if some
improvements are not clear. As you might imagine, I am primary
converting technical pages including tables and source code
highlighting, so a lot more markup in the text than on novels and such.
Kind regards,
Michael
|
sorry for taking so long... you did a lot of changes and I don't have enough time :) I'm trying to think of a way for automatically testing the web ext. before a release, maybe with Puppeteer... |
ok, so I created a 'tests' folder with a small puppeteer app that starts a chrome instance + the extension. I want to add some test pages & epub references and find a way to compare them with what is being generated. |
Hello, and thanks for that extension! Just letting you know of a few fixes and improvements I've made for my own reading of the generated EPUBs with KOReader on eInk devices (some context here). @miguelitoelgrande : regarding your commit:
This is a bit wrong. These are not exceptions, and what you did to these should be done to all tags, for all CSS properties that are inherited (per specs) and only them. I followed up on your huge improvements commit in poire-z@8daadb5 if you're interested. Note: the choice of which styles to include or not is quite use-case dependant :| |
added "start" attribute for <ol> lists with explicit numbering
This is my fork providing the following improvements (mainly improvements in extractHTML.js logic):
Related:
Thank you for your great work. save-as-ebook is really a great extension.