-
Notifications
You must be signed in to change notification settings - Fork 18
/
CHANGES
1071 lines (829 loc) · 64.1 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
0.12.47 September 20, 2024
- because of a validator bug, the W3C recommended markup for images with `role="presentation"` fails w3c validation. We have made `data-role' a synonym for `role` in this context, allowing files that use the "presentation" role to do so without raising a validation error. https://github.com/validator/validator/issues/1599
0.12.45 September 18, 2024
- generated covers are now 1600x2400 to comply with Apple Books recommended minimum width and DP guidelines https://www.pgdp.net/wiki/DP_Official_Documentation:PP_and_PPV/Post-Processing_FAQ#Information_for_all_types_of_cover #234
- added accessibility metadata to EPUB3 content.ocf as suggested by ACE
- stub implementation to allow assertions of good alt text in config.
- added aria labels and roles to nav elements of EPUB3 content.ocf and toc.xhtml
- added lang attribute to wrapper file html elements as suggested by ACE
- fix opengraph urls in HTML metadata #235
- update cchardet to solve problems installing on python 3.11
- alt-text logging is restructured
- empty alt-text warnings are now suppressed in figures
- empty alt-text warnings are now suppressed when role='presentation' or aria-labelledby attributes are present
- the alt text examination is moved from the Spider module to the HTMLParser module.
- ids are assigned to all img elements to facilitate alt-text mitigation.
- alt-text logging is improved.
- empty alt-text warnings now reference a newly added doc page: https://github.com/gutenbergtools/ebookmaker/blob/master/docs/alt-text.md
- bug in undeployed 0.12.44 fixed
0.12.43 May 22, 2024
- fixed chunker bugs:
- no longer emits empty chunks (was happening with large child elements of body) #224
- no longer splits elements in NEVER_SPLIT list when they are only children. #226
- fix missing empty line in txt output for copyrighted books #222
- made copyright addition in header/footer case-insensitive
- adds NFC unicode normalization to text parser #218
- libgutenberg 0.10.5
- don't strip periods from title_no_subtitle
- The `heading` column in the database's author-book many to many table was being ignored by much of our code. The result was that multiple authors were being listed in alphabetical order. now, the heading column is used and the first sort column for the authors of a book, and the authors other than the first author are have heading=2 (instead of the default `heading=1`) set on initial metadata load. The cataloguer can reset the heading numbers, but does not wish the order of authors other than the "main" author to be tracked in the database.
- fixed a reversion in 0.10.10 that made author name matching case sensitive.
- get ebook number from filename if parse fails #225
0.12.42 April 01, 2024
- fixed rst -> epub3 conversion
0.12.41 March 01, 2024
- clean up txt boilerplate. fixes easy parts of #220
- refactor `enclose_text()` to properly deal with html4 transitional files
- don't allow non-images as img src
- libgutenberg 0.10.22 (fixes 508 attribute issue)
0.12.40 February 9, 2024
- fixed an issue parsing large text files. "The Entire Project Gutenberg Works of Mark Twain" was 1.2M.
- fixed an issue detecting multiline boilerplate markers.
- added an id 'pg-title-no-subtitle' to a span containing dc.title_no_subtitle
0.12.39 January 31, 2024
- fixed an EPUB3 problem affecting kobo reader when a book has more than 10 chapters - read order was being sorted as a string.
- for EPUB2, added a deprecation for `u` elements. fixes #210
- fixed multiline handling of MARKER_END in text boilerplate. fixes #209
- libgutenberg 0.10.20
0.12.38 December 8, 2023
- fixed reversion where adding display: initial over-wrote the addition of display:flex
0.12.37 December 7, 2023
- added `bgsound` to list of removed elements
- added 'initial` as an allowed value for the CSS properties `display`, `margin-*`, `font-*`, `text-*`, and `padding-*`. This has been part of W3C REC for over 2 years and is primarily useful for reproducing the behavior of the deprecated HTML4 align attribute in HTML5.
- Improved rendering of tables with respect to the deprecated `align` attribute.
- added handling for the deprecated `color` attribute on `hr` elements.
- updates pg urls in the backfile to https, updates links to html4 files to html5 files. fixed #194
- fixes #207 bug where gif files are converted to png, but the manifest gives them the wrong mediatype.
- libgutenberg 0.10.19
0.12.36 September 20, 2023
Nothing in this release should affect usage without Ebookconverter (e.g. Online Ebookmaker).
- Ebookmaker no longer needs a parsable source file to use metadata-only writers. So, RDF, and QRcodes, and posts to Mastodon and Facebook, can be made using only the database. This change should also save execution time, as parsing of the source files is no longer done for each of these outputs. RDF files can now be made for "books" that aren't books, such as data files, audio files, etc.
- ebook number on Logger is set even if parse fails.
0.12.35 August 10, 2023
- fixed crash when END sentinel is on the last line
- turned on the SMALLPRINT_MARKERS
- cleaned up spacing and formatting of generated txt files
0.12.34 July 27, 2023
- fixed typo affecting in-copyright books
0.12.33 July 20, 2023
- replaced CSS4 value in CSS - was logging an error
0.12.32 July 19, 2023
- Updated a deprecated constant in ImageParser for compatibility with Pillow 10.0
- to facilitate use of their by other applications, boilerplate strings are now exposed as strings in the writers.TemplateStrings module.
- to improve accessibility in the backfile, ebookmaker now recognizes figures and captions which use class names like 'figcenter' and 'caption'. `role='figure'` and `aria-labelledby` attributes are added to the element used as a figure. New source files should not use this markup, the HTML5 `figure` and `figcaption` elements are preferred. Addresses fixes #178.
- ebookmaker was not adding boilerplate (including metadata) when the source file was txt. Now it does.
- added font-variant-numeric to the custom PG css profile. CSS Utils implemented CSS3-fonts before it was finalized, so it was missing font-variant-numeric. Fixes #188
- removed polyfill css for `figure` in deprecated to EPUB2. The added CSS just caused problems. Fixes #151
- the HTML5 element `article` has been added to the down-conversion list for EPUB2 output. Fixes #172.
- when ebookmaker transforms `a[name]` attributes into `id` attributes, whitespace or non-alphanumeric characters in the name result in invalid identifiers. These are dropped in the transformation to EPUB. Code has been added to transform these name attributes into valid xml identifiers. Fixes #173
- `title=''` has been added to the `h2` element of the boilerplate header. This suppresses an awkward TOC entry in EPUB. fixes #117
- CSS for the boilerplate headeer has been tweaked to better protect its format from the document CSS, and to better match the fomatting used by the workflow tool.
- Ebookmaker's parser no longer crashes when fed a document with SCRIPT elements containing bare "&" characters. Script elements are still removed. This was a problem exhibited when trying to convert HTML from PG Australia.
- Ebookmaker has been normalizing case in css selectors for EPUB. However, the regexp being used was overbroad and didn't exclude text in selectors that used attribute value comparisons. At the same time, the regep was missing selectors that used , ~ + and > expressions. This regexp and the lower casing code has been fixed. In addition, we now also normalize element selector case in generated HTML, since we are normalizing the the HTML element case.
- add ' | Project Gutenberg' to title element for uniformity. fixes #183
- add helpful comment in by-heading warning message
- in metadata header, omit role label for all but first when multiple creators have the same role. Fixes #181.
- when there is no pg header or pg-footer recognized, an INFO message is logged instead of a warning. The message indicates that it is an error for white-washed files. Fixes #184.
- Ebookmaker no longer inserts generated boilerplate when none is recognized in an html source file.
When the source file is plain text only, boilerplate IS inserted. The only things needed to trigger boilerplate insertion are the strings `*** START OF THE PROJECT GUTENBERG EBOOK ***` and `*** END OF THE PROJECT GUTENBERG EBOOK ***` in a `pre`, `div`, or `p` element.
0.12.32 July 18, 2023. not released.
0.12.30 March 8, 2023
Note about side-loading files on Kobo devices.
An issue with side-loading files onto Kobo devices causes the wrong rendering engine to be used for our EPUB3 files, resulting in (among other things), the absence of a TOC for the book. This version of ebookmaker adds an EPUB2-style table of contents in the EPUB3 files to mitigate this issue, as well as to improve compatibility of our EPUB3 files on older devices, such as Nook. This is common practice in the ebook industry. We will continue to produce EPUB2 files, which often render better on older devices, as well as on the older Kobo rendering engine currently invoked for side-loaded books. Kobo is aware of the issue with side-loaded EPUB3, and we expect that the issue will eventually be fixed. In the meantime, Kobo rendering of EPUB3 should NOT be considered when authoring new HTML5 books.
- include an NCX toc file to improve EPUB3 compatibility with side-loaded kobo/older devices #167
- remove public identifier from NCX files (including EPUB2) - causes validation error for EPUB3
- only try to add credit from text if it's not empty
- improve css for h2 in pg-header with `all:inherit`. (`all:revert` is what we really want, but it's not REC yet). addresses #165
- don't qualify heading elements in CSS - use ids instead. https://github.com/CSSLint/csslint/wiki/Disallow-qualified-headings
- remove author from boilerplate section heading #161
- libgutenberg 0.10.17 removes $c and $v subfield markers from titles
0.12.29 February 15, 2023
- the boilerplate header and footer has been refactored to move css styling to a stylesheet.
- css styling for boilerplate header and footer revised to better match current submission style
- metadata listing revised to better match current submission style
- TODO: improve creator listing. Needs a spec in line with the reality of names and authorlists and with intended use cases.
- reorganized test directory to facilitate new tests
- add css `all` to supported attributes. (Introduced in 2013, supported mostly everywhere, REC since 2021.)
- add workaround for bad checksums in a small number of PNG files
- don't delete links to external resources when the fragment is not a valid xml identifier
- add a method to CommonCode to help map libgutenberg.Models.File objects to files in the filesystem
- ebookmaker strips floating pagenumbers, replacing them with page anchors when the content of the float is < 13 characters. Turns out Roman numeral page numbers can exceed the limit. This limit has been removed for floating elements with class 'pagenum'. fixed #155
- 'most recently updated' in header is now taken from the file modification date for the source file. when run from Ebookconverter, the file.modified value from the database is used.
- for EPUB2, HTML5 elements such as `figure` are converted to `div` with a class corresponding to the name of the replaced class. Since the default style for `figure` is different than for `div` some CSS is inserted to change the default style. With this version, that style is now inserted at the beginning of the `head` element rather than at the end so that it is easier to override with css from the source file. fixes #151
- display of original publication data is cleaned up
- text added to credit from parsed html is logged to facilitate moving it to the db.
0.12.25 January 1, 2023
- forgot to strip marc subfields in pubinfo
0.12.24 December 26, 2022
- use same method as Autocat3 for rendering PubInfo
- in generated header, render each creator on a separate line
- don't set exclude_encodings param for html5lib parser
- improve info log message for parser selection
- header removal was messed up by language tags in the marker. removed them
- MIN_CHUNK_SIZE reduced from 1024 bytes to 256 bytes. fixes #147
- update to libgutenberg 0.10.13 to include plural creator roles
0.12.23 November 29, 2022
- fix the fix to the whitespace around `<br>` to fix issue with 'None' appearing in EPUBS
- switch to `html5lib` for files likely to be html5 (not xml). This has the effect of closing any tags for empty elements not closed in the source file. Much slower than the C-based parsers used for XHTML, but probably not a big hit because results are cached.
0.12.22 November 28, 2022
- add whitespace to `<br>` elements in headers. tidy used to put whitespace around `<br>` elements. PPs relied on this behavior, so we're stuck.
- add pagebreak css for EPUB2 boilerplate
- changed log level for empty img[alt] from ERROR to WARNING
- fix replacement css for table[align] and img[align]
- fix incorrect css for bgcolor attributes
- fix incorrect css for clear='all'
- changed the approach to replacing `<center>` - there's additional css for tables in a center element.
- the HTML void element `wbr` is not recognized by lxml. for this reason, we had to
- switch to `html.parser` for files likely to be html5 (not xml)
- we're using `html.parser` for files that don't set `xmlns` attribute and don't declare a PUBLIC doctype `-//W3C//DTD`, which should be any non-xml HTML5 file. Only 151 files in our collection satisfy this criterion so far.
- explicitly remove stray end tags: `</wbr>` after writing to out bytes for the html5 file
- add svg cover from #132
- remove 'svg' property for stand-alone svg files; this property is meant for files that embed svg.
0.12.21 November 1, 2022
- add logging for empty img[alt] attributes
0.12.20 October 23, 2022
- don't wipe source credit if there's an update credit from the db.
- require libgutenberg>=0.10.8
0.12.19 October 17, 2022
- tweak boilerplate to align with PG practice
- use `*** START OF THE` instead of `*** START OF THIS`. PG shifted from `THIS` to `THE` around #64000.
0.12.18 October 12, 2022
- changed caching strategy after analysis showed large cost for changes made in 0.12.12
0.12.17 October 6, 2022
- handle pre-html5 type attribute of `ul` properly
- continue processing job queue after a parse failure
0.12.16 October 3, 2022
- fix bug in removal of legacy elements
- add back enclose_text removed in 0.12.14; it wasn't redundant
- when an img is removed, replace it with alt text. fixes #123
- It seems we have to use the EPUB2 method to declare the cover image for EPUB3 files.
0.12.15 October 1, 2022
- fixed bug exposed by implementation of downloaded pubinfo in libgutenberg
- updated libgutenberg to 0.10.5 to fix an older bug revealed by rendering bugfix
0.12.14 September 30, 2022
- add tests for conversion from txt source
- boilerplate marking was duplicating an id. The code was incoherent. Cleaned this up.
- don't turn top-level html comments into non-comments. in BS4, Comment is a subclass of NavigableString! Thanks @G4OEU!
- removed redundant enclose_text()
- `--make=kindle` and `--make=all` no longer make `kindle.noimages`
- stop warning about external links that start with 'https://www.gutenberg.org/'
- More helpful message when file not found
0.12.13 September 23, 2022
- fix reversion in txt production because GutenbergTextParser needs caching
0.12.12 September 22, 2022
- caching the html tree in the parserwas causing the epub build to interfere with the epub3 build, so a reset method has been added for parsers to make themselves safe for reuse. Raw bytes are still cached, since most of the benefit of caching is reading from disk.
0.12.11 September 19, 2022
- allow figure in body
- let PDFWriter create a directory
0.12.10 September 18, 2022
- fixed bad bug finding footer in texts generated from TEI
0.12.9 September 15, 2022
- restore stylesheet in cover wrapper
- PG producers used to do all sorts of crazy things in css comments, such as nesting CDATA sections: `/*<![CDATA[ */`. tidy used to clean this up for us. Now we're removing css comments entirely.
- what's more, the css sometimes contained malformed xml comments
0.12.8 September 6, 2022
- fixed parsing no-footer text files (stupid mistake!)
- fixed an issue where css files disappear because they are empty but there are still links to them
- added some file checks and a custom exception so that it's clearer what has happened if you give ebookmaker a bad file.
- removed a css link in wrapper that was was getting stripped, then re-inserted later.
- ancient browsers didn't understand stylesheets, so xml comments were used to hide the style text. Our CSS parser is too modern to remember this, it seems. So we needed to un-comment style text. Probably was another thing that tidy was doing without telling us.
0.12.7 September 5, 2022
- fixed an ancient bug in EPUB2 pageno handling. Having two children ids in a pageno-class element no longer generates validation errors for EPUB2 files. Yes, it seems odd that a book would have two page anchors in one page number floating element, but it makes sense if you look at the rendered HTML (#501), and it's no reason to mock people.
0.12.6 September 3, 2022
- bad things happen if there's text at the top level. often this is a result of bad html. pre_parse now takes care of this
0.12.5 September 1, 2022
There remain problems converting HTML 4.0 files.
- fix failed txt build when boilerplate is not found
- fix failed txt build when boilerplate marker appears twice
0.12.4 August 31, 2022
- fixed bug in colgroup wrapping
- Ebookmaker will now NEVER break a page in the middle of a table, a list, or a figure.
- PG boilerplate is inserted in EPUB2 files as well
- Fixed issue with special characters in the boilerplate dividers causing txt builds to fail
- recognize previously marked pg-header, etc., such as from rst
0.12.3 August 24,2022
- fixed a mismatch between the classname given to the cover and the corresponding css
- added CSS page breaks before footer and after header
- BeautifulSoup doesn't convert entities in script or style elements because HTML5 specifies these as CDATA. So our code has to handle cases where there are unexpected entities there.
- remove ALL non-default attributes in replaced img, not just alt
0.12.2 August 23,2022
- xml escape headers extracted from text files
0.12.1 August 21, 2022
- drop alt attribute of img elements replaced with span
0.12.0 August 18, 2022
- start EPUB2 playorders at 1 not 0
- wrap bare col elements in colgroup
- instaed of dropping img elements in noimages epubs, replace them with span tags to preserve link targets.
0.12.0b2 August 7, 2022, very possibly final candidate
- Changes to the cli were needed for ebookconverter integration
- `--notify` and `--validate` are now flags that turn on validation and notification
- to prevent any issues with picked jobs being sent via stdin to subprocesses, the newer subprocess api is now used to run validators and mobi generators.
- TxtWriter also creates a target directory if it doesn't exist.
0.12.0b1 August 4, 2022. possibly final candidate
- fix windows exception for unpadded date format
- add target directory creation to epubwriter
- remove gaps in playOrder for EPUB2
- don't count the size of the chunk template for chunking
- remove xml:space attributes - not allowed in EPUB or HTML5
- EpubWriter now creates a target directory if it doesn't exist, as HTMLWriter does
0.12.0b0 July 12, 2022. beta, almost for production.
- update to libgutenberg 0.10.0 - much improved logging when run from ebookconverter
- always set the lang attribute on html element
- added `--validate=(true/false)` to CommonCode so that EbookConverter can set/unset it via CLI. option can turn off validation even when a validator is installed - needed for rebuild script
- added `--notify=(true/false)` to CommonCode so that EbookConverter can set/unset it via CLI.
0.12.0a1 June 17, 2022. alpha, not for production.
- update to libgutenberg 0.9.3 - much improved logging
- fix boilerplate insertion; only replace boilerplate in the first document
- catch errors for each job in a job queue so that the rest of the queue can execute
- fixed disappearing wrapped images
- add a pyproject.toml file. Seems to get rid of the SetuptoolsDeprecationWarning
- moved code to a src directory so as to keep test code out of distributions and play more nicely with new packaging standards.
0.12.0a0 June 14, 2022. alpha, not for production.
With 0.12, Ebookmaker adds EPUB3 and MOBI(KF8) as output formats.
- This version is being tested and deployed on Python 3.8. We will continue to address any issues with Python 3.7. We no longer support Python 3.5. We have not yet tested on Python 3.9 but we expect it works without change.
- replaces Tidy with Beautiful Soup. Ebookmaker has used HTML Tidy to make sure that source files produced over the course of ~ 25 years can be parsed into a reasonably modern HTML DOM. With the advent of HTML5, Tidy has begun to show its age, and maintenance of Tidy has not kept up with the times. Bugs in Tidy are not being fixed, and we find we can no longer rely on Tidy. To replace Tidy, we are using Beautiful Soup, a very popular python package widely used for web scraping.
Tidy did some other things that caused Ebookmaker's HTML5 output poorly suited for PG,
- it reorganized style attributes into css style elements. While this made the CSS easier to manage, it resulted in less readable source code.
- it normalized whitespace in block elements. In almost all cases, this had no effect of the HTML display, many PG contributors have used this whitespace to reproduce the printed pages in the source code, making it easy to maintain.
Beautiful Soup, by contrast, only changes the source when absolutely needed to make parsable unicode HTML. We expect the resulting HTML5 files will be more pleasing for PG contributors. Some code was added to the Ebookmaker HTML parser to reproduce some of the functionality that Tidy provided.
- Beautiful soup required some minor modification in error catching for missing files
- Incoming DOCTYPE is ignored
- Tidy provided some conversion of obsolete elements/attributes into xhtml4 elements with added CSS Rules.
- `font` elements are replaced with `span`s.
- `center` elements are replace by `div`s. See note below about the CSS3 elements needed to reproduce the behavior of the `center` elements.
- when elements not permitted in as `body` content are present as a child of the `body element, they are wrapped in `div` elements
- A special formatter for Beautiful soup enforces Unicode Normal Form Composed.
- Ebookmaker has been somewhat heavy-handed when removing deprecated elements and attributes. With this version of Ebookmaker, we make more of an effort to preserve the formatting of the source document. This will impact EPUB2, EPUB3 and HTML5 produced files.
- size attributes in `font` tags are translated to css rather than ignored.
- list styles are translated to css rather than ignored.
- size and width attributes on `hr` are translated to css rather than ignored.
- width attributes on `hr` are translated to css rather than ignored.
- deprecated align attributes on most elements are translated to css rather than ignored.
- bgcolor attributes on elements other than body are translated to css rather than ignored.
- values for the attributes align, frame, and rules are changed to lower case
- a customization has been added for the cssutils module to permit us to add selected CSS properties we want to use (the built-in tables are getting old.) We needed to do this because certain conversions for obsolete elements could not be duplicated without using newer CSS properties. In particular:
- to reproduce the legacy `center` element, we added `display: flex` and `justify-content: center`.
- `speak` and `speak-as` css properties have been updated.
- for HTML5, a validation hook has been added. As with EPUB validation, add the path of your command-line HTML5 validator to the .ebookmaker config file and set the --validate flag. Tested with the W3C "Nu" validator - https://validator.github.io/validator/
- for HTML5, move col@valign to css
- for HTML5, change 3 letter language codes to 2 letter codes where available
- for HTML5, fill empty title elements
- for HTML5, improve handling of HTML4 table@frame and table@rules
- for HTML5, `article`, `section`, `header`, and `footer` are now allowed as top-level elements in `body`
- fixed crash in text file analysis when number of lines in paragraph exceeds log(max float). 700-ish
- include opentype fonts in EPUB file (.ttf. .otf, .woff), requires libgutenberg >= 0.8.14. fixes #106
- added an EPUB3 writer. In addition to producing valid EPUB3 files, some changes have been made to the produced EPUB.
- There is only an "-images" flavor. We continue to produce EPUB2 in images and no-images flavors
- Many changes in the HTML and CSS that were done for compatibility with e-readers are not done for EPUB3. The changes remain in place for EPUB2. For example:
- Floats are not removed.
- CSS absolute units are not changed.
- Uncommon characters and ligatures are not simplified.
- <q> elements are not rewritten.
- Preformatted sections are not reflowed.
- data elements are not stripped
- img class="dropcap" are not changed to spans
- any of the above that prove to be needed can be added back as needed
- all html4 -> html5 changes are made, no matter the source.
- it turns out that producers have long used workarounds to adjust for all the changes in support of limited-capability ereaders. For example, drop-caps in the HTML versions used @media(handheld) and `x-ebookmaker` css rules to remove drop-caps that didn't work in EPUB. Now that we are no longer removing floats and the like, we had hoped to undo most of these accommodations for EPUB3. This proved to be too complex. `@media(handheld)` rules are now replaced by `@media (max-width: 480px)` for EPUB3, and `x-ebookmaker is supplemented by `x-ebookmaker-2` for EPUB2 files and `x-ebookmaker-3` for EPUB3 files. Going forward, producers should try to avoid, as much as possible, using the `x-ebookmaker-3` class and instead use media queries so that customizations will also benefit small-screen users of the html files.
- For EPUB3, we still need to remove CSS rules that use the position property. Apple iBooks only allows the position property for fixed-layout EPUBs; for reflowable EPUBS, it appears to remove any elements that use `position: absolute`. It looks like absolute positioning is used mostly for page number anchors in the PG corpus, so we are retaining the behavior of hiding page number anchors when they use absolute positioning. Producers who want visible page number anchors should use floating elements.
- For EPUB3, in our initial testing, we found that setting a default body margin hurt more books than it helped, and we are now using different default CSS sheets for EPUB3 and EPUB2.
- CSS for the EPUB cover is has been updated to better handle small or oddly sized cover images.
- Ebookmaker breaks HTML source into chunks to improve performance on EPUB readers. For EPUB3 files, the chunker treats `section` elements the same way it treated `div.section` elements for EPUB2. Similarly section elements in HTML5 source are converted to div.section elements for EPUB2. In addition, the maximum chunk size for EPUB3 is 300KB compared to 100KB for EPUB2.
- Ebookmaker now supports attributes in the epub namespace {http://www.idpf.org/2007/ops} These can be entered in source file in two ways:
- any `data-epub-*` attribute in an html or xhtml source file is moved to the epub namespace for EPUB2, stripped for EPUB2, and preserved as-is for HTML5. This option will allow permit validation with the W3C 'nu' validator.
- any 'epub:*' attribute in a properly namespaced XHTML file will be preserved for EPUB3, stripped for EPUB2, and converted to a `data-epub-*` attribute in HTML5.
- This version expands support for accessibility attributes.
- the epub:role attribute (see above for using the epub namespace)
- HTML5 attributes `role`, `aria-label` and `aria-labelledby` help screen readers interpret HTML. see https://idpf.github.io/epub-guides/epub-aria-authoring/ for guidance about how to use these. Ebookmaker will strip these attributes for EPUB2 files.
- obsolete values of the `speak` CSS property are now updated to current CSS2/3 equivalents.
- as discussed above, `speak` and `speak-as` css properties are now included in EPUB, EPUB3 and HTML5 files.
- tibetan (bo) added to list of languages for mobi conversion by calibre
- fixed issue where backlinks required an id set on the original element
- HTML5 `wbr` tags (line break opportunity) are removed for EPUB2
- HTML5 and EPUB3 files no longer duplicate the lang attribute in xml:lang
- Ebookmaker is phasing out the use of Kindlegen, which has been unsupported for a while by Amazon. While kindlegen can still be specified as the converter app in the config file, Calibre is now the default conversion app. the generated EPUB2 file is used as the source for MOBI (version 6) files, while EPUB3 files are used as the source for MOBI (KF8 format) files.
- fixed bug where dangling references were created by `x-ebookmaker-drop`
- for EPUB2, added the required summary attribute on table elements.
- for EPUB2, when an x-ebookmaker-page element is added, a `div` is made instead of an `a` when the element is a direct child of `body`
- for EPUB2 and EPUB3, when an x-ebookmaker-drop element containing an `id` is removed, a `div` is added instead of an `span` when the element was a direct child of `body`.
- for EPUB, fixed bug for irregular heading hierarchies
- work around bug in lxml >= 4.7 causing parse failures for rst conversions
- restored newlines in validation logging to make vaidation issues readable
- for conversions from RST: removed invalid 'classes' attribute
- for conversions from RST: added pg_boilerplate to generated headers
- for conversions from RST: stop printing the encoding as metadata
- for EPUB2 and EPUB3: Ebookmaker no longer makes an invalid reference when 'mailto:' links are present
- for EPUB2 and EPUB3: Adds a MIN_CHUNK_SIZE to avoid empty chunks when `body` begins with a section.
- when HTML or TXT source files are parsed, we attempt to identify Project Gutenberg "Boilerplate". When detected, these sections are wrapped in `section` tags for HTML and `pre` for TXT, with appropriate ids. three types of boilerplate identified are:
pg_header
usually a title and license declaration
sometimes, title, book number, release date, authors, language, encoding, credits
when detected, metadata will be parsed and enclosed in a pg_metadata_raw sub-section
pg_footer
usually the trademark license
pg_smallprint
on older books, this will contain license-ish language and other material. it's usually
found at the top of the text, and is often comically dated.
- for HTML5 and EPUB3. replace old boilerplate with up-to-date, generated Boilerplate!!!
0.11.30 December 10, 2021
- for EPUB, down-convert HTML5 tags to divs so the files validate as EPUB2. The new div elements will add a class named the same as the html5 tag, so `<section>` becomes `<div class="section">`. Other attributes are preserved. In addition CSS selectors involving these elements will be transformed accordingly: for example `section` becomes `div.section`
- `section`
- `figure` (initial style set to "margin: 1em 40px;", copying from Firefox internal stylesheet.)
- `figcaption`
- `header`
- `footer`
Users of these HTML5-only tags need to check that their CSS does not conflict with the added classes or changed CSS. In almost all cases, avoiding HTML5 element names for CSS classes will prevent any conflict. Users of HTML5 input may still encounter unresolved issues with other parts of the DP/PG tool chain; please examine output files carefully for unexpected behavior.
- for EPUB, move 'tfoot' elements to before 'tbody' (the order used in HTML4)
- for EPUB, remove any 'meta' elements using the 'property' attribute.
- add 'CRITICAL' notification for 'too-deep' errors
- reset parsers after txt jobs. fixes a bug when the plain text source file is linked from the html.
- EPUBCheck validation was broken. To use EPUBCheck validation, first download and install EPUBCheck from https://www.w3.org/publishing/epubcheck/. If the command to invoke it is `java -jar /Applications/epubcheck-4.2.6/epubcheck.jar`, then add this line to ~/.ebookmaker or /etc/ebookmaker.conf: `epub_validator: java -jar /Applications/epubcheck-4.2.6/epubcheck.jar` then turn on validation by adding `--validate` to Ebookmaker's command line invocation or by setting validate to true in ~/.ebookmaker
0.11.29 November 30, 2021
- for HTML5, remove Content-Language metas
- when converting a presentational attribute to css in a style attribute, put the added css *before* existing content of the style, so as not to override it. this mimics browser behavior for cases when the two styles conflict. This won't do much good right away because tidy strips the styles into named classes.
- stop adding a viewport meta tag. it turns out this interferes with good HTML5 designs for mobile.
0.11.28 November 24, 2021
- fix #100. the behavior of --output-file has changed. a string passed using this argument is used to name the file where the Gutenberg ID would be. Previously it would be just the name of the output file, no matter the file type, except for Kindle, PDF and TeX. File naming for kindle was broken completely. In the past (version <0.11) --title would override the parsed or looked-up title. Title would be used in the file name if there was no Gutenberg id, or --outputfile.
- docutils rst conversion introduced a typo in 0.18 resulting in some css problems
- added exception handling in ImageParser for broken images
- don't select cover until it's needed. Ebookmaker has been generating unneeded covers in the txt step because it hasn't parsed an html file.
- for HTML5, fixed a css syntax error in the css added for the table@cols attribute
- for HTML5, make sure lang and xml:lang attributes are in sync; put invalid langs in data-invalid-lang attribute.
- for HTML5, remove height or width attributes that are 0 or empty
0.11.27 November 18, 2021
- one more fix for docutils 0.18+
0.11.26 November 11, 2021
- fixed a problem with covers selected from linked images based on the file name. (The image file would be added twice).
- fixed a problem with linked images being omitted if they also were used as the cover image
- cover images are stripped from the flow because they are re-added to the flow in the coverpages. This behavior can now be over-ridden with the x-ebookmaker-important class (as has been advertised).
- added `--config-dir` command line argument to help guiguts integrate the included tidy config file.
- update docutils to 0.18+
- fixed a problem with noimages files with caused by the parsing for .images jobs. Build order reversed!
- fixed a problem with noimages files due to broken link removal
- changed file naming methods so that Calibre file checking no longer complains.
- 0.11.25 was not deployed due to test failures.
0.11.24 November 8, 2021
- compatibility with docutils 0.18+. docutils' node traversal was changed from a list to an iterator.
- fixed duplicated generated cover bug. This caused errors in epubs generated by Online Ebookmaker or when the source directory is the same as the target directory.
- fix entities in generated CSS. When we generate HTML from txt or rst we must not entify '>'
- fix problem when an html source file links to a text file. Ebookmaker was trying to convert these files to html, and including them in the ebook reading order. Now, linked plain text files are only converted to utf8, nothing more.
- make sure that every img element has an alt attribute
- replace obsolete attributes for HTML5: td@background, td@bordercolor, tr@bordercolor, table@bordercolor, table@height, table@background background (the last one was never a thing!)
- remove blink elements
- add missing dd elements
- make sure that lang attribute == xml:lang attributes, everywhere.
- fix issue with <CR> in metadata when setting title for conversion from text files
- update to libgutenberg 0.8.12 to fix issue with control characters in dc.title meta tags
0.11.23 October 28, 2021
- moves tfoot to end of tables for HTML5
- removes superfluous span attributes in tables for HTML5
- replaces frame and rules attributes in tables with equivalent css for HTML5
- checks all values of the lang and xml:lang attributes for validity, fixes common invalid values for HTML5
- fix thead@align, tfoot@align, thead@valign, tfoot@valign for HTML5
0.11.21 October 23, 2021
- fixed file scanner to not scan parent directory when asked to scan a directory
0.11.20 October 22, 2021
- fixed file scanner used to find covers
0.11.19 October 21, 2021
- tt elements replaced by span with monospace font in HTML5
- newlines properly escaped in meta attributes
- removes meta elements with scheme attribute
- fixed missing coverpages in epub
0.11.18 October 19, 2021
- fixed reversion in TeX conversion
0.11.17 October 18, 2021
- fixed missing subtype in link rel setter
0.11.16 October 15, 2021
- fixed cover setting issues in 0.11.15
- parser now converts unicode to Normal Form C. So "A"+"combining-`" -> "À"
- CSS serializer now omits invalid CSS properties in the derived HTML5. CSS profile used in CSS2, and includes a small number of properties that are marked as errors by the HTML5 validator because it considers them deprecated by CSS3 (for example, the "speak" property, which is replaced by "speak-as") We'll need to move to CSS3 eventually, but for now we need to also target EPUB2 and ereaders that don't do CSS3 yet. The supported CSS Properties are defined by profiles in the cssutils module in python; CSS3 properties are somewhat modular, and there needs to be discussion around which properties we should be using in PG files.
0.11.15 October 14, 2021
Not deployed due to test failures
- fix syntax and position of style element replacing HTML5 deprecated elements
- remove xml:space from HTML5 pre and style elements
- make removal of http-equiv meta elements case-insensitive
- try to remove problematic carriage returns in meta tags
- tr@align, tr@valign, tbody@align, tbody@valign changed to css equivalents
- remove img@longdesc; it never did anything
- add title to files produced from txt
- to allow for better HTML5 validation the preferred mechanism for denoting a cover image has been changed from `<link rel="coverpage" type href="a_relative_url.jpg" />` to `<link rel="icon" href="a_relative_url.jpg" type="image/x-cover" />`. type is optional unless there is more than one link@rel=icon. The issue is that "coverpage" is not an HTML5-registered link relation. The registered "icon" relation is described as "Imports an icon to represent the document." The "coverpage" mechanism will continue to be supported and does not need to be changed, especially for XHMTL source files.
- Ebookmaker now looks for unlinked cover files if there are none linked. Cover file names must contain the string "cover" and must an extension in '.jpg', '.jpeg', '.png', or '.gif'. Cover files must be in the same directory as the source file or one of its subdirectories. At some time in the past, cover files for display on the website were similarly identified by name. Some covers in the backfile were replaced by generated files when Ebookmaker added the capability of generating cover files. With Ebookmaker now identifying cover files, many of the unlinked covers should be restored. When a cover file is supplied, it is still a best practice to use a link element in the html file.
- adds utility functions CommonCode.dir_from_url and CommonCode.find_candidates to refactor directory walking and url to path conversions
0.11.14 October 11, 2021
- fix error when a style element is empty
- fix error when style contains non-ascii text
- fix issue with aux files not being copied to html destination
- use secure version of Pillow
0.11.13 September 30, 2021
- fix reversion in 0.11.10 leaving out added meta elements
- remove encoding meta element originating in HTML5 source from EPUB2 files, as it was causing validation failures.
- remove @media handheld rules for HTML
- move content of table summary attribute to data-summary attribute
- width attribute on table and col is converted to css in a style attribute
- non-integer width or height on img is converted to css in a style attribute
- fixed epub builds for books with images in external css sheets
- the 'big' element is obsolete; changed to <span class="xhtml_big">. css is altered or added as appropriate
- a number of table attributes are obsolete in html5 and changed to the corresponding css styles: table@width, col@width, table@cellpadding, table@cellspacing, table@border, td@align, th@align, td@valign, th@valign
- html5 doesn't allow elision of dd in dl. Where they are missing, we add empty dd
- html5 doesn't permit carriage returns. these are replace by newlines when represented as numeric entities.
- libgutenberg dependency updated to 0.8.11
- fix issue of polymorphism in dc.languages. Without a db, it's a list of structs; with a db, its a related collection.
- ebookmaker will no longer ignore xml:lang or DC meta attributes
- fix windows path comparison - ebookmaker will behave properly when input file is in outputdir
- fixed style element bug in unreleased 0.11.12
0.11.11 September 22, 2021
- remove encoding meta element originating in HTML5 source from EPUB2 files, as it was causing validation failures.
- fix bug when an XHTML source file set an xml:lang attribute
0.11.10 September 20, 2021
- use tidy for _all_ html source files
- addressed long-standing issue where images referenced in css were not included in epubs. This issue surfaced because the images were also missing from generated files.
- addressed some simple issues preventing derived HTML5 files from validating. More complex issues involving incompatibilities between XHTML and HTML5 have been enumerated and will be addressed in subsequent updates.
- removed http-equiv meta elements for Content-Type and Content-Style-Type
- set lang attribute when xml:lang attribute is present
- removed duplicate encoding meta elements introduced by HTML5 source.
- removed type attribute from style elements
- update requirements so that stand-alone installs will work better
0.11.9 September 3, 2021
- more aggressive session closing
0.11.8 September 2, 2021
- fix crash when source document contains html comments
0.11.7 September 2, 2021
- Using libgutenberg 0.8.7, which includes the type of meta tags used in HTML5.
- Ebookmaker was not saving the derived HTML files if the main source file was in the output directory. This prevented online Ebookmaker to from displaying the files. Now, Ebookmaker will put derived files in an "out" directory. This turned out to require some code restructuring.
- The pseudo-xhtml files produced by 0.11.5 were cause problems with browser compatibility, most noticeably by doubling break elements. It turns out that the quirky output from lxml was caused by xml namespacing of elements. when xml namespaces were removed, the html output method worked as desired, resulting in files that in many cases validated as html5. This solved a number of problems for us, and puts us in a position to start remediating problem files in the backfile in preparation for EPUB3.
- The encoding for all python source files was changed to UTF8. A mis-encoded python file caused a problem with mdashes in titles.
- Sessions are now closed after every set of jobs. Ebookconverter was running out of Databse connections.
- Some superfluous logging was removed.
- There is documentation for the changes introduced in version 0.11
0.11.5 August 28, 2021
- fix bug in stand-alone kindle generation
0.11.4 August 26, 2021
- one more change. can't use xml write mode for html, because Chrome and Safari no longer support self-closing tags.
0.11.3 August 18. 2021
Adds notification support
- Add queue_notifications method in CommonCode, usable by both EbookConverter and EbookMaker
- configure notifications for missing file problems
- remove parsers for missing files - bug exposed by html generation
- fixed regression in WrapperParser
- add coverage by test file
- enhanced log formatting
- refactored log setup for use with
- started using CRITICAL logs to trigger notifications
0.11.2 August 4. 2021
Bugfixes for stand-alone use. We should not have released 0.11.1 on pypi.
- Ebookmaker 0.11.1 did not work without psycopg2 and the PG database. Neither did libgutenberg 0.7.2. This version, with libgutenberg 0.8.1, works without the PG database or psycopg2.
- uses libcountry for language name lists.
- uses old dc object if database not present
0.11.1 July 19. 2021
Fixes for Ebookconverter compatibility. 0.11.0 was never deployed.
- stop using old dc object, no use only ORM for db access
- since the ORM dc object contains a session, it can't be pickled. But EbookConverter sends a pickled job queue to EbookConverter to process, presumably to enable processing on multiple servers. So job queues no longer can contain dc objects. EbookMaker now gets a new (ORM) dc object for every job.
- when making txt output, EBM was relying on not also generating html except for rst. so now we check directly for txt source when creating.
- assorted delinting
0.11.0 June 30. 2021
Ebookmaker version 0.11 makes enhanced HTML files for all types of input, including HTML source files. Here are the improvements and other changes made to HTML source:
- all HTML files are cleaned by HTML Tidy. Tidy does the following:
- converts all HTML to well-formed UTF8-encoded XHTML files. This will allow the PG server to add encoding to MIME headers, improving browser compatibility and accessibility.
- LF is used as the newline character for all files (unix standard)
- html entities such as "`’`" `Á` etc. are converted to unicode characters
- correct badly formed HTML, improving browser compatibility and standards conformance.
- Because the files are now guaranteed to be well-formed, DOM manipulation can be done reliably by browser plugins, mobile apps, proxy servers, accessibility tools and PG's own file processors.
- inline style attributes are moved to a generated inline stylesheet for better rendering performance.
- a doctype declaration for XHTML+RDFa 1.1 is used for all files to allow validation with included RDFa metadata.
- tags are now uniformly lower case
- some legacy presentational tags (`<i>,` `<b>`, `<center>` when enclosed within appropriate inline tags, and <font>) are replaced with CSS <style> tags and structural markup as appropriate.
- empty paragraphs are discarded.
- any text in the body element is wrapped in a `<p>` element.
- added RDFa data, Dublin Core, and schema.org metadata to head element of HTML for better SEO and facebook unfurls. Changes in the metadata are now reflected in the HTML presentation
Some incidental changes were necessary to make this possible:
- Because the generated html is moved to a new directory, linked files also needed to be moved.
- Because the generated file has a different name, back-links needed to be changed
It is possible that rendering of the HTML is changed by this additional processing; however, the changed rendering would be aligned aligned with what has long existed in PG EPUB files.
Note that the unprocessed source files will continue to be available without URL change on the PG web site.
- Don't stop generating html with first html file.
- Don't generate wrapper files when spidering to generate html
- Move @media handling to EpubWriter, not in parser.
- Also copy css and images to target directory
- Don't rewrite urls on output; they're already relative
- Let Spider follow "nofollow" links; instead have EpubWriter remove the nofollow links and corresponding files
- added USAGE.md to provide better documentation for html authors preparing files for Ebookmaker
- removed data-* attributes for epub because these attributes are not allowed in EPUB 2.0.1 and files were thus failing EpubCheck
- add RDFa data and schema.org metadata to head element of generated HTML for better SEO and facebook unfurls
- now using the doctype declaration for XHTML+RDFa 1.1 for generated HTML from libgutenberg >= 0.7.1
- added a tidy config to eliminate dependence on system configured tidy and to turn off drop-empty-elements, an option not available at the command line. Dropping empty spans/divs was having unexpected effects on css rendering; easily worked around, but confusing for producers.
Boilerplate generation will follow in v0.12
0.10.4 April 6, 2021
- add a minimal css stylesheet to the html generated for txt files
- delint
0.10.3 February 25, 2021
- added rendering for <q>: ebookmaker will now change all <q> tags to <span> for epub builds, keeping any attributes on the tags. curly quotes will be added inside the spans, double for top level q and single for all q nest in other qs.
0.10.2 January 18, 2021
- corrected text in PG footer for RST - thanks Roger Frank. Note that boilerplate generation is being revised for v0.11
- when reflowing pre, don't make it one long line. This was causing problems in the Kobo reader.
- don't drop a heading that starts with "by " if class 'x-ebookmaker-important' is on it.
- also log headings that are dropped because they start with "by "
- fix bug in anchor fixing where if an <a> tag had both id and name attributes, both were deleted.
- delinting
0.10.1 November 25, 2020
- fixed minor issue where "too deep" errors were emitted for self-links. Thanks to rfrank for the error report.
- fixed deprecation warning from Docutils; should be ok for Docutils > 0.1
0.10.0 November 2, 2020
- SVG files are now considered images and included in EPUBs. They were being discarded. SVG files are not scaled or compressed by ebookmaker - the renderer should be able to auto-scale. This appears to fix kindlegen failures associated with svg images.
- fixed the rst test
0.9.7 September 14, 2020
- changed font for rst conversion. Linux Libertine was unmaintained since 2012 and no package was available for CENTOS8. We switched to the closest replacement, Libertinus https://github.com/alerque/libertinus
- added documentation about configuration for rst conversion
- the deprecated 'handheld' @media query was being used to prevent ebookmaker from stripping floats. to preserve this feature, ebookmaker no longer strips floats when the css selector contains the x-ebookmaker class. Most likely, float stripping was originally needed because html pages were designed before the advent of EPUB. Today, we can assume that if the html designer uses the x-ebookmaker class, they've considered the impact of the float on the generated EPUB.
- assorted delinting
0.9.6 September 8, 2020
- added 'x-ebookmaker' class to epub body elements. There are now 4 "x-ebookmaker" classes
- css can now apply styles that are triggered by being a descendent of .x-ebookmaker. This addition is meant to replace the 'handheld' @media query that is deprecated in HTML 5
- the 'x-ebookmaker-important' class on on image element tells ebookmaker not to remove the image, even in no-images builds.
- the 'x-ebookmaker-drop' class tells ebookmaker to remove an element and its descendents from ebook builds.
- the 'x-ebookmaker-pageno' class is applied to some span elements whose content has been stripped because they use a class that indicates they represent page numbers: pagenum pageno page pb folionum foliono
- added mayan as a language not supported by kindlegen
- typos fixed in README - thanks Joseph Koshy
0.9.5 July 6, 2020
- fixed minor issue where the spider was getting confused when iterating on WrapperParsers.
0.9.4 June 30, 2020
- handle invalid quantization table in jpeg files when using quality 'keep'
- respect rel="nofollow" attribute: this allows authors to link to an alternate version file in html without duplicating content in the EPUB file.
- set wrappers to nonlinear in spine.
- fixed bugs and ugliness in toc generation
- when the same header level is consecutive, only one toc item is generated. (We see this used to make multiline heads or titles)
- toc normalization made a hash of the toc. Now the toc is normalized in the epubwriter, not in the parser.
- add display:block to standard css sheet to prevent hidden headings from breaking kindlegen
- added a configuration option to use calibre (or whatever!) for non-supported in kindleget languages.
0.9.3 June 23, 2020
- Fix reversion in 0.9.2 which caused CoverWriter in EbookConverter to fail. (It uses ImageParser to convert images from PNG to JPEG
- add a test to check this
0.9.2 June 19, 2020
- note that EbookMaker is no longer installable in Python 2.7 (thanks cpeel)
- clean up pipfile and gitignore pipfile.lock (thanks cpeel)
- fixed bug where filepaths need escape in HTML
- fixed issue where compression was expanding compressed jpegs (thanks choward)
- adde EBOK flag in MOBI when using ebook-convert (thanks rfrank)
0.9.1 June 11, 2020
Minor bug fix and optimization.
- picsdir builds weren't recognizing when it was copying files to themselves
- "broken" images are now inserted when an image is missing
- title attribute in wrappers needed escaping
- build times are now reported to logs
- small optimization with preparse on image parsers
0.9.0 June 2, 2020
Image handling has changed starting with 0.9
- linked images (image files as targets of links) are now wrapped in html, fixing display in ADE.
- linked images are compressed to 1MB if possible (changed from 128K)
- inline images are compressed to 256K if possible (changed from 128K)
- all images are limited to 5000x5000 pixels (was 800 x 1280)
- PNG images are scaled to meet the image filesize targets (previously no scaling of PNG images)
- L format JPEG images (greyscale) are no longer converter to RBG (thanks rfrank)
- generated covers are now 1200 x 1800
- covers and "important" images in noimages builds are scaled to 64K (previously no scaling)
General bug fixes
- eliminated double parsing due to first pass using raw paths
- when tidy fails, the (huge) error trace is only logged once.
- generated html is no longer overwritten by empty results
0.8.12 May 5, 2020
- corrected the exception to catch for missing files
0.8.11 May 4, 2020
- It turns out that ebookmaker gets called both with bare paths and file: urls depending on ebookconverter config files. So 0.8.10 broke on the production machine, though it seemed just fine on mac and windows. So we went back to the drawing board to figure out how to support posix and windows, with or without windows mount points (not sure if that's the right term), file:/// urls and bare paths. We also figured out some issues involving spaces in paths.
0.8.10 April 30, 2020
- fixed numerous file path nits on Windows
- kindlegen
- figsdir
- cover
- pdf
- PEP8 delinted:
- EbookMaker.py
- ParserFactory.py
- Spider.py
- parsers/__init__.py
- writers/EpubWriter.py
0.8.9 April 13, 2020
- fixed issues preventing successful deployments on Windows
- added logging for books deeper than max_depth
- improved documentation for tidy and cairo prerequisites
- try to catch and report exceptions when tidy and cairo aren't installed
- added install-on-windows notes
0.8.8 March 6, 2020
- fixed issue preventing chunking when text file is latin-1
- fixed failure when source links to a directory
- improved parse error message
0.8.7 February 10, 2020
- fixed issue causing failed build when file encoding doesn't match plateform default
- fixed issue where cover set on command line is excluded in build.
0.8.6 February 6, 2020
- fixed issue where covers aren't set when source is txt
0.8.5 January 27, 2020
- set "huge_tree" attribute for the xml parser to keep big files from blocking a build
0.8.4 January 23, 2020
- fixed bugs in rst->tex conversion
- class = 'medium' extra '}'
- removed '%' in noindent
0.8.3 January 21, 2020
- fixed bug exposed by setting cover with a command line argument
0.8.2 January 16, 2020
- refined embedded media error message to include referrer
- added external link warning
0.8.1 January 15, 2020
- Ebookmaker has been downloading and including in epubs embedded media from arbitrary websites, for example, images. This has caused build errors when a remote site goes away. ebookmaker has command line parameters that allow include and exclude urls that have governed document files; these same rules now apply to media files.
In general, PG books should not embed media from other sites.
0.8.0 January 10, 2020
- support ebook-convert tool from Calibre as alternative to kindlegen
To use calibre instead of kindlegen
1. install calibre
2. change MOBIGEN setting to 'ebook-convert' or the path to ebook-convert
0.7.10 January 9, 2020
- build failures on our production system suggest that the parser's output for empty style elements may be os-dependent. Parser now handles style elements with null text.
- allow the build to succeed even if a css file is missing
- allow px only in border properties. see discussion below
- fix false encoding error (complaining about "Klingon") when content file is missing
Ebookmaker has been removing any css rules that use 'px' measurements. 'px' measurements are discouraged in css styles for epub because of poor scaling when user changes font size. However fixed lengths are still useful in properties such as table borders. see discussion on DP forum:
https://www.pgdp.net/phpBB3/viewtopic.php?f=3&t=41237&p=1188773&hilit=windymilla#p1188746
0.7.9 January 6, 2020
- Ebookmaker was failing to make epubs if a linked file was missing. Now it emits an Error message to the log with url of missing file, but goes ahead and makes an epub without it.
0.7.8 January 4, 2020
- updated libgutenberg to 0.5.1
- added warning when mediatype of a linked file cannot be determined
0.7.7 December 19, 2019
- fixed bugs in ParserFactory that raised exceptions for parsers built from package-supplied
resources
- fixed bug that raised exceptions when a cover was rejected for being too small
0.7.6 December 2, 2019
- disabled html to txt conversion because no conversion was being done
- changed a truth eval of an etree element to silence deprecation warnings
- corrected documentation of config behavior
- better install documentation
0.7.5 October 22, 2019
- deconflicted an overloaded "parsers" variable
- fix parse failure when an htm file links to valid xml (not xhtml)
- updated libgutenberg to correctly handle music files.
0.7.4 October 21, 2019
- fixed bug where a cover is always generated during conversion to kindle
- epub generated files are now only saved to log for -v -v -v
0.7.3 October 9, 2019
- fixed bug where if generated cover can't be written, ebookfile is not made
- when outputdir is specified, cover is written to outputdir
- updated libgutenberg to support 'aut' in marcrel covers
- unloaded outputdir from job queue, as it had been passing in options
0.7.1 September 30, 2019
- fixed bug for pickled jobs; dc was previously passed silently via builtins in options
- fixed bug in cover generator when url is a file: url
0.7.0 September 20, 2019
- config files can now be used to provide defaults for most command line arguments.
0.6.4 September 16, 2019
- added updates so that gutenberg canonical url is always https
- updated libgutenberg requirement
0.6.3 September 12, 2019
- fixed parsing of rst files in MyDocUtils
- fixed undefined options in TxtWriter and PdfWriter
0.6.2 May 10, 2019
- updated libgutenberg requirement. itd DublinCore class depended on the installation of '_' that was removed in ebookmaker 0.4.1a1
0.6.1 May 9, 2019
- added strip_links command-line option. turn it off to stop EbookMaker from stripping links in EPUB and MOBI output
- the test_htm not longer tries conversion to pdf or rst
- fixed conversion failure when html contains '<pre><code /></pre>'
- corrected install directions
- readme is now markdown
- fix python 3.6.7 compatibility issue when dbm_gnu not present
- added tests for loading parsers, loaders and packagers
0.6.0 February 14, 2019
- command line additions
1. set cover url
2. generate cover flag
- cover should be first in reading order
- covers can be png
- added tests
- added travis
- fixed some import issues exposed by testing
- expected conversion skipping is a warning, not an error
- fixed issue when there's no stylesheet -- thanks to ray
- stop putting silly tag attribute into html
0.5.0 January 10, 2019
Moved Borg options class to libgutenberg
distutils --> setuptools
0.4.1a1
don't be using builtins as a backdoor global
updates for distutils 0.14 compatibility
0.4.0a2
Fix legalese in PG boilerplate.
0.4.0a1
Port to Python3.
Lots of refactoring.
Code cleanup as suggested by 2to3 and pylint.
Package renamed to ebookmaker.
0.3.20
Do not make special kindlegen epub anymore. Requires kindlegen 2.7+.
Better coverpage handling.
Works with docutils 0.11+.
0.3.19
0.3.19b6
Floats now support 'here'.
0.3.19b5
Fix typo in license text.
Fix "strip_links" debug message crash.
Extend styles directive.
- Add display option to hide the element.
- Allow for negative matches.
Don't use \marginpar for page numbers in TeX.
0.3.19b4
Style directive extended.
Now preserves all trailing whitespace except U+0020.
Added "table de matières" to auto toc detection.
Convert U+2015 to single hyphen in plain text.
0.3.19b3
Fix keyerror hrules and vrules.
Fix unescaped characters in html meta attribute values.
Fix default block image alignment.
Fix use numeric entities in xhtml writer.
0.3.19b2
Fixed text-indent in page nos (made pagenos disapper in line blocks).
Fixed whitespace collapsing in <pre> nodes.
Fixed: honors newlines in metadata fields.
Internal fix: correct format name is: "txt.utf-8".
Can use docinfo in addition to meta directive.
0.3.19b1
New formats: html.noimages and pdf.noimages.
No-image builds use a placeholder 'broken' image instead of nothing.
Figure directives without a filename create a placeholder 'broken' image.
New option :selector: in lof and lot directives for filtering.
Turn off italics with class no-italics (and bold with no-bold).
nbsp now works in ascii txt, soft hyphens now removed from ascii txt.
Insert line numbers with [ln 42] and [ln!42].
Works with kindlegen 2.0.
0.3.18
Allow unicode line separator U+2028 as line feed.
Fix XetexWriter bug with tables without explicit width.
Add language support in XetexWriter.
Works with docutils 0.8
Support docutils-0.8-style :class: language-<code>.
0.3.17
Fix line height of large text.
Fix images with spaces in src attribute.
0.3.16
Add image_dir to Xetex writer.
Use quotation environment instead of quote.
Don't automatically insert \frontmatter.
Page nos. for kindlegen 1.2.
Call kindlegen.
Integrate changes into PG environment.
0.3.15
Reduce vertical margin of images to 1 in TXT.
Fixed link targets in NROFF, PDF.
Report error on xetex errors.
Escape characters in PDF info.
0.3.14
Fixed crash on HTML comments in Kindle writer.
0.3.13
Start on Kindle writer.
Fix spurious space in PDF literal blocks with classes.
Fix `flat´ TOC.