All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.5.2 - 2023-09-24
- Improved handling of empty attribute values (
<img alt="">
) and valueless attributes (<iframe seamless>
).
0.5.1 - 2022-10-08
- Document the function of the
WebResource.frame_name
property.
0.5.0 - 2022-04-16
- More complete documentation for the
WebArchive
andWebResource
classes. - Documentation on pywebarchive's internals.
- Unit test for subresource URLs occurring as literal text.
- Massively overhaul the README.
- Improved the documentation for the
webarchive
module. - Expanded and clarified various code comments.
- Use a
with
clause for proper cleanup in test/extracted_archive_display.py. - Rename
WebArchive.extract()
'ssingle_file
argument to the more descriptiveembed_subresources
(potentially backwards-incompatible change).
- Raise a
WebArchiveError
when attempting to extract a webarchive with no main resource. - Raise a
WebArchiveError
when attempting to convert a webarchive with no main resource to HTML. - Return the correct value for
WebArchive.resource_count()
if no main resource is present.
- The unnecessary
<!-- Processed by pywebarchive -->
tag previously added to extracted pages.
0.4.1 - 2022-03-26
- Call
close()
inWebArchive.__exit__()
.
0.4.0 - 2022-03-26
- Context manager (
with
statement) support in theWebArchive
class. - The
WebArchive.close()
method. - The
WebArchive.parent
property. - Support for the
mode
argument inwebarchive.open()
(though only read mode remains implemented).
- Further cleaned up internal APIs.
- Improved module documentation.
- Ensure an encoding is always specified when creating a text
WebResource
. - Removed duplicated code in test/extracted_archive_display.py.
0.3.3 - 2021-11-05
- Unit tests for HTML- and CSS-rewriting logic.
- Build script for the Windows version of Webarchive Extractor.
- Clean up the
WebResource
class's internal API. - Do not force a newline after the doctype in
HTMLRewriter.handle_decl()
. - Moved
test_extracted_archive_display
from the unit tests to a separate script. - Removed
test_extracted_archive_display
's dependency on Tkinter.
- Rewrite URLs in inline CSS code when extracting.
0.3.2 - 2021-09-26
- The module version number in
webarchive.__version__
. - Initial support for command-line arguments in
extractor-gui.py
. - The
--version
argument inextractor.py
andextractor-gui.py
.
- Further code cleanup.
- Give more descriptive names to various internals.
- Support HTML subresources.
- Handle non-HTML subresources incorrectly served as
text/html
. - Update the module description in
setup.py
to match its documentation. - Specify a text encoding in
WebArchiveTest.test_webarchive_to_html()
so the test will pass on Windows. - Make
webbrowser
an optional dependency inextractor.py
to matchextractor-gui.py
.
0.3.1 - 2021-09-25
- Unit test for
WebArchive.to_html()
.
- Massively expanded module documentation.
- Don't delete the
srcset
attribute from<img>
. - Embed style sheets in single-file mode using data URIs rather than
<style>
. - Cleaned up various internals.
- Handle
srcset
entries without a width or pixel density descriptor. - Embed subresources recursively when calling
WebResource.to_data_uri()
on an archive's main resource. - Don't escape HTML entities in a
<script>
or<style>
block. - Correctly handle non-HTML main resources.
0.3.0 - 2021-07-18
- Experimental support for extracting webarchives to single-file HTML documents.
- External scripts and style sheets are replaced with inline content.
- External images are embedded using data URIs.
- New command-line options for
extractor.py
:-s
/--single-file
to extract archive contents to a single HTML file.-o
/--open-page
to open the extracted webpage when finished.
- New
WebArchive
class methods:get_local_path()
returns the basename of the file created when a specified subresource is extracted.get_subframe_archive()
returns the subframe archive corresponding to a specified URL.get_subresource()
returns the subresource corresponding to a specified URL.to_html()
returns the archive's contents as a single-file HTML document.
- The
WebResource.archive
property, which identifies a given resource's parentWebArchive
. - The
WebArchiveError
exception.
- Moved the development status up to beta.
- Correctly handle "empty" tags like
<img />
in XHTML documents. - Fixed local resource paths for extracted subframe archives.
- The
Extractor
class, included only for backwards compatibility with the poorly thought-out 0.1.0 API.
0.2.4 - 2020-02-22
- Unit tests.
extractor-gui.py
can now open converted files on non-Windows platforms.
0.2.3 - 2019-09-02
- Code cleanup release; no user-visible changes.
0.2.2 - 2018-10-21
- Various bugfixes, mainly involving subframe archives.
0.2.1 - 2018-10-20
- Graphical extraction tool.
- Support for subframe archives.
- Various bugfixes.
Note: Version 0.2.0 was pulled shortly after posting due to problems with its setup.py
script.
0.1.1 - 2018-10-19
- The
open()
function as the preferred way to open a WebArchive.
- Moved extraction into the main
WebArchive
class. - Massive internal cleanup.
- The
Extractor
class from the poorly thought-out initial API.
0.1.0 - 2018-10-16
- Initial public release.