-
Notifications
You must be signed in to change notification settings - Fork 1
/
SavePageNowAtArchiveDotOrg.html
335 lines (330 loc) · 12.5 KB
/
SavePageNowAtArchiveDotOrg.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
<html>
<head>
<style>
html {
background-color: #000207;
color: white;
font-family: sans-serif;
}
table, th, td {
border: 1px solid white;
border-collapse: collapse;
}
span.NoLineBreak {
white-space: nowrap;
}
abbr{cursor: help;}
</style>
</head>
<body>
<h1><center>How to save pages using the internet archive's [email protected]</center></h1>
<p>While it's good news they've implemented an email-based automated tasks (announced
<a href="https://blog.archive.org/2019/10/23/the-wayback-machines-save-page-now-is-new-and-improved/">here</a>),
It may not be sunshines and rainbows as pages can fail to save. So this tutorial will help you deal with such problems effectively.</p>
<p>I recommend the following tools:
<ul>
<li><a href="https://notepad-plus-plus.org/downloads/">Notepad++</a> (<a href="https://github.com/notepad-plus-plus/notepad-plus-plus">github</a>). Because
several features that Microsoft's notepad don't have are required.<br><br>
Plugins for notepad++:
<ul>
<li>Compare (<a href="https://github.com/jsleroy/compare-plugin">github</a>). This is needed to check if all the URLs have saved correctly (mainly
for random redirects like on twitter).</li>
</ul>
</li>
</ul>
</p>
<h2>NP++'s macro feature</h2>
<p>A lot of things going on takes place on notepad++, and almost all of them uses the replace function. If you find this tedious to do these, I recommend
setting up <a href="https://npp-user-manual.org/docs/macros/">macros</a> and record yourself doing these steps.</p>
<h2>Saving pages</h2>
<ol>
<li id="ListOfLinksToSend">After having a list of URLs, have them formatted:
<ol>
<li>Make sure that all URLs are on their own line (don't have multiple URLs on the same line)</li>
<li>sort the lines alphabetically (Edit → Line Operations → Sort Lines Lexicographically)</li>
<li>Have duplicates removed (Edit → Line Operations → Remove Consecutive Duplicate lines)</li>
<li>Make sure you have that list saved, so you can search for links that have failed.</li>
</ol>
</li>
<li>
Send the links:
<ul>
<li>
If you have any URL using special characters or non-UTF-8 characters (such as japanese characters or a space character.
They're often <a href="https://en.wikipedia.org/wiki/Percent-encoding">percent encoded</a>) and are prone to be saved
incorrectly. If the URLs contain raw characters, convert them into percent encoded form, and unstead of submitting these
links as raw text, create a blank HTML file, paste your links, then replace (CTRL+H) and do the following:
<ol>
<li> Wrap the URLs into an HTML hyperlink by prepending <kbd><a href="</kbd> to the left of the links: ...
<table>
<tr>
<td>Find what:</td>
<td><kbd>^</kbd></td>
</tr>
<tr>
<td>Replace with:</td>
<td><kbd><a href="</kbd></td>
</tr>
<tr>
<td>Search mode:</td>
<td>Regular Expression</td>
</tr>
</table>
and hit “Replace all”
</li>
<li>...Now append <kbd>">Link</a></kbd> to mark the end of the HTML tag as well as the clickable link:
<table>
<tr>
<td>Find what:</td>
<td><kbd>$</kbd></td>
</tr>
<tr>
<td>Replace with:</td>
<td><kbd>">Link</a></kbd></td>
</tr>
<tr>
<td>Search mode:</td>
<td>Regular Expression</td>
</tr>
</table>
</li>
<li>
All the URLs should now look like this:
<table><tr><td>
<pre><a href="https://www.example.com">link</a>
<a href="https://www.example2.com">link</a>
<a href="https://www.example3.com">link</a>
...
</pre>
</td></tr></table>
Then save. Open the created HTML file with any browser, and copy that text. Paste that into an email (it should retain the HTML formatting) and send it.
</li>
</ol>
</li>
<li>
Otherwise if there are no such characters in the links, then you can submit them as raw text.
</li>
</ul>
</li>
<li>
You then should wait until you get a reply.
</li>
</ol>
<h2>Retry saving failed links</h2>
<ol>
<li>
Once you get a reply, it should send you back all the URLs found. The list of URLs are formatted like this:
<table><tr><td>
<ol>
<li><kbd><URL address1></kbd> <kbd><Status></kbd></li>
<li><kbd><URL address2></kbd> <kbd><Status></kbd></li>
<li><kbd><URL address3></kbd> <kbd><Status></kbd></li>
...
</ol>
</td></tr></table>
This shows every URL you attempt to save and also indicates which links have failed to save and so on. The legends follow:
<ul>
<li><kbd><URL addressX></kbd> The URL you submitted. When failed, shows the original URL, when successful, will show up as
an internet archive URL version
(<span class="NoLineBreak"><kbd>https://web.archive.org/web/YYYYMMDDhhmmss/<URL addressX></kbd></span>) instead.</li>
<li>
<kbd><Status></kbd> The status indicating if the page have been saved or not. List of statuses included but not limited to:
<ul>
<li>When saving fails:
<ul>
<li>Error! Browser timeout for <URL address></li>
<li>Error! Capture timed out</li>
<li>Error! Expecting value: line 1 column 1 (char 0)</li>
<li>Error! Internal Server Error for <URL address> (HTTP status=500)</li>
<li>Error! Job failed</li>
<li>Error! Live page is not available: <URL address></li>
<li>Error! System proxy error for URL <URL address></li>
<li>Error! The server didn't respond in time for <URL address></li>
<li>Error! The server response status was 502.</li>
</ul>
</li>
</ul>
When successful and saved a URL for the first time displays “First Archive”, otherwise no status is shown.
</li>
</ul><br>
Now note: During my testing, if the WBM gets redirected when saving URLs, it will just display the redirected destination URL, not the URL you have sent (and no indication explicitly saying a redirect occured).
This is problematic if pages randomly redirects when it shouldn't. Twitter, for example, can randomly redirect and only display one of these if a URL
is redirected (including but not limited to):
<ul>
<li><kbd>https://web.archive.org/web/YYYYMMDDhhmmss/https://api.twitter.com/2/timeline/conversation/<TweetID>.json?<long string of commands></kbd></li>
<li><kbd>https://web.archive.org/web/YYYYMMDDhhmmss/https://twitter.com/i/js_inst?c_name=ui_metrics</kbd></li>
<li><kbd>https://web.archive.org/web/YYYYMMDDhhmmss/https://pbs.twimg.com/hashflag/config-<year>-<month>-<day>-16.json</kbd></li>
</ul>
See <a href="#RandomRedirect" target="_blank">here</a> on how to re-attempt to save links that have been redirected
</li>
<li>Saving errors:
<h3>For links that errored out</h3>
<ol>
<li>
To obtain URLs that have errored out, copy all that text and paste in notepad++. Use the find function to extract URLs with an error on it:
<table>
<tr>
<td>Find what:</td>
<td><kbd>Error!</kbd></td>
</tr>
<tr>
<td>Search mode:</td>
<td><kbd>Normal</kbd></td>
</tr>
</table>
Then click the “Find All in Current Document”. This will extract line-by-line format of lines that contain the word “Error!” and should
be all the URLs failed to save. Rightclick on the top bar and copy it, it should copy all the lines in the search result.
</li>
<li>
Paste the links in a blank new document/tab in notepad++.
</li>
<li>
<span id="GetOriginalLinksFromArchive">Extract the ORIGINAL URLs from internet archive URLs by doing these:</span>
<ol>
<li>
Remove IA's URL part of the link:
<table>
<tr>
<td>Find what:</td>
<td><kbd>https://web\.archive\.org/web/[0-9]*/</kbd></td>
</tr>
<tr>
<td>Replace with:</td>
<td>(nothing)</td>
</tr>
<tr>
<tr>
<td>Wrap around:</td>
<td>Checked</td>
</tr>
<td>Search mode:</td>
<td>Regular Expression</td>
</tr>
</table>
Note: Depending on what browser you are using, certain characters may be added. Firefox, for example,
prepends anything within an HTML tag that has a closing tag with 4 spaces. So make sure you remove those.
</li>
<li>
Remove the status at the end of the URL:
<table>
<tr>
<td>Find what:</td>
<td><pre> Error! .*$</pre> (note: keep the space character before “Error!”)</td>
</tr>
<tr>
<td>Replace with:</td>
<td>(nothing)</td>
</tr>
<tr>
<td>Wrap around:</td>
<td>Checked</td>
</tr>
<tr>
<td>Search mode:</td>
<td>Regular Expression</td>
</tr>
<tr>
<td>. matches newline:</td>
<td>Unchecked</td>
</tr>
</table>
</li>
</ol>
</li>
<li>
Now resend those failed links. Make sure you have a copy of that so you can check via compare. Repeat until all links are sucessfully saved.
</li>
</ol>
<h3 id="RandomRedirect">Randomly redirected links</h3>
<ol>
<li>
Make sure you have the list of links you sent to the internet archive prior doing this (and following <a href="#ListOfLinksToSend" target="_blank">these rules</a>). Once you get a reply showing the save status, open a new
tab/document in notepad++ and paste the list of links from the archive.
</li>
<li>
Do <a href="#GetOriginalLinksFromArchive" target="_blank">this familiar move</a> to convert the list of links to original links.
</li>
<li>
Now after you got the original links, copy all of that, go to the list of the URLs you sent, and paste the archive links below it. Essentially
you are combining two lists into one. Make sure the paste isn't inserted on the same line as the last item of the sent URL.
</li>
<li>
Sort lines alphabetically (Edit → Line Operations → Sort Lines Lexicographically Ascending). Any Link that wasn't redirected should now
exist exactly twice (1 URL you sent, another second copy from the one you got in the reply), placed next to another, while URLs that DO get redirected
will exist only once (because the URLs redirected you sent are replaced with the redirected location). For made-up example:
<table><tr>
<td>
<pre>
https://www.google.com
https://example.com
https://twitter.com
https://wikipedia.org
https://www.google.com
https://example.com
</pre>
</td>
<td>
Becomes:
</td>
<td>
<pre>https://example.com
https://example.com
https://twitter.com
https://wikipedia.org
https://www.google.com
https://www.google.com</pre>
</td>
</tr></table>
example.com existed twice, and so is google, so they were saved properly, but wikipedia and twitter aren't, so a redirect has occured.
</li>
<li>
Go to the bottom on the last URL, and make a linebreak (there must be a last line that is blank after last URL).
</li>
<li>
Now replace:
<table>
<tr>
<td>Find what:</td>
<td><kbd>(?-s)^(.+\R)\1+</kbd></td>
</tr>
<tr>
<td>Replace with:</td>
<td>(nothing)</td>
</tr>
<tr>
<td>Search mode:</td>
<td>Regular Expression</td>
</tr>
<tr>
<td>Wrap around:</td>
<td>Checked</td>
</tr>
</table>
This will delete all URLs that have 2+ copies placed next to another (not to be confused with notepad++'s “Remove Consecutive Duplicate Lines”, which that
reduces all duplicates to one and not erase them entirely), while URLs existed once will still exist, using the same example as above, it should
resulted in this:
<table><tr>
<td>
<pre>https://example.com
https://example.com
https://twitter.com
https://wikipedia.org
https://www.google.com
https://www.google.com</pre>
</td>
<td>
Becomes:
</td>
<td>
<pre>https://twitter.com
https://wikipedia.org
</pre>
</td>
</tr></table>
</li>
<li>
Now resend those links and repeat (check again in case a random redirect still happens).
</li>
</ol>
</li>
</ol>