Categories
DailyOps

How to archive entire web page in a single HTML file

Use SingleFile to archive entire web page in a single HTML file.

Browser extension

Install browser extension for Firefox, Chrome or Edge.

This is enough for occasional use, quick and clean solution, just remember to backup saved pages.

CLI utility

You can also use the command-line utility which will execute a headless browser.

Install npm.

$ sudo apt install npm

Install puppeteer.

$ npm install puppeteer

Install SingleFile.

$ npm install "gildas-lormeau/SingleFile#master"

Ensure that PATH will include installed executable files. Update .bashrc file accordingly.

$ export PATH=$PATH:~/node_modules/.bin/

Display help information.

$ single-file --help
single-file [url] [output]

Save a page into a single HTML file.

Pozycyjne:
  url     URL or path on the filesystem of the page to save  [ciąg znaków]
  output  Output filename  [ciąg znaków]

Opcje:
  --help                                  Pokaż pomoc  [boolean]
  --version                               Pokaż numer wersji  [boolean]
  --back-end                              Back-end to use  [dostępne: "jsdom", "puppeteer", "webdriver-chromium", "webdriver-gecko", "puppeteer-firefox", "playwright-firefox", "playwright-chromium"] [domyślny: "puppeteer"]
  --block-mixed-content                   Block mixed contents  [boolean] [domyślny: false]
  --browser-server                        Server to connect to (puppeteer only for now)  [ciąg znaków] [domyślny: ""]
  --browser-headless                      Run the browser in headless mode (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: true]
  --browser-executable-path               Path to chrome/chromium executable (puppeteer, webdriver-gecko, webdriver-chromium)  [ciąg znaków] [domyślny: ""]
  --browser-width                         Width of the browser viewport in pixels  [liczba] [domyślny: 1280]
  --browser-height                        Height of the browser viewport in pixels  [liczba] [domyślny: 720]
  --browser-load-max-time                 Maximum delay of time to wait for page loading in ms (puppeteer, webdriver-gecko, webdriver-chromium)  [liczba] [domyślny: 60000]
  --browser-wait-delay                    Time to wait before capturing the page in ms  [liczba] [domyślny: 0]
  --browser-wait-until                    When to consider the page is loaded (puppeteer, webdriver-gecko, webdriver-chromium)  [dostępne: "networkidle0", "networkidle2", "load", "domcontentloaded"] [domyślny: "networkidle0"]
  --browser-wait-until-fallback           Retry with the next value of --browser-wait-until when a timeout error is thrown  [boolean] [domyślny: true]
  --browser-debug                         Enable debug mode (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: false]
  --browser-script                        Path of a script executed in the page (and all the frames) before it is loaded  [tablica] [domyślny: []]
  --browser-stylesheet                    Path of a stylesheet file inserted into the page (and all the frames) after it is loaded  [tablica] [domyślny: []]
  --browser-args                          Arguments provided as a JSON array and passed to the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [ciąg znaków] [domyślny: ""]
  --browser-start-minimized               Minimize the browser (puppeteer)  [boolean] [domyślny: false]
  --browser-cookie                        Ordered list of cookie parameters separated by a comma: name,value,domain,path,expires,httpOnly,secure,sameSite,url (puppeteer, webdriver-gecko, webdriver-chromium, jsdom)  [tablica] [domyślny: []]
  --browser-cookies-file                  Path of the cookies file formatted as a JSON file or a Netscape text file (puppeteer, webdriver-gecko, webdriver-chromium, jsdom)  [ciąg znaków] [domyślny: ""]
  --compress-CSS                          Compress CSS stylesheets  [boolean] [domyślny: false]
  --compress-HTML                         Compress HTML content  [boolean] [domyślny: true]
  --crawl-links                           Crawl and save pages found via inner links  [boolean] [domyślny: false]
  --crawl-inner-links-only                Crawl pages found via inner links only if they are hosted on the same domain  [boolean] [domyślny: true]
  --crawl-no-parent                       Crawl pages found via inner links only if their URLs are not parent of the URL to crawl  [boolean]
  --crawl-load-session                    Name of the file of the session to load (previously saved with --crawl-save-session or --crawl-sync-session)  [ciąg znaków]
  --crawl-remove-url-fragment             Remove URL fragments found in links  [boolean] [domyślny: true]
  --crawl-save-session                    Name of the file where to save the state of the session  [ciąg znaków]
  --crawl-sync-session                    Name of the file where to load and save the state of the session  [ciąg znaków]
  --crawl-max-depth                       Max depth when crawling pages found in internal and external links (0: infinite)  [liczba] [domyślny: 1]
  --crawl-external-links-max-depth        Max depth when crawling pages found in external links (0: infinite)  [liczba] [domyślny: 1]
  --crawl-replace-urls                    Replace URLs of saved pages with relative paths of saved pages on the filesystem  [boolean] [domyślny: false]
  --crawl-rewrite-rule                    Rewrite rule used to rewrite URLs of crawled pages  [tablica] [domyślny: []]
  --dump-content                          Dump the content of the processed page in the console ('true' when running in Docker)  [boolean] [domyślny: false]
  --emulate-media-feature                 Emulate a media feature. The syntax is :, e.g. "prefers-color-scheme:dark" (puppeteer)  [tablica]
  --error-file  [ciąg znaków]
  --filename-template                     Template used to generate the output filename (see help page of the extension for more info)  [ciąg znaków] [domyślny: "{page-title} ({date-iso} {time-locale}).html"]
  --filename-conflict-action              Action when the filename is conflicting with existing one on the filesystem. The possible values are "uniquify" (default), "overwrite" and "skip"  [ciąg znaków] [domyślny: "uniquify"]
  --filename-replacement-character        The character used for replacing invalid characters in filenames  [ciąg znaków] [domyślny: "_"]
  --filename-max-length                   Specify the maximum length of the filename  [liczba] [domyślny: 192]
  --filename-max-length-unit              Specify the unit of the maximum length of the filename ('bytes' or 'chars')  [ciąg znaków] [domyślny: "bytes"]
  --group-duplicate-images                Group duplicate images into CSS custom properties  [boolean] [domyślny: true]
  --http-header                           Extra HTTP header (puppeteer, jsdom)  [tablica] [domyślny: []]
  --include-BOM                           Include the UTF-8 BOM into the HTML page  [boolean] [domyślny: false]
  --include-infobar                       Include the infobar  [boolean] [domyślny: false]
  --load-deferred-images                  Load deferred (a.k.a. lazy-loaded) images (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: true]
  --load-deferred-images-max-idle-time    Maximum delay of time to wait for deferred images in ms (puppeteer, webdriver-gecko, webdriver-chromium)  [liczba] [domyślny: 1500]
  --load-deferred-images-keep-zoom-level  Load deferred images by keeping zoomed out the page  [boolean] [domyślny: false]
  --max-parallel-workers                  Maximum number of browsers launched in parallel when processing a list of URLs (cf --urls-file)  [liczba] [domyślny: 8]
  --max-resource-size-enabled             Enable removal of embedded resources exceeding a given size  [boolean] [domyślny: false]
  --max-resource-size                     Maximum size of embedded resources in MB (i.e. images, stylesheets, scripts and iframes)  [liczba] [domyślny: 10]
  --move-styles-in-head                   Move style elements outside the head element into the head element  [boolean] [domyślny: false]
  --remove-frames                         Remove frames (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: false]
  --remove-hidden-elements                Remove HTML elements which are not displayed  [boolean] [domyślny: true]
  --remove-unused-styles                  Remove unused CSS rules and unneeded declarations  [boolean] [domyślny: true]
  --remove-unused-fonts                   Remove unused CSS font rules  [boolean] [domyślny: true]
  --remove-imports                        Remove HTML imports  [boolean] [domyślny: true]
  --remove-scripts                        Remove JavaScript scripts  [boolean] [domyślny: true]
  --remove-audio-src                      Remove source of audio elements  [boolean] [domyślny: true]
  --remove-video-src                      Remove source of video elements  [boolean] [domyślny: true]
  --remove-alternative-fonts              Remove alternative fonts to the ones displayed  [boolean] [domyślny: true]
  --remove-alternative-medias             Remove alternative CSS stylesheets  [boolean] [domyślny: true]
  --remove-alternative-images             Remove images for alternative sizes of screen  [boolean] [domyślny: true]
  --save-original-urls                    Save the original URLS in the embedded contents  [boolean] [domyślny: false]
  --save-raw-page                         Save the original page without interpreting it into the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [boolean] [domyślny: false]
  --urls-file                             Path to a text file containing a list of URLs (separated by a newline) to save  [ciąg znaków]
  --user-agent                            User-agent of the browser (puppeteer, webdriver-gecko, webdriver-chromium)  [ciąg znaków]
  --user-script-enabled                   Enable the event API allowing to execute scripts before the page is saved  [boolean] [domyślny: true]
  --web-driver-executable-path            Path to Selenium WebDriver executable (webdriver-gecko, webdriver-chromium)  [ciąg znaków] [domyślny: ""]
  --output-directory                      Path to where to save files, this path must exist.  [ciąg znaków] [domyślny: ""]
  --accept-headers  [domyślny: {"font":"application/font-woff2;q=1.0,application/font-woff;q=0.9,*/*;q=0.8","image":"image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8","stylesheet":"text/css,*/*;q=0.1","script":"*/*","document":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"}]

Test command execution.

$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium  https://www.debian.org --dump-content
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="pl"><!--
 Page saved with SingleFile 
 url: https://www.debian.org 
 saved date: Wed Mar 23 2022 00:15:53 GMT+0100 (czas środkowoeuropejski standardowy)
--><head><meta charset="utf-8">
  
  <title>Debian -- The Universal Operating System </title>
  <link rel="author" href="mailto:webmaster@debian.org">
  <meta name="Description" content="Debian to system operacyjny i dystrybucja Wolnego Oprogramowania. Opiekuje się nią wielu użytkowników, którzy poświęcają jej swój czas i wysiłek.">
  <meta name="Generator" content="WML 2.12.2">
  <meta name="Modified" content="2022-03-02 07:57:56">
  <meta name="viewport" content="width=device-width">
  <meta name="mobileoptimized" content="300">
  <meta name="HandheldFriendly" content="true">
[...]
Last Built: śro, 2. mar 2022r, 07:57:56 UTC
  <br>
  Copyright © 1997-2022
 <a href="https://www.spi-inc.org/">SPI</a> i inni; Zobacz <a href="https://www.debian.org/license" rel="copyright">warunki umowy</a><br>
  Debian jest zarejestrowanym <a href="https://www.debian.org/trademark">znakiem handlowym</a> Software in the Public Interest, Inc.
</p>
</div>
<!--/UdmComment-->
</div> <!-- end footer -->


</body></html>

Save a web page.

$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium  https://www.debian.org

Display archived web page.

$ ls *.html
'Debian -- The Universal Operating System (2022-03-22 00_18_11).html'

Open archived web page.

$ chromium 'Debian -- The Universal Operating System (2022-03-22 00_18_11).html

Alternatively, create URLs list.

$ cat <<EOF | tee /tmp/urls.txt
https://linux.com
https://debian.org
EOF
$ single-file --back-end puppeteer --browser-executable-path /snap/bin/chromium --urls-file=/tmp/urls.txt

Display archived web page.

$ ls *.html
'Debian -- The Universal Operating System (2022-03-22 00_18_11).html'
'Debian -- The Universal Operating System (2022-03-22 00_29_31).html'
'Linux.com - News For Open Source Professionals (2022-03-22 00_29_32).html'

I prefer to use command-line utility, but choose what is best for you.