Install ArchiveBox an open source self-hosted web archive to preserve websites you care about.
Installation and configuration
Install required dependencies.
youtube-dl
if you do not plan to use it. You can assume that you will need at least 1 GB of free space to install these packages.$ sudo apt install python3 python3-pip git curl wget youtube-dl chromium
Clone source code to /srv/archivebox/
directory.
$ sudo git clone https://github.com/pirate/ArchiveBox.git /srv/archivebox --depth 1
Ensure that output
directory exists.
$ sudo mkdir /srv/archivebox/output
Create /srv/archivebox/etc/ArchiveBox.conf
configuration file.
# Example config file for ArchiveBox: The self-hosted internet archive. # Copy this file to ~/.ArchiveBox.conf before editing it. # Config file is in both Python and .env syntax (all strings must be quoted). # For documentation, see: # https://github.com/pirate/ArchiveBox/wiki/Configuration ################################################################################ ## General Settings ################################################################################ #OUTPUT_DIR="output" #OUTPUT_PERMISSIONS=755 ONLY_NEW=True TIMEOUT=3600 MEDIA_TIMEOUT=7200 #TEMPLATES_DIR="archivebox/templates" #FOOTER_INFO="Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests." ################################################################################ ## Archive Method Toggles ################################################################################ FETCH_TITLE=True FETCH_FAVICON=True FETCH_WGET=True FETCH_WARC=True FETCH_PDF=True FETCH_SCREENSHOT=True FETCH_DOM=True FETCH_GIT=True FETCH_MEDIA=False SUBMIT_ARCHIVE_DOT_ORG=True ################################################################################ ## Archive Method Options ################################################################################ CHECK_SSL_VALIDITY=True FETCH_WGET_REQUISITES=True #RESOLUTION="1440,900" #WGET_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" #CHROME_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" #GIT_DOMAINS="github.com,bitbucket.org,gitlab.com" #COOKIES_FILE="path/to/cookies.txt" #CHROME_USER_DATA_DIR="~/.config/google-chrome/Default" ################################################################################ ## Shell Options ################################################################################ USE_COLOR=False SHOW_PROGRESS=False LC_ALL=C.UTF-8 ################################################################################ ## Dependency Options ################################################################################ #CURL_BINARY="curl" #GIT_BINARY="git" #WGET_BINARY="wget" #YOUTUBEDL_BINARY="youtube-dl" #CHROME_BINARY="chromium-browser"
Change owner and group to www-data
/www-data
.
$ sudo chown -R www-data:www-data /srv/archivebox
Ensure that application can store data in output
directory.
$ sudo chmod 770 /srv/archivebox/output
Web-server configuration
Install nginx
web-server.
$ sudo apt install nginx
Disable default configuration.
$ sudo unlink /etc/nginx/sites-enabled/default
Create /etc/nginx/sites-available/archivebox
configuration file.
server { listen 80; server_name _; root /srv/archivebox/output/; index index.html; location / { try_files $uri $uri/ =404; } location /archive/ { autoindex on; } }
Enable this specific configuration.
$ sudo ln -s /etc/nginx/sites-available/archivebox /etc/nginx/sites-enabled/
Reload nginx
service.
$ sudo systemctl reload nginx
Archive URL
Finally, use the following code snippet to archive specific URL.
$ URL="http://lwn.net"; sudo -u www-data bash -c "cd /srv/archivebox/; set -a; source etc/ArchiveBox.conf; echo $URL | /srv/archivebox/archive"
[*] [2019-06-16 21:35:13] Parsing new links from output/sources/stdin-1560720913.txt... > Adding 1 new links to index (parsed import as Plain Text) [*] [2019-06-16 21:35:13] Saving main index files... √ output/index.json √ output/index.html [▶] [2019-06-16 21:35:13] Updating content for 1 pages in archive... [+] [2019-06-16 21:35:13] "http://lwn.net" http://lwn.net > output/archive/1560720913 > title > favicon > wget > pdf > screenshot > dom > archive_org [√] [2019-06-16 21:35:32] Update of 1 pages complete (19.55 sec) - 0 links skipped - 1 links updated - 0 links had errors To view your archive, open: output/index.html [*] [2019-06-16 21:35:32] Saving main index files... √ output/index.json √ output/index.html
This can be more or less complicated, you will see next week why I am using this particular way to perform archiving process.
Additional notes
Do not forget to create and configure SSL certificate.