Categories
DailyOps

How to generate static website for an old ArchiveBox archive

I have an old archive that I want to keep, but I cannot import it to the new one due to changes in JSON files, so I will generate a static website.

Inspect backup directory.

$ ls
archive

Inspect ArchiveBox archive.

$ ls archive/
1561294668	1561294668.128	1561294668.159	1561294668.19	1561294668.22	1561294668.250	1561294668.281	1561294668.311	1561294668.342	1561294668.373	1561294668.403	1561294668.434	1561294668.465	1561294668.496	1561294668.526	1561294668.557	1561294668.83
1561294668.0	1561294668.129	1561294668.16	1561294668.190	1561294668.220	1561294668.251	1561294668.282	1561294668.312	1561294668.343	1561294668.374	1561294668.404	1561294668.435	1561294668.466	1561294668.497	1561294668.527	1561294668.558	1561294668.84
1561294668.1	1561294668.13	1561294668.160	1561294668.191	1561294668.221	1561294668.252	1561294668.283	1561294668.313	1561294668.344	1561294668.375	1561294668.405	1561294668.436	1561294668.467	1561294668.498	1561294668.528	1561294668.559	1561294668.85
1561294668.10	1561294668.130	1561294668.161	1561294668.192	1561294668.222	1561294668.253	1561294668.284	1561294668.314	1561294668.345	1561294668.376	1561294668.406	1561294668.437	1561294668.468	1561294668.499	1561294668.529	1561294668.56	1561294668.86
1561294668.100	1561294668.131	1561294668.162	1561294668.193	1561294668.223	1561294668.254	1561294668.285	1561294668.315	1561294668.346	1561294668.377	1561294668.407	1561294668.438	1561294668.469	1561294668.5	1561294668.53	1561294668.560	1561294668.87
1561294668.101	1561294668.132	1561294668.163	1561294668.194	1561294668.224	1561294668.255	1561294668.286	1561294668.316	1561294668.347	1561294668.378	1561294668.408	1561294668.439	1561294668.47	1561294668.50	1561294668.530	1561294668.57	1561294668.88
1561294668.102	1561294668.133	1561294668.164	1561294668.195	1561294668.225	1561294668.256	1561294668.287	1561294668.317	1561294668.348	1561294668.379	1561294668.409	1561294668.44	1561294668.470	1561294668.500	1561294668.531	1561294668.58	1561294668.89
1561294668.103	1561294668.134	1561294668.165	1561294668.196	1561294668.226	1561294668.257	1561294668.288	1561294668.318	1561294668.349	1561294668.38	1561294668.41	1561294668.440	1561294668.471	1561294668.501	1561294668.532	1561294668.59	1561294668.9
1561294668.104	1561294668.135	1561294668.166	1561294668.197	1561294668.227	1561294668.258	1561294668.289	1561294668.319	1561294668.35	1561294668.380	1561294668.410	1561294668.441	1561294668.472	1561294668.502	1561294668.533	1561294668.6	1561294668.90
1561294668.105	1561294668.136	1561294668.167	1561294668.198	1561294668.228	1561294668.259	1561294668.29	1561294668.32	1561294668.350	1561294668.381	1561294668.411	1561294668.442	1561294668.473	1561294668.503	1561294668.534	1561294668.60	1561294668.91
1561294668.106	1561294668.137	1561294668.168	1561294668.199	1561294668.229	1561294668.26	1561294668.290	1561294668.320	1561294668.351	1561294668.382	1561294668.412	1561294668.443	1561294668.474	1561294668.504	1561294668.535	1561294668.61	1561294668.92
1561294668.107	1561294668.138	1561294668.169	1561294668.2	1561294668.23	1561294668.260	1561294668.291	1561294668.321	1561294668.352	1561294668.383	1561294668.413	1561294668.444	1561294668.475	1561294668.505	1561294668.536	1561294668.62	1561294668.93
1561294668.108	1561294668.139	1561294668.17	1561294668.20	1561294668.230	1561294668.261	1561294668.292	1561294668.322	1561294668.353	1561294668.384	1561294668.414	1561294668.445	1561294668.476	1561294668.506	1561294668.537	1561294668.63	1561294668.94
1561294668.109	1561294668.14	1561294668.170	1561294668.200	1561294668.231	1561294668.262	1561294668.293	1561294668.323	1561294668.354	1561294668.385	1561294668.415	1561294668.446	1561294668.477	1561294668.507	1561294668.538	1561294668.64	1561294668.95
1561294668.11	1561294668.140	1561294668.171	1561294668.201	1561294668.232	1561294668.263	1561294668.294	1561294668.324	1561294668.355	1561294668.386	1561294668.416	1561294668.447	1561294668.478	1561294668.508	1561294668.539	1561294668.65	1561294668.96
1561294668.110	1561294668.141	1561294668.172	1561294668.202	1561294668.233	1561294668.264	1561294668.295	1561294668.325	1561294668.356	1561294668.387	1561294668.417	1561294668.448	1561294668.479	1561294668.509	1561294668.54	1561294668.66	1561294668.97
1561294668.111	1561294668.142	1561294668.173	1561294668.203	1561294668.234	1561294668.265	1561294668.296	1561294668.326	1561294668.357	1561294668.388	1561294668.418	1561294668.449	1561294668.48	1561294668.51	1561294668.540	1561294668.67	1561294668.98
1561294668.112	1561294668.143	1561294668.174	1561294668.204	1561294668.235	1561294668.266	1561294668.297	1561294668.327	1561294668.358	1561294668.389	1561294668.419	1561294668.45	1561294668.480	1561294668.510	1561294668.541	1561294668.68	1561294668.99
1561294668.113	1561294668.144	1561294668.175	1561294668.205	1561294668.236	1561294668.267	1561294668.298	1561294668.328	1561294668.359	1561294668.39	1561294668.42	1561294668.450	1561294668.481	1561294668.511	1561294668.542	1561294668.69
1561294668.114	1561294668.145	1561294668.176	1561294668.206	1561294668.237	1561294668.268	1561294668.299	1561294668.329	1561294668.36	1561294668.390	1561294668.420	1561294668.451	1561294668.482	1561294668.512	1561294668.543	1561294668.7
1561294668.115	1561294668.146	1561294668.177	1561294668.207	1561294668.238	1561294668.269	1561294668.3	1561294668.33	1561294668.360	1561294668.391	1561294668.421	1561294668.452	1561294668.483	1561294668.513	1561294668.544	1561294668.70
1561294668.116	1561294668.147	1561294668.178	1561294668.208	1561294668.239	1561294668.27	1561294668.30	1561294668.330	1561294668.361	1561294668.392	1561294668.422	1561294668.453	1561294668.484	1561294668.514	1561294668.545	1561294668.71
1561294668.117	1561294668.148	1561294668.179	1561294668.209	1561294668.24	1561294668.270	1561294668.300	1561294668.331	1561294668.362	1561294668.393	1561294668.423	1561294668.454	1561294668.485	1561294668.515	1561294668.546	1561294668.72
1561294668.118	1561294668.149	1561294668.18	1561294668.21	1561294668.240	1561294668.271	1561294668.301	1561294668.332	1561294668.363	1561294668.394	1561294668.424	1561294668.455	1561294668.486	1561294668.516	1561294668.547	1561294668.73
1561294668.119	1561294668.15	1561294668.180	1561294668.210	1561294668.241	1561294668.272	1561294668.302	1561294668.333	1561294668.364	1561294668.395	1561294668.425	1561294668.456	1561294668.487	1561294668.517	1561294668.548	1561294668.74
1561294668.12	1561294668.150	1561294668.181	1561294668.211	1561294668.242	1561294668.273	1561294668.303	1561294668.334	1561294668.365	1561294668.396	1561294668.426	1561294668.457	1561294668.488	1561294668.518	1561294668.549	1561294668.75
1561294668.120	1561294668.151	1561294668.182	1561294668.212	1561294668.243	1561294668.274	1561294668.304	1561294668.335	1561294668.366	1561294668.397	1561294668.427	1561294668.458	1561294668.489	1561294668.519	1561294668.55	1561294668.76
1561294668.121	1561294668.152	1561294668.183	1561294668.213	1561294668.244	1561294668.275	1561294668.305	1561294668.336	1561294668.367	1561294668.398	1561294668.428	1561294668.459	1561294668.49	1561294668.52	1561294668.550	1561294668.77
1561294668.122	1561294668.153	1561294668.184	1561294668.214	1561294668.245	1561294668.276	1561294668.306	1561294668.337	1561294668.368	1561294668.399	1561294668.429	1561294668.46	1561294668.490	1561294668.520	1561294668.551	1561294668.78
1561294668.123	1561294668.154	1561294668.185	1561294668.215	1561294668.246	1561294668.277	1561294668.307	1561294668.338	1561294668.369	1561294668.4	1561294668.43	1561294668.460	1561294668.491	1561294668.521	1561294668.552	1561294668.79
1561294668.124	1561294668.155	1561294668.186	1561294668.216	1561294668.247	1561294668.278	1561294668.308	1561294668.339	1561294668.37	1561294668.40	1561294668.430	1561294668.461	1561294668.492	1561294668.522	1561294668.553	1561294668.8
1561294668.125	1561294668.156	1561294668.187	1561294668.217	1561294668.248	1561294668.279	1561294668.309	1561294668.34	1561294668.370	1561294668.400	1561294668.431	1561294668.462	1561294668.493	1561294668.523	1561294668.554	1561294668.80
1561294668.126	1561294668.157	1561294668.188	1561294668.218	1561294668.249	1561294668.28	1561294668.31	1561294668.340	1561294668.371	1561294668.401	1561294668.432	1561294668.463	1561294668.494	1561294668.524	1561294668.555	1561294668.81
1561294668.127	1561294668.158	1561294668.189	1561294668.219	1561294668.25	1561294668.280	1561294668.310	1561294668.341	1561294668.372	1561294668.402	1561294668.433	1561294668.464	1561294668.495	1561294668.525	1561294668.556	1561294668.82

Get an older docker ArchiveBox image to generate static website.

$ docker run archivebox/archivebox:82838b0f974cab16d46c77f0bfa4d92dd9eafae3 version
ArchiveBox v0.5.3
Cpython Linux Linux-5.13.0-27-generic-x86_64-with-glibc2.28 x86_64 (in Docker)

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.5.3          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.1          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.3          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.5.1         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.1.14         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.1.0          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.01.03     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v87.0.4280.88   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/themes                                                      

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              


[i] Data locations:

Initialize ArchiveBox.

$ docker run -v $PWD:/data archivebox/archivebox:82838b0f974cab16d46c77f0bfa4d92dd9eafae3 init
[i] [2022-01-30 22:38:46] ArchiveBox v0.5.3: archivebox init
    > /data

[+] Initializing a new ArchiveBox collection in this folder...
    /data
------------------------------------------------------------------

[+] Building archive folder structure...
    √ /data/sources
    √ /data/archive
    √ /data/logs
    √ /data/ArchiveBox.conf

[+] Building main SQL index and running migrations...
    √ /data/index.sqlite3

    Operations to perform:
    Apply all migrations: admin, auth, contenttypes, core, sessions
    Running migrations:
    Applying contenttypes.0001_initial... OK
    Applying auth.0001_initial... OK
    Applying admin.0001_initial... OK
    Applying admin.0002_logentry_remove_auto_add... OK
    Applying admin.0003_logentry_add_action_flag_choices... OK
    Applying contenttypes.0002_remove_content_type_name... OK
    Applying auth.0002_alter_permission_name_max_length... OK
    Applying auth.0003_alter_user_email_max_length... OK
    Applying auth.0004_alter_user_username_opts... OK
    Applying auth.0005_alter_user_last_login_null... OK
    Applying auth.0006_require_contenttypes_0002... OK
    Applying auth.0007_alter_validators_add_error_messages... OK
    Applying auth.0008_alter_user_username_max_length... OK
    Applying auth.0009_alter_user_last_name_max_length... OK
    Applying auth.0010_alter_group_name_max_length... OK
    Applying auth.0011_update_proxy_permissions... OK
    Applying auth.0012_alter_user_first_name_max_length... OK
    Applying core.0001_initial... OK
    Applying core.0002_auto_20200625_1521... OK
    Applying core.0003_auto_20200630_1034... OK
    Applying core.0004_auto_20200713_1552... OK
    Applying core.0005_auto_20200728_0326... OK
    Applying core.0006_auto_20201012_1520... OK
    Applying core.0007_archiveresult... OK
    Applying core.0008_auto_20210105_1421... OK
    Applying sessions.0001_initial... OK

[*] Collecting links from any existing indexes and archive folders...
    √ Added 562 orphaned links from existing archive directories.
    ! Skipped adding 562 invalid link data directories.
        X /data/archive/1561294668.556 [1561294668.556] http://aa.quae.nl/en/reken/zonpositie.html "None"
        [...]
        X /data/archive/1561294668.558 [1561294668.558] http://aa.quae.nl/en/antwoorden/manen.html "None"

    Hint: For more information about the link data directories that were skipped, run:
        archivebox status
        archivebox list --status=invalid

[*] [2022-01-30 22:39:42] Writing 562 links to main index...
    √ /data/index.sqlite3

------------------------------------------------------------------
[√] Done. A new ArchiveBox collection was initialized (0 links).

    Hint: To view your archive index, run:
        archivebox server  # then visit http://127.0.0.1:8000

    To add new links, you can run:
        archivebox add ~/some/path/or/url/to/list_of_links.txt

    For more usage and examples, run:
        archivebox help

Inspect current directory.

$ ls
archive  ArchiveBox.conf  ArchiveBox.conf.bak  index.sqlite3  logs  sources

List export options.

$ docker run -v $PWD:/data archivebox/archivebox:82838b0f974cab16d46c77f0bfa4d92dd9eafae3 list --help
[i] [2022-01-30 22:45:45] ArchiveBox v0.5.3: archivebox list --help
    > /data

usage: archivebox list [-h] [--csv CSV | --json | --html] [--with-headers]
                       [--sort SORT] [--before BEFORE] [--after AFTER]
                       [--status {indexed,archived,unarchived,present,valid,invalid,duplicate,orphaned,corrupted,unrecognized}]
                       [--filter-type {exact,substring,domain,regex,tag,search}]
                       [filter_patterns ...]

List, filter, and export information about archive entries

positional arguments:
  filter_patterns       List only URLs matching these filter patterns.

optional arguments:
  -h, --help            show this help message and exit
  --csv CSV             Print the output in CSV format with the given columns,
                        e.g.: timestamp,url,extension
  --json                Print the output in JSON format with all columns
                        included.
  --html                Print the output in HTML format
  --with-headers        Include the headers in the output document
  --sort SORT           List the links sorted using the given key, e.g.
                        timestamp or updated.
  --before BEFORE       List only links bookmarked before the given timestamp.
  --after AFTER         List only links bookmarked after the given timestamp.
  --status {indexed,archived,unarchived,present,valid,invalid,duplicate,orphaned,corrupted,unrecognized}
                        List only links or data directories that have the given status
                            indexed       indexed links without checking archive status or data directory validity (the default)
                            archived      indexed links that are archived with a valid data directory
                            unarchived    indexed links that are unarchived with no data directory or an empty data directory
                        
                            present       dirs that actually exist in the archive/ folder
                            valid         dirs with a valid index matched to the main index and archived content
                            invalid       dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized
                        
                            duplicate     dirs that conflict with other directories that have the same link URL or timestamp
                            orphaned      dirs that contain a valid index but aren't listed in the main index
                            corrupted     dirs that don't contain a valid index and aren't listed in the main index
                            unrecognized  dirs that don't contain recognizable archive data and aren't listed in the main index
  --filter-type {exact,substring,domain,regex,tag,search}
                        Type of pattern matching to use when filtering URLs

Create index.html file.

$ docker run -v $PWD:/data archivebox/archivebox:82838b0f974cab16d46c77f0bfa4d92dd9eafae3 list --html --with-headers > /tmp/index.html
[i] [2022-01-30 22:41:30] ArchiveBox v0.5.3: archivebox list --html --with-headers
    > /data

Create index.json file.

$ docker run -v $PWD:/data archivebox/archivebox:82838b0f974cab16d46c77f0bfa4d92dd9eafae3 list --json --with-headers > /tmp/index.json
[i] [2022-01-30 22:43:20] ArchiveBox v0.5.3: archivebox list --json --with-headers
    > /data

Copy these files to the target server with an archive.

$ scp /tmp/index.json archivebox@172.16.1.2:~/archivebox/archivebox_old/
index.json                                                                         100% 4130KB   2.3MB/s   00:01    
$ scp /tmp/index.html archivebox@172.16.1.2:~/archivebox/archivebox_old/
index.html                                                                         100%  617KB   2.3MB/s   00:01

Create nginx configuration

server {
	listen 80 default_server;

	server_name _;

	root /opt/archivebox/archivebox_old/;

	location / {
                index index.html;
		try_files $uri $uri/ =404;
	}

        location ~ /.*/ {
                index nonexistent;
                autoindex on;
        }
}

Copy the static files. Get these files from ArchiveBox application to be safe.

Visit the website.

Additional notes

You can also use jq utility to generate simple index file.

$ echo "<table>" | tee index.html; find * -maxdepth 1 -name index.json -exec bash -c 'url=$(jq --raw-output .url {}); title=$(jq --raw-output .history.title[0].output {}); date=$(jq --raw-output .history.title[0].start_ts {}); dir=$(dirname {}); echo "<tr><td><img style=\"width:156px;height:100px;border:1px solid black;\" src=\"$dir/screenshot.png\" loading=\"lazy\"/></td><td style=\"vertical-align:top;\"><p><strong>$title</strong></p><p><a href=\"$dir/index.html\">$url</a></p><p>$date</p></td></tr><tr><td colspan=2><hr/></td></tr>"' \; | tee -a index.html;  echo "</table>" | tee -a index.html