Today I finished repacking all .zip files as yet found within the https://pushbx.org/ecm/download/old/ directory, and uploading the lzipped tarball instead. This reduces the disk space use from 22 GB to about 2.3 GB.
rsync -haxHAX [redacted]@pushbx.org:wwwecm/download/ . --progress
This ran from Friday, 17:26 until 18:06. It downloaded 2338 files, worth 23 GB.
There are 2056 files in the old subdirectory, most of them .zip files.
find ../20251107/old/ -type f -a -iname '*.zip' -print0 | xargs -r0 bash -c 'for pathname; do dir="${pathname%/*.zip}"; dir="${dir##*/}"; dt="${pathname##*/}"; dt="${dt%.zip}"; mkdir -p "old/$dir/$dt"; unzip -d "old/$dir/$dt" "$pathname"; done' scriptlet
find old/msdos4 -type f -print0 | xargs -r0 bash -c 'for file; do dirname="old/ldos/${file##old/msdos4/}"; dirname="${dirname%/*}"; mkdir -p "$dirname"; link "$file" "old/ldos/${file##old/msdos4/}"; done' scriptlet
hardlink -potO .
I left this command running starting at 20:38 on Friday, and it was done by at the latest 09:57 on Saturday.
This was done using a scriptlet that sorts the 1.8 million files
according first to their basename (part after the last slash)
and second to their pathname (part before the last slash).
I attempted using perl's spaceship operator <=> for the sorting
at first but found that this seems to treat its operands
as numbers, failing to sort as desired.
The correct operator for text operands is cmp apparently.
find old -type f | perl -e 'my @array = (); while (<<>>) { push(@array, $_); }; my @sorted = sort { ($a =~ s/^(.*\/)([^\/]+?)[\r\n]*$/$2\/$1/r) cmp ($b =~ s/^(.*\/)([^\/]+?)[\r\n]*$/$2\/$1/r) } @array; printe join "", @sorted;' > sorted.txt
cat sorted.txt | tr '\n' '\0' > sortedz.txt
cat sortedz.txt | tar --owner=0 --group=0 --numeric-owner -cf sorted-nolink.tar --null -T - --totals --hard-dereference
Terminal output:
Total bytes written: 88992112640 (83GiB, 22MiB/s)
This ran from 11:36 on Saturday until 12:42. The resulting file contains 89 GB.
cat sortedz.txt | tar --owner=0 --group=0 --numeric-owner -cf sorted.tar --null -T -
This ran from 10:34 on Saturday until 11:32. The resulting file contains 36 GB.
This is more efficient to depack in full as tar will recreate the hardlinks between files with identical contents. (Some files have as many as 9000 hardlinks pointing to the same file.) However, it is more difficult to use this for depacking individual files. Quoth the GNU tar manual:
Although creating special records for hard links helps keep a faithful record of the file system contents and makes archives more compact, it may present some difficulties when extracting individual members from the archive. For example, trying to extract file ‘one’ from the archive created in previous examples produces, in the absence of file ‘jeden’:
$ tar xf archive.tar ./one tar: ./one: Cannot hard link to './jeden': No such file or directory tar: Error exit delayed from previous errorsThe reason for this behavior is that tar cannot seek back in the archive to the previous member (in this case, ‘one’), to extract it(23). If you wish to avoid such problems at the cost of a bigger archive, use the following option:
Using pv (pipe view) for the progress display and lzip to compress.
cat sorted-nolink.tar | pv -paterb -s"83G" | lzip -9 -c > sorted-nolink.tar.lz
Final pv progress line:
82.9GiB 14:43:51 [1.60MiB/s] [1.60MiB/s] [===================> ] 99%
The file contains 2.1 GB.
cat sorted.tar | pv -paterb -s"36G" | lzip -9 -c > sorted.tar.lz
Final pv progress line:
33.1GiB 7:57:26 [1.18MiB/s] [1.18MiB/s] [===================> ] 91%
The file contains 2.0 GB.
Because of the difficulties extracting specified files, and the compressed file size savings only coming out to about 5%, we did not upload this file to the server.
lzip -kcd ../sorted-nolink.tar.lz | pv -paterb -s83G | tar -xf -
Final pv progress line:
82.9GiB 0:25:40 [55.1MiB/s] [55.1MiB/s] [====================> ] 99%
rsync -haxHAX sorted-nolink.tar.lz [redacted]@pushbx.org:wwwecm/download/old/202511.tlz --progress
The following command was used to delete old zipballs from the old subdirectories. It only processes files that have less than 2 (ie, exactly 1) hardlink pointing to them. Therefore, each most recent build (with a hardlink from the https://pushbx.org/ecm/download/ directory) was preserved.
~/wwwecm/download/old$ find -type f -a -iname '*.zip' -a -links -2 | LC_ALL=C sort | xargs -r rm