User Tools

Site Tools


blog:pushbx:2025:1109_repacked_ecm_files_in_download:old

Repacked ecm files in download/old/

Today I finished repacking all .zip files as yet found within the https://pushbx.org/ecm/download/old/ directory, and uploading the lzipped tarball instead. This reduces the disk space use from 22 GB to about 2.3 GB.

Downloading the ecm/download directory

rsync -haxHAX [redacted]@pushbx.org:wwwecm/download/ . --progress

This ran from Friday, 17:26 until 18:06. It downloaded 2338 files, worth 23 GB.

Unpacking the zipballs

There are 2056 files in the old subdirectory, most of them .zip files.

find ../20251107/old/ -type f -a -iname '*.zip' -print0 | xargs -r0 bash -c 'for pathname; do dir="${pathname%/*.zip}"; dir="${dir##*/}"; dt="${pathname##*/}"; dt="${dt%.zip}"; mkdir -p "old/$dir/$dt"; unzip -d "old/$dir/$dt" "$pathname"; done' scriptlet

Hardlinking old/msdos4/ files to old/ldos/

find old/msdos4 -type f -print0 | xargs -r0 bash -c 'for file; do dirname="old/ldos/${file##old/msdos4/}"; dirname="${dirname%/*}"; mkdir -p "$dirname"; link "$file" "old/ldos/${file##old/msdos4/}"; done' scriptlet

Hardlinking identical files

hardlink -potO .

I left this command running starting at 20:38 on Friday, and it was done by at the latest 09:57 on Saturday.

Creating the sorted list of files

This was done using a scriptlet that sorts the 1.8 million files according first to their basename (part after the last slash) and second to their pathname (part before the last slash). I attempted using perl's spaceship operator <=> for the sorting at first but found that this seems to treat its operands as numbers, failing to sort as desired. The correct operator for text operands is cmp apparently.

find old -type f | perl -e 'my @array = (); while (<<>>) { push(@array, $_); }; my @sorted = sort { ($a =~ s/^(.*\/)([^\/]+?)[\r\n]*$/$2\/$1/r) cmp ($b =~ s/^(.*\/)([^\/]+?)[\r\n]*$/$2\/$1/r) } @array; printe join "", @sorted;' > sorted.txt

Converting the sorted list to NUL separators

cat sorted.txt | tr '\n' '\0' > sortedz.txt

Creating the tarball

cat sortedz.txt | tar --owner=0 --group=0 --numeric-owner -cf sorted-nolink.tar --null -T - --totals --hard-dereference

Terminal output:

Total bytes written: 88992112640 (83GiB, 22MiB/s)

This ran from 11:36 on Saturday until 12:42. The resulting file contains 89 GB.

Creating the tarball without --hard-dereference switch

cat sortedz.txt | tar --owner=0 --group=0 --numeric-owner -cf sorted.tar --null -T -

This ran from 10:34 on Saturday until 11:32. The resulting file contains 36 GB.

This is more efficient to depack in full as tar will recreate the hardlinks between files with identical contents. (Some files have as many as 9000 hardlinks pointing to the same file.) However, it is more difficult to use this for depacking individual files. Quoth the GNU tar manual:

Although creating special records for hard links helps keep a faithful record of the file system contents and makes archives more compact, it may present some difficulties when extracting individual members from the archive. For example, trying to extract file ‘one’ from the archive created in previous examples produces, in the absence of file ‘jeden’:

$ tar xf archive.tar ./one
tar: ./one: Cannot hard link to './jeden': No such file or directory
tar: Error exit delayed from previous errors

The reason for this behavior is that tar cannot seek back in the archive to the previous member (in this case, ‘one’), to extract it(23). If you wish to avoid such problems at the cost of a bigger archive, use the following option:

GNU tar manual, Hard Links

Compressing the repacked tarball

Using pv (pipe view) for the progress display and lzip to compress.

cat sorted-nolink.tar | pv -paterb -s"83G" | lzip -9 -c > sorted-nolink.tar.lz

Final pv progress line:

82.9GiB 14:43:51 [1.60MiB/s] [1.60MiB/s] [===================> ] 99%

The file contains 2.1 GB.

Compressing the repacked tarball without --hard-dereference switch

cat sorted.tar | pv -paterb -s"36G" | lzip -9 -c > sorted.tar.lz

Final pv progress line:

33.1GiB 7:57:26 [1.18MiB/s] [1.18MiB/s] [===================> ] 91%

The file contains 2.0 GB.

Because of the difficulties extracting specified files, and the compressed file size savings only coming out to about 5%, we did not upload this file to the server.

Depacking the lzipped tarball

lzip -kcd ../sorted-nolink.tar.lz | pv -paterb -s83G | tar -xf -

Final pv progress line:

82.9GiB 0:25:40 [55.1MiB/s] [55.1MiB/s] [====================> ] 99%

Uploading the repacked file

rsync -haxHAX sorted-nolink.tar.lz [redacted]@pushbx.org:wwwecm/download/old/202511.tlz --progress

Deleting the old files

The following command was used to delete old zipballs from the old subdirectories. It only processes files that have less than 2 (ie, exactly 1) hardlink pointing to them. Therefore, each most recent build (with a hardlink from the https://pushbx.org/ecm/download/ directory) was preserved.

~/wwwecm/download/old$ find -type f -a -iname '*.zip' -a -links -2 | LC_ALL=C sort | xargs -r rm

You could leave a comment if you were logged in.
blog/pushbx/2025/1109_repacked_ecm_files_in_download/old.txt · Last modified: 2025-11-09 12:26:44 +0100 Nov Sun by ecm