User Tools

Site Tools


blog:pushbx:2025:1109_repacked_ecm_files_in_download:old

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

blog:pushbx:2025:1109_repacked_ecm_files_in_download:old [2025-11-09 12:26:44 +0100 Nov Sun] (current)
ecm created
Line 1: Line 1:
 +====== Repacked ecm files in download/old/ ======
 +
 +Today I finished repacking all .zip files as yet
 +found within the https://pushbx.org/ecm/download/old/ directory,
 +and uploading the lzipped tarball instead.
 +This reduces the disk space use from 22 GB to about 2.3 GB.
 +
 +===== Downloading the ecm/download directory =====
 +
 +''rsync -haxHAX [redacted]@pushbx.org:wwwecm/download/ . -%%%%-progress''
 +
 +This ran from Friday, 17:26 until 18:06. It downloaded 2338 files, worth 23 GB.
 +
 +===== Unpacking the zipballs =====
 +
 +There are 2056 files in the old subdirectory, most of them .zip files.
 +
 +''find ../20251107/old/ -type f -a -iname '*.zip' -print0 | xargs -r0 bash -c 'for pathname; do dir="${pathname%/*.zip}"; dir="${dir##*/}"; dt="${pathname##*/}"; dt="${dt%.zip}"; mkdir -p "old/$dir/$dt"; unzip -d "old/$dir/$dt" "$pathname"; done' scriptlet''
 +
 +===== Hardlinking old/msdos4/ files to old/ldos/ =====
 +
 +''find old/msdos4 -type f -print0 | xargs -r0 bash -c 'for file; do dirname="old/ldos/${file##old/msdos4/}"; dirname="${dirname%/*}"; mkdir -p "$dirname"; link "$file" "old/ldos/${file##old/msdos4/}"; done' scriptlet''
 +
 +===== Hardlinking identical files =====
 +
 +''hardlink -potO .''
 +
 +I left this command running starting at 20:38 on Friday, and it was done by at the latest 09:57 on Saturday.
 +
 +===== Creating the sorted list of files =====
 +
 +This was done using a scriptlet that sorts the 1.8 million files
 +according first to their basename (part after the last slash)
 +and second to their pathname (part before the last slash).
 +I attempted using perl's spaceship operator ''<%%%%=%%%%>'' for the sorting
 +at first but found that this seems to treat its operands
 +as numbers, failing to sort as desired.
 +The correct operator for text operands is ''cmp'' apparently.
 +
 +''find old -type f | perl -e 'my @array = (); while (<%%%%<>%%%%>) { push(@array, $_); }; my @sorted = sort { ($a =~ s/^(.*\/)([^\/]+?)[\r\n]*$/$2\/$1/r) cmp ($b =~ s/^(.*\/)([^\/]+?)[\r\n]*$/$2\/$1/r) } @array; printe join "", @sorted;' > sorted.txt''
 +
 +==== Converting the sorted list to NUL separators ====
 +
 +''cat sorted.txt | tr '\n' '\0' > sortedz.txt''
 +
 +===== Creating the tarball =====
 +
 +''cat sortedz.txt | tar -%%%%-owner=0 -%%%%-group=0 -%%%%-numeric-owner -cf sorted-nolink.tar -%%%%-null -T - -%%%%-totals -%%%%-hard-dereference''
 +
 +Terminal output:
 +
 +''Total bytes written: 88992112640 (83GiB, 22MiB/s)''
 +
 +This ran from 11:36 on Saturday until 12:42. The resulting file contains 89 GB.
 +
 +==== Creating the tarball without --hard-dereference switch ====
 +
 +''cat sortedz.txt | tar -%%%%-owner=0 -%%%%-group=0 -%%%%-numeric-owner -cf sorted.tar -%%%%-null -T -''
 +
 +This ran from 10:34 on Saturday until 11:32. The resulting file contains 36 GB.
 +
 +This is more efficient to depack in full as tar will recreate the hardlinks
 +between files with identical contents.
 +(Some files have as many as 9000 hardlinks pointing to the same file.)
 +However, it is more difficult to use this for depacking individual files.
 +Quoth the GNU tar manual:
 +
 +<blockquote>Although creating special records for hard links helps keep a faithful record of the file system contents and makes archives more compact, it may present some difficulties when extracting individual members from the archive. For example, trying to extract file ‘one’ from the archive created in previous examples produces, in the absence of file ‘jeden’:
 +
 +<code>$ tar xf archive.tar ./one
 +tar: ./one: Cannot hard link to './jeden': No such file or directory
 +tar: Error exit delayed from previous errors</code>
 +
 +The reason for this behavior is that tar cannot seek back in the archive to the previous member (in this case, ‘one’), to extract it(23). If you wish to avoid such problems at the cost of a bigger archive, use the following option:
 +
 +<cite>[[https://www.gnu.org/software/tar/manual/html_node/hard-links.html|GNU tar manual, Hard Links]]</cite></blockquote>
 +
 +
 +===== Compressing the repacked tarball =====
 +
 +Using pv (pipe view) for the progress display and lzip to compress.
 +
 +''cat sorted-nolink.tar | pv -paterb -s"83G" | lzip -9 -c > sorted-nolink.tar.lz''
 +
 +Final pv progress line:
 +
 +''82.9GiB 14:43:51 [1.60MiB/s] [1.60MiB/s] [===================%%%%> ] 99%''
 +
 +The file contains 2.1 GB.
 +
 +==== Compressing the repacked tarball without --hard-dereference switch ====
 +
 +''cat sorted.tar | pv -paterb -s"36G" | lzip -9 -c > sorted.tar.lz''
 +
 +Final pv progress line:
 +
 +''33.1GiB 7:57:26 [1.18MiB/s] [1.18MiB/s] [===================%%%%>  ] 91%''
 +
 +The file contains 2.0 GB.
 +
 +Because of the difficulties extracting specified files,
 +and the compressed file size savings only coming out to about 5%,
 +we did not upload this file to the server.
 +
 +
 +===== Depacking the lzipped tarball =====
 +
 +''lzip -kcd ../sorted-nolink.tar.lz | pv -paterb -s83G | tar -xf -''
 +
 +Final pv progress line:
 +
 +''82.9GiB 0:25:40 [55.1MiB/s] [55.1MiB/s] [====================%%%%> ] 99%''
 +
 +===== Uploading the repacked file =====
 +
 +''rsync -haxHAX sorted-nolink.tar.lz [redacted]@pushbx.org:wwwecm/download/old/202511.tlz -%%%%-progress''
 +
 +===== Deleting the old files =====
 +
 +The following command was used to delete old zipballs from the old subdirectories.
 +It only processes files that have less than 2 (ie, exactly 1) hardlink
 +pointing to them.
 +Therefore, each most recent build (with a hardlink from the
 +https://pushbx.org/ecm/download/ directory) was preserved.
 +
 +''~/wwwecm/download/old$ find -type f -a -iname '*.zip' -a -links -2 | LC_ALL=C sort | xargs -r rm''
 +
 +
 +{{tag>server webecm}}
 +
 +
 +~~DISCUSSION~~
  
blog/pushbx/2025/1109_repacked_ecm_files_in_download/old.txt · Last modified: 2025-11-09 12:26:44 +0100 Nov Sun by ecm