As discovered by Andrew Bird on the 2017 iniload review post, our hgweb server now requires a sign in.
My reply there:
That's intended. While the prompt unfortunately doesn't (yet?) indicate how to sign in, it is described on the website:
2025-10-21 Oct Tue
Access to the hg.pushbx.org hgweb server is now password-protected. This has become a necessity due to unruly bot activity. Enter
anonymousas the username and any nonempty string as the password. Alternatively, download a snapshot (< 100 MiB per file) of the hg repos from the backups.hg directory. (They are transient, but usually the one for the last elapsed day will be available.)
The pushbx.org and ulukai.org domains, backed by the same server, recently received as many as 100_000 requests per day. Most of the involved IP addresses did not send more than 2 requests, so blocking IPs wouldn't be workable. Their user agents were obviously false and partially randomised, so blocking based on those also wouldn't work well. They also didn't honour the robots.txt it appears.
I assume that the bot activity is from scrapers searching for input to generative AI training datasets. I'd gladly let them download the occasional backups.hg file, containing all the public ecm hgweb repos. But there doesn't seem to be any intelligence (natural or artificial) behind them and I don't see a way to notify their creators to facilitate a smart approach like this.
Therefore, they ended up bruteforcing the hgweb contents. Imagine that the lDebug repo alone, with history stretching back to 2010-10-24, has more than 6000 changesets. Now assume that there's about 20 files in every revision. Even with these vastly limited estimates, there's 120_000 possible single-file pages, and the scrapers do appear to indiscriminately download whatever they find.
Recently the traffic generated by this effort effectively constituted a Distributed Denial of Service, often slowing the entire https server to a crawl or even have it run into timeouts when trying to legitimately access it. That includes the hgweb, the dokuwiki, the files and downloads, and of course the webpage and the documentation files.
As of Monday we're considering moving to a different hosting solution for the Mercurial (hg) repositories. The folks at foss.heptapod.net seem like they may welcome at least the essential repos that provide the components of lDebug and lDOS. I already considered that platform back when bitbucket announced it would sunset hg repo support. However, at least one feature that seems to be missing from GitLab/heptapod is for the view of a single changeset to link both parents and children of this changeset.
For now, we added the sign in requirement for the local hgweb. Responsivity has already improved a lot, albeit the bots are still not dissuaded. A different alternative is to download one of the backups.hg files and unpack it on another machine, then set up hgweb on there. I recently managed to do this using the Apache web server on an amd64 Debian machine, allowing to access the repos using an address like http://127.0.0.1/hg/ecm/ (I may detail the set up I did on another day.)
Discussion
As an example, the following page was requested on 2025-10-21: https://hg.pushbx.org/ecm/ldebug/comparison/82eaae368e0e/.hgignore?revcount=30
This is completely useless, but that didn't stop the scraper from loading this page. Out of a number of requests, several accessed "comparison" pages like this, generally leading to no difference to show.
This all could be avoided if the scrapers would just load the backup files to receive all repos.