Table of Contents

CPU performance comparison

2023-03-21T20:50:10+01:00 Tue

Recently, Bret Johnson and some others on the freedos-devel mailing list discussed how different CPUs perform. I want to add some results to this topic.

The LZMA and LZSA2 depack test

I extracted the files tmp/lz/tdebug.com and tmp/sa2/tdebug.com from the lDebug release 5 packages. I then ran these on three different machines:

The new box

I uploaded the tests for the new box as well as the results of the tests. I did two tests, the LZMA-lzip packed lDebug as well as the LZSA2-packed lDebug. LZMA has the best compression ratio, whereas LZSA2 is among the fastest depackers.

This is the lzip test:

$ ls -l ldebug5/tmp/lz/tdebug.com -gG
-rw-rw-r-- 1 180736 Mar  8 18:21 ldebug5/tmp/lz/tdebug.com
$ DEFAULT_MACHINE=qemu QEMU=./qemu-kvm.sh ./test.sh 1024 lz
   27.08s for 1024 runs (   26ms / run), method               lz
$ DEFAULT_MACHINE=dosemu ./test.sh 1024 lz
Info: 1SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS [cut for brevity]
   27.98s for 1024 runs (   27ms / run), method               lz
$ 

We can see that dosemu2 (it is dosemu2-2.0pre8-20210107-2657-gd9079b724) and qemu (QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-10+b2)) are almost equally fast. From experience, dosemu2 tends to take more time to start up, which could contribute to the small difference even though we're doing 1024 runs of the same program within a single VM, only incurring the startup costs once.

Next, the LZSA2 test:

$ ls -l ldebug5/tmp/sa2/tdebug.com -gG
-rw-rw-r-- 1 186368 Mar  8 18:21 ldebug5/tmp/sa2/tdebug.com
$ DEFAULT_MACHINE=qemu QEMU=./qemu-kvm.sh ./test.sh 1024 sa2
    2.15s for 1024 runs (    2ms / run), method              sa2
$ DEFAULT_MACHINE=dosemu ./test.sh 1024 sa2
Info: 1SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS [cut for brevity]
    2.92s for 1024 runs (    2ms / run), method              sa2
$ 

The time is under 3ms on both VMs. At this point the scale of the per-run time is too unclear, so we can enhance the test using the INICOMP_SPEED_SCALE variable. This results in:

$ DEFAULT_MACHINE=qemu QEMU=./qemu-kvm.sh INICOMP_SPEED_SCALE=2 ./test.sh 1024 sa2
    2.19s for 1024 runs (   2.14ms / run), method              sa2
$ DEFAULT_MACHINE=dosemu INICOMP_SPEED_SCALE=2 ./test.sh 1024 sa2
Info: 1SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS [cut for brevity]
    2.94s for 1024 runs (   2.87ms / run), method              sa2
$ 

Writing of the variable, this test.sh script was extracted from the mak script of lDebug, which had the speed test option added in 2022-March almost a year ago. The cfg.sh is an exact copy of lDebug's, whereas ovr.sh is a copy of the former which has three paths adjusted to work with the testperf directory structure. The lmacros, ldosboot, and bootimg repos are full clones of the current revisions from our server. qemu-kvm.sh is a little script that just executes qemu-system-i386 with the --enable-kvm option and all other parameters passed through. It looks like this:

#! /bin/bash
qemu-system-i386 --enable-kvm "$@"

Finally, I also extracted lDebug's misc/quit.asm which is a small utility program to shut down a QEMU machine from within itself. The test script will, when using QEMU rather than dosemu2, assemble this utility, as well as a boot sector loader, and prepare a FreeDOS boot diskette using the bootimg NASM script. The kernel.sys and command.com files are expected in ~/.dosemu/drive_c/ but the source pathnames can be adjusted in the ovr.sh overrides file.

The old box

To run the tests on the old box, I transferred the two tdebug.com files one after another (using the trusty PD ZModem of course) from the new box. I also added a script file which simplifies the timing over manually keeping watch. It is named testtdeb.bat and these are its contents:

echo.|time
tdebug.com b %1
echo.|time

The lzip test took from 20:32:59,91 to 20:33:30,61 for 256 runs. That's almost exactly 30s. 30_000ms per 256 runs results in 117.2ms per run. LZSA2 took from 20:38:18,15 to 20:38:23,75 for 1024 runs. That's about 5.5s. 5_500ms per 1024 runs results in 5.37ms per run.

The HP 95LX

To run these tests, I set up the 1 MiB model of the HP 95LX, as the tdebug test programs need in excess of 350 KiB of memory to run. As usual I first transferred Public Domain ZModem, then used that to transfer all the other programs and scripts I needed. Without an SRAM card, the disk space on the internal RAM drive quickly filled up, so I actually transferred the lzip-compressed test program first, tested it, then deleted it. Only then I transferred the LZSA2-compressed test program and tested that.

The lzip test took a long time, as expected. It ran from 19:03:59 to 19:50:04 for 16 runs, just under 46 minutes. 46 minutes are 2760s. And 2_760_000ms per 16 runs results in 172_500ms per run. Nearly 3 minutes per run matches prior experience on another HP 95LX, which was the reason we switched the inicomp winner to the faster LZSA2.

That depacker, in turn, took from 20:02:51 to 20:04:42 for 16 runs, for a duration of 111s. 111_000ms per 16 runs results in 6_938ms, also matching our experiences with compressed executables on the 95LX.

Comparison

For LZMA-lzip: The HP 95LX takes 1471 times as long as the 686 machine. And the 95LX takes 6388 times as long as the A10. The A10 is about 4.3 times as fast as the 686.

For LZSA2: The HP 95LX takes 1292 times as long as the 686. And the 95LX takes 2417 times as long as the A10. LZSA2 depacks only 1.87 times as fast on the A10 as it does on the 686.

For reference, the 5.37 MHz NEC V20 runs at a frequency that's about one 186th of the 686, and one 740th of the A10. And the 686 of course runs at a quarter the (maximum) frequency of the A10.

Conclusion? CPU-bound tasks like depacking do greatly benefit from higher frequencies, though that cannot be the only speed-up. Presumably pipelining and caching do come into this as well, though the decompressed data is larger than 100 kB so it will not fully fit in small caches. LZMA-lzip apparently does many multiplications and shifts, which may be more costly than most instructions on the good old NEC V20 as well.

Bret Johnson's SLOWDOWN problems

There were three problems I ran into while using Bret Johnson's SLOWDOWN program. First, the reason I started this article: It was noted on the mailing list that SLOWDOWN uses an in al, dx instruction within its Waste function. The port I/O access is very likely much slower than other instructions, on modern systems. Although I only tested in V86M and KVM, I suppose that an in instruction may be slower in R86M as well. I created a Script for lDebug file, slowfix2.sld, to modify SLOWDOWN in memory to not do port input in the waste function. This is that script:

a ss:sp - 10
 mov dx, 21
 in al, dx
 .
r v0 := aao - (sp - 10)
s cs:100 length bxcx range ss:sp - 10 length v0
if (src == 0) then goto :error
f cs:sro length v0 90
goto :eof
:error
; Waste loop input instruction not found

It assembles a signature code sequence on the stack, determines its length automatically, searches in the program image for the sequence using an S command with a RANGE parameter, checks whether any matches occurred, and then patches the (first) match with NOP opcodes to disable the port input.

The second problem that I encountered was when running SLOWDOWN in dosemu2 KVM on the new box, it kept crashing trying to execute a wbinvd instruction in its DisableCache function, after trying to disable the cache by writing to cr0. Passing the parameter /C:NO did not help. This was fixed by another patch, which disables the wbinvd instruction. It was based on the first patch Script for lDebug. It is named slowfix3.sld and contains this text:

a ss:sp - 10
 wbinvd
 .
r v0 := aao - (sp - 10)
s cs:100 length bxcx range ss:sp - 10 length v0
if (src == 0) then goto :error
f cs:sro length v0 90
goto :eof
:error
; WBINVD instruction not found

The final problem was that after using the two patches, the program was interrupted by a division overflow. The division itself is in the TestWaste function, which calls Waste in a loop until 4 timer ticks have elapsed. It counts how often it was able to run a Waste call in this timeframe. It does use a 32-bit value in a register pair to count how many calls occurred. However, to get an average per-tick value, it uses a narrowing 32-bit to 16-bit divide instruction, with the divisor as a constant 4. This consistently overflows on the new box (dosemu2 KVM), causing the interrupt.

My solution to this is to manually set ax to FFFFh (quotient), dx to 3 (remainder), and then skip past the div instruction. This appears to work, though I am unsure if everything else is in order after this failure. Oddly enough, the division overflow appears to occur in the TestWaste call from the CalcCacheSFactor function, not the call in CalcSpeed.

This is the command to run in order to use only the wbinvd patch:

lcdebugu /c=yslowfix3.sld;g slowdown.com /t

This results in a Slowdown-Unit rating of around 620 in dosemu2 KVM on the A10. The "MHz of an equivalent 80486" results in about 50 MHz.

This is the command to run in order to use both patches, and continue past the overflowing division:

lcdebugu /c=yslowfix3.sld;yslowfix2.sld;g;r,ax,-1;r,dx,3;g=abo slowdown.com /t

This results in the fixed Slowdown-Unit rating of up to 63400, and a 486 equivalent of 5350 MHz.

On the 686 box, neither the wbinvd fault nor the division overflow occur. Without the slowfix2.sld script this machine gets a 1780 SUs rating and a 486 equivalent of 150 MHz. With that script, the SUs rise to nearly 19750 and the 486 equivalent to 1666 MHz.

On the NEC V20 we get 17 SUs, regardless of whether the in al, dx instruction is executed in the time wasting loop or not.

Conclusion

CPU-bound benchmarks are much faster on a modern machine than they are on older ones. The frequency increase does not actually suffice to explain the speedup. Some things, like doing I/O, were not sped up nearly as much however.