User Tools

Site Tools


blog:pushbx:2023:0321_cpu_performance_comparison

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

blog:pushbx:2023:0321_cpu_performance_comparison [2023-03-21 23:14:21 +0100 Mar Tue] (current)
ecm created
Line 1: Line 1:
 +====== CPU performance comparison ======
 +
 +''2023-03-21T20:50:10+01:00 Tue''
 +
 + Recently, Bret Johnson and some others on the freedos-devel mailing list discussed how different CPUs perform. I want to add some results to this topic.
 +
 +===== The LZMA and LZSA2 depack test =====
 +
 +I extracted the files ''tmp/lz/tdebug.com'' and ''tmp/sa2/tdebug.com'' from the lDebug release 5 packages. I then ran these on three different machines:
 +
 +  * The 1 MiB model of the HP 95LX, which [[https://en.wikipedia.org/wiki/HP_95LX|is said to have]] a 5.37 MHz NEC V20. This is running MS-DOS 3.20 in Real 86 Mode.
 +  * The old box, running a single core 686 (Intel Pentium 3) at 1 GHz. This is running MS-DOS 7.10 and JemmEx v5.69, thus in Virtual 86 Mode.
 +  * The new box, running a quad core AMD A10-7870K at nearly 4 GHz. This is running Debian Linux, which in turn runs dosemu2 or qemu, both with KVM, and in both cases running a recent FreeDOS kernel in Real/Virtual 86 Mode.
 +
 +
 +==== The new box ====
 +
 +I uploaded [[https://pushbx.org/ecm/test/20230321/testperf.tlz|the tests for the new box]] as well as [[https://pushbx.org/ecm/test/20230321/rel5test.txt|the results of the tests]]. I did two tests, the LZMA-lzip packed lDebug as well as the LZSA2-packed lDebug. LZMA has the best compression ratio, whereas LZSA2 is among the fastest depackers.
 +
 +This is the lzip test:
 +
 +<code>$ ls -l ldebug5/tmp/lz/tdebug.com -gG
 +-rw-rw-r-- 1 180736 Mar  8 18:21 ldebug5/tmp/lz/tdebug.com
 +$ DEFAULT_MACHINE=qemu QEMU=./qemu-kvm.sh ./test.sh 1024 lz
 +   27.08s for 1024 runs (   26ms / run), method               lz
 +$ DEFAULT_MACHINE=dosemu ./test.sh 1024 lz
 +Info: 1SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS [cut for brevity]
 +   27.98s for 1024 runs (   27ms / run), method               lz
 +$ </code>
 +
 +We can see that dosemu2 (it is ''dosemu2-2.0pre8-20210107-2657-gd9079b724'') and qemu (''QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-10+b2)'') are almost equally fast. From experience, dosemu2 tends to take more time to start up, which could contribute to the small difference even though we're doing 1024 runs of the same program within a single VM, only incurring the startup costs once.
 +
 +Next, the LZSA2 test:
 +
 +<code>$ ls -l ldebug5/tmp/sa2/tdebug.com -gG
 +-rw-rw-r-- 1 186368 Mar  8 18:21 ldebug5/tmp/sa2/tdebug.com
 +$ DEFAULT_MACHINE=qemu QEMU=./qemu-kvm.sh ./test.sh 1024 sa2
 +    2.15s for 1024 runs (    2ms / run), method              sa2
 +$ DEFAULT_MACHINE=dosemu ./test.sh 1024 sa2
 +Info: 1SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS [cut for brevity]
 +    2.92s for 1024 runs (    2ms / run), method              sa2
 +$ </code>
 +
 +The time is under 3ms on both VMs. At this point the scale of the per-run time is too unclear, so we can enhance the test using the ''INICOMP_SPEED_SCALE'' variable. This results in:
 +
 +<code>$ DEFAULT_MACHINE=qemu QEMU=./qemu-kvm.sh INICOMP_SPEED_SCALE=2 ./test.sh 1024 sa2
 +    2.19s for 1024 runs (   2.14ms / run), method              sa2
 +$ DEFAULT_MACHINE=dosemu INICOMP_SPEED_SCALE=2 ./test.sh 1024 sa2
 +Info: 1SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS [cut for brevity]
 +    2.94s for 1024 runs (   2.87ms / run), method              sa2
 +$ </code>
 +
 +Writing of the variable, this ''test.sh'' script was extracted [[https://hg.pushbx.org/ecm/ldebug/file/release5/source/mak.sh|from the mak script of lDebug]], which [[https://hg.pushbx.org/ecm/ldebug/rev/64ad0acaf3c9|had the speed test option added in 2022-March]] almost a year ago. The ''cfg.sh'' is an exact copy of lDebug's, whereas ''ovr.sh'' is a copy of the former which has three paths adjusted to work with the testperf directory structure. The lmacros, ldosboot, and bootimg repos are full clones of the current revisions from our server. ''qemu-kvm.sh'' is a little script that just executes ''qemu-system-i386'' with the ''-%%%%-enable-kvm'' option and all other parameters passed through. It looks like this:
 +
 +<code>#! /bin/bash
 +qemu-system-i386 --enable-kvm "$@"</code>
 +
 +Finally, I also extracted lDebug's ''misc/quit.asm'' which is a small utility program to shut down a QEMU machine from within itself. The test script will, when using QEMU rather than dosemu2, assemble this utility, as well as a boot sector loader, and prepare a FreeDOS boot diskette using the bootimg NASM script. The ''kernel.sys'' and ''command.com'' files are expected in ''~/.dosemu/drive_c/'' but the source pathnames can be adjusted in the ''ovr.sh'' overrides file.
 +
 +
 +==== The old box ====
 +
 +To run the tests on the old box, I transferred the two ''tdebug.com'' files one after another (using the trusty PD ZModem of course) from the new box. I also added a script file which simplifies the timing over manually keeping watch. It is named ''testtdeb.bat'' and these are its contents:
 +
 +<code>echo.|time
 +tdebug.com b %1
 +echo.|time</code>
 +
 +The lzip test took from 20:32:59,91 to 20:33:30,61 for 256 runs. That's almost exactly 30s. 30_000ms per 256 runs results in 117.2ms per run. LZSA2 took from 20:38:18,15 to 20:38:23,75 **for 1024 runs**. That's about 5.5s. 5_500ms per 1024 runs results in 5.37ms per run.
 +
 +
 +==== The HP 95LX ====
 +
 +To run these tests, I set up the 1 MiB model of the HP 95LX, as the tdebug test programs need in excess of 350 KiB of memory to run. As usual I first transferred Public Domain ZModem, then used that to transfer all the other programs and scripts I needed. Without an SRAM card, the disk space on the internal RAM drive quickly filled up, so I actually transferred the lzip-compressed test program first, tested it, then deleted it. Only then I transferred the LZSA2-compressed test program and tested that.
 +
 +The lzip test took a long time, as expected. It ran from 19:03:59 to 19:50:04 for 16 runs, just under 46 minutes. 46 minutes are 2760s. And 2_760_000ms per 16 runs results in 172_500ms per run. Nearly 3 minutes per run matches prior experience on another HP 95LX, which was the reason we switched the inicomp winner to the faster LZSA2.
 +
 +That depacker, in turn, took from 20:02:51 to 20:04:42 for 16 runs, for a duration of 111s. 111_000ms per 16 runs results in 6_938ms, also matching our experiences with compressed executables on the 95LX.
 +
 +
 +==== Comparison ====
 +
 +For LZMA-lzip: The HP 95LX takes 1471 times as long as the 686 machine. And the 95LX takes 6388 times as long as the A10. The A10 is about 4.3 times as fast as the 686.
 +
 +For LZSA2: The HP 95LX takes 1292 times as long as the 686. And the 95LX takes 2417 times as long as the A10. LZSA2 depacks only 1.87 times as fast on the A10 as it does on the 686.
 +
 +For reference, the 5.37 MHz NEC V20 runs at a frequency that's about one 186th of the 686, and one 740th of the A10. And the 686 of course runs at a quarter the (maximum) frequency of the A10.
 +
 +Conclusion? CPU-bound tasks like depacking do greatly benefit from higher frequencies, though that cannot be the only speed-up. Presumably pipelining and caching do come into this as well, though the decompressed data is larger than 100 kB so it will not fully fit in small caches. LZMA-lzip apparently does many multiplications and shifts, which may be more costly than most instructions on the good old NEC V20 as well.
 +
 +
 +===== Bret Johnson's SLOWDOWN problems =====
 +
 +There were three problems I ran into while using [[https://bretjohnson.us/|Bret Johnson's SLOWDOWN program]]. First, the reason I started this article: It was noted on the mailing list that SLOWDOWN uses an ''in al, dx'' instruction within its ''Waste'' function. The port I/O access is very likely much slower than other instructions, on modern systems. Although I only tested in V86M and KVM, I suppose that an ''in'' instruction may be slower in R86M as well. I created a Script for lDebug file, ''slowfix2.sld'', to modify SLOWDOWN in memory to not do port input in the ''waste'' function. This is that script:
 +
 +<code>a ss:sp - 10
 + mov dx, 21
 + in al, dx
 + .
 +r v0 := aao - (sp - 10)
 +s cs:100 length bxcx range ss:sp - 10 length v0
 +if (src == 0) then goto :error
 +f cs:sro length v0 90
 +goto :eof
 +:error
 +; Waste loop input instruction not found</code>
 +
 +It assembles a signature code sequence on the stack, determines its length automatically, searches in the program image for the sequence using an S command with a RANGE parameter, checks whether any matches occurred, and then patches the (first) match with NOP opcodes to disable the port input.
 +
 +The second problem that I encountered was when running SLOWDOWN in dosemu2 KVM on the new box, it kept crashing trying to execute a ''wbinvd'' instruction in its ''DisableCache'' function, after trying to disable the cache by writing to ''cr0''. Passing the parameter ''/C:NO'' did not help. This was fixed by another patch, which disables the ''wbinvd'' instruction. It was based on the first patch Script for lDebug. It is named ''slowfix3.sld'' and contains this text:
 +
 +<code>a ss:sp - 10
 + wbinvd
 + .
 +r v0 := aao - (sp - 10)
 +s cs:100 length bxcx range ss:sp - 10 length v0
 +if (src == 0) then goto :error
 +f cs:sro length v0 90
 +goto :eof
 +:error
 +; WBINVD instruction not found</code>
 +
 +The final problem was that after using the two patches, the program was interrupted by a division overflow. The division itself is in the ''TestWaste'' function, which calls ''Waste'' in a loop until 4 timer ticks have elapsed. It counts how often it was able to run a ''Waste'' call in this timeframe. It does use a 32-bit value in a register pair to count how many calls occurred. However, to get an average per-tick value, it uses a narrowing 32-bit to 16-bit divide instruction, with the divisor as a constant 4. This consistently overflows on the new box (dosemu2 KVM), causing the interrupt.
 +
 +My solution to this is to manually set ''ax'' to FFFFh (quotient), ''dx'' to 3 (remainder), and then skip past the ''div'' instruction. This appears to work, though I am unsure if everything else is in order after this failure. Oddly enough, the division overflow appears to occur in the ''TestWaste'' call from the ''CalcCacheSFactor'' function, not the call in ''CalcSpeed''.
 +
 +This is the command to run in order to use only the ''wbinvd'' patch:
 +
 +<code>lcdebugu /c=yslowfix3.sld;g slowdown.com /t</code>
 +
 +This results in a Slowdown-Unit rating of around 620 in dosemu2 KVM on the A10. The "MHz of an equivalent 80486" results in about 50 MHz.
 +
 +This is the command to run in order to use both patches, and continue past the overflowing division:
 +
 +<code>lcdebugu /c=yslowfix3.sld;yslowfix2.sld;g;r,ax,-1;r,dx,3;g=abo slowdown.com /t</code>
 +
 +This results in the fixed Slowdown-Unit rating of up to 63400, and a 486 equivalent of 5350 MHz.
 +
 +
 +On the 686 box, neither the ''wbinvd'' fault nor the division overflow occur. Without the ''slowfix2.sld'' script this machine gets a 1780 SUs rating and a 486 equivalent of 150 MHz. With that script, the SUs rise to nearly 19750 and the 486 equivalent to 1666 MHz.
 +
 +
 +On the NEC V20 we get 17 SUs, regardless of whether the ''in al, dx'' instruction is executed in the time wasting loop or not.
 +
 +
 +===== Conclusion =====
 +
 +CPU-bound benchmarks are much faster on a modern machine than they are on older ones. The frequency increase does not actually suffice to explain the speedup. Some things, like doing I/O, were not sped up nearly as much however.
 +
 +{{tag>ldebug slowdown dosemu2}}
 +
 +
 +~~DISCUSSION~~
  
blog/pushbx/2023/0321_cpu_performance_comparison.txt ยท Last modified: 2023-03-21 23:14:21 +0100 Mar Tue by ecm