2023-08-27
This week some development happened. I also finished an audit of all lDebug changes in recent months and did a pass over the entire lDebug manual to update the worst outdated parts. Finally, I prepared the lDebug release 6 yesterday.
The range parameter type's manual entry was updated to specify that the default length is clamped to the end of a segment if the start is close to that.
While trying to build EDR-DOS using JWasm, WarpLink, x2b2, OpenWatcom 1.9, as well as the tools shipped with the OpenDOS release, I encountered a bug in WarpLink. When invoked from the mak.bat or command/make.bat scripts, WarpLink reported not finding a file despite it existing.
First I attempted to debug the problem by running ldebug /p warplink @resp2
but that made the problem disappear. Next I tried intercep c:\bin\warplink.exe @resp2
but that also made the problem disappear. Likewise intercep c:\command.com warplink @resp2
.
I finally ran lDebug (with lh
, but this only puts the environment into the UMA), changed the allocation strategy to last fit (int 21h function 5801h with bx
equal to 2), then ran an N command to load lCDebugU, then L, then G. In the lCDebug application I entered a TSR command and then G. Next, I instructed the first debugger (lDebug) to quit itself. This left the second debugger resident near the top of the Low Memory Area (using 116 KiB), though with a memory gap behind it (of 82 KiB for the compressed lCDebug, only 46 KiB for the uncompressed lCDebugU).
This finally allowed the linker to exhibit the problem while we had a debugger resident. However, the problem would take some more time to debug. I first re-entered the debugger with a small 6-byte utility called int3.com
and ran these commands:
install indos uninstall debug bp new ptr ri21p when ah == 3D
This allowed us to Go again and have the applications break into the debugger whenever they opened a file using int 21h service 3Dh. After this, I ran WarpLink again and continued running it until the offending open. Sure enough, DOS returned an error opening this file.
First I checked the buffer with the pathname passed to DOS. It contained the expected name, .\BIN\CMDLIST.OBJ
. Next I assembled a little helper on the stack of the application, to get the current directory, like so:
r sp -= 80 a ss:sp push ax push dx push ds push si mov ah, 47 mov dl, 0 push ss pop ds mov si, (sp + 30) int 21 pop si pop ds pop dx pop ax jmp (cs):(ip) . r csip sssp
Tracing this yielded no result, the cwd was COMMAND
as expected. The open still failed the same way afterwards.
The next order was to trace into the DOS. Just using the T command was not sufficient, as it seemed the dosemu2 handler would loop somehow. So I used a di 21
command to find the DOS's entrypoint. I used a G command with a temporary breakpoint to trace into this handler. (I discovered during this part that the DOS code segment was in the Low Memory Area, which I did not fix immediately because the exact configuration was needed to reproduce the bug. The fix was to add a dos=high
directive to the FDCONFIG.SYS file.)
I eventually traced into the DOS dispatcher, the DosOpen function, the DosOpenSft function, and the truename function. Finally I found that the truename function was underflowing the input buffer to check for a second dot before the dot that was the first text in the buffer. And sure enough, one byte before the pathname buffer, when the bug happened there was a dot in that spot in memory.
I recalled that this bug may have already been fixed in fdpp and quickly found the relevant patch and issue, by comparing the kernel/newstuff.c
file to FreeDOS's and using github blame to find the commit. It happened to be an issue (on github) in which I had commented, actually. The comments weren't related to that bug, rather, asking about financially contributing to dosemu2. However, I certainly must have scanned the patch back then.
I have since adapted the patch to the FreeDOS kernel, with credit to stsp. It was merged last night.
The changed kernel happened to make the kernel no longer hit the bug case, though I tested the specific call with the buffer that exhibited the bug before and it is indeed fixed.
The other work on the FreeDOS kernel involves four FCB find bugs that I found:
I did not prepare patches for these problems yet, but reported them to the FreeDOS kernel repo's issue tracker.
Trying to run the EDR-DOS build utilising WarpLink on the local machine, where dosemu2 can use KVM V86M and KVM PM for running DOS software, I actually ran into a different WarpLink problem. It hung. Running ldebug /p warplink @wlbios.rsp
I encountered the same hang. Next, I ran dosemu2 with a serial port connected to a local terminal application. Then in lDebug I ran the command install serial, timer
. Upon reaching the hang I sent two Control-C keypresses to the serial input of the debugger. It immediately disassembled a most suspect instruction: mov ax, [FFFF]
. Of course that would hang! But only on KVM, as dosemu2's software emulation of the CPU does not respect segment limits. Rerunning the debugger with install intfaults
also caught the fault immediately, as expected.
Now, what was the cause of that? It turned out my memory access fix script that I used to port WarpLink to NASM didn't pick up negative numbers as being numeric properly. Thus mov ax, -1
was converted wrongly to mov ax, [-1]
. I am lucky to have found this bug, actually.
There was an entire class of related bugs. The cases other than simple negative numbers involved either an equate in some numeric expression, or a numeric expression consisting of only numbers. All of these were added to the memory access fix script, though its support is not perfect. (For example, an expression involving two equates would not be recognised. Equates that are actually references to memory labels would also not be handled correctly.)
Another fix involves a numeric expression involving an OFFSET keyword, but with the keyword not appearing at the very beginning of the operand. This was fixed by detecting an offset keyword with the regexp /\boffset\b/i
where before there was an ^
anchor at the beginning.
Something else was learning to construct a scriptlet which copies the original files and applies the fixmem.pl
script to all ported files all over again, then re-applies several patches that did some manual fixes. In one case, a patch had to be applied before running fixmem.pl
. With the scriptlet completed, it was very easy to rerun the entire port and then find the differences from the prior revision using a hg d
or hg d | diffrr
command. To avoid having to re-invent such scriptlets, I noted them down in the commit messages of the changes to the repo.
Finally, I shortened several jumps to largely identicalise the NASM build output to the TASM build.
I did this by running the NASM build (using the host NASM with ./mak.sh
, not the DOS NASM) then running bdiff wl.exe wltasm.exe
. Often, the bdiff result would not be very useful as is, as it stops when encountering too many different bytes. In this case, say, if it ended on finding 426 different bytes, rerun as bdiff -426 wl.exe wltasm.exe
. (It is important to put the number switch before the filename parameters.) Then scroll to the offset at which it previously stopped.
Next, run ldebug /f wl.exe
. (Instead of re-running the debugger, I simply reloaded the changed file using a subsequent L command.) If the change was on, for example, offset 1138, disassemble the nearby bytes using a command like u 1138 + F0
. The F0h displacement is 100h for the offset in the PSP segment minus 10h to start disassembly a few instructions earlier. (Sometimes this requires retrying with a few different offsets to get the disassembly to synchronise properly.)
Next, the hardest part: Identify a particularly unique instruction. It is best if it involves only registers, as it can be searched for in the sources best then. Search for it in the *.nas files using grep, specify to include some context lines, pipe the output to less. Then search for the particular spot in the sources.
All cases of wildly different bytes in the NASM build, after fixing all the bugs (some of which I found while identicalising), were jumps specified with NEAR PTR
but actually optimised to short jumps by TASM. However, letting NASM optimise all jumps by simply dropping all the NEAR PTR
uses also wasn't correct. Perhaps it depends on whether the jump is backwards or forwards.
After fixing all the jumps, the only remaining differences are in the data segment. It appears that the alignment bytes differ, and that the TASM build creates a slightly longer load image. I assume this is due to it emitting some data that is nobits for the NASM build. There are some warnings related to this in the TASM build.
Discussion
Actually both choices leave about 45 KiB or 46 KiB of a gap. I'm not sure how I had a 82 KiB gap on one occasion, but it isn't the choice of the compressed or uncompressed executable.
The executable image size plus minimum allocation is very similar for both choices, too:
The above calculation is
exePages
times 512 minusexeHeaderSize
times 16 plusexeMinAlloc
times 16. Refer to structure at https://hg.pushbx.org/ecm/ldebug/file/7a0c4551b99b/source/debug.mac#l755More information on the involved sizes: