User Tools

Site Tools


blog:pushbx:2023:0828_late_august_work

Late August work

2023-08-27

This week some development happened. I also finished an audit of all lDebug changes in recent months and did a pass over the entire lDebug manual to update the worst outdated parts. Finally, I prepared the lDebug release 6 yesterday.

MSDebug changes

The range parameter type's manual entry was updated to specify that the default length is clamped to the end of a segment if the start is close to that.

FreeDOS kernel

While trying to build EDR-DOS using JWasm, WarpLink, x2b2, OpenWatcom 1.9, as well as the tools shipped with the OpenDOS release, I encountered a bug in WarpLink. When invoked from the mak.bat or command/make.bat scripts, WarpLink reported not finding a file despite it existing.

First I attempted to debug the problem by running ldebug /p warplink @resp2 but that made the problem disappear. Next I tried intercep c:\bin\warplink.exe @resp2 but that also made the problem disappear. Likewise intercep c:\command.com warplink @resp2.

I finally ran lDebug (with lh, but this only puts the environment into the UMA), changed the allocation strategy to last fit (int 21h function 5801h with bx equal to 2), then ran an N command to load lCDebugU, then L, then G. In the lCDebug application I entered a TSR command and then G. Next, I instructed the first debugger (lDebug) to quit itself. This left the second debugger resident near the top of the Low Memory Area (using 116 KiB), though with a memory gap behind it (of 82 KiB for the compressed lCDebug, only 46 KiB for the uncompressed lCDebugU).

This finally allowed the linker to exhibit the problem while we had a debugger resident. However, the problem would take some more time to debug. I first re-entered the debugger with a small 6-byte utility called int3.com and ran these commands:

install indos
uninstall debug
bp new ptr ri21p when ah == 3D

This allowed us to Go again and have the applications break into the debugger whenever they opened a file using int 21h service 3Dh. After this, I ran WarpLink again and continued running it until the offending open. Sure enough, DOS returned an error opening this file.

First I checked the buffer with the pathname passed to DOS. It contained the expected name, .\BIN\CMDLIST.OBJ. Next I assembled a little helper on the stack of the application, to get the current directory, like so:

r sp -= 80
a ss:sp
 push ax
 push dx
 push ds
 push si
 mov ah, 47
 mov dl, 0
 push ss
 pop ds
 mov si, (sp + 30)
 int 21
 pop si
 pop ds
 pop dx
 pop ax
 jmp (cs):(ip)
 .
r csip sssp

Tracing this yielded no result, the cwd was COMMAND as expected. The open still failed the same way afterwards.

The next order was to trace into the DOS. Just using the T command was not sufficient, as it seemed the dosemu2 handler would loop somehow. So I used a di 21 command to find the DOS's entrypoint. I used a G command with a temporary breakpoint to trace into this handler. (I discovered during this part that the DOS code segment was in the Low Memory Area, which I did not fix immediately because the exact configuration was needed to reproduce the bug. The fix was to add a dos=high directive to the FDCONFIG.SYS file.)

I eventually traced into the DOS dispatcher, the DosOpen function, the DosOpenSft function, and the truename function. Finally I found that the truename function was underflowing the input buffer to check for a second dot before the dot that was the first text in the buffer. And sure enough, one byte before the pathname buffer, when the bug happened there was a dot in that spot in memory.

I recalled that this bug may have already been fixed in fdpp and quickly found the relevant patch and issue, by comparing the kernel/newstuff.c file to FreeDOS's and using github blame to find the commit. It happened to be an issue (on github) in which I had commented, actually. The comments weren't related to that bug, rather, asking about financially contributing to dosemu2. However, I certainly must have scanned the patch back then.

I have since adapted the patch to the FreeDOS kernel, with credit to stsp. It was merged last night.

The changed kernel happened to make the kernel no longer hit the bug case, though I tested the specific call with the buffer that exhibited the bug before and it is indeed fixed.

The other work on the FreeDOS kernel involves four FCB find bugs that I found:

  • FreeDOS defaults to search for any directory entry (all attributes except volume label) when FCB find first is used without an extended search FCB. EDR-DOS defaults to a zero attribute.
  • FreeDOS would truncate the current directory cluster written to and read from a search FCB to 16 bits, which would lose a high word of a 32-bit cluster number on a FAT32 FS.
  • The second bug was masked by the third, however. That was the fact that the kernel always retained its internal search DTA for FCB Find Next, not reloading from the search FCB. The logic to do just that was reversed, running for FCB Find First (in which case it was useless) but not for FCB Find Next. This disabled concurrent searches from ever working.
  • The logic to update the search DTA from the search FCB was local-drive-specific. A proper solution has to copy nearly all of the reserved fields of the DTA in order to support redirectors, who may use different fields in the DTA than DOS.

I did not prepare patches for these problems yet, but reported them to the FreeDOS kernel repo's issue tracker.

Trying to run the EDR-DOS build utilising WarpLink on the local machine, where dosemu2 can use KVM V86M and KVM PM for running DOS software, I actually ran into a different WarpLink problem. It hung. Running ldebug /p warplink @wlbios.rsp I encountered the same hang. Next, I ran dosemu2 with a serial port connected to a local terminal application. Then in lDebug I ran the command install serial, timer. Upon reaching the hang I sent two Control-C keypresses to the serial input of the debugger. It immediately disassembled a most suspect instruction: mov ax, [FFFF]. Of course that would hang! But only on KVM, as dosemu2's software emulation of the CPU does not respect segment limits. Rerunning the debugger with install intfaults also caught the fault immediately, as expected.

Now, what was the cause of that? It turned out my memory access fix script that I used to port WarpLink to NASM didn't pick up negative numbers as being numeric properly. Thus mov ax, -1 was converted wrongly to mov ax, [-1]. I am lucky to have found this bug, actually.

There was an entire class of related bugs. The cases other than simple negative numbers involved either an equate in some numeric expression, or a numeric expression consisting of only numbers. All of these were added to the memory access fix script, though its support is not perfect. (For example, an expression involving two equates would not be recognised. Equates that are actually references to memory labels would also not be handled correctly.)

Another fix involves a numeric expression involving an OFFSET keyword, but with the keyword not appearing at the very beginning of the operand. This was fixed by detecting an offset keyword with the regexp /\boffset\b/i where before there was an ^ anchor at the beginning.

Something else was learning to construct a scriptlet which copies the original files and applies the fixmem.pl script to all ported files all over again, then re-applies several patches that did some manual fixes. In one case, a patch had to be applied before running fixmem.pl. With the scriptlet completed, it was very easy to rerun the entire port and then find the differences from the prior revision using a hg d or hg d | diffrr command. To avoid having to re-invent such scriptlets, I noted them down in the commit messages of the changes to the repo.

Finally, I shortened several jumps to largely identicalise the NASM build output to the TASM build.

I did this by running the NASM build (using the host NASM with ./mak.sh, not the DOS NASM) then running bdiff wl.exe wltasm.exe. Often, the bdiff result would not be very useful as is, as it stops when encountering too many different bytes. In this case, say, if it ended on finding 426 different bytes, rerun as bdiff -426 wl.exe wltasm.exe. (It is important to put the number switch before the filename parameters.) Then scroll to the offset at which it previously stopped.

Next, run ldebug /f wl.exe. (Instead of re-running the debugger, I simply reloaded the changed file using a subsequent L command.) If the change was on, for example, offset 1138, disassemble the nearby bytes using a command like u 1138 + F0. The F0h displacement is 100h for the offset in the PSP segment minus 10h to start disassembly a few instructions earlier. (Sometimes this requires retrying with a few different offsets to get the disassembly to synchronise properly.)

Next, the hardest part: Identify a particularly unique instruction. It is best if it involves only registers, as it can be searched for in the sources best then. Search for it in the *.nas files using grep, specify to include some context lines, pipe the output to less. Then search for the particular spot in the sources.

All cases of wildly different bytes in the NASM build, after fixing all the bugs (some of which I found while identicalising), were jumps specified with NEAR PTR but actually optimised to short jumps by TASM. However, letting NASM optimise all jumps by simply dropping all the NEAR PTR uses also wasn't correct. Perhaps it depends on whether the jump is backwards or forwards.

After fixing all the jumps, the only remaining differences are in the data segment. It appears that the alignment bytes differ, and that the TASM build creates a slightly longer load image. I assume this is due to it emitting some data that is nobits for the NASM build. There are some warnings related to this in the TASM build.

lDebug

Discussion

C. MaslochC. Masloch, 2023-08-29 14:00:33 +0200 Aug Tue

This left the second debugger resident near the top of the Low Memory Area (using 116 KiB), though with a memory gap behind it (of 82 KiB for the compressed lCDebug, only 46 KiB for the uncompressed lCDebugU).

Actually both choices leave about 45 KiB or 46 KiB of a gap. I'm not sure how I had a 82 KiB gap on one occasion, but it isn't the choice of the compressed or uncompressed executable.

The executable image size plus minimum allocation is very similar for both choices, too:

ldebug /f
&; Welcome to lDebug!
-n lcdebugu.com
-l
-h as bytes word [104] * #512 - word [108] * #16 + word [10A] * #16
00028810   162 KiB
-n lcdebug.com
-l
-h as bytes word [104] * #512 - word [108] * #16 + word [10A] * #16
00028790   161 KiB

The above calculation is exePages times 512 minus exeHeaderSize times 16 plus exeMinAlloc times 16. Refer to structure at https://hg.pushbx.org/ecm/ldebug/file/7a0c4551b99b/source/debug.mac#l755

C. MaslochC. Masloch, 2023-08-29 14:06:58 +0200 Aug Tue

More information on the involved sizes:

type readexeh.sld
h as pages word [104] ; pages
h as paras word [108] ; header
h as paras word [10A] ; minalloc
h as bytes word [104] * #512 - word [108] * #16 ; image
h as bytes word [104] * #512 - word [108] * #16 + word [10A] * #16 ; alloc
ldebug /f
&; Welcome to lDebug!
-n lcdebugu.com
-l
-y readexeh.sld
-h as pages word [104] ; pages
0001E600   121 KiB
-h as paras word [108] ; header
0FD0     3 KiB
-h as paras word [10A] ; minalloc
B1E0    44 KiB
-h as bytes word [104] * #512 - word [108] * #16 ; image
0001D630   117 KiB
-h as bytes word [104] * #512 - word [108] * #16 + word [10A] * #16 ; alloc
00028810   162 KiB
-n lcdebug.com
-l
-y readexeh.sld
-h as pages word [104] ; pages
00014400    81 KiB
-h as paras word [108] ; header
0FD0     3 KiB
-h as paras word [10A] ; minalloc
00015360    84 KiB
-h as bytes word [104] * #512 - word [108] * #16 ; image
00013430    77 KiB
-h as bytes word [104] * #512 - word [108] * #16 + word [10A] * #16 ; alloc
00028790   161 KiB
You could leave a comment if you were logged in.
blog/pushbx/2023/0828_late_august_work.txt · Last modified: 2023-08-28 21:25:47 +0200 Aug Mon by ecm