User Tools

Site Tools


blog:pushbx:2023:0221_ldebugx_dual_code_bug_tsr_checks_allegory_symbol_relocations

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

blog:pushbx:2023:0221_ldebugx_dual_code_bug_tsr_checks_allegory_symbol_relocations [2023-02-21 15:41:42 +0100 Feb Tue] (current)
ecm created
Line 1: Line 1:
 +====== lDebugX dual code bug, TSR checks, Allegory symbol relocations ======
 +
 +**2023-02-13**
 +
 +===== Bug in lDebugX symbolic (dual code) init =====
 +
 +Yesterday during testing of the fix for the ''EXECUTING'' keyword to use ''cs:cip'' rather than ''cs:eip'', I intentionally traced into the (compressed) lCDebugX's init section using ''p F0'' then modified the outcome of the machine level tests to say that no 386 or higher is present. It isn't a good idea to try to enter Protected Mode with the debugger believing that no 32-bit processor is in use, but simple Real/Virtual 86 Mode tasks should work fine.
 +
 +They didn't work fine.
 +
 +Right away the debugger emitted a message about a stack overflow, noting that a certain magic word signature was a mismatch. Further, the debugger didn't execute commands as intended and spewed forth a lot of random messages whenever I attempted to enter anything. I had to kill the dosemu2 process as I wasn't able to return control to the outer lCDebugX. (It may have worked if I had previously installed the timer interrupt handler and switched to serial I/O, then entered a double Control-C to the serial terminal. Albeit, the outer debugger cannot terminate the inner debugger without the latter's cooperation.)
 +
 +I repeated the attempt, and this time continued tracing the entire initialisation. I didn't notice anything obviously wrong, but when I ran the ''cmd3'' loop that is stored at the beginning of the lDEBUG_CODE section, I soon found that some test instruction had part of its address displacement and the test immediate byte overwritten with NOPs (opcode 90h). A tell-tale sign that something was incorrectly being patched.
 +
 +I repeated the run again, this time giving a repeated trace with a ''while'' condition checking for the offending instruction to still hold its original value. Sure enough, the 386-related patch function showed up. It took me a moment to grasp the error: The ''es'' and ''dx'' registers had been set up to point at the first and second code segment respectively. Only ''es'' was relevant for the 386-related patches. There was another kind of patch, the far dual call relocation patches. However, these were only included if ''_DUALCALL'' was enabled and ''_PM'' disabled. (The DPMI-enabled build always uses thunks rather than far-immediate calls, to enable operating in both modes.)
 +
 +Turns out, I had swapped ''es'' and ''dx'' in the middle of the far call immediate patches to get ''es'' to hold the lDEBUG_CODE2 section's segment address, which I then depended on in the calls to the 386-related patches for the second code segment. The solution was simply to always swap the two registers even if no call patches were to occur, splitting that part of the code into three fragments, the outer two of which depended on ''!_PM''.
 +
 +Now why didn't I notice this earlier? Simple: In the lCDebugX symbolic build I was running, there were precisely zero patches to apply to the second code segment in case of running on a 386 or higher machine. So it didn't matter that the patch function got the wrong target segment, because it didn't write anything. The same was not true for patches to apply on a non-386 machine.
 +
 +Luckily, I tested the case of running this build on what it detects as a non-386 and the failure was very obvious. This bug could have bitten us much further down the line eventually.
 +
 +
 +===== TSR init checks =====
 +
 +**2023-02-14**
 +
 +As the Pofo tests showed, our TSRs depend on some features such as that the interrupt vectors 2Dh and 2Fh are valid and can be called. There is also an 8-byte test that is intended to test for an MS-DOS version 2 or higher level system, by calling interrupt 21h service 4Dh with CY and expecting it to return NC. (This test used to fail on DOSBox and DOSBox-X.)
 +
 +It would be good to expand the existing check and add two for the interrupt vectors, with somewhat verbose error messages indicating what the problem is. The interrupt vector check could suggest to install the inst2d2f program so as to fix the handlers to allow the TSRs to go on.
 +
 +Another thing that may need fixing is the PSP relocation. It doesn't seem to work on the Pofo's DIP-DOS. So we either need to adapt it to work on there, which would need nontrivial amounts of research on the device itself. I'm not aware of any in-depth information on DIP-DOS's process handling. Or we need to provide a way to avoid the PSP relocation. (We already do that, but only as a build time option.)
 +
 +
 +===== Relocations to support in Allegory =====
 +
 +**2023-02-17**
 +
 +Allegory, the code name for lDebug symbolic, is supposed to support symbol relocation to mirror code or data being relocated. This is needed to implement nontrivial programs. Examples include:
 +
 +  * Our TSRs relocating their process to free the original process memory block (multi-section relocation)
 +  * Our TSRs installing their resident to an allocated memory block (single-section relocation)
 +  * The TSRs free their transient (section deletion) while the resident stays installed
 +  * Debugger relocates its init section higher
 +  * Debugger relocates its code section(s) to final location
 +  * The bootloaded initial loader may relocate in order to fit yet-to-be-loaded parts of its payload in available memory
 +  * Compressed debugger/kernel depacker (along with compressed payload) is relocated to the top of available memory to make space for decompression
 +  * Compression depacker writes decompressed payload (section creation)
 +  * Debugger or kernel discards init sections one by one
 +  * Kernel may relocate DOSCODE multiple times, to LMA top, then to LMA bottom, UMA, or HMA (I believe this is called HMATEXT for the FreeDOS kernel)
 +  * Bootloaded or device driver debugger relocates its data/entry section
 +  * Debugger creates space for its DATASTACK (nobits) section, then starts initialising and using it
 +  * Bootloaded debugger may relocate multiple times for everything to end up at desired location (init, relocator paragraph, whole load image with prepended init, placement of data entry tables section and code section(s))
 +
 +
 +
 +{{tag>ldebug pofo tsr allegory}}
 +
 +
 +~~DISCUSSION~~
  
blog/pushbx/2023/0221_ldebugx_dual_code_bug_tsr_checks_allegory_symbol_relocations.txt ยท Last modified: 2023-02-21 15:41:42 +0100 Feb Tue by ecm