User Tools

Site Tools


blog:pushbx:2022:1031_unmak_explanations_and_plans

unmak explanations and plans

2022-10-30

I didn't get to use the (third) 95LX that much this week, though I did read a few more chapters of a story. Right now my modern mobile device shut down due to a discharged battery (without warning) so I might as well spend some time writing on the 95LX again.

Friday (the day before yesterday) I started to work on some tools to disassemble and re-assemble programs. For now I have been working on Public Domain ZModem (PDZM) version 1.26 which is, as the name suggests, released into the public domain but does not ship with sources. Another case of "free software without sources" that I may want to work on later is the assembler and linker included with the free software release of MS-DOS version 2.

For now the main script, called unmak.pl, will read in both a binary executable MZ .EXE file as well as lDebug trace or disassembly logs. It can parse these logs from an lDebug session's output, as recorded from a terminal application connected to a serial I/O controlled lDebug. The easiest way to get these logs is to trace the execution of the original program using a command like tp FFFFFF. However, it is difficult to obtain trace log code coverage for all the code of a nontrivial program. And another problem is that in the case of PDZM, tracing makes the application too slow to successfully transfer a file.

The next way to obtain disassembly logs is to load the application up into the debugger then simply instruct the debugger explicitly to disassemble ranges of memory identified as machine code. It helps to run two debuggers in two separate machines so one can be used to examine and test runs of machine code, which are then disassembled with the other which is logged so as to obtain only correct disassembly ranges. As a help, we can use the unmak data files previously created from the trace logs to help find machine code runs and what segment they belong to.

The next step will be to script lDebug (either with Script for lDebug (.sld) files, or by attaching a special-purpose program to the debugger) in order to disassemble code automatically. For performance reasons we probably want to disassemble more than one instruction at a time. However, the output must be examined one instruction at a time. The correct disassembly should be deduced from several criteria:

  1. Find initial entrypoints. (For an MZ .EXE this is only one, taken from the header.)
  2. Disassemble instructions until an unconditional, non-call, non-interrupt branch is encountered. (Those are jmp, retn, retn imm16, retf, retf imm16, iret, and special cases like ud2.)
  3. Collect all immediate branch targets from what was disassembled. (These are from conditional or unconditional jumps and loops, as well as calls, both short, near, and far.)
  4. Repeat the process with the newly gathered entrypoints. However, keep track of what was already disassembled to avoid redundant disassembly as well as infinite loops.
  5. Record all disassembly ranges in the least amount of data.

There are a few special cases that can come up. As already mentioned there can be invalid instructions like ud2. Moreover, an interrupt or call instruction can behave as a no return function call. In these cases it may be wrong to disassemble past these instructions. It is probably best left to the user to identify these functions.

Additionally, not all entrypoints may be reachable purely from immediate branches. There's indirect branches (where trace logs shine), and function pointers may be passed to elsewhere such as installing an interrupt handler.

During the repeated disassembly we may want to remember the linear and offset addresses of every disassembled instruction. However, to conserve space and time, the final result should be given in ranges to disassemble, using one U command per continuous range. This result can then be used as input either for another session of the automated disassembly process or for a clean slate debugger session whose output is captured as a disassembly log for unmak.

Another area of improvement is to indicate in some way that a string (counted or terminated) occurs somewhere in the data, and have unmak recreate the string data as strings in its output data. (PDZM contains german and english language strings. For portability, the umlauts and esszett can be rendered numerically when they occur in a string. The numbers would reflect the Code Page 437 and 850/858 codepoints for these letters, while allowing the source to be read as valid UTF-8 text.)

More generally, data should be formattable in arbitrary data structure layouts. The only case of data being formatted specifically so far is relocations found in data spaces, which will emit a dw directive as they only make sense as words. Everything else is formatted into db directives, either 1 or 8 bytes per line.

Some things that unmak already does well are to take note of segments, optionally allowing them to be named, and picking up immediate branch targets within the disassemblies and inventing label names for the branch targets. The segments' sectioning directives and the invented labels are correctly emitted already. (Minus loop* and j(e)cxz branch targets which I did not yet detect as of the time of this writing.)

All of these features depend a lot on the original application to be well-behaved. For example, the skipping instructions (documented in the ACEGALS and also used by the likes of doslfn) are not well-behaved for our purposes. Having fun with segments, and all sorts of relocation, much like self-unpacking executables and SMC generally, are also not well-behaved.

However, PDZM did use SMC in one instance. This is likely a library function that runs an int instruction (machine code CD xx) and patches the second byte of this instruction to pick at run time an interrupt to call. This has been observed to be used for interrupt 10h as well as interrupt 21h. A small fix to unmak made it so it will not disassemble this instruction, noting in a comment that SMC was detected before dumping the machine code with two db directives. (That's the machine code for int 0 as we pull the data from the executable file's image. In this image the second byte is uninitialised.)

However, other than that PDZM appears to be greatly well-behaved. Comments in the documentation indicate that it was compiled by Turbo Pascal. I expect that the Microsoft programs are similarly well-behaved, as there is no obvious need for relocating or much SMC.

The unmak.asm source file contains some macros to enable building an output executable from the unmak output data. It supplies macros such as addsectionunmak, padins (pad instruction), and relocation. The padins and relocation macros can check that their position in the section and in the total image correctly match those in the original executable. The padins macro also receives a parameter indicating how long its instruction was in the original file. If the assembler uses a shorter encoding for an instruction then padins will insert trailing NOPs to result in the same space being used. (This just has given me the idea to check that no such NOP is inserted before a relocation *, -2 macro call.)

The bulk of the assembly source file concerns segmentation macros, defines, and equates. The only actual data written by this file is the relocation table entries. The source, along with the data file, creates an MZ .EXE file without the use of an external linker, utilising the NASM -f bin format with multiple sections. (This is technically using a linker that is internal to NASM.)

The assembly source makes use of a NASM trick that blurs the boundary between the preprocessor and the assembler. In the addsectionunmak macro, an equate is created for the linear base address of the segment. This "early equate" is set to equal a "late equate". The early equate is then used by an %assign directive, which magically receives the numeric value of the early equate before the late equate has been encountered. I believe this trick only works because of how NASM is a multi-pass assembler, and as mentioned there is some entanglement between the preprocessor and assembler.

The point of the trick is to allow preprocessor directives like %assign and %if to use the numeric value of the section's linear base before this base is known. This is not strictly needed for the unmak tooling at this point because we could insert "end section" macros before every usesection, and then we could give the next section's linear base define before it is ever used by padins and relocation. This is possible because unmak outputs one section after another, one at a time, in the final image order.

However, to avoid needing end section macros and to allow for future enhancements (or simply manual editing of the data), we want to support writing to sections in a non-linear fashion. As in, write to section A, then section B, then A again. The image order should be section A followed by B. So during section B's writing, the linear base of this section is not yet known. The early equate trick supports this use case.

Something related to this trick is the fact that many calculations in NASM sources can only be done on scalars. This includes shifting and division, such as in the dreaded mov dx, (residentsize + 15) / 16 for the DOS interrupt 21h service 31h. Putting a label like residentsize or, if you prefer, residentend, within a normal section in NASM's -f bin format will make the calculation fail. As I have oft mentioned yet, you need to get a number made from only scalars or deltas, each delta between two labels in the same section. So to get from the ordinary residentend: label to a residentsize equate, you may use a line like residentsize equ residentend - $$ + 256 (if you are in an org 256 section). The same lesson applies trivially to cases using more than one delta, which is especially useful if you use multiple sections.

Most assembly programs (for NASM or not) use multi-instruction sequences to calculate a number of paragraphs at run time where a properly crafted expression could get the job done at build time. For example:

mov dx, residentend
add dx, 15
mov cl, 4
shr dx, cl

I admit that my lesson on this part may be arcane, but it is a distinct possibility. The run time solution simply does not satisfy me.

So this is how I spent the last three hours! (Thought it was four but I didn't change the 95LX's clock for the timezone switchover earlier today yet.) I may be an amateur but I am certainly dedicated.

Discussion

C. MaslochC. Masloch, 2022-11-01 10:07:26 +0100 Nov Tue

I was asked why I didn't use an existing intelligent disassembler.

For one thing, some of the choices are not free software and/or cost a lot of money. (NB, these two criteria are not the same concern.) The various variants of IDA fall into this category.

The other two suggestions I've received are ghidra and radare2. They seem to be sophisticated and support a lot of architectures and executable formats and features.

However, I want to make use of the assembler and debugger that I have helped develop and expand the use cases for them. So it is less of a question of what's the best tool for the tasks and more of how can I learn from this and make use of my pet projects. This project certainly has an educational nature for me.

You could leave a comment if you were logged in.
blog/pushbx/2022/1031_unmak_explanations_and_plans.txt · Last modified: 2022-10-31 16:40:11 +0100 Oct Mon by ecm