2024-07-14
This week I started the ident86 project, which combines several parts to aid in comparing ports from one assembly language dialect to another.
The build using the mak.sh script now creates NASM listing files. These are for use with the convert listing files script.
Switch to use the convlist.pl script over the prior convmasm.pl.
Switch to use convlist.pl as well. Further, use WarpLink to link the debugger. This results in the same file except that the WarpLink build is shorter. The trailing data is entirely filled with zeroes in the Microsoft linker build.
The TracList repository (for historic reasons called tractest) got several updates to the conversion scripts that read map and listing files to create a trace listing file.
A trace listing file is simply a file in the format expected as input by TracList. This format is based on what NASM outputs as a listing file, specifically for its -f bin
(flat binary) output format. The characteristics of this are that every instruction's assembled machine code bytes are listed next to the instruction, along with some sort of offset indicating where the machine code was written to. Some bytes may be subject to later relocation, so their contents in the listing file may not match the values found in the final executable. In a trace listing file, these bytes must be specifically marked to indicate the relocation. NASM happens to mark such relocations using square brackets or round parentheses, so that is what TracList expects.
(To NASM, the flat binary output format is said to be akin to a linker that is built into the assembler. This is why the listing file does not list the final byte values for those relocations that would likewise make it into an object file for other output formats.)
The use of a trace listing file originally was for the TracList application to be connected to the output of an lDebug instance (in its serial I/O mode), and follow the control flow of the debugged program to always show the current instruction in the input trace listing file. This automated the tedious manual search for the relevant part of the listing file or source texts, especially when branching to different bits of the executable and thus source.
The conversion scripts were added then for two tasks: Converting (single) listing files to the NASM-inspired format expected by TracList, and compiling a map file and several listing files into a single trace listing file.
The first task was implemented by convdebx.pl first, which simply read in the (single) listing file created by JWasm when building the current (FreeDOS) Debug/X sources directly into a binary output file. (This requires one or another patch to JWasm in order to mark the relocations so that the conversion script can convert the markers into those expected in a trace listing file.)
The second task, compiling several listing files plus a map file, was first served by convmasm.pl, which accepted Microsoft MASM listing files and the linker map files created by the build tools included with MSDebug and MS-DOS v4. Subsequently, convedr.pl was forked from convmasm.pl to do the same but for RASM-86 and JWasm listing files plus the extended map file created by WarpLink (with its /mx
switch) when building Enhanced DR-DOS's kernel.
This faced a complication: Both RASM-86 and the existing JWasm builds did not provide adequate relocation markers in their listing files. JWasm didn't provide any, although my patch and the one eventually implemented in JWasm v2.18 rectified this. RASM-86 does provide a marker, but the marker only indicates that some relocation lives in a given line. It doesn't tell you which bytes in that line are affected by the relocation, making it approximately useless.
The solution was found in the Enhanced DR-DOS build tools as well. The RASM-86 object files were postprocessed by a utility called fixupp. This closed source tool was re-implemented as open source in the SvarDOS repo of EDR-DOS. (To be quite honest, I do not understand exactly what the tool does to this day.)
Using the provided source and a reference of the OMF object file format, I extended fixupp so it can output relocation markers for the object file that it is processing. After I'd started down this path, I was pointed to another existing OMF dump utility, but at that point I was far enough along to finish it.
For the RASM-86 object files, fixupp is now called with the RELOC keyword and its stdout is redirected to create the corresponding .rel (relocations) file. To support older versions of JWasm, fixupp can be run with its output file passed as NUL
(on DOS) or /dev/null
(on Linux), but with a RELOC keyword as well. (Whatever magic fixupp was originally intended to do, it doesn't break down when applying it to JWasm-created object files.)
(The fixupp relocations file can also be used for the patched variants of JWasm, particularly to mark relocations in data instead of code.)
It was undesirable to develop three similar scripts in parallel. The motivation to unify them led to the discovery that out of the different assemblers recognised, and to be added yet, most of them had a header in the listing file (either once or repeated throughout) that identified the assembler. Initially I had filtered these out already, simply to prettify the output trace listing file. However, I started to detect RASM-86 (as opposed to JWasm) in convedr.pl using this header.
The looming unification made me add detection for all assemblers: Old MASM, mid recent MASM, JWasm (not differentiating old lacking relocations / NASM style relocations / new [sor]
style relocations), RASM-86, and eventually NASM.
NASM is actually the odd one out, being not as well behaved. It does not identify itself in a header to the listing file. To detect it, we rather check that the first line of a listing file starts with 5 blanks, then the number "1", and 34 more blanks. This does depend on the very first source line not emitting any bytes though. But this was already fulfilled by all relevant files.
The most advanced of the three scripts, convedr.pl, was chosen as the base for what became convlist.pl, the all-in-one converter script. It was updated to support all three JWasm revisions and NASM as well.
The NASM support was made necessary to create trace listings for the WarpLink build, as ported to NASM previously. The listing files created by NASM are already similar to the expected trace listing format safe for minor differences like the [ssss]
relocation (with literal S letters). However, convlist.pl is used in this case for its second task: Compiling all the listing files corresponding to each object file, along with the extended map file, into one trace listing file. Chiefly this involves referencing the per-object start of sections to update the offsets from the individual listing files.
Last Tuesday I started out creating a Python program to identicalise files.
The term "identicalise" is my invention, and describes to compare files so as to insure there are no (unintended) differences between two files which change the meaning of the program. In particular, assemblers may choose a different encoding for the same instruction, which should be detected as meaning the same. Further, some assemblers may optimise better than others. In this case identicalising means to insert padding (usually a nop
byte) to preserve the addresses, and identifying when this is needed.
(Robert has asked whether ident86 can be used to fingerprint binaries and tell whether they were built eg by A86 or NASM. This was famously intended by the creator of A86 to identify unregistered users of that assembler. I replied that this is not among my goals, as I usually possess both source texts to the programs I am trying to identicalise, so I know what assemblers could be (or are) used to build them.)
The initial ident86 revisions were quite poor in features. In particular, when a run of differences is detected, the program simply assumed that the instructions to be disassembled must start in the (up to) 16 bytes found in the file before the first different byte. (An x86 instruction cannot be longer than 16 bytes.) So, it would feed the area around the differences to the lDebug running in the background and have it disassemble in an "I'm feeling lucky" approach.
While it turns out this is not useless, it can be tripped up by wrongly detected instruction boundaries. So what's the solution? Well, you simply provide a trace listing file that corresponds to the build and read this in to detect instruction boundaries.
While this is not (yet) terribly hardened, the naive way this is done for now works well enough to completely eliminate the problem. The only hurdle is disassembling of data, but that cannot work well regardless. The support for actually scanning the trace listing was added in a single changeset and remains unmodified to this day.
Apart from consuming trace listing files created by convlist.pl, ident86 started out from the test.py script that runs lDebug in a VM in the background. This lDebug is instructed to connect to a serial port, and its serial I/O is hooked into the script to control and read from the debugger. ident86 uses this lDebug process exclusively to enter and disassemble data. While this is likely slower than using a dedicated disassembly library, it does re-use the debugger in a satisfying way.
The tool supports MZ executables in three ways:
This is just what was useful to me while using ident86 for the first time, which was to identicalise the WarpLink port. None of it is set in stone, this is just what I needed then. Given that ident86 is a Python program it shouldn't be terribly difficult to modify.
I expect that matching the DRDOS trace listing file to the intermediate binary will require a new switch to indicate an offset to the match. (And the DRDOS module must be built without compression, requiring a small special step to produce the correct file.)
As opposed to convlist.pl, ident86.py already supports several switches. Apart from some environment variables that it reads mainly to control the lDebug process (inherited from lDebug's test.py) it also supports a few command line switches proper. They are the following:
The complete processing of a program may take several minutes, and may repeat parts already known to be fine or continue into parts whose addresses don't match thus generating a lot of useless output.
Among the first switches were small -m
and large -M
, for respectively a minimum and maximum offset. It took a while to properly handle the numbers in all ways, from excluding data byte differences, data byte and EOF contrasting, and disassembly of different ranges.
Similar to -m
, this switch makes it so differences in MZ executable header data bytes are not listed on their own. That means only the brief messages about differences in the allocation size, init CS:IP or SS:SP, or relocation tables are displayed.
These switches are moderately useful to put a limit to the portions of the binaries that are disassembled, useful to scan for initial differences that desynchronise the files so that later addresses all differ. When it works as intended, this is particularly useful to avoid the long wait needed for the program to disassemble all these differences.
The -a
switch specifies how many bytes must be in a run to activate the automatic truncation. The -A
switch adds a small detail; if given, it is expected to have a smaller number than -a
. The big letter switch specifies how much to disassemble from the point at which the automatic truncation was activated. This allows to specify larger lengths such as 128 or 1024 but still to not disassemble as much in the last range.
Usually, disassembly output text is listed in two runs, one for the first file and another for the second file. With the -s
switch, the disassembly lines are displayed side by side.
There are some convenience features to this display: If the second line exactly matches the first, it is replaced by the display "samesame". If the in-file address of the second line matches, but the length or instruction do not, then the address is replaced by the word "same". Furthermore, if the two displays aren't at the same address but would be again by processing a single line from one disassembly, then this line is shown alone on a line and the lines are synced up again.
The disassembly text generated by the debugger is massaged in several ways yet:
Not yet: Expanding disp8 offsets in an a16 address to an unsigned 16-bit quantity, and deleting word
keywords in the same spot
The first use of ident86 was to verify that the current NASM port of WarpLink is valid. This needed extending convlist.pl to accept NASM listing files as input. It was a success, the port indeed was accurate already.
The second use was to pick up the last weeks' patches from the SvarDOS repo that port the Enhanced DR-DOS DRBIO module to completely build with JWasm. As noted by Bernd, he'd hoped not to have committed to any errors in the port:
Porting required some manual steps, opening the door for human errors. Time will tell how many of them I added to the source…
Instead of merely hoping, I identicalised the port at every step. This does mean I dropped a few changes (such as aligning the stack) to be added at a later time, but I was able to verify that the port was accurate.
In this way, ident86 at this point automates the process of identicalising halfway: It does still require some fiddling, trying out different parameters, and parsing the generated output. But it greatly eases the work of spotting and testing differences as opposed to using a more general purpose tool (such as Jason Hood's bdiff).
The next project is likely to identicalise the port of the DRDOS kernel module from RASM-86 to JWasm. This port is already done in the SvarDOS repo, but it is not yet validated. As mentioned, I will likely have to add an offset adjustment to use the trace listing file with this module.
After that, the moon is the limit.
Discussion
Mentioned this blog post on the forum and also linked to it in a relevant issue of the SvarDOS repo.
In one of the EDR-DOS changesets I mentioned some scriptlets of usage examples of ident86: