This shows you the differences between two versions of the page.
— |
blog:pushbx:2024:0714_early_mid_july_work [2024-07-14 20:48:06 +0200 Jul Sun] (current) ecm created |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Early mid July work ====== | ||
+ | |||
+ | **2024-07-14** | ||
+ | |||
+ | This week I started the ident86 project, which combines several parts to aid in comparing ports from one assembly language dialect to another. | ||
+ | |||
+ | |||
+ | ===== WarpLink ===== | ||
+ | |||
+ | The build using the mak.sh script [[https:// | ||
+ | |||
+ | |||
+ | ===== MS-DOS v4 ===== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | |||
+ | ===== MSDebug ===== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | |||
+ | ===== TracList ===== | ||
+ | |||
+ | The TracList repository (for historic reasons called tractest) [[https:// | ||
+ | |||
+ | |||
+ | ==== What is a trace listing file? ==== | ||
+ | |||
+ | A trace listing file is simply a file in the format expected as input by TracList. This format is based on what NASM outputs as a listing file, specifically for its '' | ||
+ | |||
+ | (To NASM, the flat binary output format is said to be akin to a linker that is built into the assembler. This is why the listing file does not list the final byte values for those relocations that would likewise make it into an object file for other output formats.) | ||
+ | |||
+ | |||
+ | ==== What is the use of a trace listing file? ==== | ||
+ | |||
+ | The use of a trace listing file originally was [[https:// | ||
+ | |||
+ | |||
+ | ==== The conversion script ==== | ||
+ | |||
+ | The conversion scripts were added then for two tasks: Converting (single) listing files to the NASM-inspired format expected by TracList, and compiling a map file and several listing files into a single trace listing file. | ||
+ | |||
+ | The first task was implemented by convdebx.pl first, which simply read in the (single) listing file created by JWasm when building the current (FreeDOS) Debug/X sources directly into a binary output file. (This [[https:// | ||
+ | |||
+ | The second task, compiling several listing files plus a map file, was first served by convmasm.pl, | ||
+ | |||
+ | |||
+ | === Enter the fixupp tool === | ||
+ | |||
+ | This faced a complication: | ||
+ | |||
+ | The solution was found in the Enhanced DR-DOS build tools as well. The RASM-86 object files were postprocessed by a utility called fixupp. This closed source tool [[https:// | ||
+ | |||
+ | Using the provided source and [[http:// | ||
+ | |||
+ | For the RASM-86 object files, fixupp is now called with the RELOC keyword and its stdout is redirected to create the corresponding .rel (relocations) file. To support older versions of JWasm, fixupp can be run with its output file passed as '' | ||
+ | |||
+ | (The fixupp relocations file can also be used for the patched variants of JWasm, particularly to mark relocations in data instead of code.) | ||
+ | |||
+ | |||
+ | ==== The unification ==== | ||
+ | |||
+ | It was undesirable to develop three similar scripts in parallel. The motivation to unify them led to the discovery that out of the different assemblers recognised, and to be added yet, most of them had a header in the listing file (either once or repeated throughout) that identified the assembler. Initially I had filtered these out already, simply to prettify the output trace listing file. However, I started to detect RASM-86 (as opposed to JWasm) in convedr.pl using this header. | ||
+ | |||
+ | The looming unification made me add detection for all assemblers: Old MASM, mid recent MASM, JWasm (not differentiating old lacking relocations / NASM style relocations / new '' | ||
+ | |||
+ | NASM is actually the odd one out, being not as well behaved. It does not identify itself in a header to the listing file. To detect it, we rather check that the first line of a listing file starts with 5 blanks, then the number " | ||
+ | |||
+ | |||
+ | === Unifying the scripts === | ||
+ | |||
+ | The most advanced of the three scripts, convedr.pl, was chosen as the base for what became convlist.pl, | ||
+ | |||
+ | The NASM support was made necessary to create trace listings for the WarpLink build, as ported to NASM previously. The listing files created by NASM are already similar to the expected trace listing format safe for minor differences like the '' | ||
+ | |||
+ | |||
+ | ===== ident86 ===== | ||
+ | |||
+ | Last Tuesday I started out creating [[https:// | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | (Robert has asked whether ident86 can be used to fingerprint binaries and tell whether they were built eg by A86 or NASM. This was famously intended by the creator of A86 to identify unregistered users of that assembler. I replied that this is not among my goals, as I usually possess both source texts to the programs I am trying to identicalise, | ||
+ | |||
+ | |||
+ | ==== Conv that list! ==== | ||
+ | |||
+ | The initial ident86 revisions were quite poor in features. In particular, when a run of differences is detected, the program simply assumed that the instructions to be disassembled must start in the (up to) 16 bytes found in the file before the first different byte. (An x86 instruction cannot be longer than 16 bytes.) So, it would feed the area around the differences to the lDebug running in the background and have it disassemble in an " | ||
+ | |||
+ | While it turns out this is not useless, it can be tripped up by wrongly detected instruction boundaries. So what's the solution? Well, you simply provide a trace listing file that corresponds to the build and read this in to detect instruction boundaries. | ||
+ | |||
+ | While this is not (yet) terribly hardened, the naive way this is done for now works well enough to completely eliminate the problem. The only hurdle is disassembling of data, but that cannot work well regardless. [[https:// | ||
+ | |||
+ | |||
+ | ==== Integration ==== | ||
+ | |||
+ | Apart from consuming trace listing files created by convlist.pl, | ||
+ | |||
+ | |||
+ | ==== MZ executable support ==== | ||
+ | |||
+ | The tool supports MZ executables in three ways: | ||
+ | |||
+ | * Using the header size as an offset of the trace listing in the binary file | ||
+ | * Comparing that the minimum allocation size, init CS:IP, init SS:SP, and relocation table match (though it is allowed for the relocations to use a different encoding or order between the two binaries) | ||
+ | * Marking that differences before the image are in the header, and refusing to disassemble them | ||
+ | |||
+ | This is just what was useful to me while using ident86 for the first time, which was to identicalise the WarpLink port. None of it is set in stone, this is just what I needed then. Given that ident86 is a Python program it shouldn' | ||
+ | |||
+ | I expect that matching the DRDOS trace listing file to the intermediate binary will require a new switch to indicate an offset to the match. (And the DRDOS module must be built without compression, | ||
+ | |||
+ | |||
+ | ==== The switches ==== | ||
+ | |||
+ | As opposed to convlist.pl, | ||
+ | |||
+ | === -m and -M: Minimum and maximum offsets === | ||
+ | |||
+ | The complete processing of a program may take several minutes, and may repeat parts already known to be fine or continue into parts whose addresses don't match thus generating a lot of useless output. | ||
+ | |||
+ | Among the first switches were small '' | ||
+ | |||
+ | === -z: Skip header differences === | ||
+ | |||
+ | Similar to '' | ||
+ | |||
+ | === -a and -A: Automatic difference length handling === | ||
+ | |||
+ | These switches are moderately useful to put a limit to the portions of the binaries that are disassembled, | ||
+ | |||
+ | The '' | ||
+ | |||
+ | === -s: Side by side view === | ||
+ | |||
+ | Usually, disassembly output text is listed in two runs, one for the first file and another for the second file. With the '' | ||
+ | |||
+ | There are some convenience features to this display: If the second line exactly matches the first, it is replaced by the display " | ||
+ | |||
+ | |||
+ | ==== Disassembly ==== | ||
+ | |||
+ | The disassembly text generated by the debugger is massaged in several ways yet: | ||
+ | |||
+ | * The modrm keyword is filtered out, different encodings are to be matched as meaning the same | ||
+ | * The machine code bytes are replaced by a single digit indicating their length; we do want to mismatch on differing lengths but match on different encodings of the same mnemonic instruction | ||
+ | * The original segmented address (in the lDebug debuggee segment) is replaced by a single 6-digit address that relates to the binary file offset (relative branch offsets are not fixed however) | ||
+ | * A " | ||
+ | * An immediate operand after a 16-bit operand that uses an imms8, indicated by a plus or minus sign followed by two hexits, is expanded to an unsigned 16-bit value consisting of four hexits | ||
+ | * Different length but same mnemonic encodings, where the shorter instruction is followed by enough NOPs, are considered to be no difference matches | ||
+ | |||
+ | Not yet: Expanding disp8 offsets in an a16 address to an unsigned 16-bit quantity, and deleting '' | ||
+ | |||
+ | |||
+ | ==== The uses ==== | ||
+ | |||
+ | The first use of ident86 was to verify that the current NASM port of WarpLink is valid. This needed extending convlist.pl to accept NASM listing files as input. It was a success, the port indeed was accurate already. | ||
+ | |||
+ | The second use was to pick up the last weeks' patches from [[https:// | ||
+ | |||
+ | < | ||
+ | |||
+ | Instead of merely hoping, I identicalised the port at every step. This does mean I dropped a few changes (such as aligning the stack) to be added at a later time, but I was able to verify that the port was accurate. | ||
+ | |||
+ | In this way, ident86 at this point automates the process of identicalising halfway: It does still require some fiddling, trying out different parameters, and parsing the generated output. But it greatly eases the work of spotting and testing differences as opposed to using a more general purpose tool (such as [[https:// | ||
+ | |||
+ | The next project is likely to identicalise the port of the DRDOS kernel module from RASM-86 to JWasm. This port is already done in the SvarDOS repo, but it is not yet validated. As mentioned, I will likely have to add an offset adjustment to use the trace listing file with this module. | ||
+ | |||
+ | After that, the moon is the limit. | ||
+ | |||
+ | {{tag> | ||
+ | |||
+ | |||
+ | ~~DISCUSSION~~ | ||