User Tools

Site Tools


blog:pushbx:2024:0809_freedos-devel_mail_on_edr-dos_single-file_load_jwasm_port_and_ident86

freedos-devel mail on EDR-DOS single-file load, JWasm port, and ident86

Hello Eric, hello list,

On at 2024-07-26 13:04 +0200, Eric Auer via Freedos-devel wrote:

Hi! News from BTTR:

https://www.bttr-software.de/forum/board_entry.php?id=20959&page=0&order=time&category=0

while working on a single-file version of the EDR-DOS kernel,
GPT partition support was discussed and ECM mentioned a new
interesting tool: IDENT86.

This mail is now a few weeks old but I wanted to correct this. While the thread in the forum was originally about single-file load, the single-file load topic goes back to my first releases of it in late 2023 December and is not directly related to the JWasm port or the identicalising tool ident86. (We have the bad habit of re-using forum threads occasionally.)

The single-file load comes in several variants. One branch is based on my lDOS boot stages. My builds include four different kernel variants based on this branch of development. Any single file of these will act as a complete kernel. The variants are:

  • Either using lDOS iniload (*.com named file, can be loaded as many different formats including as DOS application) or lDOS drload (*.sys named file, can only be loaded as an EDR-DOS/FreeDOS kernel file) as an outermost wrapper.
  • Either using lDOS inicomp (initial loader compression) stage (edrpack named file), or not (edrdos named file).
  • Also, the internal zerocomp compression of the drbio and drdos stages may be enabled or disabled.

In addition, the inicomp stage may utilise a number of different compression formats. The compression occurs at build time using the mak script from the kernwrap repo (based on lDebug's mak.sh script). The depacker is included in the inicomp stage, and is selected at build time to match the format of the compressed payload. The default builds ship files built with several alternative inicomp methods, in the tmp/ subdirectories of the build. These range from the super-fast zerocomp (based on DR-DOS's original kernel compression) to the still fairly fast LZSA2 and down to smaller resulting files using better ratio compression methods such as Exomizer 3.x, APL, or LZMA-lzip.

The edrpack.* files in the bin/ subdirectory are currently selected from the smallest method, which is LZMA-lzip. This can take several minutes to depack on slow machines, which is the reason I added a progress indicator that can be switched to one of several types using the patchpro tool. The orders of magnitude can be observed in the results of the INICOMP_SPEED_TEST as well, which I did post in another discussion for several formats as run on our Debian Linux amd64 server running dosemu2 (no KVM). This ranges from 8ms per run (zerocomp) to 488ms per run (LZMA-lzip), a factor of 61.

There is another branch of a single-file load, developed by Bernd (Böckmann). This one works without the lDOS staged model, resulting in a kernel that is only loadable as an EDR-DOS or FreeDOS kernel file (similar to lDOS's drload), as well as limiting the choice of compression. Only the zerocomp compression adapted from DR-DOS's original kernel compression is currently available for this branch. Lacking the lDOS stages overhead the kernel is smaller than the uncompressed lDOS edrdos.* files, but not as small as the lDOS edrpack.* files with better ratio compression. The kernel filename for this "flavor" (as the repo's action artefacts call them) is kernel.sys, matching the FreeDOS kernel's conventional name.

This is able to confirm that the
JWASM port of the kernel is identical to a version made with
another Assembler down to the single machine code instruction
level, only leaving encoding differences without influence on
semantics between the original and the JWASM port.

This is true, but ident86 has learned some additional tricks since.

An overview, as I still haven't written much of any documentation for ident86:

The basic data item is a range of different bytes. Such a range is typically bookended by at least 16 Bytes without a difference, both before and after the range. The 16 Bytes length has been chosen because a valid x86 instruction cannot exceed 16 Bytes in length.

When a range is being handled (in the function handlerange) the corresponding data from either file is fed to an lDebug instance running in the background in a VM (qemu or dosemu2). Then this data is disassembled. Instruction boundaries are found by referencing an optional trace listing file, specified as the third file to ident86's command line. The function disassemble does some initial postprocessing to drop irrelevant differences in the disassembly, such as the MODRM keyword that indicates a certain operand encoding order, expanding imms8 signed 8-bit immediates, and changing the "segmented-address hexdump disassembly" format to a "file-seek length disassembly" format.

Then the two disassemblies are compared. This usually proceeds line by line, but sometimes multiple lines from one side may be matched to a single line from the other side. Matching lines are hidden from display. If an entire range is made up of matching lines, that is the processing of the disassemblies ends up having matched all lines from both disassemblies, then the "no difference" line is displayed. If a difference is found, then the remaining disassembly lines (after any possible leading matches) are displayed.

Some magic starts to happen when this occurs and the -s switch (side by side view) is specified. In the side by side view, disassembly lines from file 1 are displayed at the beginning of a line while lines from file 2 are displayed at an offset of at least 40 columns to the right of the beginning of a line. One display line may contain disassembly lines from both files, or from only one of them or the other. ident86 will try to sync up lines so disassemblies with the same starting address are paired.

If the address of a file 2 line matches the address of the paired file 1 line, then the file 2 line address is replaced by the keyword "same". If all of the address and the length and the disassembly match, the file 2 line is replaced entirely by the text "samesame". If the disassemblies compare as being similar in a fuzzy logic comparison, then the file 2 disassembly is marked with a comment reading "; fuzzysame".

The fuzzy logic is needed because we want to match lines that may have slightly different addresses encoded into them (as immediate operands, branch target operands, or address offsets) but encode the same meaning of an instruction. These lines are usually uninteresting for identicalising work, but may occur en masse if a later difference has a differing length so that subsequent addresses (that may be referenced from before that difference) are all shifted by a small number. The first line that differs such that the line from file 2 is neither "samesame" nor "fuzzysame" is marked as the earliest definitive difference. This comes into play next.

ident86 ships with some logic to inspect an earliest difference to figure out what change needs to be made to undo the difference. This recognises differing length but (fuzzy) matching disassembly texts where the shorter instruction is not followed by enough NOP instructions to level the length difference, as well as missing or differing segment override prefixes. ident86 will display a hint as to what change is needed.

When the -e switch and either or both -S or -E switches are used, and a trace listing file is passed as the third filename, then the address of the first hint displayed is crossreferenced with the trace listing to obtain the source text file that needs to be edited. (The -p switch can be specified once or multiple times to give regular expression patterns that convert the "trace listing source" filenames to source text filenames.) The -S switch will make it so the relevant part of the source is displayed. With the -E switch specified, ident86 will go ahead and edit the source appropriately. (This assumes that the trace listing file and the source text are in NASM format.)

If the -r and -b switches are both specified along -e and -E, then after an edit is done, ident86 will loop back to its beginning, re-build the binary to be identicalised using the scriptlet specified with the -b switch, then re-start the comparison of both files. This allows it to automatically apply several edits in a row to aid in identicalising the source text.

The -c switch allows to specify a cookie file, which will be used in subsequent runs to skip all bytes before the prior run's earliest definitive difference. These lines must have contained only samesame or fuzzysame lines, so it is assumed that they would still match. It is expected that after an -e -E -b -r -c run, the full resulting binary is checked by another run without -c or -m (minimum offset to examine).

I'm using ident86, along with the fixmem script and associated NASM macros, to port several programs to NASM. You may want to watch my blog to learn more about this.

Creating a
byte for byte identical version (which a binary checksum would
be able to confirm) would have required manually enforcing the
choice of encodings, which does not make code nicer, I think.

Yes, this is true. Especially as some assemblers, including my preferred target of NASM (the Netwide Assembler), lack a way to specify which register operand is to be encoded as the ModR/M operand. My debugger does now allow to disassemble and assemble with a MODRM keyword to depict or enforce a particular order.

So to encode a non-default order of operands, NASM requires using db (Define Data Bytes) directives to directly emit machine code bytes rather than assembling mnemonic instructions. This is obviously not desirable if an exacting byte-by-byte match is not required, hence ident86. ident86 automates several tasks I used to carry out manually when working to identicalise ports of assembly language programs.

I guess there are several technology news bits in this thread
which can be interesting for us :-)

Regards,
ecm

Discussion

E. C. MaslochE. C. Masloch, 2024-08-09 17:50:15 +0200 Aug Fri
lack a way to specify which register operand is to be encoded as the ModR/M operand

The majority of "no difference" ranges are due to such encoding choice differences, particularly for two-register-operand instructions that can encode either the source or the destination register as the ModR/M operand.

You could leave a comment if you were logged in.
blog/pushbx/2024/0809_freedos-devel_mail_on_edr-dos_single-file_load_jwasm_port_and_ident86.txt · Last modified: 2024-08-09 17:51:14 +0200 Aug Fri by ecm