Possible Execute-In-Place scheme #691

nickd4 · 2019-02-14T02:44:25Z

hi,

I did substantial work on an UZI port for a cash register some years (too many years) back. This machine had a Z180 CPU and an interesting feature of the machine was it had 256 kbyte ordinary RAM and 768 kbyte RAM disk, which acted essentially as Flash does to day (memory mapped and fast to read but slow to write).

So the scheme implemented was pretty sophisticated, I made sure that no function had more than 4 kbytes, and I implemented a custom linker that would pack all the code into 4 kbyte windows. All code executed in the F000-FFFF region of the address space, with the currently executing function's window being paged in as part of the calling sequence and the region 1000-EFFF containing the process's data and 0000-0FFF interrupt handlers and system stack.

This behaved pretty much like a PDP-11 with split I&D except code wasn't limited to 64 kbyte, kernel had about 80 kbytes code and some larger utilities such as the shell and the editor had large code too. Code pointers were 3 bytes and data pointers 2 bytes. Smaller utilities did not have to use the scheme and could simply use the tiny memory model with all 2 byte pointers.

The next interesting thing about the scheme was the execute-in-place ability, which I nicknamed "XIP". After downloading new code, e.g. a new kernel, you could run the command "align uzi.bin" as root, and it would arrange the 1 kbyte blocks of the file such that each 4 kbyte window was a contiguous region of up to 4 x 1 kbyte blocks in the filesystem, aligned to a 4 kbyte boundary in physical RAM and thus addressible by CBR register to bring it in at logical F000-FFFF. Hence program code could be executed directly from RAM disk and the 256 kbyte ordinary RAM could be used for kernel data plus 3 or more large processes with up to 52 kbyte data each.

As part of this I modified the filesystem and utilities such as fsck so that instead of the block free list it uses a bitmap. This should be a substantial performance improvement at the cost of a few blocks lost to the bitmap when the filesystem is full.

I made some progress to digging out this old code and putting it into a git repo, there is one commit per development release that I made at the time so it's not a file by file history but there are release notes at least.

If anyone wants to pick this up I am happy to MIT license it or similar and make it available. I also have loads of the hardware it ran on, from the decommissioned network of cash registers, which I am happy to give away for the cost of the postage.

cheers, Nick
nick "at" ndcode "dot" org

EtchedPixels · 2019-02-14T12:31:32Z

That would be really interesting to see and possibly very useful as a concept because we have a bunch of Z80 (and 6502) systems with a single 16K pageable window and lots of memory.

The filesystem one is an odd one. On spinning rust I think you are right - a bit map would be a huge win, and later System5 actually load the lists into a bitmap at mount time. On compact flash and SD which is what everyone uses now even with ancient machines it's less clear.

The hardware would be interesting too and I suspect you can probably shift quite a few boards on retrobrewcomputers or vcfed.

nickd4 · 2019-02-18T22:32:10Z

I am glad to hear that there is positive interest in the idea.

Yes, I had been thinking the execute-in-place might have limited interest whereas the bitmap idea that sprung out of the execute-in-place might be a useful and more self-contained change to merge back into mainline.

One thing that occurred to me since writing the original post is that this used a commercial compiler, the IAR Z180 embedded development system. The nice thing about that is it supports the 3 byte function pointer, which is quite fundamental (I didn't use any of their bank switching routines as I provided compatible replacements).

It might be possible to get this effect by hacking on SDCC, especially if it has a 32-bit int type which I think it does. Or even provide a preprocessor as a source to source translation that replaces function pointers with 32-bit ints. The calling sequence could be munged so that "myfunc(arg1, arg2, ...)" becomes "docall(myfunc, arg1, arg2, ...)". I have a separate project which is intended to do source to source translation like this, I had some limited success with it but it does need a fair bit more work to complete it.

In the meantime a more practical solution might be to copy what 2.11BSD does and have the linker provide stubs for functions in far memory so that 2 byte function pointers can be used. I have another project along these lines which is a cross compiler for 2.11BSD so that the 2.11BSD toolchain can run on a modern system. It works quite well though it is a bit complicated to set up the build directories and staging for a full build. In this context what we'd basically be doing is modifying the Z80 assembler used by SDCC to produce a 2.11BSD object file (routines can be lifted from the PDP-11 assembler which I have translated to C) and then using the 2.11BSD linker on it more or less directly, only the stub generation would have to be changed to make Z80 stubs.

The GCC ports of the 2.11BSD toolchain is done in a bit of a manual way, quite a lot of conditional compilation to get around library and word size differences. Later I came up with an automatic scheme which I call the "Xifier", it is a source to source transformation that puts "x" in front of all symbols and changes ints to shorts, etc. Using the "Xifier" I made GCC ports of the 4.3BSD C library and toolchain, someday I might redo the 2.11BSD toolchain in this way, but for the moment it's useable.

I will definitely put all of this stuff in public repositories, I'm not sure I can commit further time at the moment but possibly in a few months I could.

cheers, Nick

EtchedPixels · 2019-02-19T20:12:35Z

I already have the SDCC compiler and linker taught to generate banked code. It's not smart enough to work out how to lay stuff out itself (and there are advantages in hand layout because same bank calls are faster.

Basically the sdcc in my git tree can be told to generate

push af
call foo
pop af

and to expect an extra word offset on arguments.

The linker rewrites those between banks to call _bank%dto%d .dw foo

Anything with a function address generates a short stub in common space so that the C rule about all functions having a unique address remain valid and the pointer fits 16bits.

So I have some of the pieces today it seems.

Alan

nickd4 · 2019-02-20T00:04:27Z

Cool, that is extremely promising. Well I decided that instead of further theorizing I should check out the tree and take a look. So I gather that you use a standard release SDCC with standard release sdasz80 and the customized sdld from /Kernel/tools/bankld? And that asking SDCC to generate the push af/pop af sequence is basically a workaround to get 2 spare bytes without having to customize SDCC?

One thing that is extremely promising, is that the assembler and linker that I used with the IAR embedded C compiler are actually more or less compatible with yours, since all are derived from Alan R. Baldwin's code. (I did not use IAR's assembler and linker, but rather I temporarily modified Alan R. Baldwin's assembler to accept IAR Z180 syntax, eventually I planned to replace the compiler as well).

So I think it would not be a big project to drop in my window packing code, the main incompatibility is that my way does not generate stubs whereas your way does. By combining mine and your code I think we could do both things. I will look at this later. In the meantime could you explain more about the manual setup, i.e. how does the linker get instructed to put particular functions in particular banks at present? I vaguely recall that in 2.11BSD there's a linker switch that says "new bank" and is placed between .o files.

For the time being, I've made my code publicly available, I haven't done anything about licensing it yet.

See:
https://git.ndcode.org/public/uzi.git (my Z180 port of uzi, it also has a TCP/IP stack that needs work)
https://git.ndcode.org/public/211bsd.git (my cross compiler setup for 2.11BSD and partial ANSI C port)
https://git.ndcode.org/public/43bsd.git (my cross compiler setup for 4.3BSD and partial ANSI C port)
https://git.ndcode.org/public/ccom.git (a project to modernize Dennis Ritchie's PDP-11 C compiler)

If you would like to have a look at the linker window packing code that I propose to merge into sdld, see:
https://git.ndcode.org/public/gitweb.cgi?p=uzi.git;a=tree;f=src/mkutil/link-z80
It looks like the file "lkarea.c" is the important one, changes are marked by "#if 1 /* Nick */" and "#endif".

cheers, Nick

EtchedPixels · 2019-02-20T00:58:04Z

No it's a slightly customised SDCC (in my github tree). The actual changes for the banking are tiny - most of what it contains is the initial Z280 work.

The linker banking is a hack where it has rules for section names and banks. As you can tell SDCC to use different section names it means you can have _CODE1 _CODE2 etc to match banks. Not pretty but it got me going 8)

nickd4 · 2019-02-20T22:25:08Z

OK thanks I have looked at that. I have done a fairly thorough survey of the different banked builds in order to try and determine the current constraints and the way forward, and I will explain my thoughts.

My work is useful for a fairly specific use case, that of (1) small code windows and (2) having a second pageable region for data. The Z180 fits these criteria, I use CBR to bring in the code and BBR to bring in the data. If assumption (1) is violated, there is not really a huge benefit to having the linker do the automatic packing into windows, since it is quite unlikely that clever packing will make the difference between say 2 x 16k kbyte code windows or 3 x 16 kbyte code windows. If assumption (2) is violated, then a much more pressing concern is to get the kernel data out of the way of the userspace data, and in my way of thinking, each kernel bank should contain associated code AND its data.

Another issue concerns kernel vs. userspace banked executables with respect to the the total memory size. I think there is always a case for the kernel to be banked, since it helps keep the kernel out of the way of userspace. I think there is only a case for banked userspace executables on a system with large RAM. For instance, some ports can only support 2 or 3 userspace executables at once and these are limited to 16 kbyte. So a monster userspace executable with 16 kbytes data and 2 x 16 kbytes code windows would tend to take over the entire system, and still would be quite limited by the 16 kbytes data.

If only the kernel is going to be banked, there is not much penalty in applying specific programming techniques such as telling SDCC the area to use manually and/or using complex build scripts. If the applications are going to be banked as well, then there is a much more compelling case for modifying the build system to allow programs to be written in a more compatible and natural way.

Tell me, with the current banked kernel, do the different banks contain both code and data? What I'm envisaging is that the bank which contains all process management stuff should contain also the process table, the bank which contains all file management stuff should contain also the open file table, and so on. Trying to access the process table or open file table from an unrelated function in another bank would cause a crash in such case. (Also, things wouldn't always have unique addresses for sleeping on, although I'm sure we could work around that). If the kernel data segment could be got down to only a few essential globals then perhaps it could go into common, leaving say 4000-BFFF for userspace data and C000-FFFF for banked userspace code? (With userspace using a more conventional memory model).

What do you think? Even on the Z180 case, having a kernel data segment from 1000-EFFF and kernel banked code from F000-FFFF does tend to run out of data room when TCP/IP and much file handling is involved (I know this from experience). Perhaps, moving the boundary down to say C000 or 8000 and putting specific kernel data tables in with the associated code, would be a more sustainable approach.

cheers, Nick

EtchedPixels · 2019-02-23T15:31:04Z

In the cases where the kernel code is banked the data is usually not. The only real exception is the disk buffers. With no TCP/IP and the disk buffers banked you need about 9K for kernel data and stacks. That could probably be trimmed to 8K with some careful tweaking.

TCP/IP adds a chunk but is not too horrible because the protocol end runs in user space using uIP and the kernel and user space use the buffer cache as a disk backed store for all the queues. The TCP/IP layer is almost entirely separated, and splitting it completely is on the TODO list so it can live in its own banks.

As with 2.11BSD the big and difficult things to squash down are the inode cache and process structures. Unlike 2.11BSD it's at least theoretically possible to write active inodes to and from disk when the inode cache is full.

nickd4 · 2019-03-01T23:10:31Z

Thanks for the feedback. I will do some experiments along the lines of the above. I have since looked closely at SDCC and I must say I am hugely impressed with the progress since I evaluated it >15years ago and found it to be not ready at that time.

As to a platform for experiments I started to put together a Z180 emulator using the CPU code from MAME and the device code from your 8085 emulator or from z80pack or both. I will get it so that it can run the basically unmodified 8085 or z80pack distributions as a first step. The reason I want to do it like this is for ease of experimentation with 16K, 32K or 60K window which are needed by the various ports and the use of 2 or 3 windows (I think only the Z180 supports 3 windows though).

Perhaps we could eventually define some executables and magic numbers for the different window sizes and let the Z180 version run whatever kind of executable you throw at it, for testing userspace distributions for the various different ports.

I saw what you did with relocatable userspace executables. I am not as anxious to do it this way and would be happy to have a separate userspace per port which is compiled with appropriate base addresses etc. However, this would need much more infrastructure and to my mind relocatable is handy until we do have that infrastructure.

nickd4 · 2019-03-03T08:05:49Z

I put something here:
https://git.ndcode.org/public/fuzix_sim.git
It should be able to run z80pack images, but with a Z180 not a Z80.
At the moment it boots the fuzix image distributed with z80pack (not sure how up to date this is), but doesn't seem to get past init at the moment.

EtchedPixels · 2019-03-04T13:32:41Z

Catching up a bit

The disk images from Z80pack are pretty ancient. The images on fuzix.org are somewhat newer.

In theory you should be able to run with just the Z180 inbuilt I/O. The serial ports provide console, the MMU provides the memory management, the CSIO drivers can support an SD card for the file system.

nickd4 · 2019-03-05T15:49:38Z

Done a bit of midnight hacking. The simulator now works well. Actually it was already working, I just hadn't booted from the correct drive. Yesterday I connected up the timer interrupts, and today I have enhanced the simulator to properly support the Z180's expanded I/O and address spaces, while retaining the ability to run z80pack binary images with some clever backward compatibility measures.

Yes, eventually I am planning to hook up the Z180 inbuilt I/O and mass storage in a realistic way, but this isn't really necessary since I am happy to use the fake z80pack devices that do not resemble any real hardware. This will save time, since the device drivers are very simple and also already written. At the moment I just want a platform for memory management experiments and testing userspace binaries.

Since the CPU comes from MAME and I have preserved the object-oriented interface, it is easy to add different MAME CPUs. I plan to make command line switches for other CPUs we support, such as 6809, without changing the z80pack devices any more than necessary (just make them memory mapped on devices with no I/O space, allow a low common area on devices where page 0 is special, etc).

I think I'm ready to compile FUZIX now. I will first just try to duplicate what I have, then I will see about optimizing the bank layout of the existing Z80 or Z180 kernels, and/or possibly modifying SDCC to improve bank handling, such as by extending the generic pointer stuff from 8051 to other architectures. I also have in mind various possible ideas for how we could implement banked userspace executables.

nickd4 · 2019-03-09T05:14:19Z

I'm definitely making progress now. I've got a dev system up and running and I have been making exploratory changes. I'm focusing on understanding the platform specific code as much as possible and how it relates to the generic C code such as process creation/fork/exec, signal delivery and so on.

Since this morning, I feel that I basically have the picture in regards to the different platforms' build processes, the bank*.c files, the lib/z80*.s, the callbacks from this into platform specific code, etc.

My investigation has involved test-removal of various platform files to see what breaks and then looking at the callsites that are trying to enter the platform specific code. The PORTING file is also helpful.

I want to start making changes to the bankfixed.c and swap.c code to implement more of a first-fit allocator for z80pack rather than whole banks (similar to what is in unbanked.c), one question that arises in the course of this, is what do you think about requiring processes to declare their stack size upfront?

Obviously, in a system with a paged MMU like Linux, this is an unnecessary restriction. And also in the fixed bank case it is attractive to just let brk increase until it meets the stack pointer as we do now (on the other hand, I am not sure that there is any theoretical justification for the current margin of 512 bytes).

On plenty of other systems (68000 perhaps?) it's probably not really practical to just let processes have a huge stack region and use as much or as little as they want. So would it be okay if we moved towards a standardized system where all executables declare in their header the maximum stack they use?

For backward compatibility we could take the declaration as something big like say 0x1000 if it's not present. And then in the z80pack case, I want to use this to compute the brk limit instead of SP - 512.

nickd4 · 2019-03-09T10:51:53Z

OK so in the process of trying to improve the bank-switching stuff I sidetracked slightly onto improving the swap, and thinking about it carefully I realized there is sort of a logic error in the current swap stuff.

If the PTABSIZE is at least the number of banks plus the number of swaps, it is possible to completely fill all banks and swaps, which makes swapping impossible as you can neither swap in nor out.

I verified this by writing a fork-bomb type program that forks as many times as it can, and then tries to exchange signals with the children. Eventually I got it to happen and it panics with a message "nopage!".

Admittedly, this does not happen in normal z80pack which has a huge number of swaps compared with PTABSIZE, but that won't be the case on smaller systems that are swapping to floppy disk or similar.

A simple fix would be to limit PTABSIZE to one less than the number of banks plus swaps, we could do this statically since I believe at the moment we will always know the size of the swap at compile time.

On the other hand if I make the swap usage more efficient, so that only the used portion of a sleeping process's address space occupies swap, then that will no longer suffice. If we know that processes can't be bigger than say 48 kbytes, then we can reserve the last 48 kbytes of swap, and not let it be used except to avoid the deadlock. But this might get unmanageable with larger processes that have several segments, once my work on the banking is complete. That's how I came to considering the issue.

With this in mind, I'd like to know what we see as the eventual goal for swap so I can work towards it.

Do we agree that fixed size processes in swap is wasteful? If so, then do we want to have a special format for swap with a malloc()-like allocator and possible fragmentation management? Or would we prefer it to behave more like a filesystem, with the ability to map blocks to a process in any order?

Another goal to keep in mind is simplicity, since if we have a large source file just for managing swap, it will compile to a relatively big object file and take up precious kernel room, I don't want that either.

EtchedPixels · 2019-03-09T17:06:46Z

At the moment you need enough swap that you can't run out. That's breaks in several places and it's got a bug number for some of it #686 for nready. nopage! means we tried to swap in something that had no memory allocated to swap it in, which means we tried to run something and couldn't swap anything out so exploded. It's not entirely easy to fix because swapin occurs during switchin so a doexit() would recurse into switchin which would get a bit strange and we need to swap something in to exit it. I guess therefore preallocating enough swap as we go would be necessary to fix it properly.

Making swap size things sensibly on Z80 and 8080 is one of the things I just haven't gotten around too. For 8bit it's mostly a detail, but for 68000 it matters so definitely needs fixing.

Do you want the stack size up front or the total stack and space it might allocate up front ? When then binary format gets fixed (0.4 I hope) then stack definitely needs to go in there because for 4x16K and similar layouts you can do some interesting tricks that Tormod pointed out where you initially point any spare or high banks at the top bank, and then as you allocate memory the map changes.

Initially for example your map might be 1 1 1 1 if your max stack plus code/data/bss fitted into 16K. When brk() grows it then you might end up 1 1 1 2 and in time 1 1 3 2 and so on, but your stack would be copied in part so it would always appear in the right place and your bss would grow like Unix expects.

We probably need both and where '0' means "whatever you can give me".

For swap allocators it is in theory a win if they are linear chunks, at least with spinning rust. With modern disks or CF it doesn't matter so much. Z280 will also be a bit different here as it has a real virtual memory system. The rules get a bit odd with CF and a Z80 because memcpy() and disk read/write are the same speed.

nickd4 · 2019-03-11T02:58:41Z

OK, I didn't pay attention to the bug #686 until I read it more carefully and realized it is relevant to me.

I want the stack size upfront, and I propose to make it static, i.e. it never changes once the executable has been loaded. I realize we could do more with this. For instance, code in the function prologue could check the remaining stack space and make a brk-like call to increase the stack segment. But, this seems unacceptable in terms of run-time overhead, and the program couldn't adjust/recover if it fails either.

In regards to the Tormod stack suggestion, it is certainly a cool idea. I had to read it a few times to get the idea. And I can propose an extension, which is that the bank used for stack does not necessarily have to be one of the program's own banks. So essentially if lots of programs are running, we have a pool of banks with varying amounts of space at the end of them, that can be used for stacks generally.

On the other hand, I'm trying to strictly simplify things to come up with essentially a HAL that does not change too much for the different types of hardware. I realize that significant flexibility is needed given the large number of different banked memory schemes in use, nevertheless it is a worthwhile goal to try to analyze the common points and provide the simplest possible abstraction (but no simpler...).

And in this goal, I see a worthwhile simplification in requiring all application programs to declare a stack size upfront, that never changes. I also don't want to have an exceptional value meaning "give me all". So if we did it this way, stack can be placed just after bss, and then the brk-allocated memory goes after that (I don't propose any upfront declaration of brk-memory). This wouldn't need the Tormod scheme.

With my proposal, the process address space is by default completely linear, which simplifies allocation and swapping, although there is no reason we can't choose to break it into segments for added flexibility.

I see three main categories of memory, with analogous cases for swap:

(1) Huge banks that are either there or not there. For memory, this corresponds to the z80pack port with at most one application running in each bank and wasting the rest of the bank. For swap, this corresponds to the z80pack port with a fixed 56kbyte or similar slot in swap for each possible process.

(2) Memory or swap that requires contiguous allocation. For memory, this corresponds to hardware that has a base register, for instance the 8088 or Z180 ports. So you can allocate varying size segments up to 64 kbytes, and transparently place them anywhere in memory, but they must be contiguous. For swap, this corresponds to spinning rust where the seek delay is significant, so we want contiguous storage.

(3) Memory or swap that allows block-by-block allocation. For memory, this corresponds to hardware that has a page table, for instance the PDP-11 or https://github.com/EtchedPixels/Virtual65 or the 4x16k layout that would allow the Tormod tricks. So you can map each 8 kbyte or similar region of the logical address space to any physical memory independently. For swap, this is similar to a filesystem in the sense that there is a dynamic indirection table between the logical addresses and the physical block numbers. It allows all swap to be used without defragmenting, but has overhead and seek delays.

So here is my proposal for each of these:

(1) No change. Looking at bankfixed.c and the "static unsigned char pfree[MAX_MAPS];" scheme, or analogously swap.c and the "static uint8_t swapmap[MAX_SWAPS];", the beautiful thing about this is that it's so simple and requires hardly any code to implement, a big consideration for some ports. Having said that, an improvement might be possible if you were willing to give up swapping. Then with each executable having a relocation table, we can load any executable anywhere in any bank. But it must stay there for its entire lifetime. So it does not really make sense to swap it out and then swap it back in at the same logical address as before. It WOULD be possible (treating the z80pack memory as sort of like a 7-way set-associative cache so that a swapped process can go at the same location in any of the 7 banks), but I really don't favour this scheme. I think it DOES make sense if we are okay to have no swap.

(2) We need a contiguous allocator that supports alloc/realloc/free, but it also needs to be able to transparently defragment the memory by physical copying (and in the memory case as opposed to swap, the base registers of processes' logical address spaces must adjusted transparently as memory is moved). Without the defragmenting, we'll only be able to use 30-50% of memory, and we are also vulnerable to deadlocks in swapping when we cannot swap in to free enough swap for a swap out.

(3) Each process can have a page table, or in the case of swap, a "swap table". There can be a free list of blocks or similar. However, as we implement processes that can be much larger than currently (but don't have to be), this will either involve wasteful amounts of static page table arrays in the process structure, or will require dynamic allocation of page tables in the kernel, which makes me a little bit grumpy. So what I'm proposing is that we have ONE page table, or in the case of swap, ONE "swap table". For instance if process 0 requires 3 x 16k banks which are 7, 4 and 3 whereas process 1 requires 2 x 16k banks which are 5 and 2, then the page table could contain [-1, -1, 7, 4, 3, -1, -1, 5, 2, -1, -1, -1, -1, -1, -1, -1] which defines logical pages 0..15, and process 0's process table entry tells that it uses 3 logical pages starting at logical page 2, whereas process 1 tells that it uses 2 logical pages starting at 7.

Further comments on the page table or "swap table" for devices that allow block-by-block mapping:

(a) For memory, the page table can be statically allocated in the kernel's data segment depending on how much memory we expect to have. (For the Z180, maybe a 256-entry table corresponding to 4 kbyte logical pages in a 1 Mbyte address space, indeed I implemented exactly this in earlier Z180 work. though this was based on a model where the base register selects a single page not a contiguous region). For swap the "swap table" would similarly be statically allocated in the first "n" sectors of the swap partition.

(b) Some interesting tricks are possible with this table. For instance, if we have the moveable allocator, then the table can be the same size as physical memory, and defragmentation is a very cheap operation since the page table entries are only a few bytes, so quick to move. (In the swap case, defragmentation might require reading and writing a few blocks of the first "n" sectors, but is still relatively cheap). Or, we could simplify by using the simpler first-fit non-moveable allocator (similar to malloc), and just make the page table or "swap table" somewhat larger, a factor of 4 wouldn't be too costly as entries are small.

So, having settled on what is possibly quite a grand vision, but which makes sense to me, I went ahead and implemented a defragmenting allocator. It would link into the current kernel more or less directly.

The kernel must provide routines that can move a process's image up and down in physical memory, and/or routines that can move a process's swap image up and down in swap space. The allocator does not care whether this is the actual memory, or the entries of the indirection table mentioned above, or whether they reside in memory or disk. The kernel must also provide an 8-byte structure in the process table for each of memory and swap, which are privately used by the allocator to track the allocations. The kernel would also refer to these in order to set up the process's base register or access swap data.

I put the initial commit of the allocator here:
https://git.ndcode.org/public/moveable_pool.git

To use it, build it by running "make", and then run "./n.sh". This will generate a test script containing a list of alloc, realloc and free operations to do. It runs them firstly in non-moveable mode, so that some of the alloc and realloc operations fail. Then it runs them in moveable mode, all succeed and you can see the diagnostic messages from the callback that moves stuff around. Dummy memory contents are provided when blocks are allocated, and verified before freeing or making the blocks smaller. It works really well.

There's a bit of duplication in the allocator code, that can be got down by subroutinizing etc later on.

The next thing I will do is build a model that includes two pools (memory and swap), and generate a similar test script containing process wakeups and sleeps, process creation and brk requests, etc. This will verify that we can move processes smoothly between memory and swap, and never deadlock.

EtchedPixels · 2019-03-12T01:00:10Z

For Tormod's trick the stack does want to be part of the same banks as the code/data because in many cases that saves you space (18K of program and 6K of stack is 2x16K not 3x for example). Unix and traditional unix always had data/bss/brk/hole/stack, and some apps break if you change it.

V7 and the like on PC were of course even crazier because of the segments. They set SS = DS and placed the data and bss at the bottom and stack at the top. To avoid wasting 64K chunks they interleaved programs into the gaps in each other !

First bit

Yes although on some platforms the I/O is slow enough it would be worth writing out / reading back only the used chunks. The maps don't change however.
The other case of this is systems with a single flat 32bit space, but right now I've not done swap on them because the memory management and copying back and forth for forked and overlapped processes is horrible. MAPUX did it on the Amiga with UZI but the complexity is eww...
Agreed, and if you have an in memory map it's not that expensive as well as being possible to try and find groups when doing swapping. Paging is different altogether when you have real VM (eg Z280)
Agreed - and there are some machines with lots of banked memory and crap I/O where swap makes no real sense.
For swap yes, and for Z180/Rabbit style definitely
Makes sense I think

I am not sure you want to be able to move things up and down. The classic mainframe systems with base/limit pairs always used a first fit algorithm until there was no room and then used a first hole algorithm until they'd done one pass through memory. In other words once they had something that should fit but didn't it started booting stuff out to make space gradually walking up the memory.

I believe the theory is that most stuff is long lived and stable so as well as avoiding thrashing the sweep tended to push all the stable stuff down one end and the rest remained the transient pool under memory pressure. I am not sure it translates - another reason of course was such machines had to use the CPU to memcpy but the disk interfaces were point and fire so a compaction didn't burn CPU.

The memory compaction is trivial enough (memmove), the disk one is a bit hairier but certainly
doable. I guess a swap table is preferable to compacting disk swap.

nickd4 · 2019-03-13T06:58:17Z

I see what you mean about the mainframe style algorithms. One difference that we have to consider in our case is that we might have quite a small swap partition. I'd think that most mainframes would have been swapping to a hard disk measured in Megabytes, especially multiuser systems.

If we had this luxury then I would consider preallocating all swap. So for instance with 1 Mbyte memory and 5 Mbyte swap, you could run only 5 MBytes worth of processes, and process creation would return an error if space in the swap could not be allocated, even for short lived processes that never actually need to swap.

This would be great because (1) you would never have swap deadlocks since you'd always be swapping out to a different swap location than swapping in, and (2) you could do a lovely suspend operation where all processes get swapped and the system shuts down, then reverses this next boot.

On the other hand, with say a 128 kbyte Apple IIe having a 140 kbyte swap floppy, you'd expect and want to be able to run about 256 kbytes of processes (allowing 12 kbyte for overhead). Similarly if a Linux user mounts an 8 Gbyte swap partition on a machine with 4 Gbytes RAM (as I ve often done) he/she expects 12 Gbytes not 8.

With these use cases I don't think it is feasible to defragment memory via the swap. You could do it in 12 kbyte pieces in the Apple IIe case above (since I would be reserving about that much space from the combined memory plus swap to avoid swap deadlocks), but there'd be no real advantage as compared with just defragmenting the memory directly.

Admittedly the Apple IIe might not be the best example as you'd likely be using the bankfixed.c-like scheme, however I think you can see what I am getting at.

Anyhow, I will proceed to do a prototype for the Z180 case and we can then discuss the merits of extending the ideas to other platforms, I think they do generalize mostly.

EtchedPixels · 2019-03-13T23:08:08Z

Sounds good.

nickd4 · 2019-03-14T14:47:31Z

I did a further simulation where there are two pools, one representing core and the other swap. It runs a test script containing the following commands:

alloc - create a process (like fork), victimizing others as needed
realloc - resize a process which is fully in core (like brk), victimizing others as needed
run - bring a process fully into core, victimizing others as needed
free - delete a process (like exit)

To execute, build and then run ./o.sh, it will run a simulation as follows:

64 blocks of core, with 4 blocks reserved to hopefully reduce fragmentation (optional feature)
192 blocks of swap, with 12 blocks reserved to hopefully reduce fragmentation (optional feature)
a further 16 blocks that reduces the available number of blocks for processes from 240 to 224 (essential, as it avoids swap deadlocks, and impacts the size of memory<->swap transfers).
16 processes of up to 32 kbytes each (randomly fluctuates between 0 and 32 kbytes).

The probabilities of each action are set so that the memory and swap are under heavy pressure. In such case, swapping a large process out and another large process in, is done in 16 block chunks.

In the running state there can also be a partially swapped process (the least recently used process). This works because swap is arranged like a stack, that accepts core blocks in LIFO arrangement.

In the simulation, dummy process data is provided and checked at process exit, by which time it has moved through swap possibly multiple times. This works, so it seems likely my logic is correct.

To use the system the user has to provide the routines in core.c and swap.c, which are as follows:

copy memory to memory, down
copy memory to memory, up
copy swap to swap, down
copy swap to swap, up
copy memory to swap
copy swap to memory

I will do further work on this so that swap-to-swap copies are done via memory, reducing the number of routines you have to write (via the 4 blocks reserved core, which would become mandatory).

I will then make it so that the swap pool is not moveable but rather it uses indirection like a filesystem, this will avoid swap-to-swap copies altogether. (I want to evaluate with and without indirection).

See the repo at: https://git.ndcode.org/public/moveable_pool.git

nickd4 · 2019-03-18T13:52:50Z

I made a lot of improvements to the simulator, so that you can choose various compile time options like MOVEABLE_CORE, MOVEABLE_SWAP, PREALLOCATE_CORE, PREALLOCATE_SWAP, and just now, INDIRECT_SWAP. These correspond in a complicated way to different kinds of hardware.

So, for instance, the Z180 would use MOVEABLE_CORE, because the Z180's MMU works with contiguous regions, so we need to be able to get things contiguous without risking fragmentation which would destroy the system's guarantees. If swapping to CompactFlash it would also use INDIRECT_SWAP. That means that block allocation is random and no compacting is required.

When using INDIRECT_SWAP, the MOVEABLE_SWAP and PREALLOCATE_SWAP options refer to the indirection table, not the backing store. We would like to turn MOVEABLE_SWAP off when using INDIRECT_SWAP, because there is no real point in being able to defragment the indirection table. We can instead just make it a bit larger than needed. The risk with turning MOVEABLE_SWAP off, is that guarantees could be destroyed -- suppose we are trying to kick a process out, and there is no way to get a contiguous region of the indirection table because of fragmentation. Thus, I made it that when MOVEABLE_SWAP is turned off, PREALLOCATE_SWAP must be turned on -- meaning that any problems due to fragmentation (of memory or indirection table) are detected and rejected to the application by an error return from a fork or brk system call, instead of a time-bomb further along.

If PREALLOCATE_SWAP is used without indirection then it means that all processes must have their backing store allocated in the swap in order to be able to execute at all. This could be handy for implementing a suspend feature. It also compiles to less code, since swap allocation is simply a malloc-like operation and all guarantees are preserved without ever having to defragment the swap.

Hmm! The code is pretty complicated, but will compile down quite small. One problem is, that all the conditional compiles make it hard to understand the code. So what I could do is, for each of the common cases, pre-run the C preprocessor to get a much smaller module that can be put into a port.

I think I might try compiling this as a userspace program and see what the cost will be in code space.

nickd4 · 2019-03-28T12:55:40Z

I haven't had time to work on this lately. Finally got a chance to do something this evening.

So I've decided that the moveable swap is not a good idea. The deciding factor was basically the difficulty of copying swap from one place to another, when core is full (which it is when you're trying to swap out, and that is when you need to defragment the swap). Without some spare core you can't do it, especially as that use case was supposed to be for mechanical disks so you want a big batch size.

For the indirect swap I had planned to implement an indirection table on disk, taking the first "n" sectors of the swap partition, and then use contiguous parts of this indirection table per process to locate the blocks of the process. This would have been cute if the indirection was conditionally compiled, but since the indirection will always be there, I decided to do it another way. I will just use the filesystem code.

So basically I see two ways of organizing the swap. You could swap to a mounted filesystem, in which case the swap works a bit like a pipe -- each swapped process has an invisible file that eats some disk space but can't be accessed through the directory structure. Alternatively you could have a dedicated swap partition which is created with mkswap, that will be identical to mkfs but won't create a root inode.

I think the versatility of being able to use a mounted filesystem for swap is helpful for very constrained systems (for example, running from 2 x 1 Megabyte floppies), and also I think that the reuse of the filesystem code for the block allocation and indirection stuff will save significant kernel code space.

A slight complicating factor of doing it this way, is the accounting for the allowable total process size. As I have mentioned previously, I am fine to reject process-creation or brk-setting if it exceeds the total amount of core and swap available, I am not fine to reject a swap attempt because we find out that we do not have enough core or swap at some arbitrary time later (and OOM-killing is out of the question).

So I have to account for the maximum total size of a process, including any indirect blocks, rounded up to an appropriate block size to account for internal fragmentation in core or swap (4 kbytes for the Z180). Then I have to make sure this total doesn't exceed what is available (with dedicated swap) or reserve the swap part of the total away from normal filesystem activities (when swapping to a mounted filesystem).

To test the concept I am working on changing the test script to use the FUZIX filesystem for the swapping. I have taken the filesystem code from the UCP utility, although I since found out it's a bit out of date and buggy. I may be able to integrate the kernel and UCP filesystem code at some point, for the moment it is sufficient for my purpose. I started by just writing a separate inode test script, which works.

To run the inode test script you run ./p.sh, it will create an 8 Mbyte filesystem in "fs.bin" and then randomly create and destroy up to 64 inodes simultaenously, each of which can have up to 2 Mbytes of dummy data stored in it. I set it for a pretty long test (16384 events) where each event is either creating an inode with some dummy data, resizing an existing inode with dummy data, or freeing an inode.

The test script runs perfectly, according to the predicted total sizes of the inodes and the number of blocks stored concurrently, and the dummy data is preserved as it should be. So seems from this that I can correctly store the dummy data in the inodes and account for the sizes including the indirect blocks. The next step will be to integrate this code into the process test script, where it will store the swap data.

Note that to get this to work I extended the f_trunc() function to allow the size to be set arbitrarily, instead of only being allowed to destroy a file with it. It recursively frees all blocks (direct, indirect or double indirect) beyond the given threshold, after rounding the threshold up to the nearest block. I will integrate this code into the kernel when I'm ready to integrate the rest of it, and add truncate()/ftruncate() APIs.

See the repository here:
https://git.ndcode.org/public/moveable_pool.git

EtchedPixels · 2019-03-28T14:18:49Z

I am fine to reject process-creation or brk-setting if it exceeds the total amount of core and swap
available, I am not fine to reject a swap attempt because we find out that we do not have enough
core or swap at some arbitrary time later (and OOM-killing is out of the question).

I would agree - it makes sense for a paging system to overcommit but not really a swapping one

Alternatively you could have a dedicated swap partition which is created with mkswap, that will be
identical to mkfs but won't create a root inode.

Linux does something a little different for fs based swap. It's a lot more complicated now in implementation but the original idea looks like it would fit in. In the original Linux swapping you either swapped to a block device, or a to a single swap file (that acts like a swap device). The only real difference is that your swap table of indirections also does a bmap() on the inode unless the file was created sequentially. That short circuits all the expensive readi/writei paths.

The ftruncate is definitely useful if it can be made small enough.

nickd4 · 2019-03-31T11:23:37Z

Thanks a lot for the feedback. I'm not sure I understand fully about the original Linux swap implementation, from what I gather you're saying that it partitions the swap into a dedicated set of indirect blocks plus a dedicated set of data blocks, and it may or may not use the indirect blocks depending on whether a contiguous region is available? So quite similar to my original design?

Anyhow, I made some progress. What I decided to do is to reverse the order of how processes are swapped in and out, so that to swap out, the data is essentially popped off the start of the process's core image and appended to the process's swap image, whereas to swap in, the data is "dis-appended" from the process's swap image by reading the tail of the file and truncating it, then pushed onto the start of the process's core image. This means the core image has to behave sort of like a deque, in the sense that the application itself by brk-setting can push/pop the tail whereas swapping can push/pop the head.

How I had it before was more like Towers of Hanoi, it would pop off the tail of the core image and then push onto the tail of the swap image, making the swap image end up in reverse order. Then I had some rather tricky code to complement all addressing to re-reverse the swap image and thus enable multiple blocks to be written linearly from core to swap. It was kind of confusing, although efficient in code space. By reversing things this way, I had it so that the realloc() operation was actually creating space at the START of the swap image rather than the end, although it didn't know or care about the physical layout.

So with the new deque approach, it compiles to more code space, but I don't think that will matter in the Z180 (and PC/XT) case that will use it -- because the Z180 (and PC/XT) have much better memory capacity and management ability than the other, more constrained, platforms that we support, and I'm aiming for a >64 kbyte kernel with far calls, the way I had originally implemented for the cash register.

Also, the deque is good because in the ideal case you'll have processes A followed by B in memory, and each chunk swapped out from B will create room to extend A without moving stuff around. The Towers of Hanoi approach couldn't do this. And in similar cases where say there's process A, B then C and you're swapping out C to swap in A, it will move B out of the way if possible, otherwise only move process B. Oops no that doesn't compute. Possibly could implement extra code to make that happen.

So the deque is working and hence the current test script is using swap in a file-like manner. I've started to incorporate the inodes code, so that it will use the filesystem for swap rather than my pool allocator.

About the f_trunc() code size, have a look: Lines 1359 to 1455 inclusive at this link:
https://git.ndcode.org/public/gitweb.cgi?p=moveable_pool.git;a=blob;f=fuzix_fs.c;h=b198505e21e04b811165d53b9bbdf77fbfcf27d7;hb=af94a9859755031ef4ad37283e7ebfe07196c24d
See the new routines freeblk_partial2() and freeblk_partial1() which are called for the case of the double or single indirect block respectively, that straddles the truncation point. This code took an evening of head scratching and several failed attempts to get right, I am fairly happy with how it turned out.

nickd4 · 2019-04-13T22:59:17Z

Making progress -- I'm ready to start integrating it into the kernel.

I improved the code a bit and the separation between generic and test-code. The generic code is in process.c, pool.c, core.c and optionally swap.c (depending on whether we are going to handle space allocation in the swap partition or whether we will leave this up to the filesystem code).

The generic code defines the following abstract routines:

If managing swap ourselves: core_to_swap_copy() and swap_to_core_copy() which take core and swap addresses and a size.
If doing inodes: swap_read() and swap_write(), which take an inode, an offset within the inode file, a core address and a size.
If doing moveable core: core_copy() and core_copy_up() which take source and destination addresses and a size, and implement basically memmove() semantics, except that the generic code tells you which direction to copy and also for the copy-up version it gives end-addresses not start-addresses.

The core and swap addresses and size are implemented as long which gives 32-bit addressing, this is necessary for the Z180 and/or PC/XT which allow up to 1 Mbyte physical. I believe the current kernel readi() and writei() can already specify far addresses, but I will have to look into that as part of the integration process since at the moment I'm using the fuzix_fs.c from UCP utility which is much simpler.

The test-code is in process_test_run.c which provides test-versions of the abstract routines based on memcpy() or equivalent, and/or fuzix_fs.c, and runs a test script as described earlier in this thread.

Because the code was nearly impossible to understand (simple in principle, complex in practice), and also the options for conditional compilation are rather confusing (certain options require other options, certain combinations not allowed), I pre-ran the C preprocessor to generate 6 nice clean versions of each file, and suddenly I can see my logic clearly. Essential since in 6 months time I probably won't understand the original :) But the original source is now just a private note, and won't go into FUZIX.

To see the preprocessed code, you check out the repo and run ./gen.sh, this populates 6 directories:

Indirect core: for platforms like PDP-11 or the emulated 8085 which allow each page of logical address space to be mapped independently, and thus do not require a process's core image to be contiguous:
indirect_core_indirect_swap
indirect_core_inode_swap
indirect_core_preallocate_swap

Moveable core: for platforms like Z180 or PC/XT, which have MMU base registers or segment registers allowing code or data to be accessed from anywhere in physical memory as long as it's contiguous:
moveable_core_indirect_swap
moveable_core_inode_swap
moveable_core_preallocate_swap

If we are going to manage space allocation in the swap partition ourselves, there are two ways of doing it: If swap is not plentiful then we use xxx_core_indirect_swap which keeps a free bitmap and a block indirection table in kernel data space. If swap is plentiful then we use xxx_core_preallocate_swap which restricts the total amount of processes to be run to equal the swap space ignoring the core space. The preallocated way does not require an indirection table as everything is contiguous in swap. And note that both of these options do not have good guarantees, since they omit the moveable pool code and will simply reject process creation or brk-setting if failed due to fragmentation. For the indirect case, swap fragmentation is less of a problem since the indirection memory can simply be made bigger by say 3x.

If the swap is going to be a filesystem with inodes and indirect blocks, we use xxx_core_inode_swap. This is the recommended way, because fragmentation is never an issue, since the indirect blocks do not need to be contiguous. At present it would have to be a dedicated partition, as I have not implemented reservations to allow it to share with ordinary filesystem use and still maintain the system's guarantees.

Another advantage of inode swap over own-managed swap is the saving of kernel data space, since the indirection information is stored in the filesystem and not in precious kernel data space. I was originally going to make the own-managed swap system do this too, which is why I separated out the indirection stuff into core.c and swap.c, so that swap.c could get a more complicated implementation that uses the buffer cache to manage the indirection table. But I haven't done this, as I think the inode way is much superior and the own-managed way shouldn't be supported long term. I only generated these versions because it was easy to do, so we may as well evaluate them. The xxx_core_preallocate_swap does have some advantage of extreme simplicity and small code/data space, if you really have plenty of swap.

Anyway, back to the usage intructions: you run ./gen.sh to generate the 6 directories as above, and then you go into one of these directories and run make and then ./o.sh to generate and run the test script. You also have to have done make in the top level to get the test script generator and mkfs.

If anybody is interested, follow the above steps then inspect moveable_core_inode_swap/process.c to see how I am going to manage the core and swap for the Z180 implementation. It is just 320 lines of code and very clear and easy to understand. The routines in this file are as follows:

estimate_size() -- calculate size of the core image (rounded up to 4 kbytes) and the swap image (rounded up to 0.5 kbyte block and indirect blocks added). Take the maximum -- this is necessary for the system's guarantees so that process creation or brk-setting can be rejected if problems will occur later.

process_init() -- call this once in the beginning, to tell it how many processes to use and how much spare space to reserve for the transfers between core and swap (under heavy pressure we must swap out small pieces to make room to swap in each small piece until a process to run is fully swapped in).

do_swap_out() -- an internal routine which frees enough core to swap in a specified amount of the process we want to run, it has its own routine because it may need to loop through several victims.

process_alloc() -- supports the fork() operation or possibly in future an execve() after a vfork(). Possibly swaps out and/or moves core to find a contiguous core region of the requested size, or fails.

process_realloc() -- supports the brk() operation. Possibly swaps out and/or moves core to extend the current contiguous core region (which must be fully swapped in; the process must be executing) or fails.

process_run() -- called from the scheduler to bring the process fully into core if it isn't already. Repeatedly calls do_swap_out() and then does the swapping in part itself. Note that this can't fail.

process_free() -- supports the exit() or kill() operation. Process can be partially in core or swap.

The other files pool.c and core.c are basically support code and should be mostly self-explanatory.

I'm having a problem with SDCC that it hangs when compiling the test-system, I'm investigating. Well, I don't really need the test-system for kernel integration, but I want it, so as to do things step by step.

nickd4 · 2019-04-19T13:49:56Z

I got a bit disappointed with SDCC, it seems to be over-complex, slow and bug-prone. A significant amount of work has gone into it, and kudos to the devs since it has some very nice features that I was initially quite excited about using (the global pointers and so forth), but on balance I feel that a ground-up redesign and new approach would be needed to get it to something I'd consider robust and scalable.

So I started to look seriously at ACK again, I have looked at it several times over the years and been put off by its design which is also a bit over-complicated in different ways. But after hacking on it for 1-2 weeks and making some experimental changes to the EM machine assembler-link-editor and simulator, I'm starting to come around to the ACK approach. Also, the code it generates seems surprisingly good.

So for my Z180 port I'm quite keen to resurrect the ACK Z80 target. I have got it basically working and I am ready to try to compile FUZIX with it. As a preliminary step I'm trying to compile the v85 and v8080 target, and I do not understand one thing: where does the ack -mfuzix come from? I expected it to be ack -mi80 or similar. I cannot see any reference to a platform called fuzix in David Given's ack?

Any tips would be helpful. I would also like to know whether we use ack style executables in those ports or whether we are converting them to put a standard FUZIX a.outish header on them? Or did we hack on the platform's linker to produce the standard FUZIX executables hence the -mfuzix in the compilation step? I saw the ack2kernel tool, although I haven't looked at it all that closely at this stage.

EtchedPixels · 2019-04-19T14:17:48Z

The fuzix file is in the Build/ directory. Drop it into your ACK enviironment. I know @davidgiven was also looking at the Z80 state with what we learned from 8080.

ack2kernel and friends just turn an ack image into a Fuzix one. It doesn't currently know how to build relocatable binaries but that will be much the same as with SDCC (build it twice, binary diff). I did poke at ack a bit for Z280 but it's obsession with a framepointer is a killer (as with native 8085 rather than 8080) so I didn't get too far because I couldn't figure out how to make it generate all the offsets as SP relative. I also looked at ANSI pcc a bit but Z80/Z180 was just too weird to encode the tables easily.

I am not sure how well it will work in general - the ack linker is dumb as molasses and can't cope with the complex layouts needed for many systems. OTOH I am very interested to see how small an ack for Z80 targetted at size could get. SDCC is good at fast, it's not so good at small.

ack2kernel should be usable as is for Z80, and the same for the user tools providing you build them at a valid load address for your system. The bigger problem will be signal handling. ACK on 8080 at least does not generate re-entrant code. SDCC on Z80 does and our signal paths currently asume that on Z80 (it's also why Z80 really can't run 8080 binaries right now).

Alan

nickd4 · 2019-04-19T23:54:52Z

OK, thanks for the tips and the detailed run-down of current ACK problems. I think those things could be fixable given the will to really get down and dirty in the ACK code and make structural modifications. I have not looked closely at the target-specific backends yet, if they are similar to what I see for the EM-machine it should be OK.

I also looked at pcc a bit, and although Steve Johnson is a genius (and is still active on early unix mailing lists) I am not sure that pcc is his finest work, the table matching algorithm seems much too heuristic to me. Perhaps there were good reasons at the time. But a detailed study of the AT&T documents about the internal design of the Ritchie vs. Johnson compilers convinces me that Ritchie had the better algorithm. I made some efforts to pick up Ritchie's code as basis for a multi platform compiler, it is highly intricate work and I eventually decided replacement is easier.

Yes, I am highly focused on the size of ACK vs SDCC and the self hosting potential. It is tough to choose between these compilers as both have good points and bad points, for me the potential that FUZIX could some day compile itself on Z80 (like 2.11BSD can on PDP-11) is a deciding factor for ACK.

feilipu · 2019-04-20T06:07:07Z

@nickd4 have you looked at sccz80 in the z88dk?

It is under active improvement, is much faster than sdcc, and supports both classic and new libraries.

Most of the older z80 hardware is supported via sccz80 and the classic library.

davidgiven · 2019-04-20T09:54:08Z

I think the ACK's unlikely to self-host on the Z80 --- the smallest architecture I know of that it runs on is the i86, which both has much better code density than the Z80 and also cheats hugely by using split I/D address spaces, effectively doubling the available space. The ncg ACK binary (the core code generator) is 65kB, and the code segment is essentially full.

The ACK's table-based code generation algorithm is interesting, however, because it's a single-pass non-basic-block non-AST architecture, which makes it suitable for streaming from/to disk. It doesn't keep much state in RAM. With simplified bytecode it may be possible to write a new, smaller code generator that would fit on these systems.

(Do you have any links to the papers you described on Ritchie and Johnson compilers?)

nick-lifx · 2019-05-21T13:44:04Z

Yup, that is almost exactly the problem, it was touching _inint (which is in kdata.c thus in kernel space) just after calling map_restore. The code looks like it was moved to the correct spot at some point, but the original code accidentally wasn't deleted afterwards. The following patch has fixed it:

diff --git a/Kernel/lowlevel-6502.s b/Kernel/lowlevel-6502.s
index f5d4c462..7e85fd9f 100644
--- a/Kernel/lowlevel-6502.s
+++ b/Kernel/lowlevel-6502.s
@@ -102,14 +102,9 @@ interrupt_handler:
        sta _inint
        lda _kernel_flag
        bne interrupt_k
-       jsr map_process_always          ; may have switched task
-       jmp int_switch
+       jmp map_process_always          ; may have switched task
 interrupt_k:
-       jsr map_restore
-int_switch:
-       lda #0
-       sta _inint
-       rts
+       jmp map_restore
 
 ;
 ;      The following is taken from the debugger example as referenced in

In the process I also did some nice debugging stuff in the Virtual65 emulator, it now has a 6502 disassembler and the ability to print an execution trace on stderr. I tackled the problem by creating good and bad execution traces and comparing them for the first divergence (with some caveats). When I discovered this occurred during a timer interrupt, I then tried commenting out parts of the interrupt handling code, and bisected in from there to discover what part of the interrupt handling caused it.

I am not sure if it's worth PRing these fixes for the moment, I can do so if there is a pressing reason to. Similarly, the platform-v8080 changes. Anyway, the 6502 and 8080 seem to have already been in a pretty good state, so it only required a tiny bit of tinkering to get things bootable and seemingly robust.

nick-lifx · 2019-05-21T13:50:33Z

About the 65c816 I have flagged the following as possibly incorrect, should arguments be swapped?

diff --git a/Kernel/lib/65c816.s b/Kernel/lib/65c816.s
index 71cce952..9b01d84b 100644
--- a/Kernel/lib/65c816.s
+++ b/Kernel/lib/65c816.s
@@ -279,12 +279,14 @@ fork_patch_3:
        sep     #$30            ; back to 8bit mode for C
        .a8
        .i8
+       ; NICK: IS THIS CORRECT? SHOULDN'T THE ORDER OF ARGUMENTS BE SWAPPED??
        lda     #<_udata
        ldx     #>_udata
        jsr     pushax
        lda     ptr1
        ldx     ptr1+1
        jsr     _makeproc
+       ; NICK: END IS THIS CORRECT?
        ; We are now being the child properly
        lda     #0
        sta     _runticks

That's the only reason I thought the 65c816 might not be in current use, if it is then we should check it.

Also, I was wondering if there is an easy way to remove specific files from the packager for specific platforms? Since a few things were too big to fit or wouldn't compile on 6502. Obviously I can get around it for now, but that's something that I would have to tidy up before I could PR the changes officially. (Or the problem goes away when/if I implement the larger address space, which is envisaged in comments).

EtchedPixels · 2019-05-21T18:49:02Z

Possibly .. I will check that. I have not tested 65C816 since I did the big makeproc change, I've just fixed banked memory Z80 with banked kernel for that same bug!

EtchedPixels · 2019-05-21T20:53:32Z

I've added the v8080, 6502 and 65c816 fixes you mentioned here. Any other 6502 bits would be good to get into the tree as I am currently bringing up a 6502 system with 512K of banked RAM (in 16K banks), 16550A UART, CF adapter and (once I've finished soldering it) a 6522 VIA for timers etc.

nick-lifx · 2019-05-21T22:16:03Z

That's good. As a bit of an old-timer I am pretty comfortable with exchanging text files and patch files and I feel that project owners / maintainers should not look a gift horse in the mouth, if the information about a bug or feature reaches the owner / maintainer by some means then that is better than it not doing so. Since the advent of github style interfaces though, I have increasingly had patches rejected for not being formal PRs -- and then critiqued and delayed multiple times once formally PRed, as if it were my responsibility to get critical fixes polished and into the style because I want them in the mainline, when in reality it makes little difference to me since I can always merge the fixes I see as critical in a private version of the repo! As a highly productive developer who throws off such patches regularly as a side effect of my work, I do not want the hassle of shepherding them through someone else's process, when I could spend that time developing! (Having said that, there is a time and a place for PRs and there are occasions where it is important to me to get stuff into mainline). So anyway, to conclude the rant, I am happy that you take a practical approach to such contributions!

I believe that all critical bugs I found are mentioned above, i.e. those that took time to isolare, there are also quite a few hacky fixes or commentings-out for obvious problems and some changes for my use such as making /dev/tty1 refer to z80pack aux not lpt. These are tricky to PR since they would arguably leave the source in a messier state than I found it, but anyway I'll review my branches for any missed fixes.

EtchedPixels · 2019-05-21T22:52:40Z

On the 6502 side btw my RC2014 tree now has an emulator for the 6502 RC2014 configuration I am currently working on, and the RC2014-ROM directory my rather minimal flash ROM I am using for bring up.

Still have to write the VIA6522 emulation bits I need - such an incredibly complicated bit of silicon.

nick-lifx · 2019-05-22T02:34:36Z

Yes, an interesting chip. I have VIA6522 (if I recall correctly) which were supposedly brand new from a guy in USA, but actually are bad Chinese fakes, well they might actually work, I don't know. Haha. I complained to him that the tops of the chips had been sanded down and repainted, but he acted all innocent and I could not be bothered sending them back at huge expense just to prove a point. Hmm.

nickd4 · 2019-05-24T15:25:18Z

I am making great progress on building my mental model of how the banking and swapping works. It was a bit unfortunate really that I started on z80pack because that's really quite complicated. It is best to start with v65 since all the code is relatively together and comprehensible, also it's more friendly than the fixed-bank scheme. Once you understand the v65 way, the z80pack way is just a small twist on it.

So an important misconception I had concerns the common area, I had been thinking of this in the z80pack way as memory that's shared between all processes and the kernel, but that's not accurate, in FUZIX the terminology means memory that's shared between a process and the kernel. Once you understand that each process has its own "common" area, the udata management makes more sense!

So in v65 the important thing to understand is that when in kernel mode, the kernel's private stuff is at 0x2000 and above, whereas the first page of the process is mapped in below 0x2000. The udata and various stacks are kept there. So when a process goes to sleep in kernel mode and another process wakes up in kernel mode, it's quick to remap the first page and get the correct stack and udata for it.

The difference with z80pack is that when kernel is mapped in, that consumes the one-and-only base register so we do not get the choice of mapping in a particular page containing the right udata and stack, we are stuck with the kernel's one. So the only option is when a process goes to sleep, we must copy its user and stack to somewhere else, and then copy back the user and stack of the waking-up process.

As a matter of mechanics, in the z80pack case we make the following decisions (i) the udata and stack of the running process will be kept in z80pack common which is above 0xf000, you can change the boundary by writing an I/O port but I believe we set it to 0xf000 giving the smallest possible common; (ii) the stashed udata and stack of a sleeping process will be copied to/from the end of the process's bank.

A downside to this scheme is that in both cases (v65 and z80pack) we lose the corresponding amount of logical address space in the process, even though the process has no interest in the udata or kernel stack and never accesses them. The space lost is 512 bytes. And this is kind of significant for the largest processes, one of them being the Bourne shell, and another would be the linker if self-hosting.

What I had imagined before I realized all of this last night (while trying to make room for my test program to run in kernel space), was that the kernel was somehow privately managing the udata and kernel stacks in a way that wasn't visible to the process. And I'd kind of like to see it do that, because it worries me that processes can scribble on data that is important to the kernel (at least, it can in the v65 case).

I realize we do not have formal memory protection, and so in theory we cannot really protect from a bad process, and I also wouldn't advocate putting a lot of effort into things like illegal instruction trapping or stack checks on entry to the kernel etc, but still, if the MMU can be used to keep things separate then it's good to do so, as it would save a lot of head-scratching in the presence of tricky corruption problems.

When considering if the kernel should try to keep the udata and kernel stack invisible to the process and allow the process the largest possible logical address space, the difficulty for v65 is that space at the start of a page is a limited resource and we need some of this for the kernel's zero page and CPU stack, whereas the difficulty for z80pack is that the memory is not very granular, if we take a bank for the kernel's private usage to stash udatas and stacks in, then we have to take an entire bank or not at all.

So basically if we let the process have all of its 64 kbyte address space we have to find somewhere else to put the udatas and stacks. And, in the v65 case we want it so we can quickly map it in with the MMU, which implies for each process we have to take an 8 kbyte page, use the first 512 bytes and waste the rest. That's not very palatable. On the other hand, it would simplify the processing of interrupts that occur in user mode, because the process's zero page and CPU stack could be left alone during the switch to kernel mode and possible sleeping in kernel mode (that occupies the kernel stack) if pre-emption occurs.

This is an interesting issue, and I went digging in the Unix V7 source for quite a while to see what Ritchie had done about it. (And I saw what @EtchedPixels was referring to in regards to processes getting booted until enough contiguous core is available; also references to "likelihood" of core or swap running out, which says to me that Unix V7 did not have good guarantees of the kind that I consider essential).

It turns out that the magic elixir with Unix V7 is that the PDP-11, while it has only 8 kbyte granularity in dealing with logical memory, has a smaller 64 byte granularity in dealing with physical memory (it makes sense really, since in those days physical memory was precious and logical memory abundant). So what they're able to do is to tack the udata and kernel stack onto the start of the process's core (or swap) image, taking only the 512 kbytes or so, and map it into the kernel, but keep it invisible to the process. The only way we can do the same thing in v65, is to waste the remainder of an 8 kbyte page as noted.

Thinking about all of this I came up with a bit of a radical plan that I think could optimize things a bit, though it may seem like a de-optimization in some ways. For me, probably the biggest priority is that processes have a full 64 kbytes to scribble on, and a secondary priority is that the kernel have same. (Or, as a compromise, the kernel might have say 56 kbytes plus an 8 kbyte window for uget-type stuff).

What I want to do is to allocate pages (in v65) or one or more banks (in z80pack) just to hold udatas and stacks, and I guess possibly other memory-intensive stuff like disk or network buffers eventually. For each process we'd have a udata, a stack, and in the v65 case also a zero page and a CPU stack, total would be 1 kbyte, so an 8 kbyte page could support 8 processes. To switch processes we would do this:

Have the 8 kbyte page holding the current process's udata and stacks mapped in already up high.
Copy the zero page and the CPU stack into the correct slot (one of the 8) in the high-mapped page.
Change the mapping to bring in the new process's udata and stacks (may be in a different page).
Copy the zero page and the CPU stack from the correct slot (one of the 8) down to the fixed locations.
Continue to access udata through a pointer (because it's in one of 8 slots), C stack already is like this.

This isn't nearly as bad as it sounds, because the cc65 compiler only uses about 20 bytes or so of the zero page, and only a similar amount of CPU stack would ever be used as we'd only be nested a couple of C function calls deep when we go to sleep. The worst part is probably the udata access through a pointer, to mitigate that I wonder if cc65 could be told that the pointer is permanently in the zero page?

For z80pack it really wouldn't be any worse than now, just a different choice of stashing location, though we'd have to take a hit in having one fewer banks available. We would still have a lot of wasted space in the reserved bank, but this gave me another idea which I think would be helpful in those systems that have a large fixed common (like 48 kbyte banks and 16 kbyte always mapped common). We can let processes grow into common, and stash the overflow part somewhere else on task switch.

With these ideas, I think we would move closer to having a self-hosting system that can compile itself.

@EtchedPixels what do you think? It somewhat depends on what we see as the priorities (fast task switching, fast interrupt handling, more logical memory for processes, separation of private data, etc).

jcw · 2019-05-24T19:47:05Z

It seems to me that common is not even an exchange area between a process and the kernel, but in many cases the only way FUZIX can switch contexts. Like a trampoline to bounce between banks.

As I understand it, with the eZ80 (and prob Z{1,2}80), context switches could be done without common area (the switch into ADL mode can jump to any 24-bit address, while the stack switches to a separate 24-bit one). And since the kernel context can access any address using special "long" instructions, this also means that it can manage process data without having that process actively switched in.

So, to add to your mix of considerations and variants: with the eZ80, it may be possible to support a full 64K per process, no FUZIX kernel data at all. This means accidental scribbling is no longer a risk, for a more robust setup in case of crashes (even more so when combined with eZ80's watchdog).

EtchedPixels · 2019-05-25T19:07:47Z

Fuzix used the term 'common' the way CP/M 3 and MP/M did - the area of memory that is always mapped. It's not actually quite that simple any more because

Z180 flips the entire 64K and keeps copies of the common in the top of each process map
Bank16K keeps copies of common space in the top 16K block but each process has a 16K page split between common and private space
We have systems with no true common data memory at all where we do some rather tricky stuff when switching.

Historically though the udata was in common because it held the interrupt stack and whilst it no longer does that it does still contain a pile of variables that are easier to manage if always mapped (syscall argments etc). You can avoid it with some gymnastics (see z80-thunked).

Z180 doesn't have a single 'change bank and branch' instruction nor a way to do interrupt vectors into far space so it needs a small amount of 'common' space. Likewise you need a bit for disk I/O so you can do direct I/O into other banks.

Z280 has a real MMU, paging and supervisor mode so is a very different beast. Getting Bill Shen's Z280RC running Fuzix is on the longer end of the TODO list.

eZ80 you probably can do it because you've got "far" (in 16bit think anyway) addressing. 65C816 you can do most of it that way but you need some common code to handle interrupt switching, zero page, and I/O really.

@nickd4
On Z80 having udata via a pointer kills you. 6809 probably not so bad, 68000 does this - A5 is a register global holding the udata pointer. It's actually faster and cleaner because everything is xx(A5) and gcc knows how to do it right.

For 6502 I always wanted to put some of the udata stuff into zero page, and some key system variables, but never had time to play with that. I think you are right though that if you put a udata pointer in zero page it will generate code that's as compact and not much slower.

The real question though is about shared data. Small chunks of shared code are easy - you put a copy in each bank.

nick-lifx · 2019-05-27T22:08:58Z

@EtchedPixels you are right that Z80 indexing udata would be weak, I guess if the compiler could be persuaded it could use IX or IY for this purpose (like 68000) but it would still be bigger and slower than the direct access. Also about 6502, you are right that it wouldn't be hugely worse, but OTOH we'd lose the opportunity to put frequently accessed variables in ZP (did not think of that aspect).

I have also been thinking about a tricky problem with the granularity of core vs swap, basically you have to round up all swap allocation to a multiple of core block size or risk running out. That's a bit nasty if you have say a 140 kbyte floppy and 8 processes with 8 kbyte core blocks as you could have up to 64 kbytes you can never use due to them being reserved to maintain guarantees. It's really frustrating, that.

Limiting the number of in-core processes to n would almost fix the problem because the reserved space for rounding would be only n * 8 kbytes. But another problem then is if those n processes happen to be quite small the remaining ones could overflow swap.

Considering all of these issues I was thinking of having a "pre swap" phase that does not go to disk but basically compacts a less recently run process from RAM to RAM, so basically copy its ZP, udata and partial last 8 kbyte page to unaligned non contiguous RAM (I have all the mechanics to manage this already so there would be significant code space sharing too).

I am wary of adding complexity as it wastes kernel code space. But in this case I think the saving in memory and time would compensate. The only question is whether it's worthwhile to jump through such hoops to make kernel data invisible to processes and increase process address space a bit. For my part, I do think that it is worthwhile.

nick-lifx · 2019-05-29T06:58:55Z

So I thought about this for a few days and I came up with a scheme, it is rather complex, so perhaps more of a thought experiment at this stage, but I think it is certainly doable and on the right track.

In what follows, when I mention udata I mean also the kernel stack and possibly the interrupt stack, and in the 6502 context I also include the zero page and the CPU stack (kernel stack is C stack).

Recap about udata stash

Firstly, a recap about stashing of udata in the process's address space: this is a good idea in general, because it simplifies things, you don't have to deal with udata specifically when doing the swapping.

If you stash the udata at 0 in the process's address space (for 6502), then it takes logical memory away from the process, and interrupt handling is slower (we are sharing the zero page between process and kernel, which does not matter for system calls but must be saved for interrupts), and context switching is fast (the udata is already at the correct address so it's an MMU-only operation).

If you stash the udata somewhere invisible to the process (for 6502), then the process has its full logical memory, and interrupt handling is faster (process's zero page and CPU stack can be left completely alone while we switch to kernel mode on a different zero page and CPU stack), but you need an extra page with the udata at the correct address, or you must copy it at context switching.

V7 Unix on the PDP-11 does it the latter way, and does not pay a price because it has very small physical pages (only 64 bytes) so there is no problem having udata aligned to the correct address.

Aligned udata cache

So my idea to get the full logical address space of the process without paying too high a price in time or memory, is to allocate some smallish fixed number of udata's, say 4 or 8, and have this many spare pages available, with a udata at the same address in each page. Then the 4 to 8 most recently run processes will already have their udata somewhere that it can be quickly mapped in by the MMU.

As well as that, each process has a stash area at the end of its in-core image for longer term storage of its udata, and where the udata can be placed before swapping so that it forms part of the process's swap image (for simplicity we have this for all fully in-core processes, including those whose udata is also in the aligned udata cache, and for those processes the contents of the udata stash is ignored).

If the process's logical address space is a multiple of the core page size, then the stash area forces a new full page to be allocated, but usually there is some spare room in the last page for the stash area.

Also, the stash area should divide the page size (e.g. page size of 8192 bytes and stash area of 1024 bytes, which is how it is currently for 6502 IIRC), and the stash area should be aligned to this so that it does not straddle a page, which will simplify swapping. Also, the core image can be larger than the process's logical address space, so that if it's completely full, the stash area can still be placed after it.

Running a process

We maintain a marker in the LRU list of processes, to divide processes whose udata is in the udata cache vs processes whose udata is in the stash. So given an in-core process to run, if its udata is in the udata cache we just run it, otherwise we try to get one of the spare pages to copy unstash its udata into, if none, we take a victim from just before the marker, stash its udata, and move the marker first.

A possible optimization is that if the stash happened to fall at the start of a page (for 6502), i.e. the case where the stash area caused a whole extra page to be allocated, then that page can simply be mapped in without bothering to copy the udata to a cache page first. When reclaiming pages, we have to keep choosing a victim and moving the marker until we get a victim that does use a cache page.

Reclaiming partial pages

A significant problem for a system with large in-core pages, e.g. Virtual65 which provides the example of 8192 byte pages as discussed above, is the waste of space at the end of a page. And even more so, if process used an extra number of pages, causing the udata stash to waste an entire extra page.

What I propose about this is to have a third LRU list for in-core processes, to hold processes that have been demoted from "fully in core with stashed udata" to "mostly in core with partial last page swapped out". So there would be potentially mutiple processes that are in-core and nearly ready to run, but only their last page is in swap. And of course, potentially multiple processes that are fully swapped out.

If we do this, there is a trade-off to make when we need more core. If we swap out the last page from a older (but not the oldest) process, we can get more core with potentially less disk I/O. If we swap out a full page from the oldest process, then there is less likelihood that the process will shortly need to run.

Core vs swap block sizes

An annoying assumption in the current swapper code is that the swap block is the same as the core block, this is done solely to simplify the calculations as to the system's available space and to deny process creation or "brk" requests when this gets too low. The calculations must be done in advance to avoid problems occurring later, because we cannot return an error to the process while swapping.

If you have say a 128 kbyte floppy with 256 x 512 byte blocks, and 128 kbytes of core with 16 x 8 kbyte blocks, the problem is that you'll never be able to pack the floppy with swapped out processes, you will always have a gap in the last page since you have rounded all process sizes up to the nearest 8 kbytes. Even if you do pack the swap images more closely, you just waste a lot at the end of the floppy.

That's where the third LRU list comes in: you can do all your calculations in 512 byte blocks instead of 8 kbyte blocks, and create enough processes or answer enough "brk" requests to completely fill up the swap floppy, and then if you find yourself running out of core because the in-core processes wasted big amounts of their last page, you run the routine that swaps out only the last page of each process.

Use of the disk block cache

Generally we do not want to do swapping through the disk block cache because it causes thrashing, once you've sent a large process through it, all the other blocks that were there before have now been flushed and you've filled the cache up with something you'll only access again once. So we bypass it.

BUT, when swapping the last page it does make sense to use the cache. Because we are only dealing with small amounts of data, it hopefully will not thrash, and because the process is not the oldest, there's some likelihood we'll need it again soon and so it would be good to avoid disk I/O in that case.

Note that we cannot reserve an area of RAM just to hold the compacted last pages. This does not work because it destroys the system's guarantees unless there is a way to unreserve last-page RAM and send it to full-page RAM and vice versa, which implies a defragmentation routine. BUT, if we use the disk cache then we do it without guarantees but just opportunistically, and I think it would work well.

Further, the udata cache can with some gymnastics provide far memory for the disk and inode caches, because we only need to use a certain address in each block, the other addresses are free for other uses. I think it's a high priority to do this and get the disk and inode cache out of kernel data segment.

Summary

A multi-stage demotion scheme from fully runnable processes gradually down to swapped ones, that allows efficient use of core and swap, fast context switching and interrupt handling, less I/O and so on.

EtchedPixels · 2019-05-29T20:45:26Z

What I'm doing for RC2014-6502 is to load the low 16K page with per process stuff as well as the necessary zero page and stack page. At the moment I have various stuff stuck in 0x000-1FFF because
I am playing with ideas but much of that could (and if I can't figure out how to mod the I/O window on the board from C000 to FE00 will shove it up high).

Switching that low 16K page on the actual task switch really helps, and because the banking is flexible doesn't mess up swapping too much. It means the only ZP value exchange I have to do is on interrupts and I don't have to copy stack or other stuff. It's more expensive on memory than copying but the 6502 is not very good at block copies anyway.

The cache is also an odd one - it might help a bit on floppy but floppy swap will be painfully slow, but on hard disk it's usually slower and adds latency to go via the cache so the cache is generally only of real use with metadata. On hard disk it's actually as cheap to write udata to disk when swapping as copying it somewhere else.

nick-lifx · 2019-05-30T02:11:24Z

@EtchedPixels that's interesting, I would like to know more about what you're doing with the low 16K. Where do you put (i) kernel ZP, kernel CPU stack, kernel C stack (ii) process ZP, process CPU stack, process C stack (iii) interrupt stacks (iv) udata? Is process otherwise allowed to use the low 16K?

As for me, I made some minor changes in 6502 to the link script and discard segment, was not going to post them yet, but I can do so if you think it would help. I implemented the discard segment, but frustratingly it seems we can't really use it because of the initialization order of devices, once we get into the kernel proper it's slightly too late to do things like putting disk block buffers in the discard?

Regarding cache, I think we need to make a lot more use of cache, at least on spinning-rust type systems. Anyone who has used SMARTDRV can confirm that it makes an incredible difference if you have a large enough cache, but the current 6ish disk buffers is useless. I think on systems like the Virtual65 or Z180 where you have up to 1 Mbyte, we need to have more like 64 / 128 kbyte of buffers, and a corresponding increase in the number of in-core inodes as well. For me this is a high priority. I'm considering two main schemes for how the far memory can be accessed here, will write about it later.

EtchedPixels · 2019-05-30T19:25:59Z

$2000+ is available to the application to use directly, $0-$1FFF is a mix of stuff that is switched with the task in switchin() and stuff that is identical and copied into each low bank. See platform-rc2014-6502. It mostly works except for signal handling.

On spinning rust (unless you happen to be running an original TRS80 hard disk unit or similar) the drives have upwards of 1MB of cache and the interface runs at close to ldir speed. So whilst a cache is useful the drive has one. In fact big drives have 32MB or more cache so you don't really run Fuzix from the disk at all 8)

Different platforms do different things with discard. Some place it in memory that will be used by userspace, others try and place it straight after the buffers section so that they can indeed just expand buffers over it and empty space in platform_discard(). The most general arrangement is probably the the the kernel runs from low memory upwards ending with buffers then discard, then there is a gap and common. The platform discard code reclaims the space between buffers and common.

You can also put buffers in their own memory bank and the overhead for most ports when doing this is pretty small as only metadata goes wandering off into buffers anyway.

Another problem if you bump the numbers up is that we don't hash inodes and buffers nor do we have pointer chains for things like runnable, so it doesn't scale.

nickd4 · 2019-06-01T11:43:34Z

Yes, fair enough, I agree that there are some weird caveats when it comes to caching in our situation. So perhaps it is not the magic elixir that it was on early 386s... although, having said that, I DO have in the cupboard some ST506 drives which I plan to put onto an AT motherboard and run FUZIX on eventually... not an original TRS-80 but close :) About the hashing and so on, that is trivial to do although I agree it would not bring much benefit right at the moment. I was also reading a bit about segmented LRU and combined LRU/LFU algorithms the other day and it seems they're not too difficult to implement, we probably should eventually do something like that since at present any longer burst wipes the cache. I guess we are going direct to userspace for any full block transfers though? If not, we probably should.

As for me, I got time to do a little more the last 24 hours, I am back to working on the test script / swapping simulator. Given that I have had quite a few additional insights as described in earlier posts, much as I want to move away from the simulator and make an implementation in the kernel ASAP, I decided it's worth doing some reworking and it's coming along quite well. I've been putting in support for an expand-down stack segment among other things. Although I'm not keen on the memory-hole idea that was apparently used in early systems (Xenix?) I would like to have a separate stack segment on the hardware that supports it. Unfortunately that doesn't include 8088 or 286 unless we move to large model.

There are some interesting issues around the stack segment, well I mentioned earlier about the idea of having a fixed-size stack declared in the executable's header and not necessarily in the top of memory, I know this could cause compatibility issues and so on, but nevertheless there are arguments in favour.

Lately my thinking on this has centred on the guarantees, it's a bit of a problem that we have no way of informing the process that it has run out of stack, besides killing it with a signal. That applies even if the user-space cooperates by detecting out-of-stack (perhaps by comparing SP with something in the function prologue) and goes into a brk()-like routine to ask for more stack -- what does it do if this fails.

The best I could suggest is program uses setjmp(), then calls stack intensive routine such as qsort(), ends up in a signal handler on an interrupt stack to tell it it's out of stack, then longjmp() back to the main program to tell the user that the data was too large to be sorted? I suppose it works but it's quite messy.

So to my way of thinking, if there's a chance we could run out of stack during the execution of a process and (in the case we can expand stacks on the fly) not be able to expand for lack of core and/or swap, the right thing to do is deny the fork() or execve(). I realize that this sounds pedantic, but the alternative is basically to overcommit the memory and do what is effectively an OOM-killer, and I don't agree with that.

Because of this, my idea for the expand-down stack segment is essentially to reserve the needed space for the lifetime of the process, but for the system to be aware of the unused portion and avoid copying it or swapping it. This would save significant effort compared with the current way where the entire image of the process is swapped. I was thinking of putting the stack at the start of the process's core and swap images, which can expand and contract in both directions (at the start from stack, at the end from brk), and on some hardware it might just appear there, on other hardware we can trickily remap it to the end.

EtchedPixels · 2019-06-02T21:52:47Z

Aligned full blocks go direct to user space unless a platform has specifically decided not to for that device. Even on fastish floppies it seems to be a win not to cache data pages only metadata (and maybe directories). Interestingly the authors of the ZSystem found similar results and went the same way except that they added a hack so if you have lots of buffers (and on a big box you could set up a lot of buffers if you wanted) then you could mark some binaries to be cached on load and then subject to the normal expiry rules.

Having a fixed size stack is definitely a win, and on some architectures it's also relatively cheap to
spot stack overflows in the function call entry code. I am not sure why you need the stack not to be at the top of the space. What kind of platform does that help ?

Likewise agreed on fork/execve failing on non paged systems.

Modern Unix has the notion of signal stacks for the out of stack and signal case. Nobody ever uses them 8)

nick-lifx · 2019-06-03T03:37:23Z

About the stack being at the start of the space, it is really more of an internal thing (internal to the kernel), and mostly for the platforms like 8088 or Z180 that use segment or MMU base registers.

Assuming you use a separate data and stack segment, you would not want to put the stack segment right after the data segment, because you would have to move it each time the process calls brk().

So the solution is either treat the data and stack segments as separate from the viewpoint of memory managemant, thus a brk() call might move the stack segment or might move some other segment, or another way is to put the stack segment before the data segment, that way brk() does not affect it, and indeed if the stack can also grow with a brk()-like call, it can grow in the other direction. You would still have to move other processes' segments if they get in the way of extending the stack or data.

If data and stack are treated as separate segments, it does have some small advantage because the defragmentation might be able to do smaller block-moves, but I don't think it would make much real difference, since basically to satisfy a brk() request you have to compact other segments, which means copying everything along to fill the gaps, so having them in smaller pieces would not help that much.

So what I have decided is to concatenate the two, with stack first and the low stuff (zero page, text, data, bss, heap) after. That way, swapping does not have to care where the stack and the low stuff begins. It just keeps a pointer into the conjoined segment to know what is swapped and what isn't.

That does not mean that the stack has to appear this way to the program. In the Z180 case, for instance, we can use the BBR to map in the low stuff and the CBR to map in the stack. On the 8088 it's a bit more tricky since we would have to go to the large memory model to accommodate this.

For the 8088, if we want to get the benefit of less copying (not swapping out the unused stack space) then we could put the stack low, and only swap the used portion (going from the SP register up to the brk level). If we decided it's more compatible to put a fixed size stack above bss and below heap, then we can do that, but unused stack would get swapped unless we added complicated extra swap code.

On z80pack the problem is slightly different and there is some argument in favour of putting the stack up high and then copying it in two goes (perhaps stack followed by low stuff) to the swap, but that's more complicated. Aside from the compatibility issue of programs assuming stack is above heap, I don't see a big advantage. It would be much easier simply to put the stack down low at all times.

Of course there is also the memory hole idea. Give each program a 64 kbyte segment with low stuff, gap, and then stack. And try to interleave the processes so that low stuff of the next goes in the hole. But it would ruin absolutely everything, firstly it would make a dog's breakfast of the code, and secondly once you start introducing weird alignment requirements it's not possible to guarantee the aligned size.

nickd4 · 2019-06-04T14:41:49Z

I am working in the swapping simulator still, it is looking awesome.

Yesterday I revamped the swap-in and swap-out loops to greatly simplify the calculations and rationalize the state saved per process to manage this, and also to combine common code into a swap_read_write() routine which saves significant code space by using a read/write flag.

Then today I did another major rewrite which makes it keep track of the internal fragmentation in a better way. It has meant changing the test script, the test script runner, and the allocate/reallocate code.

Say for example that you started with 4 blocks and then you added 0.25 of a block to each end. This wouldn't change the alignment relative to the backing store, so you'd be left with a region spanning -0.25 up to 4.25 relative to the original allocation.

Well previously I stored this as the tuple (0.75, 4.5) meaning that in the now 6 block allocation, there is 0.75 of a block wasted, then 4.5 blocks used, then another 0.75 blocks wasted. What I now do is store it as the tuple (-0.25, 4.25) and this implicitly encodes an origin.

This makes it much simpler for the client to resize either end (by specifying that it wants to extend the -0.25 end down to -0.75 say, without having to consider the 4.25 end at all), and also greatly simplifies the internal calculations for swapping and for allocating the backing store, since I can round the number ~~up and down~~ down and up respectively to (-1, 5) to get a description of the block level backing storage.

Interestingly, the counter for how much is currently swapped has to be expressed in terms of the block level storage, because of a problem with null blocks that had me scratching my head for quite a while yesterday. If for example the allocation is (0.5, 0.5) then instead of allocating no blocks like you would think, the system considers this to be half a block wasted, no data, and then half a block wasted. So the block level description is (0, 1) by rounding 0.5 first down and then up. And when swapping it needs to know whether this dummy block has been brought into the core. So the in-core blocks ranges from 0 to 1. If the in-core blocks pointer was restricted to the range of valid data, it would lose information.

Anyhow, all of this work has been aimed towards making it simple to take the plunge and give the core and swap different block sizes, I have rationalized everything to the best extent possible so that the rewrite needed to accomplish this will not be too involved. The holy grail is within grasp!

Recall from my posts last week, that it needs an extra LRU list and the pre-swap phase to maintain guarantees in that situation, because wasted space due to larger core pages isn't budgeted for and can exhaust both core and swap. To recover, we will pre-swap some number of processes that have internal fragmentation. Last week I rewrote the LRU code so that this will also not need much of a rewrite.

The code size needed to accomplish all this is now really tiny, because of all the clever rationalizations.

nickd4 · 2019-07-30T23:30:10Z

Apologies that I was absent for quite a while. This is because I meandered onto some other fascinating "spare time" projects. In order to stay sane I need a dose of novelty from time to time. However please do not close the issue as I remain committed to completing and merging the swapper. From what I recall there was a small amount still to do in the swapping simulator, to complete implementing the feature that swap blocks could be smaller than core blocks, while not having to round everything up and waste egregious amounts of space to allow for the worst case. I will have to reorient my brain to this problem and try to remember the solution :)

In the meantime I implemented a kind of P-machine for my personal unix projects, but optimized to execute C not Pascal, and sacrificing binary compatibility across platforms for source code compatibility, with the assembler producing a kind of a hybrid or customized P-code for a specific target (Z80 or 6502 currently) that can run very fast by reducing the instruction decoding time in the interpreter significantly. I think it would not be all that much slower than native Z80 or 6502 code given that indexing the stack and doing double byte arithmetic cost many cycles.

My P-code assembler is based on asxxxx and I must say I really like that platform. To reorient myself with asxxxx internals I also undertook another project to port asxxxx to PDP-11 and a.out object format not its own object format, it needs a little work on encoding the PC-relative instructions to a.out but otherwise is close to useability.

Another large project that I undertook has been to create an integer math package for the P-code interpreter, also I preserved an API separation so it can be used standalone and I think it would be a highly competitive replacement for the math package in ACK or SDCC Z80 (

nickd4 · 2019-07-30T23:41:26Z

Apologies that I was absent for quite a while. This is because I meandered onto some other fascinating "spare time" projects. In order to stay sane I need a dose of novelty from time to time. However please do not close the issue as I remain committed to completing and merging the swapper. From what I recall there was a small amount still to do in the swapping simulator, to complete implementing the feature that swap blocks could be smaller than core blocks, while not having to round everything up and waste egregious amounts of space to allow for the worst case. I will have to reorient my brain to this problem and try to remember the solution :)

In the meantime I implemented a kind of P-machine for my personal unix projects, but optimized to execute C not Pascal, and sacrificing binary compatibility across platforms for source code compatibility, with the assembler producing a kind of a hybrid or customized P-code for a specific target (Z80 or 6502 currently) that can run very fast by reducing the instruction decoding time in the interpreter significantly. I think it would not be all that much slower than native Z80 or 6502 code given that indexing the stack and doing double byte arithmetic cost many cycles.

My P-code assembler is based on asxxxx and I must say I really like that platform. To reorient myself with asxxxx internals I also undertook another project to port asxxxx to PDP-11 and a.out object format not its own object format, it needs a little work on encoding the PC-relative instructions to a.out but otherwise is close to useability.

Another large project that I undertook has been to create an integer math package for the P-code interpreter, also I preserved an API separation so it can be used standalone and I think it would be a highly competitive replacement for the math package in ACK or SDCC Z80 (have not done the 6502 or 8080 versions yet but have thoughts in those directions). It provides 16x16->32 bit and 32x32->64 bit multiplication and the inverse operations 32/16 bit division with 16 bit remainder and 64/32 bit division with 32 bit remainder. It does these in a highly optimized and sophisticated way. It also has fast shifts and compares. Signed and unsigned versions are provided. No negation is done, so the worst case time taken is the same for positive and negative arguments. Division is non-restoring.

More recently I moved on to doing a transcendental package, this was originally for something work related since I do not envisage doing a lot of scientific computing on Z80, but it has evolved in a way that will slot fairly neatly into the P-code math package when I get around to doing floating point support (I have some innovative ideas on that point, to speed up Z80 soft float).

So lots of things are happening that I hope will eventually merge into a very tight package to run FUZIX on Z80 from floppy, but in a fairly chaotic way with progress on many fronts, I suppose in a breadth first search if you take my meaning :) :)

feilipu · 2019-07-31T03:24:39Z

More recently I moved on to doing a transcendental package, this was originally for something work related since I do not envisage doing a lot of scientific computing on Z80, but it has evolved in a way that will slot fairly neatly into the P-code math package when I get around to doing floating point support (I have some innovative ideas on that point, to speed up Z80 soft float).

If you're interested, I've recently finished a 32-bit IEEE 754 compatible floating library for z80, z180, and z80n (Spectrum Next) in the z88dk. The library is derived from the Digi Rabbit example with Cephes and Hi-tech C transcendental functions, but after three rewrites there's not that much of the Digi code remaining.

Further input of innovative ideas would be welcome.

nickd4 · 2019-07-31T04:07:57Z

Whoops double posting and seems I closed the issue by mistake, well it does not matter, is there a better place to continue the discussion such as a forum or mailing list? Anyhow I will aim to PR the swapper at some point in the next month or two.

@feilipu very interested to see what techniques you used, do you have a link? Having said that I must say that IEEE is not totally the best for soft float due to all the infinity and NaN tests that have to be put everywhere and weird cases to handle. The sticky bit and round to even I am sort of persuaded has value, but not too sure if the overhead can be justified on all platforms.

feilipu · 2019-07-31T05:13:54Z

@nickd4 sure yes. Link in blue here. A 32-bit IEEE 754 compatible floating library for z80, z180, and z80n (Spectrum Next) in the z88dk. Most of the techniques are explained there. Generally, it is more like "SDCC 32-bit float", being mostly compatible with IEEE 754, than fully letter of law IEEE 754 compliant, because of the reasons you mention.

But, you might want to comment on the Issue on the library in z88dk, rather than polluting your thread here.

EtchedPixels · 2019-08-02T23:03:25Z

@nickd4 btw your ndcode domain appears a bit busted - expired certificate with wrong name for www and bad gateway for anything else ?

nick-lifx · 2019-08-04T22:44:56Z

OK I noticed that a while ago, thanks for the reminder. I cannot work out why it is so troublesome, never seems to work for more than a few months without attention. The gitweb is the most troublesome, seems to overwrite itself with stock configuration from time to time? The certificate update is also troublesome, even though it's configured and all that. Will fix it when I finish work.

jcw · 2019-08-05T10:58:05Z

FWIW, I can highly recommend Gitea as code server. I use it behind a Caddy reverse proxy, but it can also handle (and auto-refresh) LetsEncrypt certificates on its own. Updates are a matter of replacing a single executable and restarting the service. See my https://git.jeelabs.org/ setup as example, but it's easy to set up your own on any platform, and play with it as admin.

I'm not trying to convince you in any way, just sharing a happy experience.

nickd4 · 2019-08-11T08:51:34Z

@jcw I will certainly look into gitea. For me, the use of gitweb was kind of temporary until I could look at doing some development in that direction. Because I was not happy with gitlab and some other heavyweight solutions that I looked at. Your gitea server has a very clean look, I'd be keen to know more. I fiddled the css slightly on gitweb to make it look more github / gitea-like but only took it so far.

About my server the issue was not too drastic. I have got git and running again, with some notes:

I am not trying to get https://ndcode.org (which is attached to https://www.ndcode.org) working at the moment, it will be my personal blog but uses an experimental webserver, more on this below.
The https://git.ndcode.org was returning "bad gateway" because the nginx server only proxies a FastCGI backend that runs gitweb and the FastCGI server was not running, more on this below.

About the experimental webserver, it is written in node.js and I have uploaded most modules to npm, unfortunately the actual main module (which uses all the npm packages already uploaded) is not quite ready for release -- I decided to do some minor refactoring first, and I think it was going well, but I haven't looked at that in 9 months or so. See https://www.npmjs.com/~nick_d2 for the modules already uploaded, and in particular the https://www.npmjs.com/package/@ndcode/jst package.

About the FastCGI not running, inspecting the logs revealed that the OOM killer had been doing its awful work -- how I hate that nondeterministic piece of crap! I know that memory overcommit can be turned off in /proc and I am thinking I will do that, but hesitating slightly because I know that many popular components including probably some that I am running as part of iRedMail, rely on sparse allocation and will allocate huge blocks without the intention of using all of it. So it's a yuck situation.

For the time being I increased my server memory from 2GB to 4GB with corresponding increase in monthly fee. To compensate for this I shut down one of my servers I use for web projects which was running another highly experimental web service that was very much a WIP, to do with sandboxing and executing arbitrary user code (I wasn't very happy with how my approach panned out and will have to solve it differently when I return to it). So I now have just 2 webservers to maintain. Haha!

jcw · 2019-08-13T18:58:47Z

We're veering waaay off-topic here, but node.js (or rather: v8) can be a real memory hog. That's why I switched to Go a long time ... (wait for it) ... aGo ;)

Current memory use on my server: Gitea 161M, MySQL 487M, Caddy 30M. MySQL also handles Nextcloud (PHP) and Redmine (Rails), so that will not reflect its needs when used just for Gitea.

EtchedPixels added the discussion label Feb 14, 2019

nickd4 closed this as completed Jul 30, 2019

Possible Execute-In-Place scheme #691

Possible Execute-In-Place scheme #691

Comments

nickd4 commented Feb 14, 2019

EtchedPixels commented Feb 14, 2019

nickd4 commented Feb 18, 2019

EtchedPixels commented Feb 19, 2019

nickd4 commented Feb 20, 2019 • edited Loading

EtchedPixels commented Feb 20, 2019

nickd4 commented Feb 20, 2019

EtchedPixels commented Feb 23, 2019

nickd4 commented Mar 1, 2019

nickd4 commented Mar 3, 2019

EtchedPixels commented Mar 4, 2019

nickd4 commented Mar 5, 2019

nickd4 commented Mar 9, 2019

nickd4 commented Mar 9, 2019

EtchedPixels commented Mar 9, 2019

nickd4 commented Mar 11, 2019

EtchedPixels commented Mar 12, 2019

nickd4 commented Mar 13, 2019

EtchedPixels commented Mar 13, 2019

nickd4 commented Mar 14, 2019

nickd4 commented Mar 18, 2019

nickd4 commented Mar 28, 2019

EtchedPixels commented Mar 28, 2019

nickd4 commented Mar 31, 2019 • edited Loading

nickd4 commented Apr 13, 2019 • edited Loading

nickd4 commented Apr 19, 2019

EtchedPixels commented Apr 19, 2019

nickd4 commented Apr 19, 2019

feilipu commented Apr 20, 2019

davidgiven commented Apr 20, 2019

nick-lifx commented May 21, 2019

nick-lifx commented May 21, 2019

EtchedPixels commented May 21, 2019 • edited Loading

EtchedPixels commented May 21, 2019

nick-lifx commented May 21, 2019

EtchedPixels commented May 21, 2019

nick-lifx commented May 22, 2019

nickd4 commented May 24, 2019

jcw commented May 24, 2019

EtchedPixels commented May 25, 2019

nick-lifx commented May 27, 2019

nick-lifx commented May 29, 2019

Recap about udata stash

Aligned udata cache

Running a process

Reclaiming partial pages

Core vs swap block sizes

Use of the disk block cache

Summary

EtchedPixels commented May 29, 2019

nick-lifx commented May 30, 2019

EtchedPixels commented May 30, 2019

nickd4 commented Jun 1, 2019

EtchedPixels commented Jun 2, 2019

nick-lifx commented Jun 3, 2019

nickd4 commented Jun 4, 2019 • edited Loading

nickd4 commented Jul 30, 2019

nickd4 commented Jul 30, 2019

feilipu commented Jul 31, 2019 • edited Loading

nickd4 commented Jul 31, 2019

feilipu commented Jul 31, 2019 • edited Loading

EtchedPixels commented Aug 2, 2019

nick-lifx commented Aug 4, 2019

jcw commented Aug 5, 2019

nickd4 commented Aug 11, 2019

jcw commented Aug 13, 2019

nickd4 commented Feb 20, 2019 •

edited

Loading

nickd4 commented Mar 31, 2019 •

edited

Loading

nickd4 commented Apr 13, 2019 •

edited

Loading

EtchedPixels commented May 21, 2019 •

edited

Loading

nickd4 commented Jun 4, 2019 •

edited

Loading

feilipu commented Jul 31, 2019 •

edited

Loading

feilipu commented Jul 31, 2019 •

edited

Loading