Sunday, October 18, 2009

Musings

Since the resources and space are so limited on the PIC, one always has to think of clever ways to squeeze out more from very little. For instance, on the PIC 16F84 memory, or working RAM is limited to 80 bytes. 3(for 2s/20MHz) to 6(for 1y/20MHz) of those were used up in the previous code simply to implement a delay. We needed to use 6 variables of 256, because (2*6+1)*256*256*256.. 6times=15*256^6 is what's able to execute the needed amount of instructions for 1 year. 256 is the biggest number we can represent in the PIC with a single memory location.
I was thinking perhaps there were a way to get away with less variables, to make big numbers. After all, if I'm given two bytes, d1 and d2, each able to store 256, my options are 256+256, or 256*256, or.. or.. how about 256^256.. now that's a huge number. Even 256^8 is huge. So is there a way to get aways with say d1 storing 255, and d2 say 8? What's the mechanics under the hood, can we generate the huge number of operations needed for a delay with just these two register? In C one would just make a routine,
pow(int x, int y) {int i, result=1; for(i=0;i!=y;i++) result=result*x; return result;}

Well, here is where C programmers or high level programmers think abstractly, and forget about the underlying architecture, and expect things to just work out. In practice however, first of all, this function results in overflows pretty fast, because a power such as 10000^10000 is not going to fit in an int result variable. The very reason C has datatypes is to enable efficient use of memory, and for the programmer to keep in mind the limits and the architecture at hand. VB does not require type declarations, and anything undeclared ends up as a 16 byte Variant type. Note the even the biggest long long int, on C is only 12 bytes. Still, 16 bytes is not enough for everything, because it can overflow too. Abstraction makes it possible to write code independent of the architecture and the hardware that it runs on, but eventually it does come down to having limits, PC computer hardware is not unlimited, at least not like it used to be back in the days when IBM made the decision that 1 MB should be enough memory for everyone. That decision has been laughed at ever since. Note that in the picdelay.c program below, I made a similar decision that 1 year should be enough for anyone, in view of the architecture. In the end that 1MB decision turned out to be very wrong, and since then, we learned our lessons very well, and proper new design tries to eliminate any built in limits. For instance in the old 8086 DOS PC, working memory was up to 640 KB, and above that was the video memory, and other things, up to 1MB. When practical memory limits grew to 8 MB with the 80286, the computer had to keep running circles around that video memory, and load contiguous blocks of memory from above 1MB to the block of 0-640 KB, work work work with it til it ran out of stuff to do, and then it had to take everything and move it back above 1MB, and load a new work dose from 1MB-8MB region to the 0-640KB region. In fact some device drivers loaded, such as CD ROM drivers or mouse or joystick drivers, required a few KB's of memory, and you ended up with 600KB or 586 KB or even less "conventional" memory, which was not enough for a lot of games to run, which needed at least say 604KB contiguous to function. You had the option to not run the CD ROM device driver, that could save some memory, or there were some memory optimizers like QEMM, that sought out contiguous empty space that was not used by any hardware between 640 and 1MB, and loaded CDROM drivers there. They would stay resident, but this way you could squeeze 629 KB! of working RAM, AND have a gazillion device drivers still loaded "high." Had they designed DOS without this fixed location limitation, such as the video memory address start location shown by a pointer value stored somewhere, things might have been easier to adapt as technology advanced. But once you had to adapt to the status quo, and stay compatible, it was hard to change things. One option of course is to require each computer to come with 8MB and a newer design spec says video memory now is between 7.64 to 8.00 MB. Any old program written assuming 0-1 MB design would have to be rewritten. That would be a mess.
Win32 has a flat memory model, and any program written is given a sandbox where it thinks it has access to all the memory on the computer, and Windows does the translation of what a program memory of say 0 memory location, or 1MB memory location translates to in the real computer, which might be the 12.5MB to 13.5 MB region. This way you don't have to rewrite windows programs to adapt to the new limits.
So back to 256^256 - can you implement code with only 2 registers and do this many iterations, find a cleverer way than having to use 6 registers? Would this be more elegant, maybe use even less instructions? Well, the answer is no. I actually tried. What happens is that you need a way to represent and store the intermediate numbers. In order to raise to power, you need to multiply, and in order to multiply, you need to add. In the PIC you have to add manually to multiply, and then use this multiply to raise to power, manually. Things such as 3^3 =27 or 4^4=256 fit into a register, but anything bigger doesn't. You could try to consume up the so far stored stuff before you do a next step, so before you come and multiply again, how about doing the iterations now, one by one, instead of doing them later. This does not work, because the very essence of raising to power, such as 256*(current product of previous 256*256..) = 256*123456789012=12345679012+12345689012+.. 256 times. Should you use up any of these intermediate values, you won't have the huge numbers to add up on the next turn of multiplying with 256.. basically +/- is one operation step away from * multiplication, but ^ raising to the power is 2 steps away, there is no quick and dirty method to decrement in between so you don't get an overflow while raising to power. So you can't do raising to power, unless you're willing to store huge intermediate values. You can make your own custom type, and dedicate 16 or more bytes to storing the values in it, but that negates the whole purpose of trying to save registers in the first place - use 16 registers to save 7 registers, and make the code very complicated too, similar to old 32 bit Windows on 16 bit PC architecture such as a 286 - low byte, high byte, what a mess that was. Linux was started out later, and by design it does not run on anything less than a 386, which is natively 32 bit, simply to not have to deal with low byte, high byte mess for internal calculations of memory addressing.
Basically the below delay code, using as many registers as needed and going with the 256*256*256*256... way to generate large numbers, is the most elegant and efficient way I can think of. In the mechanics, the intermediate values needing to be stored are never greater than 256, and can be decremented along the way, while the multiplication happens, so you don't need to first generate the product, then start 1 at a time from a huge number, but you can take one of them, decrement it all the way, then continue with the multiplication, and the final result of how many instructions were ultimately done, is not affected. Using registers + is not as efficient as * to create large numbers, and ^ is unimplementable without storing large intermediate values greater than 256, so it's not more elegant. Unless you can think of something I didn't think of.
By the way, while surfing around for answers on this topic, I came across the factoradic number system, which is a mixed radix number system (it's not 2 base like binary, or 10 base like decimal, or 16 base like hex, but each digit has its own base, the prior factorial up to that point. Because factorial is roughly proportional to N^N, huge numbers can be stored with few digits after a certain threshhold. The factoradic number system is basically the next step beyond arabic numeral notation - the + is tally marks fast, the arabic numeral system is developing * to its finesse with a fixed radix, and the factoradic gives a notation ^ fast. I'm not smart enough a mathematician to really analyze this, but there might be ways to implement factoradic in hardware that could give more efficient computation, at least when the number of digits stored is huge, probably well beyond 64 bit. Instead of the highest number stored being around say 2^64, one might be able to get closer to 64^64 magnitude number with accurate individual step granularity squished into 64 bits.

No comments:

Post a Comment