Pack/Unpack Tutorial

来源:百度文库 编辑:神马文学网 时间:2024/07/04 16:37:19
A recent conversation in the chatterbox gave me the idea to write this. A beginning programmer was trying to encode some information with pack and unpack but was having trouble coming to grips with exactly how they work. I have never had trouble with them but I came to programming from a hardware background and I‘m very familiar with assembly and C programming. People who have come to programming recently have probably never dealt with things at such a low level and may not understand how a computer stores data. A little understanding at this level might make pack and unpack a little easier to figure out.
Why we need pack and unpack
Perl can handle strings, integers and floating point values. Occassionally a perl programmer will need to exchange data with programs written in other languages. These other languages have a much larger set of datatypes. They have integer values of different sizes. They may only be capable of dealing with fixed length strings (dare I say COBOL?). Sometimes, there may be a need to exchange binary data over a network with other machines. These machines may have different word sizes or even store values differently. Somehow, we need to get our data into a format that these other programs and machines can understand. We also need to be able to interpret the responses we get back.
Perl‘s pack and unpack functions allow us to read and write buffers of data according to a template string. The template string allows us to indicate specific byte orderings and word sizes or use the local system‘s default sizes and ordering. This gives us a great deal of flexibility when dealing with external programs.
In order to understand how all of this works, it helps to understand how computers store different types of information.
Integer Formats
Computer memory can be looked at as a large array of bytes. A byte contains eight bits and can represent unsigned values between 0 and 255 or signed values between -128 and 127. You can‘t do a whole lot of computation with such a small range of values so a modern computer‘s registers are larger than a byte. Most modern processors use 32 bit registers and there are some processors with 64 bit registers. A 32 bit register can store unsigned values between 0 and 4294967295 or signed values between -2147483648 and 2147483647.
When storing values greater than 8 bits long to memory, the value is broken up into 8 bit segments and stored in multiple consecutive storage locations. Some processors will store the segment containing the most significant bits in the first memory location and work up in memory with lesser segments. This is referred to as "big-endian" format. Other processors will store the least significant segment in the first byte and store more significant segments into higher memory locations. This is referred to as "little-endian" format.
This might be easier to see with a picture. Suppose a register contains the value 0x12345678 and we‘re trying to store it to memory at address 1000. Here‘s how it looks.
Address Big-Endian
Machine Little-Endian
Machine
1000 0x12 0x78
1001 0x34 0x56
1002 0x56 0x34
1003 0x78 0x12
If you have looked at perldoc -f pack or have looked up the pack function in Programming Perl, you have seen a table listing template characters with a description of the type of datum they match. That table lists integer formats of several sizes and byte orders. There are also signed and unsigned versions.
Format Description
c,C A signed/unsigned char (8-bit integer) value
s,S A signed/unsigned short, always 16 bits
l,L A signed/unsigned long, always 32 bits
q,Q A signed/unsigned quad (64-bit integer) value
i,I A signed/unsigned integer, native format
n,N A 16/32 bit value in "network" (big-endian) order
v,V A 16/32 bit value in "VAX" (little-endian) order
The s, l, and q formats pack 16, 32, and 64 bit values in the host machine‘s native order. The i format packs a value of the host machine‘s word length. The n and v formats allow you to specify the size and storage order and are useful for interchange with other systems.
Character Formats
Strings are stored as arrays of characters. Traditionally, each character was encoded in a single byte using some coding system like ASCII or EBCDIC. Newer encoding systems like Unicode use either multi-byte or variable length encodings to represent characters.
Perl‘s pack function accepts the following template characters for strings.
Format Description
a,A A null/space padded string
b,B A bit (binary) string in ascending/descending bit order
h,H A hexadecimal string, low/high nybble first
Z A null terminated string
Strings are stored in successive increasing memory locations with the first character in the lowest address location.
Perl‘s pack function
The pack function accepts a template string and a list of values. It returns a scalar containing the list of values stored according to the formats specified in the template. This allows us to write data in a format that would be readable by a program written in C or another language or to pass data to a remote system through a network socket.
The template contains a series of letters from the tables above. Each letter is optionally followed by a repeat count (for numeric values) or a length (for strings). A ‘*‘ on an integer format tells pack to use this format for the rest of the values. A ‘*‘ on a string format tells pack to use the length of the string.
Now, let‘s try an example. Suppose we‘re collecting some information from a web form and posting it for processing by our backend system which is written in C. The form allows a monk to request office supplies. The backend system wants to see input in the following format.
struct SupplyRequest {time_t request_time; // time request was enteredint employee_id; // employee making requestchar item[32]; // item requestedshort quantity; // quantity neededshort urgent; // request is urgent};
After looking through our system header files, we determine that time_t is a long. To create a suitable record for sending to the backend, we could use the following.
$rec = pack( "l i Z32 s2", time, $emp_id, $item, $quan, $urgent);
That template says ‘a long, an int, a 32 character null terminated string and two shorts‘.
If monk number 217641 (hey! that‘s me!) placed an urgent order for two boxes of paperclips on January 1, 2003 at 1pm EST, $rec would contain the following (first line in decimal, second in hex, third as characters where applicable). Pipe characters indicate field boundaries.
Offset Contents (increasing addresses left to right)0 160 44 19 62| 41 82 3 0| 98 111 120 101 115 32 1+11 102A0 2C 13 3E| 29 52 03 00| 62 6f 78 65 73 20+6f 66| b o x e s+ o f16 32 112 97 112 101 114 99 108 105 112 115 0 0 0+ 0 020 70 61 70 65 72 63 6c 69 70 73 00 00 00+00 00p a p e r c l i p s32 0 0 0 0 0 0 0 0| 2 0| 1 000 00 00 00 00 00 00 00| 02 00| 01 00
Let‘s figure out where all of that came from. The first template item is a ‘l‘ which packs a long. A long is 32 bits or four bytes. The value that was stored came from the time function. The actual value was 1041444000 or 0x3e132ca0. See how that fits into the beginning of the buffer? My system has an Intel Pentium processor which is little endian.
The second template item is a ‘i‘. This calls for an integer of the machine‘s native size. The Pentium is a 32 bit processor so again we pack into four bytes. The monk‘s number is 217641 or 0x00035229.
The third template item is ‘Z32‘. This specifies a 32 character null terminated field. You can see the string ‘boxes of paperclips‘ stored next in the buffer followed by zeros (null characters) until the 32 bytes have been filled.
The last template item is ‘s2‘. This calls for two shorts which are 16 bit integers. This consumes two values from the list of values passed to pack. 16 bits get stored in two bytes. The first value was the quantity 2 and the second was the 1 indicating urgent. These two values occupy the last four bytes of the buffer.
Perl‘s unpack function
Unbeknownst to us when we wrote the web side of this application, someone was porting the backend from C to perl (something about eating dog food, I don‘t think I heard it right). But, since we‘ve already written the web side of the application, they figured they would just use the same data format. Therefore, they need to use unpack to read the data we sent them.
unpack is kind of the opposite of pack. pack takes a template string and a list of values and returns a scalar. unpack takes a template string and a scalar and returns a list of values.
Theoretically, if we give unpack the same template string and the scalar produced by pack, we should get back the list of values we passed to pack. I say theoretically because if the unpacking is done on a machine with a different byte order (big vs. little endian) or a different word size (16, 32, 64 bit), unpack might interpret the data differently than pack wrote it. The formats we used all used our machine‘s native byte order and ‘i‘ could be different sizes on different machines so we could be in trouble. But in our simple case, we‘ll assume the backend runs on the same machine as the web interface.
To unpack the data we wrote, the backend program would use a statement like this.
($order_time, $monk, $itemname, $quantity, $ignore) =unpack( "l i Z32 s2", $rec );
Notice that the template string is identical to the one we used above for packing and the same information is returned in the same order (except they used $ignore where we packed with $urgent, what are they trying to say?).
Integer Formats
aka, Why all those template types?
You may be asking why there are so many different ways to write the same data type. ‘i‘, ‘l‘, ‘N‘, and ‘V‘ could all be used to write a 32 bit integer to a buffer. Why use any specific one? Well, that depends on what you are trying to exchange information with.
If you are only going to be exchanging information with programs on the same machine, you can use ‘i‘, ‘l‘, ‘s‘, and ‘q‘ and their uppercase unsigned counterparts. Since both the reading and writing programs will be running on the same system architecture, you might as well use the native formats.
If you are writing a program to read files whose layout is architecture specific, use the ‘n‘, ‘N‘, ‘v‘ and ‘V‘ formats. This way, you will know that you are interpreting the information correctly no matter what architecture your program is running on. For example, the ‘wav‘ file format is defined for Windows on the Intel processor which is little endian. If you were trying to read the header of a ‘wav‘ file, you should use ‘v‘ and ‘V‘ to read out 16 and 32 bit values respectively.
The ‘n‘ and ‘N‘ formats are called "network order" for a reason: they are the order specified for TCP/IP communications. If you are doing certain types of network programming, you will need to use these formats.
String formats
Choosing between the string formats is a little different. You would probably choose between ‘a‘, ‘A‘ and ‘Z‘ depending on the language of the other program. If the other program is written in C or C++, you probably want ‘a‘ or ‘Z‘. ‘A‘ would be a good choice for COBOL or FORTRAN.
‘a‘, ‘A‘, and ‘Z‘ formats
When packing, ‘a‘ and ‘z‘ with a count fill extra locations with nulls. ‘A‘ fills the extra locations with spaces. When unpacking, ‘A‘ removes trailing spaces and nulls, ‘Z‘ strips everything after the first null, and ‘a‘ returns the full field as is.
Examples
pack(‘a8‘,"hello") produces "hello\0\0\0"pack(‘Z8‘,"hello") produces "hello\0\0\0"pack(‘A8‘,"hello") produces "hello "unpack(‘a8‘,"hello\0\0\0") produces "hello\0\0\0"unpack(‘Z8‘,"hello\0\0\0") produces "hello"unpack(‘A8‘,"hello ") produces "hello"unpack(‘A8‘,"hello\0\0\0") produces "hello"‘b‘ and ‘B‘ formats
The ‘b‘ and ‘B‘ formats pack strings consisting of ‘0‘ and ‘1‘ characters to bytes and unpack bytes to strings of ‘0‘ and ‘1‘ characters. Perl treats even valued characters as 0 and odd valued characters as 1 while packing. The difference between the two is the order of the bits within each byte. With ‘b‘, the bits are specified in increasing order. With ‘B‘, in descending order. The count represents the number of bits to pack.
Examples
ord(pack(‘b8‘,‘00100110‘)) produces 100 (4 + 32 + 64)ord(pack(‘B8‘,‘00100110‘)) produces 38 (32 + 4 + 2)‘h‘ and ‘H‘ formats
The ‘h‘ and ‘H‘ formats pack strings containing hexadecimal digits. ‘h‘ takes the low nybble first, ‘H‘ takes the high nybble first. The count represents the number of nybbles to pack. In case you were wondering, a nybble is half a byte.
Examples
Each of the following returns a two byte scalar.
pack(‘h4‘,‘1234‘) produces 0x21,0x43pack(‘H4‘,‘1234‘) produces 0x12,0x34Additional Information
Perl 5.8 includesits own tutorial for pack and unpack. That tutorial is a bit more indepth than this one but some of the things it covers may be specific to perl 5.8. If you are still using perl 5.6, check your own documentation if things don‘t work as that tutorial describes.
There are more template characters that I haven‘t covered here. There are also ways to read and write counted ASCII fields as well as some additional tricks you can play with pack and unpack. Try perldoc -f pack on your system or refer toProgramming Perl. And above all, don‘t be afraid to experiment (except on live programs). Use the DumpString function below to examine the buffers returned by pack until you understand how it manipulates data.
References
Programming Perl, Third Edition, Larry Wall, Tom Christiansen, and Jon Orwant, © 2000, 1996, 1991 O‘Reilly & Associates, Inc. ISBN 0-596-00027-8
Thanks tobart forthe reference to the pack/unpack tutorial from perl 5.8.
Thanks toZaxo andjeffa for reviewing this document and sharing their own efforts at creating a tutorial.
Thanks tosulfericacid andPodMaster for inspiring this on the CB.
Example Code
The following program contains the examples in this document.
#!/usr/bin/perl -wuse strict;# dump the contents of a string as decimal and hex bytes and character+ssub DumpString {my @a = unpack(‘C*‘,$_[0]);my $o = 0;while (@a) {my @b = splice @a,0,16;my @d = map sprintf("%03d",$_), @b;my @x = map sprintf("%02x",$_), @b;my $c = substr($_[0],$o,16);$c =~ s/[[:^print:]]/ /g;printf "%6d %s\n",$o,join(‘ ‘,@d);print " "x8,join(‘ ‘,@x),"\n";print " "x9,join(‘ ‘,split(//,$c)),"\n";$o += 16;}}# place our web ordermy $t = time;my $emp_id = 217641;my $item = "boxes of paperclips";my $quan = 2;my $urgent = 1;my $rec = pack( "l i a32 s2", $t, $emp_id, $item, $quan, $urgent);DumpString($rec);# process a web ordermy ($order_time, $monk, $itemname, $quantity, $ignore) =unpack( "l i a32 s2", $rec );print "Order time: ",scalar localtime($order_time),"\n";print "Placed by monk #$monk for $quantity $itemname\n";# string formats$rec = pack(‘a8‘,"hello"); # should produce ‘hello\0\0\0+‘DumpScalar($rec);$rec = pack(‘Z8‘,"hello"); # should produce ‘hello\0\0\0+‘DumpScalar($rec);$rec = pack(‘A8‘,"hello"); # should produce ‘hello ‘DumpScalar($rec);($rec) = unpack(‘a8‘,"hello\0\0\0"); # should produce ‘hello\0\0\0+‘DumpScalar($rec);($rec) = unpack(‘Z8‘,"hello\0\0\0"); # should produce ‘hello‘DumpScalar($rec);($rec) = unpack(‘A8‘,"hello "); # should produce ‘hello‘DumpScalar($rec);($rec) = unpack(‘A8‘,"hello\0\0\0"); # should produce ‘hello‘DumpScalar($rec);# bit format$rec = pack(‘b8‘,"00100110"); # should produce 0x64 (100)DumpScalar($rec);$rec = pack(‘B8‘,"00100110"); # should produce 0x26 (38)DumpScalar($rec);# hex format$rec = pack(‘h4‘,"1234"); # should produce 0x21,0x43DumpScalar($rec);$rec = pack(‘H4‘,"1234"); # should produce 0x12,0x34DumpScalar($rec);http://nbpfaus.net/~pfau/
 
Comment on Pack/Unpack Tutorial (aka How the System Stores Data)
Select orDownload Code
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
bydiotalevi on Jan 06, 2003 at 18:11 UTC
I have a few comments and I‘ll just leave them here in no particular order:
I wish you would have defined "word" prior to using it willy-nilly. It‘s jargon that your tutorial‘s audience it‘s likely to be familiar with. FromWordNet: a word is a string of bits stored in computer memory; "large computers use words up to 64 bits long".
Bytes are almost always eight bits though that‘s not a universal constant. Perhaps it‘s infrequent enough that I didn‘t even need to mention but this one always gets my goat.
Your use of "most" and "least" significant byte was also jargon. If you assume the value 0x12345678 then the most significant byte has the value 0x12 and the least significant has the value 0x78. From there the point on differently endian machines is just which order you start with when transcribing bytes.
Your use of memory addresses is obfuscatory. This is better written as "Byte 0, byte 1, byte 2, byte 3". The only point at which a perl programmer cares about memory addresses is when doing non-perl programming or with the ‘p‘ or ‘P‘ format options. The point here is to indicate an order to the bytes in memory - that byte 0 might be located at a memory address 1000 is entire beside the point.
White space is allowed without consequence in an unpack/pack format. It‘s just ignored except when it‘s a fatal error. I haven‘t nailed it down but some uses of whitespace just don‘t parse. That may be a bug but it‘s worth noting. This just means that in general people should use whitespace in a format to enhance readability - it doesn‘t affect it‘s operation.
I‘ve never been clear on the bit order within a byte - can you expand on that? I used to think that the differently endian machines also shuffled the bit order around as well. At this point I‘m just confused.
[reply]
Re^2: Pack/Unpack Tutorial ("bit order")
bytye on Jun 08, 2005 at 02:26 UTC
I‘ve never been clear on the bit order within a byte - can you expand on that? I used to think that the differently endian machines also shuffled the bit order around as well. At this point I‘m just confused.
In many cases, you can ignore bit order since bits do not have separate memory addresses. "Byte order" matters because you can access a byte as either the "first" byte (lowest address) in a "string" of bytes or as the "high" or "low" byte in a multi-byte numeric value. If you don‘t try to do both, then "byte order" doesn‘t matter, but usingpack orunpack often means you are looking at bytes both ways. But, there is no "first" bit in packed data.
If you have a text format ("unpacked" string) that shows bits (or hexidecimal nybbles or octal "digits") in a specific order, then you may have to worry about "bit order" (or other sub-byte order) if you‘ve got something not using the near-universal "most significant digit first" ordering that is used when writing numbers in any base. Of course,pack andunpack (quite unfortunately) make a mess of this, as noted inRe^2: pack/unpack 6-bit fields. (precision) and(tye)Re: Ascending vs descending bit order.
Put another way, "byte order" is usually used in reference to a detail of a computer‘s design and "bit order" doesn‘t matter in this context. However, both "bit order" and "byte order" can be applied to text representations of data (or even other "unpacked" representations where bits from within a byte get encoded into multiple bytes/characters of some other representation).
-tye
[reply]
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
byfredopalus on Jan 08, 2003 at 01:36 UTC
People having trouble understanding some things in this tutorial may findAssembly Language Step-by-Step a very helpful book. It focuses more on the hardware aspects of programming.[reply]
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
byNetWallah on Sep 12, 2003 at 19:06 UTC
Thanks,pfaut for a good tutorial.
Your "Example code" has errors that need to be fixed. Undefined subroutine &main::DumpScalar called at line 38. Replacing all 11 occurrances of "DumpScalar" by "DumpString" corrects the problem.[reply]
[d/l]
Re: Pack/Unpack Tutorial (aka How the System Stores Data)
byplanetscape on Jul 04, 2006 at 14:40 UTC
For a Spanish translation, please seeSpanish translation of pack/unpack tutorial, byHue-Bond.
planetscape
[reply]