diff --git a/Doc/library/struct.rst b/Doc/library/struct.rst index 620f50376be..50d70731f77 100644 --- a/Doc/library/struct.rst +++ b/Doc/library/struct.rst @@ -12,21 +12,25 @@ -------------- -This module performs conversions between Python values and C structs represented -as Python :class:`bytes` objects. This can be used in handling binary data -stored in files or from network connections, among other sources. It uses -:ref:`struct-format-strings` as compact descriptions of the layout of the C -structs and the intended conversion to/from Python values. +This module converts between Python values and C structs represented +as Python :class:`bytes` objects. Compact :ref:`format strings ` +describe the intended conversions to/from Python values. +The module's functions and objects can be used for two largely +distinct applications, data exchange with external sources (files or +network connections), or data transfer between the Python application +and the C layer. .. note:: - By default, the result of packing a given C struct includes pad bytes in - order to maintain proper alignment for the C types involved; similarly, - alignment is taken into account when unpacking. This behavior is chosen so - that the bytes of a packed struct correspond exactly to the layout in memory - of the corresponding C struct. To handle platform-independent data formats - or omit implicit pad bytes, use ``standard`` size and alignment instead of - ``native`` size and alignment: see :ref:`struct-alignment` for details. + When no prefix character is given, native mode is the default. It + packs or unpacks data based on the platform and compiler on which + the Python interpreter was built. + The result of packing a given C struct includes pad bytes which + maintain proper alignment for the C types involved; similarly, + alignment is taken into account when unpacking. In contrast, when + communicating data between external sources, the programmer is + responsible for defining byte ordering and padding between elements. + See :ref:`struct-alignment` for details. Several :mod:`struct` functions (and methods of :class:`Struct`) take a *buffer* argument. This refers to objects that implement the :ref:`bufferobjects` and @@ -102,10 +106,13 @@ The module defines the following exception and functions: Format Strings -------------- -Format strings are the mechanism used to specify the expected layout when -packing and unpacking data. They are built up from :ref:`format-characters`, -which specify the type of data being packed/unpacked. In addition, there are -special characters for controlling the :ref:`struct-alignment`. +Format strings describe the data layout when +packing and unpacking data. They are built up from :ref:`format characters`, +which specify the type of data being packed/unpacked. In addition, +special characters control the :ref:`byte order, size and alignment`. +Each format string consists of an optional prefix character which +describes the overall properties of the data and one or more format +characters which describe the actual data values and padding. .. _struct-alignment: @@ -116,6 +123,11 @@ Byte Order, Size, and Alignment By default, C types are represented in the machine's native format and byte order, and properly aligned by skipping pad bytes if necessary (according to the rules used by the C compiler). +This behavior is chosen so +that the bytes of a packed struct correspond exactly to the memory layout +of the corresponding C struct. +Whether to use native byte ordering +and padding or standard formats depends on the application. .. index:: single: @ (at); in struct format strings @@ -144,12 +156,10 @@ following table: If the first character is not one of these, ``'@'`` is assumed. -Native byte order is big-endian or little-endian, depending on the host -system. For example, Intel x86 and AMD64 (x86-64) are little-endian; -IBM z and most legacy architectures are big-endian; -and ARM, RISC-V and IBM Power feature switchable endianness -(bi-endian, though the former two are nearly always little-endian in practice). -Use ``sys.byteorder`` to check the endianness of your system. +Native byte order is big-endian or little-endian, depending on the +host system. For example, Intel x86, AMD64 (x86-64), and Apple M1 are +little-endian; IBM z and many legacy architectures are big-endian. +Use :data:`sys.byteorder` to check the endianness of your system. Native size and alignment are determined using the C compiler's ``sizeof`` expression. This is always combined with native byte order. @@ -231,9 +241,9 @@ platform-dependent. +--------+--------------------------+--------------------+----------------+------------+ | ``d`` | :c:expr:`double` | float | 8 | \(4) | +--------+--------------------------+--------------------+----------------+------------+ -| ``s`` | :c:expr:`char[]` | bytes | | | +| ``s`` | :c:expr:`char[]` | bytes | | \(9) | +--------+--------------------------+--------------------+----------------+------------+ -| ``p`` | :c:expr:`char[]` | bytes | | | +| ``p`` | :c:expr:`char[]` | bytes | | \(8) | +--------+--------------------------+--------------------+----------------+------------+ | ``P`` | :c:expr:`void \*` | integer | | \(5) | +--------+--------------------------+--------------------+----------------+------------+ @@ -292,8 +302,33 @@ Notes: format `_ for more information. (7) - For padding, ``x`` inserts null bytes. + When packing, ``'x'`` inserts one NUL byte. +(8) + The ``'p'`` format character encodes a "Pascal string", meaning a short + variable-length string stored in a *fixed number of bytes*, given by the count. + The first byte stored is the length of the string, or 255, whichever is + smaller. The bytes of the string follow. If the string passed in to + :func:`pack` is too long (longer than the count minus 1), only the leading + ``count-1`` bytes of the string are stored. If the string is shorter than + ``count-1``, it is padded with null bytes so that exactly count bytes in all + are used. Note that for :func:`unpack`, the ``'p'`` format character consumes + ``count`` bytes, but that the string returned can never contain more than 255 + bytes. + +(9) + For the ``'s'`` format character, the count is interpreted as the length of the + bytes, not a repeat count like for the other format characters; for example, + ``'10s'`` means a single 10-byte string mapping to or from a single + Python byte string, while ``'10c'`` means 10 + separate one byte character elements (e.g., ``cccccccccc``) mapping + to or from ten different Python byte objects. (See :ref:`struct-examples` + for a concrete demonstration of the difference.) + If a count is not given, it defaults to 1. For packing, the string is + truncated or padded with null bytes as appropriate to make it fit. For + unpacking, the resulting bytes object always has exactly the specified number + of bytes. As a special case, ``'0s'`` means a single, empty string (while + ``'0c'`` means 0 characters). A format character may be preceded by an integral repeat count. For example, the format string ``'4h'`` means exactly the same as ``'hhhh'``. @@ -301,15 +336,6 @@ the format string ``'4h'`` means exactly the same as ``'hhhh'``. Whitespace characters between formats are ignored; a count and its format must not contain whitespace though. -For the ``'s'`` format character, the count is interpreted as the length of the -bytes, not a repeat count like for the other format characters; for example, -``'10s'`` means a single 10-byte string, while ``'10c'`` means 10 characters. -If a count is not given, it defaults to 1. For packing, the string is -truncated or padded with null bytes as appropriate to make it fit. For -unpacking, the resulting bytes object always has exactly the specified number -of bytes. As a special case, ``'0s'`` means a single, empty string (while -``'0c'`` means 0 characters). - When packing a value ``x`` using one of the integer formats (``'b'``, ``'B'``, ``'h'``, ``'H'``, ``'i'``, ``'I'``, ``'l'``, ``'L'``, ``'q'``, ``'Q'``), if ``x`` is outside the valid range for that format @@ -319,17 +345,6 @@ then :exc:`struct.error` is raised. Previously, some of the integer formats wrapped out-of-range values and raised :exc:`DeprecationWarning` instead of :exc:`struct.error`. -The ``'p'`` format character encodes a "Pascal string", meaning a short -variable-length string stored in a *fixed number of bytes*, given by the count. -The first byte stored is the length of the string, or 255, whichever is -smaller. The bytes of the string follow. If the string passed in to -:func:`pack` is too long (longer than the count minus 1), only the leading -``count-1`` bytes of the string are stored. If the string is shorter than -``count-1``, it is padded with null bytes so that exactly count bytes in all -are used. Note that for :func:`unpack`, the ``'p'`` format character consumes -``count`` bytes, but that the string returned can never contain more than 255 -bytes. - .. index:: single: ? (question mark); in struct format strings For the ``'?'`` format character, the return value is either :const:`True` or @@ -345,18 +360,36 @@ Examples ^^^^^^^^ .. note:: - All examples assume a native byte order, size, and alignment with a - big-endian machine. + Native byte order examples (designated by the ``'@'`` format prefix or + lack of any prefix character) may not match what the reader's + machine produces as + that depends on the platform and compiler. -A basic example of packing/unpacking three integers:: +Pack and unpack integers of three different sizes, using big endian +ordering:: - >>> from struct import * - >>> pack('hhl', 1, 2, 3) - b'\x00\x01\x00\x02\x00\x00\x00\x03' - >>> unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03') - (1, 2, 3) - >>> calcsize('hhl') - 8 + >>> from struct import * + >>> pack(">bhl", 1, 2, 3) + b'\x01\x00\x02\x00\x00\x00\x03' + >>> unpack('>bhl', b'\x01\x00\x02\x00\x00\x00\x03' + (1, 2, 3) + >>> calcsize('>bhl') + 7 + +Attempt to pack an integer which is too large for the defined field:: + + >>> pack(">h", 99999) + Traceback (most recent call last): + File "", line 1, in + struct.error: 'h' format requires -32768 <= number <= 32767 + +Demonstrate the difference between ``'s'`` and ``'c'`` format +characters:: + + >>> pack("@ccc", b'1', b'2', b'3') + b'123' + >>> pack("@3s", b'123') + b'123' Unpacked fields can be named by assigning them to variables or by wrapping the result in a named tuple:: @@ -369,35 +402,132 @@ the result in a named tuple:: >>> Student._make(unpack('<10sHHb', record)) Student(name=b'raymond ', serialnum=4658, school=264, gradelevel=8) -The ordering of format characters may have an impact on size since the padding -needed to satisfy alignment requirements is different:: +The ordering of format characters may have an impact on size in native +mode since padding is implicit. In standard mode, the user is +responsible for inserting any desired padding. +Note in +the first ``pack`` call below that three NUL bytes were added after the +packed ``'#'`` to align the following integer on a four-byte boundary. +In this example, the output was produced on a little endian machine:: - >>> pack('ci', b'*', 0x12131415) - b'*\x00\x00\x00\x12\x13\x14\x15' - >>> pack('ic', 0x12131415, b'*') - b'\x12\x13\x14\x15*' - >>> calcsize('ci') + >>> pack('@ci', b'#', 0x12131415) + b'#\x00\x00\x00\x15\x14\x13\x12' + >>> pack('@ic', 0x12131415, b'#') + b'\x15\x14\x13\x12#' + >>> calcsize('@ci') 8 - >>> calcsize('ic') + >>> calcsize('@ic') 5 -The following format ``'llh0l'`` specifies two pad bytes at the end, assuming -longs are aligned on 4-byte boundaries:: +The following format ``'llh0l'`` results in two pad bytes being added +at the end, assuming the platform's longs are aligned on 4-byte boundaries:: - >>> pack('llh0l', 1, 2, 3) + >>> pack('@llh0l', 1, 2, 3) b'\x00\x00\x00\x01\x00\x00\x00\x02\x00\x03\x00\x00' -This only works when native size and alignment are in effect; standard size and -alignment does not enforce any alignment. - .. seealso:: Module :mod:`array` Packed binary storage of homogeneous data. - Module :mod:`xdrlib` - Packing and unpacking of XDR data. + Module :mod:`json` + JSON encoder and decoder. + + Module :mod:`pickle` + Python object serialization. + + +.. _applications: + +Applications +------------ + +Two main applications for the :mod:`struct` module exist, data +interchange between Python and C code within an application or another +application compiled using the same compiler (:ref:`native formats`), and +data interchange between applications using agreed upon data layout +(:ref:`standard formats`). Generally speaking, the format strings +constructed for these two domains are distinct. + + +.. _struct-native-formats: + +Native Formats +^^^^^^^^^^^^^^ + +When constructing format strings which mimic native layouts, the +compiler and machine architecture determine byte ordering and padding. +In such cases, the ``@`` format character should be used to specify +native byte ordering and data sizes. Internal pad bytes are normally inserted +automatically. It is possible that a zero-repeat format code will be +needed at the end of a format string to round up to the correct +byte boundary for proper alignment of consective chunks of data. + +Consider these two simple examples (on a 64-bit, little-endian +machine):: + + >>> calcsize('@lhl') + 24 + >>> calcsize('@llh') + 18 + +Data is not padded to an 8-byte boundary at the end of the second +format string without the use of extra padding. A zero-repeat format +code solves that problem:: + + >>> calcsize('@llh0l') + 24 + +The ``'x'`` format code can be used to specify the repeat, but for +native formats it is better to use a zero-repeat format like ``'0l'``. + +By default, native byte ordering and alignment is used, but it is +better to be explicit and use the ``'@'`` prefix character. + + +.. _struct-standard-formats: + +Standard Formats +^^^^^^^^^^^^^^^^ + +When exchanging data beyond your process such as networking or storage, +be precise. Specify the exact byte order, size, and alignment. Do +not assume they match the native order of a particular machine. +For example, network byte order is big-endian, while many popular CPUs +are little-endian. By defining this explicitly, the user need not +care about the specifics of the platform their code is running on. +The first character should typically be ``<`` or ``>`` +(or ``!``). Padding is the responsibility of the programmer. The +zero-repeat format character won't work. Instead, the user must +explicitly add ``'x'`` pad bytes where needed. Revisiting the +examples from the previous section, we have:: + + >>> calcsize('>> pack('>> calcsize('@llh') + 18 + >>> pack('@llh', 1, 2, 3) == pack('>> calcsize('>> calcsize('@llh0l') + 24 + >>> pack('@llh0l', 1, 2, 3) == pack('>> calcsize('>> calcsize('@llh0l') + 12 + >>> pack('@llh0l', 1, 2, 3) == pack('