From 98175119b0c93597fd9ced4653f5a0f3afa06399 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Thu, 14 Nov 2024 13:05:50 +0000 Subject: [PATCH 1/5] gh-119786: add code object doc, inline locations.md into it --- InternalDocs/README.md | 4 +- InternalDocs/_code_objects.md | 43 +++++++++++ InternalDocs/code_objects.md | 141 +++++++++++++++++++++++++++++++++- InternalDocs/compiler.md | 8 +- InternalDocs/interpreter.md | 2 +- InternalDocs/locations.md | 69 ----------------- Objects/lnotab_notes.txt | 2 +- 7 files changed, 187 insertions(+), 82 deletions(-) create mode 100644 InternalDocs/_code_objects.md delete mode 100644 InternalDocs/locations.md diff --git a/InternalDocs/README.md b/InternalDocs/README.md index 2ef6e653ac19d4..dbc858b276833c 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -24,9 +24,7 @@ Compiling Python Source Code Runtime Objects --- -- [Code Objects (coming soon)](code_objects.md) - -- [The Source Code Locations Table](locations.md) +- [Code Objects](code_objects.md) - [Generators (coming soon)](generators.md) diff --git a/InternalDocs/_code_objects.md b/InternalDocs/_code_objects.md new file mode 100644 index 00000000000000..6cd6098132fdfd --- /dev/null +++ b/InternalDocs/_code_objects.md @@ -0,0 +1,43 @@ + +Code objects +============ + +The interpreter uses a code object (``frame->f_code``) as its starting point. +Code objects contain many fields used by the interpreter, as well as some for use by debuggers and other tools. +In 3.11, the final field of a code object is an array of indeterminate length containing the bytecode, ``code->co_code_adaptive``. +(In previous versions the code object was a :class:`bytes` object, ``code->co_code``; it was changed to save an allocation and to allow it to be mutated.) + +Code objects are typically produced by the bytecode :ref:`compiler `, although they are often written to disk by one process and read back in by another. +The disk version of a code object is serialized using the :mod:`marshal` protocol. +Some code objects are pre-loaded into the interpreter using ``Tools/scripts/deepfreeze.py``, which writes ``Python/deepfreeze/deepfreeze.c``. + +Code objects are nominally immutable. +Some fields (including ``co_code_adaptive``) are mutable, but mutable fields are not included when code objects are hashed or compared. + +The locations table +------------------- + +Whenever an exception is raised, we add a traceback entry to the exception. +The ``tb_lineno`` field of a traceback entry is (lazily) set to the line number of the instruction that raised it. +This field is computed from the locations table, ``co_linetable`` (this name is an understatement), using :c:func:`PyCode_Addr2Line`. +This table has an entry for every instruction rather than for every ``try`` block, so a compact format is very important. + +The full design of the 3.11 locations table is written up in :cpy-file:`InternalDocs/locations.md`. +While there are rumors that this file is slightly out of date, it is still the best reference we have. +Don't be confused by :cpy-file:`Objects/lnotab_notes.txt`, which describes the 3.10 format. +For backwards compatibility this format is still supported by the ``co_lnotab`` property. + +The 3.11 location table format is different because it stores not just the starting line number for each instruction, but also the end line number, *and* the start and end column numbers. +Note that traceback objects don't store all this information -- they store the start line number, for backward compatibility, and the "last instruction" value. +The rest can be computed from the last instruction (``tb_lasti``) with the help of the locations table. +For Python code, a convenient method exists, :meth:`~codeobject.co_positions`, which returns an iterator of :samp:`({line}, {endline}, {column}, {endcolumn})` tuples, one per instruction. +There is also ``co_lines()`` which returns an iterator of :samp:`({start}, {end}, {line})` tuples, where :samp:`{start}` and :samp:`{end}` are bytecode offsets. +The latter is described by :pep:`626`; it is more compact, but doesn't return end line numbers or column offsets. +From C code, you have to call :c:func:`PyCode_Addr2Location`. + +Fortunately, the locations table is only consulted by exception handling (to set ``tb_lineno``) and by tracing (to pass the line number to the tracing function). +In order to reduce the overhead during tracing, the mapping from instruction offset to line number is cached in the ``_co_linearray`` field. + + +TODO: +- co_consts, co_names, co_varnames, and their ilk diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index 284a8b7aee5765..f81494bee0390e 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -1,5 +1,140 @@ -Code objects -============ +# Code objects -Coming soon. +A `CodeObject` is a builtin Python type that represents a compiled executable, +such as a compiled function or class. +It contains a sequence of bytecode instructions along with its associated +metadata: data which is necessary to execute the bytecode instructions (such +as the values of the constants they access) or context information such as +the source code location, which is useful for debuggers and other tools. + +Since 3.11, the final field of the `PyCodeObject` C struct is an array +of indeterminate length containing the bytecode, `code->co_code_adaptive`. +(In older versions the code object was a +[`bytes`](https://docs.python.org/dev/library/stdtypes.html#bytes) +object, `code->co_code`; this was changed to save an allocation and to +allow it to be mutated.) + +Code objects are typically produced by the bytecode [compiler](compiler.md), +although they are often written to disk by one process and read back in by another. +The disk version of a code object is serialized using the +[marshal](https://docs.python.org/dev/library/marshal.html) protocol. +Some code objects are pre-loaded into the interpreter using +[`Tools/build/deepfreeze.py`](../Tools/build/deepfreeze.py), +which writes +[`Python/deepfreeze/deepfreeze.c`](../Python/deepfreeze/deepfreeze.c). + +Code objects are nominally immutable. +Some fields (including `co_code_adaptive` and fields for runtime +information such as `_co_monitoring`) are mutable, but mutable fields are +not included when code objects are hashed or compared. + +## Source code locations + +Whenever an exception occurs, the interpreter adds a traceback entry to +the exception for the current frame, as well as each frame on the stack that +it unwinds. +The `tb_lineno` field of a traceback entry is (lazily) set to the line +number of the instruction that was executing in the frame at the time of +the exception. +This field is computed from the locations table, `co_linetable`, by the function +[`PyCode_Addr2Line`](https://docs.python.org/dev/c-api/code.html#c.PyCode_Addr2Line). +Despite its name, `co_linetable` includes more than line numbers; it represents +a 4-number source location for every instruction, indicating the precise line +and column at which it begins and ends. This is a significant amount of data, +so a compact format is very important. + +Note that traceback objects don't store all this information -- they store the start line +number, for backward compatibility, and the "last instruction" value. +The rest can be computed from the last instruction (`tb_lasti`) with the help of the +locations table. For Python code, there is a convenience method +(`codeobject.co_positions`)[https://docs.python.org/dev/reference/datamodel.html#codeobject.co_positions] +which returns an iterator of `({line}, {endline}, {column}, {endcolumn})` tuples, +one per instruction. +There is also `co_lines()` which returns an iterator of `({start}, {end}, {line})` tuples, +where `{start}` and `{end}` are bytecode offsets. +The latter is described by [`PEP 626`](https://peps.python.org/pep-0626/); it is more +compact, but doesn't return end line numbers or column offsets. +From C code, you need to call +[`PyCode_Addr2Location`](https://docs.python.org/dev/c-api/code.html#c.PyCode_Addr2Location). + +As the locations table is only consulted by exception handling (to set ``tb_lineno``) +and by tracing (to pass the line number to the tracing function), lookup is not +performance critical. +In order to reduce the overhead during tracing, the mapping from instruction offset to +line number is cached in the ``_co_linearray`` field. + +### Format of the locations table + +The `co_linetable` bytes object of code objects contains a compact +representation of the source code positions of instructions, which are +returned by the `co_positions()` iterator. + +> [!NOTE] +> Not to be confused by [`Objects/lnotab_notes.txt`](Objects/lnotab_notes.txt), +> which describes the 3.10 format, that stores only that start line for each instruction. +> For backwards compatibility this format is still supported by the `co_lnotab` property. + +`co_linetable` consists of a sequence of location entries. +Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset. + +Each entry contains the following information: +* The number of code units covered by this entry (length) +* The start line +* The end line +* The start column +* The end column + +The first byte has the following format: + +Bit 7 | Bits 3-6 | Bits 0-2 + ---- | ---- | ---- + 1 | Code | Length (in code units) - 1 + +The codes are enumerated in the `_PyCodeLocationInfoKind` enum. + +## Variable length integer encodings + +Integers are often encoded using a variable length integer encoding + +### Unsigned integers (varint) + +Unsigned integers are encoded in 6 bit chunks, least significant first. +Each chunk but the last has bit 6 set. +For example: + +* 63 is encoded as `0x3f` +* 200 is encoded as `0x48`, `0x03` + +### Signed integers (svarint) + +Signed integers are encoded by converting them to unsigned integers, using the following function: +```Python +def convert(s): + if s < 0: + return ((-s)<<1) | 1 + else: + return (s<<1) +``` + +*Location entries* + +The meaning of the codes and the following bytes are as follows: + +Code | Meaning | Start line | End line | Start column | End column + ---- | ---- | ---- | ---- | ---- | ---- + 0-9 | Short form | Δ 0 | Δ 0 | See below | See below + 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte + 13 | No column info | Δ svarint | Δ 0 | None | None + 14 | Long form | Δ svarint | Δ varint | varint | varint + 15 | No location | None | None | None | None + +The Δ means the value is encoded as a delta from another value: +* Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. +* End line: Delta from the start line + +*The short forms* + +Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down). +* Start column: `(code*8) + ((second_byte>>4)&7)` +* End column: `start_column + (second_byte&15)` diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md index 37964bd99428df..ed4cfb23ca51f7 100644 --- a/InternalDocs/compiler.md +++ b/InternalDocs/compiler.md @@ -443,14 +443,12 @@ reference to the source code (filename, etc). All of this is implemented by Code objects ============ -The result of `PyAST_CompileObject()` is a `PyCodeObject` which is defined in +The result of `_PyAST_Compile()` is a `PyCodeObject` which is defined in [Include/cpython/code.h](../Include/cpython/code.h). And with that you now have executable Python bytecode! -The code objects (byte code) are executed in [Python/ceval.c](../Python/ceval.c). -This file will also need a new case statement for the new opcode in the big switch -statement in `_PyEval_EvalFrameDefault()`. - +The code objects (byte code) are executed in `_PyEval_EvalFrameDefault()` +in [Python/ceval.c](../Python/ceval.c). Important files =============== diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md index dcfddc99370c0e..4c10cbbed37735 100644 --- a/InternalDocs/interpreter.md +++ b/InternalDocs/interpreter.md @@ -16,7 +16,7 @@ from the instruction definitions in [Python/bytecodes.c](../Python/bytecodes.c) which are written in [a DSL](../Tools/cases_generator/interpreter_definition.md) developed for this purpose. -Recall that the [Python Compiler](compiler.md) produces a [`CodeObject`](code_object.md), +Recall that the [Python Compiler](compiler.md) produces a [`CodeObject`](code_objects.md), which contains the bytecode instructions along with static data that is required to execute them, such as the consts list, variable names, [exception table](exception_handling.md#format-of-the-exception-table), and so on. diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md deleted file mode 100644 index 91a7824e2a8e4d..00000000000000 --- a/InternalDocs/locations.md +++ /dev/null @@ -1,69 +0,0 @@ -# Locations table - -The `co_linetable` bytes object of code objects contains a compact -representation of the source code positions of instructions, which are -returned by the `co_positions()` iterator. - -`co_linetable` consists of a sequence of location entries. -Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset. - -Each entry contains the following information: -* The number of code units covered by this entry (length) -* The start line -* The end line -* The start column -* The end column - -The first byte has the following format: - -Bit 7 | Bits 3-6 | Bits 0-2 - ---- | ---- | ---- - 1 | Code | Length (in code units) - 1 - -The codes are enumerated in the `_PyCodeLocationInfoKind` enum. - -## Variable length integer encodings - -Integers are often encoded using a variable length integer encoding - -### Unsigned integers (varint) - -Unsigned integers are encoded in 6 bit chunks, least significant first. -Each chunk but the last has bit 6 set. -For example: - -* 63 is encoded as `0x3f` -* 200 is encoded as `0x48`, `0x03` - -### Signed integers (svarint) - -Signed integers are encoded by converting them to unsigned integers, using the following function: -```Python -def convert(s): - if s < 0: - return ((-s)<<1) | 1 - else: - return (s<<1) -``` - -## Location entries - -The meaning of the codes and the following bytes are as follows: - -Code | Meaning | Start line | End line | Start column | End column - ---- | ---- | ---- | ---- | ---- | ---- - 0-9 | Short form | Δ 0 | Δ 0 | See below | See below - 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte - 13 | No column info | Δ svarint | Δ 0 | None | None - 14 | Long form | Δ svarint | Δ varint | varint | varint - 15 | No location | None | None | None | None - -The Δ means the value is encoded as a delta from another value: -* Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. -* End line: Delta from the start line - -### The short forms - -Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down). -* Start column: `(code*8) + ((second_byte>>4)&7)` -* End column: `start_column + (second_byte&15)` diff --git a/Objects/lnotab_notes.txt b/Objects/lnotab_notes.txt index 0f3599340318f0..003f78acc32193 100644 --- a/Objects/lnotab_notes.txt +++ b/Objects/lnotab_notes.txt @@ -1,7 +1,7 @@ Description of the internal format of the line number table in Python 3.10 and earlier. -(For 3.11 onwards, see Objects/locations.md) +(For 3.11 onwards, see InternalDocs/locations.md) Conceptually, the line number table consists of a sequence of triples: start-offset (inclusive), end-offset (exclusive), line-number. From cd4142dd0f3faa023c4f61e72933e49560a583bf Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Thu, 14 Nov 2024 16:02:37 +0000 Subject: [PATCH 2/5] Update Objects/lnotab_notes.txt --- Objects/lnotab_notes.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Objects/lnotab_notes.txt b/Objects/lnotab_notes.txt index 003f78acc32193..335e441cfded3d 100644 --- a/Objects/lnotab_notes.txt +++ b/Objects/lnotab_notes.txt @@ -1,7 +1,7 @@ Description of the internal format of the line number table in Python 3.10 and earlier. -(For 3.11 onwards, see InternalDocs/locations.md) +(For 3.11 onwards, see InternalDocs/code_objects.md) Conceptually, the line number table consists of a sequence of triples: start-offset (inclusive), end-offset (exclusive), line-number. From b4d819de55ad1bce754a3f3124c974fb80ada246 Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Thu, 14 Nov 2024 16:03:11 +0000 Subject: [PATCH 3/5] Delete InternalDocs/_code_objects.md --- InternalDocs/_code_objects.md | 43 ----------------------------------- 1 file changed, 43 deletions(-) delete mode 100644 InternalDocs/_code_objects.md diff --git a/InternalDocs/_code_objects.md b/InternalDocs/_code_objects.md deleted file mode 100644 index 6cd6098132fdfd..00000000000000 --- a/InternalDocs/_code_objects.md +++ /dev/null @@ -1,43 +0,0 @@ - -Code objects -============ - -The interpreter uses a code object (``frame->f_code``) as its starting point. -Code objects contain many fields used by the interpreter, as well as some for use by debuggers and other tools. -In 3.11, the final field of a code object is an array of indeterminate length containing the bytecode, ``code->co_code_adaptive``. -(In previous versions the code object was a :class:`bytes` object, ``code->co_code``; it was changed to save an allocation and to allow it to be mutated.) - -Code objects are typically produced by the bytecode :ref:`compiler `, although they are often written to disk by one process and read back in by another. -The disk version of a code object is serialized using the :mod:`marshal` protocol. -Some code objects are pre-loaded into the interpreter using ``Tools/scripts/deepfreeze.py``, which writes ``Python/deepfreeze/deepfreeze.c``. - -Code objects are nominally immutable. -Some fields (including ``co_code_adaptive``) are mutable, but mutable fields are not included when code objects are hashed or compared. - -The locations table -------------------- - -Whenever an exception is raised, we add a traceback entry to the exception. -The ``tb_lineno`` field of a traceback entry is (lazily) set to the line number of the instruction that raised it. -This field is computed from the locations table, ``co_linetable`` (this name is an understatement), using :c:func:`PyCode_Addr2Line`. -This table has an entry for every instruction rather than for every ``try`` block, so a compact format is very important. - -The full design of the 3.11 locations table is written up in :cpy-file:`InternalDocs/locations.md`. -While there are rumors that this file is slightly out of date, it is still the best reference we have. -Don't be confused by :cpy-file:`Objects/lnotab_notes.txt`, which describes the 3.10 format. -For backwards compatibility this format is still supported by the ``co_lnotab`` property. - -The 3.11 location table format is different because it stores not just the starting line number for each instruction, but also the end line number, *and* the start and end column numbers. -Note that traceback objects don't store all this information -- they store the start line number, for backward compatibility, and the "last instruction" value. -The rest can be computed from the last instruction (``tb_lasti``) with the help of the locations table. -For Python code, a convenient method exists, :meth:`~codeobject.co_positions`, which returns an iterator of :samp:`({line}, {endline}, {column}, {endcolumn})` tuples, one per instruction. -There is also ``co_lines()`` which returns an iterator of :samp:`({start}, {end}, {line})` tuples, where :samp:`{start}` and :samp:`{end}` are bytecode offsets. -The latter is described by :pep:`626`; it is more compact, but doesn't return end line numbers or column offsets. -From C code, you have to call :c:func:`PyCode_Addr2Location`. - -Fortunately, the locations table is only consulted by exception handling (to set ``tb_lineno``) and by tracing (to pass the line number to the tracing function). -In order to reduce the overhead during tracing, the mapping from instruction offset to line number is cached in the ``_co_linearray`` field. - - -TODO: -- co_consts, co_names, co_varnames, and their ilk From 93cc1921fa8c011b193ef65c39f803a0deb3d6fe Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Tue, 19 Nov 2024 19:23:51 +0000 Subject: [PATCH 4/5] Apply suggestions from code review Co-authored-by: Alex Waygood --- InternalDocs/code_objects.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index f81494bee0390e..04172705827f9e 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -71,12 +71,14 @@ representation of the source code positions of instructions, which are returned by the `co_positions()` iterator. > [!NOTE] -> Not to be confused by [`Objects/lnotab_notes.txt`](Objects/lnotab_notes.txt), -> which describes the 3.10 format, that stores only that start line for each instruction. -> For backwards compatibility this format is still supported by the `co_lnotab` property. +> `co_linetable` is not to be confused with `co_lnotab`. +> For backwards compatibility, `co_lnotab` stores the format +> as it existed in Python 3.10 and lower: this older format +> stores only the start line for each instruction. +> See [`Objects/lnotab_notes.txt`](../Objects/lnotab_notes.txt) for more details. `co_linetable` consists of a sequence of location entries. -Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset. +Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with the most significant bit unset. Each entry contains the following information: * The number of code units covered by this entry (length) @@ -93,20 +95,20 @@ Bit 7 | Bits 3-6 | Bits 0-2 The codes are enumerated in the `_PyCodeLocationInfoKind` enum. -## Variable length integer encodings +## Variable-length integer encodings -Integers are often encoded using a variable length integer encoding +Integers are often encoded using a variable-length integer encoding -### Unsigned integers (varint) +### Unsigned integers (`varint`) -Unsigned integers are encoded in 6 bit chunks, least significant first. +Unsigned integers are encoded in 6-bit chunks, least significant first. Each chunk but the last has bit 6 set. For example: * 63 is encoded as `0x3f` * 200 is encoded as `0x48`, `0x03` -### Signed integers (svarint) +### Signed integers (`svarint`) Signed integers are encoded by converting them to unsigned integers, using the following function: ```Python From dfc672bffd5ec6d5c3938abc7fb28de452fcfb9d Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Fri, 22 Nov 2024 18:10:55 +0000 Subject: [PATCH 5/5] mark's comments --- InternalDocs/code_objects.md | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index 04172705827f9e..bee4a9d0a08915 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -19,10 +19,6 @@ Code objects are typically produced by the bytecode [compiler](compiler.md), although they are often written to disk by one process and read back in by another. The disk version of a code object is serialized using the [marshal](https://docs.python.org/dev/library/marshal.html) protocol. -Some code objects are pre-loaded into the interpreter using -[`Tools/build/deepfreeze.py`](../Tools/build/deepfreeze.py), -which writes -[`Python/deepfreeze/deepfreeze.c`](../Python/deepfreeze/deepfreeze.c). Code objects are nominally immutable. Some fields (including `co_code_adaptive` and fields for runtime @@ -58,8 +54,8 @@ compact, but doesn't return end line numbers or column offsets. From C code, you need to call [`PyCode_Addr2Location`](https://docs.python.org/dev/c-api/code.html#c.PyCode_Addr2Location). -As the locations table is only consulted by exception handling (to set ``tb_lineno``) -and by tracing (to pass the line number to the tracing function), lookup is not +As the locations table is only consulted when displaying a traceback and when +tracing (to pass the line number to the tracing function), lookup is not performance critical. In order to reduce the overhead during tracing, the mapping from instruction offset to line number is cached in the ``_co_linearray`` field. @@ -72,9 +68,10 @@ returned by the `co_positions()` iterator. > [!NOTE] > `co_linetable` is not to be confused with `co_lnotab`. -> For backwards compatibility, `co_lnotab` stores the format +> For backwards compatibility, `co_lnotab` exposes the format > as it existed in Python 3.10 and lower: this older format > stores only the start line for each instruction. +> It is lazily created from `co_linetable` when accessed. > See [`Objects/lnotab_notes.txt`](../Objects/lnotab_notes.txt) for more details. `co_linetable` consists of a sequence of location entries.