bpo-47189: What's New in 3.11: Faster CPython (GH-32235)

Co-authored-by: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: Guido van Rossum <gvanrossum@users.noreply.github.com>
Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
This commit is contained in:
Ken Jin 2022-04-06 18:38:25 +07:00 committed by GitHub
parent 074da78802
commit 9ffe47df54
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 223 additions and 7 deletions

View File

@ -211,6 +211,8 @@ directory. This is an error unless the replacement is intended. See section
.. %
Do we need stuff on zip files etc. ? DUBOIS
.. _tut-pycache:
"Compiled" Python files
-----------------------

View File

@ -62,6 +62,8 @@ Summary -- Release highlights
.. This section singles out the most important changes in Python 3.11.
Brevity is key.
- Python 3.11 is up to 10-60% faster than Python 3.10. On average, we measured a
1.22x speedup on the standard benchmark suite. See `Faster CPython`_ for details.
.. PEP-sized items next.
@ -477,13 +479,6 @@ Optimizations
almost eliminated when no exception is raised.
(Contributed by Mark Shannon in :issue:`40222`.)
* Method calls with keywords are now faster due to bytecode
changes which avoid creating bound method instances. Previously, this
optimization was applied only to method calls with purely positional
arguments.
(Contributed by Ken Jin and Mark Shannon in :issue:`26110`, based on ideas
implemented in PyPy.)
* Pure ASCII strings are now normalized in constant time by :func:`unicodedata.normalize`.
(Contributed by Dong-hee Na in :issue:`44987`.)
@ -498,6 +493,223 @@ Optimizations
(Contributed by Inada Naoki in :issue:`46845`.)
Faster CPython
==============
CPython 3.11 is on average `1.22x faster <https://github.com/faster-cpython/ideas/blob/main/main-vs-310.rst>`_
than CPython 3.10 when measured with the
`pyperformance <https://github.com/python/pyperformance>`_ benchmark suite,
and compiled with GCC on Ubuntu Linux. Depending on your workload, the speedup
could be up to 10-60% faster.
This project focuses on two major areas in Python: faster startup and faster
runtime. Other optimizations not under this project are listed in `Optimizations`_.
Faster Startup
--------------
Frozen imports / Static code objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Python caches bytecode in the :ref:`__pycache__<tut-pycache>` directory to
speed up module loading.
Previously in 3.10, Python module execution looked like this:
.. code-block:: text
Read __pycache__ -> Unmarshal -> Heap allocated code object -> Evaluate
In Python 3.11, the core modules essential for Python startup are "frozen".
This means that their code objects (and bytecode) are statically allocated
by the interpreter. This reduces the steps in module execution process to this:
.. code-block:: text
Statically allocated code object -> Evaluate
Interpreter startup is now 10-15% faster in Python 3.11. This has a big
impact for short-running programs using Python.
(Contributed by Eric Snow, Guido van Rossum and Kumar Aditya in numerous issues.)
Faster Runtime
--------------
Cheaper, lazy Python frames
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Python frames are created whenever Python calls a Python function. This frame
holds execution information. The following are new frame optimizations:
- Streamlined the frame creation process.
- Avoided memory allocation by generously re-using frame space on the C stack.
- Streamlined the internal frame struct to contain only essential information.
Frames previously held extra debugging and memory management information.
Old-style frame objects are now created only when required by debuggers. For
most user code, no frame objects are created at all. As a result, nearly all
Python functions calls have sped up significantly. We measured a 3-7% speedup
in pyperformance.
(Contributed by Mark Shannon in :issue:`44590`.)
.. _inline-calls:
Inlined Python function calls
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
During a Python function call, Python will call an evaluating C function to
interpret that function's code. This effectively limits pure Python recursion to
what's safe for the C stack.
In 3.11, when CPython detects Python code calling another Python function,
it sets up a new frame, and "jumps" to the new code inside the new frame. This
avoids calling the C interpreting function altogether.
Most Python function calls now consume no C stack space. This speeds up
most of such calls. In simple recursive functions like fibonacci or
factorial, a 1.7x speedup was observed. This also means recursive functions
can recurse significantly deeper (if the user increases the recursion limit).
We measured a 1-3% improvement in pyperformance.
(Contributed by Pablo Galindo and Mark Shannon in :issue:`45256`.)
PEP 659: Specializing Adaptive Interpreter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:pep:`659` is one of the key parts of the faster CPython project. The general
idea is that while Python is a dynamic language, most code has regions where
objects and types rarely change. This concept is known as *type stability*.
At runtime, Python will try to look for common patterns and type stability
in the executing code. Python will then replace the current operation with a
more specialized one. This specialized operation uses fast paths available only
to those use cases/types, which generally outperform their generic
counterparts. This also brings in another concept called *inline caching*, where
Python caches the results of expensive operations directly in the bytecode.
The specializer will also combine certain common instruction pairs into one
superinstruction. This reduces the overhead during execution.
Python will only specialize
when it sees code that is "hot" (executed multiple times). This prevents Python
from wasting time for run-once code. Python can also de-specialize when code is
too dynamic or when the use changes. Specialization is attempted periodically,
and specialization attempts are not too expensive. This allows specialization
to adapt to new circumstances.
(PEP written by Mark Shannon, with ideas inspired by Stefan Brunthaler.
See :pep:`659` for more information.)
..
If I missed out anyone, please add them.
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Operation | Form | Specialization | Operation speedup | Contributor(s) |
| | | | (up to) | |
+===============+====================+=======================================================+===================+===================+
| Binary | ``x+x; x*x; x-x;`` | Binary add, multiply and subtract for common types | 10% | Mark Shannon, |
| operations | | such as ``int``, ``float``, and ``str`` take custom | | Dong-hee Na, |
| | | fast paths for their underlying types. | | Brandt Bucher, |
| | | | | Dennis Sweeney |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Subscript | ``a[i]`` | Subscripting container types such as ``list``, | 10-25% | Irit Katriel, |
| | | ``tuple`` and ``dict`` directly index the underlying | | Mark Shannon |
| | | data structures. | | |
| | | | | |
| | | Subscripting custom ``__getitem__`` | | |
| | | is also inlined similar to :ref:`inline-calls`. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Store | ``a[i] = z`` | Similar to subscripting specialization above. | 10-25% | Dennis Sweeney |
| subscript | | | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Calls | ``f(arg)`` | Calls to common builtin (C) functions and types such | 20% | Mark Shannon, |
| | ``C(arg)`` | as ``len`` and ``str`` directly call their underlying | | Ken Jin |
| | | C version. This avoids going through the internal | | |
| | | calling convention. | | |
| | | | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Load | ``print`` | The object's index in the globals/builtins namespace | [1]_ | Mark Shannon |
| global | ``len`` | is cached. Loading globals and builtins require | | |
| variable | | zero namespace lookups. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Load | ``o.attr`` | Similar to loading global variables. The attribute's | [2]_ | Mark Shannon |
| attribute | | index inside the class/object's namespace is cached. | | |
| | | In most cases, attribute loading will require zero | | |
| | | namespace lookups. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Load | ``o.meth()`` | The actual address of the method is cached. Method | 10-20% | Ken Jin, |
| methods for | | loading now has no namespace lookups -- even for | | Mark Shannon |
| call | | classes with long inheritance chains. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Store | ``o.attr = z`` | Similar to load attribute optimization. | 2% | Mark Shannon |
| attribute | | | in pyperformance | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
| Unpack | ``*seq`` | Specialized for common containers such as ``list`` | 8% | Brandt Bucher |
| Sequence | | and ``tuple``. Avoids internal calling convention. | | |
+---------------+--------------------+-------------------------------------------------------+-------------------+-------------------+
.. [1] A similar optimization already existed since Python 3.8. 3.11
specializes for more forms and reduces some overhead.
.. [2] A similar optimization already existed since Python 3.10.
3.11 specializes for more forms. Furthermore, all attribute loads should
be sped up by :issue:`45947`.
Misc
----
* Objects now require less memory due to lazily created object namespaces. Their
namespace dictionaries now also share keys more freely.
(Contributed Mark Shannon in :issue:`45340` and :issue:`40116`.)
* A more concise representation of exceptions in the interpreter reduced the
time required for catching an exception by about 10%.
(Contributed by Irit Katriel in :issue:`45711`.)
FAQ
---
| Q: How should I write my code to utilize these speedups?
|
| A: You don't have to change your code. Write Pythonic code that follows common
best practices. The Faster CPython project optimizes for common code
patterns we observe.
|
|
| Q: Will CPython 3.11 use more memory?
|
| A: Maybe not. We don't expect memory use to exceed 20% more than 3.10.
This is offset by memory optimizations for frame objects and object
dictionaries as mentioned above.
|
|
| Q: I don't see any speedups in my workload. Why?
|
| A: Certain code won't have noticeable benefits. If your code spends most of
its time on I/O operations, or already does most of its
computation in a C extension library like numpy, there won't be significant
speedup. This project currently benefits pure-Python workloads the most.
|
| Furthermore, the pyperformance figures are a geometric mean. Even within the
pyperformance benchmarks, certain benchmarks have slowed down slightly, while
others have sped up by nearly 2x!
|
|
| Q: Is there a JIT compiler?
|
| A: No. We're still exploring other optimizations.
About
-----
Faster CPython explores optimizations for :term:`CPython`. The main team is
funded by Microsoft to work on this full-time. Pablo Galindo Salgado is also
funded by Bloomberg LP to work on the project part-time. Finally, many
contributors are volunteers from the community.
CPython bytecode changes
========================

View File

@ -0,0 +1,2 @@
Add a What's New in Python 3.11 entry for the Faster CPython project.
Documentation by Ken Jin and Kumar Aditya.