cpython/Lib
Cody Maloney 2f5f19e783
gh-120754: Reduce system calls in full-file FileIO.readall() case (#120755)
This reduces the system call count of a simple program[0] that reads all
the `.rst` files in Doc by over 10% (5706 -> 4734 system calls on my
linux system, 5813 -> 4875 on my macOS)

This reduces the number of `fstat()` calls always and seek calls most
the time. Stat was always called twice, once at open (to error early on
directories), and a second time to get the size of the file to be able
to read the whole file in one read. Now the size is cached with the
first call.

The code keeps an optimization that if the user had previously read a
lot of data, the current position is subtracted from the number of bytes
to read. That is somewhat expensive so only do it on larger files,
otherwise just try and read the extra bytes and resize the PyBytes as
needeed.

I built a little test program to validate the behavior + assumptions
around relative costs and then ran it under `strace` to get a log of the
system calls. Full samples below[1].

After the changes, this is everything in one `filename.read_text()`:

```python3
openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3`
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0`
ioctl(3, TCGETS, 0x7ffdfac04b40)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343
read(3, "", 1)                          = 0
close(3)                                = 0
```

This does make some tradeoffs
1. If the file size changes between open() and readall(), this will
still get all the data but might have more read calls.
2. I experimented with avoiding the stat + cached result for small files
in general, but on my dev workstation at least that tended to reduce
performance compared to using the fstat().

[0]

```python3
from pathlib import Path

nlines = []
for filename in Path("cpython/Doc").glob("**/*.rst"):
    nlines.append(len(filename.read_text()))
```

[1]
Before small file:

```
openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
ioctl(3, TCGETS, 0x7ffe52525930)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343
read(3, "", 1)                          = 0
close(3)                                = 0
```

After small file:

```
openat(AT_FDCWD, "cpython/Doc/howto/clinic.rst", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0
ioctl(3, TCGETS, 0x7ffdfac04b40)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
read(3, ":orphan:\n\n.. This page is retain"..., 344) = 343
read(3, "", 1)                          = 0
close(3)                                = 0
```

Before large file:

```
openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0
ioctl(3, TCGETS, 0x7ffe52525930)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0
read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104
read(3, "", 1)                          = 0
close(3)                                = 0
```

After large file:

```
openat(AT_FDCWD, "cpython/Doc/c-api/typeobj.rst", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=133104, ...}) = 0
ioctl(3, TCGETS, 0x7ffdfac04b40)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, ".. highlight:: c\n\n.. _type-struc"..., 133105) = 133104
read(3, "", 1)                          = 0
close(3)                                = 0
```

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
2024-07-04 09:17:00 +02:00
..
__phello__
_pyrepl gh-118908: Use __main__ for the default PyREPL namespace (#121054) 2024-06-26 15:01:10 -04:00
asyncio gh-87744: fix waitpid race while calling send_signal in asyncio (#121126) 2024-07-01 10:17:36 +05:30
collections gh-120417: Fix "imported but unused" linter warnings (#120461) 2024-06-14 20:39:50 +02:00
concurrent gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
ctypes gh-61103: Support float and long double complex types in ctypes module (#121248) 2024-07-03 11:08:11 +02:00
curses gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
dbm gh-120417: Remove unused imports in the stdlib (#120420) 2024-06-12 20:56:42 +02:00
email Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
encodings gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError (#113674) 2024-03-17 04:58:42 +00:00
ensurepip gh-120888: Bump bundled pip to 24.1.1 (#120889) 2024-06-27 09:09:54 +00:00
html
http gh-120485: Add an override of `allow_reuse_port` on classes subclassing `socketserver.TCPServer` (GH-120488) 2024-06-16 13:15:03 +01:00
idlelib gh-121008: Fix idlelib.run tests (#121046) 2024-06-26 15:41:16 +02:00
importlib gh-117983: Defer import of threading for lazy module loading (#120233) 2024-07-03 20:50:46 +00:00
json gh-95382: Improve performance of json encoder with indent (GH-118105) 2024-05-06 11:04:39 +03:00
logging gh-105623 Fix performance degradation in logging RotatingFileHandler (GH-105887) 2024-06-27 16:44:40 +00:00
multiprocessing gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
pathlib GH-73991: Support copying directory symlinks on older Windows (#120807) 2024-07-03 04:30:29 +01:00
pydoc_data Python 3.13.0b1 2024-05-08 11:21:00 +02:00
re gh-111259: Optimize complementary character sets in RE (GH-120742) 2024-06-20 07:19:32 +00:00
site-packages
sqlite3 gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
sysconfig gh-116622: Android sysconfig updates (#118352) 2024-05-01 16:47:54 +00:00
test gh-121141: add support for `copy.replace` to AST nodes (#121162) 2024-07-03 20:10:54 -07:00
tkinter gh-120211: Fix tkinter.ttk with Tcl/Tk 9.0 (GH-120213) 2024-06-07 10:49:07 +00:00
tomllib
turtledemo gh-120633: Move scrollbar and remove tear-off menus in turtledemo (#120634) 2024-06-19 02:20:54 -04:00
unittest gh-120732: Fix `name` passing to `Mock`, when using kwargs to `create_autospec` (#120737) 2024-06-19 21:35:11 +01:00
urllib gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
venv gh-90329: Add _winapi.GetLongPathName and GetShortPathName and use in venv to reduce warnings (GH-117817) 2024-04-15 15:36:06 +01:00
wsgiref Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
xml gh-120417: Fix "imported but unused" linter warnings (#120461) 2024-06-14 20:39:50 +02:00
xmlrpc gh-120485: Add an override of `allow_reuse_port` on classes subclassing `socketserver.TCPServer` (GH-120488) 2024-06-16 13:15:03 +01:00
zipfile gh-119588: Implement zipfile.Path.is_symlink (zipp 3.19.0). (#119591) 2024-06-03 11:13:07 -04:00
zoneinfo gh-106233: Fix stacklevel in zoneinfo.InvalidTZPathWarning (GH-106234) 2024-02-06 15:08:56 +02:00
__future__.py
__hello__.py
_aix_support.py
_android_support.py gh-116622: Redirect stdout and stderr to system log when embedded in an Android app (#118063) 2024-04-30 16:00:31 +02:00
_collections_abc.py GH-120097: Make FrameLocalsProxy a mapping (#120101) 2024-06-19 17:54:13 +01:00
_colorize.py gh-117225: Move colorize functionality to own internal module (#118283) 2024-05-01 12:27:06 -06:00
_compat_pickle.py
_compression.py
_ios_support.py gh-119253: use ImportError in _ios_support (#119254) 2024-05-20 16:39:30 -04:00
_markupbase.py
_opcode_metadata.py GH-120507: Lower the `BEFORE_WITH` and `BEFORE_ASYNC_WITH` instructions. (#120640) 2024-06-18 12:17:46 +01:00
_osx_support.py gh-102362: Fix macOS version number in result of sysconfig.get_platform() (GH-112942) 2023-12-18 18:51:58 -05:00
_py_abc.py
_pydatetime.py gh-120713: Normalize year with century for datetime.strftime (GH-120820) 2024-06-29 09:32:42 +03:00
_pydecimal.py gh-118164: str(10**10000) hangs if the C _decimal module is missing (#118503) 2024-05-04 18:22:33 -05:00
_pyio.py gh-120754: Reduce system calls in full-file FileIO.readall() case (#120755) 2024-07-04 09:17:00 +02:00
_pylong.py gh-119057: Use better error messages for zero division (#119066) 2024-06-03 19:03:56 +03:00
_sitebuiltins.py
_strptime.py GH-70647: Deprecate strptime day of month parsing without a year present to avoid leap-year bugs (GH-117107) 2024-04-03 14:19:49 +02:00
_threading_local.py
_weakrefset.py
abc.py
antigravity.py
argparse.py gh-121018: Fix more cases of exiting in argparse when exit_on_error=False (GH-121056) 2024-06-28 17:21:59 +03:00
ast.py gh-121210: handle nodes with missing attributes/fields in `ast.compare` (#121211) 2024-07-02 16:23:17 +05:30
base64.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
bdb.py gh-58933: Make pdb return to caller frame correctly when f_trace is not set (#118979) 2024-05-13 13:38:21 +01:00
bisect.py
bz2.py gh-115961: Add name and mode attributes for compressed file-like objects (GH-116036) 2024-04-21 11:46:39 +03:00
cProfile.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
calendar.py gh-120567: Clarify weekday return in calendar.monthrange docstring (#120570) 2024-06-16 16:43:57 -04:00
cmd.py Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
code.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
codecs.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
codeop.py gh-119521: Rename IncompleteInputError to _IncompleteInputError and remove from public API/ABI (GH-119680) 2024-06-24 14:08:12 +02:00
colorsys.py
compileall.py gh-117205: Increase chunksize when compiling pyc in parallel (#117206) 2024-04-03 15:24:24 -07:00
configparser.py Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
contextlib.py gh-103791: handle `BaseExceptionGroup` in `contextlib.suppress()` (#111910) 2023-11-10 13:32:36 +00:00
contextvars.py
copy.py gh-121300: Add `replace` to `copy.__all__` (#121302) 2024-07-03 20:33:56 +05:30
copyreg.py
csv.py gh-114628: Display csv.Error without context (#115005) 2024-02-04 20:57:54 -05:00
dataclasses.py gh-120417: Remove unused imports in the stdlib (#120420) 2024-06-12 20:56:42 +02:00
datetime.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
decimal.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
difflib.py gh-115801: Only allow sequence of strings as input for difflib.unified_diff (GH-118333) 2024-06-10 14:06:18 +03:00
dis.py gh-120780: Show attribute name for LOAD_SPECIAL in dis output (#120781) 2024-06-20 07:07:24 -07:00
doctest.py Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
enum.py gh-118650: Exclude `_repr_*` methods from Enum's _sunder_ reservation (GH-118651) 2024-05-07 12:35:51 +02:00
filecmp.py gh-57141: Add dircmp shallow option (GH-109499) 2024-03-04 17:27:43 +00:00
fileinput.py Use bool in fileinput.input() docstring and tests for the inplace argument (GH-111998) 2024-01-27 23:47:55 +02:00
fnmatch.py GH-72904: Add `glob.translate()` function (#106703) 2023-11-13 17:15:56 +00:00
fractions.py gh-119838: Treat Fraction as a real value in mixed arithmetic operations with complex (GH-119839) 2024-06-03 12:29:01 +03:00
ftplib.py Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
functools.py gh-121027: Make the functools.partial object a method descriptor (GH-121089) 2024-07-03 09:02:15 +03:00
genericpath.py gh-117114: Make os.path.isdevdrive available on all platforms (GH-117115) 2024-03-25 22:55:11 +00:00
getopt.py Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
getpass.py gh-76912: Raise OSError from any failure in getpass.getuser() (#29739) 2023-11-27 10:05:55 -08:00
gettext.py gh-88434: Emit deprecation warnings for non-integer numbers in gettext if translation not found (GH-110574) 2023-10-14 09:07:02 +03:00
glob.py GH-116380: Move pathlib-specific code from `glob` to `pathlib._abc`. (#120011) 2024-06-07 17:59:34 +01:00
graphlib.py
gzip.py gh-112346: Always set OS byte to 255, simpler gzip.compress function. (GH-120486) 2024-06-15 18:46:39 +00:00
hashlib.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
heapq.py gh-119721: Integrate documentation fixes into heapq module docstring. (gh-119722) 2024-05-29 11:39:34 -05:00
hmac.py gh-112999: Replace the outdated "deprecated" directives with "versionchanged" (GH-113000) 2023-12-12 18:31:04 +02:00
imaplib.py Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
inspect.py gh-121027: Add a future warning in functools.partial.__get__ (#121086) 2024-06-27 11:47:20 +00:00
io.py gh-111356: io: Add missing documented objects to io.__all__ (#111370) 2023-11-10 16:18:52 +09:00
ipaddress.py gh-120128: fix description of argument to ipaddress.collapse_addresses() (#120131) 2024-06-06 00:52:40 +03:00
keyword.py
linecache.py linecache: Fix docstring location (#117948) 2024-04-16 15:37:18 -07:00
locale.py gh-91565: Replace bugs.python.org links with Devguide/GitHub ones (GH-91568) 2024-04-01 13:02:07 +00:00
lzma.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
mailbox.py gh-117467: Add preserving of mailbox owner on flush (GH-117510) 2024-04-04 13:32:53 +03:00
mimetypes.py Remove almost all unpaired backticks in docstrings (#119231) 2024-05-22 12:35:18 -04:00
modulefinder.py gh-114099 - Add iOS framework loading machinery. (GH-116454) 2024-03-19 08:36:19 -04:00
netrc.py
ntpath.py gh-120417: Remove unused imports in the stdlib (#120420) 2024-06-12 20:56:42 +02:00
nturl2path.py
numbers.py
opcode.py gh-120780: Show attribute name for LOAD_SPECIAL in dis output (#120781) 2024-06-20 07:07:24 -07:00
operator.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
optparse.py
os.py gh-120057: Add os.environ.refresh() method (#120059) 2024-06-10 16:34:17 +00:00
pdb.py gh-118714: Make the pdb post-mortem restart/quit behavior more reasonable (#118725) 2024-07-03 11:30:20 -07:00
pickle.py gh-120380: fix Python implementation of `pickle.Pickler` for `bytes` and `bytearray` objects in protocol version 5. (GH-120422) 2024-06-21 14:22:38 +02:00
pickletools.py gh-115146: Fix typo in pickletools.py documentation (GH-115148) 2024-02-08 10:12:58 +02:00
pkgutil.py
platform.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
plistlib.py gh-111803: Support loading more deeply nested lists in binary plist format (GH-114024) 2024-01-13 15:26:55 +02:00
poplib.py
posixpath.py pathlib ABCs: remove duplicate `realpath()` implementation. (#119178) 2024-06-05 18:54:50 +01:00
pprint.py [pprint]: Add docstring about `PrettyPrinter.underscore_numbers` parameter (#112963) 2023-12-13 12:04:17 +00:00
profile.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
pstats.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
pty.py gh-118824: Remove deprecated `master_open` and `slave_open` from `pty` (#118826) 2024-05-28 16:42:35 +03:00
py_compile.py
pyclbr.py
pydoc.py gh-120541: Improve the "less" prompt in pydoc (GH-120543) 2024-06-15 20:56:40 +03:00
queue.py gh-117531: Unblock getters after non-immediate queue shutdown (#117532) 2024-04-10 08:01:42 -07:00
quopri.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
random.py gh-118131: Command-line interface for the `random` module (#118132) 2024-05-05 06:30:03 +00:00
reprlib.py
rlcompleter.py gh-113978: Ignore warnings on text completion inside REPL (#113979) 2024-05-21 18:28:21 +02:00
runpy.py gh-99437: runpy: decode path-like objects before setting globals 2024-01-15 16:58:50 +00:00
sched.py
secrets.py
selectors.py
shelve.py
shlex.py
shutil.py GH-73991: Use same signature for `shutil._rmtree_[un]safe()`. (#120517) 2024-06-18 22:15:18 +01:00
signal.py gh-112559: Avoid unnecessary conversion attempts to enum_klass in signal.py (#113040) 2023-12-23 17:07:52 -08:00
site.py gh-121245: Amend d611c4c8e9 (correct import) (#121255) 2024-07-02 09:40:01 +00:00
smtplib.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
socket.py gh-110383: Document `socket.makefile()` accepts combined modes (#119150) 2024-05-21 16:23:50 +00:00
socketserver.py
sre_compile.py
sre_constants.py
sre_parse.py
ssl.py gh-107361: strengthen default SSL context flags (#112389) 2024-03-06 13:44:58 -08:00
stat.py gh-120417: Remove unused imports in the stdlib (#120420) 2024-06-12 20:56:42 +02:00
statistics.py Refactor (mostly rearrange) the statistics module (gh-119930) 2024-06-01 22:07:46 -05:00
string.py
stringprep.py
struct.py gh-120417: Add #noqa to used imports in the stdlib (#120421) 2024-06-13 16:14:50 +02:00
subprocess.py gh-120417: Fix "imported but unused" linter warnings (#120461) 2024-06-14 20:39:50 +02:00
symtable.py gh-119698: symtable: Fix merge race (#120779) 2024-06-20 05:42:30 +00:00
tabnanny.py gh-120495: Fix incorrect exception handling in Tab Nanny (#120498) 2024-06-15 05:04:14 -06:00
tarfile.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
tempfile.py gh-59616: Support os.chmod(follow_symlinks=True) and os.lchmod() on Windows (GH-113049) 2023-12-14 13:28:37 +02:00
textwrap.py
this.py
threading.py gh-114271: Fix race in `Thread.join()` (#114839) 2024-03-16 13:56:30 +01:00
timeit.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
token.py
tokenize.py gh-115154: Fix untokenize handling of unicode named literals (#115171) 2024-02-19 14:54:10 +00:00
trace.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
traceback.py gh-99180: Make `StackSummary.should_show_carets` private (#119554) 2024-05-25 17:08:32 +00:00
tracemalloc.py
tty.py gh-114328: tty cbreak mode should not alter ICRNL (#114335) 2024-01-21 15:25:52 -08:00
turtle.py
types.py
typing.py gh-114053: Fix another edge case involving `get_type_hints`, PEP 695 and PEP 563 (#120272) 2024-06-25 16:53:18 +01:00
uuid.py gh-113308: Remove some internal parts of `uuid` module (#115934) 2024-03-14 13:01:41 +03:00
warnings.py gh-121163: Add "all" as an valid alias for "always" in warnings.simplefilter() (#121164) 2024-06-30 19:48:00 +02:00
wave.py
weakref.py
webbrowser.py gh-118673: Remove shebang and executable bits from stdlib modules. (#119658) 2024-05-29 12:43:19 -04:00
zipapp.py
zipimport.py Remove references to private symbols from zipimport module docstring (GH-119015) 2024-05-15 11:21:52 -05:00