Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.
In my simple benchmark[1], this extra caching speeds up tarfile by 8x.
[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
---------
Co-authored-by: Tian Gao <gaogaotiantian@hotmail.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
* Remove backtracking when parsing tarfile headers
* Rewrite PAX header parsing to be stricter
* Optimize parsing of GNU extended sparse headers v0.0
Co-authored-by: Kirill Podoprigora <kirill.bast9@mail.ru>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
Co-authored-by: Tomas R <tomas.roun8@gmail.com>
Co-authored-by: Scott Odle <scott@sjodle.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
* gh-118673: Remove shebang and executable bits from stdlib modules.
* Removed shebangs and exe bits on turtledemo scripts.
The setting was inappropriate for '__main__' and inconsistent across the other modules. The scripts can still be executed directly by invoking with the desired interpreter.
As reported in #117847 and #115366, an unpaired backtick in a docstring
tends to confuse e.g. Sphinx running on subclasses of standard library
objects, and the typographic style of using a backtick as an opening
quote is no longer in favor. Convert almost all uses of the form
The variable `foo' should do xyz
to
The variable 'foo' should do xyz
and also fix up miscellaneous other unpaired backticks (extraneous /
missing characters).
No functional change is intended here other than in human-readable
docstrings.
* Add name and mode attributes for compressed and archived file-like objects
in modules bz2, lzma, tarfile and zipfile.
* Change the value of the mode attribute of GzipFile from integer (1 or 2)
to string ('rb' or 'wb').
* Change the value of the mode attribute of ZipExtFile from 'r' to 'rb'.
Tarfile.addfile now throws an ValueError when the user passes
in a non-zero size tarinfo but does not provide a fileobj,
instead of writing an incomplete entry.
Avoid race conditions in the creation of directories during concurrent
extraction in tarfile and zipfile.
Co-authored-by: Samantha Hughes <shughes-uk@users.noreply.github.com>
Co-authored-by: Peder Bergebakken Sundt <pbsds@hotmail.com>
In the stack call of: _init_read_gz()
```
_read, tarfile.py:548
read, tarfile.py:526
_init_read_gz, tarfile.py:491
```
a try;except exists that uses `self.exception`, so it needs to be set before
calling _init_read_gz().
`tarfile` already accepts a compressionlevel argument for creating
files. This patch adds the same for stream-based tarfile usage.
The default is 9, the value that was previously hard-coded.
Numeric fields of type float, notably mtime, can't be represented
exactly in the ustar header, so the pax header is used. But it is
helpful to set them to the nearest int (i.e. second rather than
nanosecond precision mtimes) in the ustar header as well, for the
benefit of unarchivers that don't understand the pax header.
Add test for tarfile.TarInfo.create_pax_header to confirm correct
behaviour.
* during tarfile parsing, a zlib error indicates invalid data
* tarfile.open now raises a descriptive exception from the zlib error
* this makes it clear to the user that they may be trying to open a
corrupted tar file
tarfile writes full path to FNAME field of GZIP format instead of just basename if user specified absolute path. Some archive viewers may process file incorrectly. Also it creates security issue because anyone can know structure of directories on system and know username or other personal information.
RFC1952 says about FNAME:
This is the original name of the file being compressed, with any directory components removed.
So tarfile must remove directory names from FNAME and write only basename of file.
Automerge-Triggered-By: @jaraco
The GNU docs describe the `devmajor` and `devminor` fields of the tar
header struct only in the context of character and block special files,
suggesting that in other cases they are not populated. Typical utilities
behave accordingly; this patch teaches `tarfile` to do the same.
tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.
This commit bypass compressed buffering.
In this benchmark [1], user time become 250ms from 300ms.
[1]: https://bugs.python.org/msg320763
During buffered read, use a list followed by join instead of extending a bytes object.
This is how it was done before but changed in commit b506dc32c1.
* Fix multiple typos in code comments
* Add spacing in comments (test_logging.py, test_math.py)
* Fix spaces at the beginning of comments in test_logging.py