Commit Graph

77 Commits

Author SHA1 Message Date
Lysandros Nikolaou abfbab6415
gh-105718: Fix buffer allocation in tokenizer with readline (#105728) 2023-06-13 16:18:11 +01:00
Pablo Galindo Salgado c0a6ed3934
gh-105259: Ensure we don't show newline characters for trailing NEWLINE tokens (#105364) 2023-06-06 12:52:16 +01:00
Pablo Galindo Salgado 9216e69a87
gh-105069: Add a readline-like callable to the tokenizer to consume input iteratively (#105070) 2023-05-30 22:43:34 +01:00
Marta Gómez Macías 96fff35325
gh-105017: Include CRLF lines in strings and column numbers (#105030)
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
2023-05-28 15:15:53 +01:00
Stepfen Shawn 705e387dd8
Fix typo in the tokenizer (#104950) 2023-05-26 02:50:33 +00:00
Marta Gómez Macías 6715f91edc
gh-102856: Python tokenizer implementation for PEP 701 (#104323)
This commit replaces the Python implementation of the tokenize module with an implementation
that reuses the real C tokenizer via a private extension module. The tokenize module now implements
a compatibility layer that transforms tokens from the C tokenizer into Python tokenize tokens for backward
compatibility.

As the C tokenizer does not emit some tokens that the Python tokenizer provides (such as comments and non-semantic newlines), a new special mode has been added to the C tokenizer mode that currently is only used via
the extension module that exposes it to the Python layer. This new mode forces the C tokenizer to emit these new extra tokens and add the appropriate metadata that is needed to match the old Python implementation.

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
2023-05-21 01:03:02 +01:00
Pablo Galindo Salgado ff7f731632
gh-104658: Fix location of unclosed quote error for multiline f-strings (#104660) 2023-05-20 14:07:05 +01:00
jx124 5078eedc5b
gh-104016: Fixed off by 1 error in f string tokenizer (#104047)
Co-authored-by: sunmy2019 <59365878+sunmy2019@users.noreply.github.com>
Co-authored-by: Ken Jin <kenjin@python.org>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
2023-05-01 19:15:47 +00:00
Lysandros Nikolaou 9169a56fad
gh-103656: Transfer f-string buffers to parser to avoid use-after-free (GH-103896)
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
2023-04-27 01:33:31 +00:00
Lysandros Nikolaou 05b3ce7339
GH-103718: Correctly cache and restore f-string buffers when needed (GH-103719) 2023-04-23 13:06:10 -06:00
Pablo Galindo Salgado d4aa8578b1
gh-102856: Clean some of the PEP 701 tokenizer implementation (#103634) 2023-04-19 14:51:31 -06:00
Pablo Galindo Salgado 1ef61cf71a
gh-102856: Initial implementation of PEP 701 (#102855)
Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-authored-by: Batuhan Taskaya <isidentical@gmail.com>
Co-authored-by: Marta Gómez Macías <mgmacias@google.com>
Co-authored-by: sunmy2019 <59365878+sunmy2019@users.noreply.github.com>
2023-04-19 11:18:16 -05:00
Pablo Galindo Salgado 417206a05c
gh-99891: Fix infinite recursion in the tokenizer when showing warnings (GH-99893)
Automerge-Triggered-By: GH:pablogsal
2022-11-30 03:36:06 -08:00
Lysandros Nikolaou 3de08ce8c1
gh-97997: Add col_offset field to tokenizer and use that for AST nodes (#98000) 2022-10-07 14:38:35 -07:00
Lysandros Nikolaou cbf0afd8a1
gh-97973: Return all necessary information from the tokenizer (GH-97984)
Right now, the tokenizer only returns type and two pointers to the start and end of the token.
This PR modifies the tokenizer to return the type and set all of the necessary information,
so that the parser does not have to this.
2022-10-06 16:07:17 -07:00
Victor Stinner 5115a16831
gh-93103: Parser uses PyConfig.parser_debug instead of Py_DebugFlag (#93106)
* Replace deprecated Py_DebugFlag with PyConfig.parser_debug in the
  parser.
* Add Parser.debug member.
* Add tok_state.debug member.
* Py_FrozenMain(): Replace Py_VerboseFlag with PyConfig.verbose.
2022-05-24 22:35:08 +02:00
Victor Stinner da5727a120
gh-92651: Remove the Include/token.h header file (#92652)
Remove the token.h header file. There was never any public tokenizer
C API. The token.h header file was only designed to be used by Python
internals.

Move Include/token.h to Include/internal/pycore_token.h. Including
this header file now requires that the Py_BUILD_CORE macro is
defined. It no longer checks for the Py_LIMITED_API macro.

Rename functions:

* PyToken_OneChar() => _PyToken_OneChar()
* PyToken_TwoChars() => _PyToken_TwoChars()
* PyToken_ThreeChars() => _PyToken_ThreeChars()
2022-05-11 23:22:50 +02:00
Pablo Galindo Salgado 4f006a789a
Ensure the str member of the tokenizer is always initialised (GH-29681) 2021-11-21 02:06:39 +00:00
Victor Stinner 713bb19356
bpo-45434: Mark the PyTokenizer C API as private (GH-28924)
Rename PyTokenize functions to mark them as private:

* PyTokenizer_FindEncodingFilename() => _PyTokenizer_FindEncodingFilename()
* PyTokenizer_FromString() => _PyTokenizer_FromString()
* PyTokenizer_FromFile() => _PyTokenizer_FromFile()
* PyTokenizer_FromUTF8() => _PyTokenizer_FromUTF8()
* PyTokenizer_Free() => _PyTokenizer_Free()
* PyTokenizer_Get() => _PyTokenizer_Get()

Remove the unused PyTokenizer_FindEncoding() function.

import.c: remove unused #include "errcode.h".
2021-10-13 17:22:14 +02:00
Christian Clauss 5f401f1040
Fix typos in the Objects directory (GH-28766) 2021-10-06 16:57:10 -07:00
Serhiy Storchaka 058fb35b57
bpo-44854: Remove trailing whitespaces (GH-27689) 2021-08-09 21:32:54 +03:00
Pablo Galindo bd7476dae3
bpo-44201: Avoid side effects of "invalid_*" rules in the REPL (GH-26298)
When the parser does a second pass to check for errors, these rules can
have some small side-effects as they may advance the parser more than
the point reached in the first pass. This can cause the tokenizer to ask
for extra tokens in interactive mode causing the tokenizer to show the
prompt instead of failing instantly.

To avoid this, add a new mode to the tokenizer that is activated in the
second pass and deactivates asking for new tokens when the interactive
line is finished. As the parsing should have reached the last line in
the first pass, the second pass should not need to ask for more tokens.
2021-05-22 23:05:00 +01:00
Pablo Galindo 261a452a13
bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050) 2021-03-28 23:48:05 +01:00
Pablo Galindo cd8dcbc851
bpo-43410: Fix crash in the parser when producing syntax errors when reading from stdin (GH-24763) 2021-03-14 04:38:40 +01:00
Pablo Galindo d6d6371447
bpo-42864: Improve error messages regarding unclosed parentheses (GH-24161) 2021-01-19 23:59:33 +00:00
Lysandros Nikolaou e5fe509054
bpo-42827: Fix crash on SyntaxError in multiline expressions (GH-24140)
When trying to extract the error line for the error message there
are two distinct cases:

1. The input comes from a file, which means that we can extract the
   error line by using `PyErr_ProgramTextObject` and which we already
   do.
2. The input does not come from a file, at which point we need to get
   the source code from the tokenizer:
   * If the tokenizer's current line number is the same with the line
     of the error, we get the line from `tok->buf` and we're ready.
   * Else, we can extract the error line from the source code in the
     following two ways:
     * If the input comes from a string we have all the input
       in `tok->str` and we can extract the error line from it.
     * If the input comes from stdin, i.e. the interactive prompt, we
       do not have access to the previous line. That's why a new
       field `tok->stdin_content` is added which holds the whole input for the
       current (multiline) statement or expression. We can then extract the
       error line from `tok->stdin_content` like we do in the string case above.

Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
2021-01-14 21:36:30 +00:00
Andy Lester 384f3c536d
closes bpo-39721: Fix constness of members of tok_state struct. (GH-18600)
The function PyTokenizer_FromUTF8 from Parser/tokenizer.c had a comment:

    /* XXX: constify members. */

This patch addresses that.

In the tok_state struct:
    * end and start were non-const but could be made const
    * str and input were const but should have been non-const

Changes to support this include:
    * decode_str() now returns a char * since it is allocated.
    * PyTokenizer_FromString() and PyTokenizer_FromUTF8() each creates a
        new char * for an allocate string instead of reusing the input
        const char *.
    * PyTokenizer_Get() and tok_get() now take const char ** arguments.
    * Various local vars are const or non-const accordingly.

I was able to remove five casts that cast away constness.
2020-02-27 18:44:52 -08:00
Pablo Galindo f2cf1e3e28
bpo-36623: Clean parser headers and include files (GH-12253)
After the removal of pgen, multiple header and function prototypes that lack implementation or are unused are still lying around.
2019-04-13 17:05:14 +01:00
Guido van Rossum 495da29225 bpo-35975: Support parsing earlier minor versions of Python 3 (GH-12086)
This adds a `feature_version` flag to `ast.parse()` (documented) and `compile()` (hidden) that allow tweaking the parser to support older versions of the grammar. In particular if `feature_version` is 5 or 6, the hacks for the `async` and `await` keyword from PEP 492 are reinstated. (For 7 or higher, these are unconditionally treated as keywords, but they are still special tokens rather than `NAME` tokens that the parser driver recognizes.)



https://bugs.python.org/issue35975
2019-03-07 12:38:08 -08:00
Pablo Galindo 1f24a719e7
bpo-35808: Retire pgen and use pgen2 to generate the parser (GH-11814)
Pgen is the oldest piece of technology in the CPython repository, building it requires various #if[n]def PGEN hacks in other parts of the code and it also depends more and more on CPython internals. This commit removes the old pgen C code and replaces it for a new version implemented in pure Python. This is a modified and adapted version of lib2to3/pgen2 that can generate grammar files compatibles with the current parser.

This commit also eliminates all the #ifdef and code branches related to pgen, simplifying the code and making it more maintainable. The regen-grammar step now uses $(PYTHON_FOR_REGEN) that can be any version of the interpreter, so the new pgen code maintains compatibility with older versions of the interpreter (this also allows regenerating the grammar with the current CI solution that uses Python3.5). The new pgen Python module also makes use of the Grammar/Tokens file that holds the token specification, so is always kept in sync and avoids having to maintain duplicate token definitions.
2019-03-01 15:34:44 -08:00
Guido van Rossum dcfcd146f8 bpo-35766: Merge typed_ast back into CPython (GH-11645) 2019-01-31 12:40:27 +01:00
Anthony Sottile 995d9b9297 bpo-16806: Fix `lineno` and `col_offset` for multi-line string tokens (GH-10021) 2019-01-13 13:05:13 +09:00
Serhiy Storchaka 94cf308ee2
bpo-33306: Improve SyntaxError messages for unbalanced parentheses. (GH-6516) 2018-12-17 17:34:14 +02:00
Victor Stinner f2ddc6ac93
tokenizer: Remove unused tabs options (#4422)
Remove the following fields from tok_state structure which are now
used unused:

* altwarning: "Issue warning if alternate tabs don't match"
* alterror: "Issue error if alternate tabs don't match"
* alttabsize: "Alternate tab spacing"

Replace alttabsize variable with ALTTABSIZE define.
2017-11-17 01:25:47 -08:00
Jelle Zijlstra ac317700ce bpo-30406: Make async and await proper keywords (#1669)
Per PEP 492, 'async' and 'await' should become proper keywords in 3.7.
2017-10-05 23:24:46 -04:00
Jim Fasarakis-Hilliard cf1958af4c Remove obsolete declaration in tokenizer.h (#962) 2017-04-03 19:18:32 +03:00
Yury Selivanov 96ec934e75 Issue #24619: Simplify async/await tokenization.
This commit simplifies async/await tokenization in tokenizer.c,
tokenize.py & lib2to3/tokenize.py.  Previous solution was to keep
a stack of async-def & def blocks, whereas the new approach is just
to remember position of the outermost async-def block.

This change won't bring any parsing performance improvements, but
it makes the code much easier to read and validate.
2015-07-23 15:01:58 +03:00
Yury Selivanov 8fb307cd65 Issue #24619: New approach for tokenizing async/await.
This commit fixes how one-line async-defs and defs are tracked
by tokenizer.  It allows to correctly parse invalid code such
as:

>>> async def f():
...     def g(): pass
...     async = 10

and valid code such as:

>>> async def f():
...     async def g(): pass
...     await z

As a consequence, is is now possible to have one-line
'async def foo(): await ..' functions:

>>> async def foo(): return await bar()
2015-07-22 13:33:45 +03:00
Yury Selivanov 7544508f02 PEP 0492 -- Coroutines with async and await syntax. Issue #24017. 2015-05-11 22:57:16 -04:00
Serhiy Storchaka c679227e31 Issue #1772673: The type of `char*` arguments now changed to `const char*`. 2013-10-19 21:03:34 +03:00
Victor Stinner fe7c5b5bdf Issue #9319: Include the filename in "Non-UTF8 code ..." syntax error. 2011-04-05 01:48:03 +02:00
Victor Stinner 7f2fee3640 Issue #10785: Store the filename as Unicode in the Python parser. 2011-04-05 00:39:01 +02:00
Georg Brandl 2b15bd810d #10222: fix for overzealous AIX compiler. 2010-10-29 04:54:13 +00:00
Victor Stinner 4c7c8c3023 Issue #9713, #10114: Parser functions (eg. PyParser_ASTFromFile) expects
filenames encoded to the filesystem encoding with surrogateescape error handler
(to support undecodable bytes), instead of UTF-8 in strict mode.
2010-10-16 13:14:10 +00:00
Victor Stinner 22a351aabf Issue #10095: fp_setreadl() doesn't reopen the file, reuse instead the file
descriptor.
2010-10-14 12:04:34 +00:00
Antoine Pitrou f95a1b3c53 Recorded merge of revisions 81029 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r81029 | antoine.pitrou | 2010-05-09 16:46:46 +0200 (dim., 09 mai 2010) | 3 lines

  Untabify C files. Will watch buildbots.
........
2010-05-09 15:52:27 +00:00
Benjamin Peterson aeaa592516 Merged revisions 76230 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r76230 | benjamin.peterson | 2009-11-12 17:39:44 -0600 (Thu, 12 Nov 2009) | 2 lines

  fix several compile() issues by translating newlines in the tokenizer
........
2009-11-13 00:17:59 +00:00
Benjamin Peterson f5b52246ed ignore the coding cookie in compile(), exec(), and eval() if the source is a string #4626 2009-03-02 23:31:26 +00:00
Brett Cannon da78043237 Latin-1 source code was not being properly decoded when passed through
compile(). This was due to left-over special-casing before UTF-8 became the
default source encoding.

Closes issue #3574. Thanks to Victor Stinner for help with the patch.
2008-10-17 03:38:50 +00:00
Guido van Rossum 40d20bcf1f Issue 1267, continued.
Additional patch by Christian Heimes to deal more cleanly with the
FILE* vs file-descriptor issues.
I cleaned up his code a bit, and moved the lseek() call into import.c.
2007-10-22 00:09:51 +00:00