Commit Graph

71 Commits

Author SHA1 Message Date
Greg Price 6954be815a closes bpo-37758: Extend unicodedata checksum tests to cover all of Unicode. (GH-15125)
Unicode has grown since Python first gained support for it,
when Unicode itself was still rather new.

This pair of test cases was added in commit 6a20ee7de back in 2000,
and they haven't needed to change much since then.  But do change
them to look beyond the Basic Multilingual Plane (range(0x10000))
and cover all 17 planes of Unicode's final form.

This adds about 5 seconds to the test suite's runtime.  Mark the
tests as CPU-using accordingly.
2019-09-12 10:25:25 +01:00
Greg Price 1ad0c776cb bpo-38043: Move unicodedata.normalize tests into test_unicodedata. (GH-15712)
Having these in a separate file from the one that's named after the
module in the usual way makes it very easy to miss them when looking
for tests for these two functions.

(In fact when working recently on is_normalized, I'd been surprised to
see no tests for it here and concluded the function had evaded being
tested at all.  I'd gone as far as to write up some tests myself
before I spotted this other file.)

Mostly this just means moving all the one file's code into the other,
and moving code from the module toplevel to inside the test class to
keep it tidily separate from the rest of the file's code.

There's one substantive change, which reduces by a bit the amount of
code to be moved: we drop the `x > sys.maxunicode` conditional and all
the `RangeError` logic behind it.  Now if that condition ever occurs
it will cause an error at `chr(x)`, and a test failure.  That's the
right result because, since PEP 393 in Python 3.3, there is no longer
such a thing as an "unsupported character".
2019-09-10 10:29:26 +01:00
Greg Price 2f09413947 closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
2019-09-03 19:45:44 -07:00
Greg Price def97c988b bpo-37758: Clean out vestigial script-bits from test_unicodedata. (GH-15126)
This file started life as a script, before conversion to a
`unittest` test file.  Clear out some legacies of that conversion
that are a bit confusing about how it works.

Most notably, it's unlikely there's still a good reason to try
to recover from `unicodedata` failing to import -- as there was
when that logic was first added, when the module was very new.
So take that out entirely.  Keep `self.db` working, though, to
avoid a noisy diff.
2019-08-12 22:58:01 -07:00
Benjamin Peterson 3aca40d3cb
closes bpo-36861: Update Unicode database to 12.1.0. (GH-13214)
Adds ㋿.
2019-05-08 20:59:35 -07:00
Benjamin Peterson 738c19f4c5
closes bpo-33376: Update to Unicode 12.0.0. (GH-12256) 2019-03-09 16:25:55 -08:00
Wonsup Yoon d134809cd3 bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
2018-06-15 20:03:14 +08:00
Benjamin Peterson 7c69c1c0fb
update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)
Also, standardize indentation of generated tables.
2018-06-06 20:14:28 -07:00
Benjamin Peterson 279a96206f bpo-30736: upgrade to Unicode 10.0 (#2344)
Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
2017-06-22 22:31:08 -07:00
Benjamin Peterson 6775231597 Unicode 9.0.0
Not completely mechanical since support for East Asian Width changes—emoji
codepoints became Wide—had to be added to unicodedata.
2016-09-14 23:53:47 -07:00
Berker Peksag 33a7fcc066 Issue #23981: Update test_unicodedata to use script_helpers
Patch by Christie.
2015-10-22 03:29:10 +03:00
Benjamin Peterson 4801383c29 upgrade to Unicode 8.0.0 2015-06-27 15:45:56 -05:00
Zachary Ware 38c707e7e0 Issue #21741: Update 147 test modules to use test discovery.
I have compared output between pre- and post-patch runs of these tests
to make sure there's nothing missing and nothing broken, on both
Windows and Linux.  The only differences I found were actually tests
that were previously *not* run.
2015-04-13 15:00:43 -05:00
Benjamin Peterson 96baaae46f for some reason, you don't get the right checksum from an incremental build 2014-07-06 22:07:08 -07:00
Benjamin Peterson 3032ed7cb1 upgrade to unicode 7.0.0 2014-07-06 13:04:20 -07:00
Antoine Pitrou 73abc527eb Fix expected checksum for new unicodedata (after full rebuild) 2013-10-11 21:40:55 +02:00
Benjamin Peterson 94d08d908b upgrade unicode db to 6.3.0 (closes #19221) 2013-10-10 17:24:45 -04:00
Benjamin Peterson b8350f1c7d upgrade to UCD 6.2 2012-09-29 13:47:39 -04:00
Benjamin Peterson 71f660e00f update to Unicode 6.1 2012-02-20 22:24:29 -05:00
Benjamin Peterson b2bf01d824 use full unicode mappings for upper/lower/title case (#12736)
Also broaden the category of characters that count as lowercase/uppercase.
2012-01-11 18:17:06 -05:00
Ezio Melotti f503673c4d Move UCS4-specific tests with the "normal" tests. 2011-09-29 03:14:56 +03:00
Alexander Belopolsky 86f65d5dbb Issue #10254: Fixed a crash and a regression introduced by the implementation of PRI 29. 2010-12-23 02:27:37 +00:00
Ezio Melotti b3aedd4862 #9424: Replace deprecated assert* methods in the Python test suite. 2010-11-20 19:04:17 +00:00
Antoine Pitrou 849e12bfe9 Fix resource warning in test_unicodedata. Patch by Brian Brazil. 2010-10-30 14:24:33 +00:00
Martin v. Löwis baecd7243a Upgrade to Unicode 6.0.0.
makeunicodedata.py: download all data files from unicode.org,
  switch to extracting Unihan data from zip file.
  Read linebreakprops and derivednormalizationprops even for
  old versions, even though they are not used in delta records.
test:unicode.py: U+11000 is now assigned, use U+14000 instead.
2010-10-11 22:42:28 +00:00
Amaury Forgeot d'Arc 7e44b6b0c5 Add more tests to unicodedata with large code points
(the other functions where not affected by the recent change)
2010-08-18 22:07:15 +00:00
Amaury Forgeot d'Arc 56ab01b66a Fix stupid typo in test. 2010-08-18 21:12:52 +00:00
Amaury Forgeot d'Arc 324ac65ceb #5127: Even on narrow unicode builds, the C functions that access the Unicode
Database (Py_UNICODE_TOLOWER, Py_UNICODE_ISDECIMAL, and others) now accept
and return characters from the full Unicode range (Py_UCS4).

The differences from Python code are few:
- unicodedata.numeric(), unicodedata.decimal() and unicodedata.digit()
  now return the correct value for large code points
- repr() may consider more characters as printable.
2010-08-18 20:44:58 +00:00
Mark Dickinson 388122d43b Issue #9337: Make float.__str__ identical to float.__repr__.
(And similarly for complex numbers.)
2010-08-04 20:56:28 +00:00
Florent Xicluna 806d8cf0e8 Merged revisions 79494,79496 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r79494 | florent.xicluna | 2010-03-30 10:24:06 +0200 (mar, 30 mar 2010) | 2 lines

  #7643: Unicode codepoints VT (0x0B) and FF (0x0C) are linebreaks according to Unicode Standard Annex #14.
........
  r79496 | florent.xicluna | 2010-03-30 18:29:03 +0200 (mar, 30 mar 2010) | 2 lines

  Highlight the change of behavior related to r79494.  Now VT and FF are linebreaks.
........
2010-03-30 19:34:18 +00:00
Florent Xicluna faa663f03d Fixed a failure in test_bigmem.
Merged revision 79059 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r79059 | florent.xicluna | 2010-03-18 22:50:06 +0100 (jeu, 18 mar 2010) | 2 lines

  Issue #8024: Update the Unicode database to 5.2
........
2010-03-19 13:37:08 +00:00
Florent Xicluna f1789dee30 Revert Unicode UCD 5.2 upgrade in 3.x. It broke repr() for unicode objects, and gave failures in test_bigmem. Revert 79062, 79065 and 79083. 2010-03-19 01:17:46 +00:00
Florent Xicluna 0106250f0d Fix bad unicodedata checksum merge from trunk in r79062 2010-03-19 00:03:01 +00:00
Florent Xicluna 657de43f97 Merged revisions 79059 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r79059 | florent.xicluna | 2010-03-18 22:50:06 +0100 (jeu, 18 mar 2010) | 2 lines

  Issue #8024: Update the Unicode database to 5.2
........
2010-03-18 22:11:01 +00:00
Victor Stinner 931bb02d96 oops, fix the test of my previous commit about unicodedata and PR #29 (r78647) 2010-03-04 12:47:32 +00:00
Victor Stinner 7ed9c4c910 Merged revisions 78646 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r78646 | victor.stinner | 2010-03-04 13:09:33 +0100 (jeu., 04 mars 2010) | 5 lines

  Issue #1054943: Fix unicodedata.normalize('NFC', text) for the Public Review
  Issue #29.

  PR #29 was released in february 2004!
........
2010-03-04 12:14:57 +00:00
Benjamin Peterson 577473fe68 use assert[Not]In where appropriate
A patch from Dave Malcolm.
2010-01-19 00:09:57 +00:00
Amaury Forgeot d'Arc 7d52079395 Merged revisions 75272-75273 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r75272 | amaury.forgeotdarc | 2009-10-06 21:56:32 +0200 (mar., 06 oct. 2009) | 5 lines

  #1571184: makeunicodedata.py now generates the functions _PyUnicode_ToNumeric,
  _PyUnicode_IsLinebreak and _PyUnicode_IsWhitespace.

  It now also parses the Unihan.txt for numeric values.
........
  r75273 | amaury.forgeotdarc | 2009-10-06 22:02:09 +0200 (mar., 06 oct. 2009) | 2 lines

  Add Anders Chrigstrom to Misc/ACKS for his work on unicodedata.
........
2009-10-06 21:03:20 +00:00
Benjamin Peterson c9c0f201fe convert old fail* assertions to assert* 2009-06-30 23:06:06 +00:00
Martin v. Löwis 54d9d07806 Rename the surrogates handler to surrogatepass. 2009-05-10 09:33:21 +00:00
Martin v. Löwis db12d454e6 Issue #3672: Reject surrogates in utf-8 codec; add surrogates error
handler.
2009-05-02 18:52:14 +00:00
Walter Dörwald e250775d53 Merged revisions 71972 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r71972 | walter.doerwald | 2009-04-26 21:11:43 +0200 (So, 26 Apr 2009) | 2 lines

  Fix typo.
........
2009-04-26 19:16:11 +00:00
Martin v. Löwis 71efeb7cbf Merged revisions 71947 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r71947 | martin.v.loewis | 2009-04-26 02:53:18 +0200 (So, 26 Apr 2009) | 3 lines

  Issue #4971: Fix titlecase for characters that are their own
  titlecase, but not their own uppercase.
........
2009-04-26 01:02:07 +00:00
Walter Dörwald 1b08b30743 Merged revisions 71894 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r71894 | walter.doerwald | 2009-04-25 16:03:16 +0200 (Sa, 25 Apr 2009) | 4 lines

  Issue #5828 (Invalid behavior of unicode.lower): Fixed bogus logic in
  makeunicodedata.py and regenerated the Unicode database (This fixes
  u'\u1d79'.lower() == '\x00').
........
2009-04-25 14:13:56 +00:00
Benjamin Peterson 6f7fad16bc Merged revisions 67320 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r67320 | benjamin.peterson | 2008-11-21 16:27:24 -0600 (Fri, 21 Nov 2008) | 4 lines

  don't segfault when \N escapes are used and unicodedata fails to load

  Fixes #4367
........
2008-11-21 22:58:57 +00:00
Martin v. Löwis 93cbca33f2 Merged revisions 66362 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r66362 | martin.v.loewis | 2008-09-10 15:38:12 +0200 (Mi, 10 Sep 2008) | 3 lines

  Issue #3811: The Unicode database was updated to 5.1.
  Reviewed by Fredrik Lundh and Marc-Andre Lemburg.
........
2008-09-10 14:08:48 +00:00
Walter Dörwald f342bfcbd4 Change all functions that expect one unicode character to accept a pair of
surrogates in narrow builds. Fixes issue #1706460. (Port of r63899).
2008-06-03 11:45:02 +00:00
Benjamin Peterson ee8712cda4 #2621 rename test.test_support to test.support 2008-05-20 21:35:26 +00:00
Guido van Rossum 254348e201 Rename buffer -> bytearray. 2007-11-21 19:29:53 +00:00
Guido van Rossum 98297ee781 Merging the py3k-pep3137 branch back into the py3k branch.
No detailed change log; just check out the change log for the py3k-pep3137
branch.  The most obvious changes:

  - str8 renamed to bytes (PyString at the C level);
  - bytes renamed to buffer (PyBytes at the C level);
  - PyString and PyUnicode are no longer compatible.

I.e. we now have an immutable bytes type and a mutable bytes type.

The behavior of PyString was modified quite a bit, to make it more
bytes-like.  Some changes are still on the to-do list.
2007-11-06 21:34:58 +00:00