Commit Graph

323 Commits

Author SHA1 Message Date
Fredrik Lundh 6471ee4f18 needforspeed: use "fastsearch" for count and findstring helpers. this
results in a 2.5x speedup on the stringbench count tests, and a 20x (!)
speedup on the stringbench search/find/contains test, compared to 2.5a2.

for more on the algorithm, see:

    http://effbot.org/zone/stringlib.htm

if you get weird results, you can disable the new algoritm by undefining
USE_FAST in Objects/unicodeobject.c.

enjoy /F
2006-05-24 14:28:11 +00:00
Fredrik Lundh 240bf2a8e4 use Py_ssize_t for string indexes (thanks, neal!) 2006-05-24 10:20:36 +00:00
Fredrik Lundh 7763351808 return 0 on misses, not -1. 2006-05-23 19:47:35 +00:00
Fredrik Lundh b63588c188 needforspeed: use append+reverse for rsplit, use "bloom filters" to
speed up splitlines and strip with charsets; etc.  rsplit is now as
fast as split in all our tests (reverse takes no time at all), and
splitlines() is nearly as fast as a plain split("\n") in our tests.
and we're not done yet... ;-)
2006-05-23 18:44:25 +00:00
Fredrik Lundh 833bf9422e needforspeed: fixed unicode "in" operator to use same implementation
approach as find/index
2006-05-23 10:12:21 +00:00
Tim Peters 1bacc641a0 unicode_repeat(): Change type of local to Py_ssize_t,
since that's what it should be.
2006-05-23 05:47:16 +00:00
Tim Peters 286085c781 PyUnicode_Join(): Recent code changes introduced new
compiler warnings on Windows (signed vs unsigned mismatch
in comparisons).  Cleaned that up by switching more locals
to Py_ssize_t.  Simplified overflow checking (it can _be_
simpler because while these things are declared as
Py_ssize_t, then should in fact never be negative).
2006-05-22 19:17:04 +00:00
Fredrik Lundh 8a8e05a2b9 needforspeed: use memcpy for "long" strings; use a better algorithm
for long repeats.
2006-05-22 17:12:58 +00:00
Fredrik Lundh f1d60a5384 needforspeed: speed up unicode repeat, unicode string copy 2006-05-22 16:29:30 +00:00
Fredrik Lundh 763b50f9d9 docstring tweaks: count counts non-overlapping substrings, not
total number of occurences
2006-05-22 15:35:12 +00:00
Neal Norwitz 1004a5339a Patch #1488312, Fix memory alignment problem on SPARC in unicode. Will backport 2006-05-15 07:17:23 +00:00
Thomas Wouters 715a4cdea2 Use %zd instead of %i as format character (in call to PyErr_Format) for
Py_ssize_t argument.
2006-04-16 22:04:49 +00:00
Martin v. Löwis 5cb6936672 Make Py_BuildValue, PyObject_CallFunction and
PyObject_CallMethod aware of PY_SSIZE_T_CLEAN.
2006-04-14 09:08:42 +00:00
Martin v. Löwis f15da6995b Remove another INT_MAX limitation 2006-04-13 07:24:50 +00:00
Martin v. Löwis 412fb67368 Change more ints to Py_ssize_t. 2006-04-13 06:34:32 +00:00
Martin v. Löwis 80d2e591d5 Revert 34153: Py_UNICODE should not be signed. 2006-04-13 06:06:08 +00:00
Anthony Baxter ac6bd46d5c spread the extern "C" { } magic pixie dust around. Python itself builds now
using a C++ compiler. Still lots and lots of errors in the modules built by
setup.py, and a bunch of warnings from g++ in the core.
2006-04-13 02:06:09 +00:00
Anthony Baxter a62862120d More low-hanging fruit. Still need to re-arrange some code (or find a better
solution) in the same way as listobject.c got changed. Hoping for a better
solution.
2006-04-11 07:42:36 +00:00
Georg Brandl ecdc0a9f46 That one was a mistake. 2006-03-30 12:19:07 +00:00
Georg Brandl 347b30042b Remove unnecessary casts in type object initializers. 2006-03-30 11:57:00 +00:00
Thomas Wouters a96affe1fc - Reindent a confusingly indented piece of code (no intended code changes
there)
 - Add missing DECREFs of inner-scope 'temp' variable
 - Add various missing DECREFs by changing 'return NULL' into 'goto onError'
 - Avoid double DECREF when last _PyUnicode_Resize() fails

Coverity found one of the missing DECREFs, but oddly enough not the others.
2006-03-12 00:29:36 +00:00
Martin v. Löwis 480f1bb67b Update Unicode database to Unicode 4.1. 2006-03-09 23:38:20 +00:00
Guido van Rossum 38fff8c4e4 Checking in the code for PEP 357.
This was mostly written by Travis Oliphant.
I've inspected it all; Neal Norwitz and MvL have also looked at it
(in an earlier incarnation).
2006-03-07 18:50:55 +00:00
Hye-Shik Chang 4af5c8cee4 SF #1444030: Fix several potential defects found by Coverity.
(reviewed by Neal Norwitz)
2006-03-07 15:39:21 +00:00
Martin v. Löwis 15e62742fa Revert backwards-incompatible const changes. 2006-02-27 16:46:16 +00:00
Thomas Wouters de01774dae Use correct PyArg_Parse format char for Py_ssize_t in unicode.center().
Fixes:

>>> u"".center(10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

on 64-bit systems.
2006-02-16 19:34:37 +00:00
Martin v. Löwis eb079f1c25 Use Py_ssize_t for counts and sizes.
Convert Py_ssize_t using PyInt_FromSsize_t
2006-02-16 14:32:27 +00:00
Martin v. Löwis 2c95cc6d72 Support %zd in PyErr_Format and PyString_FromFormat. 2006-02-16 06:54:25 +00:00
Tim Peters 15231548d2 doubletounicode(), longtounicode():
Py_SAFE_DOWNCAST can evaluate its first argument multiple
times in a debug build.  This caused two distinct assert-
failures in test_unicode run under a debug build.  Rewrote
the code in trivial ways so that multiple evaluation of the
first argument doesn't hurt.
2006-02-16 01:08:01 +00:00
Thomas Wouters 4701af5bf5 Remove two unused Py_ssize_t variables (merge glitches, looks like.) 2006-02-15 23:10:32 +00:00
Martin v. Löwis 18e165558b Merge ssize_t branch. 2006-02-15 17:27:45 +00:00
Neal Norwitz fc76d633e8 - Patch #1400181, fix unicode string formatting to not use the locale.
This is how string objects work.  u'%f' could use , instead of .
  for the decimal point.  Now both strings and unicode always use periods.

This is the code that would break:

import locale
locale.setlocale(locale.LC_NUMERIC, 'de_DE')
u'%.1f' % 1.0
assert '1.0' == u'%.1f' % 1.0

I couldn't create a test case which fails, but this fixes the problem.

Will backport.
2006-01-10 06:03:13 +00:00
Neal Norwitz d43069ce95 Fix icc warnings: remove (sometimes) unused variable conditionally 2006-01-08 01:12:10 +00:00
Martin v. Löwis dea59e5755 Stop maintaining the buildno file.
Also, stop determining Unicode sizes with PyString_GET_SIZE.
2006-01-05 10:00:36 +00:00
Hye-Shik Chang 835b243c71 Bug #1379994: Fix *unicode_escape codecs to encode r'\' as r'\\'
just like string codecs.
2005-12-17 04:38:31 +00:00
Jeremy Hylton af68c874a6 Add const to several API functions that take char *.
In C++, it's an error to pass a string literal to a char* function
without a const_cast().  Rather than require every C++ extension
module to put a cast around string literals, fix the API to state the
const-ness.

I focused on parts of the API where people usually pass literals:
PyArg_ParseTuple() and friends, Py_BuildValue(), PyMethodDef, the type
slots, etc.  Predictably, there were a large set of functions that
needed to be fixed as a result of these changes.  The most pervasive
change was to make the keyword args list passed to
PyArg_ParseTupleAndKewords() to be a const char *kwlist[].

One cast was required as a result of the changes:  A type object
mallocs the memory for its tp_doc slot and later frees it.
PyTypeObject says that tp_doc is const char *; but if the type was
created by type_new(), we know it is safe to cast to char *.
2005-12-10 18:50:16 +00:00
Walter Dörwald d4fff1731c Fix leaked reference to None. 2005-11-28 22:15:56 +00:00
Andrew M. Kuchling 8294de5673 Another comment typo fix 2005-11-02 16:36:12 +00:00
Walter Dörwald 2e2c02fedb Fix typo in comment. 2005-11-02 08:57:11 +00:00
Fred Drake db390c1ad8 fix typos, mostly in comments 2005-10-28 14:39:47 +00:00
Michael W. Hudson b2308bb9be Fix bug:
[ 1327110 ] wrong TypeError traceback in generator expressions

by removing the code that can stomp on the users' TypeError raised by the
iterable argument to ''.join() -- PySequence_Fast (now?) gives a perfectly
reasonable message itself.  Also, a couple of tests.
2005-10-21 11:45:01 +00:00
Marc-André Lemburg 5c4a9d6591 Whitespace corrections. 2005-10-19 22:39:02 +00:00
Marc-André Lemburg e115ec832c Bug fix for [ 1331062 ] utf 7 codec broken.
Backport candidate.
2005-10-19 22:33:31 +00:00
Walter Dörwald d1c1e10f70 Part of SF patch #1313939: Speedup charmap decoding by extending
PyUnicode_DecodeCharmap() the accept a unicode string as the mapping
argument which is used as a mapping table.

This code isn't used by any of the codecs yet.
2005-10-06 20:29:57 +00:00
Walter Dörwald a47d1c08d0 SF bug #1251300: On UCS-4 builds the "unicode-internal" codec will now complain
about illegal code points. The codec now supports PEP 293 style error handlers.
(This is a variant of the Nik Haldimann's patch that detects truncated data)
2005-08-30 10:23:14 +00:00
Marc-André Lemburg a9cadcd41b Correct the handling of 0-termination of PyUnicode_AsWideChar()
and its usage in PyLocale_strcoll().

Clarify the documentation on this.

Thanks to Andreas Degert for pointing this out.
2004-11-22 13:02:31 +00:00
Marc-André Lemburg 204bd6d9d2 Applied patch for [ 1047269 ] Buffer overwrite in PyUnicode_AsWideChar.
Python 2.3.x candidate.
2004-10-15 07:45:05 +00:00
Skip Montanaro 6543b45b0c Initialize sep and seplen to suppress warning from gcc. 2004-09-16 03:28:13 +00:00
Thomas Heller ca0d2cb66e Add a missing line continuation character. 2004-09-15 11:41:32 +00:00
Walter Dörwald 065a32f550 Make the hint about the None default less ambiguous. 2004-09-14 09:45:10 +00:00
Walter Dörwald 782afc5927 Enhance the docstrings for unicode.split() and string.split()
to make it clear that it is possible to pass None as the
separator argument to get the default "any whitespace" separator.
2004-09-14 09:40:45 +00:00
Walter Dörwald 69652035bc SF patch #998993: The UTF-8 and the UTF-16 stateful decoders now support
decoding incomplete input (when the input stream is temporarily exhausted).
codecs.StreamReader now implements buffering, which enables proper
readline support for the UTF-16 decoders. codecs.StreamReader.read()
has a new argument chars which specifies the number of characters to
return. codecs.StreamReader.readline() and codecs.StreamReader.readlines()
have a new argument keepends. Trailing "\n"s will be stripped from the lines
if keepends is false. Added C APIs PyUnicode_DecodeUTF8Stateful and
PyUnicode_DecodeUTF16Stateful.
2004-09-07 20:24:22 +00:00
Tim Peters 91879ab8ea PyUnicode_Join(): Bozo Alert. While this is chugging along, it may
need to convert str objects from the iterable to unicode.  So, if
someone set the system default encoding to something nasty enough,
the conversion process could mutate the input iterable as a side
effect, and PySequence_Fast doesn't hide that from us if the input was
a list.  IOW, can't assume the size of PySequence_Fast's result is
invariant across PyUnicode_FromObject() calls.
2004-08-27 22:35:44 +00:00
Tim Peters 05eba1fdc8 PyUnicode_Join(): Rewrote to use PySequence_Fast(). This doesn't do
much to reduce the size of the code, but greatly improves its clarity.
It's also quicker in what's probably the most common case (the argument
iterable is a list).  Against it, if the iterable isn't a list or a tuple,
a temp tuple is materialized containing the entire input sequence, and
that's a bigger temp memory burden.  Yawn.
2004-08-27 21:32:02 +00:00
Tim Peters 894c512c2f PyUnicode_Join(): Missed a spot where I intended a cast from size_t to
int.  I sure wish MS would gripe about that!  Whatever, note that the
statement above it guarantees that the cast loses no info.
2004-08-27 05:08:36 +00:00
Tim Peters 8ce9f16259 PyUnicode_Join(): Two primary aims:
1. u1.join([u2]) is u2
2. Be more careful about C-level int overflow.

Since PySequence_Fast() isn't needed to achieve #1, it's not used -- but
the code could sure be simpler if it were.
2004-08-27 01:49:32 +00:00
Hye-Shik Chang e9ddfbb412 SF #989185: Drop unicode.iswide() and unicode.width() and add
unicodedata.east_asian_width().  You can still implement your own
simple width() function using it like this:
    def width(u):
        w = 0
        for c in unicodedata.normalize('NFC', u):
            cwidth = unicodedata.east_asian_width(c)
            if cwidth in ('W', 'F'): w += 2
            else: w += 1
        return w
2004-08-04 07:38:35 +00:00
Marc-André Lemburg d25c650461 Let u'%s' % obj try obj.__unicode__() first and fallback to obj.__str__(). 2004-07-23 16:13:25 +00:00
Nicholas Bastin 9ba301e589 Moved SunPro warning suppression into pyport.h and out of individual
modules and objects.
2004-07-15 15:54:05 +00:00
Marc-André Lemburg 126b44cd41 Fix a copy&paste typo. 2004-07-10 12:04:20 +00:00
Marc-André Lemburg 1dffb120b7 .encode()/.decode() patch part 2. 2004-07-08 19:13:55 +00:00
Marc-André Lemburg d2d4598ec2 Allow string and unicode return types from .encode()/.decode()
methods on string and unicode objects. Added unicode.decode()
which was missing for no apparent reason.
2004-07-08 17:57:32 +00:00
Nicholas Bastin 1ce9e4cfc1 Fixed end-of-loop code not reached warning when using SunPro C 2004-06-17 18:27:18 +00:00
Hye-Shik Chang 974ed7cfa5 - SF #962502: Add two more methods for unicode type; width() and
iswide() for east asian width manipulation. (Inspired by David
Goodger, Reviewed by Martin v. Loewis)
- Move _PyUnicode_TypeRecord.flags to the end of the struct so that
no padding is added for UCS-4 builds. (Suggested by Martin v. Loewis)
2004-06-02 16:49:17 +00:00
Hye-Shik Chang 4057483164 SF Patch #926375: Remove a useless UTF-16 support code that is never
been used. (Suggested by Martin v. Loewis)
2004-04-06 07:24:51 +00:00
Walter Dörwald cd736e71a3 Fix reallocation bug in unicode.translate(): The code was comparing
characters instead of character pointers to determine space requirements.
2004-02-05 17:36:00 +00:00
Hye-Shik Chang 1bc09b7c2a Cosmetic fix for wrongly indented tabs with ts=4. 2004-01-03 19:35:43 +00:00
Hye-Shik Chang 7fc4cf57b8 Fix unicode.rsplit()'s bug that ignores separater on the end of string when
using specialized splitter for 1 char sep.
2003-12-23 09:10:16 +00:00
Hye-Shik Chang 40e9509dc7 Fix broken xmlcharrefreplace by rev 2.204.
(Pointy hat goes to perky)
2003-12-22 01:31:13 +00:00
Hye-Shik Chang 4a264fb054 SF #859573: Reduce compiler warnings on gcc 3.2 and above. 2003-12-19 01:59:56 +00:00
Hye-Shik Chang 3ae811b57d Add rsplit method for str and unicode builtin types.
SF feature request #801847.
Original patch is written by Sean Reifschneider.
2003-12-15 18:49:53 +00:00
Guido van Rossum 6c9e130524 - Removed FutureWarnings related to hex/oct literals and conversions
and left shifts.  (Thanks to Kalle Svensson for SF patch 849227.)
  This addresses most of the remaining semantic changes promised by
  PEP 237, except for repr() of a long, which still shows the trailing
  'L'.  The PEP appears to promise warnings for operations that
  changed semantics compared to Python 2.3, but this is not
  implemented; we've suffered through enough warnings related to
  hex/oct literals and I think it's best to be silent now.
2003-11-29 23:52:13 +00:00
Raymond Hettinger 4f8f976576 Add optional fillchar argument to ljust(), rjust(), and center() string methods. 2003-11-26 08:21:35 +00:00
Walter Dörwald 4894c30626 Fix a bug in the memory reallocation code of PyUnicode_TranslateCharmap().
charmaptranslate_makespace() allocated more memory than required for the
next replacement but didn't remember that fact, so memory size was growing
exponentially every time a replacement string is longer that one character.
This fixes SF bug #828737.
2003-10-24 14:25:28 +00:00
Martin v. Löwis 6828e18a6a Patch #825679: Clarify semantics of .isfoo on empty strings.
Backported to 2.3.
2003-10-18 09:55:08 +00:00
Jeremy Hylton 504de6bd2c Fix for SF bug [ 817156 ] invalid \U escape gives 0=length unistr. 2003-10-06 05:08:26 +00:00
Tim Peters ced69f8a20 On c.l.py, Martin v. Löwis said that Py_UNICODE could be of a signed type,
so fiddle Jeremy's fix to live with that.  Also added more comments.

Bugfix candidate (this bug is in all versions of Python, at least since
2.1).
2003-09-16 20:30:58 +00:00
Jeremy Hylton d808279be3 Double-fix of crash in Unicode freelist handling.
If a length-1 Unicode string was in the freelist and it was
uninitialized or pointed to a very large (magnitude) negative number,
the check

	 unicode_latin1[unicode->str[0]] == unicode

could cause a segmentation violation, e.g. unicode->str[0] is 0xcbcbcbcb.

Fix this in two ways:

1. Change guard befor unicode_latin1[] to test against 256U.  If I
   understand correctly, the unsigned long used to store UCS4 on my
   box was getting converted to a signed long to compare with the
   signed constant 256.

2. Change _PyUnicode_New() to make sure the first element of str is
   always initialized to zero.  There are several places in the code
   where the caller can exit with an error before initializing any
   of str, which would leave junk in str[0].

Also, silence a compiler warning on pointer vs. int arithmetic.

Bug fix candidate.
2003-09-16 19:41:39 +00:00
Jeremy Hylton deb2dc6658 Change checks of PyUnicode_Resize() return value for clarity.
The unicode_resize() family only returns -1 or 0 so simply checking
for != 0 is sufficient, but somewhat unclear.  Many Python API
functions return < 0 on error, reserving the right to return 0 or 1 on
success.  Change the call sites for consistency with these calls.
2003-09-16 03:41:45 +00:00
Raymond Hettinger 9bfe533c69 SF bug #795506: Wrong handling of string format code for float values.
Adding missing support for '%F'.

Will backport to 2.3.1.
2003-08-27 04:55:52 +00:00
Walter Dörwald 150523efa5 Fix refcounting leak in charmaptranslate_lookup() 2003-08-15 16:52:19 +00:00
Walter Dörwald 9b30f206ee Fix another refcounting leak in PyUnicode_EncodeCharmap(). 2003-08-15 16:26:34 +00:00
Walter Dörwald d4ade0885c Fix another refcounting leak (in PyUnicode_DecodeUnicodeEscape()). 2003-08-15 15:00:26 +00:00
Walter Dörwald e5402fb340 Fix refcount leak in PyUnicode_EncodeCharmap(). The bug surfaces
when an encoding error occurs and the callback name is unknown,
i.e. when the callback has to be called. The problem was that
the fact that the callback has already been looked up was only
recorded in a local variable in charmap_encoding_error(), because
charmap_encoding_error() got it's own copy of the errorHandler
pointer instead of a pointer to the pointer in
PyUnicode_EncodeCharmap().
2003-08-14 20:25:29 +00:00
Mark Hammond 0ccda1ee10 Support 'mbcs' as a 'built-in' encoding, so the C API can use it without
defering to the encodings package.
As described in [ 763111 ] mbcs encoding should skip encodings package
2003-07-01 00:13:27 +00:00
Raymond Hettinger f466793fcc SF patch 703666: Several objects don't decref tmp on failure in subtype_new
Submitted By: Christopher A. Craig

Fillin some missing decrefs.
2003-06-28 20:04:25 +00:00
Martin v. Löwis 9a3a9f7791 Consider \U-escapes in raw-unicode-escape. Fixes #444514. 2003-05-18 12:31:09 +00:00
Neal Norwitz ffe33b7f24 Attempt to make all the various string *strip methods the same.
* Doc - add doc for when functions were added
 * UserString
 * string object methods
 * string module functions
'chars' is used for the last parameter everywhere.

These changes will be backported, since part of the changes
have already been made, but they were inconsistent.
2003-04-10 22:35:32 +00:00
Guido van Rossum a7132189d2 Reformat a few docstrings that caused line wraps in help() output. 2003-04-09 19:32:45 +00:00
Walter Dörwald 44f527fea4 Change formatchar(), so that u"%c" % 0xffffffff now raises
an OverflowError instead of a TypeError to be consistent
with "%c" % 256. See SF patch #710127.
2003-04-02 16:37:24 +00:00
Raymond Hettinger c8df5780e1 Sf patch #700047: unicode object leaks refcount on resizing
Contributed by Hye-Shik Chang.
2003-03-09 07:30:43 +00:00
Neal Norwitz ec74f2fda7 Add more missing PyErr_NoMemory() after failled memory allocs 2003-02-11 23:05:40 +00:00
Walter Dörwald f6b56aecad Fix two refcounting bugs 2003-02-09 23:42:56 +00:00
Walter Dörwald 2e0b18af30 Change the treatment of positions returned by PEP293
error handers in the Unicode codecs: Negative
positions are treated as being relative to the end of
the input and out of bounds positions result in an
IndexError.

Also update the PEP and include an explanation of
this in the documentation for codecs.register_error.

Fixes a small bug in iconv_codecs: if the position
from the callback is negative *add* it to the size
instead of substracting it.

From SF patch #677429.
2003-01-31 17:19:08 +00:00
Guido van Rossum 5d9113d8be Implement appropriate __getnewargs__ for all immutable subclassable builtin
types.  The special handling for these can now be removed from save_newobj().
Add some testing for this.

Also add support for setting the 'fast' flag on the Python Pickler class,
which suppresses use of the memo.
2003-01-29 17:58:45 +00:00
Walter Dörwald adc727490b Fix charmapencode_lookup(), so that a None value in the mapping
is treated as "character maps to <undefined>" and not as
"character mapping must return integer, None or str".
2003-01-08 22:01:33 +00:00
Walter Dörwald 034d97605d Remove variable owned from PyUnicode_FromEncodedObject, which is unused
(except for Py_DECREF calls) since the introduction of __unicode__.
2003-01-08 20:38:39 +00:00
Marc-André Lemburg 79f57833f3 Patch for bug #659709: bogus computation of float length
Python 2.2.x backport candidate. (This bug has been around since
Python 1.6.)
2002-12-29 19:44:06 +00:00
Neil Schemenauer ce30bc9f49 Add nb_remainder (i.e. __mod__) slot to unicode type. Fixes SF bug #615506. 2002-11-18 16:10:18 +00:00
Neal Norwitz 80a1bf4b5d Fix SF # 635969, No error "not all arguments converted"
When mwh added extended slicing, strings and unicode became mappings.
Thus, dict was set which prevented an error when doing:
	newstr = 'format without a percent' % string_value

This fix raises an exception again when there are no formats
and % with a string value.
2002-11-12 23:01:12 +00:00