Commit Graph

1406 Commits

Author SHA1 Message Date
Marc-André Lemburg e5034378cc Removing UTF-16 aware Unicode comparison code. This kind of compare
function (together with other locale aware ones) should into a new collation
support module. See python-dev for a discussion of this removal.

Note: This patch should also be applied to the 1.6 branch.
2000-08-08 08:04:29 +00:00
Marc-André Lemburg bff879cabb This patch finalizes the move from UTF-8 to a default encoding in
the Python Unicode implementation.

The internal buffer used for implementing the buffer protocol
is renamed to defenc to make this change visible. It now holds the
default encoded version of the Unicode object and is calculated
on demand (NULL otherwise).

Since the default encoding defaults to ASCII, this will mean that
Unicode objects which hold non-ASCII characters will no longer
work on C APIs using the "s" or "t" parser markers. C APIs must now
explicitly provide Unicode support via the "u", "U" or "es"/"es#"
parser markers in order to work with non-ASCII Unicode strings.

(Note: this patch will also have to be applied to the 1.6 branch
 of the CVS tree.)
2000-08-03 18:46:08 +00:00
Guido van Rossum 16b1ad9c7d Changing the CNRI copyright notice according to CNRI's instructions.
This is a notice without a date, which apparently is not a claim to
copyright but only advice to the reader.  IANAL. :-)
2000-08-03 16:24:25 +00:00
Peter Schneider-Kamp 7e01890986 merge Include/my*.h into Include/pyport.h
marked my*.h as obsolete
2000-07-31 15:28:04 +00:00
Thomas Wouters 7889010731 Miscelaneous ANSIfications. I'm assuming here 'main' should take (int,
char**) and return an int even on PC platforms. If not, please fix
PC/utils/makesrc.c ;-P
2000-07-22 19:25:51 +00:00
Marc-André Lemburg 9542f48fd5 Fixed problems with UTF error reporting macros and some formatting bugs. 2000-07-17 18:23:13 +00:00
Greg Stein af36a3aa20 gcc is being stupid with if/else constructs
clean out some other warnings
2000-07-17 09:04:43 +00:00
Greg Stein ff975003cf stop messing around with goto and just write the macro correctly. 2000-07-16 21:39:49 +00:00
Fredrik Lundh 0e19e76aba - change \x to mean "byte" also in unicode literals
(patch #100912)
2000-07-16 18:47:43 +00:00
Tim Peters 855ffac224 Fix fatal compiler (MSVC6) error:
unicodeobject.c(735) :
    error C2143: syntax error : missing ';' before '}'
2000-07-16 17:10:50 +00:00
Marc-André Lemburg fb625847bf Fix to a bug found by Florian Weimer:
The UTF-8 decoder is still buggy (i.e. it doesn't pass Markus Kuhn's
stress test), mainly due to the following construct:

    #define UTF8_ERROR(details)  do {                       \
        if (utf8_decoding_error(&s, &p, errors, details))   \
            goto onError;                                   \
        continue;                                           \
    } while (0)

(The "continue" statement is supposed to exit from the outer loop,
but of course, it doesn't.  Indeed, this is a marvelous example of
the dangers of the C programming language and especially of the C
preprocessor.)
2000-07-16 13:29:13 +00:00
Thomas Wouters 7e47402264 Spelling fixes supplied by Rob W. W. Hooft. All these are fixes in either
comments, docstrings or error messages. I fixed two minor things in
test_winreg.py ("didn't" -> "Didn't" and "Didnt" -> "Didn't").

There is a minor style issue involved: Guido seems to have preferred English
grammar (behaviour, honour) in a couple places. This patch changes that to
American, which is the more prominent style in the source. I prefer English
myself, so if English is preferred, I'd be happy to supply a patch myself ;)
2000-07-16 12:04:32 +00:00
Jeremy Hylton 03657cfdb0 replace PyXXX_Length calls with PyXXX_Size calls 2000-07-12 13:05:33 +00:00
Marc-André Lemburg 566d8a64eb Jeremy Hylton:
better error message for unicode coercion failure
2000-07-11 09:47:04 +00:00
Fredrik Lundh dde6164402 - changed hash calculation for unicode strings. the new
value is calculated from the character values, in a way
  that makes sure an 8-bit ASCII string and a unicode string
  with the same contents get the same hash value.

  (as a side effect, this also works for ISO Latin 1 strings).

  for more details, see the python-dev discussion.
2000-07-10 18:27:47 +00:00
Marc-André Lemburg e12896ec98 New surrogate support in the UTF-8 codec. By Bill Tutt. 2000-07-07 17:51:08 +00:00
Marc-André Lemburg 5a5c81a0e9 Added new API PyUnicode_FromEncodedObject() which supports decoding
objects including instance objects.

The old API PyUnicode_FromObject() is still available as shortcut.
2000-07-07 13:46:42 +00:00
Marc-André Lemburg 063e0cb4c6 Fix to bug #393 (UTF16 codec didn't like empty strings) and
corrected some usage of 'unsigned long' where Py_UNICODE
should have been used.
2000-07-07 11:27:45 +00:00
Sjoerd Mullender 2629bd5a33 Two more places where long should be used instead of int. Especially
true after revision 2.36 was checked in...
2000-07-07 09:47:24 +00:00
Marc-André Lemburg 449c325303 Fixed some code that used 'short' to use 'long' instead. 2000-07-06 20:13:23 +00:00
Marc-André Lemburg 85cc4d8940 Fixed a couple of places where 'int' was used where 'long'
should have been used.
2000-07-06 19:43:31 +00:00
Marc-André Lemburg a7acf425f6 Added new .isalpha() and .isalnum() methods which provide interfaces
to the new alphabetic lookup APIs in unicodectype.c.
2000-07-05 09:49:44 +00:00
Marc-André Lemburg 1e7205a62a Bill Tutt:
Make unicode_compare a true UTF-16 compare function (includes
support for surrogates).
2000-07-04 09:51:07 +00:00
Marc-André Lemburg d49e5b4667 Marc-Andre Lemburg <mal@lemburg.com>:
A previous patch by Jack Jansen was accidently reverted.
2000-06-30 14:58:20 +00:00
Marc-André Lemburg f28dd83b86 Marc-Andre Lemburg <mal@lemburg.com>:
New buffer overflow checks for formatting strings.

By Trent Mick.
2000-06-30 10:29:57 +00:00
Guido van Rossum 4f4b799b33 Jack Jansen: Use include "" instead of <>; and staticforward declarations 2000-06-29 00:06:39 +00:00
Marc-André Lemburg 0f774e3987 Marc-Andre Lemburg <mal@lemburg.com>:
Patch to the standard unicode-escape codec which dynamically
loads the Unicode name to ordinal mapping from the module
ucnhash.

By Bill Tutt.
2000-06-28 16:43:35 +00:00
Marc-André Lemburg 7c014684c2 Marc-Andre Lemburg <mal@lemburg.com>:
Better error message for "1 in unicodestring". Submitted
by Andrew Kuchling.
2000-06-28 08:11:47 +00:00
Marc-André Lemburg 49ef6dc1f4 Marc-Andre Lemburg <mal@lemburg.com>:
Fixed a bug in PyUnicode_Count() which would have caused a
core dump in case of substring coercion failure.

Synchronized .count() with the string method of the same name
to return len(s)+1 for s.count('').
2000-06-18 22:25:22 +00:00
Marc-André Lemburg bea47e768d Vladimir MARANGOZOV <Vladimir.Marangozov@inrialpes.fr>:
This patch fixes an optimisation mystery in _PyUnicodeNew causing segfaults
on AIX when the interpreter is compiled with -O.
2000-06-17 20:31:17 +00:00
Marc-André Lemburg 60bc809d9a Marc-Andre Lemburg <mal@lemburg.com>:
Added code so that .isXXX() testing returns 0 for emtpy strings.
2000-06-14 09:18:32 +00:00
Marc-André Lemburg 07ceb67d9c Marc-Andre Lemburg <mal@lemburg.com>:
Fixed a typo and removed a debug printf(). Thanks to Finn Bock
for finding these.
2000-06-10 09:32:51 +00:00
Andrew M. Kuchling cb95a1470a Patch from Michael Hudson: improve unclear error message 2000-06-09 14:04:53 +00:00
Marc-André Lemburg d4ab4a5905 Marc-Andre Lemburg <mal@lemburg.com>:
Fixed %c formatting to check for one character arguments. Thanks
to Finn Bock for finding this bug.

Added a fix for bug PR#348 which originated from not resetting
the globals correctly in _PyUnicode_Fini().
2000-06-08 17:54:00 +00:00
Marc-André Lemburg 90e8147118 Marc-Andre Lemburg <mal@lemburg.com>:
Change the default encoding to 'ascii' (it was previously
defined as UTF-8).

Note: The implementation still uses UTF-8 to implement
the buffer protocol, so C APIs will still see UTF-8. This
is on purpose: rather than fixing the Unicode implementation,
the C APIs should be made Unicode aware.
2000-06-07 09:13:21 +00:00
Fred Drake 785d14f965 Minimal change so I can add the rest of MAL's checkin message:
M.-A. Lemburg <mal@lemburg.com>:
Fixed a core dump in PyUnicode_Format().
2000-05-09 19:54:43 +00:00
Fred Drake e4315f58d2 M.-A. Lemburg <mal@lemburg.com>:
Added support for user settable default encodings. The
current implementation uses a per-process global which
defines the value of the encoding parameter in case it
is set to NULL (meaning: use the default encoding).
2000-05-09 19:53:39 +00:00
Guido van Rossum b8872e61c6 Trent Mick:
Fix the string methods that implement slice-like semantics with
optional args (count, find, endswith, etc.) to properly handle
indeces outside [INT_MIN, INT_MAX]. Previously the "i" formatter
for PyArg_ParseTuple was used to get the indices. These could overflow.

This patch changes the string methods to use the "O&" formatter with
the slice_index() function from ceval.c which is used to do the same
job for Python code slices (e.g. 'abcabcabc'[0:1000000000L]).
2000-05-09 14:14:27 +00:00
Guido van Rossum 03e29f1ae9 Mark Hammond should get his act into gear (his words :-). Zero length
strings _are_ valid!
2000-05-04 15:52:20 +00:00
Guido van Rossum 42c29aaeb5 Fix warning detected by VC++ on assignment of Py_UNICODE to char. 2000-05-03 23:58:29 +00:00
Guido van Rossum b18618dab7 Vladimir Marangozov's long-awaited malloc restructuring.
For more comments, read the patches@python.org archives.
For documentation read the comments in mymalloc.h and objimpl.h.

(This is not exactly what Vladimir posted to the patches list; I've
made a few changes, and Vladimir sent me a fix in private email for a
problem that only occurs in debug mode.  I'm also holding back on his
change to main.c, which seems unnecessary to me.)
2000-05-03 23:44:39 +00:00
Guido van Rossum 4e751c3d12 Mark Hammond withdraws his fix -- the size includes the trailing 0 so
a size of 0 *is* illegal.
2000-05-03 12:27:22 +00:00
Guido van Rossum a6edfd9737 Mark Hammond:
Fixes the MBCS codec to work correctly with zero length strings.
2000-05-03 11:03:24 +00:00
Guido van Rossum 0e4f657a50 Marc-Andre Lemburg:
Fixed \OOO interpretation for Unicode objects. \777 now
correctly produces the Unicode character with ordinal 511.
2000-05-01 21:27:20 +00:00
Guido van Rossum 3c1bb8043f Marc-Andre Lemburg:
Fixed a reference leak in the allocator.

Renamed utf8_string to _PyUnicode_AsUTF8String() and made
it external for use by other parts of the interpreter.
2000-04-27 20:13:50 +00:00
Guido van Rossum 86662914be Marc-Andre Lemburg:
The maxsplit functionality in .splitlines() was replaced by the keepends
functionality which allows keeping the line end markers together
with the string.
2000-04-11 15:38:46 +00:00
Guido van Rossum fd4b957b06 Marc-Andre Lemburg:
* New exported API PyUnicode_Resize()

* The experimental Keep-Alive optimization was turned back
  on after some tweaks to the implementation. It should now
  work without causing core dumps... this has yet to tested
  though (switching it off is easy: see the unicodeobject.c
  file for details).

* Fixed a memory leak in the Unicode freelist cleanup code.

* Added tests to correctly process the return code from
  _PyUnicode_Resize().

* Fixed a bug in the 'ignore' error handling routines
  of some builtin codecs. Added test cases for these to
  test_unicode.py.
2000-04-10 13:51:10 +00:00
Guido van Rossum 5db862dd0c Skip Montanaro: add string precisions to calls to PyErr_Format
to prevent possible buffer overruns.
2000-04-10 12:46:51 +00:00
Guido van Rossum ba47704943 Conrad Huang points out that "if (0 < ch < 256)", while legal C,
doesn't mean what the Python programmer thought...
2000-04-06 18:18:10 +00:00
Guido van Rossum 34888ed689 Fredrik Lundh: eliminate a MSVC compiler warning. 2000-04-05 21:29:50 +00:00
Guido van Rossum 9e896b37c7 Marc-Andre's third try at this bulk patch seems to work (except that
his copy of test_contains.py seems to be broken -- the lines he
deleted were already absent).  Checkin messages:


New Unicode support for int(), float(), complex() and long().

- new APIs PyInt_FromUnicode() and PyLong_FromUnicode()
- added support for Unicode to PyFloat_FromString()
- new encoding API PyUnicode_EncodeDecimal() which converts
  Unicode to a decimal char* string (used in the above new
  APIs)
- shortcuts for calls like int(<int object>) and float(<float obj>)
- tests for all of the above

Unicode compares and contains checks:
- comparing Unicode and non-string types now works; TypeErrors
  are masked, all other errors such as ValueError during
  Unicode coercion are passed through (note that PyUnicode_Compare
  does not implement the masking -- PyObject_Compare does this)
- contains now works for non-string types too; TypeErrors are
  masked and 0 returned; all other errors are passed through

Better testing support for the standard codecs.

Misc minor enhancements, such as an alias dbcs for the mbcs codec.

Changes:
- PyLong_FromString() now applies the same error checks as
  does PyInt_FromString(): trailing garbage is reported
  as error and not longer silently ignored. The only characters
  which may be trailing the digits are 'L' and 'l' -- these
  are still silently ignored.
- string.ato?() now directly interface to int(), long() and
  float(). The error strings are now a little different, but
  the type still remains the same. These functions are now
  ready to get declared obsolete ;-)
- PyNumber_Int() now also does a check for embedded NULL chars
  in the input string; PyNumber_Long() already did this (and
  still does)

Followed by:

Looks like I've gone a step too far there... (and test_contains.py
seem to have a bug too).

I've changed back to reporting all errors in PyUnicode_Contains()
and added a few more test cases to test_contains.py (plus corrected
the join() NameError).
2000-04-05 20:11:21 +00:00
Guido van Rossum 2ea3e143f0 Some blank lines. 2000-03-31 17:24:09 +00:00
Guido van Rossum b7a40ba8d3 MBCS codecs. (Win32 only.) By Mark Hammond. 2000-03-28 02:01:52 +00:00
Barry Warsaw 51ac58039f On 17-Mar-2000, Marc-Andre Lemburg said:
Attached you find an update of the Unicode implementation.

    The patch is against the current CVS version. I would appreciate
    if someone with CVS checkin permissions could check the changes
    in.

    The patch contains all bugs and patches sent this week and also
    fixes a leak in the codecs code and a bug in the free list code
    for Unicode objects (which only shows up when compiling Python
    with Py_DEBUG; thanks to MarkH for spotting this one).
2000-03-20 16:36:48 +00:00
Guido van Rossum 403d68b484 Add sq_contains implementation. 2000-03-13 15:55:09 +00:00
Guido van Rossum d57fd91488 Unicode implementation by Marc-Andre Lemburg based on original code by
Fredrik Lundh.
2000-03-10 22:53:23 +00:00