Merged revisions 85843,85849-85850,85867,85907,85914,86134,86187,86315-86316,86390,86424-86425,86428 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/branches/py3k
........
r85843 | georg.brandl | 2010-10-26 08:59:23 +0200 (Di, 26 Okt 2010) | 1 line
Markup fix.
........
r85849 | georg.brandl | 2010-10-26 21:31:06 +0200 (Di, 26 Okt 2010) | 1 line
#10200: typo.
........
r85850 | georg.brandl | 2010-10-26 21:58:11 +0200 (Di, 26 Okt 2010) | 1 line
#10200: typo.
........
r85867 | georg.brandl | 2010-10-27 22:01:51 +0200 (Mi, 27 Okt 2010) | 1 line
Add David.
........
r85907 | georg.brandl | 2010-10-29 06:54:13 +0200 (Fr, 29 Okt 2010) | 1 line
#10222: fix for overzealous AIX compiler.
........
r85914 | georg.brandl | 2010-10-29 08:17:38 +0200 (Fr, 29 Okt 2010) | 1 line
(?:...) is a non-capturing, but still grouping construct.
........
r86134 | georg.brandl | 2010-11-03 08:41:00 +0100 (Mi, 03 Nov 2010) | 1 line
A newline in lineno output breaks pyframe output.
........
r86187 | georg.brandl | 2010-11-05 08:10:41 +0100 (Fr, 05 Nov 2010) | 1 line
Move glossary entry to the right position and fix link.
........
r86315 | georg.brandl | 2010-11-08 12:05:18 +0100 (Mo, 08 Nov 2010) | 1 line
Fix latex conversion glitch in property/feature descriptions.
........
r86316 | georg.brandl | 2010-11-08 12:08:35 +0100 (Mo, 08 Nov 2010) | 1 line
Fix typo.
........
r86390 | georg.brandl | 2010-11-10 08:57:10 +0100 (Mi, 10 Nov 2010) | 1 line
Fix typo.
........
r86424 | georg.brandl | 2010-11-12 07:19:48 +0100 (Fr, 12 Nov 2010) | 1 line
Build a PDF of the FAQs too.
........
r86425 | georg.brandl | 2010-11-12 07:20:12 +0100 (Fr, 12 Nov 2010) | 1 line
#10008: Fix duplicate index entry.
........
r86428 | georg.brandl | 2010-11-12 09:09:26 +0100 (Fr, 12 Nov 2010) | 1 line
Fix weird line block in table.
........
2010-11-26 04:20:18 -04:00
|
|
|
.. _sortinghowto:
|
2010-11-05 21:06:14 -03:00
|
|
|
|
2010-04-10 21:01:23 -03:00
|
|
|
Sorting HOW TO
|
|
|
|
**************
|
|
|
|
|
|
|
|
:Author: Andrew Dalke and Raymond Hettinger
|
|
|
|
:Release: 0.1
|
|
|
|
|
|
|
|
|
|
|
|
Python lists have a built-in :meth:`list.sort` method that modifies the list
|
2011-01-12 00:47:43 -04:00
|
|
|
in-place. There is also a :func:`sorted` built-in function that builds a new
|
|
|
|
sorted list from an iterable.
|
2010-04-10 21:01:23 -03:00
|
|
|
|
|
|
|
In this document, we explore the various techniques for sorting data using Python.
|
|
|
|
|
|
|
|
|
|
|
|
Sorting Basics
|
|
|
|
==============
|
|
|
|
|
|
|
|
A simple ascending sort is very easy: just call the :func:`sorted` function. It
|
|
|
|
returns a new sorted list::
|
|
|
|
|
|
|
|
>>> sorted([5, 2, 3, 1, 4])
|
|
|
|
[1, 2, 3, 4, 5]
|
|
|
|
|
|
|
|
You can also use the :meth:`list.sort` method of a list. It modifies the list
|
|
|
|
in-place (and returns *None* to avoid confusion). Usually it's less convenient
|
|
|
|
than :func:`sorted` - but if you don't need the original list, it's slightly
|
|
|
|
more efficient.
|
|
|
|
|
|
|
|
>>> a = [5, 2, 3, 1, 4]
|
|
|
|
>>> a.sort()
|
|
|
|
>>> a
|
|
|
|
[1, 2, 3, 4, 5]
|
|
|
|
|
|
|
|
Another difference is that the :meth:`list.sort` method is only defined for
|
|
|
|
lists. In contrast, the :func:`sorted` function accepts any iterable.
|
|
|
|
|
|
|
|
>>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'})
|
|
|
|
[1, 2, 3, 4, 5]
|
|
|
|
|
|
|
|
Key Functions
|
|
|
|
=============
|
|
|
|
|
|
|
|
Starting with Python 2.4, both :meth:`list.sort` and :func:`sorted` added a
|
|
|
|
*key* parameter to specify a function to be called on each list element prior to
|
|
|
|
making comparisons.
|
|
|
|
|
|
|
|
For example, here's a case-insensitive string comparison:
|
|
|
|
|
|
|
|
>>> sorted("This is a test string from Andrew".split(), key=str.lower)
|
|
|
|
['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
|
|
|
|
|
|
|
|
The value of the *key* parameter should be a function that takes a single argument
|
|
|
|
and returns a key to use for sorting purposes. This technique is fast because
|
|
|
|
the key function is called exactly once for each input record.
|
|
|
|
|
|
|
|
A common pattern is to sort complex objects using some of the object's indices
|
|
|
|
as keys. For example:
|
|
|
|
|
|
|
|
>>> student_tuples = [
|
|
|
|
('john', 'A', 15),
|
|
|
|
('jane', 'B', 12),
|
|
|
|
('dave', 'B', 10),
|
|
|
|
]
|
|
|
|
>>> sorted(student_tuples, key=lambda student: student[2]) # sort by age
|
|
|
|
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
|
|
|
|
|
|
|
The same technique works for objects with named attributes. For example:
|
|
|
|
|
|
|
|
>>> class Student:
|
|
|
|
def __init__(self, name, grade, age):
|
|
|
|
self.name = name
|
|
|
|
self.grade = grade
|
|
|
|
self.age = age
|
|
|
|
def __repr__(self):
|
|
|
|
return repr((self.name, self.grade, self.age))
|
|
|
|
|
|
|
|
>>> student_objects = [
|
|
|
|
Student('john', 'A', 15),
|
|
|
|
Student('jane', 'B', 12),
|
|
|
|
Student('dave', 'B', 10),
|
|
|
|
]
|
|
|
|
>>> sorted(student_objects, key=lambda student: student.age) # sort by age
|
|
|
|
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
|
|
|
|
|
|
|
Operator Module Functions
|
|
|
|
=========================
|
|
|
|
|
|
|
|
The key-function patterns shown above are very common, so Python provides
|
|
|
|
convenience functions to make accessor functions easier and faster. The operator
|
|
|
|
module has :func:`operator.itemgetter`, :func:`operator.attrgetter`, and
|
|
|
|
starting in Python 2.5 a :func:`operator.methodcaller` function.
|
|
|
|
|
|
|
|
Using those functions, the above examples become simpler and faster:
|
|
|
|
|
|
|
|
>>> from operator import itemgetter, attrgetter
|
|
|
|
|
|
|
|
>>> sorted(student_tuples, key=itemgetter(2))
|
|
|
|
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
|
|
|
|
|
|
|
>>> sorted(student_objects, key=attrgetter('age'))
|
|
|
|
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
|
|
|
|
|
|
|
The operator module functions allow multiple levels of sorting. For example, to
|
|
|
|
sort by *grade* then by *age*:
|
|
|
|
|
|
|
|
>>> sorted(student_tuples, key=itemgetter(1,2))
|
|
|
|
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
|
|
|
|
|
|
|
|
>>> sorted(student_objects, key=attrgetter('grade', 'age'))
|
|
|
|
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
|
|
|
|
|
2011-07-19 05:35:35 -03:00
|
|
|
The :func:`operator.methodcaller` function makes method calls with fixed
|
|
|
|
parameters for each object being sorted. For example, the :meth:`str.count`
|
|
|
|
method could be used to compute message priority by counting the
|
|
|
|
number of exclamation marks in a message:
|
|
|
|
|
|
|
|
>>> messages = ['critical!!!', 'hurry!', 'standby', 'immediate!!']
|
|
|
|
>>> sorted(messages, key=methodcaller('count', '!'))
|
|
|
|
['standby', 'hurry!', 'immediate!!', 'critical!!!']
|
|
|
|
|
2010-04-10 21:01:23 -03:00
|
|
|
Ascending and Descending
|
|
|
|
========================
|
|
|
|
|
|
|
|
Both :meth:`list.sort` and :func:`sorted` accept a *reverse* parameter with a
|
2012-04-29 13:25:25 -03:00
|
|
|
boolean value. This is used to flag descending sorts. For example, to get the
|
2010-04-10 21:01:23 -03:00
|
|
|
student data in reverse *age* order:
|
|
|
|
|
|
|
|
>>> sorted(student_tuples, key=itemgetter(2), reverse=True)
|
|
|
|
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
|
|
|
|
|
|
|
|
>>> sorted(student_objects, key=attrgetter('age'), reverse=True)
|
|
|
|
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
|
|
|
|
|
|
|
|
Sort Stability and Complex Sorts
|
|
|
|
================================
|
|
|
|
|
|
|
|
Starting with Python 2.2, sorts are guaranteed to be `stable
|
2016-02-26 14:37:12 -04:00
|
|
|
<https://en.wikipedia.org/wiki/Sorting_algorithm#Stability>`_\. That means that
|
2010-04-10 21:01:23 -03:00
|
|
|
when multiple records have the same key, their original order is preserved.
|
|
|
|
|
|
|
|
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
|
|
|
|
>>> sorted(data, key=itemgetter(0))
|
|
|
|
[('blue', 1), ('blue', 2), ('red', 1), ('red', 2)]
|
|
|
|
|
|
|
|
Notice how the two records for *blue* retain their original order so that
|
|
|
|
``('blue', 1)`` is guaranteed to precede ``('blue', 2)``.
|
|
|
|
|
|
|
|
This wonderful property lets you build complex sorts in a series of sorting
|
|
|
|
steps. For example, to sort the student data by descending *grade* and then
|
|
|
|
ascending *age*, do the *age* sort first and then sort again using *grade*:
|
|
|
|
|
|
|
|
>>> s = sorted(student_objects, key=attrgetter('age')) # sort on secondary key
|
|
|
|
>>> sorted(s, key=attrgetter('grade'), reverse=True) # now sort on primary key, descending
|
|
|
|
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
|
|
|
|
2016-02-26 14:37:12 -04:00
|
|
|
The `Timsort <https://en.wikipedia.org/wiki/Timsort>`_ algorithm used in Python
|
2010-04-10 21:01:23 -03:00
|
|
|
does multiple sorts efficiently because it can take advantage of any ordering
|
|
|
|
already present in a dataset.
|
|
|
|
|
|
|
|
The Old Way Using Decorate-Sort-Undecorate
|
|
|
|
==========================================
|
|
|
|
|
|
|
|
This idiom is called Decorate-Sort-Undecorate after its three steps:
|
|
|
|
|
|
|
|
* First, the initial list is decorated with new values that control the sort order.
|
|
|
|
|
|
|
|
* Second, the decorated list is sorted.
|
|
|
|
|
|
|
|
* Finally, the decorations are removed, creating a list that contains only the
|
|
|
|
initial values in the new order.
|
|
|
|
|
|
|
|
For example, to sort the student data by *grade* using the DSU approach:
|
|
|
|
|
|
|
|
>>> decorated = [(student.grade, i, student) for i, student in enumerate(student_objects)]
|
|
|
|
>>> decorated.sort()
|
|
|
|
>>> [student for grade, i, student in decorated] # undecorate
|
|
|
|
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
|
|
|
|
|
|
|
|
This idiom works because tuples are compared lexicographically; the first items
|
|
|
|
are compared; if they are the same then the second items are compared, and so
|
|
|
|
on.
|
|
|
|
|
|
|
|
It is not strictly necessary in all cases to include the index *i* in the
|
|
|
|
decorated list, but including it gives two benefits:
|
|
|
|
|
|
|
|
* The sort is stable -- if two items have the same key, their order will be
|
|
|
|
preserved in the sorted list.
|
|
|
|
|
|
|
|
* The original items do not have to be comparable because the ordering of the
|
|
|
|
decorated tuples will be determined by at most the first two items. So for
|
|
|
|
example the original list could contain complex numbers which cannot be sorted
|
|
|
|
directly.
|
|
|
|
|
|
|
|
Another name for this idiom is
|
2016-02-26 14:37:12 -04:00
|
|
|
`Schwartzian transform <https://en.wikipedia.org/wiki/Schwartzian_transform>`_\,
|
2010-04-10 21:01:23 -03:00
|
|
|
after Randal L. Schwartz, who popularized it among Perl programmers.
|
|
|
|
|
|
|
|
For large lists and lists where the comparison information is expensive to
|
|
|
|
calculate, and Python versions before 2.4, DSU is likely to be the fastest way
|
|
|
|
to sort the list. For 2.4 and later, key functions provide the same
|
|
|
|
functionality.
|
|
|
|
|
|
|
|
The Old Way Using the *cmp* Parameter
|
|
|
|
=====================================
|
|
|
|
|
|
|
|
Many constructs given in this HOWTO assume Python 2.4 or later. Before that,
|
|
|
|
there was no :func:`sorted` builtin and :meth:`list.sort` took no keyword
|
|
|
|
arguments. Instead, all of the Py2.x versions supported a *cmp* parameter to
|
|
|
|
handle user specified comparison functions.
|
|
|
|
|
2012-05-03 13:21:40 -03:00
|
|
|
In Python 3, the *cmp* parameter was removed entirely (as part of a larger effort to
|
2010-04-10 21:01:23 -03:00
|
|
|
simplify and unify the language, eliminating the conflict between rich
|
|
|
|
comparisons and the :meth:`__cmp__` magic method).
|
|
|
|
|
2012-05-03 13:21:40 -03:00
|
|
|
In Python 2, :meth:`~list.sort` allowed an optional function which can be called for doing the
|
2010-04-10 21:01:23 -03:00
|
|
|
comparisons. That function should take two arguments to be compared and then
|
|
|
|
return a negative value for less-than, return zero if they are equal, or return
|
|
|
|
a positive value for greater-than. For example, we can do:
|
|
|
|
|
|
|
|
>>> def numeric_compare(x, y):
|
|
|
|
return x - y
|
|
|
|
>>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare)
|
|
|
|
[1, 2, 3, 4, 5]
|
|
|
|
|
|
|
|
Or you can reverse the order of comparison with:
|
|
|
|
|
|
|
|
>>> def reverse_numeric(x, y):
|
|
|
|
return y - x
|
|
|
|
>>> sorted([5, 2, 4, 1, 3], cmp=reverse_numeric)
|
|
|
|
[5, 4, 3, 2, 1]
|
|
|
|
|
|
|
|
When porting code from Python 2.x to 3.x, the situation can arise when you have
|
|
|
|
the user supplying a comparison function and you need to convert that to a key
|
|
|
|
function. The following wrapper makes that easy to do::
|
|
|
|
|
|
|
|
def cmp_to_key(mycmp):
|
|
|
|
'Convert a cmp= function into a key= function'
|
|
|
|
class K(object):
|
|
|
|
def __init__(self, obj, *args):
|
|
|
|
self.obj = obj
|
|
|
|
def __lt__(self, other):
|
|
|
|
return mycmp(self.obj, other.obj) < 0
|
|
|
|
def __gt__(self, other):
|
|
|
|
return mycmp(self.obj, other.obj) > 0
|
|
|
|
def __eq__(self, other):
|
|
|
|
return mycmp(self.obj, other.obj) == 0
|
|
|
|
def __le__(self, other):
|
|
|
|
return mycmp(self.obj, other.obj) <= 0
|
|
|
|
def __ge__(self, other):
|
|
|
|
return mycmp(self.obj, other.obj) >= 0
|
|
|
|
def __ne__(self, other):
|
|
|
|
return mycmp(self.obj, other.obj) != 0
|
|
|
|
return K
|
|
|
|
|
|
|
|
To convert to a key function, just wrap the old comparison function:
|
|
|
|
|
|
|
|
>>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric))
|
|
|
|
[5, 4, 3, 2, 1]
|
|
|
|
|
|
|
|
In Python 2.7, the :func:`functools.cmp_to_key` function was added to the
|
|
|
|
functools module.
|
|
|
|
|
|
|
|
Odd and Ends
|
|
|
|
============
|
|
|
|
|
|
|
|
* For locale aware sorting, use :func:`locale.strxfrm` for a key function or
|
|
|
|
:func:`locale.strcoll` for a comparison function.
|
|
|
|
|
2011-07-19 05:35:35 -03:00
|
|
|
* The *reverse* parameter still maintains sort stability (so that records with
|
|
|
|
equal keys retain their original order). Interestingly, that effect can be
|
2010-04-10 21:01:23 -03:00
|
|
|
simulated without the parameter by using the builtin :func:`reversed` function
|
|
|
|
twice:
|
|
|
|
|
|
|
|
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
|
|
|
|
>>> assert sorted(data, reverse=True) == list(reversed(sorted(reversed(data))))
|
|
|
|
|
2012-01-10 03:55:17 -04:00
|
|
|
* To create a standard sort order for a class, just add the appropriate rich
|
|
|
|
comparison methods:
|
2010-04-10 21:01:23 -03:00
|
|
|
|
2012-01-10 03:55:17 -04:00
|
|
|
>>> Student.__eq__ = lambda self, other: self.age == other.age
|
|
|
|
>>> Student.__ne__ = lambda self, other: self.age != other.age
|
2010-04-10 21:01:23 -03:00
|
|
|
>>> Student.__lt__ = lambda self, other: self.age < other.age
|
2012-01-10 03:55:17 -04:00
|
|
|
>>> Student.__le__ = lambda self, other: self.age <= other.age
|
|
|
|
>>> Student.__gt__ = lambda self, other: self.age > other.age
|
|
|
|
>>> Student.__ge__ = lambda self, other: self.age >= other.age
|
2010-04-10 21:01:23 -03:00
|
|
|
>>> sorted(student_objects)
|
|
|
|
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
|
|
|
|
|
2011-07-19 05:35:35 -03:00
|
|
|
For general purpose comparisons, the recommended approach is to define all six
|
|
|
|
rich comparison operators. The :func:`functools.total_ordering` class
|
|
|
|
decorator makes this easy to implement.
|
|
|
|
|
2010-04-10 21:01:23 -03:00
|
|
|
* Key functions need not depend directly on the objects being sorted. A key
|
|
|
|
function can also access external resources. For instance, if the student grades
|
|
|
|
are stored in a dictionary, they can be used to sort a separate list of student
|
|
|
|
names:
|
|
|
|
|
|
|
|
>>> students = ['dave', 'john', 'jane']
|
2011-07-19 05:35:35 -03:00
|
|
|
>>> grades = {'john': 'F', 'jane':'A', 'dave': 'C'}
|
|
|
|
>>> sorted(students, key=grades.__getitem__)
|
2010-04-10 21:01:23 -03:00
|
|
|
['jane', 'dave', 'john']
|