Skip to content

BUG: make sure that the multi-index is lex-sorted before passing to _lexsort_indexer (GH8017) #8282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 17, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions doc/source/v0.15.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -667,7 +667,6 @@ Enhancements



- Bug in ``get`` where an ``IndexError`` would not cause the default value to be returned (:issue:`7725`)



Expand Down Expand Up @@ -745,10 +744,10 @@ Bug Fixes
- Bug in DataFrameGroupby.transform when transforming with a passed non-sorted key (:issue:`8046`)
- Bug in repeated timeseries line and area plot may result in ``ValueError`` or incorrect kind (:issue:`7733`)
- Bug in inference in a MultiIndex with ``datetime.date`` inputs (:issue:`7888`)

- Bug in ``get`` where an ``IndexError`` would not cause the default value to be returned (:issue:`7725`)
- Bug in ``offsets.apply``, ``rollforward`` and ``rollback`` may reset nanosecond (:issue:`7697`)
- Bug in ``offsets.apply``, ``rollforward`` and ``rollback`` may raise ``AttributeError`` if ``Timestamp`` has ``dateutil`` tzinfo (:issue:`7697`)

- Bug in sorting a multi-index frame with a Float64Index (:issue:`8017`)

- Bug in ``is_superperiod`` and ``is_subperiod`` cannot handle higher frequencies than ``S`` (:issue:`7760`, :issue:`7772`, :issue:`7803`)

Expand Down
13 changes: 10 additions & 3 deletions pandas/core/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -625,10 +625,17 @@ def is_numeric_dtype(dtype):
fmt_columns = columns.format(sparsify=False, adjoin=False)
fmt_columns = lzip(*fmt_columns)
dtypes = self.frame.dtypes.values

# if we have a Float level, they don't use leading space at all
restrict_formatting = any([ l.is_floating for l in columns.levels ])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche this seems kind of hacky, but not sure what else to do here...thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I do not really understand this. What was the issue this solved?
Because I don't see how this solves the incorrect repetitions of 'red' in the first level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it DID solve it. Beacuse the subsequent routines basically stringify numbers (e.g. column headings by putting a space before it). But not entirely sure why/how that is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what has it to do with the level being float or not? The issue also occured with eg integer indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integers were ok (I don't really understand why a space was added in the first place),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, it did solve something else as the wrong repetition of the first level? like #8017 (comment)

need_leadsp = dict(zip(fmt_columns, map(is_numeric_dtype, dtypes)))
str_columns = list(zip(*[
[' ' + y if y not in self.formatters and need_leadsp[x]
else y for y in x] for x in fmt_columns]))

def space_format(x,y):
if y not in self.formatters and need_leadsp[x] and not restrict_formatting:
return ' ' + y
return y

str_columns = list(zip(*[ [ space_format(x,y) for y in x ] for x in fmt_columns ]))
if self.sparsify:
str_columns = _sparsify(str_columns)

Expand Down
6 changes: 6 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2770,6 +2770,12 @@ def trans(v):
na_position=na_position)

elif isinstance(labels, MultiIndex):

# make sure that the axis is lexsorted to start
# if not we need to reconstruct to get the correct indexer
if not labels.is_lexsorted():
labels = MultiIndex.from_tuples(labels.values)

indexer = _lexsort_indexer(labels.labels, orders=ascending,
na_position=na_position)
indexer = com._ensure_platform_int(indexer)
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1628,6 +1628,7 @@ def sort_index(self, axis=0, ascending=True):

new_axis = labels.take(sort_index)
return self.reindex(**{axis_name: new_axis})

_shared_docs['reindex'] = """
Conform %(klass)s to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index. A new object
Expand Down Expand Up @@ -3558,10 +3559,10 @@ def _tz_convert(ax, tz):
result = self._constructor(self._data, copy=copy)
result.set_axis(axis,ax)
return result.__finalize__(self)

@deprecate_kwarg(old_arg_name='infer_dst', new_arg_name='ambiguous',
mapping={True: 'infer', False: 'raise'})
def tz_localize(self, tz, axis=0, level=None, copy=True,
def tz_localize(self, tz, axis=0, level=None, copy=True,
ambiguous='raise'):
"""
Localize tz-naive TimeSeries to target time zone
Expand All @@ -3583,7 +3584,7 @@ def tz_localize(self, tz, axis=0, level=None, copy=True,
- 'raise' will raise an AmbiguousTimeError if there are ambiguous times
infer_dst : boolean, default False (DEPRECATED)
Attempt to infer fall dst-transition hours based on order

Returns
-------
"""
Expand Down
38 changes: 38 additions & 0 deletions pandas/tests/test_multilevel.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,44 @@ def test_sort_index_preserve_levels(self):
result = self.frame.sort_index()
self.assertEqual(result.index.names, self.frame.index.names)

def test_sorting_repr_8017(self):

np.random.seed(0)
data = np.random.randn(3,4)

for gen, extra in [([1.,3.,2.,5.],4.),
([1,3,2,5],4),
([Timestamp('20130101'),Timestamp('20130103'),Timestamp('20130102'),Timestamp('20130105')],Timestamp('20130104')),
(['1one','3one','2one','5one'],'4one')]:
columns = MultiIndex.from_tuples([('red', i) for i in gen])
df = DataFrame(data, index=list('def'), columns=columns)
df2 = pd.concat([df,DataFrame('world',
index=list('def'),
columns=MultiIndex.from_tuples([('red', extra)]))],axis=1)

# check that the repr is good
# make sure that we have a correct sparsified repr
# e.g. only 1 header of read
self.assertEqual(str(df2).splitlines()[0].split(),['red'])

# GH 8017
# sorting fails after columns added

# construct single-dtype then sort
result = df.copy().sort_index(axis=1)
expected = df.iloc[:,[0,2,1,3]]
assert_frame_equal(result, expected)

result = df2.sort_index(axis=1)
expected = df2.iloc[:,[0,2,1,4,3]]
assert_frame_equal(result, expected)

# setitem then sort
result = df.copy()
result[('red',extra)] = 'world'
result = result.sort_index(axis=1)
assert_frame_equal(result, expected)

def test_repr_to_string(self):
repr(self.frame)
repr(self.ymd)
Expand Down