Skip to content

BUG: get_indexer for MultiIndex with nans returns wrong indexer #37222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
phofl opened this issue Oct 18, 2020 · 0 comments · Fixed by #49442
Closed
3 tasks done

BUG: get_indexer for MultiIndex with nans returns wrong indexer #37222

phofl opened this issue Oct 18, 2020 · 0 comments · Fixed by #49442
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@phofl
Copy link
Member

phofl commented Oct 18, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

idx1 = pd.MultiIndex.from_product([['A'], [1.0, 2.0]], names=['id1', 'id2'])
idx2 = pd.MultiIndex.from_product([['A'], [np.nan, 2.0]], names=['id1', 'id2'])


print(idx2.get_indexer(idx1))
print(idx1.get_indexer(idx2))

Problem description

This snippet returns

[0 1]
[-1  1]

Expected Output

It should return [-1, 1] for both statement.

Part of the problem is in

def _engine(self):
# Calculate the number of bits needed to represent labels in each
# level, as log2 of their sizes (including -1 for NaN):
sizes = np.ceil(np.log2([len(l) + 1 for l in self.levels]))
# Sum bit counts, starting from the _right_....
lev_bits = np.cumsum(sizes[::-1])[::-1]
# ... in order to obtain offsets such that sorting the combination of
# shifted codes (one for each level, resulting in a unique integer) is
# equivalent to sorting lexicographically the codes themselves. Notice
# that each level needs to be shifted by the number of bits needed to
# represent the _previous_ ones:
offsets = np.concatenate([lev_bits[1:], [0]]).astype("uint64")
# Check the total number of bits needed for our representation:
if lev_bits[0] > 64:
# The levels would overflow a 64 bit uint - use Python integers:
return MultiIndexPyIntEngine(self.levels, self.codes, offsets)
return MultiIndexUIntEngine(self.levels, self.codes, offsets)

The self.levels statement does not hand the nans over. But even fixing this returns

[ 0 -1]
[-1  1]

The error must be somewhere deeper, but I could not figure out where something goes wrong. Tested on 1.0.5 and 0.25.3. Does not seem to be a regression.

Output of pd.show_versions()

master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants