Skip to content

ENH: implement fast isin() for nullable dtypes #38340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Dec 7, 2020 · 4 comments · Fixed by #38379
Closed

ENH: implement fast isin() for nullable dtypes #38340

jorisvandenbossche opened this issue Dec 7, 2020 · 4 comments · Fixed by #38379
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Currently, you can get quite a slowdown:

In [41]: arr = np.random.randint(0, 10, 1_000_001)

In [42]: s = pd.Series(arr)

In [43]: %timeit s.isin([1, 2, 3, 20])
2.71 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [44]: s = pd.Series(arr, dtype="Int64")

In [45]: %timeit s.isin([1, 2, 3, 20])
22.9 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
@jorisvandenbossche jorisvandenbossche added Enhancement Performance Memory or execution speed performance NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Dec 7, 2020
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Dec 7, 2020
@tushushu
Copy link
Contributor

tushushu commented Dec 8, 2020

@jorisvandenbossche Hi, I'm new here and would like to work on this issue.

@tushushu
Copy link
Contributor

tushushu commented Dec 8, 2020

@jorisvandenbossche I found function ensure_int64 could be the bottleneck, for it would cost about 20ms according to the example provided.

The whole debug path is:
-> pandas.Series.isin
-> pandas.core.algorithms.isin
-> pandas.core.algorithms._ensure_data
-> pandas.core.dtypes.common.ensure_int64
-> pandas._libs.algos.ensure_int64

But I cannot find this function in algos.pyx or algos.pxd, could you please give some advices? Thanks very much.

@tushushu
Copy link
Contributor

tushushu commented Dec 8, 2020

@jorisvandenbossche

We can see Series's isin method is calling algorithms.isin function.

result = algorithms.isin(self._values, values)

And if we pass s2._values._data to algorithms.isin instead, the performance will be much better.

import pandas as pd
import numpy as np
from pandas.core import algorithms

arr = np.random.randint(0, 10, 1_000_001)
s1 = pd.Series(arr)
s2 = pd.Series(arr, dtype="Int64")

%timeit algorithms.isin(s1._values, [1, 2, 3, 20])
1.87 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit algorithms.isin(s2._values, [1, 2, 3, 20])
22.7 ms ± 851 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit algorithms.isin(s2._values._data, [1, 2, 3, 20])
1.86 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So the solution could be:

from pandas.core.arrays.integer import IntegerArray

class Series:
    def isin(self, values) -> "Series":
        if isinstance(self._values, IntegerArray):
            result = algorithms.isin(self._values._data, values)
        else:
            result = algorithms.isin(self._values, values)
        return self._constructor(result, index=self.index).__finalize__( self, method="isin")

Looking forward to your reply.

@tushushu
Copy link
Contributor

tushushu commented Dec 9, 2020

Here is the link of PR, please let me know if there is anything need to be modified. Thanks
#38379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
3 participants