ENH: implement fast isin() for nullable dtypes #38340

jorisvandenbossche · 2020-12-07T08:19:19Z

Currently, you can get quite a slowdown:

In [41]: arr = np.random.randint(0, 10, 1_000_001)

In [42]: s = pd.Series(arr)

In [43]: %timeit s.isin([1, 2, 3, 20])
2.71 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [44]: s = pd.Series(arr, dtype="Int64")

In [45]: %timeit s.isin([1, 2, 3, 20])
22.9 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The text was updated successfully, but these errors were encountered:

tushushu · 2020-12-08T14:23:51Z

@jorisvandenbossche Hi, I'm new here and would like to work on this issue.

tushushu · 2020-12-08T15:24:26Z

@jorisvandenbossche I found function ensure_int64 could be the bottleneck, for it would cost about 20ms according to the example provided.

The whole debug path is:
-> pandas.Series.isin
-> pandas.core.algorithms.isin
-> pandas.core.algorithms._ensure_data
-> pandas.core.dtypes.common.ensure_int64
-> pandas._libs.algos.ensure_int64

But I cannot find this function in algos.pyx or algos.pxd, could you please give some advices? Thanks very much.

tushushu · 2020-12-08T16:39:41Z

@jorisvandenbossche

We can see Series's isin method is calling algorithms.isin function.

pandas/pandas/core/series.py

Line 4633 in 5cafae7

result = algorithms.isin(self._values, values)

And if we pass s2._values._data to algorithms.isin instead, the performance will be much better.

import pandas as pd
import numpy as np
from pandas.core import algorithms

arr = np.random.randint(0, 10, 1_000_001)
s1 = pd.Series(arr)
s2 = pd.Series(arr, dtype="Int64")

%timeit algorithms.isin(s1._values, [1, 2, 3, 20])
1.87 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit algorithms.isin(s2._values, [1, 2, 3, 20])
22.7 ms ± 851 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit algorithms.isin(s2._values._data, [1, 2, 3, 20])
1.86 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So the solution could be:

from pandas.core.arrays.integer import IntegerArray

class Series:
    def isin(self, values) -> "Series":
        if isinstance(self._values, IntegerArray):
            result = algorithms.isin(self._values._data, values)
        else:
            result = algorithms.isin(self._values, values)
        return self._constructor(result, index=self.index).__finalize__( self, method="isin")

Looking forward to your reply.

tushushu · 2020-12-09T01:10:40Z

Here is the link of PR, please let me know if there is anything need to be modified. Thanks
#38379

jorisvandenbossche added Enhancement Performance Memory or execution speed performance NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Dec 7, 2020

jorisvandenbossche added this to the Contributions Welcome milestone Dec 7, 2020

tushushu mentioned this issue Dec 9, 2020

fix series.isin slow issue with Dtype IntegerArray #38379

Merged

5 tasks

jbrockmendel mentioned this issue Dec 12, 2020

ENH/POC: EA.isin #38422

Closed

5 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Jan 3, 2021

jreback closed this as completed in #38379 Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: implement fast isin() for nullable dtypes #38340

ENH: implement fast isin() for nullable dtypes #38340

jorisvandenbossche commented Dec 7, 2020

tushushu commented Dec 8, 2020

tushushu commented Dec 8, 2020 •

edited

Loading

tushushu commented Dec 8, 2020 •

edited

Loading

tushushu commented Dec 9, 2020

ENH: implement fast isin() for nullable dtypes #38340

ENH: implement fast isin() for nullable dtypes #38340

Comments

jorisvandenbossche commented Dec 7, 2020

tushushu commented Dec 8, 2020

tushushu commented Dec 8, 2020 • edited Loading

tushushu commented Dec 8, 2020 • edited Loading

tushushu commented Dec 9, 2020

tushushu commented Dec 8, 2020 •

edited

Loading

tushushu commented Dec 8, 2020 •

edited

Loading