Skip to content

ENH: Implement searchsorted for DataFrames #43907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
charlesbluca opened this issue Oct 7, 2021 · 2 comments
Closed

ENH: Implement searchsorted for DataFrames #43907

charlesbluca opened this issue Oct 7, 2021 · 2 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@charlesbluca
Copy link
Contributor

Is your feature request related to a problem?

Pandas currently only supports searchsorted for series:

>>> ser = pd.Series([1, 2, 3])
>>> ser.searchsorted([0, 4])
array([0, 3])

It would be nice if it were also supported for dataframes, returning indices to maintain order in a lexicographically sorted dataframe. This is supported by cuDF:

>>> df = cudf.DataFrame({"a" :[1, 1, 1, 2, 2, 2], "b": list(range(6))})
>>> values = cudf.DataFrame({"a": [1, 1, 2, 2], "b": [1, 2, 4, 6]})
>>> df.searchsorted(values, side="right")
array([2, 3, 5, 6], dtype=int32)

Describe the solution you'd like

The addition of a searchsorted(value, ...) method to pandas.DataFrame, returning the indices to places the elements of values to maintain order in a dataframe lexicographically sorted by all columns.

API breaking implications

None that I can think of.

Describe alternatives you've considered

This could probably be accomplished by converting the dataframe and values to a series of tuples and then using Series.searchsorted, but I imagine there's a more performant way to do this.

Additional context

If this functionality were added, along with a multi-columnar quantiles (#43881), it would enable Dask dataframes to compute sort_values with multiple sort-by columns, using an algorithm roughly similar to that of dask-cudf.

@charlesbluca charlesbluca added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 7, 2021
@mzeitlin11
Copy link
Member

Is this covered in #42872?

@charlesbluca
Copy link
Contributor Author

Yes! I will close this in favor of that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants