Skip to content

pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datajanko opened this issue Oct 11, 2016 · 6 comments
Closed

pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396

datajanko opened this issue Oct 11, 2016 · 6 comments
Labels
API Design Needs Discussion Requires discussion from core team before further action

Comments

@datajanko
Copy link
Contributor

datajanko commented Oct 11, 2016

A small, complete example of the issue

import pandas as pd

axis_length = 1
df = pd.Series([0])
df.sample(frac=2**(-axis_length))

Series([], dtype: int64)

Expected Output

Series([1], dtype: int64)

Why?

I tried to sample fractions of groups, some groups containted only one entry and were not returned in the sampled frame ( I used groupby(key).apply(lambda x:x.sample)).
A potential fix may be easy: in generic.py line 2634 might be changed from

n = int(round(frac * axis_length))

to

n = max(int(round(frac * axis_length)),1)

If this behaviour is not desired, maybe one could add to the API some keyword like, min_sample_size

Further Remarks

The example can be generalized using np.arange(axis_length). The bound $2^{-axis_length}$ seems to be hard. Increasing the value would return at least one sample entry.

Edit: Depending on the discussion, I'd love to provide a pull request by myself. I think, it's a good start to support pandas and it fits my skill level

Output of pd.show_versions()

Series([], dtype: int64)

# Paste the output here ## INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Oct 11, 2016

Thanks for the report! It's not obvious to me what the expected behavior is if a fraction is smaller than 1 row.

Either of your suggestions (flooring at 1, adding a min_sample kwarg) seem reasonable - will let others weigh in.

@datajanko
Copy link
Contributor Author

@chris-b1 you ar right it is not obvious. However, what should be the result when sampling from one element if frac is less than 1? Should it be zero or one. I want to have a sample, so maybe one element at least should be returned. This could be nicely controlled with a min_sample keyword.
Additionally, this would implement sample x percent but at least y elements.

One thing that might be an issue here is that python rounds .5 to 0 instead of 1, in fact, that was the first thing i realized when trying to get counterexamples. By the way, an "inconsistency" the other way round appears in the following case:

import pandas as pd
import numpy as np

tmp = pd.Series(np.arange(6))
tmp.sample(frac=0.25)

1    1
0    0
dtype: int32

So here, we python rounds 1.5 to 2 and we get two values, instead of one? I'm not sure what is desired here

@jorisvandenbossche
Copy link
Member

IMO rounding between 0-1 row should follow the same rule as for > 1 rows (which is just the python rounding rule of round(..) I think).
But a min_sample (or other name) for specifying a minimum sample size seems indeed reasonable.

@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Oct 12, 2016
@datajanko
Copy link
Contributor Author

Yes, it is just pythons rounding rule. However, the rounding rule between 0 and 1 row is not the same between 1 and 2 rows since round always rounds to the nearest even integer.

Personally, I'd prefer the min_sample approach

@shoyer
Copy link
Member

shoyer commented Oct 13, 2016

My two cents is that this behavior is working as it should. If the user cares about rounding in a particular way, they should calculate n exactly as they desire.

@datajanko
Copy link
Contributor Author

thanks for the reply. However, one example where the min_sample_size might be practical if you have a large number of groups and want to sample a certain fraction of each group but obtain at least one element. In this case, it would be convenient to have the min_sample keyword when you "apply" the sample function to the data frame (or groups of the dataframe).

But if such a behaviour is not desired, I'm okay with that and we could close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants