-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! It's not obvious to me what the expected behavior is if a fraction is smaller than 1 row. Either of your suggestions (flooring at 1, adding a |
@chris-b1 you ar right it is not obvious. However, what should be the result when sampling from one element if frac is less than 1? Should it be zero or one. I want to have a sample, so maybe one element at least should be returned. This could be nicely controlled with a min_sample keyword. One thing that might be an issue here is that python rounds .5 to 0 instead of 1, in fact, that was the first thing i realized when trying to get counterexamples. By the way, an "inconsistency" the other way round appears in the following case: import pandas as pd
import numpy as np
tmp = pd.Series(np.arange(6))
tmp.sample(frac=0.25)
1 1
0 0
dtype: int32 So here, we python rounds 1.5 to 2 and we get two values, instead of one? I'm not sure what is desired here |
IMO rounding between 0-1 row should follow the same rule as for > 1 rows (which is just the python rounding rule of |
Yes, it is just pythons rounding rule. However, the rounding rule between 0 and 1 row is not the same between 1 and 2 rows since round always rounds to the nearest even integer. Personally, I'd prefer the min_sample approach |
My two cents is that this behavior is working as it should. If the user cares about rounding in a particular way, they should calculate |
thanks for the reply. However, one example where the min_sample_size might be practical if you have a large number of groups and want to sample a certain fraction of each group but obtain at least one element. In this case, it would be convenient to have the min_sample keyword when you "apply" the sample function to the data frame (or groups of the dataframe). But if such a behaviour is not desired, I'm okay with that and we could close this issue. |
Uh oh!
There was an error while loading. Please reload this page.
A small, complete example of the issue
Expected Output
Series([1], dtype: int64)
Why?
I tried to sample fractions of groups, some groups containted only one entry and were not returned in the sampled frame ( I used
groupby(key).apply(lambda x:x.sample)
).A potential fix may be easy: in generic.py line 2634 might be changed from
to
If this behaviour is not desired, maybe one could add to the API some keyword like, min_sample_size
Further Remarks
The example can be generalized using np.arange(axis_length). The bound$2^{-axis_length}$ seems to be hard. Increasing the value would return at least one sample entry.
Edit: Depending on the discussion, I'd love to provide a pull request by myself. I think, it's a good start to support pandas and it fits my skill level
Output of
pd.show_versions()
Series([], dtype: int64)
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: