pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396

datajanko · 2016-10-11T12:18:37Z

A small, complete example of the issue

import pandas as pd

axis_length = 1
df = pd.Series([0])
df.sample(frac=2**(-axis_length))

Series([], dtype: int64)

Expected Output

Series([1], dtype: int64)

Why?

I tried to sample fractions of groups, some groups containted only one entry and were not returned in the sampled frame ( I used groupby(key).apply(lambda x:x.sample)).
A potential fix may be easy: in generic.py line 2634 might be changed from

n = int(round(frac * axis_length))

to

n = max(int(round(frac * axis_length)),1)

If this behaviour is not desired, maybe one could add to the API some keyword like, min_sample_size

Further Remarks

The example can be generalized using np.arange(axis_length). The bound $2^{-axis_length}$ seems to be hard. Increasing the value would return at least one sample entry.

Edit: Depending on the discussion, I'd love to provide a pull request by myself. I think, it's a good start to support pandas and it fits my skill level

Output of `pd.show_versions()`

Series([], dtype: int64)

# Paste the output here ## INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-10-11T16:10:25Z

Thanks for the report! It's not obvious to me what the expected behavior is if a fraction is smaller than 1 row.

Either of your suggestions (flooring at 1, adding a min_sample kwarg) seem reasonable - will let others weigh in.

datajanko · 2016-10-12T06:00:18Z

@chris-b1 you ar right it is not obvious. However, what should be the result when sampling from one element if frac is less than 1? Should it be zero or one. I want to have a sample, so maybe one element at least should be returned. This could be nicely controlled with a min_sample keyword.
Additionally, this would implement sample x percent but at least y elements.

One thing that might be an issue here is that python rounds .5 to 0 instead of 1, in fact, that was the first thing i realized when trying to get counterexamples. By the way, an "inconsistency" the other way round appears in the following case:

import pandas as pd
import numpy as np

tmp = pd.Series(np.arange(6))
tmp.sample(frac=0.25)

1    1
0    0
dtype: int32

So here, we python rounds 1.5 to 2 and we get two values, instead of one? I'm not sure what is desired here

jorisvandenbossche · 2016-10-12T15:25:32Z

IMO rounding between 0-1 row should follow the same rule as for > 1 rows (which is just the python rounding rule of round(..) I think).
But a min_sample (or other name) for specifying a minimum sample size seems indeed reasonable.

datajanko · 2016-10-12T20:40:29Z

Yes, it is just pythons rounding rule. However, the rounding rule between 0 and 1 row is not the same between 1 and 2 rows since round always rounds to the nearest even integer.

Personally, I'd prefer the min_sample approach

shoyer · 2016-10-13T02:08:16Z

My two cents is that this behavior is working as it should. If the user cares about rounding in a particular way, they should calculate n exactly as they desire.

datajanko · 2016-10-16T20:18:37Z

thanks for the reply. However, one example where the min_sample_size might be practical if you have a large number of groups and want to sample a certain fraction of each group but obtain at least one element. In this case, it would be convenient to have the min_sample keyword when you "apply" the sample function to the data frame (or groups of the dataframe).

But if such a behaviour is not desired, I'm okay with that and we could close this issue.

chris-b1 added the API Design label Oct 11, 2016

jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Oct 12, 2016

datajanko closed this as completed Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396

pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396

datajanko commented Oct 11, 2016 •

edited

Loading

chris-b1 commented Oct 11, 2016 •

edited

Loading

Uh oh!

datajanko commented Oct 12, 2016

Uh oh!

jorisvandenbossche commented Oct 12, 2016

Uh oh!

datajanko commented Oct 12, 2016

Uh oh!

shoyer commented Oct 13, 2016

Uh oh!

datajanko commented Oct 16, 2016

Uh oh!

Uh oh!

pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396

pandas.*.sample may return empty objects if frac <= 2^{-axis_length} #14396

Comments

datajanko commented Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A small, complete example of the issue

Expected Output

Why?

Further Remarks

Output of pd.show_versions()

chris-b1 commented Oct 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datajanko commented Oct 12, 2016

Uh oh!

jorisvandenbossche commented Oct 12, 2016

Uh oh!

datajanko commented Oct 12, 2016

Uh oh!

shoyer commented Oct 13, 2016

Uh oh!

datajanko commented Oct 16, 2016

Uh oh!

datajanko commented Oct 11, 2016 •

edited

Loading

Output of `pd.show_versions()`

chris-b1 commented Oct 11, 2016 •

edited

Loading