Skip to content

DOC: df.iterrows() returns view for homogenous columns, and copy for heterogenous / perf warning #7194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shashispace opened this issue May 21, 2014 · 18 comments · Fixed by #10680
Labels
Milestone

Comments

@shashispace
Copy link

perf discussion: http://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316

import pandas as pd
df = pd.DataFrame({'a':range(5),'b':range(5)})
print df

for _, row in df.iterrows():
    row.a += 1
print df

df['c'] = 'what'

for _, row in df.iterrows():
    row.b += 1
print df

   a  b
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4

[5 rows x 2 columns]
   a  b
0  1  0
1  2  1
2  3  2
3  4  3
4  5  4

[5 rows x 2 columns]
   a  b     c
0  1  0  what
1  2  1  what
2  3  2  what
3  4  3  what
4  5  4  what

[5 rows x 3 columns]
@jreback
Copy link
Contributor

jreback commented May 21, 2014

and? lots of pandas operations do this
more efficient this way

@shashispace
Copy link
Author

Hmmn... that's probably true. But it leads to newbies wasting lots of time... anyways, I thought consistency should be preferred over speed.

@jreback
Copy link
Contributor

jreback commented May 21, 2014

this is a very common idiom on python in general; iterating over a list-like. You should NEVER modify something you are iterating over. this is NOT a copy in python space either (though I think it IS allowed to be a copy, maybe in python 3 it is, not sure).

So maybe a doc warning in that section would be good, informing users that setting values using ANY of the iteration methods IS NOT a good idea.

can you do a pull-request?

@jreback jreback added the Docs label May 21, 2014
@jreback jreback added this to the 0.14.1 milestone May 21, 2014
@shashispace
Copy link
Author

So why not make it consistent and return a copy even for homogenous dataframes? (I guess you've already said that it is more efficient.)

I had this issue with something I was running - and I couldn't figure out what was going on, cause a toy example I constructed behaved differently (cuase all its columns were same type) from the main code. I think others would have this issue too - it would be best to make it consistent between the two cases. I have nothing against returning a copy.

Sorry, I don't know enough programming to create a proper pull request... embarrassing, I know. I should dive into this stuff. It would be good for me... any tips on how to get started?

@cpcloud
Copy link
Member

cpcloud commented May 21, 2014

Check the wiki there are a lot of tips on contributing

@jreback
Copy link
Contributor

jreback commented May 21, 2014

@shashispace well performance is worth it not to make a copy when people do iterrows (which FYI, is never necessary as there are much better methods). so a doc warning would be in order but that's about it.

@shashispace
Copy link
Author

Sounds good. Thanks for indulging me. What are some of the better ways to iterate? I was doing:
for i in df.index:
df.ix[i,'a'] += 1

(that's a bad example, but you get the idea - I need to pick the column inside the loop...) I found this way considerably slower, though I didn't benchmark it or anything.

@jreback jreback reopened this May 21, 2014
@jreback
Copy link
Contributor

jreback commented May 21, 2014

you don't iterate or use loops, use a mask instead, e.g.

df.ix[df['a']>5,'a'] += 1

the df['a']>5 produces a mask, see here: http://pandas-docs.github.io/pandas-docs-travis/indexing.html#boolean-indexing

@shashispace
Copy link
Author

I think I didn't explain it well, let me try again:

I have dataframe with large number of columns, and I need to add a number to a different column in each row. In row one, I might add to 'a', in row 13, I might need to add to column 'b' etc.

Writing this, I realize that I can iterate over the columns.. which would be way more efficient and I can write vectorized code for that. I can't remember if matlab allowed writing to random columns in different rows, is there a vectorized way to do that in pandas?

Thanks so so much - greatly appreciate your replies. It's definitely motivated me to get more involved than just using pandas.

@jreback
Copy link
Contributor

jreback commented May 21, 2014

pandas aligns, so just create the rhs with the values and it will work, e.g. something like

s = Series([1,2,3],index=[10,12,20])
df.loc[:,['a','c']] += s

would add the indicates values from s in columns a and c

so easiest to do this column by column, of course depending how you are determining what you are adding in the first place.

might be even easier to simply construct a frame that you are adding, e.g.

frame_to_add = DataFrame(a = Series([1,2,3],index=[10,20,30]), c = Series([1,2,3],index=[4,5,6]))
df += frame_to_add.fillna(0)

does it all at once

@shashispace
Copy link
Author

awesome. thanks again.

@immerrr
Copy link
Contributor

immerrr commented May 22, 2014

@jreback

make a copy when people do iterrows (which FYI, is never necessary as there are much better methods)

I kind of use iterrows when there's a row-by-row operation that includes dot-product of a subset of columns, i.e. something like:

x = pd.DataFrame(..., columns=['a', 'b', 'c', 'timestamp'])

for (_, abc), timestamp in izip(x[['a', 'b', 'c']].iterrows(),
                                x['timestamp']):
    m = get_matrix(timestamp)
    yield m.dot(abc)

Is there a better way to perform something like that?

@jreback
Copy link
Contributor

jreback commented May 22, 2014

@immerrr no that is reasonable (I would do this by transposing first and use iteritems() though if its a large frame I think you might get better slicing perf as you are indexing off of the info axis if I am reading what you are doing correctly.

I often have this issue with say panels (and higher ndims) when I am doing an apply like operation, need to transpose to get good slicing perf (and also want to store them in consistent dimensional space).

@immerrr
Copy link
Contributor

immerrr commented May 22, 2014

Wouldn't timestamp column force it to be cast to object dtype when transposed?

@jreback
Copy link
Contributor

jreback commented May 22, 2014

subset then transpose I think would work (assume abc are floats)

e.g.

for (_, abc), timestamp in izip(df[['a','b','c']].T.iteritems(), df['timestamp']):
   ...

@immerrr
Copy link
Contributor

immerrr commented May 22, 2014

Ah, that. I didn't see any tangible improvement in 0.12.0 (which we're still using) so I went with the natural way (transpose does help with to_dict though).

@jreback
Copy link
Contributor

jreback commented May 22, 2014

yeh I suspect your perf limit will be from the actual operation (and not slicing)

@jreback jreback modified the milestones: 0.15.0, 0.14.1 Jun 17, 2014
@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 7, 2014
@jreback jreback changed the title df.iterrows() returns view for homogenous columns, and copy for heterogenous. DOC: df.iterrows() returns view for homogenous columns, and copy for heterogenous / perf warning Jul 7, 2014
@jreback jreback modified the milestones: 0.16, 0.15.0 Sep 9, 2014
@jreback jreback modified the milestones: 0.16, 0.15.1 Oct 7, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@dashesy
Copy link
Contributor

dashesy commented Mar 31, 2015

In my use-case I have a datafame that I need to transform to a list of dictionaries (to record to a timeseries db), and I was seeing odd behaviour until I read the docs (more carefully) and realized iterrows actually may change the underlying data type, like integer could become double. Now I am going to use itertuples as suggested, but just wanted to mention that sometimes the reason for iterating rows is some sort of transformation and not really math, so apply is not intuitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants