-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: df.iterrows() returns view for homogenous columns, and copy for heterogenous / perf warning #7194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
and? lots of pandas operations do this |
Hmmn... that's probably true. But it leads to newbies wasting lots of time... anyways, I thought consistency should be preferred over speed. |
this is a very common idiom on python in general; iterating over a list-like. You should NEVER modify something you are iterating over. this is NOT a copy in python space either (though I think it IS allowed to be a copy, maybe in python 3 it is, not sure). So maybe a doc warning in that section would be good, informing users that setting values using ANY of the iteration methods IS NOT a good idea. can you do a pull-request? |
So why not make it consistent and return a copy even for homogenous dataframes? (I guess you've already said that it is more efficient.) I had this issue with something I was running - and I couldn't figure out what was going on, cause a toy example I constructed behaved differently (cuase all its columns were same type) from the main code. I think others would have this issue too - it would be best to make it consistent between the two cases. I have nothing against returning a copy. Sorry, I don't know enough programming to create a proper pull request... embarrassing, I know. I should dive into this stuff. It would be good for me... any tips on how to get started? |
Check the wiki there are a lot of tips on contributing |
@shashispace well performance is worth it not to make a copy when people do |
Sounds good. Thanks for indulging me. What are some of the better ways to iterate? I was doing: (that's a bad example, but you get the idea - I need to pick the column inside the loop...) I found this way considerably slower, though I didn't benchmark it or anything. |
you don't iterate or use loops, use a mask instead, e.g.
the |
I think I didn't explain it well, let me try again: I have dataframe with large number of columns, and I need to add a number to a different column in each row. In row one, I might add to 'a', in row 13, I might need to add to column 'b' etc. Writing this, I realize that I can iterate over the columns.. which would be way more efficient and I can write vectorized code for that. I can't remember if matlab allowed writing to random columns in different rows, is there a vectorized way to do that in pandas? Thanks so so much - greatly appreciate your replies. It's definitely motivated me to get more involved than just using pandas. |
pandas aligns, so just create the rhs with the values and it will work, e.g. something like
would add the indicates values from s in columns a and c so easiest to do this column by column, of course depending how you are determining what you are adding in the first place. might be even easier to simply construct a frame that you are adding, e.g.
does it all at once |
awesome. thanks again. |
I kind of use iterrows when there's a row-by-row operation that includes dot-product of a subset of columns, i.e. something like: x = pd.DataFrame(..., columns=['a', 'b', 'c', 'timestamp'])
for (_, abc), timestamp in izip(x[['a', 'b', 'c']].iterrows(),
x['timestamp']):
m = get_matrix(timestamp)
yield m.dot(abc) Is there a better way to perform something like that? |
@immerrr no that is reasonable (I would do this by transposing first and use I often have this issue with say panels (and higher ndims) when I am doing an apply like operation, need to transpose to get good slicing perf (and also want to store them in consistent dimensional space). |
Wouldn't timestamp column force it to be cast to object dtype when transposed? |
subset then transpose I think would work (assume abc are floats) e.g.
|
Ah, that. I didn't see any tangible improvement in 0.12.0 (which we're still using) so I went with the natural way (transpose does help with |
yeh I suspect your perf limit will be from the actual operation (and not slicing) |
In my use-case I have a datafame that I need to transform to a list of dictionaries (to record to a timeseries db), and I was seeing odd behaviour until I read the docs (more carefully) and realized iterrows actually may change the underlying data type, like integer could become double. Now I am going to use itertuples as suggested, but just wanted to mention that sometimes the reason for iterating rows is some sort of transformation and not really math, so apply is not intuitive. |
perf discussion: http://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316
The text was updated successfully, but these errors were encountered: