DOC: df.iterrows() returns view for homogenous columns, and copy for heterogenous / perf warning #7194

shashispace · 2014-05-21T17:34:24Z

perf discussion: http://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316

import pandas as pd
df = pd.DataFrame({'a':range(5),'b':range(5)})
print df

for _, row in df.iterrows():
    row.a += 1
print df

df['c'] = 'what'

for _, row in df.iterrows():
    row.b += 1
print df

   a  b
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4

[5 rows x 2 columns]
   a  b
0  1  0
1  2  1
2  3  2
3  4  3
4  5  4

[5 rows x 2 columns]
   a  b     c
0  1  0  what
1  2  1  what
2  3  2  what
3  4  3  what
4  5  4  what

[5 rows x 3 columns]

The text was updated successfully, but these errors were encountered:

jreback · 2014-05-21T17:36:49Z

and? lots of pandas operations do this
more efficient this way

shashispace · 2014-05-21T18:35:37Z

Hmmn... that's probably true. But it leads to newbies wasting lots of time... anyways, I thought consistency should be preferred over speed.

jreback · 2014-05-21T18:50:56Z

this is a very common idiom on python in general; iterating over a list-like. You should NEVER modify something you are iterating over. this is NOT a copy in python space either (though I think it IS allowed to be a copy, maybe in python 3 it is, not sure).

So maybe a doc warning in that section would be good, informing users that setting values using ANY of the iteration methods IS NOT a good idea.

can you do a pull-request?

shashispace · 2014-05-21T20:22:26Z

So why not make it consistent and return a copy even for homogenous dataframes? (I guess you've already said that it is more efficient.)

I had this issue with something I was running - and I couldn't figure out what was going on, cause a toy example I constructed behaved differently (cuase all its columns were same type) from the main code. I think others would have this issue too - it would be best to make it consistent between the two cases. I have nothing against returning a copy.

Sorry, I don't know enough programming to create a proper pull request... embarrassing, I know. I should dive into this stuff. It would be good for me... any tips on how to get started?

cpcloud · 2014-05-21T20:23:44Z

Check the wiki there are a lot of tips on contributing

jreback · 2014-05-21T21:46:15Z

@shashispace well performance is worth it not to make a copy when people do iterrows (which FYI, is never necessary as there are much better methods). so a doc warning would be in order but that's about it.

shashispace · 2014-05-21T22:04:29Z

Sounds good. Thanks for indulging me. What are some of the better ways to iterate? I was doing:
for i in df.index:
df.ix[i,'a'] += 1

(that's a bad example, but you get the idea - I need to pick the column inside the loop...) I found this way considerably slower, though I didn't benchmark it or anything.

jreback · 2014-05-21T22:09:03Z

you don't iterate or use loops, use a mask instead, e.g.

df.ix[df['a']>5,'a'] += 1

the df['a']>5 produces a mask, see here: http://pandas-docs.github.io/pandas-docs-travis/indexing.html#boolean-indexing

shashispace · 2014-05-21T22:43:18Z

I think I didn't explain it well, let me try again:

I have dataframe with large number of columns, and I need to add a number to a different column in each row. In row one, I might add to 'a', in row 13, I might need to add to column 'b' etc.

Writing this, I realize that I can iterate over the columns.. which would be way more efficient and I can write vectorized code for that. I can't remember if matlab allowed writing to random columns in different rows, is there a vectorized way to do that in pandas?

Thanks so so much - greatly appreciate your replies. It's definitely motivated me to get more involved than just using pandas.

jreback · 2014-05-21T22:54:32Z

pandas aligns, so just create the rhs with the values and it will work, e.g. something like

s = Series([1,2,3],index=[10,12,20])
df.loc[:,['a','c']] += s

would add the indicates values from s in columns a and c

so easiest to do this column by column, of course depending how you are determining what you are adding in the first place.

might be even easier to simply construct a frame that you are adding, e.g.

frame_to_add = DataFrame(a = Series([1,2,3],index=[10,20,30]), c = Series([1,2,3],index=[4,5,6]))
df += frame_to_add.fillna(0)

does it all at once

shashispace · 2014-05-21T23:02:46Z

awesome. thanks again.

immerrr · 2014-05-22T12:15:58Z

@jreback

make a copy when people do iterrows (which FYI, is never necessary as there are much better methods)

I kind of use iterrows when there's a row-by-row operation that includes dot-product of a subset of columns, i.e. something like:

x = pd.DataFrame(..., columns=['a', 'b', 'c', 'timestamp'])

for (_, abc), timestamp in izip(x[['a', 'b', 'c']].iterrows(),
                                x['timestamp']):
    m = get_matrix(timestamp)
    yield m.dot(abc)

Is there a better way to perform something like that?

jreback · 2014-05-22T12:27:47Z

@immerrr no that is reasonable (I would do this by transposing first and use iteritems() though if its a large frame I think you might get better slicing perf as you are indexing off of the info axis if I am reading what you are doing correctly.

I often have this issue with say panels (and higher ndims) when I am doing an apply like operation, need to transpose to get good slicing perf (and also want to store them in consistent dimensional space).

immerrr · 2014-05-22T12:36:54Z

Wouldn't timestamp column force it to be cast to object dtype when transposed?

jreback · 2014-05-22T12:40:39Z

subset then transpose I think would work (assume abc are floats)

e.g.

for (_, abc), timestamp in izip(df[['a','b','c']].T.iteritems(), df['timestamp']):
   ...

immerrr · 2014-05-22T13:02:29Z

Ah, that. I didn't see any tangible improvement in 0.12.0 (which we're still using) so I went with the natural way (transpose does help with to_dict though).

jreback · 2014-05-22T13:04:07Z

yeh I suspect your perf limit will be from the actual operation (and not slicing)

dashesy · 2015-03-31T01:59:23Z

In my use-case I have a datafame that I need to transform to a list of dictionaries (to record to a timeseries db), and I was seeing odd behaviour until I read the docs (more carefully) and realized iterrows actually may change the underlying data type, like integer could become double. Now I am going to use itertuples as suggested, but just wanted to mention that sometimes the reason for iterating rows is some sort of transformation and not really math, so apply is not intuitive.

jreback added the Docs label May 21, 2014

jreback added this to the 0.14.1 milestone May 21, 2014

shashispace closed this as completed May 21, 2014

jreback reopened this May 21, 2014

jreback modified the milestones: 0.15.0, 0.14.1 Jun 17, 2014

jreback mentioned this issue Jul 7, 2014

.iterrows takes too long and generate large memory footprint #7683

Closed

jreback modified the milestones: 0.15.0, 0.15.1 Jul 7, 2014

jreback changed the title ~~df.iterrows() returns view for homogenous columns, and copy for heterogenous.~~ DOC: df.iterrows() returns view for homogenous columns, and copy for heterogenous / perf warning Jul 7, 2014

jreback added the Good as first PR label Jul 7, 2014

jreback modified the milestones: 0.16, 0.15.0 Sep 9, 2014

jreback modified the milestones: 0.16, 0.15.1 Oct 7, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jorisvandenbossche mentioned this issue Jul 29, 2015

DOC: improve docs on iteration #10680

Merged

jorisvandenbossche closed this as completed in #10680 Aug 2, 2015

jorisvandenbossche modified the milestones: 0.17.0, Next Major Release Aug 2, 2015

Uh oh!

DOC: df.iterrows() returns view for homogenous columns, and copy for heterogenous / perf warning #7194

DOC: df.iterrows() returns view for homogenous columns, and copy for heterogenous / perf warning #7194

Comments

shashispace commented May 21, 2014

jreback commented May 21, 2014

Uh oh!

shashispace commented May 21, 2014

Uh oh!

jreback commented May 21, 2014

Uh oh!

shashispace commented May 21, 2014

Uh oh!

cpcloud commented May 21, 2014

Uh oh!

jreback commented May 21, 2014

Uh oh!

shashispace commented May 21, 2014

Uh oh!

jreback commented May 21, 2014

Uh oh!

shashispace commented May 21, 2014

Uh oh!

jreback commented May 21, 2014

Uh oh!

shashispace commented May 21, 2014

Uh oh!

immerrr commented May 22, 2014

Uh oh!

jreback commented May 22, 2014

Uh oh!

immerrr commented May 22, 2014

Uh oh!

jreback commented May 22, 2014

Uh oh!

immerrr commented May 22, 2014

Uh oh!

jreback commented May 22, 2014

Uh oh!

dashesy commented Mar 31, 2015

Uh oh!