Skip to content

Fixes timezone drop when encoding it in msgpack #25195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

cgangwar11
Copy link
Contributor

@cgangwar11 cgangwar11 commented Feb 6, 2019

After

In [1]: import numpy as np
   ...: import pandas as pd
   ...:
   ...: idx = pd.date_range(start='2018-12-02 14:50:00-07:00', end='2018-12-03 03:11:00-07:00', freq='1min')
   ...:
   ...:
   ...: df = pd.DataFrame(np.random.randn(len(idx), 1), index=idx, columns=['A'])

In [2]: df.index[0]
Out[2]: Timestamp('2018-12-02 14:50:00-0700', tz='pytz.FixedOffset(-420)', freq='T')

In [3]: df.to_msgpack('~/foo.msgpack')

In [4]: df = pd.read_msgpack('~/foo.msgpack')

In [5]: df.index[0]
Out[5]: Timestamp('2018-12-02 14:50:00-0700', tz='pytz.FixedOffset(-420)', freq='T')

@pep8speaks
Copy link

pep8speaks commented Feb 6, 2019

Hello @cgangwar11! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 09, 2019 at 18:27 Hours UTC

@cgangwar11 cgangwar11 changed the title BUG: #25148 Fixes timezone drop when encoding it in msgpack Feb 6, 2019
@codecov
Copy link

codecov bot commented Feb 6, 2019

Codecov Report

Merging #25195 into master will decrease coverage by 49.5%.
The diff coverage is 33.33%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25195       +/-   ##
===========================================
- Coverage   92.35%   42.84%   -49.51%     
===========================================
  Files         166      166               
  Lines       52314    52315        +1     
===========================================
- Hits        48312    22414    -25898     
- Misses       4002    29901    +25899
Flag Coverage Δ
#multiple ?
#single 42.84% <33.33%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/packers.py 14.89% <33.33%> (-73.32%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 638ddeb...99cab5a. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 6, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@2448e52). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #25195   +/-   ##
=========================================
  Coverage          ?   92.16%           
=========================================
  Files             ?      166           
  Lines             ?    52314           
  Branches          ?        0           
=========================================
  Hits              ?    48214           
  Misses            ?     4100           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.62% <100%> (?)
#single 42.25% <9.52%> (?)
Impacted Files Coverage Δ
pandas/io/packers.py 88.85% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2448e52...b1347ae. Read the comment docs.

@@ -52,7 +52,7 @@ Bug Fixes
**I/O**

- Bug in reading a HDF5 table-format ``DataFrame`` created in Python 2, in Python 3 (:issue:`24925`)
-
- Bug in retaining timezone in DateTimeIndex when saving and loading the DataFrame from msgpack. (:issue:`25148`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double back ticks around msgpack. Single backticks around DatetimeIndex and DataFrame with :class: before each

@@ -393,6 +393,12 @@ def categorical_index(self):
result = self.encode_decode(df)
tm.assert_frame_equal(result, df)

def test_timezone_encode_decode(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parameterize using the tz_aware_fixture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I suspect the fix is too narrowly focused on FixedOffets.

@@ -390,7 +390,8 @@ def encode(obj):

# store tz info and data as UTC
if tz is not None:
tz = u(tz.zone)
tz = u(tz.zone) if not isinstance(tz, _FixedOffset) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the timezone is not a FixedOffset? e.g. pytz.UTC or dateutil.tz.tzutc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
Encoding process of TimeStamp is using zone attribute and normalize the datetime to utc.
Now there is no zone attribute in FixedOffset hence we will not be able to encode it and loosing the offset value in the process.
So i extracted offset value and encode that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> Timestamp('2018-12-02 14:50:00',tz='UTC').tzinfo.zone
'UTC'
>>> Timestamp('2018-12-02 14:50:00',tz=pytz.FixedOffset(-420)).tzinfo.zone is None
True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the explanation.

I'd change this to a try/except block instead of an if/else.

Another option would be to str the timezone otherwise?

In [3]: str(tz)
Out[3]: 'pytz.FixedOffset(60)'

In [4]: import dateutil

In [5]: str(dateutil.tz.tzutc())
Out[5]: 'tzutc()'

In [7]: dateutil.tz.tzutc().zone
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-d8163524e913> in <module>()
----> 1 dateutil.tz.tzutc().zone

AttributeError: 'tzutc' object has no attribute 'zone'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
Converting the timezone object into string to encode is a great idea.
But I will have to use something like this in decoding side.

tz = eval('pytz.FixedOffset(60)' )

Using eval doesn't seem like a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'd prefer your minutes extraction method then.

I imagine there will need to be separate handling for dateutil timezones all together.

@mroeschke mroeschke added IO Msgpack Timezones Timezone data dtype labels Feb 6, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there must be something going else wrong here. The data is stored as utc and the tz is stored as well. what exactly is the problem?

@@ -52,7 +52,7 @@ Bug Fixes
**I/O**

- Bug in reading a HDF5 table-format ``DataFrame`` created in Python 2, in Python 3 (:issue:`24925`)
-
- Bug in retaining timezone in :class:`DateTimeIndex` when saving and loading the :class"`DataFrame` from ``msgpack``. (:issue:`25148`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeIndex, can you use :meth:read_msgpack here

Copy link
Contributor Author

@cgangwar11 cgangwar11 Feb 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Yes we were encoding tz using two value (zone as a str) and (Time value in utc format).
As per pandas DateTimeIndex documentation tz could be

tz : pytz.timezone or dateutil.tz.tzfile

The current encoding version is capable of encoding pytz.timezone but not the dateutil timezones.
Infact code will throw this exception when we try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> df1 = pd.DataFrame({'A':pd.date_range(start="2018-08-01",end="2018-08-30",tz=dateutil.tz.tzutc())})
>>> df1.to_msgpack("~/xyz.msg")
>>> a = pd.read_msgpack("~/xyz.msg")
Traceback (most recent call last):
  File "/home/chandan/Softwares/pandas/pandas/core/dtypes/common.py", line 2037, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: Invalid datetime unit in metadata string "[ns, tzutc()]"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chandan/Softwares/pandas/pandas/io/packers.py", line 207, in read_msgpack
    return read(fh)
  File "/home/chandan/Softwares/pandas/pandas/io/packers.py", line 187, in read
    unpacked_obj = list(unpack(fh, encoding=encoding, **kwargs))
  File "pandas/io/msgpack/_unpacker.pyx", line 483, in pandas.io.msgpack._unpacker.Unpacker.__next__
    return self._unpack(unpack_construct, None, 1)
  File "pandas/io/msgpack/_unpacker.pyx", line 410, in pandas.io.msgpack._unpacker.Unpacker._unpack
    ret = execute(&self.ctx, self.buf, self.buf_tail, &self.buf_head)
  File "/home/chandan/Softwares/pandas/pandas/io/packers.py", line 667, in decode
    blocks = [create_block(b) for b in obj[u'blocks']]
  File "/home/chandan/Softwares/pandas/pandas/io/packers.py", line 667, in <listcomp>
    blocks = [create_block(b) for b in obj[u'blocks']]
  File "/home/chandan/Softwares/pandas/pandas/io/packers.py", line 648, in create_block
    b[u'compress']), b[u'shape'])
  File "/home/chandan/Softwares/pandas/pandas/io/packers.py", line 322, in unconvert
    dtype = pandas_dtype(dtype).base
  File "/home/chandan/Softwares/pandas/pandas/core/dtypes/common.py", line 2043, in pandas_dtype
    dtype))
TypeError: data type 'datetime64[ns, tzutc()]' not understood

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

an you merge master and update

@jreback
Copy link
Contributor

jreback commented May 12, 2019

good idea, but closing as stale

@jreback jreback closed this May 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

msgpack drops timezone info
4 participants