Skip to content

Add metadata to Session objects #659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alixdamman
Copy link
Collaborator

Needs changelog

@alixdamman alixdamman requested a review from gdementen June 29, 2018 13:33
@alixdamman alixdamman mentioned this pull request Jul 4, 2018
create a session with metadata

>>> # Python <= 3.5
>>> s = Session(arr1=arr1, arr2=arr2, meta=[('title', 'my title'), ('author', 'John Smith')])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a good idea to recommend the keyword argument syntax (for arrays) for <= Python 3.5

return self._objects[key]
else:
raise AttributeError("'{}' object has no attribute '{}'".format(self.__class__.__name__, key))

def __setattr__(self, key, value):
self._objects[key] = value
if key is 'meta':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

@@ -172,13 +202,21 @@ def __delitem__(self, key):
del self._objects[key]

def __getattr__(self, key):
if key in self._objects:
if key is 'meta':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it possible to move this to a property? If that is possible it would probably be cleaner...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started by trying to implement meta as a property but I faced some difficulties to make it work with Python 2.7

return tmpl.format(key=k, value=str(v))
else:
return tmpl(k, v)

return '\n'.join(display(k, v) for k, v in self.items())
return '\n'.join(display(k, v) for k, v in list(self.meta.items()) + list(self.items()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above. Having metadata "items" at the same level than self.items is not a good option IMO.

s = pickle.dumps(original)
res = pickle.loads(s)
assert res.equals(original)
if PY2:
from larray.core.metadata import Metadata
assert Metadata(res.meta) == meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd. I can understand the unpickling of the metadata does not return a Metadata object on python2, but shouldn't it be converted to one when it is "reattached" to the Session?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not proud of it. After trying to implement meta in Session as a property and facing difficulties with Python 2.7 I just gave up for this part. I will change that.

@@ -1091,7 +1165,7 @@ def summary(self, template=None):
- for groups: 'key', 'name', 'axis_name', 'labels' and 'length',
- for axes: 'key', 'name', 'labels' and 'length',
- for arrays: 'key', 'axes_names', 'shape', 'dtype' and 'title',
- for all other types: 'key', 'value'.
- for metadata: 'key', 'value'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about other kinds of objects?


def display(k, v):
t = Group if isinstance(v, Group) else type(v)
if not isinstance(v, (Axis, Group, LArray)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather have a single template for Metadata objects, so that users can chose to indent it more than the rest or not to display it at all. It would also (re-) allow user-defined objects in the Session.

def save(self):
if len(self.axes) > 0:
df = _axes_to_df(self.axes.values())
df.to_excel(self.handle, '__axes__', engine='xlsxwriter')
df.to_excel(self.handle, '__axes__', index=False, engine='xlsxwriter')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change really related to session metadata???

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, sorry.

# DATAFRAME <--> AXES, GROUPS #
# ################################## #
def _metadata_to_series(metadata):
s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use a 1D object LArray for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because LArray objects are converted to pandas objects before to be dumped or loaded. I think using 1D array here instead of a pandas Series will add an unnecessary layer. Especially since we need to iterate over each element at reading to convert each metadata to the right type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a case of premature optimization. Passing via LArray has negligible cost in my working copy (disclaimer: I modified .series slightly to go faster for 1D and I had already a few speed optimizations all over the place in my working copy, so this might not be the case on your computer).

>>> metadata = {'title': 'yad aez dgrz jgzeigfejzi', 'date': datetime.now()}
>>> %timeit l = LArray(list(metadata.values()), Axis(metadata.keys())); s = l.series; s.name = ''
1.22 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')
1.3 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Besides, I think that it might be a case where avoiding pandas structures altogether would be faster. Going via pandas is a net win for reading .csv files because their parser is so fast, but for anything else, we only gain convenience, not performance. This is probably especially true when we go via xlwings, which in the end must transform everything to "simple" Python types. Anyway, you can leave this as-is, but please at least add a comment that this is done for performance reasons.


def _series_to_metadata(series):

def _convert_value(key, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function does not seem to use the key argument. Remove it.

# DATAFRAME <--> AXES, GROUPS #
# ################################## #
def _metadata_to_series(metadata):
s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a case of premature optimization. Passing via LArray has negligible cost in my working copy (disclaimer: I modified .series slightly to go faster for 1D and I had already a few speed optimizations all over the place in my working copy, so this might not be the case on your computer).

>>> metadata = {'title': 'yad aez dgrz jgzeigfejzi', 'date': datetime.now()}
>>> %timeit l = LArray(list(metadata.values()), Axis(metadata.keys())); s = l.series; s.name = ''
1.22 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')
1.3 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Besides, I think that it might be a case where avoiding pandas structures altogether would be faster. Going via pandas is a net win for reading .csv files because their parser is so fast, but for anything else, we only gain convenience, not performance. This is probably especially true when we go via xlwings, which in the end must transform everything to "simple" Python types. Anyway, you can leave this as-is, but please at least add a comment that this is done for performance reasons.

def _series_to_metadata(series):

def _convert_value(key, value):
value = str(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line needed? It seems weird to convert numeric types to string to convert them back on the next line to numeric?


def _convert_value(key, value):
value = str(value)
value = pd.to_numeric([value], errors='ignore')[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you use [value] instead of just value? pd.to_numeric seems to handle scalars just fine (even if that is undocumented)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version of pandas do you have? If I try to pass value instead of [value], I get the following error message: TypeError: 'int' object is not subscriptable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> pd.to_numeric('10', errors='ignore')
10
>>> pd.to_numeric(10, errors='ignore')
10
>>> pd.__version__
'0.20.3'

@alixdamman alixdamman requested a review from gdementen July 6, 2018 13:51
@@ -498,6 +564,7 @@ def to_excel(self, fname, names=None, overwrite=True, display=False, **kwargs):
- each array is saved in a separate sheet
- all Axis objects are saved together in the same sheet named __axes__
- all Group objects are saved together in the same sheet named __groups__
- all metadata are saved together in the same sheet named __metadata__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/metadata/session metadata/ (to avoid confusion with array metadata)
s/are/is/

@@ -540,6 +610,7 @@ def to_csv(self, fname, names=None, display=False, **kwargs):
- each array is saved in a separate file
- all Axis objects are saved together in the same CSV file named __axes__.csv
- all Group objects are saved together in the same CSV file named __groups__.csv
- all metadata are saved together in the same CSV file named __metadata__.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/metadata/session metadata/ (to avoid confusion with array metadata)
s/are/is

@@ -815,6 +889,10 @@ def element_equals(self, other):
-------
Boolean LArray

Notes
-----
Metadata are ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/are/is

@@ -885,9 +963,13 @@ def equals(self, other):
-------
True if arrays of both sessions are all equal, False otherwise.

Notes
-----
Metadata are ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/are/is

def _read_metadata(self):
filepath = self._to_filepath('__metadata__')
if os.path.isfile(filepath):
meta = read_csv(filepath, wide=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect you to actually make the change from pandas to larray in this PR, but it is a nice surprise. I find it much cleaner this way.

def __larray__(self):
from larray.core.array import LArray
from larray.core.axis import Axis
return LArray(list(self.values()), Axis(self.keys(), name='METADATA'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1150,13 +1240,26 @@ def summary(self, template=None):
template[Group] = "{key}: {axis_name}{labels} >> {name} ({length})"
if LArray not in template:
template[LArray] = "{key}: {axes_names} ({shape}) [{dtype}]\n {title}"
if Metadata not in template:
template[Metadata] = "{key} (metadata): {value}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already better than before, but I don't think you understood what I meant. I meant something like:

if Metadata not in template:
    template[Metadata] = "{value}"

res = ''
if len(self.meta) > 0:
res = '\n'.join(display_metadata(k, v) for k, v in self.meta.items()) + '\n'
res += '\n'.join(display_items(k, v) for k, v in self.items())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then here, something like:

if self.meta:
    res += display_items('Metadata', v) + '\n'
res += '\n'.join(display_items(k, v) for k, v in self.items())

The goal being that users can produce something like the following if they want to (maybe it should be the default):

Metadata: 
    title: my title
    author: John Smith
axis1: a ['a0' 'a1' 'a2'] (3)
group1 -> a01: a['a0', 'a1'] (2)

Now, I am unsure it's worth it to change what you have but having no way to distinguish a metadata item from a "normal" item clearly bothers me.

@alixdamman alixdamman requested a review from gdementen July 9, 2018 15:42
@alixdamman
Copy link
Collaborator Author

There is a bug with conda for Python 2.7 : conda/conda#7503

@@ -48,7 +48,7 @@ New features

Warnings:

- Currently, only the HDF (.h5) file format supports saving and loading metadata.
- Currently, only the HDF (.h5) file format supports saving and loading array metadata.
- Metadata are not kept when actions or methods are applied on a array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Metadata are/Metadata is/
s/a/an/

@@ -67,6 +67,31 @@ New features
>>> arr.meta.title = 'array for testing'


* allowed sessions to have metadata.
Like for arrays, session metadata can be accessed using the syntax ``session.meta.name``.
See bellow for use cases:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bellow/below/


Warnings:

- Excel, CSV and HDF (.h5) file formats support saving and loading session metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • These are all the common formats, so it is a bit weird to have this in a Warning section, but maybe putting that in a Notes section by itself would be impractical.
  • I would emphasize more the difference with arrays by saying something like "Contrary to array metadata, saving and loading session metadata is supported for all current session file formats: Excel, CSV and HDF (.h5).

Warnings:

- Excel, CSV and HDF (.h5) file formats support saving and loading session metadata.
- Metadata are not kept when actions or methods are applied on a session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/are/is/ 😄


Warnings
--------
Metadata are not kept when actions or methods are applied on a session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are...is...

return '\n'.join(display(k, v) for k, v in self.items())
res = ''
if len(self.meta) > 0:
res = 'Metadata:\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not what I had in mind, but oh well... I am tired of this...

@alixdamman alixdamman requested a review from gdementen July 11, 2018 14:23
Copy link
Contributor

@gdementen gdementen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alixdamman alixdamman force-pushed the add_metadata_to_larray_objects branch from cce9350 to d6bf4c0 Compare July 11, 2018 15:24
@alixdamman alixdamman merged commit 0c7af11 into larray-project:master Jul 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants