-
Notifications
You must be signed in to change notification settings - Fork 6
Add metadata to Session objects #659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata to Session objects #659
Conversation
larray/core/session.py
Outdated
create a session with metadata | ||
|
||
>>> # Python <= 3.5 | ||
>>> s = Session(arr1=arr1, arr2=arr2, meta=[('title', 'my title'), ('author', 'John Smith')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a good idea to recommend the keyword argument syntax (for arrays) for <= Python 3.5
larray/core/session.py
Outdated
return self._objects[key] | ||
else: | ||
raise AttributeError("'{}' object has no attribute '{}'".format(self.__class__.__name__, key)) | ||
|
||
def __setattr__(self, key, value): | ||
self._objects[key] = value | ||
if key is 'meta': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
larray/core/session.py
Outdated
@@ -172,13 +202,21 @@ def __delitem__(self, key): | |||
del self._objects[key] | |||
|
|||
def __getattr__(self, key): | |||
if key in self._objects: | |||
if key is 'meta': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it possible to move this to a property? If that is possible it would probably be cleaner...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started by trying to implement meta
as a property but I faced some difficulties to make it work with Python 2.7
larray/core/session.py
Outdated
return tmpl.format(key=k, value=str(v)) | ||
else: | ||
return tmpl(k, v) | ||
|
||
return '\n'.join(display(k, v) for k, v in self.items()) | ||
return '\n'.join(display(k, v) for k, v in list(self.meta.items()) + list(self.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above. Having metadata "items" at the same level than self.items is not a good option IMO.
larray/tests/test_session.py
Outdated
s = pickle.dumps(original) | ||
res = pickle.loads(s) | ||
assert res.equals(original) | ||
if PY2: | ||
from larray.core.metadata import Metadata | ||
assert Metadata(res.meta) == meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd. I can understand the unpickling of the metadata does not return a Metadata object on python2, but shouldn't it be converted to one when it is "reattached" to the Session?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not proud of it. After trying to implement meta
in Session as a property and facing difficulties with Python 2.7 I just gave up for this part. I will change that.
larray/core/session.py
Outdated
@@ -1091,7 +1165,7 @@ def summary(self, template=None): | |||
- for groups: 'key', 'name', 'axis_name', 'labels' and 'length', | |||
- for axes: 'key', 'name', 'labels' and 'length', | |||
- for arrays: 'key', 'axes_names', 'shape', 'dtype' and 'title', | |||
- for all other types: 'key', 'value'. | |||
- for metadata: 'key', 'value'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about other kinds of objects?
larray/core/session.py
Outdated
|
||
def display(k, v): | ||
t = Group if isinstance(v, Group) else type(v) | ||
if not isinstance(v, (Axis, Group, LArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather have a single template for Metadata objects, so that users can chose to indent it more than the rest or not to display it at all. It would also (re-) allow user-defined objects in the Session.
def save(self): | ||
if len(self.axes) > 0: | ||
df = _axes_to_df(self.axes.values()) | ||
df.to_excel(self.handle, '__axes__', engine='xlsxwriter') | ||
df.to_excel(self.handle, '__axes__', index=False, engine='xlsxwriter') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change really related to session metadata???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, sorry.
larray/inout/pandas.py
Outdated
# DATAFRAME <--> AXES, GROUPS # | ||
# ################################## # | ||
def _metadata_to_series(metadata): | ||
s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use a 1D object LArray for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because LArray objects are converted to pandas objects before to be dumped or loaded. I think using 1D array here instead of a pandas Series will add an unnecessary layer. Especially since we need to iterate over each element at reading to convert each metadata to the right type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a case of premature optimization. Passing via LArray has negligible cost in my working copy (disclaimer: I modified .series slightly to go faster for 1D and I had already a few speed optimizations all over the place in my working copy, so this might not be the case on your computer).
>>> metadata = {'title': 'yad aez dgrz jgzeigfejzi', 'date': datetime.now()}
>>> %timeit l = LArray(list(metadata.values()), Axis(metadata.keys())); s = l.series; s.name = ''
1.22 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')
1.3 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Besides, I think that it might be a case where avoiding pandas structures altogether would be faster. Going via pandas is a net win for reading .csv files because their parser is so fast, but for anything else, we only gain convenience, not performance. This is probably especially true when we go via xlwings, which in the end must transform everything to "simple" Python types. Anyway, you can leave this as-is, but please at least add a comment that this is done for performance reasons.
larray/inout/pandas.py
Outdated
|
||
def _series_to_metadata(series): | ||
|
||
def _convert_value(key, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function does not seem to use the key argument. Remove it.
larray/inout/pandas.py
Outdated
# DATAFRAME <--> AXES, GROUPS # | ||
# ################################## # | ||
def _metadata_to_series(metadata): | ||
s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a case of premature optimization. Passing via LArray has negligible cost in my working copy (disclaimer: I modified .series slightly to go faster for 1D and I had already a few speed optimizations all over the place in my working copy, so this might not be the case on your computer).
>>> metadata = {'title': 'yad aez dgrz jgzeigfejzi', 'date': datetime.now()}
>>> %timeit l = LArray(list(metadata.values()), Axis(metadata.keys())); s = l.series; s.name = ''
1.22 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')
1.3 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Besides, I think that it might be a case where avoiding pandas structures altogether would be faster. Going via pandas is a net win for reading .csv files because their parser is so fast, but for anything else, we only gain convenience, not performance. This is probably especially true when we go via xlwings, which in the end must transform everything to "simple" Python types. Anyway, you can leave this as-is, but please at least add a comment that this is done for performance reasons.
larray/inout/pandas.py
Outdated
def _series_to_metadata(series): | ||
|
||
def _convert_value(key, value): | ||
value = str(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line needed? It seems weird to convert numeric types to string to convert them back on the next line to numeric?
larray/inout/pandas.py
Outdated
|
||
def _convert_value(key, value): | ||
value = str(value) | ||
value = pd.to_numeric([value], errors='ignore')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you use [value] instead of just value? pd.to_numeric seems to handle scalars just fine (even if that is undocumented)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version of pandas do you have? If I try to pass value instead of [value], I get the following error message: TypeError: 'int' object is not subscriptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> pd.to_numeric('10', errors='ignore')
10
>>> pd.to_numeric(10, errors='ignore')
10
>>> pd.__version__
'0.20.3'
larray/core/session.py
Outdated
@@ -498,6 +564,7 @@ def to_excel(self, fname, names=None, overwrite=True, display=False, **kwargs): | |||
- each array is saved in a separate sheet | |||
- all Axis objects are saved together in the same sheet named __axes__ | |||
- all Group objects are saved together in the same sheet named __groups__ | |||
- all metadata are saved together in the same sheet named __metadata__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/metadata/session metadata/ (to avoid confusion with array metadata)
s/are/is/
larray/core/session.py
Outdated
@@ -540,6 +610,7 @@ def to_csv(self, fname, names=None, display=False, **kwargs): | |||
- each array is saved in a separate file | |||
- all Axis objects are saved together in the same CSV file named __axes__.csv | |||
- all Group objects are saved together in the same CSV file named __groups__.csv | |||
- all metadata are saved together in the same CSV file named __metadata__.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/metadata/session metadata/ (to avoid confusion with array metadata)
s/are/is
larray/core/session.py
Outdated
@@ -815,6 +889,10 @@ def element_equals(self, other): | |||
------- | |||
Boolean LArray | |||
|
|||
Notes | |||
----- | |||
Metadata are ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/are/is
larray/core/session.py
Outdated
@@ -885,9 +963,13 @@ def equals(self, other): | |||
------- | |||
True if arrays of both sessions are all equal, False otherwise. | |||
|
|||
Notes | |||
----- | |||
Metadata are ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/are/is
def _read_metadata(self): | ||
filepath = self._to_filepath('__metadata__') | ||
if os.path.isfile(filepath): | ||
meta = read_csv(filepath, wide=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't expect you to actually make the change from pandas to larray in this PR, but it is a nice surprise. I find it much cleaner this way.
larray/core/metadata.py
Outdated
def __larray__(self): | ||
from larray.core.array import LArray | ||
from larray.core.axis import Axis | ||
return LArray(list(self.values()), Axis(self.keys(), name='METADATA')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- we really need a from_dict or to fix implement LArray.from_dict or from_dict top-level function #581
- s/METADATA/metadata/? I usually don't like all capital words :)
larray/core/session.py
Outdated
@@ -1150,13 +1240,26 @@ def summary(self, template=None): | |||
template[Group] = "{key}: {axis_name}{labels} >> {name} ({length})" | |||
if LArray not in template: | |||
template[LArray] = "{key}: {axes_names} ({shape}) [{dtype}]\n {title}" | |||
if Metadata not in template: | |||
template[Metadata] = "{key} (metadata): {value}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already better than before, but I don't think you understood what I meant. I meant something like:
if Metadata not in template:
template[Metadata] = "{value}"
larray/core/session.py
Outdated
res = '' | ||
if len(self.meta) > 0: | ||
res = '\n'.join(display_metadata(k, v) for k, v in self.meta.items()) + '\n' | ||
res += '\n'.join(display_items(k, v) for k, v in self.items()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then here, something like:
if self.meta:
res += display_items('Metadata', v) + '\n'
res += '\n'.join(display_items(k, v) for k, v in self.items())
The goal being that users can produce something like the following if they want to (maybe it should be the default):
Metadata:
title: my title
author: John Smith
axis1: a ['a0' 'a1' 'a2'] (3)
group1 -> a01: a['a0', 'a1'] (2)
Now, I am unsure it's worth it to change what you have but having no way to distinguish a metadata item from a "normal" item clearly bothers me.
There is a bug with conda for Python 2.7 : conda/conda#7503 |
@@ -48,7 +48,7 @@ New features | |||
|
|||
Warnings: | |||
|
|||
- Currently, only the HDF (.h5) file format supports saving and loading metadata. | |||
- Currently, only the HDF (.h5) file format supports saving and loading array metadata. | |||
- Metadata are not kept when actions or methods are applied on a array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Metadata are/Metadata is/
s/a/an/
@@ -67,6 +67,31 @@ New features | |||
>>> arr.meta.title = 'array for testing' | |||
|
|||
|
|||
* allowed sessions to have metadata. | |||
Like for arrays, session metadata can be accessed using the syntax ``session.meta.name``. | |||
See bellow for use cases: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/bellow/below/
|
||
Warnings: | ||
|
||
- Excel, CSV and HDF (.h5) file formats support saving and loading session metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- These are all the common formats, so it is a bit weird to have this in a Warning section, but maybe putting that in a Notes section by itself would be impractical.
- I would emphasize more the difference with arrays by saying something like "Contrary to array metadata, saving and loading session metadata is supported for all current session file formats: Excel, CSV and HDF (.h5).
Warnings: | ||
|
||
- Excel, CSV and HDF (.h5) file formats support saving and loading session metadata. | ||
- Metadata are not kept when actions or methods are applied on a session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/are/is/ 😄
larray/core/session.py
Outdated
|
||
Warnings | ||
-------- | ||
Metadata are not kept when actions or methods are applied on a session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are...is...
return '\n'.join(display(k, v) for k, v in self.items()) | ||
res = '' | ||
if len(self.meta) > 0: | ||
res = 'Metadata:\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not what I had in mind, but oh well... I am tired of this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… be consistent with XLWingsHandler --> do not include index of the generated pandas dataframe when dumping axes and groups
…a Session instance
cce9350
to
d6bf4c0
Compare
Needs changelog