Add metadata to Session objects #659

alixdamman · 2018-06-29T13:33:02Z

Needs changelog

gdementen · 2018-07-02T10:12:26Z

larray/core/session.py

+    create a session with metadata
+
+    >>> # Python <= 3.5
+    >>> s = Session(arr1=arr1, arr2=arr2, meta=[('title', 'my title'), ('author', 'John Smith')])


I don't think it is a good idea to recommend the keyword argument syntax (for arrays) for <= Python 3.5

gdementen · 2018-07-02T10:13:54Z

larray/core/session.py

            return self._objects[key]
        else:
            raise AttributeError("'{}' object has no attribute '{}'".format(self.__class__.__name__, key))

    def __setattr__(self, key, value):
-        self._objects[key] = value
+        if key is 'meta':


gdementen · 2018-07-02T10:14:45Z

larray/core/session.py

@@ -172,13 +202,21 @@ def __delitem__(self, key):
        del self._objects[key]

    def __getattr__(self, key):
-        if key in self._objects:
+        if key is 'meta':


isn't it possible to move this to a property? If that is possible it would probably be cleaner...

I started by trying to implement meta as a property but I faced some difficulties to make it work with Python 2.7

gdementen · 2018-07-02T10:18:20Z

larray/core/session.py

                    return tmpl.format(key=k, value=str(v))
            else:
                return tmpl(k, v)

-        return '\n'.join(display(k, v) for k, v in self.items())
+        return '\n'.join(display(k, v) for k, v in list(self.meta.items()) + list(self.items()))


see above. Having metadata "items" at the same level than self.items is not a good option IMO.

gdementen · 2018-07-02T14:47:24Z

larray/tests/test_session.py

    s = pickle.dumps(original)
    res = pickle.loads(s)
    assert res.equals(original)
+    if PY2:
+        from larray.core.metadata import Metadata
+        assert Metadata(res.meta) == meta


This is odd. I can understand the unpickling of the metadata does not return a Metadata object on python2, but shouldn't it be converted to one when it is "reattached" to the Session?

I'm not proud of it. After trying to implement meta in Session as a property and facing difficulties with Python 2.7 I just gave up for this part. I will change that.

gdementen · 2018-07-04T14:25:50Z

larray/core/session.py

@@ -1091,7 +1165,7 @@ def summary(self, template=None):
                - for groups: 'key', 'name', 'axis_name', 'labels' and 'length',
                - for axes: 'key', 'name', 'labels' and 'length',
                - for arrays: 'key', 'axes_names', 'shape', 'dtype' and 'title',
-                - for all other types: 'key', 'value'.
+                - for metadata: 'key', 'value'.


what about other kinds of objects?

gdementen · 2018-07-04T14:31:35Z

larray/core/session.py


        def display(k, v):
-            t = Group if isinstance(v, Group) else type(v)
+            if not isinstance(v, (Axis, Group, LArray)):


I would much rather have a single template for Metadata objects, so that users can chose to indent it more than the rest or not to display it at all. It would also (re-) allow user-defined objects in the Session.

gdementen · 2018-07-04T14:45:09Z

larray/inout/excel.py

    def save(self):
        if len(self.axes) > 0:
            df = _axes_to_df(self.axes.values())
-            df.to_excel(self.handle, '__axes__', engine='xlsxwriter')
+            df.to_excel(self.handle, '__axes__', index=False, engine='xlsxwriter')


is this change really related to session metadata???

Nope, sorry.

gdementen · 2018-07-04T14:50:25Z

larray/inout/pandas.py

-#    DATAFRAME <--> AXES, GROUPS     #
-# ################################## #
+def _metadata_to_series(metadata):
+    s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')


Why don't you use a 1D object LArray for this?

Because LArray objects are converted to pandas objects before to be dumped or loaded. I think using 1D array here instead of a pandas Series will add an unnecessary layer. Especially since we need to iterate over each element at reading to convert each metadata to the right type.

This might be a case of premature optimization. Passing via LArray has negligible cost in my working copy (disclaimer: I modified .series slightly to go faster for 1D and I had already a few speed optimizations all over the place in my working copy, so this might not be the case on your computer).

>>> metadata = {'title': 'yad aez dgrz jgzeigfejzi', 'date': datetime.now()} >>> %timeit l = LArray(list(metadata.values()), Axis(metadata.keys())); s = l.series; s.name = '' 1.22 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> %timeit s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='') 1.3 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Besides, I think that it might be a case where avoiding pandas structures altogether would be faster. Going via pandas is a net win for reading .csv files because their parser is so fast, but for anything else, we only gain convenience, not performance. This is probably especially true when we go via xlwings, which in the end must transform everything to "simple" Python types. Anyway, you can leave this as-is, but please at least add a comment that this is done for performance reasons.

gdementen · 2018-07-04T14:51:20Z

larray/inout/pandas.py

+
+def _series_to_metadata(series):
+
+    def _convert_value(key, value):


this function does not seem to use the key argument. Remove it.

gdementen · 2018-07-06T08:30:17Z

larray/inout/pandas.py

-#    DATAFRAME <--> AXES, GROUPS     #
-# ################################## #
+def _metadata_to_series(metadata):
+    s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='')


This might be a case of premature optimization. Passing via LArray has negligible cost in my working copy (disclaimer: I modified .series slightly to go faster for 1D and I had already a few speed optimizations all over the place in my working copy, so this might not be the case on your computer).

>>> metadata = {'title': 'yad aez dgrz jgzeigfejzi', 'date': datetime.now()} >>> %timeit l = LArray(list(metadata.values()), Axis(metadata.keys())); s = l.series; s.name = '' 1.22 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> %timeit s = pd.Series(data=list(metadata.values()), index=metadata.keys(), name='') 1.3 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Besides, I think that it might be a case where avoiding pandas structures altogether would be faster. Going via pandas is a net win for reading .csv files because their parser is so fast, but for anything else, we only gain convenience, not performance. This is probably especially true when we go via xlwings, which in the end must transform everything to "simple" Python types. Anyway, you can leave this as-is, but please at least add a comment that this is done for performance reasons.

gdementen · 2018-07-06T08:44:28Z

larray/inout/pandas.py

+def _series_to_metadata(series):
+
+    def _convert_value(key, value):
+        value = str(value)


Why is this line needed? It seems weird to convert numeric types to string to convert them back on the next line to numeric?

gdementen · 2018-07-06T08:46:33Z

larray/inout/pandas.py

+
+    def _convert_value(key, value):
+        value = str(value)
+        value = pd.to_numeric([value], errors='ignore')[0]


why do you use [value] instead of just value? pd.to_numeric seems to handle scalars just fine (even if that is undocumented)

What version of pandas do you have? If I try to pass value instead of [value], I get the following error message: TypeError: 'int' object is not subscriptable

>>> pd.to_numeric('10', errors='ignore') 10 >>> pd.to_numeric(10, errors='ignore') 10 >>> pd.__version__ '0.20.3'

gdementen · 2018-07-06T15:36:19Z

larray/core/session.py

@@ -498,6 +564,7 @@ def to_excel(self, fname, names=None, overwrite=True, display=False, **kwargs):
        - each array is saved in a separate sheet
        - all Axis objects are saved together in the same sheet named __axes__
        - all Group objects are saved together in the same sheet named __groups__
+        - all metadata are saved together in the same sheet named __metadata__


s/metadata/session metadata/ (to avoid confusion with array metadata)
s/are/is/

gdementen · 2018-07-06T15:36:27Z

larray/core/session.py

@@ -540,6 +610,7 @@ def to_csv(self, fname, names=None, display=False, **kwargs):
        - each array is saved in a separate file
        - all Axis objects are saved together in the same CSV file named __axes__.csv
        - all Group objects are saved together in the same CSV file named __groups__.csv
+        - all metadata are saved together in the same CSV file named __metadata__.csv


s/metadata/session metadata/ (to avoid confusion with array metadata)
s/are/is

gdementen · 2018-07-06T15:37:45Z

larray/core/session.py

@@ -815,6 +889,10 @@ def element_equals(self, other):
        -------
        Boolean LArray

+        Notes
+        -----
+        Metadata are ignored.


gdementen · 2018-07-06T15:37:59Z

larray/core/session.py

@@ -885,9 +963,13 @@ def equals(self, other):
        -------
        True if arrays of both sessions are all equal, False otherwise.

+        Notes
+        -----
+        Metadata are ignored.


gdementen · 2018-07-06T15:42:09Z

larray/inout/csv.py

+    def _read_metadata(self):
+        filepath = self._to_filepath('__metadata__')
+        if os.path.isfile(filepath):
+            meta = read_csv(filepath, wide=False)


I didn't expect you to actually make the change from pandas to larray in this PR, but it is a nice surprise. I find it much cleaner this way.

gdementen · 2018-07-06T15:44:07Z

larray/core/metadata.py

+    def __larray__(self):
+        from larray.core.array import LArray
+        from larray.core.axis import Axis
+        return LArray(list(self.values()), Axis(self.keys(), name='METADATA'))


we really need a from_dict or to fix implement LArray.from_dict or from_dict top-level function #581

s/METADATA/metadata/? I usually don't like all capital words :)

gdementen · 2018-07-06T15:52:35Z

larray/core/session.py

@@ -1150,13 +1240,26 @@ def summary(self, template=None):
            template[Group] = "{key}: {axis_name}{labels} >> {name} ({length})"
        if LArray not in template:
            template[LArray] = "{key}: {axes_names} ({shape}) [{dtype}]\n    {title}"
+        if Metadata not in template:
+            template[Metadata] = "{key} (metadata): {value}"


This is already better than before, but I don't think you understood what I meant. I meant something like:

if Metadata not in template: template[Metadata] = "{value}"

gdementen · 2018-07-06T16:06:26Z

larray/core/session.py

+        res = ''
+        if len(self.meta) > 0:
+            res = '\n'.join(display_metadata(k, v) for k, v in self.meta.items()) + '\n'
+        res += '\n'.join(display_items(k, v) for k, v in self.items())


then here, something like:

if self.meta: res += display_items('Metadata', v) + '\n' res += '\n'.join(display_items(k, v) for k, v in self.items())

The goal being that users can produce something like the following if they want to (maybe it should be the default):

Metadata: title: my title author: John Smith axis1: a ['a0' 'a1' 'a2'] (3) group1 -> a01: a['a0', 'a1'] (2)

Now, I am unsure it's worth it to change what you have but having no way to distinguish a metadata item from a "normal" item clearly bothers me.

alixdamman · 2018-07-10T08:43:22Z

There is a bug with conda for Python 2.7 : conda/conda#7503

gdementen · 2018-07-10T10:01:12Z

doc/source/changes/version_0_29.rst.inc

@@ -48,7 +48,7 @@ New features

  Warnings:

-  - Currently, only the HDF (.h5) file format supports saving and loading metadata.
+  - Currently, only the HDF (.h5) file format supports saving and loading array metadata.
  - Metadata are not kept when actions or methods are applied on a array


s/Metadata are/Metadata is/
s/a/an/

gdementen · 2018-07-10T10:02:03Z

doc/source/changes/version_0_29.rst.inc

@@ -67,6 +67,31 @@ New features
    >>> arr.meta.title = 'array for testing'


+* allowed sessions to have metadata.
+  Like for arrays, session metadata can be accessed using the syntax ``session.meta.name``.
+  See bellow for use cases:


s/bellow/below/

gdementen · 2018-07-10T10:07:53Z

doc/source/changes/version_0_29.rst.inc

+
+  Warnings:
+
+  - Excel, CSV and HDF (.h5) file formats support saving and loading session metadata.


These are all the common formats, so it is a bit weird to have this in a Warning section, but maybe putting that in a Notes section by itself would be impractical.

I would emphasize more the difference with arrays by saying something like "Contrary to array metadata, saving and loading session metadata is supported for all current session file formats: Excel, CSV and HDF (.h5).

gdementen · 2018-07-10T10:08:25Z

doc/source/changes/version_0_29.rst.inc

+  Warnings:
+
+  - Excel, CSV and HDF (.h5) file formats support saving and loading session metadata.
+  - Metadata are not kept when actions or methods are applied on a session


s/are/is/ 😄

gdementen · 2018-07-10T10:10:43Z

larray/core/session.py

+
+    Warnings
+    --------
+    Metadata are not kept when actions or methods are applied on a session


are...is...

gdementen · 2018-07-10T10:44:17Z

larray/core/session.py

-        return '\n'.join(display(k, v) for k, v in self.items())
+        res = ''
+        if len(self.meta) > 0:
+            res = 'Metadata:\n'


Not what I had in mind, but oh well... I am tired of this...

gdementen

LGTM

…attached.

… be consistent with XLWingsHandler --> do not include index of the generated pandas dataframe when dumping axes and groups

…a Session instance

alixdamman requested a review from gdementen June 29, 2018 13:33

alixdamman mentioned this pull request Jul 4, 2018

Misc #661

Merged

gdementen reviewed Jul 4, 2018

View reviewed changes

gdementen reviewed Jul 6, 2018

View reviewed changes

alixdamman requested a review from gdementen July 6, 2018 13:51

gdementen reviewed Jul 6, 2018

View reviewed changes

alixdamman requested a review from gdementen July 9, 2018 15:42

gdementen reviewed Jul 10, 2018

View reviewed changes

alixdamman requested a review from gdementen July 11, 2018 14:23

gdementen approved these changes Jul 11, 2018

View reviewed changes

alixdamman added 4 commits July 11, 2018 17:17

fixed test_hdf_roundtrip() --> arrays used for tests had no metadata …

f8ec332

…attached.

updated _load_axes_and_groups() and save() from PandasExcelHandler to…

b154bf7

… be consistent with XLWingsHandler --> do not include index of the generated pandas dataframe when dumping axes and groups

fixed as_table() when ndim = 1 and wide = False

3f1bb4c

fixed larray-project#640: added possibility to associate metadata to …

d6bf4c0

…a Session instance

alixdamman force-pushed the add_metadata_to_larray_objects branch from cce9350 to d6bf4c0 Compare July 11, 2018 15:24

alixdamman merged commit 0c7af11 into larray-project:master Jul 11, 2018


		def _series_to_metadata(series):

		def _convert_value(key, value):


		Warnings:

		- Excel, CSV and HDF (.h5) file formats support saving and loading session metadata.

Add metadata to Session objects #659

Add metadata to Session objects #659

Uh oh!

Conversation

alixdamman commented Jun 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alixdamman commented Jul 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gdementen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!