Data Storage and Access Using HDF5

An alternative data storage system is available for Larch, relying on the HDF5 format and the pytables package. This system is made available through a DT object, which wraps a tables.File object.

Creating DT Objects

class larch.DT(filename, mode='a')

A wrapper for a pytables File used to get data for models.

This object wraps a _tb.File, adding a number of methods designed specifically for working with choice-based data used in Larch.

Parameters:
  • filename (str or None) – The filename of the HDF5/pytables to open. If None (the default) a named temporary file is created to serve as the backing for an in-memory HDF5 file, which is very fast as long as you’ve got enough memory to store the whole thing.
  • mode (str) – The mode used to open the H5F file. Common values are ‘a’ for append and ‘r’ for read only. See pytables for more detail.
  • complevel (int) – The compression level to use for new objects created. By default no compression is used, but substantial disk savings may be available by using it.
  • inmemory (bool) – If True (defaults False), the H5FD_CORE driver is used and data will not in general be written to disk until the file is closed, when all accumulated changes will be written in a single batch. This can be fast if you have sufficent memory but if an error occurs all your intermediate changes can be lost.
  • temp (bool) – If True (defaults False), the inmemory switch is activated and no changes will be written to disk when the file is closed. This is automatically set to true if the filename is None.

Warning

The normal constructor creates a DT object linked to an existing HDF5 file. Editing the object edits the file as well.

Similar to the DB class, the DT class can be used with example data files.

static DT.Example(dataset='MTC', filename='{}.h5', temp=True)

Generate an example data object in memory.

Larch comes with a few example data sets, which are used in documentation and testing. This function copies the data into a HDF5 file, which you can freely edit without damaging the original data.

Parameters:
  • dataset ({'MTC', 'SWISSMETRO', 'MINI', 'AIR'}) – Which example dataset should be used.
  • filename (str) – A filename to open the HDF5 file (even in-memory files need a name).
  • temp (bool) – The example database be created in-memory; if temp is false, the file will be dumped to disk when closed.
Returns:

An open connection to the HDF5 example data.

Return type:

DT

static DT.CSV_idco(filename, caseid=None, choice=None, weight=None, savename=None, alts={}, csv_args=(), csv_kwargs={}, complib='zlib', complevel=1, overwrite=0, **kwargs)

Creates a new larch DT based on an idco Format CSV data file.

The input data file should be an idco Format data file, with the first line containing the column headings. The reader will attempt to determine the format (csv, tab-delimited, etc) automatically.

Parameters:
  • filename (str) – File name (absolute or relative) for CSV (or other text-based delimited) source data.
  • caseid (str) – Column name that contains the unique case id’s. If the data is in idco format, case id’s can be generated automatically based on line numbers by setting caseid to None (the default).
  • choice (str or None) – Column name that contains the id of the alternative that is selected (if applicable). If not given, and if choice is not included in alts below, no _choice_ h5f node will be autogenerated, and it will need to be set manually later. If the choices are given in alts (see below) then this value is ignored.
  • weight (str or None) – Column name of the weight for each case. If None, defaults to equal weights.
  • savename (str or None) – If not None, the name of the location to save the HDF5 file that is created.
  • alts (dict) – A dictionary with keys of alt codes, and values of (alt name, avail column, choice column) tuples. The third item in the tuple can be omitted if choice is given.
Other Parameters:
 
  • csv_args (tuple) – A tuple of positional arguments to pass to DT.import_idco() (and by extension to pandas.import_csv() or pandas.import_excel()).
  • csv_kwargs (dict) – A dictionary of keyword arguments to pass to DT.import_idco() (and by extension to pandas.import_csv() or pandas.import_excel()).
  • complib (str) – The compression library to use for the HDF5 file default filter.
  • complevel (int) – The compression level to use for the HDF5 file default filter.
  • Keyword arguments not listed here are passed to the :class:`DT` constructor.
Returns:

An open DT file.

Return type:

DT

larch.DTL(source, newfile=None)

Create a new DT with linked data.

This special function creates a new DT object and individually links in the caseids, alternatives, and the idca and idco data from the other DT as read-only data nodes.

Parameters:
  • source (DT) – The original DT.
  • newfile (str, optional) – A filename for the new file.

Required HDF5 Structure

To be used with Larch, the HDF5 file must have a particular structure. The group node structure is created automatically when you open a new DT object with a file that does not already have the necessary structure.

digraph Required_HDF5_Structure {
larch [label="larch", shape="box"];
caseids [label="caseids\n shape=(N) ", shape="box", color="#DD0000", style="rounded", penwidth=2];
larch -> caseids;
screen [label="screen\n shape=(N) ", shape="box", color="#01BB00", style="rounded,dashed", penwidth=2];
larch -> screen;
idco [label="idco", shape="box"];
idco3 [label="...various...\n shape=(N) ", shape="box", style="rounded", penwidth=2];
idco2 [label="...various...", shape="box"];
idco2i [label="_index_\n shape=(N)", style="rounded", penwidth=2, color="#DD0000", shape="box"];
idco2v [label="_values_\n shape=(?)", style="rounded", penwidth=2, shape="box"];
idco2 -> idco2i;
idco2 -> idco2v;
idco -> idco2;
wgt [label="_weight_\n shape=(N) ", shape="box", style="rounded,dashed", penwidth=2, color="#0000EE"];
larch -> idco [minlen=2];
idco -> idco3;
idco -> wgt;
idca [label="idca", shape="box"];
idca3 [label="...various...\n shape=(N,A) ", shape="box", style="rounded", penwidth=2];
idca2 [label="...various...", shape="box"];
idca2i [label="_index_\n shape=(N)", style="rounded", penwidth=2, color="#DD0000", shape="box"];
idca2v [label="_values_\n shape=(?,A)", style="rounded", penwidth=2, shape="box"];
idca2 -> idca2i;
idca2 -> idca2v;
choice [label="_choice_\n shape=(N,A) ", shape="box", style="rounded", color="#0000EE", penwidth=2];
avail [label="_avail_\n shape=(N,A) ", shape="box", style="rounded,dashed", color="#01BB00", penwidth=2];
larch -> idca [minlen=2];
idca -> idca3;
idca -> idca2;
idca -> choice;
idca -> avail;
larch -> alts;
alts [label="alts", shape="box"];
altids [label="altids\n shape=(A) ", shape="box", color="#DD0000", style="rounded", penwidth=2];
names [label="names\n shape=(A) ", shape="box", color="#AAAA00", style="rounded", penwidth=2];
alts -> altids;
alts -> names;
} digraph Required_HDF5_Structure_Legend {
subgraph clusterlegend {
        rank="same";
        shape="box";
        style="filled,rounded";
        color="#EEEEEE";
        label="Legend";
        int64 [label="dtype=Int64", color="#DD0000", shape="box", style="rounded", penwidth=2];
        float64 [label="dtype=Float64", color="#0000EE", shape="box", style="rounded", penwidth=2];
        unicode [label="dtype=Unicode", color="#AAAA00", shape="box", style="rounded", penwidth=2];
        bool [label="dtype=Bool", color="#01BB00", shape="box", style="rounded", penwidth=2];
        optional [label="optional", shape="box", style="rounded,dashed", penwidth=2];
        Group_Node [label="Group Node", shape="box", rank="sink"];
        Array_Node [label="Data Node", shape="box", style="rounded", penwidth=2];
};
}

The details are as follows:

════════════════════════════════════════════════════════════════════════════════
larch.DT Validation for MTC.h5 (with mode 'w')
─────┬──────────────────────────────────────────────────────────────────────────
 >>> │ There should be a designated `larch` group node under which all other
     │ nodes reside.
─────┼──────────────────────────────────────────────────────────────────────────
     │ CASES
 >>> │ Under the top node, there must be an array node named `caseids`.
 >>> │ The `caseids` array dtype should be Int64.
 >>> │ The `caseids` array should be 1 dimensional.
     ├ Case Filtering ──────────────────────────────────────────────────────────
 >>> │ If there may be some data cases that are not to be included in the
     │ processing of the discrete choice model, there should be a node named
     │ `screen` under the top node.
 >>> │ If it exists `screen` must be a Bool array.
 >>> │ And `screen` must be have the same shape as `caseids`.
─────┼──────────────────────────────────────────────────────────────────────────
     │ ALTERNATIVES
 >>> │ Under the top node, there should be a group named `alts` to hold
     │ alternative data.
 >>> │ Within the `alts` node, there should be an array node named `altids` to
     │ hold the identifying code numbers of the alternatives.
 >>> │ The `altids` array dtype should be Int64.
 >>> │ The `altids` array should be one dimensional.
 >>> │ Within the `alts` node, there should also be a VLArray node named `names`
     │ to hold the names of the alternatives.
 >>> │ The `names` node should hold unicode values.
 >>> │ The `altids` and `names` arrays should be the same length, and this will
     │ be the number of elemental alternatives represented in the data.
─────┼──────────────────────────────────────────────────────────────────────────
     │ IDCO FORMAT DATA
 >>> │ Under the top node, there should be a group named `idco` to hold that
     │ data.
 >>> │ Every child node name in `idco` must be a valid Python identifer (i.e.
     │ starts with a letter or underscore, and only contains letters, numbers,
     │ and underscores) and not a Python reserved keyword.
 >>> │ Every child node in `idco` must be (1) an array node with shape the same
     │ as `caseids`, or (2) a group node with child nodes `_index_` as an array
     │ with the correct shape and an integer dtype, and `_values_` such that
     │ _values_[_index_] reconstructs the desired data array.
     ├ Case Weights ────────────────────────────────────────────────────────────
 >>> │ If the cases are to have non uniform weights, then there should a
     │ `_weight_` node (or a name link to a node) within the `idco` group.
 >>> │ If weights are given, they should be of Float64 dtype.
─────┼──────────────────────────────────────────────────────────────────────────
     │ IDCA FORMAT DATA
 >>> │ Under the top node, there should be a group named `idca` to hold that
     │ data.
 >>> │ Every child node name in `idca` must be a valid Python identifer (i.e.
     │ starts with a letter or underscore, and only contains letters, numbers,
     │ and underscores) and not a Python reserved keyword.
 >>> │ Every child node in `idca` must be (1) an array node with the first
     │ dimension the same as the length of `caseids`, and the second dimension
     │ the same as the length of `altids`, or (2) a group node with child nodes
     │ `_index_` as a 1-dimensional array with the same length as the length of
     │ `caseids` and an integer dtype, and a 2-dimensional `_values_` with the
     │ second dimension the same as the length of `altids`, such that
     │ _values_[_index_] reconstructs the desired data array.
     ├ Alternative Availability ────────────────────────────────────────────────
 >>> │ If there may be some alternatives that are unavailable in some cases,
     │ there should be a node named `_avail_` under `idca`.
 >>> │ If given as an array, it should contain an appropriately sized Bool array
     │ indicating the availability status for each alternative.
 >>> │ If given as a group, it should have an attribute named `stack` that is a
     │ tuple of `idco` expressions indicating the availability status for each
     │ alternative. The length and order of `stack` should match that of the
     │ altid array.
     ├ Chosen Alternatives ────────────────────────────────────────────────────
 >>> │ There should be a node named `_choice_` under `idca`.
 >>> │ If given as an array, it should be a Float64 array indicating the chosen-
     │ ness for each alternative. Typically, this will take a value of 1.0 for
     │ the alternative that is chosen and 0.0 otherwise, although it is possible
     │ to have other values, including non-integer values, in some applications.
 >>> │ If given as a group, it should have an attribute named `stack` that is a
     │ tuple of `idco` expressions indicating the choice status for each
     │ alternative. The length and order of `stack` should match that of the
     │ altid array.
─────┼──────────────────────────────────────────────────────────────────────────
     │ OTHER TECHNICAL DETAILS
 >>> │ The set of child node names within `idca` and `idco` should not overlap
     │ (i.e. there should be no node names that appear in both).
═════╧══════════════════════════════════════════════════════════════════════════

Note that the _choice_ and _avail_ nodes are special, they can be expressed as a stack if idco expressions instead of as a single idca array. To do so, replace the array node with a group node, and attach a stack attribute that gives the list of idco expressions. The list should match the list of alternatives. One way to do this automatically is to use the avail_idco and choice_idco attributes of the DT.

To check if your file has the correct structure, you can use the validate function:

DT.validate(log=<built-in function print>, errlog=None)

Generate a validation report for this DT.

The generated report is fairly detailed and describes each requirement for a valid DT file and whether or not it is met.

Parameters:
  • log (callable) – Typically “print”, but can be replaced with a different callable to accept a series of unicode strings for each line in the report.
  • errlog (callable or None) – By default, None. If not none, the report will print as with log but only if there are errors.

Importing Data

There are methods available to import data from external sources into the correct format for use with the larch DT facility.

DT.import_idco(filepath_or_buffer, caseid_column=None, overwrite=0, *args, **kwargs)

Import an existing CSV or similar file in idco format into this HDF5 file.

This function relies on pandas.read_csv() to read and parse the input data. All arguments other than those described below are passed through to that function.

Parameters:
  • filepath_or_buffer (str or buffer or pandas.DataFrame) – This argument will be fed directly to the pandas.read_csv() function. If a string is given and the file extension is “.xlsx” then the pandas.read_excel() function will be used instead, ot if the file extension is “.dbf” then simpledbf.Dbf5.to_dataframe() is used. Alternatively, you can just pass a pre-loaded pandas.DataFrame.
  • caseid_column (None or str) – If given, this is the column of the input data file to use as caseids. If not given, arbitrary sequential caseids will be created. If it is given and the caseids do already exist, a LarchError is raised.
  • overwrite (int) – If positive, existing data with same name will be overwritten. If zero (the default) existing data with same name will be not overwritten and tables.exceptions.NodeError will be raised. If negative, existing data will not be overwritten but no errors will be raised.
Returns:

self

Return type:

DT

Raises:

LarchError – If caseids exist and are also given, or if the caseids are not integer values.

DT.import_idca(filepath_or_buffer, caseid_col, altid_col, choice_col=None, force_int_as_float=True, chunksize=1e+300)

Import an existing CSV or similar file in idca format into this HDF5 file.

This function relies on pandas.read_csv() to read and parse the input data. All arguments other than those described below are passed through to that function.

Parameters:
  • filepath_or_buffer (str or buffer) – This argument will be fed directly to the pandas.read_csv() function.
  • caseid_column (None or str) – If given, this is the column of the input data file to use as caseids. It must be given if the caseids do not already exist in the HDF5 file. If it is given and the caseids do already exist, a LarchError is raised.
  • altid_col (None or str) – If given, this is the column of the input data file to use as altids. It must be given if the altids do not already exist in the HDF5 file. If it is given and the altids do already exist, a LarchError is raised.
  • choice_col (None or str) – If given, use this column as the choice indicator.
  • force_int_as_float (bool) – If True, data columns that appear to be integer values will still be stored as double precision floats (defaults to True).
  • chunksize (int) – The number of rows of the source file to read as a chunk. Reading a giant file in moderate sized chunks can be much faster and less memory intensive than reading the entire file.
Returns:

self

Return type:

DT

Raises:

LarchError – Various errors.

Notes

Chunking may not work on Mac OS X due to a known bug in the pandas.read_csv function.

DT.merge_into_idco(other, self_on, other_on=None, dupe_suffix='_copy', original_source=None, names=None, log=<function DT.<lambda>>, **kwargs)

Merge data into the idco group of this DT.

Every case in the current (receiving) DT should match one or zero cases in the imported data.

Parameters:
  • other (DT or pandas.DataFrame or str) – The other data table to be merged. Can be another DT, or a DataFrame, or the path to a file of type {csv, xlsx, dbf}.
  • self_on (label or list, or array-like) – Field names to join on in this DT. Can be a vector or list of vectors of length DT.nAllCases() to use a particular vector as the join key instead of columns
  • other_on (label or list, or array-like) – Field names to join on in other DataFrame, or vector/list of vectors per self_on
  • dupe_suffix (str) – A suffix to add to variables that duplicate variables names already in this DT
  • original_source (str, optional) – Give the original source of this data. If not given and the filename can be inferred from other, that name will be used.
  • names (None or list or dict) – If given as a list, only these names will be merged from the other data. If given as a dict, the keys are the names that will be merged and the values are the new names in this DT.
DT.pluck_into_idco(other_omx, rowindexes, colindexes, names=None, overwrite=False)

Pluck values from an OMX file into new idco Format variables.

This method takes O and D index numbers and plucks the individual matrix values at those coordinates. New idco variables will be created in the DT file that contain the plucked values, so that the new variables represent actual arrays and not links to the original matrix. The OMX filename is marked as the original source of the data.

Parameters:
  • other_omx (OMX or str) – Either an OMX or a filename to an OMX file.
  • colindexes (rowindexes,) – Zero-based index array for the row (origins) and columns (destinations) that will be plucked into the new variable.
  • names (str or list or dict) – If a str, only that single named matrix table in the OMX will be plucked. If a list, all of the named matrix tables in the OMX will be plucked. If a dict, the keys give the matrix tables to pluck from and the values give the new variable names to create in this DT.

See also

GroupNode.add_external_omx()

Creating Data

DT.new_idco(name, expression, dtype=<Mock id='139847338098136'>, *, overwrite=False, title=None, dictionary=None)

Create a new idco Format variable.

Creating a new variable in the data might be convenient in some instances. Although using the full expression as a data term in a model might be valid, the whole expression will need to be evaluated every time the data is loaded. By using this method, you can evaluate the expression just once, and save the resulting array to the file.

Note that this command does not (yet) evaluate the expression in kernel using the numexpr module.

Parameters:
  • name (str) – The name of the new idco Format variable.
  • expression (str) – An expression to evaluate as the new variable.
  • dtype (dtype) – The dtype for the array of new data.
  • overwrite (bool) – Should the variable be overwritten if it already exists, default to False. Explicitly set to None to suppress the NodeError if the node exists.
  • title (str, optional) – A descriptive title for the variable, typically a short phrase but an arbitrary length description is allowed.
  • dictionary (dict, optional) – A data dictionary explaining some or all of the values in this field. Even for otherwise self-explanatory numerical values, the dictionary may give useful information about particular out of range values.
Raises:
  • tables.exceptions.NodeError – If a variable of the same name already exists and overwrite is False.
  • NameError – If the expression contains a name that cannot be evaluated from within the existing idco Format data.
DT.new_idca(name, expression, title=None, dtype=None, original_source='externally defined array')

Create a new idca Format variable.

Creating a new variable in the data might be convenient in some instances. Although using the full expression as a data term in a model might be valid, the whole expression will need to be evaluated every time the data is loaded. By using this method, you can evaluate the expression just once, and save the resulting array to the file.

Note that this command does not (yet) evaluate the expression in kernel using the numexpr module.

Parameters:
  • name (str) – The name of the new idca Format variable.
  • expression (str or array) – An expression to evaluate as the new variable, or an array of data.
  • title (str, optional) – Give a description of the data in this array.
  • dtype (dtype, optional) – What numpy dtype should the new array should be, defaults to float64.
  • original_source (str, optional) – If expression is an array, provide this string as the “original source” of the data. If omitted, the original source is set as the generic “externally defined array”.
Raises:
  • _tb.exceptions.NodeError – If a variable of the same name already exists.
  • NameError – If the expression contains a name that cannot be evaluated from within the existing idca Format or idco Format data.
  • title : str – Optionally, give a description of the data in this array.

Special Data

DT.choice_idco

The stack manager for choice data in idco format.

To set a stack of idco expressions to represent choice data, assign a dictionary to this attribute with keys as alternative codes and values as idco expressions.

You can also get and assign individual alternative values using the usual dictionary operations:

DT.choice_idco[key]            # get expression
DT.choice_idco[key] = value    # set expression
DT.avail_idco

The stack manager for avail data in idco format.

To set a stack of idco expressions to represent availability data, assign a dictionary to this attribute with keys as alternative codes and values as idco expressions.

You can also get and assign individual alternative values using the usual dictionary operations:

DT.avail_idco[key]            # get expression
DT.avail_idco[key] = value    # set expression

Filtering Cases

It is common in discrete choice modeling to apply screening filters to a dataset before estimating parameters. These filters could be used for cleaning purposes (e.g. to remove apparently erroneous data) or to estimate a model for only a particular subset of observations (e.g. to pull out only home-based work trips from a data file that contains trips of many purposes).

DT.set_screen(exclude_idco=(), exclude_idca=(), exclude_unavail=False, exclude_unchoosable=False, dynamic=False)

Set a screen

DT.rescreen(exclude_idco=None, exclude_idca=None, exclude_unavail=None, exclude_unchoosable=None, dynamic=None)

Rebuild the screen based on the indicated exclusion criteria.

Parameters:
  • exclude_idco (iterable of str) – A sequence of expressions that are evaluated as booleans using DT.array_idco(). For each case, if any of these expressions evaluates as true then the entire case is excluded.
  • exclude_idca (iterable of (altcode,str)) – A sequence of (altcode, expression) tuples, where the expression is evaluated as boolean using DT.array_idca(). If the expression evaluates as true for any alternative matching any of the codes in the altcode part of the tuple (which can be an integer or an array of integers) then the case is excluded. Note that this excludes the whole case, not just the alternative in question.
  • exclude_unavail (bool) – If true, then any case with no available alternatives is excluded.
  • exclude_unchoosable (bool or int) – If true, then any case where an unavailable alternative is chosen is excluded. Set to an integer greater than 1 to increase the verbosity of the reporting.

Notes

Any method parameter can be omitted, in which case the previously used value of that parameter is retained. To explicitly clear previous screens, pass an empty tuple for each parameter.

DT.exclude_idco(expr, count=True)

Add an exclusion factor based on idco data.

This is primarily a convenience method, which calls rescreen. Future implementations may be modified to be more efficient.

Parameters:
  • expr (str) – An expression to evaluate using array_idco(), with dtype set to bool. Any cases that evaluate as positive are excluded from the dataset when provisioning.
  • count (bool) – Count the number of cases impacted by adding the screen.
Returns:

The number of cases excluded as a result of adding this exclusion factor.

Return type:

int

DT.exclude_idca(altids, expr, count=True)

Add an exclusion factor based on idca data.

This is primarily a convenience method, which calls rescreen. Future implementations may be modified to be more efficient.

Parameters:
  • altids (iterable of int) – A set of alternative to consider. Any cases for which the expression evaluates as positive for any of the listed altids are excluded from the dataset when provisioning.
  • expr (str) – An expression to evaluate using array_idca(), with dtype set to bool.
  • count (bool) – Count the number of cases impacted by adding the screen.
DT.get_screen_indexes()

Get the index values of all currently active cases.

This is just the active cases, and omits those cases that are excluded. Also, it returns the zero-based indexes and not the caseids, making this method useful for mapping a vector of filtered data into a vector of all-cases data, or vice versa.

>>> import larch
>>> d = larch.DT.Example('swissmetro')
>>> d.get_screen_indexes()
array([   0,    1,    2, ..., 8448, 8449, 8450])
>>> d.nCases()
6768
>>> d.nAllCases()
10728
>>> d.get_screen_indexes().shape
(6768,)