Data Storage and Access Using SQLite

The default storage of data within Larch is handled using SQLite. This portable and open source database system provides a common file format that is flexible and practical for storing data.

The interactions with data in Python take place through a DB object, which is derived from the apsw.Connection class in APSW, the Python interface wrapper for SQLite.

Creating DB Objects

class larch.DB(filename=None, readonly=False)

An SQLite database connection used to get data for models.

This object wraps a apsw.Connection, adding a number of methods designed specifically for working with choice-based data used in Larch.

Parameters:
  • filename (str or None) – The filename or URI of the database to open. It must be encoded as a UTF-8 string. (If your string contains only usual English characters you probably don’t need to worry about it.) The default is an in-memory database opened with a URI of file:larchdb?mode=memory, which is very fast as long as you’ve got enough memory to store the whole thing.
  • readonly (bool) – If true, the database connection is opened with a read-only flag set. If the file does not already exist, an exception is raised.

Warning

The normal constructor creates a DB object linked to an existing SQLite database file. Editing the object edits the file as well. There is currently no “undo” so be careful when manipulating the database.

In addition to opening an existing SQLite database directly, there are a number of methods available to create a DB object without having it linked to an original database file.

static DB.Copy(source, destination='file:larchdb?mode=memory', shared=False)

Create a copy of a database and link it to a DB object.

It is often desirable to work on a copy of your data, instead of working with the original file. If you data file is not very large and you are working with multiple models, there can be significant speed advantages to copying the entire database into memory first, and then working on it there, instead of reading from disk every time you want data.

Parameters:
  • source (str) –

    The source SQLite database from which the contents will be copied. Can be given as a plain filename or a URI.

  • destination (str) –

    The destination SQLite database to which the contents will be copied. Can be given as a plain filename or a URI. If it does not exist it will be created. If the destination is not given, an in-memory database will be opened with a URI of file:larchdb?mode=memory.

Returns:

An open connection to destination database.

Return type:

DB

static DB.Example(dataset='MTC', shared=False)

Generate an example data object in memory.

Larch comes with a few example data sets, which are used in documentation and testing. It is important that you do not edit the original data, so this function copies the data into an in-memory database, which you can freely edit without damaging the original data.

Parameters:
  • dataset ({'MTC', 'SWISSMETRO', 'MINI', 'AIR'}) – Which example dataset should be used.
  • shared (bool) – If True, the new copy of the database is opened with a shared cache, so additional database connections can share the same in-memory data. Defaults to False.
Returns:

An open connection to the in-memory copy of the example database.

Return type:

DB

classmethod DB.CSV_idco(filename, caseid='_rowid_', choice=None, weight=None, tablename='data', savename=None, alts={}, safety=True, shared=False)

Creates a new larch DB based on an idco Format CSV data file.

The input data file should be an idco Format data file, with the first line containing the column headings. The reader will attempt to determine the format (csv, tab-delimited, etc) automatically.

Parameters:
  • filename (str) – File name (absolute or relative) for CSV (or other text-based delimited) source data.
  • caseid (str) – Column name that contains the unique case id’s. If the data is in idco format, case id’s can be generated automatically based on line numbers, by using the reserved keyword ‘_rowid_’.
  • choice (str or None) – Column name that contains the id of the alternative that is selected (if applicable). If not given, no sql_choice table will be autogenerated, and it will need to be set manually later.
  • weight (str or None) – Column name of the weight for each case. If None, defaults to equal weights.
  • tablename (str) – The name of the sql table into which the data is to be imported. Do not give a reserved name (i.e. any name beginning with sqlite or larch).
  • savename (str or None) – If not None, the name of the location to save the SQLite database file that is created.
  • alts (dict) – A dictionary with keys of alt codes, and values of (alt name, avail column, choice column) tuples. If choice is given, the third item in the tuple is ignored and can be omitted.
  • safety (bool) – If true, all alternatives that are chosen, even if not given in alts, will be automatically added to the alternatives table.
Returns:

An open connection to the database.

Return type:

DB

classmethod DB.CSV_idca(filename, caseid=None, altid=None, choice=None, weight=None, avail=None, tablename='data', tablename_co='_co', savename=None, alts={}, safety=True, index=False, shared=False)

Creates a new larch DB based on an idca Format CSV data file.

The input data file should be an idca Format data file, with the first line containing the column headings. The reader will attempt to determine the format (csv, tab-delimited, etc) automatically.

Parameters:
  • filename (str) – File name (absolute or relative) for CSV (or other text-based delimited) source data.
  • caseid (str or None) – Column name that contains the caseids. Because multiple rows will share the same caseid, caseid’s cannot be generated automatically based on line numbers by using the reserved keyword ‘_rowid_’. If None, either the columns titled ‘caseid’ will be used if it exists, and if not then the first column of data in the file will be used.
  • altid (str or None) – Column name that contains the altids. If None, the second column of data in the file will be used.
  • choice (str or None) – Column name that contains the id of the alternative that is selected (if applicable). If None, the third column of data in the file will be used.
  • weight (str or None) – Column name of the weight for each case. If None, defaults to equal weights. Note that the weight needs to be identical for all altids sharing the same caseid.
  • avail (str or None) – Column name of the availability indicator. If None, it is assumed that unavailable alternatives have the entire row of data missing from the table.
  • tablename (str) – The name of the sql table into which the data is to be imported. Do not give a reserved name (i.e. any name beginning with sqlite or larch).
  • tablename_co (str or None) – The name of the sql table into which idco format data is to be imported. Do not give a reserved name (i.e. any name beginning with sqlite or larch). If None, then no automatic cracking will be attempted and all data will be imported into the idca table. If the given name begins with an underscore, it will be used as a suffix added onto tablename.
  • savename (str or None) – If not None, the name of the location to save the SQLite database file that is created.
  • alts (dict) – A dictionary with integer keys of alt codes, and string values of alt names.
  • safety (bool) – If true, all alternatives that appear in the altid column, even if not given in alts, will be automatically added to the alternatives table.
  • index (bool) – If true, automatically create indexes for caseids and altids on the idca Format table, and (if it is created) caseids on the idco Format table.

Notes

If you are importing a HUGE data file, set tablename_co to None, and do any post processing seperately. If it is not None, the initial import is to a temporary file, which greatly speeds the process for small to medium size files but can choke on larger ones, especially if the temporary file location (usually on the main startup drive) does not have enough space.

Returns:An open connection to the database.
Return type:DB

Importing Data

There are a variety of methods available to import data from external sources into a SQLite table for use with the larch DB facility.

DB.import_csv(rawdata, table='data', drop_old=False, progress_callback=None, temp=False, column_names=None)

Import raw csv or tab-delimited data into SQLite.

Parameters:
  • rawdata (str) – The filename (relative or absolute) of the raw csv or tab delimited data file. If the filename has a .gz extension, it is assumed to be in gzip format instead of plain text.
  • table (str) – The name of the table into which the data is to be imported
  • drop_old (bool) – If true and the table already exists in the SQLite database, then the pre-existing table is deleted.
  • progress_callback (callback function) – If given, this callback function takes a single integer as an argument and is called periodically while loading with the current precentage complete.
  • temp (bool) – If true, the data is imported into a temporary table in the database, which will be deleted automatically when the connection is closed.
  • column_names (list, optional) – If given, use these column names and assume the first line of the data file is data, not headers.
Returns:

A list of column headers from the imported csv file

Return type:

list

DB.import_dataframe(rawdataframe, table='data', if_exists='fail')

Imports data from a pandas dataframe into an existing larch DB.

Parameters:
  • rawdataframe (pandas.DataFrame) – The filename (relative or absolute) of the raw DataFrame.
  • table (str) – The name of the table into which the data is to be imported
  • if_exists ({'fail', 'replace', 'append'}) – If the table does not exist this parameter is ignored, otherwise, fail: If table exists, raise a ValueError exception. replace: If table exists, drop it, recreate it, and insert data. append: If table exists, insert data.
Returns:

A list of column headers from the imported DBF file

Return type:

list

DB.import_xlsx(io, sheetname=0, table='data', if_exists='fail', **kwargs)

Imports data from an Excel spreadsheet into an existing larch DB.

Parameters:
  • io (string, file-like object, or xlrd workbook.) – The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx
  • sheetname (string or int, default 0) – Name of Excel sheet or the page number of the sheet
  • table (str) – The name of the table into which the data is to be imported
  • if_exists ({'fail', 'replace', 'append'}) – If the table does not exist this parameter is ignored, otherwise, fail: If table exists, raise a ValueError exception. replace: If table exists, drop it, recreate it, and insert data. append: If table exists, insert data.
Returns:

A list of column headers from the imported DBF file

Return type:

list

Notes

This method uses a pandas.DataFrame as an intermediate step, first calling pandas.io.excel.read_excel() and then calling import_dataframe(). All keyword arguments other than those listed here are simply passed to pandas.io.excel.read_excel().

DB.import_dbf(rawdata, table='data', drop_old=False)

Imports data from a DBF file into an existing larch DB.

Parameters:
  • rawdata (str) – The filename (relative or absolute) of the raw DBF data file.
  • table (str) – The name of the table into which the data is to be imported
  • drop_old (bool) – If true and the table already exists in the SQLite database, then the pre-existing table is deleted.
Returns:

A list of column headers from the imported DBF file

Return type:

list

Note

This method requires the dbfpy module (available using pip).

Exporting Data

Sometimes it will be necessary to get your data out of the database, for use in other programs or for other sundry purposes. There will eventually be some documented methods to conviently allow you to export data in a few standard formats. Of course, since the DB object links to a standard SQLite database, it is possible to access your data directly from SQLite in other programs, or through apsw (included as part of Larch) or sqlite3 (included in standard Python distributions).

DB.export_idca(file, include_idco='intersect', exclude=[], **formats)

Export the idca Format data to a csv file.

Parameters:
  • file (str or file-like) – If a string, this is the file name to give to the open command. Otherwise, this object is passed to csv.writer directly.
  • include_idco ({'intersect', 'all', 'none'}) – Unless this is ‘none’, the idca and idco tables are joined on caseids before exporting. For ‘intersect’, a natural join is used, so that all columns with the same name are used for the join. This may cause problems if columns in the idca and idco tables have the same name but different data. For ‘all’, the join is made on caseids only, and every column in both tables is included in the output. When ‘none’, only the idca table is exported and the idco table is ignored.
  • exclude (set or list) – A list of variables names to exclude from the output. This could be useful in shrinking the file size if you don’t need all the output columns, or suppressing duplicate copies of caseid and altid columns.

Notes

This method uses a csv.writer object to write the output file. Any keyword arguments not listed here are passed through to the writer.

DB.export_idco(file, exclude=[], **formats)

Export the idco Format data to a csv file.

Only the idco Format table is exported, the idca Format table is ignored. Future versions of Larch may provide a facility to export idco and idca data together in a single idco output file.

Parameters:
  • file (str or file-like) – If a string, this is the file name to give to the open command. Otherwise, this object is passed to csv.writer directly.
  • exclude (set or list) – A list of variables names to exclude from the output. This could be useful in shrinking the file size if you don’t need all the output columns, or suppressing duplicate copies of caseid and altid columns.

Notes

This method uses a csv.writer object to write the output file. Any keyword arguments not listed here are passed through to the writer.

Reviewing Data

DB.seer(file=None, counts=False, **kwargs)

Display a variety of information about the DB connection in an HTML report.

Parameters:
  • file (str, optional) – A name for the HTML file that will be created. If not given, a temporary file will automatically be created.
  • counts (bool, optional) – If true, the number of rows in each table is calculated. This may take a long time if the database is large.

Notes

The report will pop up in Chrome or a default browser after it is generated.

Loading Data into Arrays

DB.array_caseids(*, table=None, caseid=None, sort=True, n_cases=None)

Extract the caseids from the DB based on preset queries.

Generaly you won’t need to specify any parameters to this method, as most values are determined automatically from the preset queries. However, if you need to override things for this array without changing the queries more permanently, you can use the input parameters to do so. Note that all parameters must be called by keyword, not as positional arguments.

Parameters:
  • tablename (str) – The caseids will be found in this table.
  • caseid (str) – This sets the column name where the caseids can be found.
  • sort (bool) – If true (the default) the resulting array will sorted in ascending order.
  • n_cases (int) – If you know the number of cases, you can specify it here to speed up the return of the results, particularly if the caseids query is complex. You can safely ignore this and the number of cases will be calculated for you. If you give the wrong number, an exception will be raised.
Returns:

An int64 array of shape (n_cases,1).

Return type:

ndarray

DB.array_idca(*vars, table=None, caseid=None, altid=None, altcodes=None, dtype='float64', sort=True, n_cases=None)

Extract a set of idca values from the DB based on preset queries.

Generaly you won’t need to specify any parameters to this method beyond the variables to include in the array, as most values are determined automatically from the preset queries. However, if you need to override things for this array without changing the queries more permanently, you can use the input parameters to do so. Note that all override parameters must be called by keyword, not as positional arguments.

Parameters:

vars (tuple of str) – A tuple giving the expressions (often column names, but any valid SQLite expression works) to extract as idca Format format variables.

Other Parameters:
 
  • table (str) – The idca data will be found in this table, view, or self contained query (if the latter, it should be surrounded by parentheses).
  • caseid (str) – This sets the column name where the caseids can be found.
  • altid (str) – This sets the column name where the altids can be found.
  • altcodes (tuple of int) – This is the set of alternative codes used in the data. The second (middle) dimension of the result array will match these codes in length and order.
  • dtype (str or dtype) – Describe the data type you would like the output array to adopt, probably ‘int64’, ‘float64’, or ‘bool’.
  • sort (bool) – If true (the default) the resulting arrays (both of them) will sorted in ascending order by caseid.
  • n_cases (int) – If you know the number of cases, you can specify it here to speed up the return of the results, particularly if the caseids query is complex. You can safely ignore this and the number of cases will be calculated for you. If you give the wrong number, an exception will be raised.
Returns:

  • data (ndarray) – An array with specified dtype, of shape (n_cases,len(altcodes),len(vars)).
  • caseids (ndarray) – An int64 array of shape (n_cases,1).

Examples

Extract a cost and time array from the MTC example data:

>>> import larch
>>> db = larch.DB.Example()
>>> x, c = db.array_idca('totcost','tottime')
>>> x.shape
(5029, 6, 2)
>>> x[0]
Array([[  70.63,   15.38],
       [  35.32,   20.38],
       [  20.18,   22.38],
       [ 115.64,   41.1 ],
       [   0.  ,   42.5 ],
       [   0.  ,    0.  ]])
DB.array_idco(*vars, table=None, caseid=None, dtype='float64', sort=True, n_cases=None)

Extract a set of idco values from the DB based on preset queries.

Generaly you won’t need to specify any parameters to this method beyond the variables to include in the array, as most values are determined automatically from the preset queries. However, if you need to override things for this array without changing the queries more permanently, you can use the input parameters to do so. Note that all override parameters must be called by keyword, not as positional arguments.

Parameters:

vars (tuple of str) – A tuple (or other iterable) giving the expressions (often column names, but any valid SQLite expression works) to extract as idco Format format variables.

Other Parameters:
 
  • tablename (str) – The idco data will be found in this table, view, or self contained query (if the latter, it should be surrounded by parentheses).
  • caseid (str) – This sets the column name where the caseids can be found.
  • dtype (str or dtype) – Describe the data type you would like the output array to adopt, probably ‘int64’, ‘float64’, or ‘bool’.
  • sort (bool) – If true (the default) the resulting arrays (both of them) will sorted in ascending order by caseid.
  • n_cases (int) – If you know the number of cases, you can specify it here to speed up the return of the results, particularly if the caseids query is complex. You can safely ignore this and the number of cases will be calculated for you. If you give the wrong number, an exception will be raised.
Returns:

  • data (ndarray) – An array with specified dtype, of shape (n_cases,len(vars)).
  • caseids (ndarray) – An int64 array of shape (n_cases,1).

Convenience Methods

DB.attach(sqlname, filename)

Attach another SQLite database.

Parameters:
  • sqlname (str) – The name SQLite will use to reference the other database.
  • filename (str) – The filename or URI to attach.

Notes

If the other database is already attached, or if the name is already taken by another attached database, the command will be ignored. Otherwise, this command is the equivalent of executing:

ATTACH filename AS sqlname;

See also

DB.detach()

DB.detach(sqlname)

Detach another SQLite database.

Parameters:sqlname (str) – The name SQLite uses to reference the other database.

Notes

If the name is not an attached database, the command will be ignored. Otherwise, this command is the equivalent of executing:

DETACH sqlname;

See also

DB.attach()

DB.crack_idca(tablename, caseid=None, ca_tablename=None, co_tablename=None)

Crack an existing idca table into idca and idco component tables.

This method will automatically analyze an existing idca table and identify columns of data that are invariant within individual cases. Those variables will be segregated into a new idco table, and the remaining variables will be put into a new idca table.

Parameters:
  • tablename (str) – The name of the existing idca table
  • caseid (str or None) – The name of the column representing the caseids in the existing table. If not given, it is assumed these are in the first column.
  • ca_tablename (str or None) – The name of the table that will be created to hold the new (with fewer columns) idca table.
  • co_tablename (str or None) – The name of the table that will be created to hold the new idco table.
Raises:

apsw.SQLError – If the name of one of the tables to be created already exists in the database.

Using Data in Models

The DB class primarily presents an interface between python and SQLite. The interface between a DB and a Model is governed by a special attribute of the DB class:

DB.queries

This attribute defines the automatic queries used to provision a Model with data. It should be an object that is a specialized subtype of the core.QuerySet abstract base class.