====
Ohio
====
.. automodule:: ohio
For higher-level examples of what Ohio can do for you, see `Extensions`_ and `Recipes`_.
.. contents::
Installation
============
Ohio is a distributed library with support for Python v3. It is available from `pypi.org `_::
$ pip install ohio
Modules
=======
.. automodule:: ohio.csvio
.. autofunction:: ohio.encode_csv
.. autoclass:: ohio.CsvTextIO
.. autoclass:: ohio.CsvDictTextIO
.. autofunction:: ohio.iter_csv
.. autofunction:: ohio.iter_dict_csv
.. autoclass:: ohio.CsvWriterTextIO
.. autoclass:: ohio.CsvDictWriterTextIO
.. automodule:: ohio.iterio
.. autoclass:: ohio.IteratorTextIO
.. automodule:: ohio.pipeio
.. autofunction:: ohio.pipe_text
.. automodule:: ohio.baseio
.. autoclass:: ohio.StreamTextIOBase
.. autoexception:: ohio.IOClosed
.. _Extensions:
.. automodule:: ohio.ext
.. automodule:: ohio.ext.numpy
.. autofunction:: ohio.ext.numpy.pg_copy_to_table
.. autofunction:: ohio.ext.numpy.pg_copy_from_table
.. autofunction:: ohio.ext.numpy.pg_copy_from_query
.. automodule:: ohio.ext.pandas
.. autoclass:: ohio.ext.pandas.DataFramePgCopyTo
.. autofunction:: ohio.ext.pandas.to_sql_method_pg_copy_to
.. autofunction:: ohio.ext.pandas.data_frame_pg_copy_from
Benchmarking
~~~~~~~~~~~~
Ohio extensions for pandas were benchmarked to test their speed and
memory-efficiency relative both to pandas built-in functionality and
to custom implementations which do not utilize Ohio.
Interfaces and syntactical niceties aside, Ohio generally features
memory stability. Its tools enable pipelines which may also
improve speed, (and which do so in standard use-cases).
In the below benchmark, Ohio extensions ``pg_copy_from`` &
``pg_copy_to`` reduced memory consumption by 84% & 61%, and
completed in 39% & 91% less time, relative to pandas built-ins
``read_sql`` & ``to_sql``, (respectively).
Compared to purpose-built extensions – which utilized PostgreSQL
``COPY``, but using ``io.StringIO`` in place of ``ohio.PipeTextIO``
and ``ohio.CsvTextIO`` – ``pg_copy_from`` & ``pg_copy_to`` also
reduced memory consumption by 60% & 32%, respectively.
``pg_copy_from`` & ``pg_copy_to`` also completed in 16% & 13%
less time than the ``io.StringIO`` versions.
The benchmarks plotted below were produced from averages and
standard deviations over 3 randomized trials per target. Input data
consisted of 896,677 rows across 83 columns: 1 of these of type
timestamp, 51 integers and 31 floats. The benchmarking package,
``prof``, is preserved in `Ohio's repository
`_.
.. Rather than relative links to assets in the repository, we use
absolute links to githubusercontent.com, to ensure document
presents the same outside of Github, (e.g. on PyPI).
.. Also note the query string ?sanitize=true
This is necessary to ensure that SVG is returned with the image content type.
.. image:: https://raw.githubusercontent.com/dssg/ohio/0.5.0/doc/img/profile-copy-from-database-to-datafram-1554345457.svg?sanitize=true
ohio_pg_copy_from_X
``pg_copy_from(buffer_size=X)``
A PostgreSQL database-connected cursor writes the results of
``COPY`` to a ``PipeTextIO``, from which pandas constructs a
``DataFrame``.
pandas_read_sql
``pandas.read_sql()``
Pandas constructs a ``DataFrame`` from a given database query.
pandas_read_sql_chunks_100
``pandas.read_sql(chunksize=100)``
Pandas is instructed to generate ``DataFrame`` slices of the
database query result, and these slices are concatenated into a
single frame, with: ``pandas.concat(chunks, copy=False)``.
pandas_read_csv_stringio
``pandas.read_csv(StringIO())``
A PostgreSQL database-connected cursor writes the results of
``COPY`` to a ``StringIO``, from which pandas constructs a
``DataFrame``.
.. image:: https://raw.githubusercontent.com/dssg/ohio/0.5.0/doc/img/profile-copy-from-dataframe-to-databas-1555458507.svg?sanitize=true
ohio_pg_copy_to
``pg_copy_to()``
``DataFrame`` data are encoded through a ``CsvTextIO``, and read
by a PostgreSQL database-connected cursor's ``COPY`` command.
pandas_to_sql
``pandas.DataFrame.to_sql()``
Pandas inserts ``DataFrame`` data into the database row by row.
pandas_to_sql_multi_100
``pandas.DataFrame.to_sql(method='multi', chunksize=100)``
Pandas inserts ``DataFrame`` data into the database in chunks of
rows.
copy_stringio_to_db
``DataFrame`` data are written and encoded to a ``StringIO``,
and then read by a PostgreSQL database-connected cursor's
``COPY`` command.
.. _Recipes:
.. automodule:: ohio.recipe
.. automodule:: ohio.recipe.dbjoin
.. autofunction:: ohio.recipe.dbjoin.pg_join_queries
.. only:: not noindex
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`