ETL process with Python and TURBODBC

Erick Gomes Anastácio

2018-11-08


Reasoning

In this demo we will upload data to a SQL Server database using TURBODBC.

The principal reason for turbodbc is: for uploading real data, pandas.to_sql is painful slow, and the workarounds to make it better are pretty hairy, if you ask me.

The first time I came across this problem, I had 8 tables with 1.6 millions of rows and 240 columns each. Pandas.to_sql took 1 entire day until I gave up on the upload. So until now I was using the company proprietary tool while trying to get pandas working.

After many hours running in circles around pandas workarounds, I gave up on it, but just because I discovered TURBODBC, this piece of pure love!

Get TURBODBC on https://turbodbc.readthedocs.io/en/latest/index.html

Comparison

Quick comparison: in this script, for loading 10000 lines, 77 columns, we have:

  • pandas.to_sql took almost 200 seconds to finish
  • turbodbc took only 3 seconds…

Step by step summary:

In this python script, we will:

  • load and treat some data using pandas (in my case, a DataFrame containing 77 columns, 350k+ lines)
  • create a sqlAlchemy connection to our database in a SQL Server
  • use pandas.to_sql to insert the head of our data, to automate the table creation
  • create a turbodbc connection
  • create turbodbc code for data insertion
  • upload a sample of data using turbodbc
  • upload the same sample of data, but this time using sqlAlchemy
  • compare user time for both methods
  • profit!

Environment and conditions

  • Python 3.6.7 :: Anaconda, Inc.
  • TURBODBC version ‘3.0.0’
  • sqlAlchemy version ‘1.2.12’
  • pandas version ‘0.23.4’
  • Microsoft SQL Server 2014
  • user with bulk operations privileges

The code

Create the table using sqlAlchemy

Unfortunatelly, turbodbc requires a lot of overhead with a lot of sql manual labor, for creating the tables and for inserting data on it.

Fortunatelly, Python is pure joy and we can automate this process of writing sql code.

The fisrt step is creating the table which will receive our data. However, creating the table manually writing sql code can be problematic if your table has more than a few columns. In my case, very often the tables have 240 columns!

This is where sqlAlchemy and pandas still can help us: pandas is bad for writing a large number of rows (10000 in this example), but what about just 6 rows, the head of the table? This way, we automate the process of creating the tables.

Create table on SQL Server

Using pandas + sqlAlchemy, but just for preparing room for turbodbc as previously mentioned. Please note that df.head() here: we are using pandas + sqlAlchemy for inserting only 6 rows of our data. This will run pretty fast and is being done to automate the table creation.

Turbodbc workflow

Now that the table is already in place, let’s get serious here

Preparing sql comands and data for turbodbc

Turbodbc very basic usage goes like:

Extracted from https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html

Another good demonstration on how to use it, this time with DataFrames - this DataFrame has 4 columns:

Extracted from this post on stackoverflow, thanks to Pylander !

As you can see, this is ok for 4 columns of data, but what about 77 (in this particular case)? Will you manually type in all the 77 column names and all the 77 place holders? Well, we don’t need to.

Let’s automate this code creation, being creative:

Writing data using turbodbc

I’ve got 10000 lines (77 columns) in 3 seconds:

The autor

Written on 2018-11-07 by Erick Gomes Anastácio

Data Scientist, physicist, living in São Paulo, Brazil.

Senior Consultant at Control Risks

erickfis@gmail.com

https://erickfis.github.io/portfolio/

https://www.linkedin.com/in/erick-anastácio-15241717/