In this demo we will upload data to a SQL Server database using TURBODBC.
The principal reason for turbodbc is: for uploading real data, pandas.to_sql is painful slow, and the workarounds to make it better are pretty hairy, if you ask me.
The first time I came across this problem, I had 8 tables with 1.6 millions of rows and 240 columns each. Pandas.to_sql took 1 entire day until I gave up on the upload. So until now I was using the company proprietary tool while trying to get pandas working.
After many hours running in circles around pandas workarounds, I gave up on it, but just because I discovered TURBODBC, this piece of pure love!
Get TURBODBC on https://turbodbc.readthedocs.io/en/latest/index.html
Quick comparison: in this script, for loading 10000 lines, 77 columns, we have:
In this python script, we will:
Substitute my sample.pkl for yours:
df = pd.read_pickle('sample.pkl')
df.columns = df.columns.str.strip() # remove white spaces around column names
df = df.applymap(str.strip) # remove white spaces around values
df = df.replace('', np.nan) # map nans, to drop NAs rows and columns later
df = df.dropna(how='all', axis=0) # remove rows containing only NAs
df = df.dropna(how='all', axis=1) # remove columns containing only NAs
df = df.replace(np.nan, 'NA') # turbodbc hates null values...
Unfortunatelly, turbodbc requires a lot of overhead with a lot of sql manual labor, for creating the tables and for inserting data on it.
Fortunatelly, Python is pure joy and we can automate this process of writing sql code.
The fisrt step is creating the table which will receive our data. However, creating the table manually writing sql code can be problematic if your table has more than a few columns. In my case, very often the tables have 240 columns!
This is where sqlAlchemy and pandas still can help us: pandas is bad for writing a large number of rows (10000 in this example), but what about just 6 rows, the head of the table? This way, we automate the process of creating the tables.
Using pandas + sqlAlchemy, but just for preparing room for turbodbc as previously mentioned. Please note that df.head() here: we are using pandas + sqlAlchemy for inserting only 6 rows of our data. This will run pretty fast and is being done to automate the table creation.
Now that the table is already in place, let’s get serious here
Turbodbc very basic usage goes like:
parameter_sets = [[42, 17],
[23, 19],
[314, 271]]
cursor.executemany("INSERT INTO TABLE my_integer_table VALUES (?, ?)",
parameter_sets)
Extracted from https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html
Another good demonstration on how to use it, this time with DataFrames - this DataFrame has 4 columns:
test_query = """
INSERT INTO [db_name].[schema].[test] (id,transaction_dt,units,measures)
VALUES (?,?,?,?)
"""
cursor.executemanycolumns(test_query,
[
df_test['id'].values,
df_test['transaction_dt'].values,
df_test['units'].values,
df_test['measures'].values
]
)
Extracted from this post on stackoverflow, thanks to Pylander !
As you can see, this is ok for 4 columns of data, but what about 77 (in this particular case)? Will you manually type in all the 77 column names and all the 77 place holders? Well, we don’t need to.
Let’s automate this code creation, being creative:
def turbo_write(mydb, df, table):
"""Use turbodbc to insert data into sql."""
start = time.time()
# preparing columns
colunas = '('
colunas += ', '.join(df.columns)
colunas += ')'
# preparing value place holders
val_place_holder = ['?' for col in df.columns]
sql_val = '('
sql_val += ', '.join(val_place_holder)
sql_val += ')'
# writing sql query for turbodbc
sql = f"""
INSERT INTO {mydb}.dbo.{table} {colunas}
VALUES {sql_val}
"""
# writing array of values for turbodbc
valores_df = [df[col].values for col in df.columns]
# cleans the previous head insert
with connection.cursor() as cursor:
cursor.execute(f"delete from {mydb}.dbo.{table}")
connection.commit()
# inserts data, for real
with connection.cursor() as cursor:
try:
cursor.executemanycolumns(sql, valores_df)
connection.commit()
except Exception:
connection.rollback()
print('something went wrong')
stop = time.time() - start
return print(f'finished in {stop} seconds')
I’ve got 10000 lines (77 columns) in 3 seconds:
I’ve got the same 10000 lines (77 columns) in 198 seconds…
Written on 2018-11-07 by Erick Gomes Anastácio
Data Scientist, physicist, living in São Paulo, Brazil.
Senior Consultant at Control Risks