Often as programmers we have a go-to language or library that we want to use for every problem. This is especially true of Data Analysts and Data Scientists whose backgrounds may not be in Computer Science. In fact, I'll be the first to admit I wish everything could be done in Python. However, it's important to remember that no matter how much we like a certain programming language or software, it may not be the best tool for the job.
For example, just because you love to work with data in Excel, it doesn't mean Excel should be your go-to for multiple linear regression. Is it possible? Absolutely. Are there better tools for this? Absolutely! While this is a fairly obvious example, it's not always so clear cut in the real world.
Recently, a friend reached out asking for advice on how to optimize his current ETL automation. It was written in Python and (as you can see in his sample code below) performed the following actions:
- looped through a number of (very large) CSV files, and read them using Python's
csv.DictReader
, - created a SQLite table for the data using Python's
sqlite3
module, - looped through each row of the data, adding it to a list,
- added that list of lists to the database table,
- queried a different table in the database using Python,
- and finally, performed string operations on certain data depending on how it was formatted
for doc in statename_list:
full_file_name = state + "CDIST.csv"
print(full_file_name)
with open("./full/unzipped/" + full_file_name) as code_file:
cur.execute('DROP TABLE IF EXISTS cur_state')
cur.execute('''
TABLE cur_state (
full_zip TEXT,
city TEXT,
city_fips TEXT,
county TEXT,
county_fips TEXT,
state TEXT,
state_fips TEXT,
cong_district TEXT
)
''')
read_csv_data = csv.DictReader(code_file)
to_db = []
for i in read_csv_data:
to_db += [(i['ZIP+4'], i['City Name'], i['City Place FIPS'], i['County Name'], i['County FIPS'], i['State Code'], i['State FIPS'], i['Congressional District'])]
cur.executemany('INSERT INTO cur_state (full_zip,city,city_fips,county,county_fips,state,state_fips,cong_district) VALUES (?,?,?,?,?,?,?,?);', to_db)
members_in_state = cur.execute('SELECT * FROM t WHERE State = ?', (state,)).fetchall()
for row in members_in_state:
num_results += 1
total = len(row[14]) + len(row[13])
if total > 9:
split_string = row[14].split("-")
fixed_full_9 = "".join(split_string)
else:
fixed_full_9 = row[14] + row[13]
result = cur.execute('SELECT * FROM cur_state WHERE full_zip = ?', (fixed_full_9,)).fetchone()
# CONCLUSION: This does go through all 1000 results
con.close()
Despite Python being one of the best languages for automation, in this particular case Python was not ideal for performing all the ETL actions from beginning to end. For starters, each CSV file had over 1 million rows of data. Even though csv.DictReader
is blazingly fast at reading in data, the rest of his code took nearly a minute to complete per file as it had to loop through millions of dictionaries and add them to lists before inserting them into the database and performing queries.
Now, we could have spent a good amount of time trying to improve and optimize his existing Python code. But why, when at least half of his code could be replaced entirely by SQLite? For those of you who don't know, SQLite is one of the fastest Relational Database Management Systems (RDBMSs) when it comes to single-user access. It is written in C and is server-less, so nothing has to be transferred over a network. Most importantly, however, the SQLite command-line interface (CLI) supports reading a CSV directly to a table — something that Python's sqlite3
module can't do.
What does this mean in terms of efficiency? It means that if we could figure out a way to automate the SQLite CLI, most of the work could be done with the speed of C, while simultaneously cutting out the step that looped through the DictReader
output line by line. Once each CSV was converted into a database table, SQL queries could be performed as needed and Python could do the rest.
In the end, my solution was to write a bash script to loop through each CSV and load it directly in into SQLite as a table, perform various queries, and then call a python script to finish the remaining processes. This improved execution time from roughly one minute per CSV to around 3 to 6 seconds per CSV, depending on the file size and how many actions were performed on the data. Absent any Python, the SQL processes averaged ~1.8 seconds per file.
Coming full circle, we see that software and programming languages are nothing more than tools at our disposal. As such, it's important that we recognize their strengths and weaknesses accordingly. Just as a carpenter wouldn't use a hammer for every job, neither should we limit ourselves to our favorite digital tools if we have the knowledge and skills to do otherwise. And for God's sake, stop trying to avoid SQL; it may not be the most exciting code in the world, but it can make your life a whole lot easier.
Example Code
Though I can't provide his company's exact data here, if you're interested in trying this out on your own you can practice by downloading the Kaggle National Baby Names dataset here. This CSV is about 41 MB, with nearly 2 million rows of data. You will need a Unix operating system (Mac or Linux), or WSL on Windows to run the following script, and if you don't already have sqlite3
installed, follow the Mac or Linux instructions here to do so.
Save the NationalNames.csv
in a folder on your computer, along with the following code as a .sh
file (I called mine load.sh
):
#!/bin/bash
for f in *.csv
do
sqlite3 <<EOF
.open Names.db
.mode csv
.import ./$f ${f%%.*}
CREATE TABLE ${f%%.*}_girl_names AS
SELECT COUNT(Id) AS count
FROM ${f%%.*}
WHERE Gender = 'F';
.quit
EOF
done
# run python script here (e.g., python script.py)
Now, use the terminal to navigate to your folder and run the following command (replace the word "load" with the name you gave the file, if you didn't name it load.sh
):
time ./load.sh
If you look in your newly created SQLite database, Names.db
, you will see two different tables: one containing all baby names and the other with all the female baby names at 1.8 million and 1 million rows, respectively. If you'd rather time how fast this simple script can iterate over several CSV files of this size, copy and paste NationalNames
a few times in the same folder and give each copy a different name (with no spaces). Then, run the same command given above and prepare to be amazed at just how painless automation with SQLite can be.