TDM 10200: Project 3 — 2023
Motivation: Learning about Big Data. When working with large data sets, it is important to know how we can use loops to find our information, a little bit at a time, without reading in all of the files at once. We will need to set our cores to 4 for this when we spin up Juypter Notebook. This will give us more space, to handle processing this week’s datasets. If we do not adjust the cores, then our kernel will crash every time we try to run a cell.
Scope: Python, If statements, for loops,
Questions
ONE
First let’s go ahead and see what files are in this dataset
ls /anvil/projects/tdm/data/flights/subset/
This allows you to see all of the files that are in this dataset.
-
Considering the files in this directory, which years of data are available?
-
What would the path be to access only the 2003 file in this data set?
-
Go ahead and import the library
pandas
aspd
, and importPath
frompathlib
Helpful Hints
/anvil/projects/tdm/data/flights/subset/2003.csv
import pandas as pd
from pathlib import Path
-
Code used to answer the question.
TWO
files = [Path(f'/anvil/projects/tdm/data/flights/subset/{year}.csv') for year in range(1987,2009)]
files
This code uses list comprehension to create a list of file paths, ranging from the years of 1987-2008
files
will now contain strings of file paths for all the csv files in this directory.
Notice that, in a range in Python, the final number in the range is not included.
Let’s test out the first file (from 1987) to see if we can find the column names
-
How many columns are there?
-
What are the column names?
-
Display the data from the first five rows.
Helpful Hints
#reads the first file into a df called eightseven
eightyseven = pd.read_csv(files[0])
#looks for column names from the df
column_names = eightyseven.columns
print(column_names)
-
Code used to answer the question.
-
Result of code.
THREE
Let’s look at the column Origin
-
Print out the unique elements of the
Origin
column. -
Find the number of times that
IND
occurs in theOrigin
column.
Helpful Hint
eightyseven['Origin'].value_counts()['IND']
-
Code used to answer the question.
-
Result of the code
FOUR
Let’s do the same thing for the 1988, 1989, and 1990 data sets, naming them eightyeight
, eightynine
, ninety
.
-
How many times is 'IND' the
Origin
airport ineightyeight
? -
How many times is 'IND' the
Origin
airport ineightynine
? -
How many times is 'IND' the
Origin
airport inninety
?
-
Code used to answer questions
-
Result of the code
FIVE
Knowing that we can find how many times the value IND
shows up in Origin
, we could do this for each of the years, and add them up, to find the total number of times that Indianapolis airport was the origin airport for flights taken during the years 1987-2008. But that is a lot of work!! We can shorten the tedious work of keeping track and adding things manually, by (instead) using a for loop, to go thru all of the subset files and keep track of the total number of times that IND
shows up in the Origin
column.
We could use this code
count = 0
for file in files:
df = pd.read_csv(file)
count += len(df[df['Origin'] == 'IND'])
print(count)
BUT before you go ahead and copy and paste, this code will take ALL of the files and read them into memory, and this will crash the kernel, even if we upped the cores to 4.
This means we have to think of another way to do it.
We have to use for loops
which is a way to iterate to check for certain conditions and repeatedly execute them. This is very helpful when you come across situations in which you need to use a specific code over and over again but you don’t want to write the same line of code multiple times.
We need to consider a way that will allow us to go thru the files line by line, and read them but then not commit them to memory. In this way, we can go thru all of the data files and still keep track of how many occurrences we have, for a specific value.
total_count = 0
for file in files:
for df in pd.read_csv(file, chunksize=10000):
for index, row in df.iterrows():
if row['Origin'] == 'IND':
total_count += 1
print(total_count)
You will note that doing the above code DOES produce the correct answer BUT it take a very long time to run! Is there a shorter way to run this code?
Helpful Hint
origin_ind = 0
for file in files:
with open(file,'r') as f:
for line in f:
if line.split(",")[16] == 'IND':
origin_ind += 1
print(origin_ind)
-
Code used to answer the question.
-
Result of code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |