TDM 40100: Project 2 — 2022
Motivation: The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a sqlite3
database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, sqlite3
, with the opportunity to use other skills you’ve built throughout the previous years.
Context: In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with sqlite3
— a powerful database engine. In this project (and following projects), we will branch into SQL and sqlite3
-specific topics and techniques that you haven’t yet had exposure to in The Data Mine.
Scope: sqlite3
, lmod, SQL
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json
Questions
Question 1
The goodreads dataset has a variety of files: /anvil/projects/tdm/data/goodreads/original
. With that being said there are 4 files which hold the bulk of the data. The rest is mostly derivitives of those 4 files.
-
/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json
Take a look at the 4 files included in this dataset. How many bytes of data total do the 4 files take up on the filesystem?
You can use |
Approximately how many books and how many reviews are included in the datasets?
Finally, take a look at the first book.
{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin's Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "title_without_series": "W.C. Fields: A Life on Film"}
As you can see, there is an image_url
included for each book. Use bash
tools to download one of the images to $HOME/p02output
. How much space does it take up (in bytes)?
Use |
It is okay to manually copy/paste the link from the json. |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
We decided we want to download more than 1 image in order to approximate how much space the images take on average.
In the previous question we said it was okay to manually copy/paste the The |
Use bash
tools (and only bash
tools, from within a bash
cell) to download 25 random book images to $HOME/p02output
, and calculate the average amount of space that each image takes up. Use that information to estimate how much space it would take to store the images for all of the book in the dataset.
Take a look at the |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Okay, so roughly, in total, we are looking at around 34 gb of data. With that size it will definitely be useful for us to create a database. After all, answering questions like:
-
What is the average rating of Brandon Sandersons books?
-
What are the titles of the 5 books with the most number of ratings?
-
Etc.
Is not very straightforward if we handed you this data and said "get that info please", but, if we had a nice sqlite
database — it would be easy! So let’s start planning this out.
First, before we do that, it would make sense to get a sample of each of the datasets. Working with samples just makes it a lot easier to load the data up and parse through it.
Use shuf
to get a random sample of the goodreads_books.json
and goodreads_reviews_dedup.json
datasets. Approximate how many rows you’d need in order to get the datasets down to around 100 mb each, and do so. Put the samples, and copies of goodreads_book_authors.json
and goodreads_book_series.json
in $HOME/goodreads_samples
.
It just needs to be approximately 100mb — no need to fuss, as long as it is within 50mb we are good. |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Check out the 5 storage classes (which you can think of as types) that sqlite3
uses: www.sqlite.org/datatype3.html
In a markdown cell, write out each of the keys in each of the json files, and list the appropriate storage class to use. For example, I’ve provided you an example of what we are looking for, for the goodreads_reviews_dedup.json
.
-
user_id: TEXT
-
book_id: INTEGER
-
review_id: TEXT
-
rating: INTEGER
-
review_text: TEXT
-
date_added: TEXT
-
date_updated: TEXT
-
read_at: TEXT
-
started_at: TEXT
-
n_votes: INTEGER
-
n_comments: INTEGER
You don’t need to copy/paste the solution for |
You do not need to assign a type to the following keys in |
|
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Please include the To do so run the following in the new terminal.
You will then be inside a |
For now, let’s ignore the "problematic" columns in the goodreads_books.json
dataset (series
, popular_shelves
, similar_books
, and authors
).
Translate the work you did in the previous question to 4 CREATE TABLE
statements that will be used to create your sqlite3
database tables. Check out some examples here. For now, let’s keep it straightforward — ignore primary and foreign keys, and just focus on building the 4 tables with the correct types. Similarly, don’t worry about any of the restrictions like NOT NULL
or UNIQUE
. Name your tables: reviews
, books
, series
, and authors
.
Once you’ve created your CREATE TABLE
statements, create a database called my.db
in your $HOME
directory — so $HOME/my.db
. Run your CREATE TABLE
statements, and, in your notebook, verify the database has been created properly by running the following.
%sql sqlite:////home/x-kamstut/my.db # change x-kamstut to your username
%%sql
SELECT sql FROM sqlite_master WHERE name='reviews';
%%sql
SELECT sql FROM sqlite_master WHERE name='books';
%%sql
SELECT sql FROM sqlite_master WHERE name='series';
%%sql
SELECT sql FROM sqlite_master WHERE name='authors';
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |