TDM 30100: Project 4 — 2022
Motivation: Documentation is one of the most critical parts of a project. There are so many tools that are specifically designed to help document a project, and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can’t go wrong with tools like Sphinx, or pdoc.
Context: This is the third project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems.
Scope: Python, documentation
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json
-
/anvil/projects/tdm/data/goodreads/goodreads_book_series.json
-
/anvil/projects/tdm/data/goodreads/goodreads_books.json
-
/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json
Questions
Question 1
The listed datasets are fairly large, and interesting! They are json
formatted data. Each row of a single json
file can be individually read in and processed. Take a look at a single row.
%%bash
head -n 1 /anvil/projects/tdm/data/goodreads/goodreads_books.json
This is nice, because you can individually process a single row. Anytime you can do something like this, it is easy to break a problem into smaller pieces and speed up processing. The following demonstrates how you can read in a single line and process it.
import json
with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
for line in f:
print(line)
parsed = json.loads(line)
print(f"{parsed['isbn']=}")
print(f"{parsed['num_pages']=}")
break
In this project, the overall goal will be to implement functions that perform certain operations, write the best docstrings you can, and use your choice of pdoc
or sphinx
to generate a pretty set of documentation.
Begin this project by choosing a tool, pdoc
or sphinx
, and setting up a firstname-lastname-project04.py
module that will host your Python functions. In addition, create a Jupyter Notebook that will be used to test out your functions, and generate your documentation. At the end of this project, your deliverable will be your .ipynb
notebook and either a series of screenshots that captures your documentation, or a PDF created by exporting the resulting webpage of documentation.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Write a function called scrape_image_from_url
that accepts a URL (as a string) and returns a bytes
object of the data. Make sure scrape_image_from_url
cleans up after itself and doesn’t leave any image files on the filesystem.
-
Create a variable with a temporary file name using the
uuid
package. -
Use the
requests
package to get the response.import requests response = requests.get(url, stream=True) # then the first argument to copyfileobj will be response.raw
-
Open the file and use the
shutil
packagescopyfileobj
method to copy theresponse.raw
to the file. -
Open the file and read the contents into a
bytes
object.You can verify a bytes object by:
type(my_object)
outputbytes
-
Use
os.remove
to remove the image file. -
Return the bytes object.
You can verify your function works by running the following:
import shutil
import requests
import os
import uuid
import hashlib
url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg'
my_bytes = scrape_image_from_url(url)
m = hashlib.sha256()
m.update(my_bytes)
m.hexdigest()
ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Write a function called json_to_sql
that accepts a single row of the goodreads_books.json
file (as a string), a table name (as a string), as well as a set
of values to "skip". This function should then return a string that is a valid INSERT INTO
SQL statement. See here for an example of an INSERT INTO
statement.
The following is a real example you can test out.
with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f:
for line in f:
first_line = str(line)
break
first_line
json_to_sql(first_line, 'books', {'series', 'popular_shelves', 'authors', 'similar_books'})
"INSERT INTO books (isbn,text_reviews_count,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,description,format,link,publisher,num_pages,publication_day,isbn13,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series) VALUES ('0312853122','1','US','','','false','4.00','','','Paperback','https://www.goodreads.com/book/show/5333265-w-c-fields','St. Martin's Press','256','1','9780312853129','9','','1984','https://www.goodreads.com/book/show/5333265-w-c-fields','https://images.gr-assets.com/books/1310220028m/5333265.jpg','5333265','3','5400751','W.C. Fields: A Life on Film','W.C. Fields: A Life on Film');"
Here is some (maybe) helpful logic:
|
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Create a new function, that does something interesting with one or more of these datasets. Just like all the previous functions, make sure to include detailed and clear docstrings.
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Generate your final documentation, and assemble and submit your deliverables:
-
.ipynb
file testing out your functions. -
firstname-lastname-project04.py
module that includes all of your functions, and associated docstrings. -
Screenshots and/or a PDF exported from your resulting documentation web page. Basically, something that shows us your resulting documentation.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |