Our main research question was to find an explanation to the "disappearance" of "Humanities" articles on the front page of the New York Times from the 1980's onward.
This seemed like an interesting research question; "How come Humanities falls of the map completely and stops being featured on the front page in The New York Times newspaper?". Sadly the answer was much simpler than our assumptions and theories: it turned out to be the simple fact that within the New York Times API the value Front-Page
was not used anymore for articles that were published in the 1980's and onwards.
This weird inconsistency in the New York Times API was a little bit of a surprise. But on the other hand it meant that we found our answer and could hand in our results.
But we didn't…
Instead we got together and tried to come up with a different research question. At least one we hoped would be a bit more of a challenge than our initial question.
And so our virtual journey continued into the digital wonderland of an inconsistent API, a lot of grep
, wget
, sed
, AWK
and jq
.
During our digital journey some of us got lost, but we managed to find our orientation again and stick together. Determined to tackle every problem thrown at us by our Teacher, the theory and the sometimes dirty hands-on hardcore coding
we manned up and pulled ourselves together and came up with several questions and the acompanying answers.
It was hard… It was dirty and someone even got lost in jq-limbo never to be seen again until the final end of our journey.
Now we're ready to present you our results.
This script will perform different actions needed to accomplish it's goal. (wow, that's vague…)
How to run nytimesdata.sh
Issues
View and run the New York Times Data Script by downloading it.
This script draws information from the NY-Times article-search API.
Developments needed:
Instead of the total number of ALL news-items it should rule out blogs, since those form a very specific type of news-item that only came into excistence over the last few decades. They do not form part of the actual newspaper. Or do they? Somebody willing to find out?
Can someone figure out a way to show the results for each decade automatically?
View and run Script One by downloading it.
This script is the sum of multiple previous scripts from which the best scripting-solutions have remained and the broken or ugly (yes, we are judgemental) bits have been thrown away. The output of the original scripts was meant to help us get a more detailed look into the spread of Humanities front page articles viewed over different kinds of time-periods.
When writing the script we where still under the impression that only articles that had "The New York Times" as source where actually published in the NYT. Now we know this is not the case, as New York Times sometimes (re)publishes articles written by other news agencies. We've kept NYT as a selected in the script though, since we value originality but more importantly, it is one of the 'scars' that shows our learning curve. In the end, the script now shows:
Although the graphs appear to show some interesting things; May and October aren't popular humanity-months (1), the 10th and 15th of each month seem a bad time to publish a Humanities article if it is your goal to reach the front page (2), it is hard to say that any of these differences are valid and significant. The most eye-catching results are the high percentages of front-page articles during the 1850's and 1870's (graph 1) can easily be discredited once we look at the low amount of articles published in total during those decades. In short, we still need to spend more time on interpreting our results.
View and run Script Two by downloading it.
This scripts extracts the total amount of articles and front-page of 2 free-to-choose articles directly from the API. The scripts however, stops working after the 1980's since the data-field which houses the front-page indication is then shifted. Since the new data-field is ridiculously parsed as a string by the API, the API won't allow us to do a valid search. Therefor, the entire dataset from the two queries is downloaded into .JSON files and searched with some help of JQ.
However, since the dataset of 'Science' is too big to download in one go, the script now uses to ready-made .JSON files. Otherwise, we would still be sitting here tomorrow... the exemplary and ready-made .JSON files will be delivered alongside the scripts in our final product.