Write a Script to Crawl News Feed

Automatic news scraping with Python, Newspaper and Feedparser

September 17, 2017 | 13 Minute Read

I just recently joined an AI hackathon where we took on the challenging task of trying to recognize fake news. Early on I worked on automatically scraping news articles from various different news sites. I was surprised of how easy this was to implement using a really nice Python library called Newspaper.

Note: The code repository contains improvements that are not inlucded in this tutorial. Please do read through to understand how the code works, but make sure to also have a look at the source code afterwards.

I haven't really worked very much with Python before, and never realized how many great libraries that are available for Python users. Some are so well made and feature rich that with an interface, could work as standalone products. There are few programming languages/frameworks that can compete with the resources available for Python users.

We wanted to gather large amounts of news articles to train out network so that it could distinguish real news from fake news. It was important to have the data in a tidy format so that it would be easy for us to work with. To automate the process, I created a scraping script with the help of Newspaper. Go take and take look at the library, it can do so much more than just scraping articles on the web! I also use Feedparser to read RSS-feeds, as I did not realize before later that Newspaper also has this feature already built in. The script relies mainly on scraping articles from the RSS-feed of the website when they have an RSS-feed is available. As a fall back option Newspapers' automatic article scraper is used for sites where I could not find any RSS-feed. I decided to scrape from the RSS-feed first because the data was much more consistent when gathered through the RSS-feed. Especially the publish date/time of the article would often be missing when using the automatic article scraper. Since the publish date was important for our solution I put extra focus on trying get this included in the dataset.

                          import              feedparser              as              fp              import              json              import              newspaper              from              newspaper              import              Article              from              time              import              mktime              from              datetime              import              datetime              # Set the limit for number of articles to download                            LIMIT              =              4              data              =              {}              data              [              'newspapers'              ]              =              {}

We start by importing some libraries. We also import mktime and datetime that will be used to format various date forms on to the same format. The download limit for each website is set here to 4, but can of course be higher. We also initialize a data object that we will store our scraped data in.

Next thing we will do is to create a file called NewsPapers.json where we can easily add and remove websites/newspapers we want the script to scrape. This file will be a JSON file on the format like this:

                          {                                          "cnn"              :                                          {                                          "link"              :                                          "http://edition.cnn.com/"                                          },                                          "bbc"              :                                          {                                          "rss"              :                                          "http://feeds.bbci.co.uk/news/rss.xml"              ,                                          "link"              :                                          "http://www.bbc.com/"                                          },                                          "theguardian"              :                                          {                                          "rss"              :                                          "https://www.theguardian.com/uk/rss"              ,                                          "link"              :                                          "https://www.theguardian.com/international"                                          },                                          "breitbart"              :                                          {                                          "link"              :                                          "http://www.breitbart.com/"                                          },                                          "infowars"              :                                          {                                          "link"              :                                          "https://www.infowars.com/"                                          },                                          "foxnews"              :                                          {                                          "link"              :                                          "http://www.foxnews.com/"                                          },                                          "nbcnews"              :                                          {                                          "link"              :                                          "http://www.nbcnews.com/"                                          },                                          "washingtonpost"              :                                          {                                          "rss"              :                                          "http://feeds.washingtonpost.com/rss/world"              ,                                          "link"              :                                          "https://www.washingtonpost.com/"                                          },                                          "theonion"              :                                          {                                          "link"              :                                          "http://www.theonion.com/"                                          }                                          }

It a good mix of websites you could say…

So we open this json file in our python script:

                          # Loads the JSON files with news sites                            with              open              (              'NewsPapers.json'              )              as              data_file              :              companies              =              json              .              load              (              data_file              )

Note that the naming is a little inconsistent (e.g. companies/newspaper/website is all the same), it was created on a hackathon with limited time spent thinking about variable names.

                          count              =              1              # Iterate through each news company                            for              company              ,              value              in              companies              .              items              ():              # If a RSS link is provided in the JSON file, this will be the first choice.                            # Reason for this is that, RSS feeds often give more consistent and correct data.                            # If you do not want to scrape from the RSS-feed, just leave the RSS attr empty in the JSON file.                            if              'rss'              in              value              :              d              =              fp              .              parse              (              value              [              'rss'              ])              print              (              "Downloading articles from "              ,              company              )              newsPaper              =              {              "rss"              :              value              [              'rss'              ],              "link"              :              value              [              'link'              ],              "articles"              :              []              }              for              entry              in              d              .              entries              :              # Check if publish date is provided, if no the article is skipped.                            # This is done to keep consistency in the data and to keep the script from crashing.                            if              hasattr              (              entry              ,              'published'              ):              if              count              >              LIMIT              :              break              article              =              {}              article              [              'link'              ]              =              entry              .              link              date              =              entry              .              published_parsed              article              [              'published'              ]              =              datetime              .              fromtimestamp              (              mktime              (              date              )).              isoformat              ()              try              :              content              =              Article              (              entry              .              link              )              content              .              download              ()              content              .              parse              ()              except              Exception              as              e              :              # If the download for some reason fails (ex. 404) the script will continue downloading                            # the next article.                            print              (              e              )              print              (              "continuing..."              )              continue              article              [              'title'              ]              =              content              .              title              article              [              'text'              ]              =              content              .              text              newsPaper              [              'articles'              ].              append              (              article              )              print              (              count              ,              "articles downloaded from"              ,              company              ,              ", url: "              ,              entry              .              link              )              count              =              count              +              1

We will break this into parts and see what is going on.

                          count              =              1              # Iterate through each news company                            for              company              ,              value              in              companies              .              items              ():              if              'rss'              in              value              :              d              =              fp              .              parse              (              value              [              'rss'              ])              print              (              "Downloading articles from "              ,              company              )              newsPaper              =              {              "rss"              :              value              [              'rss'              ],              "link"              :              value              [              'link'              ],              "articles"              :              []              }

What we do here is to iterate through our imported JSON-file and checking whether a rss key is provided for each company website. If this value exists, we'll use FeedParser to load the RSS-feed of the given website. We start building the structure for the data we want to gather by constructing a dictionary newsPaper.

                          for              entry              in              d              .              entries              :              if              hasattr              (              entry              ,              'published'              ):              if              count              >              LIMIT              :              break              article              =              {}              article              [              'link'              ]              =              entry              .              link              date              =              entry              .              published_parsed              article              [              'published'              ]              =              datetime              .              fromtimestamp              (              mktime              (              date              )).              isoformat              ()

The variable d contains a list of links to articles taken from the RSS-feed that we will loop through. To get consistent data a check is done to see if the entry has a publish date. If it does not have one the entry is discarded. An article dictionary is created to store data for each article. To get the publish date, we extract the published_parsed value from the entry and do some formatting to get it on the same form as dates given by the Newspaper library (Note: I can imagine are better ways to do this).

                          try              :              content              =              Article              (              entry              .              link              )              content              .              download              ()              content              .              parse              ()              except              Exception              as              e              :              print              (              e              )              print              (              "continuing..."              )              continue

While we have gone through the RSS-feed, we have not actually scraped the articles yet. To do this we use the Newspaper library to scrape the content of the links we got from the RSS-feed. We put this into a try block just in case the loading fails, ensuring that the script continues without crashing. If anything weird happens, the script will dump some text and then the continue will jump the script to the next loop.

                          article              [              'title'              ]              =              content              .              title              article              [              'text'              ]              =              content              .              text              newsPaper              [              'articles'              ].              append              (              article              )              print              (              count              ,              "articles downloaded from"              ,              company              ,              ", url: "              ,              entry              .              link              )              count              =              count              +              1

If everything works fine we will store the title and text to our article object and then add this to the list of articles in the newsPaper dictionary.

Now, not every site has a RSS-feed anymore as it is to some degree, a dying technology (Did I just say that?). I wanted to get all articles via RSS because the data would be much more consistent, but for those websites that do not have one we need a backup.

                          else              :              # This is the fallback method if a RSS-feed link is not provided.                            # It uses the python newspaper library to extract articles                            print              (              "Building site for "              ,              company              )              paper              =              newspaper              .              build              (              value              [              'link'              ],              memoize_articles              =              False              )              newsPaper              =              {              "link"              :              value              [              'link'              ],              "articles"              :              []              }              noneTypeCount              =              0              for              content              in              paper              .              articles              :              if              count              >              LIMIT              :              break              try              :              content              .              download              ()              content              .              parse              ()              except              Exception              as              e              :              print              (              e              )              print              (              "continuing..."              )              continue              # Again, for consistency, if there is no found publish date the article will be skipped.                            # After 10 downloaded articles from the same newspaper without publish date, the company will be skipped.                            if              content              .              publish_date              is              None              :              print              (              count              ,              " Article has date of type None..."              )              noneTypeCount              =              noneTypeCount              +              1              if              noneTypeCount              >              10              :              print              (              "Too many noneType dates, aborting..."              )              noneTypeCount              =              0              break              count              =              count              +              1              continue              article              =              {}              article              [              'title'              ]              =              content              .              title              article              [              'text'              ]              =              content              .              text              article              [              'link'              ]              =              content              .              url              article              [              'published'              ]              =              content              .              publish_date              .              isoformat              ()              newsPaper              [              'articles'              ].              append              (              article              )              print              (              count              ,              "articles downloaded from"              ,              company              ,              " using newspaper, url: "              ,              content              .              url              )              count              =              count              +              1              noneTypeCount              =              0

The else-block is pretty similar to the if-block, the only difference is that the articles are scraped directly from the frontpage of the website.

                          paper              =              newspaper              .              build              (              value              [              'link'              ],              memoize_articles              =              False              )

This builds the list of articles found on the frontpage of the website.

                          if              content              .              publish_date              is              None              :              print              (              count              ,              " Article has date of type None..."              )              noneTypeCount              =              noneTypeCount              +              1              if              noneTypeCount              >              10              :              print              (              "Too many noneType dates, aborting..."              )              noneTypeCount              =              0              break              count              =              count              +              1              continue

Because the Newspaper library often failed to extract the publishing time of the article, I added a part to check if mulitple articles in a row were missing a publish time then the script would just skip the whole newspaper.

                          data              [              'newspapers'              ][              company              ]              =              newsPaper              try              :              with              open              (              'scraped_articles.json'              ,              'w'              )              as              outfile              :              json              .              dump              (              data              ,              outfile              )              except              Exception              as              e              :              print              (              e              )

Finally, the data is stored to each individual company (website) and the data object is saved to file as JSON.

That was a quick overview of how to easily scrape any news site you want. If you have any questions regarding the code, please do not hesitate to leave an issue on the GitHub project page. :)

cookpelf1987.blogspot.com

Source: https://holwech.github.io/blog/Automatic-news-scraper/

Write a Script to Crawl News Feed

Automatic news scraping with Python, Newspaper and Feedparser

0 Response to "Write a Script to Crawl News Feed"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel