Alliance Project · Programming

Extracting meta data from Factiva to CSV via Python

I was asked by a researcher to write a script to extract the headings, sources, dates and word counts of a search of articles from the Factiva news platform into a table. She would use the table for further study. Since Factiva searches include a summary of these fields in a web browser, the content can be saved as an HTML file. Since the format of the HTML file is predictable, I could write a script to parse the relevant elements and text then save as rows in a delimited text file. I chose Python for this, as it would be quick to write, is widely available and easy to make changes to later (compared to a compiled language like Java).

Source files

Factiva summary in HTML Factiva summary HTML source code

The summary from Factiva has the following pattern, (simplified here for clarity):

<td>
<b>Title</b>
<div class="leadFields"><a>Source</a>, Time, Date, Word_count, Language</div>
<div>Excerpt</div></td>

From this, it was possible to predict the position of each of Title, Source, Time, Date, Word_count. (Time was not always provided, so the code had to allow for this.)

Results

Tab-delimited data

The output is actually a tab rather than comma separated file, to get around the use of commas in the article titles. The format is:

Title \t Source \t Time \t Date \t Word_count

The file can be opened in any spreadsheet tool such as Excel. I use the Data > Get External Data > From Text function to have control on how the file is interpreted.

Technique

iPython notebook Factiva Python

I used Python 2.7.6, actually an iPython Notebook via Anaconda Python. (IPython is now called Jupyter but I have an old version on my PC which works fine.) This way of using Python is available as an installable package where I work and includes the most used libraries. For parsing the HTML I used the lxml library and some XPath queries. The source code is available at the end of this post and it basically follows this method:

  1. Open HTML file and parser, open output text file.
  2. Find all the td elements.
  3. For each of these:
    • Ignore first two, as these are in the page header.
    • Find the b element; this is the Title.
    • Find the leadField class element; this has a list of the other fields separated by commas.
    • Write the fields to a new line in the output file, separated by tabs.
  4. Close the files.

Instructions

To use the code, open a new iPython/Jupyter notebook, which works in the C:\Work directory by default (on Windows). Click Start > All Programs > Anaconda (64-bit) > IPython (Py 2.7) Notebook (or equivalent), wait for it to load (a black screen appears then a new page opens in your default web browser) then click the New Notebook button.

In the folder C:\Work paste in the HTML content in a file Factiva.htm. Copy the code below into the input field of the notebook labelled In [ ]: and click the Run Cell button with the play triangle icon. The script will execute and produce a summary of the HTML file in a new text file called C:\Work\factiva-parsed.txt.


from lxml.html import parse
parsed = parse('Factiva.htm')
doc = parsed.getroot()
cells = doc.findall('.//td')
len(cells)
fout = open('factiva-parsed.txt', 'w')
fout.write("TITLE\tSOURCE\tTIME\tDATE\tWORD COUNT\n")
fout.flush()
NA = "N/A"
T = "\t"
NL = "\n"
for (i, cell) in enumerate(cells):
  if (i<2):
    #ignore first two results, they are from the page header
    continue
  #col1: title, from <b>
  titles = cell.findall('.//b')
  if titles:
    #Two unicode chars (left/right quo) cause problems --> replace
    fout.write(titles[0].text_content().strip().replace(u"\u2018", "'").replace(u"\u2019", "'") + u"\t")
  else:
    fout.write(NA+T)
  #col2-5: from div class=leadFields, comma-separated
  fields = cell.find_class('leadFields')
  if fields:
    raw_fields = fields[0].text_content()
    field_split = raw_fields.split(',')
    # ['Reuters News', ' 13:33', ' 27 April 2016', ' 896 words', ' (English)']
    #col2: source
    fout.write(field_split[0] + T)
    #col3: time - optional
    has_time = ":" in field_split[1]
    date_col = 2
    words_col = 3
    if has_time:
      fout.write(field_split[1].strip() + T)
    else:
      #skip the time field
      fout.write(NA+T)
      date_col = 1
      words_col = 2
    #col4: date
    fout.write(field_split[date_col].strip() + T)
    #col5: number of words
    fwordss = [int(s) for s in field_split[words_col].split() if s.isdigit()] #get the integers
    if fwordss:
      fout.write(str(fwordss[0]) + NL)
    else:
      fout.write(NA+NL)
  else:
    fout.write(NA+T+NA+T+NA+T+NA+NL)
  fout.flush()
fout.close()

More help

The code may need tweaking if your notebook is configured to use Python 3 instead of Python 2.7. See the Jupyter documentation for more information on using notebooks.

2 thoughts on “Extracting meta data from Factiva to CSV via Python

  1. randas.read_html is your friend:

    import pandas as pd
    pd.read_html(f, index_col=0)
    data = pd.concat([art for art in tables
    if ‘HD’ in art.index.values], axis=1).T.set_index(‘AN’)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s