I was asked by a researcher to write a script to extract the headings, sources, dates and word counts of a search of articles from the Factiva news platform into a table. She would use the table for further study. Since Factiva searches include a summary of these fields in a web browser, the content can be saved as an HTML file. Since the format of the HTML file is predictable, I could write a script to parse the relevant elements and text then save as rows in a delimited text file. I chose Python for this, as it would be quick to write, is widely available and easy to make changes to later (compared to a compiled language like Java).
The summary from Factiva has the following pattern, (simplified here for clarity):
<td> <b>Title</b> <div class="leadFields"><a>Source</a>, Time, Date, Word_count, Language</div> <div>Excerpt</div></td>
From this, it was possible to predict the position of each of Title, Source, Time, Date, Word_count. (Time was not always provided, so the code had to allow for this.)
The output is actually a tab rather than comma separated file, to get around the use of commas in the article titles. The format is:
Title \t Source \t Time \t Date \t Word_count
The file can be opened in any spreadsheet tool such as Excel. I use the Data > Get External Data > From Text function to have control on how the file is interpreted.
I used Python 2.7.6, actually an iPython Notebook via Anaconda Python. (IPython is now called Jupyter but I have an old version on my PC which works fine.) This way of using Python is available as an installable package where I work and includes the most used libraries. For parsing the HTML I used the lxml library and some XPath queries. The source code is available at the end of this post and it basically follows this method:
- Open HTML file and parser, open output text file.
- Find all the td elements.
- For each of these:
- Ignore first two, as these are in the page header.
- Find the b element; this is the Title.
- Find the leadField class element; this has a list of the other fields separated by commas.
- Write the fields to a new line in the output file, separated by tabs.
- Close the files.
To use the code, open a new iPython/Jupyter notebook, which works in the C:\Work directory by default (on Windows). Click Start > All Programs > Anaconda (64-bit) > IPython (Py 2.7) Notebook (or equivalent), wait for it to load (a black screen appears then a new page opens in your default web browser) then click the New Notebook button.
In the folder C:\Work paste in the HTML content in a file Factiva.htm. Copy the code below into the input field of the notebook labelled In [ ]: and click the Run Cell button with the play triangle icon. The script will execute and produce a summary of the HTML file in a new text file called C:\Work\factiva-parsed.txt.
from lxml.html import parse parsed = parse('Factiva.htm') doc = parsed.getroot() cells = doc.findall('.//td') len(cells) fout = open('factiva-parsed.txt', 'w') fout.write("TITLE\tSOURCE\tTIME\tDATE\tWORD COUNT\n") fout.flush() NA = "N/A" T = "\t" NL = "\n" for (i, cell) in enumerate(cells): if (i<2): #ignore first two results, they are from the page header continue #col1: title, from <b> titles = cell.findall('.//b') if titles: #Two unicode chars (left/right quo) cause problems --> replace fout.write(titles.text_content().strip().replace(u"\u2018", "'").replace(u"\u2019", "'") + u"\t") else: fout.write(NA+T) #col2-5: from div class=leadFields, comma-separated fields = cell.find_class('leadFields') if fields: raw_fields = fields.text_content() field_split = raw_fields.split(',') # ['Reuters News', ' 13:33', ' 27 April 2016', ' 896 words', ' (English)'] #col2: source fout.write(field_split + T) #col3: time - optional has_time = ":" in field_split date_col = 2 words_col = 3 if has_time: fout.write(field_split.strip() + T) else: #skip the time field fout.write(NA+T) date_col = 1 words_col = 2 #col4: date fout.write(field_split[date_col].strip() + T) #col5: number of words fwordss = [int(s) for s in field_split[words_col].split() if s.isdigit()] #get the integers if fwordss: fout.write(str(fwordss) + NL) else: fout.write(NA+NL) else: fout.write(NA+T+NA+T+NA+T+NA+NL) fout.flush() fout.close()
The code may need tweaking if your notebook is configured to use Python 3 instead of Python 2.7. See the Jupyter documentation for more information on using notebooks.