NLP and Information Retrieval with Julia

05 AUGUST 2019

Purpose

We are going to talk about a topic often talked about in the context of Python. With many well supported packages and a vibrant community of developers around them, why wouldn’t we use Python? Well, the simplicity and user friendly syntax of Julia along with the ability to run as fast as C at compile time, would seme like a good reason for one. So how good is Julia when the task is understanding the naturual language and it’s subtleties? We are going to take a look below. Code for the following can be found here: https://github.com/GdMacmillan/Julia-NLP-and-Information-Retrieval

Setup

The data source is usually a document database such as MongoDB. I have started a client with a local mongo service and loaded the data from a json document that is supposed to mimic the real-life document one might recieve from a web scraper. I will not go into how to do this with this tutorial so lets assume the documents are already loaded in a local Mongo service.

Loading Data from Mongo

For my pipeline, I want to extract the text content from my database to flat files (txt) located in a directory called data. I also will write a section_names.csv file with the ids and names of the sections. To do this I will call a python script, aptly called load_nyt_data.py. I did this in python as I already had pymongo installed and the main goal is to perform common nlp tasks with julia, not implement python code. Number of files written will be 0 if files already exist locally.

run(`python src/load_nyt_data.py`)

Number of files written: 0

Process(`python src/load_nyt_data.py`, ProcessExited(1)) [1]

Text Processing Pipeline

The goal is to build a basic text processing pipeline involving tokenization, stripping stopwords and stemming. Ultimately what we want is a sparse representation of the data where 1 row of data is a document and each column is a unique term, such as a unigram, bigram or trigram. The values herein will be generated from a vectorization method which assigns each document term a value which is proportional to its frequency in the document, but inversely proportional to the number of documents in which it occurs.

Load data using TextAnalysis and Glob

The main package we will be using for this pipeline is Text Analysis. Text analysis is supported by the JuliaText organization and parallels several supported packages for working with data in the form of text.

using TextAnalysis
using Glob

Create an array of filenames

fnames = glob("data/*.txt");

Read an example file

# example file
readlines(fnames[1])

42-element Array{String,1}:
 ""

 ""

 ""

 ""

 ""

 ""

 ""

 ""

 ""

 ""

 ⋮

 "Reservations Accepted.        "

 "Wheelchair access Entrance is up a short flight of stairs from the sidewa
lk. Restrooms have handrails.        "
 ""

 ""

 ""

 " "

 ""

 ""

 ""

We can also map a file document type to the filenames in fnames. This produces an array of FileDocuments.

fds = map(FileDocument, fnames);

We can read a file using the text function.

text(fds[1])

"\n\n\n\n\n\n\n\n\n\n\nHey, the man on the phone said. Are you still coming
 tonight?        \n \n\nIt took a moment for me to realize that he was call
ing from Distilled to confirm my dinner reservation.        \nYes, I replie
d. Cool, he said, and sounded as if he meant it.        \nDistilled opened
in June on the corner of Franklin Street and West Broadway in TriBeCa, the
former home of Drew Nieporents Layla and Centrico. The belly dancers and th
e frozen-margarita machine are gone, but a certain effervescence remains. S
o does Mr. Nieporent, hovering in the background as guru to Distilleds owne
rs, the first-time restaurateur Nick Iovacchini and Shane Lyons, the 25-yea
r-old chef.        \nThe space is blandly handsome, with dark woods and cha
rcoal banquettes, breathlessly high ceilings and quasi-medieval wheel chand
eliers like crowns of fire. One side is devoted to the bar, where the drink
s, by Benjamin Wood, are lady-killers, elegant with a knife twist. Occasion
ally 1980s mope rock shimmers from the speakers.        \nService is confou
ndingly friendly, almost coddling. When I stood outside reading the posted
menu, someone came hurrying down the steps to hand me my own copy, so I wou
ldnt crane my neck, he said. On arrival and departure, a host leapt to open
 the door.        \nThe mission statement that preceded one meal (We are a
modern American public house, the waiter intoned) was both unnecessary and
slightly coy about Mr. Lyonss ambitions. Yes, wings are on the menu, but th
ey are jacked up with gochujang, Korean fermented soybean and chile paste
borrowed, perhaps, from the larder at Momofuku Noodle Bar, where Mr. Lyons
worked for a year.        \nThere are occasional technical flourishes, like
 watermelon cubes Cryovaced to intensify their flavor, and mushrooms surrou
nded by puffs of buttery onion soubise, aerated by an iSi siphon. A tousle
of dehydrated and shaved bacon adorns an open-face sandwich of heirloom tom
atoes and basil on sourdough: a B.L.T., of course. Sunflower sprouts make u
p for the missing crunch.        \nOther classics are updated rather than u
pended: popcorn dusted with garlic, cumin and brewers yeast, which evokes c
heese; snappy pickles fermented with gochugaru, Korean red chile powder; po
rk ribs glazed with more gochujang, teasing the sweet-salty border without
straying too far in either direction.        \nOnion rings, battered with Y
uengling beer and tapioca flour, are fried, then frozen (a theatrical waite
r boasted that they had been brought to negative 60 degrees) and fried agai
n. They arrive nicely sturdy, the sole purpose of their existence to ferry
the narcotic-like condiments, burned scallions cut with jalapeos and mayonn
aise with the sting of preserved lime.        \nThe substantial burger (whi
ch the menu modestly refrains from telling you is made with grass-fed, orga
nic-grain-finished beef) is abetted by what may be the finest version of Ta
ter Tots in town, the grated potatoes cooked until just underdone and then
crisped with Wondra flour.        \nEverything here goes to 11, one diner m
arveled. Duck breaded and fried like chicken is brazen and irresistible, de
spite the oversweet accompanying waffle, a spongy slab of brioche dredged i
n custard, like French toast. Even broccoli turns wanton, flung with sliver
s of duck bacon and pickled watermelon rind in a fish sauce vinaigrette, wi
th (wait for it) a dollop of duck fat.        \nBut liver pt served with pl
umes of baked, dehydrated chicken skin? Now this is confrontational. My mis
sion in life is to make skinny girls fat, Mr. Lyons told my table. He got a
 laugh, but the cracklings were abandoned after one bite.        \nThe only
 logical end to such a meal is smores  deconstructed, as is the fashion, wi
th a hickory-smoked graham-flour cake, a torched smear of marshmallow fluff
, dark chocolate pudding and graham crackers broken on top. It is obvious a
nd no less pleasurable for it. (Kari Rak, previously at Bouchon Bakery, wil
l introduce a new dessert menu this month.)        \nOr take a shot of moon
shine, with an apricot shrub as a chaser. It edges everything in halos and
can make you believe that a modern American public house, whatever that is,
 is where you want to be.        \nDistilled \n211 West Broadway (Franklin
Street); (212) 601-9514; distilledny.com.        \nRecommended B.L.T.; chil
led charred broccoli; porgy; burger; wings; country fried duck and waffles;
 pork ribs; smores.        \nPrices \$5 to \$29.        \nOpen Nightly for
dinner, Saturday and Sunday for brunch.        \nReservations Accepted.
    \nWheelchair access Entrance is up a short flight of stairs from the si
dewalk. Restrooms have handrails.        \n\n\n\n \n\n\n\n"

Another way to do this would be to use the core Julia functions to load text. We can push strings into an iterable data structure.

slist = String[]
for fname in fnames
    s = open(fname) do file
        # read the contents of a file all at once
        read(file, String)
    end
    push!(slist, s)
end

metadata for our document can be accesed as a property of the FileDocument instance

a = fds[1];
a.metadata

TextAnalysis.DocumentMetadata(Languages.English(), "data/5233240838f0d8062f
ddf624.txt", "Unknown Author", "Unknown Time")

Tokenization and Stop Words

Next we will be removing stop words and tokenizing the document. Tokens are individual words split on whitespace. Stop words are high frequency words that we want to filter out. These words often have low lexical meaning and they don't help distinguish one text from another. Below I've created my_prepare which takes care of preparation tasks such as stripping punctuation, articles, pronouns, numbers and non-letters. This also removes stopwords. and stems the document which removes morphological affixes to the words leaving only the stem.

using WordTokenizers
using Languages

set_tokenizer(WordTokenizers.nltk_word_tokenize)

STOPWORDS = stopwords(Languages.English());

"""
my_prepare(text)

Returns prepared text string
"""
function my_prepare(text)
    sd = StringDocument(text)
    prepare!(sd, strip_punctuation
        | strip_articles
        | strip_pronouns
        | strip_numbers
        | strip_non_letters)
    remove_words!(sd, STOPWORDS)
    stem!(sd)
    remove_case!(sd)
    return sd.text
end

my_prepare

my_prepare(text(fds[1]))

"hey phone are come tonight it moment realiz call distil confirm dinner res
erv yes repli cool sound meant distil june corner franklin street west broa
dway tribeca former home drew niepor layla centrico the belli dancer frozen
 margarita machin gone effervesc remain so mr niepor hover background guru
distil owner time restaurateur nick iovacchini shane lyon chef the space bl
and handsom dark wood charcoal banquett breathless ceil quasi mediev wheel
chandeli crown fire one devot bar drink benjamin wood ladi killer eleg knif
e twist occasion mope rock shimmer speaker servic confound friend coddl whe
n stood outsid read post menu hurri step hand copi wouldnt crane neck on ar
riv departur host leapt door the mission statement preced meal we modern am
erican public hous waiter inton unnecessari slight coy mr lyonss ambit yes
wing menu jack gochujang korean ferment soybean chile past borrow larder mo
mofuku noodl bar mr lyon there occasion technic flourish watermelon cube cr
yovac intensifi flavor mushroom surround puff butteri onion soubis aerat is
i siphon a tousl dehydr shave bacon adorn sandwich heirloom tomato basil so
urdough b l t cours sunflow sprout miss crunch other classic updat upend po
pcorn dust garlic cumin brewer yeast evok chees snappi pickl ferment gochug
aru korean red chile powder pork rib glaze gochujang teas sweet salti borde
r stray direct onion ring batter yuengl beer tapioca flour fri frozen theat
ric waiter boast brought negat degre fri they arriv nice sturdi sole purpos
 exist ferri narcot condiment burn scallion cut jalapeo mayonnais sting pre
serv lime the substanti burger menu modest refrain tell grass fed organ gra
in finish beef abet finest version tater tot town grate potato cook underdo
n crisp wondra flour everyth goe diner marvel duck bread fri chicken brazen
 irresist despit oversweet accompani waffl spongi slab brioch dredg custard
 french toast even broccoli wanton flung sliver duck bacon pickl watermelon
 rind fish sauc vinaigrett wait dollop duck fat but liver pt serv plume bak
e dehydr chicken skin now confront my mission life skinni girl fat mr lyon
told tabl he laugh crackl abandon bite the logic meal smore deconstruct fas
hion hickori smoke graham flour cake torch smear marshmallow fluff dark cho
col pud graham cracker broken top it obvious pleasur kari rak previous bouc
hon bakeri introduc dessert menu month or shot moonshin apricot shrub chase
r it edg halo believ modern american public hous whatev distil west broadwa
y franklin street distilledni com recommend b l t chill char broccoli porgi
 burger wing countri fri duck waffl pork rib smore price open night dinner
saturday sunday brunch reserv accept wheelchair access entranc short flight
 stair sidewalk restroom handrail"

Bag of Words and TFIDF

Using the TextAnalysis package we will create a DirectoryCorpus to use when constructing counts over the whole corpus. A text corpus is a large body of text.

crps = DirectoryCorpus("data");
pop!(crps) # remove last item because this is our section_names.csv document

TextAnalysis.FileDocument("/Users/gmacmillan/projects/Case_study/julia_nlp_
case_study/data/section_names.csv", TextAnalysis.DocumentMetadata(Languages
.English(), "/Users/gmacmillan/projects/Case_study/julia_nlp_case_study/dat
a/section_names.csv", "Unknown Author", "Unknown Time"))

We can use the standardize inplace function to make sure all the documents in our corpus are standardized to the StringDocument type.

standardize!(crps, StringDocument)

I use some of the preparation steps from my_preparation function above but applied to the entire corpus. These work in-place.

remove_case!(crps)
prepare!(crps, strip_punctuation
    | strip_articles
    | strip_pronouns
    | strip_numbers
    | strip_non_letters)
remove_words!(crps, STOPWORDS)
stem!(crps)

A lexicon is what is going to keep track of the words and the counts associated with each word. The is in the form of a dictionary. Update the lexicon with the convenience function provided by TextAnalysis.

update_lexicon!(crps)

lexicon(crps)

Dict{String,Int64} with 22718 entries:
  "nuhu"     => 1
  "ironwe"   => 1
  "wintri"   => 2
  "flatb"    => 1
  "economix" => 2
  "curv"     => 21
  "skylight" => 4
  "unoffici" => 5
  "touchpad" => 1
  "bidder"   => 4
  "whiz"     => 3
  "beckett"  => 5
  "brandt"   => 5
  "apiec"    => 4
  "il"       => 3
  "msnbc"    => 3
  "archiv"   => 25
  "overdos"  => 2
  "ankl"     => 26
  ⋮          => ⋮

If we wish to have a reverse lookup for each word with row index for the document/s in which it appears, we need to create an inverse index. Fortunately, TextAnalysis makes this easy for us.

update_inverse_index!(crps)

inverse_index(crps)

Dict{String,Array{Int64,1}} with 22718 entries:
  "nuhu"     => [603]
  "ironwe"   => [630]
  "wintri"   => [159, 583]
  "flatb"    => [744]
  "economix" => [611, 809]
  "curv"     => [23, 47, 51, 272, 284, 422, 493, 522, 559, 575, 584, 599, 6
13, …
  "skylight" => [360, 376, 530, 634]
  "unoffici" => [24, 123, 281, 719, 999]
  "touchpad" => [757]
  "bidder"   => [104, 235, 881]
  "whiz"     => [51, 321, 857]
  "beckett"  => [397, 601, 701, 750, 992]
  "brandt"   => [91, 493, 706, 800, 916]
  "apiec"    => [328, 773, 816, 869]
  "il"       => [234, 833, 939]
  "msnbc"    => [89, 107, 746]
  "archiv"   => [203, 243, 245, 368, 549, 560, 599, 676, 716, 826, 833, 878
, 88…
  "overdos"  => [639, 758]
  "ankl"     => [147, 158, 168, 241, 266, 375, 393, 408, 447, 462, 572, 662
, 72…
  ⋮          => ⋮

m = DocumentTermMatrix(crps);

The DocumentTermMatrix is a struct with properties containing the components necessary to create a term frequency inverse document frequency (tfidf) matrix for the corpus. This will be applied in later procedures involving information retrieval or sentiment analysis.

The document term matrix is stored in a data structure called SparseMatrixCSC. Sparse matrices are distinct from dense matrices in that the only values stores are non-zero values. In julia, zero values can be stored but only manually. Sparse matrices are common in machine learning, such as in data that contains counts and data encodings that map values to n dimensional arrays, because these computationally efficient data structures can elicit performance gains when used by algorithms meant to take advantage of sparsity.

The inverse index also provides us with a count of the number of documents each word appears in. This is known as document frequencies.

To obtain the tfidf matrix we will simply use the function below applied to the document term matrix.

tfidf = tf_idf(m);

Steps for computing tfidf

What if we want to do all of the above manually?

Create the bag of words (bow), a set of words unique over the corpus. A set is a good datatype for this since it doesn't allow duplicates. At the end you'll want to convert it to a list so that we can deal with our words in a consistent order.

cleaned_docs = []
bow = Set{String}();
for doc in fds
    cleaned = tokenize(my_prepare((text(doc))))
    union!(bow, Set(cleaned))
    push!(cleaned_docs, cleaned)
end
filter!(!isempty, cleaned_docs);

Create a reverse lookup for the vocab list. This is a dictionary whose keys are the words and values are the indices of the words (the word id). This will make things much faster than using the list index function.

indexer = Dict{String,Int64}()
for (i, word) in enumerate(bow)
    indexer[word] = i
end

Create a word count matrix. This is an array data type where each row corresponds to a document and each column a word. The value should be the count of the number of times that word appeared in that document.

num_docs = length(fds)
num_words = length(indexer);

counts = zeros((num_docs, num_words));

for (idx, doc) in enumerate(cleaned_docs)
    C = Dict{String,Int64}()
    for word in doc
        C[word] = get(C, word, 0) + 1
    end
    for (word, count) in C
        counts[idx, indexer[word]] = count
    end
end

Create the document frequencies. For each word, get a count of the number of documents the word appears in. This is different from the total number of times the word appears.

df = sum(counts, dims=1);

Normalize the word count matrix to get the term frequencies. This means dividing each term frequency by the l2 (euclidean) norm. This makes each document vector has a length of 1.

# document_frequencies
tf_norm = sqrt.(sum(counts .^ 2, dims=2));
tf_norm[tf_norm .== 0] .= 1;
tfs = counts ./ tf_norm;

Multiply the term frequency matrix by the log of the inverse of the document frequencies to get the tf-idf matrix. We add one to the denominator to avoid dividing by 0.

idf = log.((num_docs + 1) ./ (1 .+ df)) .+ 1;
tfidf_m = tfs .* idf;

Normalize the tf-idf matrix as well by dividing by the l2 norm.

tfidf_norm = sqrt.(sum(tfidf_m .^ 2, dims=2));
tfidf_norm[tfidf_norm .== 0] .= 1;
tfidf_m ./= tfidf_norm;

Cosine Similarity Using TFIDF

The tfidf approach is common for understanding relative term importances in a document. It is a classic way to encode documents so that arbitrary queries, different forms of clustering and document classification tasks can now be executed. We have a vector space model as the representation of a set of documents. Queries can now be viewed in the same space as the document collection in order to allow for information retrieval.

One way to do scoring of the distance between vectors is by similarity score. Cosine similarity quantifies how much two multidimensional vectors point in the same direction on a scale from 1 (pointing in the same direction) to -1 (pointing in opposite directions). A cosine similarity of 0 means that the vectors are orthogonal. I use the Distances.jl library to calculate this.

using Distances

"""
readlines_remove_leading_whitespace(doc)

Returns an array of strings without leading whitespace lines
"""
function my_readlines(doc_title)
    lines = readlines(doc_title)
    i = 1
    while true
        if length(strip(lines[i])) !== 0
            break
        end
        i += 1
    end
    return lines[i:lastindex(lines)]
end


ix_A = 1; doc_A = tfidf[ix_A, :]
ix_B = 589 ; doc_B = tfidf[ix_B, :]
println("\ndocument A: ")

document A:

println.(my_readlines(crps[ix_A].metadata.title)[1:5]) # print just 5 lines

Hey, the man on the phone said. Are you still coming tonight?


It took a moment for me to realize that he was calling from Distilled to co
nfirm my dinner reservation.
Yes, I replied. Cool, he said, and sounded as if he meant it.

println("\ndocument B: ")

document B:

println.(my_readlines(crps[ix_B].metadata.title)[1:5]) # print just 5 lines

The last time we tasted mencas from Bierzo, Spain, my recipe included rosem
ary, the wines dominant herbal note. Consistency may be someone elses hobgo
blin, but its nice to see it in wines; there was that rosemary again.

Pairings often welcome contrasts, but this time I picked up the flavors and
 aromas of the wine and tossed them into the pan. Green peppers, a hit of c
hile, olives and a beefy smoke. All have been incorporated in this dish.

println("\nCosine Similarity of documents A & B: ")

Cosine Similarity of documents A & B:

1 - evaluate(CosineDist(), doc_A, doc_B)

0.028020811811042434

If we want to compute the pairwise distance between all documents simultaneously, we can do that using the pairwise function from Distances.jl.

r = zeros((num_docs, num_docs));
@time pairwise!(r, CosineDist(), transpose(tfidf), dims=2);

27.148998 seconds (2.52 M allocations: 110.984 MiB, 0.49% gc time)

println("Similarity measure pairwise: ")

Similarity measure pairwise:

1 .- r

999×999 Array{Float64,2}:
 1.0         0.0207873   0.0153      …  0.00754566  0.0156072   0.00758319
 0.0207873   1.0         0.0136661      0.0189835   0.0316948   0.0200642
 0.0153      0.0136661   1.0            0.018636    0.0193947   0.0231317
 0.047558    0.050975    0.0118592      0.0125809   0.0156882   0.0181751
 0.0895226   0.0314027   0.0172356      0.00877987  0.016188    0.00906719
 0.0627999   0.00912273  0.00434174  …  0.0244229   0.00867607  0.0144011
 0.0113251   0.00990173  0.0127414      0.0132327   0.037472    0.0323638
 0.0131596   0.0066572   0.0140031      0.0192138   0.0185132   0.0222539
 0.00620686  0.0139168   0.00896439     0.35506     0.0228831   0.0416952
 0.00730156  0.0116336   0.0103688      0.0137953   0.0483865   0.0080328
 ⋮                                   ⋱
 0.0110392   0.00678634  0.00443852  …  0.0317835   0.0223228   0.0220557
 0.0185624   0.0143821   0.014936       0.0112221   0.0321573   0.0177684
 0.00954421  0.00613039  0.0112613      0.0266921   0.00133801  0.0369548
 0.0151518   0.00557817  0.0166722      0.0127879   0.00899566  0.0288063
 0.00818376  0.00588771  0.0102004      0.0153659   0.0416719   0.0120247
 0.00384303  0.00421517  0.00990962  …  0.0150054   0.00692421  0.0363021
 0.00754566  0.0189835   0.018636       1.0         0.0309683   0.0790843
 0.0156072   0.0316948   0.0193947      0.0309683   1.0         0.0392533
 0.00758319  0.0200642   0.0231317      0.0790843   0.0392533   1.0

Feature Importances: Applications of TFIDF

Given a corpus of documents, related to each other or not, we can use the tfidf score to rank words in order of importance to the text. We can also just use the word frequencies by passing in the term frequency sparse matrix instead.

"""
Returns top n tfidf values in row and return them with their corresponding feature names
"""
function get_top_tfidf_feats(row, features, top_n::Int64=10)
    if length(row) > top_n
        top_idxs = sortperm(row, rev=true)[1:top_n]
    else
        top_idxs = sortperm(row, rev=true)
    end
    return [(features[i], row[i]) for i in top_idxs]
end

get_top_tfidf_feats

"""
Returns top tfidf features in specific document (matrix row)
"""
function get_top_feats_in_doc(X_sparse, features, row_id::Int64, top_n::Int64=10)
    get_top_tfidf_feats(X_sparse[row_id, :], features, top_n)
end

get_top_feats_in_doc

feature_array = collect(keys(crps.lexicon));

# top 10 words for document 23
get_top_feats_in_doc(tfidf, feature_array, 23)

10-element Array{Tuple{String,Float64},1}:
 ("carr", 0.0837980511755615)
 ("wayward", 0.067933919942415)
 ("paperback", 0.055194433110973204)
 ("laboratorio", 0.05197697721460046)
 ("rein", 0.051330724026278404)
 ("strength", 0.04892547199630457)
 ("longitudin", 0.04666726201789564)
 ("make", 0.04400478467419274)
 ("wrink", 0.04198383512222033)
 ("pari", 0.03963874160568814)

using StatsBase # import StatsBase to use the mean function

"""
Returns the top n features that on average are most important amongst documents in rows
indentified by indices in grp_ids
"""
function get_top_mean_feats(X_sparse, features, grp_ids=nothing, top_n::Int64=10)
    if isnothing(grp_ids)
        D = X_sparse
    else
        D = X_sparse[grp_ids, :]
    end
    tfidf_means = vec(mean(D, dims=1))
    return get_top_tfidf_feats(tfidf_means, features, top_n)
end

get_top_mean_feats

get_top_mean_feats(tfidf, feature_array)

10-element Array{Tuple{String,Float64},1}:
 ("unfulfil", 0.004298837611037763)
 ("earthflight", 0.003694854911730176)
 ("paperback", 0.0036727650633894353)
 ("rebound", 0.0035198602372548795)
 ("lama", 0.003464234955213334)
 ("uneas", 0.003222486091908161)
 ("scholar", 0.0031950129131286284)
 ("schonberg", 0.003039835421215477)
 ("zz", 0.003027595424161008)
 ("dire", 0.0029909884549977864)

Using JuliaDB with Labeled Data

So far we’ve only considered words as individual units, and considered their relationships to the larger corpus of documents. Many interesting features can be found by performing text analyses based on the relationships between words and examining which words tend to follow others immediately, or that tend to co-occur within the same documents.

Lets say we want to confirm our hypothesis that documents from different New York Times news sections have different combinations of words. I use JuliaDB to load data from section_names.csv and combine this data with our texts. I then create different corpora used for finding top bigram features.

using JuliaDB

col_names = ["id", "section"]
sections_table = loadtable("data/section_names.csv"; header_exists=false, colnames=col_names);
# reindex has inplace function
sections_table = reindex(sections_table, :id);

# create a mask for just the text in the book section
msk_books = select(sections_table, :section => x -> x == "Books" );

bi_grams = NGramDocument.(text.(crps), 2);

bi_gram_crps = Corpus(bi_grams)
update_lexicon!(bi_gram_crps) # update our bigram lexicon

tfidf2 = tf_idf(DocumentTermMatrix(bi_gram_crps));
feature_array2 = collect(keys(bi_gram_crps.lexicon));

# "10 most important words by mean tfidf score in Books sections: "
get_top_mean_feats(tfidf2, feature_array2, findall(!!, msk_books), 20)

20-element Array{Tuple{String,Float64},1}:
 ("approv keyston", 0.012769003009089008)
 ("dan mccabe", 0.009084626784167946)
 ("nw washington", 0.008826231299062898)
 ("lemon garnish", 0.006859985948229485)
 ("colleg game", 0.006457915014267387)
 ("book regain", 0.005745935812141618)
 ("list rentbro", 0.005745935812141618)
 ("ban agenc", 0.005617242437875209)
 ("honey", 0.004994429413701268)
 ("historian name", 0.004977841281908868)
 ("art park", 0.004830438541265002)
 ("greav", 0.004380921925796904)
 ("miss proposit", 0.004340074866415672)
 ("justin bieber", 0.004302304441744708)
 ("jasper tex", 0.004211579209546832)
 ("attend colleg", 0.004065387616803975)
 ("prison serv", 0.00390747355666676)
 ("wipe entir", 0.003862566285592531)
 ("design elicit", 0.0037377902666676363)
 ("wife peopl", 0.003494950648622221)

msk_sports = select(sections_table, :section => x -> x == "Sports" );

# "10 most important words by mean tfidf score in Sports sections: "
get_top_mean_feats(tfidf2, feature_array2, findall(!!, msk_sports), 20) # !! is a double negator

20-element Array{Tuple{String,Float64},1}:
 ("economist depart", 0.011148866380634209)
 ("tablespoon chinkiang", 0.008779464262457137)
 ("concert half", 0.008471186667189075)
 ("lead downtown", 0.007550361586371777)
 ("suppos braini", 0.006811770657058267)
 ("hop guest", 0.006192600712560494)
 ("judg ive", 0.005836589364014666)
 ("afghan relat", 0.005685508385988465)
 ("bootsi collin", 0.005682484391653045)
 ("scene depriv", 0.00536856407437897)
 ("five card", 0.005312194038379561)
 ("plan improv", 0.004952347354472934)
 ("accolad", 0.00478566813792554)
 ("wine vegan", 0.004778521884857718)
 ("receiv call", 0.004675518934544374)
 ("watch liriano", 0.004577184649360663)
 ("support republ", 0.0045527193515291125)
 ("mystic invent", 0.004533042470476335)
 ("seddiqi disput", 0.004482410268596228)
 ("bronx identifi", 0.0041501025347195665)

Ranking Document Relevance using Cosine Similarity

queries_text = readlines("queries.txt");

q_crps = Corpus(StringDocument.(my_prepare.(queries_text)));
q_tfidf = tf_idf(dtm(DocumentTermMatrix(q_crps, crps.lexicon)));

function print_result(crps, row_id::Int64, num_lines::Int64=5)
    println("\n")
    println.(my_readlines(crps[row_id].metadata.title)[1:num_lines])
    println("...\n")
    return
end

print_result (generic function with 2 methods)

function get_similar_documents(x, top_n::Int64=3)
    arr = []
    for i in 1:tfidf.m
        relevance = 1 - evaluate(CosineDist(), tfidf[i,:], x)
        push!(arr, relevance)
    end

    filter!(!isnan, arr);
    l = lastindex(sortperm(arr))
    print("\nsearch results:\n ")
    [print_result(crps, sortperm(arr)[i]) for i in l - top_n:l];
end

get_similar_documents (generic function with 2 methods)

function search(query_idx, top_n::Int64=3)
    print("\nquery\n: ")
    print(queries_text[query_idx])
    get_similar_documents(q_tfidf[query_idx, :], top_n)
    return
end

search (generic function with 2 methods)

search(7)

query
: Watching the Game of Thrones contingent take their final (and somewhat co
mplicated) bow this weekend at Comic-Con International I found myself wonde
ring: Will we ever see the likes of the HBO hit series again? Fans will con
tinue to debate the creative choices made in the finale season but the over
all accomplishment of David Benioff & D.B. Weiss and their writing staff wa
s truly staggering. They not only delivered the most stirring fantasy epic
since Peter Jackson’s The Lord of The Rings they did it on a weekly basis a
nd with nuances that made the sword-and-sorcery show a must-see favorite ev
en among non-fantasy fans.
search results:


AMC wants more Dead and who could be surprised?
The Walking Dead, the top show in television among the advertiser-preferred
 group of viewers between the ages of 18 and 49 is going to get what AMC is
 calling a companion series, with an expected airdate sometime in 2015.

The network announced on Monday that Robert Kirkman, who wrote the comic bo
ok series that inspired the television show, will develop the new version,
which is still untitled. It will, fans will be happy to note, feature zombi
es.
In a statement, Mr. Kirkman said, The opportunity to make a show that isnt
tethered by the events of the comic book, and is truly a blank page, has se
t my creativity racing.
Other members of the Walking Dead creative team, including the producers Ga
le Anne Hurd and David Alpert, will be involved in the new effort as well.

...



When the Mets moved into Citi Field five years ago, they were also, as it t
urned out, settling into fourth place in the National League East and makin
g themselves comfortable. If their season ended Friday, they would finish i
n that very undistinguished spot for the fifth straight season. It would be
 a franchise record, dubious as it might seem, for never in their uphill hi
story have they ended up staying in the same place for so long.


Even in the 1960s, when the Mets made a name for themselves as one of the w
orst teams in baseball history, they finished 10th, and last, only in their
 first four years of existence. In the fifth year, 1966, they finished next
 to last, in ninth.
In baseballs current division format, fourth is the new ninth, a home for t
eams that are usually overmatched and not going anywhere in particular. Tha
t certainly describes the current Mets, who have now been in fourth place l
ong enough to buy a new recliner, hang up some pictures and get to know the
 neighbors (hello, Miami Marlins).
...



FLORHAM PARK, N.J.  When Rex Ryan was hired to coach the Jets in 2009, he i
nsisted on running effectively, a philosophy that he captured in the catchp
hrase ground and pound. Even as the Jets develop Geno Smith, a rookie quart
erback who understandably lacks polish, that is no longer Ryans mantra.



Do I expect us to run more than pass? Not really, Ryan said after Thursdays
 practice. Id like to be close to balanced. I think thats where weve been t
he first couple of games. So I think thats pretty good.
The Jets, who will host the Buffalo Bills on Sunday in a meeting of A.F.C.
East teams with 1-1 records, have attempted 74 passes and 61 rushes. Smith
dropped back 42 times in a 13-10 loss at the New England Patriots on Sept.
12 in a game that raised questions about how committed Marty Mornhinweg, th
e new offensive coordinator, would be to the run. The Jets backed off the g
round game despite rushing 32 times for 129 yards; Smith threw three interc
eptions in the fourth quarter.
...



                            Several times in Rush, Ron Howards excitingly t
orqued movie set in the Formula One race world, the camera gets so close to
 a drivers eye that you can see each trembling lash. Its a startlingly beau
tiful but also naked image, partly because theres no hiding for an actor wh
en the camera gets that close. In moments like these, youre no longer watch
ing a performance with its layers of art and technique: youve crossed the b
order between fiction and documentary to go eye to eye with another persons
 nervous system. Mr. Howard doesnt just want you to crawl inside a Formula
One racecar, he also wants you to crawl inside its drivers head.


                            Specifically, he wants to get inside those of J
ames Hunt (Chris Hemsworth) and Niki Lauda (Daniel Brhl), Formula One titan
s and rivals, who, in 1976, helped push the sport into mainstream conscious
ness. (Well, at least in much of the rest of the world: Formula One has lon
g struggled in the United States.) In 1976, when both men were in their lat
e 20s, they raced after each other while chasing the world championship ove
r wet, dry and terrifyingly gnarly tracks. Tucked, very much alone, into op
en-wheel machines that could easily have become coffins, Hunt and Lauda cut
 corners and grazed death lap after lap  whooshing over racetracks, city st
reets and deceptively pastoral roads into the sort of sports legend that tr
anslates only occasionally into good cinema.

...

Wrapping Up

This tutorial showed a text analysis approach that is useful for exploring the relationships and connections between words. Relationships involving n-grams, which help to see what words tend to appear after others, or co-occurences can help us find which terms are most important to a document. It is my aim to convey that Julia, as a tool for text analysis, is just as flexible as Python and that it will play an important role in the work data scientists do moving forward.

Gordon MacMillan

Data Science Professional