If you’ve ever looked at Hadoop MapReduce you’ll have come across the tutorial to count words in a document or series of documents (see https://wiki.apache.org/hadoop/WordCount). It’s a nice way to learn Hadoop because it’s easy to picture multiple mappers breaking up files into individual words and reducers performing the aggregation. MapReduce will output a file containing a word on each line and a count of occurrences.

Looking at the tutorials for Azure Data Lake Analytics (ADLA) I found myself wondering if it is possible to carry out an equivalent word count out-of-the-box, without any custom code outside of U-SQL script. One of the key benefits of ADLA over Hadoop is we have no infrastructure to worry about because there are no servers, virtual machines, or clusters to wait for, manage, or tune. ADLA is priced on a per job basis – a key benefit is you avoid waking up in the night in a panic because you left your 60 node Hadoop cluster running!

Source data

First, we need a document to count words in, so a quick copy/paste of Teresa May’s Brexit letter to Donald Tusk provided the source. Here’s how mine looks:

The document needs to be uploaded to Azure Data Lake Store which is easily done through the Azure Portal. Azure Data Lake Store and ADLA go together like beans and toast – Data Lake Store provides the massive throughput to run analytic jobs in ALDA (or Hadoop/HDInsight) which may have thousands of concurrent executors.

There are three out-of-box extractors available with ADLA: tsv, csv and text. We know that tsv and csv won’t work here so let’s try Extractors.Text:

@words = EXTRACT wholeline string

FROM
“/Books/MayLetter.txt”

USING Extractors.Text(delimiter: ‘n’);

 

OUTPUT
@words
TO
“/output/booklines.txt”
USING Outputters.Text();

By using the n delimiter we are able to ‘schematize’ the file into a single column rowset. Here is the output file preview in Visual Studio:

Now all we need to do is tokenise the string, output each individual word/token as a row, group by each word and sum.

Tokenise with Split and SQL.ARRAY

We can tokenise each row by using the C# Split function and create an array using the SQL.ARRAY complex type:

@wordsarray = SELECT
new
SQL.ARRAY<string>(wholeline.Split(‘ ‘)) AS WordsArray FROM
@words;

 

You’ve probably realised at this point that this doesn’t take care of punctuation (because we’re tokenising using a space). Additional string processing can be added here to remove punctuation but we’ll not consider this for simplicity and readability.

Details of the SQL.ARRAY complex type can be found at https://msdn.microsoft.com/en-us/library/azure/mt764126.aspx. Now all that’s left to do is turn our SQL.ARRAY into a rowset.

CROSS APPLY EXPLODE

If you’re familiar with T-SQL you’ll probably know CROSS APPLY. When processing a value in a column and you can extract a whole rowset of information per column value by using CROSS APPLY. This is exactly what we want to do here – we need to extract a whole rowset of information per SQL.ARRAY row. This is where the EXPLODE expression helps us. EXPLODE turns an instance of SQL.ARRAY into a rowset (see https://msdn.microsoft.com/en-us/library/azure/mt621306.aspx).

@wordslist = SELECT r.eachword AS individualword FROM
@wordsarray
AS wa

CROSS
APPLY
EXPLODE(wa.WordsArray) AS r(eachword);

Here’s what the script looks like so far and the output:

@words = EXTRACT wholeline string

FROM
“/Books/MayLetter.txt”

USING Extractors.Text(delimiter: ‘n’);

 

@wordsarray = SELECT
new
SQL.ARRAY<string>(wholeline.Split(‘ ‘)) AS WordsArray

FROM
@words;

 

@wordslist = SELECT r.eachword AS individualword FROM
@wordsarray
AS wa

CROSS
APPLY
EXPLODE(wa.WordsArray) AS r(eachword);

 

OUTPUT
@wordslist
TO
“/output/wordslist.txt”

USING Outputters.Text();


As you can see, we have a single word per row.

GROUP BY and COUNT

All that remains now is to group by each word and perform a count and we’ve managed to achieve the same as the Hadoop word count with no code! Here is the final script:

@words = EXTRACT wholeline string

FROM
“/Books/MayLetter.txt”

USING Extractors.Text(delimiter: ‘n’);

 

@wordsarray = SELECT
new
SQL.ARRAY<string>(wholeline.Split(‘ ‘)) AS WordsArray

FROM
@words;

 

@wordslist = SELECT r.eachword AS individualword FROM
@wordsarray
AS wa

CROSS
APPLY
EXPLODE(wa.WordsArray) AS r(eachword);

 

@wordaggregate = SELECT individualword, COUNT(*) AS Count FROM
@wordslist
GROUP
BY individualword;

 

OUTPUT
@wordaggregate
TO
“/output/wordscount.txt”

ORDER
BY Count DESC

USING Outputters.Text();

And the output (the righthand column shows the word frequency):

We did it!

If we exclude the English language articles and conjunctions etc. we find that the most common word in the letter is, somewhat unsurprisingly, “European”.

What would be interesting to try nest is multiple file processing. This would be achieved by using a wildcard in the EXTRACT expression.

About the author