riting is very different today compared with fifteen years ago.
To be published used to mean in print, which constrained space but was less hasty. There were more gatekeepers, and way less content competing for readers’ attention. It only took a few years, but technology completely overhauled the economics of the written word.
I didn’t think the process was over.
I like writing, and at the very least, it’s proved central to my career so far. But with decades of work ahead of me, I felt it would be risky to ignore the progress being made on NLP. In late 2017, I decided it was time to upskill.
In the process, I’ve come across lots of ‘learning guides’, which are essentially compilations of freely available courses and books. Problem is, just a single course will take up weeks of your time. Likewise, digesting a single technical book is incredibly demanding — let alone a list of them.
As someone wanting to learn how to apply NLP tools and techniques, this didn’t seem practical. This article traces the actual learning path I ended up paving.
I’ve written it for people who want to extract value from text. If you only take away one thing, let it be that none of this stuff is beyond your reach. Not instantly, maybe not even quickly, but it’s doable.
I say this because it’s always ego-bruising to venture into a new field: One, it’s a lot harder to build a knowledge base from first principles than to expand an existing one. And Two, in seeking to learn from those who know more, you glumly accept this means pretty much anyone.
Often times I thought, Wow, I’m such a Loser for struggling this much. I had to actively remind myself that learning is both a choice — and a massive privilege.
So it’s important not to lose that sense of adventure. Looking back, it struck me how much it’s like being a tourist, taking in sights and experiences filtered by what you already are. Then let’s be tourists.
Welcome to London.
1. The absolute must have’s
Where do most tourists head to first?
When people start with NLP, many go straight for the ready-made data: the movie reviews, the newsgroups, the Twitter sentiment.
Mainstream thinking is it’s good to start with well-trodden examples and techniques, and eventually apply this knowledge to “real” texts later. This means you’re directing your learning down a path of shoehorning problems into stuff you’ve learned to do.
Don’t.
Instead, I recommend you initially scrape some text. This will allow you to play with textual content relevant to your specific needs, and give you a hands-on feel for what breaking these kinds of texts apart can yield.
Just as importantly, it will get you thinking about what labels you can easily extract for your data, and which ones you’d potentially like to calculate.
Start by learning how to use the requests and urllib libraries, along with a parser (like Beautiful Soup). Scraping text is often just three short lines of code. The effort is to then look at the parsed HTML and pull out the bits you need. Any search will reveal a ton of step-by-step guides.
Then there is also a clever library called newspaper3K, which may or may not work for your needs (hence why it’s good to gain general scraping skills). However, when it works, it is a thing of beauty.
Now, you’ll want to place your data into some workable format: one that’s easy to load and process, but also where you can open up Excel and easily look at the data. For this, you should get familiar with dataframes. Don’t waste any time at first beyond how to initialize, add stuff and access data. (Here, try this guide.) Everything else, you’ll pick up as you go.
The other absolute must-have is regular expressions. Language feels almost rule-based. Only by actually penning some rules can you experience the swelling burden of exceptions. That said, sometimes easily extracting 70% of the relevant info really is enough.
The other absolute must-have is regular expressions. Language feels almost rule-based. Only by actually penning some rules can you experience the swelling burden of exceptions. That said, sometimes easily extracting 70% of the relevant info really is enough.
2. Toying with texts
It’s weird, but it’s become common practice to process language while completely ignoring it. A lot of NLP kicks off with converting words and word-combinations into numerical objects, and then mathematically manipulating those objects. It’s easily forgotten that what you’re manipulating is a representation of the language, not the thing itself.
This is happening because manipulating the thing itself was what NLP practitioners have been previously doing for years — with limited success. And the tool you were told to do it with was NLTK, a sprawling educational NLP library.
Well, I tried using NLTK and didn’t like it. I tried reading the accompanying manual, which only made things worse. Then, I turned to spaCy and never looked back.
SpaCy is clean and efficient. Its lucid, visually pleasing documentation makes it highly accessible, so getting things done is straightforward. It even has an online course.
The entry point is its pipeline: By default, when you process, say, a paragraph, you end up with an object that is already split into sentences and broken into word-objects (called tokens); the sentences are parsed and the tokens are tagged according to their parts of speech; and some tokens are even recognized as named entities (for example, that ‘Germany’ is a country). Later, should you need to, you’ll be able to remove some of these steps, or add others on top. But this default setting sets you off nicely.
Picking up linguistic sentence parsing is like getting your Oyster card: a small initial inconvenience, followed by obvious ease-of-use. When I tried grasping it, I needed help. To pay it forward, I’ve written this short explainer article.
Start playing little games with texts: can you count the verbs? Can you pick text apart? Can you put some text together?
As you become more acquainted with spaCy, these games can become richer. And before you know it, there you are, manipulating text not with a keyboard or a pen, but as if it were bits of Lego.
3. Lots of shallow reading
Shallow reading is a vastly underrated learning tool.
We learn best by hooking new ideas onto familiar concepts. When everything is new — there are no hooks. By skim-reading a volume of articles on a topic, even if you actively understand very little, your pattern-seeking brain will notice how certain terms often appear together. You’ll start recognizing concepts and ideas long before understanding them fully.
Soon enough, you’ve got yourself a coat hanger.
I suggest steering well clear of academic papers or blog posts about them. The authors write to gain sway within their specialist community; their target audience is not the curious noob.
Instead, my go-to source for a torrent of NLP articles is Medium, and particularly the Towards Data Science publication. Other great sources are the fast.ai blog, the Analytics Vidhya blog and Sebastian Ruder’s newsletter. You can choose others, of course; what matters is consistently reading a variety of articles.
There is also Jay Alammar’s blog. There aren’t many posts, but each is a hyper-visualised explanation of a machine learning concept — with a strong tilt towards recent progress in deep-learning NLP.
4. Books
Books are like shopping: there’s limitless supply, competing for your limited resources. Also, whatever you choose, half the stuff will end up proving useless. Still, the experience exposes you to a bunch of new merch all at once.
Like I said, currently people seem more focused on the “Processing” bit, kind of forgetting about the “Natural Language” bit. I feel this can be a mistake.
To me, it meant reading one book about linguistics, and as many books as possible about what people pass off as language. Some examples: I spent a while reading all that was available about KoKo, the Gorilla who was taught sign language, and had learnt something, which many people accepted as English. I wasn’t interested in whether or not Koko had spoken (sign language) English; I wanted to know what made people believe that she had.
I spent a similar amount of time reading about how IBM’s Watson had won Jeopardy, and the book authored by the team’s leader — David Farucci — about his efforts to build story-writing software.
The one I would most recommend, though, if you really want to think about language and how it’s used, is Wittgenstein’s Philosophical Investigations. I know this sounds like a huge overkill, which has nothing to do with Python or fancy neural networks. But I posit that the better you understand the Natural Language part, the cleverer you can get with simpler algorithms.
Back in the Processing department, I was sorely disappointed with books.
Of the O’Reilly animal series, I got Applied Text Analysis with Python (which is fairly new), and Programming Collective Intelligence (which is not). I found both books pretty useless. Ironically — since they’re meant to be coding books– in both cases the problem was the code:
The older one is riddled with outdated code, involving now-defunct APIs and libraries. The new one may be up-to-date, but the code illustrates the ideas with one main project, which gets progressively built. This means almost every chapter requires familiarity with the preceding codebase. In both books, the text explaining the principles is severely lacking. Whoever’s their target reader — it wasn’t me.
This was a major low point. I wanted to learn NLP techniques, yet here I was: finding all the content in the world about theory, and getting nowhere on the practice.
5. Tasks and libraries
Instead, I went back to articles. If at first shallow reading was helpful to pick up concepts, over time patterns had begun to emerge, both in the tasks people explore (sentiment, summarisation, topic modelling, text generation, data visualisation), and the techniques (algorithms, libraries) with which they create solutions.
Just like how, though everyone has their own special London, lots of us happen to like the same things.
I began to map the problem space, and the commonly used techniques.
When people share their learning, the majority of articles feature an implementation of ideas at their most accessible level, using toy datasets or toy problems. By reading several articles on any given topic and following the code, I’d eventually get it.
In essence, any problem I could solve by means of search, I considered a solved problem. This last statement differs from most of the advice you’ll get, yet it’s absolutely key to applied NLP.
Take, for example, text summarisation, a difficult task by any measure. You could spend your time reading lots of academic research, which competes on improving performance on a single dataset. Or, you could find how to menially implement Text Rank, the most common extractive method.
Or, you could understand what Text Rank does, and test it using PyTextRank (just recently released; coded by a well-known expert; and improves on the original algorithm. Also, which you’ve found only because you’ve been diligently shallow-reading).
It’s not that one approach beats the others — each delivers different gains. Your job is to define the goal and assess where best to spend your time.
The same principle also holds true for code. There are so many quality open-source libraries. And let me tell you now — most of them I have yet to use. Instead, I’ve been maintaining a document listing useful libraries, and using them as needs arise. Here is one resource to get you started with your own list.
Briefly back to articles, there is also the exciting small minority. On occasion, you’ll see someone trying to adapt ideas to a highly specific task, where they’re laser-focused on making something that actually works. More often than not, they’ll be using familiar tools, but in a different way.
Make sure you save any article of this type; odds are, you’ll be wanting it later.
6. Small Projects
Say you’ve read 25 reviews of a tourist attraction, and all of them were bad. The attraction ain’t cheap either, plus it’ll eat up an entire day.
Would you go?
The articles you read will point you in the direction of stuff you can do, but not necessarily towards stuff you should do. I refused to spend time implementing algorithms where I had never seen any meaningful results.
Why? Because, broadly speaking, out-of-the-box algorithms perform poorly on useful tasks. You could spend months and years learning every trick in the book — and still be no closer to adding value.
This is bad advice for anyone wanting to clock up coding hours or just starting their career. But if you’d like to make use of NLP within your domain, the most important skill to cultivate is asking commercial questions, and figuring what it would take to solve them.
What’s better — one big project, or several smaller ones? I’d say at first, the smaller the better. Think of it as going off on little adventures.
Initially, I implemented interesting stuff I saw other people doing. Then I started applying my own ideas, just to see what happens.
One thing to have happened, for example, is that my technique for sentiment-tagging financial news is currently ranked top or nearly-top in any relevant Google search. I mention this to reinforce my earlier point: Yes, you can deliver new value by looking at known problems from fresh perspectives.
As to what you should implement — that’s where your own point of view comes in. I do recommend reading previous posts by the following two writers, for their consistent sense of adventure and can-do attitude: Susan Li, who blogs mostly on NLP stuff, and Will Koehrsen, who blogged about all sorts of machine learning topics.
And as a tourist, it’s probably time to go home.
What next?
So now the learning path forks: one way is towards a more continuous view of language, where the swing sways towards “Processing”: this is all about making use of large language models, by means of transfer learning.
The other is towards a more discrete view of language, which I see as closer to the “Natural” part: this is the stuff encompassing knowledge graphs and combinatorial representations.
Just like “Natural Language Processing” is a single idea, these paths do reconnect eventually. But initially they’re different. Because I’m theory-agnostic — a fancy way of saying I’ll happily pick and choose as befits the problem — I wanted to explore both.
For practical reasons, I decided to start off with transfer learning, and took the fast.ai NLP course. I’d like to write about it at some point.
In the meantime, let’s conclude: To become sufficiently acquainted with tools and practices so I could experiment with them, I followed the learning path traced above. I hope parts of it will help you too.
Thanks to Ludovic Benistant.
You're following Towards Data Science.
You’ll see more from Towards Data Science across Medium and in your inbox.
WRITTEN BY
No comments:
Post a Comment