What is information extraction?

Information extraction refers to the process of starting from unstructured sources (e.g., text documents written in ordinary English) and automatically extracting structured information (i.e., data in a clearly defined format that’s easily understood by computers). It’s an important concept in natural language processing (NLP).

For example, you might think of using news articles full of celebrity gossip to automatically create a database of the relationships between the celebrities mentioned (e.g., married, dating, divorced, feuding). You would end up with data in a structured format, something like MarriageBetween(celebrity1,celebrity2,date).

The challenge involves developing systems that can “understand” the text well enough to extract this kind of data from it.