Data and Clue Extraction Rules

MetaSeeker Toolkit must be directed by a series of rules on where and how to extract data and clues from target pages. The rules are called Data and Clue Extraction Rules(DCER). The rules are generated by MetaStudio after having defined a data schema for a group of pages. The rules are made up of a series of XSLT, XPath and proprietary XML commands. They are stored in Data and Clue Extraction Instruction Files(DCEIF).

There are three types of the rules as follows:

  • Data extraction rule: specifies where and how to extract data snippets from a target page and is described in detail on page Data Extraction Instruction File.
  • Clue extraction rule: specifies where and how to extract new clues from a target page and is described in detail on page Clue Extraction Instruction File.
  • Data schema recognition rule: validates if the target page matches the data schema. Only if it matches DataScraper begins extracting data and clues. Otherwise DataScraper would try to find another suitable data schema.