Extract data and clues

Tools used: DataScraper, a data and clue extraction tool.

Take the following steps to extract data and clues for theme ComList_mic_en:

  1. Run DataScraper;
  2. On the left column, input query condition *mic* to retrieve all matched themes.
  3. Select the theme ComList_mic_en and click right-button pop-up menu item Crawl;
  4. Input a number of clues to be extracted and submit it.

Then DataScraper turns pages over and over to extract data and clues from all of them. If the result files want to be checked, run Harvest Manager. If the clues' status want to be checked, click right-button pop-up menu item Statistics to view status information for a specific theme.

Note: If data and clues are extracted for a theme for the first time, the clue number should be small, e.g. 1. The newly defined data schema might not be suitable for all pages. If it is, the following log will be presented in the output window.

Suitable schema file(dsd) cannot be found for this SpiderClue in CCCst inthread cycle

Detailed information on this log is stated in DataScraper User's Guide. In order to resolve the problem, one more data schema must be defined as stated in Appendix B