Appendix a: extract sub-categories

The appendix states a succeeding phase of phase 1, which is void for most cases when extracting data from eCommerce sites. The reason for adding this phase is explained in phase 1#What next. In this phase, the clues to visit sub-category pages are to be extracted.



Recognize new theme

Tools used: MetaStudio, a data schema definition tool.

On MetaStudio's Theme List work board, the theme ComYellowPage_mic_en_l2 is shown in status torecognize (figure 1). Click right-button pop-up menu item recognize over the theme list to load a sample page which is automatically selected by MetaStudio.

Note: Because there is already a data schema defined in phase 1, MetaStudio will ask operators if the current work board should be cleaned, which must be confirmed before defining a new data schema.


Figure 1(Enlarge)



Edit theme information

Tools used: MetaStudio, a data schema definition tool.

This step can be skipped because default information are enough.



Create clues

Tools used: MetaStudio, a data schema definition tool.

On the Clue Editor work board, take the following steps to create a clue in type of Pattern:

  1. Create a new clue via pushing newClue button;
  2. Set the type of the clue to be Pattern via pushing the Pattern radio button;
  3. In tab-window Pattern, click mouse's right-button and select menu item Insert to create a pattern record of the new clue.


Map clues

Tools used: MetaStudio, a data schema definition tool.

Operators should tell MetaStudio at which position or within which scope one or more clues are to be extracted, which is fulfilled by mapping a DOM node standing for the position or the scope to the new-created clue. There are the following steps for this task:

  1. Enable reverse selection;
  2. Browse the sample page which has been loaded into the embedded Web browser. Click the string "Auto Parts" so that the DOM node with row No. 1091 is positioned by MetaStudio;
  3. Choose node 1091's ancestor node as the scope within which clues to be extracted, whose row No. is 1085. When the node is selected in the DOM Tree Viewer, the border of the HTML area will blink for 3 times, which is the method to check if the DOM node is suitable;
  4. Click right-button pop-up menu item Clue Mapping>>Clue Mapping>>s_clue 0 to map node 1085 to clue No.0.


Map patterns

Tools used: MetaStudio, a data schema definition tool.

Take the following steps to map pattern values and to name target themes:

  1. After having enabled reverse selection, in the embedded browser window, click a hyper-link, e.g. "Auto Parts", over which a clue will be extracted, to position an HTML A element node in the DOM tree viewer;
  2. Expand the sub-tree below the node and select attribute node @href;
  3. Click right-button pop-up menu item Clue Mapping>>Pattern Mapping>>scope 0 to automatically fill the edit box Loc Prefix of the pattern record with the URL of this hyper-link;
  4. Edit the pattern value so that it will cover all hyper-links. Unfortunately, in this example, only the shortest string "/" is valid;
  5. Fill edit box Target Theme with the target name, i.e. ComList_mic_en

Following figure shows the clue and the pattern after mapping:


Figure 2(Enlarge)


Upload work files

Tools used: MetaStudio, a data schema definition tool.

Push button Schema on right side of the toolbar to upload the work files.



Extract clues

Tools used: DataScraper, a Web data and clue extraction tool.

The following steps are taken to extract clues with DataScraper:

  1. Run DataScraper
  2. Input the query condition "*mic*" to query matched themes;
  3. Select theme ComYellowPage_mic_en_l2;
  4. Click right-button pop-up menu item Crawl over the theme list;
  5. Input the number of clues to be extracted for this theme. In this case, "1" is inputted.


What next

The phase stated in this appendix is an optional phase between phase 1 and phase 2. Please go to phase 2 to extract commodity information further.