extract-web

Extract content from web pages, including link URLs, image URLs and entire web page contents.

KunihikoKido

3,228

14

0.3.2

MIT

GitHub

extract-web package

Extract content from web pages, including link URLs, image URLs and entire web page contents.

overview

Commands

Settings

Customizing the output of the Extract Contents command

The Extract Contents command outputs a JSON or YAML document containing an array of objects. Each extracted web page is represented by a JSON/YAML object in this array.

The properties object for each extracted web page contains an array of properties extracted from the web page.

If you want to customize the properties extracted from each item, prepare a configuration file similar to the example below. Properties to extract are specified using CSS syntax.

Example:

{
  "target": [
    {
      "pattern": {
        "url": "https://atom.io/packages/.*"
      },
      "properties": {
        "title": {
          "text": "title"
        },
        "body": {
          "text": "body"
        },
        "bodyAsHtml": {
          "html": "body"
        },
        "package_meta": {
          "text": ".package-meta ul li a",
          "isArray": true
        },
        "meta_description": {
          "attr": "meta[name=description]",
          "args": ["content"]
        },
        "domain": {
          "default": "atom.io"
        }
      }
    }
  ]
}

Screenshots

Extract Contents

Extract Contents Screenshot