Easily allow to convert an HTML page into structured JSON data
npm i html-to-json-data
This module only provides convenient methods to transform an HTML page from string to JSON. You'll have to fetch your pages through whatever mean you prefer
const convert = require('html-to-json-data');
const { group, text, number, href, src, uniq } = require('html-to-json-data/definitions');
const html = '<html>...</html>'; // in this example https://github.com/piuccio?tab=repositories
const json = convert(html, {
page: 'GitHub',
name: text('.vcard-fullname'),
nickname: text('.vcard-username'),
avatar: src('img.avatar', 'https://github.com'),
languages: uniq('span[itemprop="programmingLanguage"]'),
repos: group('#user-repositories-list li', {
name: text('h3'),
link: href('h3 a', 'https://github.com'),
stars: number('a[href$="stargazers"]'),
}),
});
The resulting object looks like the following
{
page: 'GitHub',
name: 'Fabio Crisci',
nickname: 'piuccio',
avatar: 'https://avatars1.githubusercontent.com/u/680284?s=460&v=4',
languages: ['JavaScript', 'HTML', 'Python'],
repos: [{
name: 'cowsay',
link: 'https://github.com/piuccio/cowsay',
stars: 314,
}, {
name: 'flat-earth',
link: 'https://github.com/piuccio/flat-earth',
stars: 1,
}], // the list goes on
}
Have a look at the tests for more detailed examples.
The functions exported by html-to-json-data/definitions
allow to select data from the HTML page and convert it to your desired type.
They all take a selector as first parameter. Any selector that is valid for cheerio will work.
text(selector)
return the text content (trimmed) of the selected node.
If the selector finds multiple nodes, it'll return an array with all selected values.
uniq(selector)
similar to text
but always return an array of unique values.
number(selector)
convert the text content to a number, return 0 if the selector doesn't match any element
attr(selector, name)
returns the value of the attribute name
of the node selected by selector
.
For instance if selector returns <a title="Link" />
, the definition attr('a', 'title')
will return Link
.
href(selector, baseURI)
convenience method to return the value of the href
attribute.
Similar to attr(selector, 'href')
but it resolves relative paths from baseURI
.
src(selector, baseURI)
convenience method to return the value of the src
attribute.
Similar to attr(selector, 'src')
but it resolves relative paths from baseURI
.
prop(selector, name)
similar to attr
but returns a property of the node.
For instance in <input type="checbox" />
, the definition prop('input', 'checked')
will return false
.
data(selector, name)
similar to attr
but returns the data attribute.
For instance in <div data-apple-color="pink" />
, the definition data('div', 'apple-color')
will return pink
.
input(selector)
is a utility method to extract the data of a form input.
For instance in <input type="radio" name="gender" value="fluid">
it'll return { type: 'radio', name: 'gender', value: 'fluid' }
.
group(selector, definitions)
creates a list of objects described by definitions
.
The selectors inside definitions
are scoped inside selector
.
For instance group('li', { title: text('h3') })
returns an array of objects with title
extracted from li h3
.
If you need to access the element selected by group
selector in a nested definition you can use the special selector :self
.
For instance
group('select option', {
value: attr(':self', 'value'),
name: text(':self'),
});
definitions
can be either an Object with nested data or any other definition provided by the library, for instance
group('table tr', text('td:first-child'));
The selector above returns an array of String extracted from the first td
from every table row.
The group
function exposes the following function that can be chained to manipulate the list of results.
If you need to filter out some elements from the list but the CSS selector in not powerful enough you can use
group('table tr', {}).slice(1, -1)
.
slice
works exactly like Array.prototype.slice
.
When your selectors return an array, you can flat the list or results calling group().flat
.
const html = `
<table>
<tr>
<td><a>One</a></td>
<td><a>Two</a></td>
</tr>
<tr>
<td><a>Thre</a></td>
<td><a>Four</a></td>
</tr>
</table>
`;
group('table tr', text('a')); // [ ['One', 'Two'], ['Three', 'Four'] ]
group('table tr', text('a')).flat(); // ['One', 'Two', 'Three', 'Four']
Allow complex filtering of the selected group nodes.
filterBy(definition, filterFn)
const html =`
<table>
<tr>
<td class="price">Free</td>
<td class="product">One</td>
</tr>
<tr>
<td class="price">Expensive</td>
<td class="product">Two</td>
</tr>
</table>
`;
group('table tr', text('.product')).filterBy(text('.price'), (price) => price === 'Free')
// -> ['One']
The arguments of filterBy
are
definition
any definition that selects a value from the group nodefilterFn
gets called with the result ofdefinition
, Returntrue
to keep the value orfalse
to skip it.
If the combination of CSS selectors and filterBy
is not enough you can use a functions instead of a CSS selector string as first argument.
const html =`
<table>
<tr>
<td class="price">Free</td>
<td class="product">One</td>
</tr>
<!-- more rows ... -->
</table>
`;
const selector = ($) => $('table').filter((i, table) => $(table).children().length > 5).find('tr');
group(selector, text('.product'));
The above selector will iterate over all the table in the page and return all the tr
included in tables with at least 5 rows.
The selector function receive as argument a cheerio object, refer to the documentation for advanced usage.