-
Notifications
You must be signed in to change notification settings - Fork 0
/
html.rpp
44 lines (37 loc) · 1.33 KB
/
html.rpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*-
;;;
;;; Copyright (c) 2012 -- 2012 Stephan Oepen ([email protected]);
;;; see `LICENSE' for conditions.
;;;
;;;
;;; _fix_me_
;;; following are a set of `cheap and cheerful' REPP rules to make simplified
;;; HTML text palatable to parsing with the ERG. for the time being, we are
;;; mostly just throwing out markup, with the exception of italics, emphasis,
;;; et al., which can signal use--mention distinctions.
;;;
!</?(?:a|b|code|h[0-9]|li|small|strong|title)>
!<em[^>]*>((?:(?!</em>).)*)</em> ¦i \1 i¦
!<i[^>]*>((?:(?!</i>).)*)</i> ¦i \1 i¦
;;
;; in blogs, authors at times try to be witty and use HTML strike-through to
;; show changes of mind or otherwise inappropriate content.
;;
!<s[^>]*>((?!</s>).)*</s>
;;
;; _fix_me_
;; what about sub- and super-scripts? a task for the GMLC. (5-mar-12; oe)
;;
!</?su(?:b|p)>
;;
;; as we do for wikipedia text, collapse pieces of non-English into a token
;; that the grammar treats much like a proper name. for robustness, enforce
;; token boundaries around these.
;;
!<code[^>]*>(?:(?!</code>).)*</code> <code/>
!<img[^>]*>(?:(?!</img>).)*</img> <img/>
!<kbd[^>]*>(?:(?!</kbd>).)*</kbd> <kbd/>
;;
;; finally, for peace of mind, normalize whitespace sequences to a single space
;;
! +