-
Notifications
You must be signed in to change notification settings - Fork 0
/
quotes.rpp
64 lines (56 loc) · 2.42 KB
/
quotes.rpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*-
;;;
;;; Copyright (c) 2009 -- 2010 Stephan Oepen ([email protected]);
;;; see `LICENSE' for conditions.
;;;
;;
;; deviating from the PTB conventions, we use one-character double quote marks
;; (i.e. |“| and |”| instead of |``| and |''|); much like the PTB, however, we
;; aim to disambiguate neutral quotes (|"| and |''|) at the string level, i.e.
;; opening quotes are preceded by a token boundary (white space), with a small
;; number of additional, token-initial characters than can intervene; anything
;; else, we assume, is a closing quote.
;;
;; convert quotes to single characters prior to tokenizing off other characters
;; (group #1 below) to make adjacent whitespace detection easier, as e.g. in
;; |``$20!''|. closing quotes can double as apostrophes and units of measure,
;; i.e. feet and inches, or seconds and minutes.
;;
;;
;; it appears we cannot trust writers to use `funny' quotes properly, hence we
;; neuter them to straight double or single quotes, which will then go through
;; disambiguation, based on adjacency to token boundaries.
;;
![«»] "
![‹›] '
;;
;; _fix_me_
;; in bio-medical texts we see names with double or triple apostrophes, e.g.
;; |Figure B''| or |(A–C'')| (presumably in a figure caption). clearly, the
;; LaTeX-style conventions are incompatible with such usage of the apostrophe,
;; and probably we should limit support for LaTeX-style quotes to the LaTeX
;; REPP module. at present, however, i doubt the ERG would do the right thing
;; for double-apostrophe inputs anyway, and a full analysis of |A'| and |A''|
;; could be expensive in terms of extra ambiguity. discuss this with dan, one
;; fine day. (13-mar-10; oe)
;;
;;
;; the ‘raw’ WSJ texts every now and again contain sentence-final quotes that
;; are preceded by whitespace (many of these look like quotes that actually
;; should be part of the following sentence, i.e. were moved across a sentence
;; boundary). for robustness, force directionality of quotes in sentence-
;; initial and final positions.
;;
!^([[({“‘' ]*)("|``) \1“
!^([[({“‘ ]*)'(?![0-9]{2}s?) \1‘
!("|'')([])}”’' ]*)$ ”\2
!'([])}”’ ]*)$ ’\1
#1
!(„|``) “
!(^| [[({“‘]*)("|'') \1“
!("|'') ”
!(‚|`) ‘
!(^| [[({“‘]*)'(?![0-9]{2}s?) \1‘
!' ’
#
>1