-
Notifications
You must be signed in to change notification settings - Fork 22
/
glimpse.1
962 lines (962 loc) · 37.9 KB
/
glimpse.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
.TH GLIMPSE l "November 10, 1997"
.SH NAME
\fIglimpse 4.1\fP - search quickly through entire file systems
.SH OVERVIEW
\fIGlimpse\fP (which stands for GLobal IMPlicit SEarch)
is a very popular UNIX indexing and query system that allows
you to search through a large set of files very quickly.
Glimpse supports most of \fIagrep\fP's options
(\fIagrep\fP is our powerful version of \fIgrep\fP)
including approximate matching (e.g., finding misspelled words),
Boolean queries, and even some limited forms of regular expressions.
It is used in the same way, except that you don't have to
specify file names.
So, if you are looking for a \fIneedle\fP
anywhere in your file system, all you have to do is say
\fIglimpse needle\fR
and all lines containing \fIneedle\fP will appear preceded
by the file name.
.LP
To use glimpse you first need to index your files with glimpseindex.
For example, \fIglimpseindex -o ~\fR will index everything at or below
your home directory. See man glimpseindex for more details.
.LP
Glimpse is also available for web sites,
as a set of tools called \fIWebGlimpse\fP.
(The old glimpseHTTP is no longer supported and is not recommended.)
See http://webglimpse.net/ for more
information.
.LP
Glimpse includes all of agrep and can be used instead of agrep
by giving a file name(s) at the end of the command.
This will cause glimpse to ignore the index and run agrep as usual.
For example, \fIglimpse -1 pattern file\fR is the same as \fIagrep -1
pattern file\fR.
Agrep is distributed as a self-contained package within glimpse,
and can be used separately.
We added a new option to agrep: -r searches recursively the
directory and everything below it (see agrep options below);
it is used only when glimpse reverts to agrep.
.LP
Mail [email protected] with SUBSCRIBE wgusers in the body to be added
to the Webglimpse users mailing list. This is now the location where glimpse
questions are also discussed.
Bugs can be reported at http://webglimpse.net/bugzilla/
HTML version of these manual pages
can be found in
http://webglimpse.net/docs/glimpsehelp.html
Also, see the glimpse home pages in
http://webglimpse.net/glimpse
.SH SYNOPSIS
.B glimpse
\- [almost all letters]
.I pattern
.SH INTRODUCTION
We start with simple ways to use glimpse and describe all the
options in detail later on.
Once an index is built, using glimpseindex,
searching for \fIpattern\fP is as easy as saying
.LP
\fIglimpse pattern\fR
.LP
The output of glimpse is similar to that of \fIagrep\fP (or any other
grep).
The pattern can be any agrep legal pattern including a regular
expression or a Boolean query (e.g., searching for Tucson AND Arizona
is done by \fIglimpse 'Tucson;Arizona'\fR).
.LP
The speed of glimpse depends mainly on the number and sizes
of the files that
contain a match and only to a second degree on the total size of all
indexed files. If the pattern is reasonably uncommon, then all
matches will be reported in a few seconds even if the indexed files
total 500MB or more. Some information on how glimpse works and a
reference to a detailed article are given below.
.LP
Most of agrep (and other grep's) options are supported, including
approximate matching. For example,
.LP
\fIglimpse -1 'Tuson;Arezona'\fR
.LP
will output all lines containing both patterns allowing
one spelling error in any of the patterns
(either insertion, deletion, or
substitution), which in this case is definitely needed.
.LP
\fIglimpse -w -i 'parent'\fR
.LP
specifies case insensitive (\-i) and match on complete words (\-w).
So 'Parent' and 'PARENT' will match, 'parent/child' will match,
but 'parenthesis' or 'parents' will not match.
(Starting at version 3.0, glimpse can be much faster when these
two options are specified, especially for very large indexes.
You may want to set an alias especially for "glimpse -w -i".)
.LP
The -F option provides a pattern that must match the file name.
For example,
.LP
\fIglimpse -F '\\.c$' needle\fR
.LP
will find the pattern \fIneedle\fP in all files whose name
ends with .c.
(Glimpse will first check its index to determine which files may
contain the pattern and then run agrep on the file names to further
limit the search.)
The -F option \fIshould not\fR be put at the end after the main pattern
(e.g., "glimpse needle -F hay" is incorrect).
.SH "A Detailed Description of All the Options of Glimpse"
.TP
.B \-\fI#\fP
\fI#\fP is an integer between 1 and 8
specifying the maximum number of errors
permitted in finding the approximate matches (the default is zero).
Generally, each insertion, deletion, or substitution counts as one error.
It is possible to adjust the relative cost of insertions,
deletions and substitutions (see -I -D and -S options).
Since the index stores only lower case characters, errors of
substituting upper case with lower case may be missed
(see LIMITATIONS).
Allowing errors in the match requires more time and can slow down
the match by a factor of 2-4.
Be very careful when specifying more than one error, as the number
of matches tend to grow very quickly.
.TP
.B \-a
prints attribute names. This option applies only to Harvest SOIF structured
data (used with glimpseindex -s).
(See http://harvest.sourceforge.net/ for more information
about the Harvest project.)
.TP
.B \-A
used for glimpse internals.
.TP
.B \-b
prints the byte offset (from the beginning of the file)
of the end of each match.
The first character in a file has offset 0.
.TP
.B \-B
Best match mode. (Warning: -B sometimes misses matches. It is safer
to specify the number of errors explicitly.)
When \-B is specified and no exact matches are found, glimpse
will continue to search until the closest matches (i.e., the ones
with minimum number of errors)
are found, at which point the following message will be shown:
"the best match contains x errors, there are y matches, output them? (y/n)"
This message refers to the number of matches found in the index.
There may be many more matches in the actual text (or there may be none
if -F is used to filter files).
When the \-#, \-c, or \-l options are specified, the \-B option is ignored.
In general, \-B may be slower than \-#, but not by very much.
Since the index stores only lower case characters, errors of
substituting upper case with lower case may be missed
(see LIMITATIONS).
.TP
.B \-c
Display only the count of matching records. Only files with count > 0
are displayed.
.TP
.B \-C
tells glimpse to send its queries to \fIglimpseserver\fP.
.TP
.B \-d "'\fIdelim\fP'"
Define \fIdelim\fP to be the separator between two records.
The default value is '$', namely a record is by default
a line.
\fIdelim\fP can be a string of size at most 8
(with possible use of ^ and $), but not
a regular expression.
Text between two \fIdelim\fP's, before the first \fIdelim\fP,
and after the last \fIdelim\fP is considered as one record.
For example, -d '$$' defines paragraphs as records and -d '^From\ '
defines mail messages as records.
\fIglimpse\fP matches each record separately.
\fBThis option does not currently work with regular expressions.\fP
The -d option is especially useful for Boolean AND queries,
because the patterns need not appear in the same line but in the
same record.
For example,
\fIglimpse -F mail -d '^From\ ' 'glimpse;arizona;announcement'\fR
will output all mail messages (in their entirety) that have
the 3 patterns anywhere in the message (or the header),
assuming that files with 'mail' in their name contain mail
messages.
If you want the scope of the record to be the whole file,
use the -W option.
\fBGlimpse warning\fP:
Use this option with care. If the delimiter is set to
match mail messages, for example, and glimpse finds the pattern
in a regular file, it may not find the delimiter and will
therefore output the whole file.
(The -t option - see below - can be used to put the \fIdelim\fP at
the end of the record.)
\fBPerformance Note:\fP
Agrep (and glimpse) resorts to more complex search when the \-d
option is used. The search is slower and unfortunately no more than
32 characters can be used in the pattern.
.TP
.B \-D\fIk\fP
Set the cost of a deletion to \fIk\fP (\fIk\fP is a positive integer).
This option does not currently work with regular expressions.
.TP
.BI \-e " pattern"
Same as a simple
.I pattern
argument, but useful when the
.I pattern
begins with a
.RB ` \- '.
.TP
.B \-E
prints the lines in the index (as they appear in the index)
which match the pattern. Used mostly for debugging and
maintenance of the index.
This is not an option that a user needs to know about.
.TP
.B \-f \fIfile_name\fR
this option has a different meaning for agrep than for glimpse:
In glimpse, only the files whose names are
listed in \fIfile_name\fP are matched.
(The file names have to appear as in .glimpse_filenames.)
In agrep, the file_name contains the list of the patterns that are searched.
(Starting at version 3.6, this option for glimpse is much faster
for large files.)
.TP
.B \-F \fIfile_pattern\fR
limits the search to those files whose name (including the whole
path) matches \fIfile_pattern\fP.
This option can be used in a variety of applications to provide
limited search even for one large index.
If \fIfile_pattern\fP matches a directory, then all files with this directory on
their path will be considered. To limit the search to actual file
names, use $ at the end of the pattern. \fIfile_pattern\fP can be a
regular expression and even a Boolean pattern.
This option is implemented by running agrep \fIfile_pattern\fP on the
list of file names obtained from the index. Therefore, searching the
index itself takes the same amount of time, but limiting the
second phase of the search to only a few files can speed up the
search significantly.
For example,
.sp 1
glimpse -F 'src#\\.c$' needle
.sp 1
will search for needle in all .c files with src somewhere along the
path.
The -F \fIfile_pattern\fP must appear before the search pattern
(e.g., glimpse needle -F '\\.c$' will not work).
It is possible to use some of agrep's options when
matching file names. In this case all options as well as the
file_pattern should be in quotes. (-B and -v do not work very well
as part of a file_pattern.)
For example,
.sp
glimpse -F '-1 \\.html' pattern
.sp
will allow one spelling error when matching .html to the file names
(so ".htm" and ".shtml" will match as well).
.sp
glimpse -F '-v \\.c$' counter
.sp
will search for 'counter' in all files \fIexcept\fP for .c files.
.TP
.B \-g
prints the file number (its position in the .glimpse_filenames
file) rather than its name.
.TP
.B \-G
Output the (whole) files that contain a match.
.TP
.B \-h
Do not display filenames.
.TP
.B \-H \fIdirectory_name\fR
searches for the index and the other .glimpse files in
\fIdirectory_name\fP. The default is the home directory.
This option is useful, for example, if several different indexes are
maintained for different archives (e.g., one for mail messages, one
for source code, one for articles).
.TP
.B \-i
Case-insensitive search \(em e.g., "A" and "a" are considered equivalent.
Glimpse's index stores all patterns in lower case (see LIMITATIONS below).
\fBPerformance Note:\fP
When \-i is used together with the \-w option,
the search may become much faster.
It is recommended to have \-i and \-w as defaults, for example,
through an alias. We use the following alias in our .cshrc file
.br
alias glwi 'glimpse -w -i'
.TP
.B \-I\fIk\fP
Set the cost of an insertion to \fIk\fP (\fIk\fP is a positive integer).
This option does not currently work with regular expressions.
.TP
.B \-j
If the index was constructed with the -t option, then \-j
will output the files last modification dates in addition to
everything else.
There are no major performance penalties for this option.
.TP
.B \-J \fIhost_name\fP
used in conjunction with glimpseserver (\-C) to connect to
one particular server.
.TP
.B \-k
No symbol in the pattern is treated as a meta character.
For example, glimpse -k 'a(b|c)*d' will find
the occurrences of a(b|c)*d whereas glimpse 'a(b|c)*d'
will find substrings that match the regular expression 'a(b|c)*d'.
(The only exception is ^ at the beginning of the pattern and $ at the
end of the pattern, which are still interpreted in the usual way.
Use \\^ or \\$ if you need them verbatim.)
.TP
.B \-K \fIport_number\fP
used in conjunction with glimpseserver (\-C) to connect to
one particular server at the specified TCP port number.
.TP
.B \-l
Output only the files names that contain a match.
This option differs from the \-N option in that the files
themselves \fIare\fP searched, but the matching lines are
not shown.
.TP
.B \-L x | x:y | x:y:z
if one number is given, it is a limit on the total number of matches.
Glimpse outputs only the first x matches.
If \-l is used (i.e., only file names
are sought), then the limit is on the number of files;
otherwise, the limit is on the number of records.
If two numbers are given (x:y), then y is an added limit on the total
number of files.
If three numbers are given (x:y:z), then z is an added limit on the
number of matches per file.
If any of the x, y, or z is set to 0, it means to ignore it
(in other words 0 = infinity in this case); for example,
\-L 0:10 will output all matches to the first 10 files that
contain a match.
This option is particularly
useful for servers that needs to limit the amount of
output provided to clients.
.TP
.B \-m
used for glimpse internals.
.TP
.B \-M
used for glimpse internals.
.TP
.B \-n
Each matching record (line) is prefixed by its record (line) number in the file.
\fBPerformance Note:\fP
To compute the record/line number, agrep needs to search for all
record delimiters (or line breaks), which can slow down the search.
.TP
.B \-N
searches only the index (so the search is faster).
If -o or -b are used then the result is the number of files
that have a potential match plus a prompt to ask if you want to
see the file names.
(If \-y is used, then there is no prompt and the names of the
files will be shown.)
This could be a way to get the matching file names without even having
access to the files themselves.
However, because only the index is searched, some potential matches
may not be real matches.
In other words, with \-N you will not miss any file but you may get
extra files.
For example, since the index stores everything in lower case,
a case-sensitive query may match a file that has only a case-insensitive
match. Boolean queries may match a file that has all the keywords
but not in the same line (indexing with \-b allows glimpse to
figure out whether the keywords are close, but it cannot figure out
from the index whether they are exactly on the same line or in the same record
without looking at the file).
If the index was not build with \-o or \-b, then this option
outputs the number of \fIblocks\fP matching the pattern.
This is useful as an indication of how long the search will take.
All files are partitioned into usually 200-250 blocks.
The file \fB.glimpse_statistics\fP contains the total number of blocks
(or \fBglimpse -N a\fP will give a pretty good estimate; only blocks with no
occurrences of 'a' will be missed).
.TP
.B \-o
the opposite of \-t: the delimiter
is not output at the tail, but at the beginning of the matched record.
.TP
.B \-O
the file names are not printed before every matched record;
instead, each filename is printed just once,
and all the matched records within it are printed after it.
.TP
.B \-p
(from version 4.0B1 only) Supports reading compressed
set of filenames.
The -p option allows you to utilize compressed `neighborhoods'
(sets of filenames) to limit your search, without uncompressing them.
Added mostly for WebGlimpse.
The usage is:
.br
"-p filename:X:Y:Z"
where "filename" is the file with compressed neighborhoods, X is an
offset into that file (usually 0, must be a multiple of sizeof(int)),
Y is the length glimpse must access from that file (if 0, then whole file;
must be a multiple of sizeof(int)), and Z must be 2 (it indicates
that "filename" has the sparse-set representation of compressed
neighborhoods: the other values are for internal use only). Note that
any colon ":" in filename must be escaped using a backslash \.
.TP
.B \-P
used for glimpse internals.
.TP
.B \-q
prints the offsets of the beginning and end of each matched record.
The difference between \-q and \-b is that \-b prints the offsets
of the actual matched string, while \-q prints the offsets of the
whole record where the match occurred.
The output format is @x{y}, where x is the beginning offset
and y is the end offset.
.TP
.B \-Q
when used together with \-N glimpse not only displays the filename where
the match occurs, but the exact occurrences (offsets) as seen in the
index. This option is relevant only if the index was built
with -b; otherwise, the offsets are not available in the index.
This option is ignored when used not with \-N.
.TP
.B \-r
This option is an agrep option and it will be ignored in glimpse,
unless glimpse is used with a file name at the end which makes it
run as agrep.
If the file name is a directory name, the \-r option will search
(recursively) the whole directory and everything below it.
(The glimpse index will not be used.)
.TP
.B \-R \fIk\fP
defines the maximum size (in bytes) of a record.
The maximum value (which is the default) is 48K.
Defining the maximum to be lower than the deafult may speed
up some searches.
.TP
.B \-s
Work silently, that is, display nothing except error messages.
This is useful for checking the error status.
.TP
.B \-S\fIk\fP
Set the cost of a substitution to \fIk\fP (\fIk\fP is a positive integer).
This option does not currently work with regular expressions.
.TP
.B \-t
Similar to the \-d option, except that the delimiter is assumed
to appear at the \fIend\fP of the record.
Glimpse will output the record starting from the end of
.I delim
to (and including) the next
.I delim.
(See warning for the \-d option.)
.TP
.B \-T directory
Use \fIdirectory\fP as a place where temporary files are built.
(Glimpse produces some small temporary files usually in /tmp.)
This option is useful mainly in the context of structured queries
for the Harvest project, where the temporary files may be non-trivial,
and the /tmp directory may not have enough space for them.
.TP
.B \-U
(starting at version 4.0B1) Interprets an index created
with the -X or the -U option in glimpseindex.
Useful mostly for WebGlimpse or similar web applications.
When glimpse outputs matches, it
will display the filename, the URL, and the title automatically.
.TP
.B \-v
(This option is an agrep option and it will be ignored in glimpse,
unless glimpse is used with a file name at the end which makes it
run as agrep.)
Output all records/lines that do \fInot\fP contain a match.
(Glimpse does not support the NOT operator yet.)
.TP
.B \-V
prints the current version of glimpse.
.TP
.B \-w
Search for the pattern as a word \(em i.e., surrounded by non-alphanumeric
characters. For example,
\fIglimpse -w car\fR will match car, but not characters and not
car10.
The non-alphanumeric \fImust\fP
surround the match; they cannot be counted as errors.
This option does not work with regular expressions.
\fBPerformance Note:\fP
When \-w is used together with the \-i option,
the search may become much faster.
The \-w will not work with $, ^, and _ (see BUGS below).
It is recommended to have \-i and \-w as defaults, for example,
through an alias. We use the following alias in our .cshrc file
.br
alias glwi 'glimpse -w -i'
.TP
.B \-W
The default for Boolean AND queries is that they cover one record
(the default for a record is one line) at a time.
For example, glimpse 'good;bad' will output all lines containing
both 'good' and 'bad'.
The \-W option changes the scope of Booleans to be the whole file.
Within a file glimpse will output all matches to any of the patterns.
So, glimpse -W 'good;bad' will output all lines containing 'good'
\fIor\fP 'bad', but only in files that contain both patterns.
The NOT operator '~' can be used only with \-W.
It is described later on.
The OR operator is essentially unaffected (unless it is
in combination with the other Boolean operations).
For structured queries, the scope is always the whole attribute
or file.
.TP
.B \-x
The pattern must match the whole line.
(This option is translated to -w when the index is searched
and it is used only when the actual text is searched.
It is of limited use in glimpse.)
.TP
.B \-X
(from version 4.0B1 only) Output the names of files that
contain a match even if these files have been deleted since the
index was built.
Without this option glimpse will simply ignore these files.
.TP
.B \-y
Do not prompt.
Proceed with the match as if the answer to any prompt is y.
Servers (or any other scripts) using glimpse will probably want
to use this option.
.TP
.B \-Y \fIk\fP
If the index was constructed with the -t option, then \-Y x
will output only matches to files that were created or
modified within the last x days.
There are no major performance penalties for this option.
.TP
.B \-z
Allow customizable filtering, using the file .glimpse_filters
to perform the programs listed there for each match.
The best example is
compress/decompress. If .glimpse_filters include the line
.br
*.Z uncompress <
.br
(separated by tabs)
then before indexing any file that matches the pattern "*.Z" (same
syntax as the one for .glimpse_exclude) the command listed is
executed first (assuming input is from stdin, which is why uncompress
needs <) and its output (assuming it goes to stdout) is indexed.
The file itself is not changed (i.e., it stays compressed).
Then if glimpse -z is used, the same program is used on these files
on the fly. Any program can be used (we run 'exec'). For example,
one can filter out parts of files that should not be indexed.
Glimpseindex tries to apply all filters in .glimpse_filters in the
order they are given.
For example, if you want to uncompress a file and then extract
some part of it, put the compression command (the example above)
first and then another line that specifies the extraction.
Note that this can slow down the search because the filters need to
be run before files are searched.
(See also glimpseindex.)
.TP
.B \-Z
No op. (It's useful for glimpse's internals. Trust us.)
.LP
The characters
.RB ` $ ',
.RB `^ ',
.RB ` \(** ',
.RB ` [ ' ,
.RB ` ] ' ,
.RB ` \s+2^\s0 ',
.RB ` | ',
.RB ` ( ',
.RB ` ) ',
.RB ` ! ',
and
.RB ` \e '
can cause unexpected results when included in the
.IR pattern ,
as these characters are also meaningful
to the shell. To avoid these problems, enclose the entire
pattern in single quotes, i.e., 'pattern'.
Do not use double quotes (").
.ne 4
.SH PATTERNS
.LP
\fIglimpse\fP
supports a large variety of patterns, including simple
strings, strings with classes of characters, sets of strings,
wild cards, and regular expressions (see LIMITATIONS).
.TP
\fBStrings \fP
Strings are any sequence of characters, including the special symbols
`^' for beginning of line and `$' for end of line.
The following special characters (
.RB ` $ ',
.RB `^ ',
.RB ` \(** ',
.RB ` [ ' ,
.RB ` \s+2^\s0 ',
.RB ` | ',
.RB ` ( ',
.RB ` ) ',
.RB ` ! ',
and
.RB ` \e '
)
as well as the following meta characters special to glimpse (and agrep):
.RB ` ; ',
.RB ` , ',
.RB ` # ',
.RB ` < ',
.RB ` > ',
.RB ` - ',
and
.RB ` . ',
should be preceded by `\\' if they are to be matched as regular
characters. For example, \\^abc\\\\ corresponds to the string ^abc\\,
whereas ^abc corresponds to the string abc at the beginning of a
line.
.TP
\fBClasses of characters\fP
a list of characters inside [] (in order) corresponds to any character
from the list. For example, [a-ho-z] is any character between a and h
or between o and z. The symbol `^' inside [] complements the list.
For example, [^i-n] denote any character in the character set except
character 'i' to 'n'.
The symbol `^' thus has two meanings, but this is consistent with
egrep.
The symbol `.' (don't care) stands for any symbol (except for the
newline symbol).
.TP
\fBBoolean operations\fP
.B Glimpse
supports an `AND' operation denoted by the symbol `;'
an `OR' operation denoted by the symbol `,',
a limited version of a 'NOT' operation (starting at version 4.0B1)
denoted by the symbol `~',
or any combination.
For example,
\fIglimpse 'pizza;cheeseburger'\fR will output all lines containing
both patterns.
\fIglimpse -F 'gnu;\\.c$' 'define;DEFAULT'\fR
will output all lines containing both 'define' and 'DEFAULT'
(anywhere in the line, not necessarily in order) in
files whose name contains 'gnu' and ends with .c.
\fIglimpse '{political,computer};science'\fR will match 'political science'
or 'science of computers'.
The NOT operation works only together with the -W option and it is
generally applies only to the whole file rather to individual records.
Its output may sometimes seem counterintuitive.
Use with care.
\fIglimpse -W 'fame;~glory'\fR will output all lines containing 'fame'
in all files that contain 'fame' but do not contain 'glory';
This is the most common use of NOT, and in this case it works
as expected.
\fIglimpse -W '~{fame;glory}'\fR will be limited to files that do
not contain both words, and will output all lines containing one
of them.
.TP
\fBWild cards\fP
The symbol '#' is used to denote a sequence
of any number (including 0)
of arbitrary characters (see LIMITATIONS).
The symbol # is equivalent to .* in egrep.
In fact, .* will work too, because it is a valid regular expression
(see below), but unless this is part of an actual regular expression,
# will work faster.
(Currently glimpse is experiencing some problems with #.)
.TP
\fBCombination of exact and approximate matching\fP
Any pattern inside angle brackets <> must match the text exactly even
if the match is with errors. For example, <mathemat>ics matches
mathematical with one error (replacing the last s with an a), but
mathe<matics> does not match mathematical no matter how many errors are
allowed.
(This option is buggy at the moment.)
.TP
\fBRegular expressions\fP
Since the index is word based, a regular expression must match
words that appear in the index for glimpse to find it.
Glimpse first strips the regular expression from all non-alphabetic
characters, and searches the index for all remaining words.
It then applies the regular expression matching algorithm to the
files found in the index.
For example, \fIglimpse\fP 'abc.*xyz' will search the index
for all files that contain both 'abc' and 'xyz', and then
search directly for 'abc.*xyz' in those files.
(If you use glimpse \-w 'abc.*xyz', then 'abcxyz' will not be found,
because glimpse
will think that abc and xyz need to be matches to whole words.)
The syntax of regular expressions in \fBglimpse\fP is in general the same as
that for \fBagrep\fP. The union operation `|', Kleene closure `*',
and parentheses () are all supported.
Currently '+' is not supported.
Regular expressions are currently limited to approximately 30
characters (generally excluding meta characters). Some options
(\-d, \-w, \-t, \-x, \-D, \-I, \-S) do not
currently work with regular expressions.
The maximal number of errors for regular expressions that use '*'
or '|' is 4. (See LIMITATIONS.)
.TP
\fBstructured queries\fP
Glimpse supports some form of structured queries using Harvest's SOIF
format. See STRUCTURED QUERIES below for details.
.SH EXAMPLES
.LP
(Run "glimpse '^glimpse' this-file" to get a list of all examples, some
of which were given earlier.)
.TP
glimpse -F 'haystack.h$' needle
finds all needles in all haystack.h's files.
.TP
glimpse -2 -F html Anestesiology
outputs all occurrences of Anestesiology with two errors in files with
html somewhere in their full name.
.TP
glimpse -l -F '\\.c$' variablename
lists the names of all .c files that contain variablename
(the -l option lists file names rather than output the matched lines).
.TP
glimpse -F 'mail;1993' 'windsurfing;Arizona'
finds all lines containing \fIwindsurfing\fP and \fIArizona\fP in all files
having `mail' and '1993' somewhere in their full name.
.TP
glimpse -F mail 't.j@#uk'
finds all mail addresses (search only files with mail somewhere in
their name) from the uk, where the login name ends with
t.j, where the . stands for any one character.
(This is very useful to find a login name of someone whose middle name
you don't know.)
.TP
glimpse -F mbox -h -G . > MBOX
concatenates all files whose name matches `mbox' into one big one.
.SH "SEARCHING IN COMPRESSED FILES"
.LP
Glimpse includes an optional new compression program,
called \fIcast\fP,
which allows glimpse (and agrep) to search the compressed files
without having to decompress them. The search is actually significantly
faster when the files are compressed. However, we have not tested
\fIcast\fP as thoroughly as we would have liked, and a mishap in
a compression algorithm can cause loss of data, so we recommend
at this point to use \fIcast\fP very carefully.
We do not support or maintain cast.
(Unless you specifically use \fIcast\fP, the default is to ignore it.)
.SH "GLIMPSEINDEX FILES"
.LP
All files used by glimpse are located at the directory(ies) where
the index(es) is (are) stored and have .glimpse_ as a prefix.
The first two files (.glimpse_exclude and .glimpse_include) are
optionally supplied by the user. The other files are built and
read by glimpse.
.LP
.IP "\fB.glimpse_exclude\fR"
contains a list of files that glimpseindex is explicitly told to ignore.
In general, the syntax of .glimpse_exclude/include is the same as
that of agrep (or any other grep). The lines in the .glimpse_exclude
file are matched to the file names, and if they match, the files
are excluded. Notice that agrep matches to parts of the string!
e.g., agrep /ftp/pub will match /home/ftp/pub and /ftp/pub/whatever.
So, if you want to exclude /ftp/pub/core, you just list
it, as is, in the .glimpse_exclude file.
If you put "/home/ftp/pub/cdrom" in .glimpse_exclude, every file
name that matches that string will be excluded, meaning all files
below it.
You can use ^ to indicate the beginning of a file name, and $ to
indicate the end of one, and you can use * and ? in the usual way.
For example /ftp/*html will exclude /ftp/pub/foo.html, but will
also exclude /home/ftp/pub/html/whatever; if you want to exclude
files that start with /ftp and end with html use ^/ftp*html$
Notice that putting a * at the beginning or at the end is redundant
(in fact, in this case glimpseindex will remove the * when it
does the indexing).
No other meta characters are allowed in .glimpse_exclude
(e.g., don't use .* or # or |).
Lines with * or ? must have no more than 30 characters.
Notice that, although the index itself will not be indexed,
the list of file names (.glimpse_filenames) will be indexed
unless it is explicitly listed in .glimpse_exclude.
.IP "\fB.glimpse_filters\fR"
See the description above for the -z option.
.IP "\fB.glimpse_include\fR"
contains a list of files that glimpseindex
is explicitly told to \fIinclude\fP in the index even though they may look
like non-text files. Symbolic links are followed by glimpseindex
only if they are specifically included here.
If a file is in both .glimpse_exclude and .glimpse_include it will be
excluded.
.IP "\fB.glimpse_filenames\fP"
contains the list of all indexed file names, one per line.
This is an ASCII file that can also be used with agrep to search
for a file name leading to a fast find command.
For example,
.br
glimpse 'count#\\.c$' ~/.glimpse_filenames
.br
will output the names of all (indexed) .c files that have 'count' in
their name (including anywhere on the path from the index).
Setting the following alias in the .login file may be useful:
.br
alias findfile 'glimpse -h \!:1 ~/.glimpse_filenames'
.IP ".\fBglimpse_index\fP"
contains the index. The index consists of lines, each starting with a
word followed by a list of block numbers (unless the -o or -b options
are used, in which case each word is followed by an offset into
the file .glimpse_partitions where all pointers are kept).
The block/file numbers are stored in binary form, so this is not an ASCII file.
.IP "\fB.glimpse_messages\fP"
contains the output of the -w option (see above).
.IP "\fB.glimpse_partitions\fP"
contains the partition of the indexed space into blocks
and, when the index is built with the -o or -b options, some part of the
index. This file is used internally by glimpse and it is
a non-ASCII file.
.IP "\fB.glimpse_statistics\fP"
contains some statistics about the makeup of the index. Useful for
some advanced applications and customization of glimpse.
.IP "\fB.glimpse_turbo\fP"
An added data structure (used under glimpseindex -o or -b only)
that helps to speed up
queries significantly for large indexes. Its size is 0.25MB.
Glimpse will work without it if needed.
.SH "STRUCTURED QUERIES"
Glimpse can search for Boolean combinations of "attribute=value" terms
by using the Harvest SOIF parser library (in glimpse/libtemplate).
To search this way, the index must be made by using the -s option of
glimpseindex (this can be used in conjunction with other glimpseindex
options). For glimpse and glimpseindex to recognize "structured" files,
they must be in SOIF format. In this format, each value is prefixed by
an attribute-name with the size of the value (in bytes) present in "{}"
after the name of the attribute.
For example, The following lines are part of an SOIF file:
.br
.nf
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
.fi
Any string can serve as an attribute name.
Glimpse "pattern;type=Directory-Listing" will search for "pattern"
only in files whose type is "Directory-Listing".
The file itself is considered to be
one "object" and its name/url appears as the first attribute with an
"@" prefix; e.g.,
@FILE { http://xxx... }
The scope of Boolean operations changes from records
(lines) to whole files when structured queries are used in glimpse
(since individual query terms can look at different attributes and they
may not be "covered" by the record/line). Note that glimpse can only
search for patterns in the value parts of the SOIF file: there are some
attributes (like the TTL, MD5, etc.) that are interpreted by Harvest's
internal routines.
See RFC 2655 for more detailed information of the SOIF format.
.SH "REFERENCES"
.IP 1.
U. Manber and S. Wu,
"GLIMPSE: A Tool to Search Through Entire File Systems,"
\fIUsenix Winter 1994 Technical Conference\fP
(best paper award),
San Francisco (January 1994), pp. 23\-32.
Also, Technical Report #TR 93-34, Dept. of Computer Science,
University of Arizona, October 1993 (a postscript file
is available by anonymous ftp at
ftp://webglimpse.net/pub/glimpse/TR93-34.ps).
.IP 2.
S. Wu and U. Manber,
"Fast Text Searching Allowing Errors,"
\fICommunications of the ACM\fP
\fB35\fP (October 1992), pp. 83\-91.
.SH "SEE ALSO"
.BR agrep (1),
.BR ed (1),
.BR ex (1),
.BR glimpseindex (1),
.BR glimpseserver (1),
.BR grep (1),
.BR sh (1),
.BR csh (1).
.SH LIMITATIONS
.LP
The index of glimpse is word based. A pattern that contains more than
one word cannot be found in the index. The way glimpse overcomes this
weakness is by splitting any multi-word pattern into its set of words
and looking for all of them in the index.
For example, \fBglimpse 'linear programming'\fR will first consult the index
to find all files containing both \fIlinear\fP and \fIprogramming\fP,
and then apply agrep to find the combined pattern.
This is usually an effective solution, but it can be slow for
cases where both words are very common, but their combination is not.
.LP
As was mentioned in the section on PATTERNS above, some characters
serve as meta characters for glimpse and need to be
preceded by '\\' to search for them. The most common
examples are the characters '.' (which stands for a wild card),
and '*' (the Kleene closure).
So, "glimpse ab.de" will match abcde, but "glimpse ab\\.de"
will not, and "glimpse ab*de" will not match ab*de, but
"glimpse ab\\*de" will.
The meta character - is translated automatically to a hypen
unless it appears between [] (in which case it denotes a range of
characters).
.LP
The index of glimpse stores all patterns in lower case.
When glimpse searches the index it first converts
all patterns to lower case, finds the appropriate files,
and then searches the actual files using the original
patterns.
So, for example, \fIglimpse ABCXYZ\fR will first find all
files containing abcxyz in any combination of lower and upper
cases, and then searches these files directly, so only the
right cases will be found.
One problem with this approach is discovering misspellings
that are caused by wrong cases.
For example, \fIglimpse -B abcXYZ\fR will first search the
index for the best match to abcxyz (because the pattern is
converted to lower case); it will find that there are matches
with no errors, and will go to those files to search them
directly, this time with the original upper cases.
If the closest match is, say AbcXYZ, glimpse may miss it,
because it doesn't expect an error.
Another problem is speed. If you search for "ATT", it will look
at the index for "att". Unless you use -w to match the whole word,
glimpse may have to search all files containing, for example, "Seattle"
which has "att" in it.
.LP
There is no size limit for simple patterns and simple patterns
within Boolean expressions.
More complicated patterns, such as regular expressions,
are currently limited to approximately 30 characters.
Lines are limited to 1024 characters.
Records are limited to 48K, and may be truncated if they are larger
than that.
The limit of record length can be
changed by modifying the parameter Max_record in agrep.h.
.LP
Glimpseindex does not index words of size > 64.
.SH BUGS
.LP
In some rare cases, regular expressions using * or # may not match correctly.
.LP
A query that contains no alphanumeric characters is not
recommended (unless glimpse is used as agrep and the file names
are provided). This is an understatement.
.LP
The notion of "match to the whole word" (the \-w option) can be tricky
sometimes. For example, glimpse -w 'word$' will not match 'word'
appearing at the end of a line, because the extra '$' makes the pattern
more than just one simple word.
The same thing can happen with ^ and with _.
To be on the safe side,
use the -w option only when the patterns are actual words.
.LP
Please send bug reports or comments to [email protected].
.SH DIAGNOSTICS
Exit status is 0 if any matches are found,
1 if none, 2 for syntax errors or inaccessible files.
.SH AUTHORS
Udi Manber and Burra Gopal, Department of Computer Science,
University of Arizona, and Sun Wu, the National Chung-Cheng University,
Taiwan. Now maintained by Golda Velez at Internet WorkShop
(Email: [email protected])