-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
203 lines (148 loc) · 7.58 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
= External
Indexing and array-like access to data stored on disk rather than in memory.
== Description
External provides a way to index and access array data directly from a file
without loading it into memory. Indexes may be cached in memory or stored
on disk with the data file, in essence giving you arbitrarily large arrays.
Externals automatically chunk and buffer methods like <tt>each</tt> so that
the memory footprint remains low even during enumeration.
The main External classes are:
* ExternalIndex -- for formatted binary data
* ExternalArchive -- for string data
* ExternalArray -- for objects (serialized as YAML)
The array-like behavior of these classes is developed using modified versions
of the RubySpec[http://rubyspec.org] specification for Array. The idea is to
eventually duck-type all Array methods, including sort and collect, with
acceptable performance.
* Rubyforge[http://rubyforge.org/projects/external]
* Lighthouse[http://bahuvrihi.lighthouseapp.com/projects/10590-external]
* Github[http://github.com/bahuvrihi/external/tree/master]
==== Bugs/Known Issues
* only a limited set of array methods are currently supported
* currently only [] and []= are fully tested vs RubySpec
* documentation is patchy
Note also that YAML dump/load of some objects doesn't work or doesn't
reproduce the object; such objects will not be properly stored in an
ExternalArray. Problematic objects include:
Proc and Class:
block = lambda {}
YAML.load(YAML.dump(block)) # !> TypeError: allocator undefined for Proc
YAML.dump(Object) # !> TypeError: can't dump anonymous class Class
Carriage returns ("\r"):
YAML.load(YAML.dump("\r")) # => nil
YAML.load(YAML.dump("\r\n")) # => ""
YAML.load(YAML.dump("string with \r\n inside")) # => "string with \n inside"
Chains of newlines ("\n"):
YAML.load(YAML.dump("\n")) # => ""
YAML.load(YAML.dump("\n\n")) # => ""
DateTime is loaded as Time:
YAML.load(YAML.dump(DateTime.now)).class # => Time
== Usage
=== ExternalArray
ExternalArray can be initialized from data using the [] operator and used like
an array.
a = ExternalArray['str', {'key' => 'value'}]
a[0] # => 'str'
a.last # => {'key' => 'value'}
a << [1,2]; a.to_a # => ['str', {'key' => 'value'}, [1,2]]
ExternalArray serializes and stores entries to an io while building an io_index
that tracks the start and length of each entry. By default ExternalArray
will serialize to a Tempfile and use an Array as the io_index:
a.io.class # => Tempfile
a.io.rewind; a.io.read # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n"
a.io_index.class # => Array
a.io_index.to_a # => [[0, 8], [8, 16], [24, 13]]
To save this data more permanently, provide a path to <tt>close</tt>; the tempfile
is moved to the path and a binary index file will be created:
a.close('example.yml')
File.read('example.yml') # => "--- str\n--- \nkey: value\n--- \n- 1\n- 2\n"
index = File.read('example.index')
index.unpack('I*') # => [0, 8, 8, 16, 24, 13]
ExternalArray provides <tt>open</tt> to create ExternalArrays from an existing
file; the instance will use an index file if it exists and automatically
reindex the data if it does not. Manual calls to reindex may be necessary when
you initialize an ExternalArray with <tt>new</tt> instead of <tt>open</tt>:
# use of an existing index file
ExternalArray.open('example.yml') do |b|
File.basename(b.io_index.io.path) # => 'example.index'
b.to_a # => ['str', {'key' => 'value'}, [1,2]]
end
# automatic reindexing
FileUtils.rm('example.index')
ExternalArray.open('example.yml') do |b|
b.to_a # => ['str', {'key' => 'value'}, [1,2]]
end
# manual reindexing
file = File.open('example.yml')
c = ExternalArray.new(file)
c.to_a # => []
c.reindex
c.to_a # => ['str', {'key' => 'value'}, [1,2]]
=== ExternalArchive
ExternalArchive is exactly like ExternalArray except that it only stores
strings (ExternalArray is actually a subclass of ExternalArchive which
dumps/loads strings).
arc = ExternalArchive["swift", "brown", "fox"]
arc[2] # => "fox"
arc.to_a # => ["swift", "brown", "fox"]
arc.io.rewind; arc.io.read # => "swiftbrownfox"
ExternalArchive is useful as a base for classes to access archival data.
Here is a simple parser for FASTA[http://en.wikipedia.org/wiki/Fasta_format]
data:
# A sample FASTA entry
# >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
# LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
# EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
# LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
# GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
# IENY
class FastaEntry
attr_reader :header, :body
def initialize(str)
@body = str.split(/\r?\n/)
@header = body.shift
end
end
class FastaArchive < ExternalArchive
def str_to_entry(str); FastaEntry.new(str); end
def entry_to_str(entry); ([entry.header] + entry.body).join("\n"); end
def reindex
reindex_by_sep('>', :entry_follows_sep => true)
end
end
require 'open-uri'
fasta = FastaArchive.new open('http://external.rubyforge.org/doc/tiny_fasta.txt')
fasta.reindex
fasta.length # => 5
fasta[0].body # => ["MEVNILAFIATTLFVLVPTAFLLIIYVKTVSQSD"]
The non-redundant {NCBI protein database}[ftp://ftp.ncbi.nih.gov/blast/db/FASTA/]
contains greater than 7 million FASTA entries in a 3.56 GB file; ExternalArchive
is targeted at files that size, where lazy loading of data and a small memory
footprint are critical.
=== ExternalIndex
ExternalIndex provides array-like access to formatted binary data. The index of an
uncached ExternalArray is an ExternalIndex configured for binary data like 'II'; two
integers corresponding to the start position and length an entry.
index = ExternalIndex[1, 2, 3, 4, 5, 6, {:format => 'II'}]
index.format # => 'I*'
index.frame # => 2
index[1] # => [3,4]
index.to_a # => [[1,2], [3,4], [5,6]]
ExternalIndex handles arbitrary packing formats, opening many possibilities:
Tempfile.new('sample.txt') do |file|
file << [1,2,3].pack("IQS")
file << [4,5,6].pack("IQS")
file << [7,8,9].pack("IQS")
file.flush
index = ExternalIndex.new(file, :format => "IQS")
index[1] # => [4,5,6]
index.to_a # => [[1,2,3], [4,5,6], [7,8,9]]
end
== Installation
External is available from RubyForge[http://rubyforge.org/projects/external]. Use:
% gem install external
== Info
Copyright (c) 2006-2008, Regents of the University of Colorado.
Developer:: {Simon Chiang}[http://bahuvrihi.wordpress.com], {Biomolecular Structure Program}[http://biomol.uchsc.edu/], {Hansen Lab}[http://hsc-proteomics.uchsc.edu/hansenlab/]
Support:: CU Denver School of Medicine Deans Academic Enrichment Fund
Licence:: {MIT-Style}[link:files/MIT-LICENSE.html]