-
Notifications
You must be signed in to change notification settings - Fork 26
/
Copy pathlab3writeup.txt
152 lines (120 loc) · 8.16 KB
/
lab3writeup.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
Collaborators:
Calvin Chan: 304144970
Simon Zou: 804347338
Ethan Schreiber: The iterator in HeapFile.java is the one from the lab1 solution by Ethan/the TA.
Exercise 1: Life of a SimpleDB query
Step 1: simpledb.Parser.main() and simpledb.Parser.start() simpledb.Parser.main() is the entry point for the SimpleDB system. It calls simpledb.Parser.start(). The latter performs three main actions:
- It populates the SimpleDB catalog from the catalog text file provided by the user as argument (Database.getCatalog().loadSchema(argv[0]);).
- For each table defined in the system catalog, it computes statistics over the data in the table by calling: TableStats.computeStatistics(), which for each table does: TableStats s = new TableStats(tableid, IOCOSTPERPAGE);
- It processes the statements submitted by the user (processNextStatement(new ByteArrayInputStream(statementBytes));)
Step 2: simpledb.Parser.processNextStatement()
This method takes two key actions:
- First, it gets a physical plan for the query by invoking handleQueryStatement((ZQuery)s);
- Then it executes the query by calling query.execute();
Step 3: simpledb.Parser.handleQueryStatement()
- First, it creates a new Query object
- The program gets a logical plan using parseQueryLogicalPlan(), and a physical plan is instantiated from the logical plan and if the plan is valid, the plan is printed out.
Step 4: simpledb.Parser.parseQueryLogicalPlan()
- First, the function gets all the tables in the FROM clause of the query plan and adds it to the scan.
- Then, if there is a WHERE clause, which uses the processExpression function to add join conditions
- Then, if there is a GROUP BY clause, it checks to make sure there is only one (multiple conditions are unsupported) and then does the grouping
- Then, it looks at the SELECT clause, which picks out aggregate fields and checks for query validity
- If there is an ORDER BY clause, the results of the query are sorted
- Finally, the logical plan is returned
Exercise 2: IntHistogram.java
IntHistogram is implemented according to the spec, using an equi-width bucket based approach that uses an array of integers with the count of how many values fall in each bucket. The formulas for estimating the selectivity of a given operation use the formulas given in the spec and avgSelectivity is not implemented. Private member functions were written to handle inequality and equality operators and the rest of the operators were defined in terms of them.
Exercise 3: TableStats.java
The TableStats constructor makes use of several HashMaps. It first iterates through the whole given table to find the maxs and mins of each field value, which are stored in two separate HashMaps that map fieldnames to their max and min. It then creates two more HashMaps, mapping fieldnames to histograms (one for Ints and one for Strings). We then iterate through the file again, adding the values to the appropriate histograms and then write the appropriate methods using histogram's estimateSelectivity and given formulas in the spec for estimateTableCardinality and scanCost. avgSelectivity is not implemented.
Exercise 4: JoinOptimizer.java
For join cost estimation we are assuming that we're using an NL join; we used the formula
joincost(t1 join t2) = scancost(t1) + ntups(t1) x scancost(t2) //IO cost + ntups(t1) x ntups(t2) //CPU cost
into code, using the various parameters that our function was given.
In estimating the join cardinality, we set the default return to be the cross product * .4 because it's a bit
higher than 30% but still provides a good estimation.
For join ordering, we followed the provided pseudocode to implement the function. We didn't run into too many
problems with the code, since we were given a few helper functions that were of the utmost importance.
Exercise 5: IMDB database queries
We ran the following on the 0.1% database
SimpleDB> select d.fname, d.lname
SimpleDB> from Actor a, Casts c, Movie_Director m, Director d
SimpleDB> where a.id=c.pid and c.mid=m.mid and m.did=d.id
SimpleDB> and a.fname='John' and a.lname='Spicer';
Started a new transaction tid = 6
Added scan of table a
Added scan of table c
Added scan of table m
Added scan of table d
Added join between a.id and c.pid
Added join between c.mid and m.mid
Added join between m.did and d.id
Added select list field d.fname
Added select list field d.lname
[d:m, m:c, a:c]
PATH SO FAR = [d:m]
PATH SO FAR = [d:m, m:c]
PATH SO FAR = [d:m, a:c, m:c]
The query plan is:
π(d.fname,d.lname),card:3008
|
⨝(a.id=c.pid),card:3008
__________________________|___________________________
| |
σ(a.lname=Spicer),card:1 ⨝(m.mid=c.mid),card:3008
| ________________|_________________
σ(a.fname=John),card:1 | |
| ⨝(d.id=m.did),card:278 |
| _________|_________ |
| | | scan(Casts c)
scan(Actor a) scan(Director d) scan(Movie_Director m)
d.fname d.lname
------------------------
Chris Malazdrewicz
Thomas Parkinson
Alain Zaloum
3 rows.
Transaction 6 committed.
----------------
13.83 seconds
This was the query we ran on the 1% database. It selects all the directors and movie names with actors named Ryan and
directors named Ryan.
SimpleDB> select m.name, d.fname, d.lname from Movie m, Movie_Director md, Casts c, Actor a, Director d where md.mid=m.id and
SimpleDB> md.did=d.id and c.mid=m.id and c.pid=a.id and a.fname='Ryan' and d.fname='Ryan';
Started a new transaction tid = 7
Added scan of table m
Added scan of table md
Added scan of table c
Added scan of table a
Added scan of table d
Added join between md.mid and m.id
Added join between md.did and d.id
Added join between c.mid and m.id
Added join between c.pid and a.id
Added select list field m.name
Added select list field d.fname
Added select list field d.lname
[m:md, d:md, m:c, c:a]
PATH SO FAR = [m:md]
PATH SO FAR = [d:md, m:md]
PATH SO FAR = [m:c, d:md, m:md]
PATH SO FAR = [m:c, c:a, d:md, m:md]
The query plan is:
π(m.name,d.fname,d.lname),card:1
|
⨝(c.pid=a.id),card:1
____________________|____________________
| |
⨝(m.id=c.mid),card:29729 σ(a.fname=Ryan),card:1
_________________________|_________________________ |
| | |
⨝(d.id=md.did),card:2791 | |
________________|_________________ | |
| | | |
σ(d.fname=Ryan),card:1 ⨝(m.id=md.mid),card:2791 | |
| _______|________ | |
scan(Director d) | | | scan(Actor a)
scan(Movie m) scan(Movie_Director md) scan(Casts c)
m.name d.fname d.lname
-----------------------------------
'LTD.' Ryan Arnold
Boybrand: A Documentary Ryan McClure
Notes: We're not sure how many hours we spent on this but it was probably around 20-25. We had some trouble understanding what the first exercise was asking us to do and we had some trouble implementing IntHistogram since we weren't accounting for values that did not fit in any buckets. We ran into different kinds of errors (too many files open) trying to execute the IMDB queries, but that was because we forgot to close our RandomAccessFile accessors in HeapFile.java. Our code currently passes all tests and systemtest and we made no changes to the API.