(.s) signifies that there are either .s or .gz versions of these files in gz/ subfolder, which can be opened with Python gzip module or Unix zcat.
da0 is the only server with these .s/.gz files
Keys for identifying letters:
- a = Author
- b = Blob
- c = Commit
- cc = Child Commit
- f = File
- h = Head Commit
- p = Project
- pc = Parent Commit
- ta = Time Author
- trp = Torvald Path
List of relationships:
* a2c (.s) * a2f * a2ft * a2L (.s only) * a2p (.s) * a2trp0 (.s)
* b2c (.s) * b2f (.s)
* c2b (.s) * c2cc * c2f (.s)
* c2h * c2pc * c2p (.s) * c2ta (.s)
* f2b (.s) * f2c (.s)
* p2a (.s) * p2c (.s)
These thruMaps directories contain mappings of repositories with modules that were utilized at a given UNIX timestamp under a specific commit. The mappings are in c2bPtaPkgO{$LANG}.{0-31}.gz files.
Format: commit;repo_name;timestamp;author;blob;module1;module2;...
Each thruMaps directory has a different language ($LANG) that contains modules relevant to that language.
da3 contains the same files located on da0, except for b2f, c2cc, f2b, and f2c.
This folder can be used for faster reading, hence the directory name.
In the context of oscar.py, the dictionary values listed in the PATHS dictionary can be changed from /da0_data/basemaps/...
to /fast/...
when referencing oscar.py in another program.
Note: "/<function_name>" after a function name denotes the version of that function that returns a Generator object
These are corresponding functions in oscar.py that open the .tch files listed above for a given entity:
Author('...')
- initialized with a combination of name and email.commit_shas/commits
.project_names
.torvald
- returns the torvald path of an Author, i.e, who did this Author work with that also worked with Linus Torvald
Blob('...')
- initialized with SHA of blob.commit_shas/commits
- commits removing this blob are not included
Commit('...')
- initialized with SHA of commit.blob_shas/blobs
.child_shas/children
.changed_file_names/files_changed
.parent_shas/parents
.project_names/projects
Commit_info('...')
- initialized like Commit().head
.time_author
File('...')
- initialized with a path, starting from a commit root tree.commit_shas/commits
Project('...')
- initialized with project name/URI.author_names
.commit_shas/commits
The non-Generator version of these functions will return a tuple of items which can then be iterated:
for commit in Author(author_name).commit_shas:
print(commit)
- Get a list of commits and repositories that imported Tensorflow for .py files:
On da0:UNIX> zcat /data/play/PYthruMaps/c2bPtaPkgOPY.0.gz | grep tensorflow
Output:
0000331084e1a567dbbaae4cc12935b610cd341a;abdella-mohamed_BreastCancer;1553266304;abdella <[email protected]>;0dd695391117e784d968c111f010cff802c0e6d1;sns;keras.models;np;random;tensorflow;os;pd;sklearn.metrics;plt;keras.layers;yaml
00034db68f89d3d2061b763deb7f9e5f81fef27;lucaskjaero_chinese-character-recognizer;1497547797;Lucas Kjaero <[email protected]>;0629a6caa45ded5f4a2774ff7a72738460b399d4;tensorflow;preprocessing;sklearn
000045f6a3601be885b0b028011440dd5a5b89f2;yjernite_DeepCRF;1451682395;yacine <[email protected]>;4aac89ae85b261dba185d5ee35d12f6939fc2e44;nn_defs;utils;tensorflow
000069240776f2b94acb9420e042f5043ec869d0;tickleliu_tf_learn;1530460653;tickleliu <[email protected]>;493f0fc310765d62b03390ddd4a7a8be96c7d48c;np;tf;tensorflow
.....
- Get a list of commits made by a specific author:
On da0:UNIX> zcat /data/basemaps/gz/a2cFullP0.s | grep "Albert Krawczyk" <[email protected]>
Output:
"Albert Krawczyk" <[email protected]>;17abdbdc90195016442a6a8dd8e38dea825292ae
"Albert Krawczyk" <[email protected]>;9cdc918bfba1010de15d0c968af8ee37c9c300ff
"Albert Krawczyk" <[email protected]>;d9fc680a69198300d34bc7e31bbafe36e7185c76
- Do the same thing above using oscar.py:
UNIX> python
>>> from oscar import Author
>>> Author('"Albert Krawczyk" <[email protected]>').commit_shas
('17abdbdc90195016442a6a8dd8e38dea825292ae', '9cdc918bfba1010de15d0c968af8ee37c9c300ff', 'd9fc680a69198300d34bc7e31bbafe36e7185c76')
- Get the URL of a projects repository using the oscar.py
Project(...).toURL()
function:
UNIX> python
>>> from oscar import Project
>>> Project('notcake_gcad').toURL()
'https://github.com/notcake/gcad'
Given the data available, this is a fairly simple task. Making an application to detect the first time that a repo adopted an AI module would give you a better idea as to when it was first used, and also when it started to gain popularity.
A good example of this lies in popmods.py. In this application, we can read all 32 c2bPtaPkgO$LANG.{0-31}.gz files of a given language and look for a given module with the earliest import times. The program then creates a <module_name>.first file, with each line formatted as repo_name;UNIX_timestamp
.
Usage: UNIX> python popmods.py language_file_extension module_name
Before anything else (and this can be applied to many other programs), you want to know what your input looks like ahead of time and know how you are going to parse it. Since each line of the file has this format:
commit;repo_name;timestamp;author;blob;module1;module2;...
We can use the string.split()
method to turn this string into a list of words, split by a semicolon (;).
By turning this line into a list, and giving it a variable name, entry = ['commit', 'repo_name', 'timestamp', ...]
, we can then grab the pieces of information we need with repo, time = entry[1], entry[2]
.
An important idea to keep in mind is that we only want to count unique timestamps once. This is because we want to account for repositories that forked off of another repository with the exact timestamp of imports. An easy way to do this would be to keep a running list of the times we have come across, and if we have already seen that timestamp before, we will simply skip that line in the file:
...
if time in times:
continue
else:
times.append(time)
...
We also want to find the earliest timestamp for a repository importing a given module. Again, this is fairly simple:
...
if repo not in dict.keys() or time < dict[repo]:
for word in entry[5:]:
if module in word:
dict[repo] = time
break
...
Now that we have the .first files put together, we can take this one step further and graph a modules first-time usage over time on a line graph, or even compare multiple modules to see how they stack up against each other. modtrends.py accomplishes this by:
- reading 1 or more .first files
- converting each timestamp for each repository into a datetime date
- "rounding" those dates by year and month
- putting those dates in a dictionary with
dict["year-month"] += 1
- graphing the dates and frequencies using matplotlib.
If you want to compare first-time usage over time for Tensorflow and Keras for the .ipynb language .first files you created, run: UNIX> python3.6 modtrends.py tensorflow.first keras.first
The final graph looks something like this:
An application to calculate this would be useful for seeing how different authors changed languages over a range of years, based on the commits they have made to different files.
In order to accomplish this task, we will modify an existing program from the swsc/lookup repo (a2fBinSorted.perl) and create a new program (a2L.py) that will get language counts per year per author.
For the first part, we look at what a2fBinSorted.perl currently does: it takes one of the 32 a2cFullP{0-31}.s files thru STDIN, opens the 32 c2fFullO.{0-31}.tch files for reading, and writes a corresponding a2fFullP.{0-31}.tch file based on the a2c file number. The lines of the file being author_id;file1;file2;file3...
Example usage: UNIX> zcat /da0_data/basemaps/gz/a2cFullP0.s | ./a2fBinSorted.perl 0
We can modify this program so that it will write the earliest commit dates made by that author for those files, which will become useful for a2L.py later on. To accomplish this, we will have the program additionally read from the c2taFullP.{0-31}.tch files so we can get the time of each commit made by a given author:
my %c2ta;
for my $s (0..($sections-1)){
tie %{$c2ta{$s}}, "TokyoCabinet::HDB", "/fast/c2taFullP.$s.tch", TokyoCabinet::HDB::OREADER |
TokyoCabinet::HDB::ONOLCK,
16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000
or die "cant open fast/c2taFullP.$s.tch\n";
}
We will also ensure the files to be written will have the relationship a2ft as oppposed to a2f:
my %a2ft;
tie %a2ft, "TokyoCabinet::HDB", "/data/play/dkennard/a2ftFullP.$part.tch", TokyoCabinet::HDB::OWRITER |
TokyoCabinet::HDB::OCREAT,
16777213, -1, -1, TokyoCabinet::TDB::TLARGE, 100000
or die "cant open /data/play/dkennard/a2ftFullP.$part.tch\n";
Another important part of the file we want to change is inside the output
function:
sub output {
my $a = $_[0];
my %fs;
for my $c (@cs){
my $sec = segB ($c, $sections);
if (defined $c2f{$sec}{$c} and defined $c2ta{$sec}{$c}){
my @fs = split(/\;/, safeDecomp ($c2f{$sec}{$c}, $a), -1);
my ($time, $au) = split(/\;/, $c2ta{$sec}{$c}, -1); #add this for grabbing the time
for my $f (@fs){
if (defined $time and (!defined $fs{$f} or $time < $fs{$f})){ #modify condition to grab earliest time
$fs{$f} = $time;
}
}
}
}
$a2ft{$a} = safeComp (join ';', %fs); #changed
}
Now when we run the new program, it should write individual a2ftFullP.{0-31}.tch files with the format:
author_id;file1;file1_timestamp;file2;file2_timestamp;...
We can then create a new PATHS dictionary entry in oscar.py, as well as making another function under the Author class to read our newly-created .tch files:
In PATHS dictionary:
...
'author_file_times': ('/data/play/dkennard/a2ftFullP.{key}.tch', 5)
...
In class Author(_Base):
...
@cached_property
def file_times(self):
data = decomp(self.read_tch('author_file_times'))
return tuple(file for file in (data and data.split(";")))
...
Our next task involves creating a2LFullP{0-31}.s files utilizing the new .tch files we have just created. We want these files to have each line filled with the author name, each year, and the language counts for each year. A possible format could look something like this:
"[email protected]" <>;year2015;2;py;31;js;30;year2016;1;py;29;year2017;8;c;2;doc;1;py;386;html;6;sh;1;js;3;other;3;build;1
where the number after each year represents the number of languages used for that year, followed by pairs of languages and the number of files written in that language for that year. As an example, in the year 2015, Tim Bentley made initial commits to files in 2 languages, 31 of which were in Python, and 30 of which were in JavaScript.
There is a number of things that have to happen to get to this point, so lets break it down:
- Iterating Author().file_times and grouping timestamps into year
We will start by reading in a a2cFullP{0-31}.s file to get a list of authors, which we then hold as a tuple in memory and start building our dictionary:
a2L[author] = {}
file_times = Author(author).file_times
for j in range(0,len(file_times),2):
try:
year = str(datetime.fromtimestamp(float(file_times[j+1]))).split(" ")[0].split("-")[0]
#have to skip years either in the 20th century or somewhere far in the future
except ValueError:
continue
#in case the last file listed doesnt have a time
except IndexError:
break
year = specifier + year #specifier is the string 'year'
if year not in a2L[author]:
a2L[author][year] = []
a2L[author][year].append(file_times[j])
The datetime.fromtimestamp() function will turn this into a datetime format: year-month-day hour-min-sec
which we split by a space to get the first half year-month-day
of the string, and then split again to get year
.
- Detecting the language of a file based on file extension
for year, files in a2L[author].items():
build_list = []
for file in files:
la = "other"
if re.search("\.(js|iced|liticed|iced.md|coffee|litcoffee|coffee.md|ts|cs|ls|es6|es|jsx|sjs|co|eg|json|json.ls|json5)$",file):
la = "js"
elif re.search("\.(py|py3|pyx|pyo|pyw|pyc|whl|ipynb)$",file):
la = "py"
elif re.search("(\.[Cch]|\.cpp|\.hh|\.cc|\.hpp|\.cxx)$",file):
la = "c"
.......
The simplest way to check for a language based on a file extension is to use the re module for regular expressions. If a given file matches a certain expression, like .py
, then that file was written in Python. la = other
if no matches were found in any of those searches. We then keep track of these languages and put each language in a list build_list.append(la)
, and count how many of those languages occurred when we looped through the files build_list.count(lang)
. The final format for an author in the a2L dictionary will be a2L[author][year][lang] = lang_count
.
- Writing each authors information into the file
See a2L.py for how information is written into each file.
Usage: UNIX> python a2L.py 2
for writing a2LFullP2.s
Now that we have our a2L files, we can run some interesting statistics as to how significant language usage changes over time for different authors. The program langtrend.py runs the chi-squared contingency test (via the stats.chi2_contingency() function from scipy module) for authors from an a2LFullP{0-31}.s file on STDIN and calculates a p-value for each pair of years for each language for each author.
This p-value means the percentage chance that you would find another person (say out of 1000 people) that has this same extreme of change in language use, whether that be an increase or a decrease. For example, if a given author editied 300 different Python files in 2006, but then editied 500 different Java files in 2007, the percentage chance that you would see this extreme of a change in another author is very low. In fact, if this p-value is less than 0.001, then the change in language use between a pair of years is considered "significant".
In order for this p-value to be a more accurate approximation, we need a larger sample size of language counts. When reading the a2LFullP{0-31}.s files, you may want to rule out people who dont meet certain criteria:
- the author has at least 5 consecutive years of commits for files
- the author has edited at least 100 different files for all of their years of commits
If an author does not meet this criteria, we would not want to consider them for the chi-squared test simply because their results would be "uninteresting" and not worth investigating any further.
Heres one of the authors from the programs output:
----------------------------------
Ben Niemann <[email protected]>
{ '2015': {'doc': 3, 'markup': 2, 'obj': 1, 'other': 67, 'py': 127, 'sh': 1},
'2016': {'doc': 1, 'other': 23, 'py': 163},
'2017': {'build': 36, 'c': 116, 'lsp': 1, 'other': 81, 'py': 160},
'2018': { 'build': 12,
'c': 134,
'lsp': 2,
'markup': 2,
'other': 133,
'py': 182},
'2019': { 'build': 13,
'c': 30,
'doc': 8,
'html': 10,
'js': 1,
'lsp': 2,
'markup': 16,
'other': 67,
'py': 134}}
pfactors for obj language
2015--2016 pfactor == 0.9711606775110577 no change
pfactors for doc language
2015--2016 pfactor == 0.6669499228133753 no change
2016--2017 pfactor == 0.7027338745275937 no change
2018--2019 pfactor == 0.0009971248193242038 rise/drop
pfactors for markup language
2015--2016 pfactor == 0.5104066960256399 no change
2017--2018 pfactor == 0.5532258789014389 no change
2018--2019 pfactor == 1.756929555308731e-05 rise/drop
pfactors for py language
2015--2016 pfactor == 1.0629725495084215e-07 rise/drop
2016--2017 pfactor == 1.2847558344252341e-25 rise/drop
2017--2018 pfactor == 0.7125543569718793 no change
2018--2019 pfactor == 0.026914075872778477 no change
pfactors for sh language
2015--2016 pfactor == 0.9711606775110577 no change
pfactors for other language
2015--2016 pfactor == 1.7143130378377696e-06 rise/drop
2016--2017 pfactor == 0.020874234589765908 no change
2017--2018 pfactor == 0.008365948846657284 no change
2018--2019 pfactor == 0.1813919210757513 no change
pfactors for c language
2016--2017 pfactor == 2.770649054044977e-16 rise/drop
2017--2018 pfactor == 0.9002187643203734 no change
2018--2019 pfactor == 1.1559110387953382e-08 rise/drop
pfactors for lsp language
2016--2017 pfactor == 0.7027338745275937 no change
2017--2018 pfactor == 0.8855759560371912 no change
2018--2019 pfactor == 0.9944669523033288 no change
pfactors for build language
2016--2017 pfactor == 4.431916568235125e-05 rise/drop
2017--2018 pfactor == 5.8273175348446296e-05 rise/drop
2018--2019 pfactor == 0.1955154860787908 no change
pfactors for html language
2018--2019 pfactor == 0.0001652525618661536 rise/drop
pfactors for js language
2018--2019 pfactor == 0.7989681687355706 no change
----------------------------------
Although it is currently not implemented, one could take this one step further and visually represent an authors language changes on a graph, which would be simpler to interpret as opposed to viewing a long list of pfactors such as the one shown above.
Simlar to the C version, system(), this module allows you to run UNIX processes, and also allows you to gather any input, output, or error from those processes, all from within a Python script. Documentation can be found here: https://docs.python.org/3/library/subprocess.html
Good for evaluating regular expressions in files and extracting lines that have certain words or patterns. Documentation can be found here: https://docs.python.org/3/library/re.html
Popular graphing library for Python. Documentation can be found here: https://matplotlib.org/index.html