Skip to content
Marco A. Lopez-Sanchez edited this page Apr 5, 2015 · 12 revisions

GrainSizeTools script manual

GrainSizeTools is a free open-source cross-platform script written in Python that provides a number of tools with the aim of characterizing the population of grain sizes in dynamically recrystallized mylonites. The script is suitable to use in paleopiezometry studies using different parameters as well as to derive the actual population of grain sizes from thin sections. The script requires the previous measurement of the grain sectional areas from a thin section. There is no need of previous knowledge of Python language to use the script and get the results. For advanced users, the script is organized in a modular way using Python functions, which facilitates to modify, reuse or extend the code.

[TOC]

Features

  • It allows to load text files -txt or csv- with the sectional areas of the grains
  • It allows to calculate the apparent diameters of the grains from the sectional areas via the equivalent circular diameter
  • It allows to correct the diameters calculated by adding the perimeter of the grains
  • It returns the mean, the median and the area-weighted grain size of the apparent population of grain sizes.
  • It returns the peak of the frequency grain size via the Gaussian kernel density estimator and the central values of the modal intervals in the number- and area-weighted approaches
  • The script implements several algorithms to estimate the optimal bin size of the histogram and the bandwidth of the Gaussian KDE based on population features
  • It estimates the actual 3D populations of grains from the population of apparent (2D) grain sizes using the Scheil-Schwartz-Saltykov method to unfold the 2D population. Similar to the StripStar script.
  • It produces ready-to-publish plots, allowing to save the graphical output as a bitmap or vector images.

Requirements

The scripts requires Python 2.7.x or 3.4.x and the scientific libraries Numpy, Scipy and Matplotlib installed in the system. We recommend installing the Continuum Anaconda or the Enthought Canopy (maybe more easy-friendly for newbies) distributions since they are free (at least the basic versions) and provide the most popular python scientific packages including those named above. Both packages also provide academic free licenses for more advanced versions. In case you have space problems, there is a distribution named miniconda that only installs the packages that you choose/need.

Since the approach of the script is based on the estimate of grain sectional areas of the grains from a thin section, it is necessary to measure in advance the areas of the grains and store the data in a text file to use the script and derive the grain size. To measure the grain sectional areas, we highly encourage you to use the ImageJ program or similar since there are public-domain java-based image processing programs widely used for scientific research that runs on Windows, Mac OS X and Linux platforms. The aim of this text is not to describe how to measure the grain areas with the ImageJ application or similar but how to treat the data obtained from these applications. If you are not familiarized with the ImageJ application don't worry, there are many tutorials on the web. Just search the web the terms 'ImageJ' and 'areas' in your favorite search engine and you will find your answers. Furthermore, you have a quick tutorial here (available soon).

Why Python?

Python is a general-purpose high-level interpreted programming language characterized by a clear syntax (i.e. highly readable language) and ease of learning. In addition, python language have other advantages such as i) there is free and open-source; ii) the language is interpreted. This means that there is no need to compile and that the underlying computer language run on all different platforms (Windows, Mac OS X, Linux or Unix); iii) there are a large number of open and freely available scientific libraries, such as Numpy, Scipy, MatplotLib, Ipython or Pandas, that provides and interactive environment for algorithm development, data analysis, data visualization and numeric computation; and iv) Python is becoming increasingly popular in academia and science in general.

A brief tutorial on how to use the script

Note This simple tutorial assumes that you do not know nothing about Python programming language. This means that it is no necessary a prior knowledge of Python language to use this script and get the results. Just follow the instructions.

Running and organization of the script

Once the required software is installed in your system (see requirements above) and the script downloaded, the first thing is to open the script in a Python editor such as the Canopy editor -if you installed the Enthought Canpopy package-, the IDLE -the Python default editor- or Spyder -a scientific Python development environment that comes with the Anaconda package-. Then, to use the script, it is necessary to run the code. For this, if you are in the Canopy environment just click on the play green icon or go to the Run menu and click on “Run file” (Fig. 1). If you open the script in the IDLE, just press F5 or go to Run menu and click “Run module” (Fig. 1).

Figure 1. Run a script in the Canopy editor (left) and in the IDLE editor (right) Figure 1. Run a script in the Canopy editor (left) and in the IDLE editor (right)

The following statement will appear in the Python shell (Fig. 2):

Welcome to the GrainSizeTools script v. 0.3
See release notes in the Readme.txt file

Figure 2. The python editor and the python shell in the Enthought Canopy environment Figure 2. The python editor and the python shell in the Enthought Canopy environment

Now we can use the different functions implemented within the script to derive different parameters and values from the population of grain sectional areas. A Python function looks like this in the editor:

def calc_diameters(areas, addPerimeter = 0):
    """ Calculate the diameters from the sectional areas assuming that the grains
    have near-equant shapes.

    PARAMETERS
    areas: a numpy array with the sectional areas of the grains
    addPerimeter: a float or integer. Correct the diameters estimated from the
           areas by adding the perimeter of the grain. If addPerimeter is not 
           declared, it is considered 0
    """

    # calculate diameters via equivalent circular diameter
    diameters = 2*sqrt(areas/pi)
    
    # diameter correction adding edges (if applicable)
    if addPerimeter != 0:
        diameters = diameters + addPerimeter
    
    return diameters

Briefly, the name following the Python keyword def - in this example calc_diameters- is the function name. Within the parentheses following the function name are the formal parameter/s of the function -the input/s-. In this case there are two inputs, the array with the areas of the grains measured and the perimeter needed to correct the size of the grains. The text between the triple quotation marks provide information about the function, which in this case describes the conditions that must be met by the users and the output obtained. Below this, there is the code block.

As we shall see, the names of the Python functions defined in the script are intuitive and self-explanatory. To obtain the results, it is only necessary to use four of all the functions implemented within the script (specifically, the first four). For function details, see the section "Specifications of main functions in the GrainSizeTools script".

Using the script to derive grain size

The first step is to load the data. It is assumed that previously to this step, it was calculated the sectional areas using an image analysis software and save the results as a txt or csv file (Fig. 3).

Figure 3. Format of the txt file Figure 3. Format of the txt file

To load the data set into memory we will use the importdata function. To invoke this function we write in the Python shell:

>>> areas = importdata('C:/yourFileLocation/nameOfTheFile.txt')

where areas is just a name for the Python object in which the data could be stored into memory, this is also known as a variable. This will allow us to manipulate the data set of areas when required. Any name can be used to create a variable. As an example, in the case that the data set to load were the diameters of grains instead of the areas, it would be useful to call the variable diameters, or if you want to load several files with different data sets of areas you can named areas1, areas2 and so on. In Python variable names can contain upper- and lowercase letters, digits and the special character _. However, variable names cannot start with a digit. importdata is the function responsible for loading the data into the variable. Between the parentheses, the file location path in the OS in quotes (single or double). To avoid problems, it is advisable to use forward slash or double backslashes to define the filePath (e.g. "C:/yourfilelocation.../nameofthefile.txt") instead of single backslash. Once the data set is loaded, the data can be simply viewed by invoking the name of the variable in the Python shell and pressing the enter key, as follows:

>>> areas

we would obtain (as a random example):

>>> array([99.6535, 41.9045, ..., 79.5712, 119.777])

A useful method to check if all data is properly loaded is to verify is the size of the data set is correct. This can be checked using the Python built-in method len as follows:

>>> len(areas)

This will return the number of items in the variable declared within the parenthesis.

The second step is to convert the areas into diameters via the equivalent circular diameter. For this, it was implemented a function named calc_diameters (please, note that in you are using a version of the script previous to v.0.1.1, the function is named calcDiameters instead). To invoke the function we write in the shell:

>>> diameters = calc_diameters(areas)

In the example above, the only parameter declared are the variable areas previously loaded. In some cases a perimeter correction of the grains is needed. We can correct the diameters calculated by adding the following parameter (note that different inputs are separated by commas):

>>> diameters = calc_diameters(areas, addPerimeter = 0.05)

or just

>>> diameters = calc_diameters(areas, 0.05)

This example means that for each apparent diameter calculated from the sectional areas, 0.05 is added. If the parameter addPerimeters is not declared within the function, as in the first example, it is assumed that no perimeter correction is needed.

Once the sectional areas and the apparent grain sizes are loaded into memory, we have two choices: (1) estimate a single value of grain size for paleopiezometry/paleowattometry studies, or (2) derive the actual 3D population of grain sizes from the population of apparent 2D grain sizes using the Scheil-Schwartz-Saltykov algorithm.

In case we have to obtain a single value of grain size, we need to invoke the function find_grain_sizeas follows:

>>> find_grain_size(areas, diameters)

Note that contrary to what was shown so far, the function is called directly in the Python shell since it is not necessary to store any data into a variable . The inputs are the arrays with the areas and diameters previously stored into memory and, if desired, the bin size. As specified in Lopez-Sanchez and Llana-Fúnez (in disscussion) and within the specification of the function, the function uses by default the Freedman-Diaconis rule (Freedman and Diaconis 1981) to estimate an optimal bin size for the histogram. If we want to use the Scott rule (Scott 1979) instead or an ad hoc bin size, it can be do it as follows:

>>> find_grain_size(areas, diameters, binsize =Scott’)

to use the Scott rule (note that the name is in quotes and that the first letter is capitalized) or

>>> find_grain_size(areas, diameters, binsize = 10.0)

for and ad hoc bin size (in this example set to ten). The user-defined bin size can be of type integer or float (i.e. an irrational number).

After pressing the enter key, the function will return a number of different values of grain size typically used in paleopiezometry studies, including the mean, the median, the area-weighted mean and the frequency peak grain sizes (see details in Lopez-Sanchez and Llana-Fúnez 2015). Others parameters of interest such as the bin size estimated (indicating the method used in the estimation) and the bandwidth used for the Gaussian kernel density estimator according to the Silverman rule (Silverman 1986) are also provided. As stated in Lopez-Sanchez and Llana-Fúnez 2015, a minimum of 433 measured grains are needed to yield consistent results, although we recommended to measure at least 965 if possible.

Furthermore, a new window with the number and area weighted plots appear (Fig. 4), showing the location of the different grain sizes estimated. The advantages and disadvantages of these plots are explained in Lopez-Sanchez and Llana-Fúnez 2015. You can save the plots by clicking the floppy disk icon and save it as bitmap (8 file types to choose) or vector images (5 file types to choose) in case you need to post-edit the plots. Another interesting option is to modify the plot within the Matplotlib environment before saving. For this, just click the green tick icon and choose the subplot you want to modify. A new window appear with several options.

To derive the actual 3D population of grain sizes using the Scheil-Schwartz-Saltykov algorithm, we need to invoke the function derive3Das follows:

>>> derive3D(diameters, numbins=15)

The inputs are an array with the apparent diameters of the grains and the number of bins/classes of the histogram. If the number of bins is not declared is set to ten by default. To define the number of classes, the user can use any positive integer value. However, it is advisable to choose a number between 5 and 20 (see below for details). After pressing the enter key, the function will return the bin size estimated and an array with the normalized frequencies of the different classes. Also, a new window with two plots appear (Fig 5). On the left, the derived 3D grain size distribution in a form of the histogram and, on the right, a volume-weighted cumulative density plot. The latter allow the user to estimate by eyeballing which percentage of volume is occupied by a range of grain sizes.

Regarding the number of bins/classes, it depends on a number of factors such as the length or the features of the data set and, therefore, there is no best number of bins. Previous works using the Scheil-Schwartz-Saltykov algorithm proposed -based on experience- to use between 10 to 20 classes (e.g. Exner 1972) or even fewer than ten classes (e.g. Higgins 2000). So far, no method (i.e. algorithm) appears to be generally best for choosing an optimal number of classes (or bin size) from a particular population of apparent diameters. Hence, the only way to proceed is to use a strategy of trial and error to find which is the maximum number of classes that produces a consistent result. In figure 6, we try to explain graphically and with a brief explanation what a consistent result is and its limitations.

As a last cautionary note, to unfold the distribution of apparent diameters into the actual 3D distribution applying a Saltykov-type method/algorithm, large samples are required (n ~ 1000 or larger) to generally obtain consistent results.

###Other methods of interest

Calculate the mean of an array stored mean(*the name of the variable*)

Calculate the standard deviation of an array stored std(*the name of the variable*)

Merge two or more data sets (or Numpy arrays)

A useful method to merge two or more data sets implemented in Numpy is called hstack, which stack arrays in sequence as follows:

np.hstack((*Name of the array1*, *name of the Array2*,...))

Note the use of double parenthesis

As an example if we have two data sets and we want to merge the areas and the diameters estimated in a single variable we write into the Python shell (variable names are just a random example):

AllDiameters = np.hstack((diameters1, diameters2))```

Note that in this example we merged the original data sets into a two new variables named AllAreas and AllDiameters respectively. This means that we do not overwrite any of the original data sets and we can used later if required.

Specifications of main functions in the GrainSizeTools script

####importdata (filePath) Load the data from a text file (txt or csv).

Parameters:

filePath: string The file location in the OS in quotes.

Returns: An numpy array with the values loaded (e.g the areas or the diameters of the sectional grains).

Note: In this case, importdata is just a function that uses and renames a Numpy method called "genfromtxt". We opted to write our own function to allow specification within the script and simplify things. The user can be call "genfromtxt" directly instead of "importdata" if desired. More information about the "genfromtxt" of "loadfromtxt" Numpy methods can be found here or here

####calc_diameters (areas, addPerimeter = 0) Calculate the diameters from the sectional areas via the equivalent circular diameter: $$ d = 2\sqrt{{area}/{\pi}} $$ assuming that the grains have near equant shapes.

Parameters:

areas: array_like The sectional areas of the grains.

addPerimeter: int or float; optional Correct the diameters estimated from the areas by adding the perimeter of the grain. If addPerimeter is not declared it is considered zero.

Returns: An array with the diameters of the grains.

####find_grain_size (areas, diameters, binsize = 'FD') Estimate a representative numeric value of grain size from the population of apparent diameters and areas.

Parameters:

areas: array_like The sectional areas of the grains.

diameters: array_like The apparent diameters of the grains.

binsize: string, int or float; optional the method used to calculate the bin size. This can be: 'FD' (Fredman-Diaconis rule), 'Scott' (Scott rule) or a user-defined scalar constant of type integer or float. If not specified, 'FD' is used by default.

Returns: A number of grain size values to use in paleopiezometry or paleowattometry studies and the number and area-weighted plots. The values includes the mean, the median and the area-weighted mean grain size and the frequency peak via the Gaussian kernel density estimator (preferred option) and the mid-poind of the modal interval. It also provides other values of interest such as the bin size and bandwidth estimated and the methods chosen.

####derive3D (diameters, numbins) Estimates the actual 3D populations of grains from the population of apparent (2D) grain sizes using the Scheil-Schwartz-Saltykov method to unfold the 2D population.

Parameters:

diameters: array_like The apparent diameters of the grains.

numbins: int; optional The number of classes or bins used to unfold the population

Returns: The bin size, the frequencies (probabilities) of the different classes and a plot containing two subplots: i) the distribution of the actual grain size population according to the Scheil-Schwartz-Saltykov method and ii) the volume cumulative distribution function.

References

Freedman D and Diaconis P (1981) On the histogram as a density estimator: L2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 57, 453–476

Heilbronner R and Bruhn D (1998) The influence of three-dimensional grain size distributions on the rheology of polyphase rocks. Journal of Structural Geology 20:695–705

Higgins MD (2000) Measurement of crystal size distributions. American Mineralogist 85:1105-1116

Lopez-Sanchez MA and Llana-Funez S (2015) An evaluation of different measures of dynamically recrystallized grain size for paleopiezometry or paleowattometry studies. Solid Earth...

Scott DW (1992) Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, New York, Chicester

Silverman BW (1986) Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability, Chapman and Hall, London.

Written with StackEdit.

<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>