Merge pull request #236 from prisms-center/0.2a2

0.2a2
prisms-center · Aug 14, 2016 · d0c341e · d0c341e
2 parents ebd0e00 + 98e283c
commit d0c341e
Show file tree

Hide file tree

Showing 27 changed files with 859 additions and 299 deletions.
diff --git a/INSTALL.md b/INSTALL.md
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@ This version of CASM supports:
   - Occupational degrees of freedom. 
 - High-throughput calculations using:
   - VASP: [https://www.vasp.at](https://www.vasp.at)  
+- Semi-Grand canonical Monte Carlo calculations
 
 CASM is updated frequently with support for new effective Hamiltonians, new interfaces for first-principles electronic structure codes, and new Monte Carlo methods. Collaboration is welcome and new features can be incorporated by forking the repository on GitHub, creating a new feature, and submitting pull requests. If you are interested in developing features that involve a significant time investment we encourage you to first contact the CASM development team at <[email protected]>.
 
@@ -57,7 +58,7 @@ CASM is developed by the Van der Ven group, originally at the University of Mich
 
 **Developers**:  John Goiri and Anirudh Natarajan.
 
-**Other contributors**: Min-Hua Chen, Jonathon Bechtel, Max Radin, Elizabeth Decolvenaere and Anna Belak
+**Other contributors**: Min-Hua Chen, Jonathon Bechtel, Max Radin, Elizabeth Decolvenaere, Anna Belak, Liang Tian, and Naga Sri Harsha Gunda
 
 #### Acknowledgements ####
 
@@ -89,7 +90,7 @@ See INSTALL.md
 
 The ``casm`` executable includes extensive help documentation describing the various commands and options. Simply executing ``casm`` will display a list of possible commands, and executing ``casm <cmd> -h`` will display help documentation particular to the chosen command.
 
-For a beginner, the best place to start is to follow the suggestions printed when calling ``casm status -n``.  This provides step-by-step instructions for creating a CASM project, generating symmetry information, setting composition axes, enumerating configurations, calculating energies with VASP, setting reference states, and fitting an effective Hamiltonian. The subcommand ``casm format`` provides information on the directory structure of the CASM project and the format of all the CASM files.
+For a beginner, the best place to start is to follow the suggestions printed when calling ``casm status -n``.  This provides step-by-step instructions for creating a CASM project, generating symmetry information, setting composition axes, enumerating configurations, calculating energies with VASP, setting reference states, and fitting an effective Hamiltonian using the program ``casm-learn``. ``casm-learn`` provides The subcommand ``casm format`` provides information on the directory structure of the CASM project and the format of all the CASM files.
 
 All that is needed to start a new project is a ``prim.json`` file describing the crystal structure of the material being studied. See ``casm format --prim`` for a description and examples. Typically one will create a new project directory containing the ``prim.json`` file and then initialize the casm project. For example:
 
@@ -108,15 +109,9 @@ All that is needed to start a new project is a ``prim.json`` file describing the
 
 After initializing a casm project: 
 
-- ``casm`` generates code that is compiled and linked at runtime in order to evaluate effective Hamiltonians in a highly optimized manner. If you installed the CASM header files in a location that is not in your default search path you must specify in your CASM project settings where to find the header files. You can inspect the current settings via ``casm settings -l``, and then add the correct include path via ``casm settings --set-compile-options``. For example:
-
-        casm settings --set-compile-options 'g++ -O3 -Wall -fPIC --std=c++11 -I/path/to/include/casm'
-
-- Shared object compilation options may be set via ``casm settings --set-so-options``. For example (using the default settings):
+- ``casm`` generates code that is compiled and linked at runtime in order to evaluate effective Hamiltonians in a highly optimized manner. If you installed the CASM header files and libraries in a location that is not in your default search path you must specify where to find them. Often the default compilation options work well, but there are some cases when the c++ compiler, compiler flags, or shared object construction flags might need to be customized. You can inspect the current settings via ``casm settings -l`` and options to change them via ``casm settings --desc``.
 
-        casm settings --set-so-options 'g++ -shared -lboost_system'
 
 
-An html tutorial describing the creation of an example CASM project and typical steps is coming soon.
 
 
diff --git a/SConstruct b/SConstruct
@@ -6,7 +6,8 @@ import sys, os, glob, copy, shutil, subprocess, imp, re
 from os.path import join
 
 Help("""
-      Type: 'scons' to build all binaries,
+      Type: 'scons configure' to run configuration checks,
+            'scons' to build all binaries,
             'scons install' to install all libraries, binaries, scripts and python packages,
             'scons test' to run all tests,
             'scons unit' to run all unit tests,
@@ -43,10 +44,9 @@ Help("""
         Sets to compile with debugging symbols. In this case, the optimization level gets 
         set to -O0, and NDEBUG does not get set.
 
-      $LD_LIBRARY_PATH:
+      $LD_LIBRARY_PATH (Linux) or $DYLD_FALLBACK_LIBRARY_PATH (Mac):
         Search path for dynamic libraries, may need $CASM_BOOST_PREFIX/lib 
         and $CASM_PREFIX/lib added to it.
-        On Mac OS X, this variable is $DYLD_FALLBACK_LIBRARY_PATH.
         This should be added to your ~/.bash_profile (Linux) or ~/.profile (Mac).
       
       $CASM_BOOST_NO_CXX11_SCOPED_ENUMS:

diff --git a/casmenv.sh b/casmenv.sh
@@ -15,6 +15,8 @@
 #
 #export CASM_BOOST_PREFIX=""
 
+# 
+
 #  Recognized by install scripts. Use this if linking to boost libraries compiled without c++11. If defined, (i.e. CASM_BOOST_NO_CXX11_SCOPED_ENUMS=1) will compile with -DBOOST_NO_CXX11_SCOPED_ENUMS option.
 #  Order of precedence:
 #    1) if $CASM_BOOST_NO_CXX11_SCOPED_ENUMS defined
@@ -105,6 +107,17 @@ if [ ! -z ${CASM_PREFIX} ]; then
 
 fi
 
+#  If CASM_BOOST_PREFIX is set, update library search path
+if [ ! -z ${CASM_BOOST_PREFIX} ]; then
+
+  # For Linux, set LD_LIBRARY_PATH
+  export LD_LIBRARY_PATH=$CASM_BOOST_PREFIX/lib:$LD_LIBRARY_PATH
+
+  # For Mac, set DYLD_LIBRARY_FALLBACK_PATH
+  export DYLD_FALLBACK_LIBRARY_PATH=$CASM_BOOST_PREFIX/lib:$DYLD_FALLBACK_LIBRARY_PATH
+
+fi
+
 # If testing:
 if [ ! -z ${CASM_REPO} ]; then
 
@@ -114,25 +127,12 @@ if [ ! -z ${CASM_REPO} ]; then
   export PATH=$CASM_REPO/bin:$CASM_REPO/python/casm/scripts:$PATH
   export PYTHONPATH=$CASM_REPO/python/casm:$PYTHONPATH
 
-  if [ ! -z ${DYLD_FALLBACK_LIBRARY_PATH} ]; then
-    # For testing on Mac, use DYLD_FALLBACK_LIBRARY_PATH:
-    export DYLD_FALLBACK_LIBRARY_PATH=$CASM_REPO/lib:$DYLD_FALLBACK_LIBRARY_PATH
-
-    #  If CASM_BOOST_PREFIX is set, update library search path
-    if [ ! -z ${CASM_BOOST_PREFIX} ]; then
-      # For testing on Mac, set DYLD_LIBRARY_FALLBACK_PATH
-      export DYLD_FALLBACK_LIBRARY_PATH=$CASM_BOOST_PREFIX/lib:$DYLD_FALLBACK_LIBRARY_PATH
-    fi
-  else
-    # For testing on Linux, use LD_LIBRARY_PATH:
-    export LD_LIBRARY_PATH=$CASM_REPO/lib:$LD_LIBRARY_PATH
-
-    #  If CASM_BOOST_PREFIX is set, update library search path
-    if [ ! -z ${CASM_BOOST_PREFIX} ]; then
-      # For testing on Mac, set DYLD_LIBRARY_FALLBACK_PATH
-      export LD_LIBRARY_PATH=$CASM_BOOST_PREFIX/lib:$LD_LIBRARY_PATH
-    fi
-  fi
+  # For testing on Linux, use LD_LIBRARY_PATH:
+  export LD_LIBRARY_PATH=$CASM_REPO/lib:$LD_LIBRARY_PATH
+
+  # For testing on Mac, use DYLD_FALLBACK_LIBRARY_PATH:
+  export DYLD_FALLBACK_LIBRARY_PATH=$CASM_REPO/lib:$DYLD_FALLBACK_LIBRARY_PATH
+
 fi
 
 
diff --git a/python/casm/casm/learn/__init__.py b/python/casm/casm/learn/__init__.py
@@ -70,7 +70,7 @@ def create_halloffame(maxsize, rel_tol=1e-6):
 from fit import example_input_Lasso, example_input_LassoCV, example_input_RFE, \
  example_input_GeneticAlgorithm, example_input_IndividualBestFirst, \
  example_input_PopulationBestFirst, example_input_DirectSelection, \
- set_input_defaults, \
+ open_input, set_input_defaults, \
  FittingData, TrainingData, \
  print_input_help, print_individual, print_population, print_halloffame, print_eci, \
  to_json, open_halloffame, save_halloffame, \
@@ -90,6 +90,7 @@ def create_halloffame(maxsize, rel_tol=1e-6):
   'example_input_IndividualBestFirst',
   'example_input_PopulationBestFirst', 
   'example_input_DirectSelection',
+  'open_input',
   'set_input_defaults',
   'FittingData', 
   'TrainingData',

diff --git a/python/casm/casm/learn/fit.py b/python/casm/casm/learn/fit.py
@@ -1101,6 +1101,31 @@ def set_input_defaults(input, input_filename=None):
   return input
 
 
+def open_input(input_filename):
+  """
+  Read casm-learn input file into a dict
+  
+  Arguments
+  ---------
+    
+    input_filename: str
+      The path to the input file
+  
+  Returns
+  -------
+    input: dict
+      The result of reading the input file and running it through 
+      casm.learn.set_input_defaults
+  """
+  # open input and always set input defaults before doing anything else
+  with open(input_filename, 'r') as f:
+    try:
+      input = set_input_defaults(json.load(f), input_filename)
+    except Exception as e:
+      print "Error parsing JSON in", args.settings[0]
+      raise e
+  return input
+
 class FittingData(object):
   """ 
   FittingData holds feature values, target values, sample weights, etc. used

diff --git a/python/casm/scripts/casm-learn b/python/casm/scripts/casm-learn
@@ -9,6 +9,7 @@ import deap.tools
 if __name__ == "__main__":
 
   parser = argparse.ArgumentParser(description = 'Fit cluster expansion coefficients (ECI)')
+  parser.add_argument('--desc', help='Print extended usage description', action="store_true")
   parser.add_argument('-s', '--settings', nargs=1, help='Settings input filename', type=str)
   parser.add_argument('--format', help='Hall of fame print format. Options are "details", "json", or "csv".', type=str, default=None)
   #parser.add_argument('--path', help='Path to CASM project. Default assumes the current directory is in the CASM project.', type=str, default=os.getcwd())
@@ -61,13 +62,7 @@ if __name__ == "__main__":
     if args.verbose:
       print "Loading", args.settings[0]
 
-    # open input and always set input defaults before doing anything else
-    with open(args.settings[0], 'r') as f:
-      try:
-        input = casm.learn.set_input_defaults(json.load(f), args.settings[0])
-      except Exception as e:
-        print "Error parsing JSON in", args.settings[0]
-        raise e
+    input = casm.learn.open_input(args.settings[0])
 
     if args.hall:
 
@@ -132,28 +127,181 @@ if __name__ == "__main__":
       # pickle hall of fame
       casm.learn.save_halloffame(hall, halloffame_filename, args.verbose)
 
-  else:
+  elif args.desc:
 
     print \
     """
-    Learning is performed in four steps:
     
-    1) Select training data.
-      Create a selection of configurations to include in the regression problem.
+    1) Specify the problem:
+      
+      'casm-learn' helps solve the problem:
+      
+        X*b = y,
+      
+      where:
+        
+        X: 2d matrix of shape (n_samples, n_features)
+          The correlation matrix, holding the evaluated basis functions. The 
+          entry X[config, bfunc] holds the average value of the 'bfunc' cluster
+          basis function for configuration 'config'. The number of configurations
+          is 'n_samples' and the number of cluster basis functions is 'n_features'.
+        
+        y: 1d matrix of shape (n_samples, 1)
+          The calculated properties being fit to. The most common case is that
+          y[config] holds the formation energy calculated for configuration 
+          'config'.
+        
+        b: 1d matrix of shape (n_features, 1)
+          The effective cluster interactions (ECI) being solved for.
+      
+      To specify this problem, the 'casm-learn' input file specifies which
+      configurations to fit to (the training data), how to weight the 
+      configurations, and how to compare solutions via cross-validation. 
+            
+      
+      Training data may be input via a 'casm select' output file. The default
+      name expected is 'train'. So to use all calculated configurations, you 
+      could create a directory in your CASM project where you will perform
+      fitting and generate a 'train' file:
+        
+        cd /my/casm/project
+        mkdir fit_1 && cd fit_1
+        casm select --set is_calculated -o train
+      
+      
+      Example 'casm-learn' JSON input files can be output by the
+      'casm-learn --exMethodName' options:
+        
+        casm-learn --exGeneticAlgorithm > fit_1_ga.json
+        casm-learn --exRFE > fit_1_rfe.json
+        ...etc..
+      
+      By default, these settings files are prepared for fitting formation_energy,
+      using the 'train' configuration selection. Edit the file as needed, and
+      see 'casm-learn --settings-format' for help.
+      
+      
+      When weighting configurations, the problem is transformed:
+      
+        X*b = y  ->  L*X*b = L*y, 
+  
+      where, W = L*L.tranpose(): 
+        
+        W: 2d matrix of shape (n_samples, n_samples)
+          The weight matrix is specified in the casm-learn input file. If the 
+          weighting method provides 1-dimensional input (this is typical, i.e.
+          a weight for each configuration), in an array called 'w', then:
+          
+            W = diag(w)*n_samples/sum(w),
+            
+          diag(w) being the diagonal matrix with 'w' along the diagonal.
+      
+      
+      A cross-validation score is used for comparing generated ECI. The cv score
+      reported is:
+  
+        cv = sqrt(mean(scores)) + N_nonzero_eci*penalty, 
+  
+      where:
+      
+        scores: 1d array of shape (number of train/test sets)
+          The mean squared error calculated for each training/testing set
+        
+        N_nonzero_eci: int
+          The number of basis functions with non-zero ECI
+          
+        penalty: number, optional, default=0.0
+          Is the user-input penalty per basis function that can be used to
+          favor solutions with a small number of non-zero ECI
+      
+      See 'casm-learn --settings-format' for help specifying the cross-validation
+      training and test sets using options from scikit-learn. It is usually
+      important to use the 'shuffle'=true option so that configurations are
+      randomly added to train/test sets and not ordered by supercell size.
+      
+      
+      When you run 'casm-learn' with a new problem specification the first time, 
+      it generates a "problem specs" file that stores the training data, weights, 
+      and cross-validation train/test sets. Then, when running subsequent times,
+      the data can be loaded more quickly, and the cross-validation can be 
+      performed using the same train/test sets. 'casm-learn' will attempt to
+      prevent you from re-running with a different problem specification so that
+      solutions can be compared via their cv score in an "apples-to-apples" 
+      manner. The default name for the "specs" file is determined from the input
+      filename. For example, 'my_input_specs.pkl' is used if the input file is 
+      named 'my_input.json'. See 'casm-learn --settings-format' for more help.
+      
+      
+      The '--checkspecs' option can be used to write output files with the 
+      generated problem specs data. Amont other things, this can be used to 
+      adjust weights manually or save and re-use train/test sets. See 
+      'casm-learn --settings-format' for more help.
+      
+    
+    2) Select estimator and feature selection methods
+      
+      The "estimator" option specifies a linear model estimator that determines
+      how to solve the linear problem L*X*b = L*b, for b.
+      
+      The "feature_selection" option specifies a feature selection method that
+      determines which features (ECI) should be considered for the solution. The
+      remaining are effectively set to 0.0 when calculating the cluster 
+      expansion. Generally there is a tradeoff: By limiting the number of 
+      features included in the cluster expansion Monte Carlo calculations can be
+      more efficient, but at a possible loss of accuracy. Be careful to avoid
+      overfitting however. If your cross validation scheme does not provide 
+      enough testing data, you may fit your training data very well, but not
+      have an accurate extrapolation to other configurations.
+      
+      See 'casm-learn --settings-format' for help specifying the estimator and
+      feature selection methods. Assuming you are using the GeneticAlgorithm and
+      have named your input file 'fit_1_ga.json', run:
+        
+        casm-learn -s fit_1_ga.json
+      
+      'casm-learn' will run and eventually store its results.  For a single 
+      problem specification (step 1, the settings in "problem_specs"
+      in the 'casm-learn' input file), you may try many different estimation
+      and feature selection methods and use the cv score to compare results. All
+      the results for a single problem specification can be stored in a 'Hall Of
+      Fame' that collects the N individual solutions with the best cv scores. To
+      view these results use:
+        
+        casm-learn -s fit_1_ga.json --hall
+      
+      For more details, or to output the results for further analysis in JSON or 
+      CSV format, there is a '--format' option. To view only particular 
+      individuals in the hall of fame, there is a '--indiv' option.
+    
     
-    2) Select scoring metric.
-      Add sample weights to configurations if desired and select a cross validation
-      method.
+    3) Analyze results
+      
+      The above steps (1) and (2) may be repeated many times as you attempt to
+      optimize your ECI. Solutions for different problems (i.e. different 
+      weighting schemes, re-calculating with more training data) may be compared 
+      based on scientific knowledge, for instance, which predicts the 0K ground 
+      state configurations correctly, or from analysis of Monte Carlo results.
+      
+      The '--checkhull' option provides a simple way to check the 0K ground 
+      states and can create 'casm select' style output files with enumerated but
+      uncalculated configurations that are predicted to be low energy. These can
+      then be used to generate more training data and re-fit the ECI.
+      
+      When you have generated ECI that you wish to use in Monte Carlo 
+      calculations, use the '--select' option to write an 'eci.json' file into
+      your CASM project for the currently selected cluster expansion (as listed
+      by 'casm settings -l).
     
-    3) Select estimator.
-      Choose how to solve for ECI from calculated property and correlations. For
-      instance: LinearRegression, Lasso, or Ridge regression.
     
-    4) Select features.
-      Select which basis functions to include in the cluster expansion. For instance,
-      SelectFromModel along with a l-1 norm minimizing estimator. Or GeneticAlgorithm.
+    4) Use results
+    
+      Once an 'eci.json' file has been written, you can run Monte Carlo 
+      calculations. See 'casm monte -h' and 'casm format --monte' for help. 
+      
     """
 
+  else:
+
     parser.print_help()