From a66861b42480e12add1ea6aa37bed84ad433bcf5 Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Wed, 24 Apr 2024 07:34:43 +0100 Subject: [PATCH 1/8] science workflow written --- docs/cookbooks/analysis.rst | 2 +- docs/cookbooks/database.rst | 2 +- docs/cookbooks/multiple_datasets.rst | 5 + docs/cookbooks/result.rst | 62 +++ docs/cookbooks/search.rst | 20 +- docs/overview/backup.rst | 380 ++++++++++++++++ docs/overview/scientific_workflow.rst | 625 +++++++++++++++++++++++++- docs/overview/statistical_methods.rst | 27 ++ docs/overview/the_basics.rst | 544 +++++----------------- 9 files changed, 1202 insertions(+), 465 deletions(-) create mode 100644 docs/overview/backup.rst create mode 100644 docs/overview/statistical_methods.rst diff --git a/docs/cookbooks/analysis.rst b/docs/cookbooks/analysis.rst index 2114cffdb..50188c610 100644 --- a/docs/cookbooks/analysis.rst +++ b/docs/cookbooks/analysis.rst @@ -181,7 +181,7 @@ Visualization If a ``name`` is input into a non-linear search, all results are output to hard-disk in a folder. -By overwriting the ``Visualizer`` object of an ``Analysis`` class with a custom `Visualizer` class, custom results of the +By overwriting the ``Visualizer`` object of an ``Analysis`` class with a custom ``Visualizer`` class, custom results of the model-fit can be visualized during the model-fit. The ``Visualizer`` below has the methods ``visualize_before_fit`` and ``visualize``, which perform model specific diff --git a/docs/cookbooks/database.rst b/docs/cookbooks/database.rst index 85c5fc854..6690959ea 100644 --- a/docs/cookbooks/database.rst +++ b/docs/cookbooks/database.rst @@ -3,7 +3,7 @@ Database ======== -The default behaviour of model-fitting results output is to be written to hard-disc in folders. These are simple to +The default behaviour of model-fitting results output is to be written to hard-disk in folders. These are simple to navigate and manually check. For small model-fitting tasks this is sufficient, however it does not scale well when performing many model fits to diff --git a/docs/cookbooks/multiple_datasets.rst b/docs/cookbooks/multiple_datasets.rst index 2ecf90735..16eba4af5 100644 --- a/docs/cookbooks/multiple_datasets.rst +++ b/docs/cookbooks/multiple_datasets.rst @@ -164,6 +164,11 @@ To fit multiple datasets via a non-linear search we use this summed analysis obj result_list = search.fit(model=model, analysis=analysis) +In the example above, the same ``Analysis`` class was used twice (to set up ``analysis_0`` and ``analysis_1``) and summed. + +**PyAutoFit** supports the summing together of different ``Analysis`` classes, which may take as input completely different +datasets and fit the model to them (via the ``log_likelihood_function``) following a completely different procedure. + Result List ----------- diff --git a/docs/cookbooks/result.rst b/docs/cookbooks/result.rst index 8ab288b38..d1c9fadc3 100644 --- a/docs/cookbooks/result.rst +++ b/docs/cookbooks/result.rst @@ -12,6 +12,7 @@ This cookbook provides an overview of using the results. - **Model Fit**: Perform a simple model-fit to create a ``Result`` object. - **Info**: Print the ``info`` attribute of the ``Result`` object to display a summary of the model-fit. +- **Loading From Hard-disk**: Loading results from hard-disk to Python variables via the aggregator. - **Samples**: The ``Samples`` object contained in the ``Result``, containing all non-linear samples (e.g. parameters, log likelihoods, etc.). - **Maximum Likelihood**: The maximum likelihood model instance. - **Posterior / PDF**: The median PDF model instance and PDF vectors of all model parameters via 1D marginalization. @@ -98,6 +99,67 @@ The output appears as follows: normalization 24.79 (24.65, 24.94) sigma 9.85 (9.78, 9.90) +Loading From Hard-disk +---------------------- + +When performing fits which output results to hard-disc, a ``files`` folder is created containing .json / .csv files of +the model, samples, search, etc. + +These files can be loaded from hard-disk to Python variables via the aggregator, making them accessible in a +Python script or Jupyter notebook. + +Below, we will access these results using the aggregator's ``values`` method. A full list of what can be loaded is +as follows: + + - ``model``: The ``model`` defined above and used in the model-fit (``model.json``). + - ``search``: The non-linear search settings (``search.json``). + - ``samples``: The non-linear search samples (``samples.csv``). + - ``samples_info``: Additional information about the samples (``samples_info.json``). + - ``samples_summary``: A summary of key results of the samples (``samples_summary.json``). + - ``info``: The info dictionary passed to the search (``info.json``). + - ``covariance``: The inferred covariance matrix (``covariance.csv``). + - ``data``: The 1D noisy data used that is fitted (``data.json``). + - ``noise_map``: The 1D noise-map fitted (``noise_map.json``). + +The ``samples`` and ``samples_summary`` results contain a lot of repeated information. The ``samples`` result contains +the full non-linear search samples, for example every parameter sample and its log likelihood. The ``samples_summary`` +contains a summary of the results, for example the maximum log likelihood model and error estimates on parameters +at 1 and 3 sigma confidence. + +Accessing results via the ``samples_summary`` is much faster, because as it does reperform calculations using the full +list of samples. Therefore, if the result you want is accessible via the ``samples_summary`` you should use it +but if not you can revert to the ``samples. + +.. code-block:: python + + from autofit.aggregator.aggregator import Aggregator + + agg = Aggregator.from_directory( + directory=path.join("output", "cookbook_result"), + ) + +Before using the aggregator to inspect results, lets discuss Python generators. + +A generator is an object that iterates over a function when it is called. The aggregator creates all of the objects +that it loads from the database as generators (as opposed to a list, or dictionary, or another Python type). + +This is because generators are memory efficient, as they do not store the entries of the database in memory +simultaneously. This contrasts objects like lists and dictionaries, which store all entries in memory all at once. +If you fit a large number of datasets, lists and dictionaries will use a lot of memory and could crash your computer! + +Once we use a generator in the Python code, it cannot be used again. To perform the same task twice, the +generator must be remade it. This cookbook therefore rarely stores generators as variables and instead uses the +aggregator to create each generator at the point of use. + +To create a generator of a specific set of results, we use the `values` method. This takes the `name` of the +object we want to create a generator of, for example inputting `name=samples` will return the results `Samples` +object (which is illustrated in detail below). + +.. code-block:: python + + for samples in agg.values("samples"): + print(samples.parameter_lists[0]) + Samples ------- diff --git a/docs/cookbooks/search.rst b/docs/cookbooks/search.rst index e62567b65..8bb92caaf 100644 --- a/docs/cookbooks/search.rst +++ b/docs/cookbooks/search.rst @@ -60,24 +60,30 @@ Output To Hard-Disk ------------------- By default, a non-linear search does not output its results to hard-disk and its results can only be inspected -in Python via the ``result`` object. +in a Jupyter Notebook or Python script via the ``result`` object. However, the results of any non-linear search can be output to hard-disk by passing the ``name`` and / or ``path_prefix`` attributes, which are used to name files and output the results to a folder on your hard-disk. The benefits of doing this include: -- Inspecting results via folders on your computer can be more efficient than using a Jupyter Notebook. -- Results are output on-the-fly, making it possible to check that a fit i progressing as expected mid way through. -- Additional information about a fit (e.g. visualization) is output. +- Inspecting results via folders on your computer is more efficient than using a Jupyter Notebook for multiple datasets. +- Results are output on-the-fly, making it possible to check that a fit is progressing as expected mid way through. +- Additional information about a fit (e.g. visualization) can be output. - Unfinished runs can be resumed from where they left off if they are terminated. -- On high performance super computers which use a batch system, results must be output in this way. +- On high performance super computers results often must be output in this way. -These outputs are fully described in the scientific workflow example. +The code below shows how to enable outputting of results to hard-disk: .. code-block:: python - search = af.Emcee(path_prefix=path.join("folder_0", "folder_1"), name="example_mcmc") + search = af.Emcee( + path_prefix=path.join("folder_0", "folder_1"), + name="example_mcmc" + ) + + +These outputs are fully described in the scientific workflow example. Output Customization -------------------- diff --git a/docs/overview/backup.rst b/docs/overview/backup.rst new file mode 100644 index 000000000..76c76620f --- /dev/null +++ b/docs/overview/backup.rst @@ -0,0 +1,380 @@ + +Extending Models +---------------- + +The model composition API is designed to make composing complex models, consisting of multiple components with many +free parameters, straightforward and scalable. + +To illustrate this, we will extend our model to include a second component, representing a symmetric 1D Exponential +profile, and fit it to data generated with both profiles. + +Lets begin by loading and plotting this data. + +.. code-block:: python + + dataset_path = path.join("dataset", "example_1d", "gaussian_x1__exponential_x1") + data = af.util.numpy_array_from_json(file_path=path.join(dataset_path, "data.json")) + noise_map = af.util.numpy_array_from_json( + file_path=path.join(dataset_path, "noise_map.json") + ) + xvalues = range(data.shape[0]) + plt.errorbar( + x=xvalues, y=data, yerr=noise_map, color="k", ecolor="k", elinewidth=1, capsize=2 + ) + plt.show() + plt.close() + +The data appear as follows: + +.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/main/docs/images/data_2.png + :width: 600 + :alt: Alternative text + +We define a Python class for the ``Exponential`` model component, exactly as we did for the ``Gaussian`` above. + +.. code-block:: python + + class Exponential: + def __init__( + self, + centre=30.0, # <- **PyAutoFit** recognises these constructor arguments + normalization=1.0, # <- are the Exponentials``s model parameters. + rate=0.01, + ): + """ + Represents a symmetric 1D Exponential profile. + + Parameters + ---------- + centre + The x coordinate of the profile centre. + normalization + Overall normalization of the profile. + ratw + The decay rate controlling has fast the Exponential declines. + """ + + self.centre = centre + self.normalization = normalization + self.rate = rate + + def model_data_1d_via_xvalues_from(self, xvalues: np.ndarray): + """ + Returns the symmetric 1D Exponential on an input list of Cartesian + x coordinates. + + The input xvalues are translated to a coordinate system centred on + the Exponential, via its ``centre``. + + The output is referred to as the ``model_data`` to signify that it + is a representation of the data from the + model. + + Parameters + ---------- + xvalues + The x coordinates in the original reference frame of the data. + """ + + transformed_xvalues = np.subtract(xvalues, self.centre) + return self.normalization * np.multiply( + self.rate, np.exp(-1.0 * self.rate * abs(transformed_xvalues)) + ) + + +We can easily compose a model consisting of 1 ``Gaussian`` object and 1 ``Exponential`` object using the ``af.Collection`` +object: + +.. code-block:: python + + model = af.Collection(gaussian=af.Model(Gaussian), exponential=af.Model(Exponential)) + +A ``Collection`` behaves analogous to a ``Model``, but it contains a multiple model components. + +We can see this by printing its ``paths`` attribute, where paths to all 6 free parameters via both model components +are shown. + +The paths have the entries ``.gaussian.`` and ``.exponential.``, which correspond to the names we input into +the ``af.Collection`` above. + +.. code-block:: python + + print(model.paths) + +The output is as follows: + +.. code-block:: bash + + [ + ('gaussian', 'centre'), + ('gaussian', 'normalization'), + ('gaussian', 'sigma'), + ('exponential', 'centre'), + ('exponential', 'normalization'), + ('exponential', 'rate') + ] + +We can use the paths to customize the priors of each parameter. + +.. code-block:: python + + model.gaussian.centre = af.UniformPrior(lower_limit=0.0, upper_limit=100.0) + model.gaussian.normalization = af.UniformPrior(lower_limit=0.0, upper_limit=1e2) + model.gaussian.sigma = af.UniformPrior(lower_limit=0.0, upper_limit=30.0) + model.exponential.centre = af.UniformPrior(lower_limit=0.0, upper_limit=100.0) + model.exponential.normalization = af.UniformPrior(lower_limit=0.0, upper_limit=1e2) + model.exponential.rate = af.UniformPrior(lower_limit=0.0, upper_limit=10.0) + +All of the information about the model created via the collection can be printed at once using its ``info`` attribute: + +.. code-block:: python + + print(model.info) + +The output appears as follows: + +.. code-block:: bash + + Total Free Parameters = 6 + model Collection (N=6) + gaussian Gaussian (N=3) + exponential Exponential (N=3) + + gaussian + centre UniformPrior [13], lower_limit = 0.0, upper_limit = 100.0 + normalization UniformPrior [14], lower_limit = 0.0, upper_limit = 100.0 + sigma UniformPrior [15], lower_limit = 0.0, upper_limit = 30.0 + exponential + centre UniformPrior [16], lower_limit = 0.0, upper_limit = 100.0 + normalization UniformPrior [17], lower_limit = 0.0, upper_limit = 100.0 + rate UniformPrior [18], lower_limit = 0.0, upper_limit = 10.0 + + +A model instance can again be created by mapping an input ``vector``, which now has 6 entries. + +.. code-block:: python + + instance = model.instance_from_vector(vector=[0.1, 0.2, 0.3, 0.4, 0.5, 0.01]) + +This ``instance`` contains each of the model components we defined above. + +The argument names input into the ``Collection`` define the attribute names of the ``instance``: + +.. code-block:: python + + print("Instance Parameters \n") + print("x (Gaussian) = ", instance.gaussian.centre) + print("normalization (Gaussian) = ", instance.gaussian.normalization) + print("sigma (Gaussian) = ", instance.gaussian.sigma) + print("x (Exponential) = ", instance.exponential.centre) + print("normalization (Exponential) = ", instance.exponential.normalization) + print("sigma (Exponential) = ", instance.exponential.rate) + +The output appear as follows: + +.. code-block:: bash + +The ``Analysis`` class above assumed the ``instance`` contained only a single model-component. + +We update its ``log_likelihood_function`` to use both model components in the ``instance`` to fit the data. + +.. code-block:: python + + class Analysis(af.Analysis): + def __init__(self, data: np.ndarray, noise_map: np.ndarray): + """ + The `Analysis` class acts as an interface between the data and + model in **PyAutoFit**. + + Its `log_likelihood_function` defines how the model is fitted to + the data and it is called many times by the non-linear search + fitting algorithm. + + In this example the `Analysis` `__init__` constructor only + contains the `data` and `noise-map`, but it can be easily + extended to include other quantities. + + Parameters + ---------- + data + A 1D numpy array containing the data (e.g. a noisy 1D signal) + fitted in the workspace examples. + noise_map + A 1D numpy array containing the noise values of the data, + used for computing the goodness of fit metric, the log likelihood. + """ + + super().__init__() + + self.data = data + self.noise_map = noise_map + + def log_likelihood_function(self, instance) -> float: + """ + Returns the log likelihood of a fit of a 1D Gaussian to the dataset. + + The data is fitted using an `instance` of multiple 1D profiles + (e.g. a `Gaussian`, `Exponential`) where + their `model_data_1d_via_xvalues_from` methods are called and summed + in order to create a model data representation that is fitted to the data. + """ + + """ + The `instance` that comes into this method is an instance of the + `Gaussian` and `Exponential` models above, which were created + via `af.Collection()`. + + It contains instances of every class we instantiated it with, where + each instance is named following the names given to the Collection, + which in this example is a `Gaussian` (with name `gaussian) and + Exponential (with name `exponential`). + + The parameter values are again chosen by the non-linear search, + based on where it thinks the high likelihood regions of parameter + space are. The lines of Python code are commented out below to + prevent excessive print statements. + + + # print("Gaussian Instance:") + # print("Centre = ", instance.gaussian.centre) + # print("Normalization = ", instance.gaussian.normalization) + # print("Sigma = ", instance.gaussian.sigma) + + # print("Exponential Instance:") + # print("Centre = ", instance.exponential.centre) + # print("Normalization = ", instance.exponential.normalization) + # print("Rate = ", instance.exponential.rate) + """ + """ + Get the range of x-values the data is defined on, to evaluate + the model of the Gaussian. + """ + xvalues = np.arange(self.data.shape[0]) + + """ + Internally, the `instance` variable is a list of all model + omponents pass to the `Collection` above. + + we can therefore iterate over them and use their + `model_data_1d_via_xvalues_from` methods to create the + summed overall model data. + """ + model_data = sum( + [ + profile_1d.model_data_1d_via_xvalues_from(xvalues=xvalues) + for profile_1d in instance + ] + ) + + """ + Fit the model gaussian line data to the observed data, computing the residuals, chi-squared and log likelihood. + """ + residual_map = self.data - model_data + chi_squared_map = (residual_map / self.noise_map) ** 2.0 + chi_squared = sum(chi_squared_map) + noise_normalization = np.sum(np.log(2 * np.pi * noise_map**2.0)) + log_likelihood = -0.5 * (chi_squared + noise_normalization) + + return log_likelihood + + + +We can now fit this model to the data using the same API we did before. + +.. code-block:: python + + analysis = Analysis(data=data, noise_map=noise_map) + + search = af.DynestyStatic( + nlive=100, + number_of_cores=1, + ) + + result = search.fit(model=model, analysis=analysis) + + +The ``info`` attribute shows the result in a readable format, showing that all 6 free parameters were fitted for. + +.. code-block:: python + + print(result.info) + +The output appears as follows: + +.. code-block:: bash + + Bayesian Evidence 144.86032973 + Maximum Log Likelihood 181.14287034 + Maximum Log Posterior 181.14287034 + + model Collection (N=6) + gaussian Gaussian (N=3) + exponential Exponential (N=3) + + Maximum Log Likelihood Model: + + gaussian + centre 50.223 + normalization 26.108 + sigma 9.710 + exponential + centre 50.057 + normalization 39.948 + rate 0.048 + + + Summary (3.0 sigma limits): + + gaussian + centre 50.27 (49.63, 50.88) + normalization 26.22 (21.37, 32.41) + sigma 9.75 (9.25, 10.27) + exponential + centre 50.04 (49.60, 50.50) + normalization 40.06 (37.60, 42.38) + rate 0.05 (0.04, 0.05) + + + Summary (1.0 sigma limits): + + gaussian + centre 50.27 (50.08, 50.49) + normalization 26.22 (24.33, 28.39) + sigma 9.75 (9.60, 9.90) + exponential + centre 50.04 (49.90, 50.18) + normalization 40.06 (39.20, 40.88) + rate 0.05 (0.05, 0.05) + +We can again use the max log likelihood instance to visualize the model data of the best fit model compared to the +data. + +.. code-block:: python + + instance = result.max_log_likelihood_instance + + model_gaussian = instance.gaussian.model_data_1d_via_xvalues_from( + xvalues=np.arange(data.shape[0]) + ) + model_exponential = instance.exponential.model_data_1d_via_xvalues_from( + xvalues=np.arange(data.shape[0]) + ) + model_data = model_gaussian + model_exponential + + plt.errorbar( + x=xvalues, y=data, yerr=noise_map, color="k", ecolor="k", elinewidth=1, capsize=2 + ) + plt.plot(range(data.shape[0]), model_data, color="r") + plt.plot(range(data.shape[0]), model_gaussian, "--") + plt.plot(range(data.shape[0]), model_exponential, "--") + plt.title("Dynesty model fit to 1D Gaussian + Exponential dataset.") + plt.xlabel("x values of profile") + plt.ylabel("Profile normalization") + plt.show() + plt.close() + +The plot appears as follows: + +.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/main/docs/images/toy_model_fit.png + :width: 600 + :alt: Alternative text \ No newline at end of file diff --git a/docs/overview/scientific_workflow.rst b/docs/overview/scientific_workflow.rst index 2f0509215..e367246ea 100644 --- a/docs/overview/scientific_workflow.rst +++ b/docs/overview/scientific_workflow.rst @@ -3,17 +3,611 @@ Scientific Workflow =================== -NOTE: A complete description of the scientific workflow, including Python code extracts and visuals, is not complete -yet. It will be written in the near future. This script gives a concise overview of how **PyAutoFit** can be used to -perform a scientific workflow. +A scientific workflow are the tasks that you perform to undertake scientific study. This includes fitting models to +datasets, interpreting the results and gaining insight and intuition about your scientific problem. -The functionality necessary to develop an effective scientific workflow is described in the cookbooks, which are -available on readthedocs and in the `cookbooks` package of the workspace. +Different problems require different scientific workflows, depending on the complexity of the model, the size of the +dataset and the computational run times of the analysis. For example, some problems have a single dataset, which +is fitted with many different models in order to gain scientific insight. Other problems have many thousands of datasets, +that are fitted with a single model, in order to perform a large scale scientific study. +The **PyAutoFit** API is flexible, customisable and extensible, enabling user to a develop scientific workflow +that is tailored to their specific problem. -A scientific workflow is the series of tasks that someone performs to undertake scientific study though model fitting. -This includes fitting model a dataset, interpret the results and gaining insight and intuition for what works for their -problem. +This overview covers the key features of **PyAutoFit** that enable the development of effective scientific workflows, +which are as follows: + +- **On The Fly**: Displaying results on-the-fly (e.g. in a Jupyter notebooks), providing immediate feedback for adapting a scientific workflow. +- **Hard Disk Output**: Output results to hard-disk with high levels of customization, enabling quick but detailed inspection of fits to many datasets. +- **Visualization**: Output model specific visualization to produce custom plots that further streamline result inspection. +- **Loading Results**: Load results from hard-disk to inspect and interpret the results of a model-fit. +- **Result Customization**: Customize the result returned by a fit to streamline scientific interpretation. +- **Model Composition**: Extensible model composition makes straight forward to fit many models with different parameterizations and assumptions. +- **Searches**: Support for many non-linear searches (nested sampling, MCMC etc.), so that a user finds the right method for their problem. +- **Configs**: Configuration files which set default model, fitting and visualization behaviour, streamlining model-fitting. +- **Multiple Datasets**: Dedicated support for simultaneously fitting multiple datasets, enabling scalable analysis of large datasets. +- **Database**: Store results in a relational sqlite3 database, enabling streamlined management of large modeling results. +- **Scaling Up**: Advise on how to scale up your scientific workflow from small datasets to large datasets. + +On The Fly +---------- + +When a model-fit is running, information about the fit is displayed in user specified intervals on-the-fly. + +The frequency of on-the-fly output is controlled by a search's ``iterations_per_update``, which specifies how often +this information is output. The example code below outputs on-the-fly information every 1000 iterations: + +.. code-block:: python + + search = af.DynestyStatic( + iterations_per_update=1000 + ) + +In a Jupyter notebook, the default behaviour is for this information to appear in the cell being run and for it to include: + +- Text displaying the maximum likelihood model inferred so far and related information. +- A visual showing how the search has sampler parameter space so far, giving intuition on how the search is performing. + +Here is an image of how this looks: + +The most valuable on-the-fly output is often specific to the model and dataset you are fitting. + +For example, it might be a ``matplotlib`` subplot showing the maximum likelihood model's fit to the dataset, complete +with residuals and other diagnostic information. + +The on-the-fly output can be fully customized by extending the ``on_the_fly_output`` method of the ``Analysis`` class +being used to fit the model. + +The example below shows how this is done for the simple example of fitting a 1D Gaussian profile: + +.. code-block:: python + + class Analysis(af.Analysis): + def __init__(self, data: np.ndarray, noise_map: np.ndarray): + """ + Example Analysis class illustrating how to customize the on-the-fly output of a model-fit. + """ + super().__init__() + + self.data = data + self.noise_map = noise_map + + def on_the_fly_output(self, instance): + """ + During a model-fit, the `on_the_fly_output` method is called throughout the non-linear search. + + The `instance` passed into the method is maximum log likelihood solution obtained by the model-fit so far and it can be + used to provide on-the-fly output showing how the model-fit is going. + """ + xvalues = np.arange(analysis.data.shape[0]) + + model_data = instance.model_data_1d_via_xvalues_from(xvalues=xvalues) + + """ + The visualizer now outputs images of the best-fit results to hard-disk (checkout `visualizer.py`). + """ + import matplotlib.pyplot as plt + + plt.errorbar( + x=xvalues, + y=self.data, + yerr=self.noise_map, + color="k", + ecolor="k", + elinewidth=1, + capsize=2, + ) + plt.plot(xvalues, model_data, color="r") + plt.title("Maximum Likelihood Fit") + plt.xlabel("x value of profile") + plt.ylabel("Profile Normalization") + plt.show() # By using `plt.show()` the plot will be displayed in the Jupyter notebook. + +Here is how the visuals appear in a Jupyter Notebook: + +In the early stages of setting up a scientific workflow, on-the-fly output is invaluable. + +It provides immediate feedback on how your model fitting is performing (which at the start of a project is often not very well!). +It also forces you to think first and foremost about how to visualize your fit and diagnose whether things are performing +well or not. + +We recommend users starting a new model-fitting problem should always begin by setting up on-the-fly output! + +.. note:: + + The function ``on_the_fly_output`` is not implemented yet, we are working on this currently! + +Hard Disk Output +---------------- + +By default, a non-linear search does not output its results to hard-disk and its results can only be inspected +in a Jupyter Notebook or Python script via the ``result`` that is returned. + +However, the results of any non-linear search can be output to hard-disk by passing the ``name`` and / or ``path_prefix`` +attributes, which are used to name files and output the results to a folder on your hard-disk. + +The benefits of doing this include: + +- Inspecting results via folders on your computer is more efficient than using a Jupyter Notebook for multiple datasets. +- Results are output on-the-fly, making it possible to check that a fit is progressing as expected mid way through. +- Additional information about a fit (e.g. visualization) can be output (see below). +- Unfinished runs can be resumed from where they left off if they are terminated. +- On high performance super computers results often must be output in this way. + +The code below shows how to enable outputting of results to hard-disk: + +.. code-block:: python + + search = af.Emcee( + path_prefix=path.join("folder_0", "folder_1"), + name="example_mcmc" + ) + +The screenshot below shows the output folder where all output is enabled: + + + +Lets consider three parts of this output folder: + +- **Unique Identifier**: Results are output to a folder which is a collection of random characters, which is uniquely generated based on the model fit. For scientific workflows where many models are fitted this means many fits an be performed without manually updating the output paths. +- **Info Files**: Files containing useful information about the fit are available, for example ``model.info`` contains the full model composition and ``search.summary`` contains information on how long the search has been running. +- **Files Folder**: The ``files`` folder contains detailed information about the fit, as ``.json`` files which can be loaded as (e.g. ``model.json`` can be used to load the``Model``), so that if you return to results at a later date you can remind yourself how the fit was performed. + +**PyAutoFit** has lots more tools for customizing hard-disk output, for example configuration files controlling what gets output in order to manage hard-disk space use and model-specific ``.json`` files. + +For many scientific workflows, being able to output so much information about each fit is integral to ensuring you inspect and interpret the results +accurately. On the other hand, there are many problems where outputting so much information to hard-disk may overwhelm a user and prohibit +scientific study, which is why it can be easily disabled by not passing the search a ``name`` or ``path prefix``! + +Visualization +------------- + +If search hard-disk output is enabled, visualization of the model-fit can also be output to hard-disk, which +for many scientific workflows is integral to assessing the quality of a fit quickly and effectively. + +This is done by overwriting the ``Visualizer`` object of an ``Analysis`` class with a custom ``Visualizer`` class, +as illustrated below. + +.. code-block:: python + + class Visualizer(af.Visualizer): + + @staticmethod + def visualize_before_fit( + analysis, + paths: af.DirectoryPaths, + model: af.AbstractPriorModel + ): + """ + Before a model-fit, the `visualize_before_fit` method is called to perform visualization. + + The function receives as input an instance of the `Analysis` class which is being used to perform the fit, + which is used to perform the visualization (e.g. it contains the data and noise map which are plotted). + + This can output visualization of quantities which do not change during the model-fit, for example the + data and noise-map. + + The `paths` object contains the path to the folder where the visualization should be output, which is determined + by the non-linear search `name` and other inputs. + """ + + import matplotlib.pyplot as plt + + xvalues = np.arange(analysis.data.shape[0]) + + plt.errorbar( + x=xvalues, + y=analysis.data, + yerr=analysis.noise_map, + color="k", + ecolor="k", + elinewidth=1, + capsize=2, + ) + plt.title("Maximum Likelihood Fit") + plt.xlabel("x value of profile") + plt.ylabel("Profile Normalization") + plt.savefig(path.join(paths.image_path, f"data.png")) + plt.clf() + + @staticmethod + def visualize( + analysis, + paths: af.DirectoryPaths, + instance, + during_analysis + ): + """ + During a model-fit, the `visualize` method is called throughout the non-linear search. + + The function receives as input an instance of the `Analysis` class which is being used to perform the fit, + which is used to perform the visualization (e.g. it generates the model data which is plotted). + + The `instance` passed into the visualize method is maximum log likelihood solution obtained by the model-fit + so far and it can be used to provide on-the-fly images showing how the model-fit is going. + + The `paths` object contains the path to the folder where the visualization should be output, which is determined + by the non-linear search `name` and other inputs. + """ + xvalues = np.arange(analysis.data.shape[0]) + + model_data = instance.model_data_1d_via_xvalues_from(xvalues=xvalues) + residual_map = analysis.data - model_data + + """ + The visualizer now outputs images of the best-fit results to hard-disk (checkout `visualizer.py`). + """ + import matplotlib.pyplot as plt + + plt.errorbar( + x=xvalues, + y=analysis.data, + yerr=analysis.noise_map, + color="k", + ecolor="k", + elinewidth=1, + capsize=2, + ) + plt.plot(xvalues, model_data, color="r") + plt.title("Maximum Likelihood Fit") + plt.xlabel("x value of profile") + plt.ylabel("Profile Normalization") + plt.savefig(path.join(paths.image_path, f"model_fit.png")) + plt.clf() + + plt.errorbar( + x=xvalues, + y=residual_map, + yerr=analysis.noise_map, + color="k", + ecolor="k", + elinewidth=1, + capsize=2, + ) + plt.title("Residuals of Maximum Likelihood Fit") + plt.xlabel("x value of profile") + plt.ylabel("Residual") + plt.savefig(path.join(paths.image_path, f"model_fit.png")) + plt.clf() + +The ``Analysis`` class is defined following the same API as before, but now with its `Visualizer` class attribute +overwritten with the ``Visualizer`` class above. + +.. code-block:: python + + class Analysis(af.Analysis): + + """ + This over-write means the `Visualizer` class is used for visualization throughout the model-fit. + + This `VisualizerExample` object is in the `autofit.example.visualize` module and is used to customize the + plots output during the model-fit. + + It has been extended with visualize methods that output visuals specific to the fitting of `1D` data. + """ + Visualizer = Visualizer + + def __init__(self, data, noise_map): + """ + An Analysis class which illustrates visualization. + """ + super().__init__() + + self.data = data + self.noise_map = noise_map + + def log_likelihood_function(self, instance): + """ + The `log_likelihood_function` is identical to the example above + """ + xvalues = np.arange(self.data.shape[0]) + + model_data = instance.model_data_1d_via_xvalues_from(xvalues=xvalues) + residual_map = self.data - model_data + chi_squared_map = (residual_map / self.noise_map) ** 2.0 + chi_squared = sum(chi_squared_map) + noise_normalization = np.sum(np.log(2 * np.pi * noise_map**2.0)) + log_likelihood = -0.5 * (chi_squared + noise_normalization) + + return log_likelihood + +Visualization of the results of the search, such as the corner plot of what is called the "Probability Density +Function", are also automatically output during the model-fit on the fly. + +Loading Results +--------------- + +Your scientific workflow will likely involve many model-fits, which will be output to hard-disk in folders. + +A streamlined API for loading these results from hard-disk to Python variables, so they can be manipulated and +inspected in a Python script or Jupiter notebook is therefore essential. + +The **PyAutoFit** aggregator provides this API, you simply point it at the folder containing the results and it +loads the results (and other information) of all model-fits in that folder. + +.. code-block:: python + + from autofit.aggregator.aggregator import Aggregator + + agg = Aggregator.from_directory( + directory=path.join("output", "result_folder"), + ) + +The ``values`` method is used to specify the information that is loaded from the hard-disk, for example the +``samples`` of the model-fit. + +The for loop below iterates over all results in the folder passed to the aggregator above. + +.. code-block:: python + + for samples in agg.values("samples"): + print(samples.parameter_lists[0]) + +Result loading uses Python generators to ensure that memory use is minimized, meaning that even when loading +thousands of results from hard-disk the memory use of your machine is not exceeded. + +The `result cookbook `_ gives a full run-through of +the tools that allow results to be loaded and inspected. + +Result Customization +-------------------- + +The ``Result`` object is returned by a non-linear search after running the following code: + +.. code-block:: python + + result = search.fit(model=model, analysis=analysis) + +An effective scientific workflow ensures that this object contains all information a user needs to quickly inspect +the quality of a model-fit and undertake scientific interpretation. + +The result can be can be customized to include additional information about the model-fit that is specific to your +model-fitting problem. + +For example, for fitting 1D profiles, the ``Result`` could include the maximum log likelihood model 1D data: + +.. code-block:: python + + print(result.max_log_likelihood_model_data_1d) + +To do this we use the custom result API, where we first define a custom ``Result`` class which includes the +property ``max_log_likelihood_model_data_1d``: + +.. code-block:: python + + class ResultExample(af.Result): + + @property + def max_log_likelihood_model_data_1d(self) -> np.ndarray: + """ + Returns the maximum log likelihood model's 1D model data. + + This is an example of how we can pass the `Analysis` class a custom `Result` object and extend this result + object with new properties that are specific to the model-fit we are performing. + """ + xvalues = np.arange(self.analysis.data.shape[0]) + + return self.instance.model_data_1d_via_xvalues_from(instance=xvalues) + +The custom result has access to the analysis class, meaning that we can use any of its methods or properties to +compute custom result properties. + +To make it so that the ``ResultExample`` object above is returned by the search we overwrite the ``Result`` class attribute +of the ``Analysis`` and define a ``make_result`` object describing what we want it to contain: + +.. code-block:: python + + class Analysis(af.Analysis): + + """ + This overwrite means the `ResultExample` class is returned after the model-fit. + """ + Result = ResultExample + + def __init__(self, data, noise_map): + """ + An Analysis class which illustrates custom results. + """ + super().__init__() + + self.data = data + self.noise_map = noise_map + + def log_likelihood_function(self, instance): + """ + The `log_likelihood_function` is identical to the example above + """ + xvalues = np.arange(self.data.shape[0]) + + model_data = instance.model_data_1d_via_xvalues_from(xvalues=xvalues) + residual_map = self.data - model_data + chi_squared_map = (residual_map / self.noise_map) ** 2.0 + chi_squared = sum(chi_squared_map) + noise_normalization = np.sum(np.log(2 * np.pi * noise_map**2.0)) + log_likelihood = -0.5 * (chi_squared + noise_normalization) + + return log_likelihood + + def make_result( + self, + samples_summary: af.SamplesSummary, + paths: af.AbstractPaths, + samples: Optional[af.SamplesPDF] = None, + search_internal: Optional[object] = None, + analysis: Optional[object] = None, + ) -> Result: + """ + Returns the `Result` of the non-linear search after it is completed. + + The result type is defined as a class variable in the `Analysis` class (see top of code under the python code + `class Analysis(af.Analysis)`. + + The result can be manually overwritten by a user to return a user-defined result object, which can be extended + with additional methods and attribute specific to the model-fit. + + This example class does example this, whereby the analysis result has been overwritten with the `ResultExample` + class, which contains a property `max_log_likelihood_model_data_1d` that returns the model data of the + best-fit model. This API means you can customize your result object to include whatever attributes you want + and therefore make a result object specific to your model-fit and model-fitting problem. + + The `Result` object you return can be customized to include: + + - The samples summary, which contains the maximum log likelihood instance and median PDF model. + + - The paths of the search, which are used for loading the samples and search internal below when a search + is resumed. + + - The samples of the non-linear search (e.g. MCMC chains) also stored in `samples.csv`. + + - The non-linear search used for the fit in its internal representation, which is used for resuming a search + and making bespoke visualization using the search's internal results. + + - The analysis used to fit the model (default disabled to save memory, but option may be useful for certain + projects). + + Parameters + ---------- + samples_summary + The summary of the samples of the non-linear search, which include the maximum log likelihood instance and + median PDF model. + paths + An object describing the paths for saving data (e.g. hard-disk directories or entries in sqlite database). + samples + The samples of the non-linear search, for example the chains of an MCMC run. + search_internal + The internal representation of the non-linear search used to perform the model-fit. + analysis + The analysis used to fit the model. + + Returns + ------- + Result + The result of the non-linear search, which is defined as a class variable in the `Analysis` class. + """ + return self.Result( + samples_summary=samples_summary, + paths=paths, + samples=samples, + search_internal=search_internal, + analysis=self + ) + +Model Composition +----------------- + +Many scientific workflows require composing and fitting many different models. + +The simplest examples are when slight tweaks to the model are required, for example: + +- **Parameter Assignment**: Fix certain parameters to input values or linking parameters in the model together so they have the same values. +- **Parameter Assertions**: Place assertions on model parameters, for example requiring that one parameter is higher than another parameter. +- **Model Arithmitic**: Use arithmitic to define relations between parameters, for example a ``y = mx + c`` where ``m`` and ``c`` are model parameters. + +In more complex situations, models with many thousands of parameters consisting of many model components may be fitted. + +**PyAutoFit**'s advanced model composition API has many tools for compositing complex models, including constructing +models from lists of Python classes and hierarchies of Python classes. + +The `model cookbook `_ gives a full run-through of +the model composition API. + +Searches +-------- + +Different model-fitting problems require different methods to fit the model. + +The search appropriate for your problem depends on many factors: + +- **Model Dimensions**: How many parameters does the model and its non-linear parameter space consist of? +- **Model Complexity**: Different models have different parameter degeneracies requiring different non-linear search techniques. +- **Run Times**: How fast does it take to evaluate a likelihood and perform the model-fit? + +**PyAutoFit** supports many non-linear searches ensuring that each user can find the best method for their problem. + +In the early stages of setting up your scientific workflow, you should experiment with different searches, determine which +reliable infer the maximum likelihood fits to the data and profile which ones do so in the faster times. + +The `search cookbook `_ gives a full run-through of +all non-linear searches that are available and how to customize them. + +.. note:: + + There are currently no documentation guiding reads on what search might be appropriate for their problem and how to + profile and experiment with different methods. Writing such documentation is on the to do list and will appear + in the future. However, you can make progress now simply using visuals output by PyAutoFit and the ``search.summary` file. + +Configs +------- + +As you develop your scientific workflow, you will likely find that you are often setting up models with the same priors +every time and using the same non-linear search settings. + +This can lead to long lines of Python code repeating the same inputs. + +Configuration files can be set up to input default values, meaning that the same prior inputs and settings do not need +to be repeated in every script. This produces more concise Python code and means you have less to think about when +performing model-fitting. + +The `configs cookbook `_ gives a full run-through of +configuration file setup. + +Multiple Datasets +----------------- + +Many model-fitting problems require multiple datasets to be fitted simultaneously in order to provide the best constraints on the model. + +**PyAutoFit** makes it straight forward to scale-up your scientific workflow to fits to multiple datasets. All you +do is define ``Analysis`` classes describing how the model fits each dataset and sum them together: + +.. code-block:: python + + analysis_0 = Analysis(data=data_0, noise_map=noise_map_0) + analysis_1 = Analysis(data=data_1, noise_map=noise_map_1) + + # This means the model is fitted to both datasets simultaneously. + + analysis = analysis_0 + analysis_1 + + # summing a list of analysis objects is also a valid API: + + analysis = sum([analysis_0, analysis_1]) + +By summing analysis objects the following happen: + +- The log likelihood values computed by the ``log_likelihood_function`` of each individual analysis class are summed to give an overall log likelihood value that the non-linear search samples when model-fitting. + +- The output path structure of the results goes to a single folder, which includes sub-folders for the visualization of every individual analysis object based on the ``Analysis`` object's ``visualize`` method. + +In the example above, the same ``Analysis`` class was used twice (to set up ``analysis_0`` and ``analysis_1``) and summed. + +**PyAutoFit** supports the summing together of different ``Analysis`` classes, which may take as input completely different +datasets and fit the model to them (via the ``log_likelihood_function``) following a completely different procedure. + +The `multiple datasets cookbook `_ gives a full run-through +of fitting multiple dataset. This includes a dedicated API for customizing how the model changes over the different datasets +and how the result return becomes a list containing information on the fit to every dataset. + +Database +-------- + +The default behaviour of model-fitting results output is to be written to hard-disk in folders. These are simple to +navigate and manually check. + +For small scientific workflows and model-fitting tasks this is sufficient, however it does not scale well when +performing many model fits to large datasets, because manual inspection of results becomes time consuming. + +All results can therefore be output to an sqlite3 (https://docs.python.org/3/library/sqlite3.html) relational database, +meaning that results can be loaded into a Jupyter notebook or Python script for inspection, analysis and interpretation. +This database supports advanced querying, so that specific model-fits (e.g., which fit a certain model or dataset) can +be loaded. + +The `database cookbook `_ gives a full run-through +of how to use the database functionality. + +Scaling Up +---------- + +Irrespective of your final scientific goal, you should always Initially, the study will be performed on a small number of datasets (e.g. ~10s of datasets), as the user develops their model and gains insight into what works well. This is a manual process of trial and error, which often involves @@ -29,16 +623,9 @@ must be suitable for this environment. **PyAutoFit** enables the development of effective scientific workflows for both small and large datasets, thanks to the following features: -- **On The Fly Feedback**: The results of a model-fit are shown to the user on-the-fly in Jupiter notebooks, providing quick feedback allowing them adapt their scientific workflow accordingly. - -- **Hard-disk Output**: All results of an analysis can output to hard-disk with high levels of customization, ensuring that quick inspection of results ~100s of datasets is feasible. - -- **Visualization**: Results output includes model specific visualization, allowing the user to produce custom plots that further streamline result inspection. - -- **Model Composition**: The API for model composition is extensible, meaning a user can easily experiment with different model assumptions and priors, as well as comparing many different models. - -- **Searches**: Support for a wide variety of non-linear searches (nested sampling, MCMC etc.) means a user can find the fitting algorithm that is optimal for their model and dataset. -- **Multiple Datasets**: Dedicated support for simultaneously fitting multiple datasets simultaneously means a user can easily combine different datasets in their analysis in order to fit more complex models. +**PyAutoFit** supports **on-the-fly** output whilst the model-fit is running. -- **Database**: The results of a model-fit can be output to a relational sqlite3 database, meaning that once you are ready to scale up to large datasets you can easily do so. \ No newline at end of file +For example, in a Jupyter notebook, text displaying the maximum likelihood model inferred so far alongside visuals +showing the parameter sampling are output to the cell as the search runs. On the fly output can be fully customized +with visualization specific to your model and data, as shown in the following overviews and cookbooks. \ No newline at end of file diff --git a/docs/overview/statistical_methods.rst b/docs/overview/statistical_methods.rst new file mode 100644 index 000000000..45432cba4 --- /dev/null +++ b/docs/overview/statistical_methods.rst @@ -0,0 +1,27 @@ +.. _statistical_methods: + +Statistical Methods +=================== + +**PyAutoFit** supports numerous statistical methods that allow for advanced Bayesian inference to be performed. + +Graphical Models +---------------- + +Hierarchical Models +------------------- + +Model Comparison +---------------- + +Search Chaining +--------------- + +Interpolation +------------- + +Search Grid Search +------------------ + +Sensitivity Mapping +------------------- \ No newline at end of file diff --git a/docs/overview/the_basics.rst b/docs/overview/the_basics.rst index 3740d6de9..60b55ab14 100644 --- a/docs/overview/the_basics.rst +++ b/docs/overview/the_basics.rst @@ -3,6 +3,27 @@ The Basics ========== +**PyAutoFit** is a Python based probabilistic programming language for model fitting and Bayesian inference +of large datasets. + +The basic **PyAutoFit** API allows us a user to quickly compose a probabilistic model and fit it to data via a +log likelihood function, using a range of non-linear search algorithms (e.g. MCMC, nested sampling). + +This overview gives a run through of: + + - **Models**: Use Python classes to compose the model which is fitted to data. + - **Instances**: Create instances of the model via its Python class. + - **Analysis**: Define an ``Analysis`` class which includes the log likelihood function that fits the model to the data. + - **Searches**: Choose an MCMC, nested sampling or maximum likelihood estimator non-linear search algorithm that fits the model to the data. + - **Model Fit**: Fit the model to the data using the chosen non-linear search, with on-the-fly results and visualization. + - **Results**: Use the results of the search to interpret and visualize the model fit. + +This overviews provides a high level of the basic API, with more advanced functionality described in the following +overviews and the **PyAutoFit** cookbooks. + +Example +------- + To illustrate **PyAutoFit** we'll use the example modeling problem of fitting a 1D Gaussian profile to noisy data. To begin, lets import ``autofit`` (and ``numpy``) using the convention below: @@ -12,12 +33,10 @@ To begin, lets import ``autofit`` (and ``numpy``) using the convention below: import autofit as af import numpy as np -Example -------- -The example ``data`` with errors (black) and the model-fit (red), are shown below: +The example ``data`` with errors (black) is shown below: -.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/feature/docs_update/docs/images/data.png +.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/main/docs/images/data.png :width: 600 :alt: Alternative text @@ -29,30 +48,23 @@ The 1D signal was generated using a 1D Gaussian profile of the form: Where: -``x``: Is the x-axis coordinate where the ``Gaussian`` is evaluated. + ``x``: The x-axis coordinate where the ``Gaussian`` is evaluated. -``N``: Describes the overall normalization of the Gaussian. + ``N``: The overall normalization of the Gaussian. -``sigma``: Describes the size of the Gaussian (Full Width Half Maximum = $\mathrm {FWHM}$ = $2{\sqrt {2\ln 2}}\;\sigma$) + ``sigma``: Describes the size of the Gaussian. -Our modeling task is to fit the signal with a 1D Gaussian and recover its parameters (``x``, ``N``, ``sigma``). +Our modeling task is to fit the data with a 1D Gaussian and recover its parameters (``x``, ``N``, ``sigma``). Model ----- -We therefore need to define a 1D Gaussian as a "model component" in **PyAutoFit**. - -A model component is written as a Python class using the following format: +We therefore need to define a 1D Gaussian as a **PyAutoFit** model. -- The name of the class is the name of the model component, in this case, "Gaussian". - -- The input arguments of the constructor (the ``__init__`` method) are the parameters of the model, in this case ``centre``, ``normalization`` and ``sigma``. - -- The default values of the input arguments define whether a parameter is a single-valued ``float`` or a multi-valued ``tuple``. In this case, all 3 input parameters are floats. - -- It includes functions associated with that model component, which will be used when fitting the model to data. +We do this by writing it as the following Python class: .. code-block:: python + class Gaussian: def __init__( self, @@ -100,6 +112,16 @@ A model component is written as a Python class using the following format: np.exp(-0.5 * np.square(np.divide(transformed_xvalues, self.sigma))), ) +The **PyAutoFit** model above uses the following format: + +- The name of the class is the name of the model, in this case, "Gaussian". + +- The input arguments of the constructor (the ``__init__`` method) are the parameters of the model, in this case ``centre``, ``normalization`` and ``sigma``. + +- The default values of the input arguments define whether a parameter is a single-valued ``float`` or a multi-valued ``tuple``. In this case, all 3 input parameters are floats. + +- It includes functions associated with that model component, which are used when fitting the model to data. + To compose a model using the ``Gaussian`` class above we use the ``af.Model`` object. @@ -119,15 +141,19 @@ This gives the following output: (normalization, LogUniformPrior [2], lower_limit = 1e-06, upper_limit = 1000000.0), (sigma, UniformPrior [3], lower_limit = 0.0, upper_limit = 25.0) +.. note:: + + You may be wondering where the priors above come from. **PyAutoFit** allows a user to set up configuration files that define + the default priors on every model parameter, to ensure that priors are defined in a consistent and concise way. More + information on configuration files is provided in the next overview and the cookbooks. + The model has a total of 3 parameters: .. code-block:: python print(model.total_free_parameters) -All model information is given by printing its ``info`` attribute. - -This shows that each model parameter has an associated prior. +All model information is given by printing its ``info`` attribute: .. code-block:: python @@ -145,9 +171,7 @@ This gives the following output: normalization LogUniformPrior [2], lower_limit = 1e-06, upper_limit = 1000000.0 sigma UniformPrior [3], lower_limit = 0.0, upper_limit = 25.0 - -The priors can be manually altered as follows, noting that these updated files will be used below when we fit the -model to data. +The priors can be manually altered as follows: .. code-block:: python @@ -155,7 +179,6 @@ model to data. model.normalization = af.UniformPrior(lower_limit=0.0, upper_limit=1e2) model.sigma = af.UniformPrior(lower_limit=0.0, upper_limit=30.0) - Printing the ``model.info`` displayed these updated priors. .. code-block:: python @@ -177,8 +200,8 @@ This gives the following output: Instances --------- -Instances of the model components above (created via ``af.Model``) can be created, where an input ``vector`` of -parameters is mapped to create an instance of the Python class of the model. +Instances of a **PyAutoFit** model (created via ``af.Model``) can be created, where an input ``vector`` of parameter +values is mapped to create an instance of the model's Python class. We first need to know the order of parameters in the model, so we know how to define the input ``vector``. This information is contained in the models ``paths`` attribute: @@ -193,13 +216,7 @@ This gives the following output: [('centre',), ('normalization',), ('sigma',)] -We input values for the 3 free parameters of our model following the order of paths above: - -1) ``centre=30.0`` -2) ``normalization=2.0`` -3) ``sigma=3.0`` - -This creates an ``instance`` of the Gaussian class via the model. +We input values for the 3 free parameters of our model following the order of paths above (``centre=30.0``, ``normalization=2.0`` and ``sigma=3.0``): .. code-block:: python @@ -257,26 +274,27 @@ create a realization of the ``Gaussian`` and plot it. Here is what the plot looks like: -.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/feature/docs_update/docs/images/model_gaussian.png +.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/main/docs/images/model_gaussian.png :width: 600 :alt: Alternative text -This "model mapping", whereby models map to an instances of their Python classes, is integral to the core **PyAutoFit** -API for model composition and fitting. +.. note:: + + Mapping models to instance of their Python classes is an integral part of the core **PyAutoFit** API. It enables + the advanced model composition and results management tools illustrated in the following overviews and cookbooks. Analysis -------- -Now we've defined our model, we need to inform **PyAutoFit** how to fit it to data. +We now tell **PyAutoFit** how to fit the model to data. -We therefore define an ``Analysis`` class, which includes: +We define an ``Analysis`` class, which includes: -- An ``__init__`` constructor, which takes as input the ``data`` and ``noise_map``. This could be extended to include anything else necessary to fit the model to the data. +- An ``__init__`` constructor, which takes as input the ``data`` and ``noise_map`` (this can be extended with anything else necessary to fit the model to the data). -- A ``log_likelihood_function``, which defines how given an ``instance`` of the model we fit it to the data and return a log likelihood value. +- A ``log_likelihood_function``, defining how for an ``instance`` of the model we fit it to the data and return a log likelihood value. -Read the comments and docstrings of the ``Analysis`` object below in detail for more insights into how this object -works. +Read the comments and docstrings of the ``Analysis`` object below in detail for a full description of how an analysis works. .. code-block:: python @@ -297,8 +315,8 @@ works. Parameters ---------- data - A 1D numpy array containing the data (e.g. a noisy 1D signal) f - itted in the workspace examples. + A 1D numpy array containing the data (e.g. a noisy 1D signal) + fitted in the readthedocs and workspace examples. noise_map A 1D numpy array containing the noise values of the data, used for computing the goodness of fit metric, the log likelihood. @@ -322,15 +340,14 @@ works. The ``instance`` that comes into this method is an instance of the ``Gaussian`` model above, which was created via ``af.Model()``. - The parameter values are chosen by the non-linear search, based on where - it thinks the high likelihood regions of parameter space are. + The parameter values are chosen by the non-linear search and therefore are based + on where it has mapped out the high likelihood regions of parameter space are. The lines of Python code are commented out below to prevent excessive print statements when we run the non-linear search, but feel free to uncomment them and run the search to see the parameters of every instance that it fits. - # print("Gaussian Instance:") # print("Centre = ", instance.centre) # print("Normalization = ", instance.normalization) @@ -359,7 +376,6 @@ works. return log_likelihood - Create an instance of the ``Analysis`` class by passing the ``data`` and ``noise_map``. .. code-block:: python @@ -370,24 +386,31 @@ Create an instance of the ``Analysis`` class by passing the ``data`` and ``noise Non Linear Search ----------------- -We have defined the model that we want to fit the data, and the analysis class that performs this fit. +We now have a model to fit to the data, and an analysis class that performs this fit. We now choose our fitting algorithm, called the "non-linear search", and fit the model to the data. -For this example, we choose the nested sampling algorithm Dynesty. A wide variety of non-linear searches are -available in **PyAutoFit** (see ?). +**PyAutoFit** supports many non-linear searches, which broadly speaking fall into three categories, MCMC, nested +sampling and maximum likelihood estimators. + +For this example, we choose the nested sampling algorithm Dynesty. .. code-block:: python search = af.DynestyStatic( nlive=100, - number_of_cores=1, ) +.. note:: + + The default settings of the non-linear search are specified in the configuration files of **PyAutoFit**, just + like the default priors of the model components above. More information on configuration files is provided in the + next overview and the cookbooks. + Model Fit --------- -We begin the non-linear search by calling its ``fit`` method. +We begin the non-linear search by passing the model and analysis to its ``fit`` method. .. code-block:: python @@ -443,7 +466,7 @@ The output is as follows: normalization 24.80 (24.54, 25.11) sigma 9.84 (9.73, 9.97) -Results are returned as instances of the model, as we illustrated above in the model mapping section. +Results are returned as instances of the model, as illustrated above in the instance section. For example, we can print the result's maximum likelihood instance. @@ -488,7 +511,7 @@ Below, we use the maximum likelihood instance to compare the maximum likelihood The plot appears as follows: -.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/feature/docs_update/docs/images/toy_model_fit.png +.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/main/docs/images/toy_model_fit.png :width: 600 :alt: Alternative text @@ -510,409 +533,56 @@ corner of the results. The plot appears as follows: -.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/feature/docs_update/docs/images/corner.png - :width: 600 - :alt: Alternative text - -Extending Models ----------------- - -The model composition API is designed to make composing complex models, consisting of multiple components with many -free parameters, straightforward and scalable. - -To illustrate this, we will extend our model to include a second component, representing a symmetric 1D Exponential -profile, and fit it to data generated with both profiles. - -Lets begin by loading and plotting this data. - -.. code-block:: python - - dataset_path = path.join("dataset", "example_1d", "gaussian_x1__exponential_x1") - data = af.util.numpy_array_from_json(file_path=path.join(dataset_path, "data.json")) - noise_map = af.util.numpy_array_from_json( - file_path=path.join(dataset_path, "noise_map.json") - ) - xvalues = range(data.shape[0]) - plt.errorbar( - x=xvalues, y=data, yerr=noise_map, color="k", ecolor="k", elinewidth=1, capsize=2 - ) - plt.show() - plt.close() - -The data appear as follows: - -.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/feature/docs_update/docs/images/data_2.png +.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/main/docs/images/corner.png :width: 600 :alt: Alternative text -We define a Python class for the ``Exponential`` model component, exactly as we did for the ``Gaussian`` above. - -.. code-block:: python - - class Exponential: - def __init__( - self, - centre=30.0, # <- **PyAutoFit** recognises these constructor arguments - normalization=1.0, # <- are the Exponentials``s model parameters. - rate=0.01, - ): - """ - Represents a symmetric 1D Exponential profile. - - Parameters - ---------- - centre - The x coordinate of the profile centre. - normalization - Overall normalization of the profile. - ratw - The decay rate controlling has fast the Exponential declines. - """ - - self.centre = centre - self.normalization = normalization - self.rate = rate - - def model_data_1d_via_xvalues_from(self, xvalues: np.ndarray): - """ - Returns the symmetric 1D Exponential on an input list of Cartesian - x coordinates. - - The input xvalues are translated to a coordinate system centred on - the Exponential, via its ``centre``. - - The output is referred to as the ``model_data`` to signify that it - is a representation of the data from the - model. - - Parameters - ---------- - xvalues - The x coordinates in the original reference frame of the data. - """ - - transformed_xvalues = np.subtract(xvalues, self.centre) - return self.normalization * np.multiply( - self.rate, np.exp(-1.0 * self.rate * abs(transformed_xvalues)) - ) - - -We can easily compose a model consisting of 1 ``Gaussian`` object and 1 ``Exponential`` object using the ``af.Collection`` -object: - -.. code-block:: python - - model = af.Collection(gaussian=af.Model(Gaussian), exponential=af.Model(Exponential)) - -A ``Collection`` behaves analogous to a ``Model``, but it contains a multiple model components. - -We can see this by printing its ``paths`` attribute, where paths to all 6 free parameters via both model components -are shown. - -The paths have the entries ``.gaussian.`` and ``.exponential.``, which correspond to the names we input into -the ``af.Collection`` above. - -.. code-block:: python - - print(model.paths) - -The output is as follows: - -.. code-block:: bash - - [ - ('gaussian', 'centre'), - ('gaussian', 'normalization'), - ('gaussian', 'sigma'), - ('exponential', 'centre'), - ('exponential', 'normalization'), - ('exponential', 'rate') - ] - -We can use the paths to customize the priors of each parameter. - -.. code-block:: python - - model.gaussian.centre = af.UniformPrior(lower_limit=0.0, upper_limit=100.0) - model.gaussian.normalization = af.UniformPrior(lower_limit=0.0, upper_limit=1e2) - model.gaussian.sigma = af.UniformPrior(lower_limit=0.0, upper_limit=30.0) - model.exponential.centre = af.UniformPrior(lower_limit=0.0, upper_limit=100.0) - model.exponential.normalization = af.UniformPrior(lower_limit=0.0, upper_limit=1e2) - model.exponential.rate = af.UniformPrior(lower_limit=0.0, upper_limit=10.0) - -All of the information about the model created via the collection can be printed at once using its ``info`` attribute: - -.. code-block:: python - - print(model.info) - -The output appears as follows: - -.. code-block:: bash - - Total Free Parameters = 6 - model Collection (N=6) - gaussian Gaussian (N=3) - exponential Exponential (N=3) - - gaussian - centre UniformPrior [13], lower_limit = 0.0, upper_limit = 100.0 - normalization UniformPrior [14], lower_limit = 0.0, upper_limit = 100.0 - sigma UniformPrior [15], lower_limit = 0.0, upper_limit = 30.0 - exponential - centre UniformPrior [16], lower_limit = 0.0, upper_limit = 100.0 - normalization UniformPrior [17], lower_limit = 0.0, upper_limit = 100.0 - rate UniformPrior [18], lower_limit = 0.0, upper_limit = 10.0 - - -A model instance can again be created by mapping an input ``vector``, which now has 6 entries. - -.. code-block:: python - - instance = model.instance_from_vector(vector=[0.1, 0.2, 0.3, 0.4, 0.5, 0.01]) - -This ``instance`` contains each of the model components we defined above. - -The argument names input into the ``Collection`` define the attribute names of the ``instance``: - -.. code-block:: python - - print("Instance Parameters \n") - print("x (Gaussian) = ", instance.gaussian.centre) - print("normalization (Gaussian) = ", instance.gaussian.normalization) - print("sigma (Gaussian) = ", instance.gaussian.sigma) - print("x (Exponential) = ", instance.exponential.centre) - print("normalization (Exponential) = ", instance.exponential.normalization) - print("sigma (Exponential) = ", instance.exponential.rate) - -The output appear as follows: - -.. code-block:: bash - -The ``Analysis`` class above assumed the ``instance`` contained only a single model-component. - -We update its ``log_likelihood_function`` to use both model components in the ``instance`` to fit the data. - -.. code-block:: python - - class Analysis(af.Analysis): - def __init__(self, data: np.ndarray, noise_map: np.ndarray): - """ - The `Analysis` class acts as an interface between the data and - model in **PyAutoFit**. - - Its `log_likelihood_function` defines how the model is fitted to - the data and it is called many times by the non-linear search - fitting algorithm. - - In this example the `Analysis` `__init__` constructor only - contains the `data` and `noise-map`, but it can be easily - extended to include other quantities. - - Parameters - ---------- - data - A 1D numpy array containing the data (e.g. a noisy 1D signal) - fitted in the workspace examples. - noise_map - A 1D numpy array containing the noise values of the data, - used for computing the goodness of fit metric, the log likelihood. - """ - - super().__init__() - - self.data = data - self.noise_map = noise_map - - def log_likelihood_function(self, instance) -> float: - """ - Returns the log likelihood of a fit of a 1D Gaussian to the dataset. - - The data is fitted using an `instance` of multiple 1D profiles - (e.g. a `Gaussian`, `Exponential`) where - their `model_data_1d_via_xvalues_from` methods are called and summed - in order to create a model data representation that is fitted to the data. - """ - - """ - The `instance` that comes into this method is an instance of the - `Gaussian` and `Exponential` models above, which were created - via `af.Collection()`. - - It contains instances of every class we instantiated it with, where - each instance is named following the names given to the Collection, - which in this example is a `Gaussian` (with name `gaussian) and - Exponential (with name `exponential`). - - The parameter values are again chosen by the non-linear search, - based on where it thinks the high likelihood regions of parameter - space are. The lines of Python code are commented out below to - prevent excessive print statements. - - - # print("Gaussian Instance:") - # print("Centre = ", instance.gaussian.centre) - # print("Normalization = ", instance.gaussian.normalization) - # print("Sigma = ", instance.gaussian.sigma) - - # print("Exponential Instance:") - # print("Centre = ", instance.exponential.centre) - # print("Normalization = ", instance.exponential.normalization) - # print("Rate = ", instance.exponential.rate) - """ - """ - Get the range of x-values the data is defined on, to evaluate - the model of the Gaussian. - """ - xvalues = np.arange(self.data.shape[0]) - - """ - Internally, the `instance` variable is a list of all model - omponents pass to the `Collection` above. - - we can therefore iterate over them and use their - `model_data_1d_via_xvalues_from` methods to create the - summed overall model data. - """ - model_data = sum( - [ - profile_1d.model_data_1d_via_xvalues_from(xvalues=xvalues) - for profile_1d in instance - ] - ) - - """ - Fit the model gaussian line data to the observed data, computing the residuals, chi-squared and log likelihood. - """ - residual_map = self.data - model_data - chi_squared_map = (residual_map / self.noise_map) ** 2.0 - chi_squared = sum(chi_squared_map) - noise_normalization = np.sum(np.log(2 * np.pi * noise_map**2.0)) - log_likelihood = -0.5 * (chi_squared + noise_normalization) - - return log_likelihood - - - -We can now fit this model to the data using the same API we did before. - -.. code-block:: python - - analysis = Analysis(data=data, noise_map=noise_map) - - search = af.DynestyStatic( - nlive=100, - number_of_cores=1, - ) - - result = search.fit(model=model, analysis=analysis) - - -The ``info`` attribute shows the result in a readable format, showing that all 6 free parameters were fitted for. - -.. code-block:: python - - print(result.info) +Wrap Up +------- -The output appears as follows: +This overview explains the basic **PyAutoFit**. It used a simple model, a simple dataset, a simple model-fitting problem +and pretty much the simplest parts of the **PyAutoFit** API. -.. code-block:: bash +You should now have a good idea for how you would define and compose your own model, fit it to data with a chosen +non-linear search, and use the results to interpret the fit. - Bayesian Evidence 144.86032973 - Maximum Log Likelihood 181.14287034 - Maximum Log Posterior 181.14287034 +The **PyAutoFit** API introduce here is very extensible and very customizable, and therefore easily adapted +to your own model-fitting problems. - model Collection (N=6) - gaussian Gaussian (N=3) - exponential Exponential (N=3) +The next overview describes how to use **PyAutoFit** to set up a scientific workflow, in a nutshell how to use the API +to make your model-fitting efficient, manageable and scalable to large datasets. - Maximum Log Likelihood Model: - - gaussian - centre 50.223 - normalization 26.108 - sigma 9.710 - exponential - centre 50.057 - normalization 39.948 - rate 0.048 +Resources +--------- +The `autofit_workspace: `_ has numerous examples of how to perform +more complex model-fitting tasks: - Summary (3.0 sigma limits): +The following cookbooks describe how to use the API for more complex model-fitting tasks: - gaussian - centre 50.27 (49.63, 50.88) - normalization 26.22 (21.37, 32.41) - sigma 9.75 (9.25, 10.27) - exponential - centre 50.04 (49.60, 50.50) - normalization 40.06 (37.60, 42.38) - rate 0.05 (0.04, 0.05) +**Model Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/model.html** +Compose complex models from multiple Python classes, lists, dictionaries and customize their parameterization. - Summary (1.0 sigma limits): +**Analysis Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/search.html** - gaussian - centre 50.27 (50.08, 50.49) - normalization 26.22 (24.33, 28.39) - sigma 9.75 (9.60, 9.90) - exponential - centre 50.04 (49.90, 50.18) - normalization 40.06 (39.20, 40.88) - rate 0.05 (0.05, 0.05) +Customize the analysis with model-specific output and visualization. -We can again use the max log likelihood instance to visualize the model data of the best fit model compared to the -data. +**Searches Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/analysis.html** -.. code-block:: python +Choose from a variety of non-linear searches and customize their behaviour, for example outputting results to hard-disk and parallelizing the search. - instance = result.max_log_likelihood_instance +**Results Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/result.html** - model_gaussian = instance.gaussian.model_data_1d_via_xvalues_from( - xvalues=np.arange(data.shape[0]) - ) - model_exponential = instance.exponential.model_data_1d_via_xvalues_from( - xvalues=np.arange(data.shape[0]) - ) - model_data = model_gaussian + model_exponential - - plt.errorbar( - x=xvalues, y=data, yerr=noise_map, color="k", ecolor="k", elinewidth=1, capsize=2 - ) - plt.plot(range(data.shape[0]), model_data, color="r") - plt.plot(range(data.shape[0]), model_gaussian, "--") - plt.plot(range(data.shape[0]), model_exponential, "--") - plt.title("Dynesty model fit to 1D Gaussian + Exponential dataset.") - plt.xlabel("x values of profile") - plt.ylabel("Profile normalization") - plt.show() - plt.close() - -The plot appears as follows: - -.. image:: https://raw.githubusercontent.com/rhayes777/PyAutoFit/feature/docs_update/docs/images/toy_model_fit.png - :width: 600 - :alt: Alternative text +The variety of results available from a fit, including parameter estimates, error estimates, model comparison and visualization. -Cookbooks ----------- +**Configs Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/configs.html** -This overview shows the basics of model-fitting with **PyAutoFit**. +Customize default settings using configuration files, for example setting priors, search settings and visualization. -The API is designed to be intuitive and extensible, and you should have a good feeling for how you would define -and compose your own model, fit it to data with a chosen non-linear search, and use the results to interpret the -fit. +**Multiple Dataset Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/multiple_datasets.html** -The following cookbooks give a concise API reference for using **PyAutoFit**, and you should use them as you define -your own model to get a fit going: +Fit multiple datasets simultaneously, by combining their analysis classes such that their log likelihoods are summed. -- Model Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/model.html -- Searches Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/analysis.html -- Analysis Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/search.html -- Results Cookbook: https://pyautofit.readthedocs.io/en/latest/cookbooks/result.html -There are additioal cookbooks which explain advanced PyAutoFit functionality -which you should look into after you have a good understanding of the basics. -The next overview describes how to set up a scientific workflow, where many other tasks required to perform detailed but -scalable model-fitting can be delegated to **PyAutoFit**. From f28057c0b056d7e8d06d82da7ae940f1b9f250d6 Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Wed, 24 Apr 2024 07:55:23 +0100 Subject: [PATCH 2/8] add networkx to requirements --- docs/requirements.txt | 1 + requirements.txt | 1 + 2 files changed, 2 insertions(+) diff --git a/docs/requirements.txt b/docs/requirements.txt index 7a9f013ae..1638ca322 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -6,6 +6,7 @@ typing-inspect>=0.4.0 emcee>=3.0.2 gprof2dot==2021.2.21 matplotlib +networkx numpydoc>=1.0.0 pyprojroot==0.2.0 pyswarms==1.3.0 diff --git a/requirements.txt b/requirements.txt index c52af6cc3..e9c4a0fea 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,6 +6,7 @@ dynesty==2.1.2 typing-inspect>=0.4.0 emcee>=3.1.3 matplotlib +networkx numpydoc>=1.0.0 pyprojroot==0.2.0 pyswarms==1.3.0 From b121791aaa215d951fd3920b6bc39c1b21959b9a Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Wed, 24 Apr 2024 08:12:19 +0100 Subject: [PATCH 3/8] added pyvis to doc requiremnets --- docs/requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/requirements.txt b/docs/requirements.txt index 1638ca322..92f40d109 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -17,6 +17,7 @@ astunparse==1.6.3 autoconf sphinx==5.2.3 xxhash==3.0.0 +pyvis==0.3.2 furo myst-parser sphinx_copybutton From 4e2d1e7f14cb8d8adddf3fdbf5798d498f85839b Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Wed, 24 Apr 2024 08:13:18 +0100 Subject: [PATCH 4/8] docs requiremnets --- docs/requirements.txt | 5 ++++- requirements.txt | 1 + 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/requirements.txt b/docs/requirements.txt index 92f40d109..701716a5c 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,3 +1,4 @@ +anesthetic==2.8.1 corner==2.2.1 decorator>=4.2.1 dill>=0.3.1.1 @@ -14,9 +15,11 @@ h5py>=2.10.0 SQLAlchemy==1.3.20 scipy>=1.5.1 astunparse==1.6.3 +threadpoolctl>=3.1.0,<=3.2.0 autoconf +timeout-decorator==0.5.0 sphinx==5.2.3 -xxhash==3.0.0 +xxhash<=3.4.1 pyvis==0.3.2 furo myst-parser diff --git a/requirements.txt b/requirements.txt index e9c4a0fea..8f11787e7 100644 --- a/requirements.txt +++ b/requirements.txt @@ -5,6 +5,7 @@ dill>=0.3.1.1 dynesty==2.1.2 typing-inspect>=0.4.0 emcee>=3.1.3 +gprof2dot==2021.2.21 matplotlib networkx numpydoc>=1.0.0 From 5f32dac262bbb99e0098b0a086e010ef9cc86dde Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Wed, 24 Apr 2024 08:43:46 +0100 Subject: [PATCH 5/8] methods doc --- docs/index.rst | 1 + docs/overview/statistical_methods.rst | 92 ++++++++++++++++++++++++++- 2 files changed, 90 insertions(+), 3 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index 2b654b35d..7c5878543 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -144,6 +144,7 @@ model and marginalized probability density functions. overview/the_basics overview/scientific_workflow + overview/statistical_methods .. toctree:: :caption: Cookbooks: diff --git a/docs/overview/statistical_methods.rst b/docs/overview/statistical_methods.rst index 45432cba4..030f5e7c3 100644 --- a/docs/overview/statistical_methods.rst +++ b/docs/overview/statistical_methods.rst @@ -8,20 +8,106 @@ Statistical Methods Graphical Models ---------------- +For inference problems consisting of many datasets, the model composition is often very complex. Model parameters +can depend on multiple datasets, and the datasets themselves may be interdependent. + +Graphical models concisely describe these model and dataset dependencies and allow for +them to be fitted simultaneously, with a concise API and scientific workflow that ensures scalability to big datasets. + +A full description of using graphical models is given below: + +https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/graphical_models.ipynb + Hierarchical Models ------------------- +Hierarchical models are where multiple parameters in the model are assumed to be drawn from a common distributio. +The parameters of this parent distribution are themselves inferred from the data, which for many problems enables +more robust and informative model fitting. + +Anfull description of using hierarchical models is given below: + +https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/howtofit/chapter_graphical_models/tutorial_4_hierachical_models.ipynb + Model Comparison ---------------- -Search Chaining ---------------- +Common questions when fitting a model to data are: what model should I use? How many parameters should the model have? +Is the model too complex or too simple? + +Model comparison answers to these questions. It amounts to composing and fitting many different models to the data +and comparing how well they fit the data. + +A full description of using model comparison is given below: + +https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/model_comparison.ipynb Interpolation ------------- +It is common to fit a model to many similar datasets, where it is anticipated that one or more model parameters vary +smoothly across the datasets. + +For example, the datasets may be taken at different times, where the signal in the data and therefore model parameters +vary smoothly as a function of time. + +It may be desirable to fit the datasets one-by-one and then interpolate the results in order +to determine the most likely model parameters at any point in time. + +**PyAutoFit**'s interpolation feature allows for this, and a full description of its use is given below: + +https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/interpolate.ipynb + Search Grid Search ------------------ +A classic method to perform model-fitting is a grid search, where the parameters of a model are divided on to a grid of +values and the likelihood of each set of parameters on this grid is sampled. For low dimensionality problems this +simple approach can be sufficient to locate high likelihood solutions, however it scales poorly to higher dimensional +problems. + +**PyAutoFit** can perform a search grid search, which allows one to perform a grid-search over a subset of parameters +within a model, but use a non-linear search to fit for the other parameters. The parameters over which the grid-search +is performed are also included in the model fit and their values are simply confined to the boundaries of their grid +cell. + +This can help ensure robust results are inferred for complex models, and can remove multi modality in a parameter +space to further aid the fitting process. + +A full description of using search grid searches is given below: + +https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/search_grid_search.ipynb + +Search Chaining +--------------- + +To perform a model-fit, we typically compose one model and fit it to our data using one non-linear search. + +Search chaining fits many different models to a dataset using a chained sequence of non-linear searches. Initial +fits are performed using simplified models and faster non-linear fitting techniques. The results of these simplified +fits are then be used to initialize fits using a higher dimensionality model with a more detailed non-linear search. + +To fit highly complex models search chaining allows us to therefore to granularize the fitting procedure into a series +of **bite-sized** searches which are faster and more reliable than fitting the more complex model straight away. + +A full description of using search chaining is given below: + +https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/search_grid_search.ipynb + Sensitivity Mapping -------------------- \ No newline at end of file +------------------- + +Bayesian model comparison allows us to take a dataset, fit it with multiple models and use the Bayesian evidence to +quantify which model objectively gives the best-fit following the principles of Occam's Razor. + +However, a complex model may not be favoured by model comparison not because it is the 'wrong' model, but simply +because the dataset being fitted is not of a sufficient quality for the more complex model to be favoured. Sensitivity +mapping addresses what quality of data would be needed for the more complex model to be favoured. + +In order to do this, sensitivity mapping involves us writing a function that uses the model(s) to simulate a dataset. +We then use this function to simulate many datasets, for different models, and fit each dataset to quantify +how much the change in the model led to a measurable change in the data. This is called computing the sensitivity. + +A full description of using sensitivity mapping is given below: + +https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/sensitivity_mapping.ipynb \ No newline at end of file From d3fa9663462102543d9477918c6242aa9f0e14a0 Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Wed, 24 Apr 2024 10:02:29 +0100 Subject: [PATCH 6/8] latent vaeriables in docs --- docs/overview/scientific_workflow.rst | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/overview/scientific_workflow.rst b/docs/overview/scientific_workflow.rst index e367246ea..d0f38235a 100644 --- a/docs/overview/scientific_workflow.rst +++ b/docs/overview/scientific_workflow.rst @@ -145,7 +145,9 @@ The code below shows how to enable outputting of results to hard-disk: The screenshot below shows the output folder where all output is enabled: +.. note:: + Screenshot needs to be added here. Lets consider three parts of this output folder: @@ -492,6 +494,15 @@ of the ``Analysis`` and define a ``make_result`` object describing what we want analysis=self ) +Result customization has full support for **latent variables**, which are parameters that are not sampled by the non-linear +search but are computed from the sampled parameters. + +They are often integral to assessing and interpreting the results of a model-fit, as they present information +on the model in a different way to the sampled parameters. + +The `result cookbook `_ gives a full run-through of +all the different ways the result can be customized. + Model Composition ----------------- From 1b9dc65f24c8aea3a477ed5b0b4f1f2a5c53534e Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Mon, 29 Apr 2024 12:46:38 +0100 Subject: [PATCH 7/8] docs --- .../non_linear/search/nest/dynesty/search/static.py | 2 +- docs/cookbooks/result.rst | 12 +++++++----- requirements.txt | 1 - 3 files changed, 8 insertions(+), 7 deletions(-) diff --git a/autofit/non_linear/search/nest/dynesty/search/static.py b/autofit/non_linear/search/nest/dynesty/search/static.py index a08d3f8a9..65b9b5dee 100644 --- a/autofit/non_linear/search/nest/dynesty/search/static.py +++ b/autofit/non_linear/search/nest/dynesty/search/static.py @@ -21,7 +21,6 @@ def __init__(self, function): def grad(self): import jax from jax import grad - print("Compiling gradient") return jax.jit(grad(self.function)) @@ -135,6 +134,7 @@ def search_internal_from( The number of CPU's over which multiprocessing is performed, determining how many samples are stored in the dynesty queue for samples. """ + if self.use_gradient: gradient = GradWrapper(fitness) else: diff --git a/docs/cookbooks/result.rst b/docs/cookbooks/result.rst index d1c9fadc3..993e573fb 100644 --- a/docs/cookbooks/result.rst +++ b/docs/cookbooks/result.rst @@ -102,11 +102,13 @@ The output appears as follows: Loading From Hard-disk ---------------------- -When performing fits which output results to hard-disc, a ``files`` folder is created containing .json / .csv files of -the model, samples, search, etc. +When performing fits which output results to hard-disk, a `files` folder is created containing .json / .csv files of +the model, samples, search, etc. You should check it out now for a completed fit on your hard-disk if you have +not already! -These files can be loaded from hard-disk to Python variables via the aggregator, making them accessible in a -Python script or Jupyter notebook. +These files can be loaded from hard-disk to Python variables via the aggregator, making them accessible in a +Python script or Jupyter notebook. They are loaded as the internal **PyAutoFit** objects we are familiar with, +for example the `model` is loaded as the `Model` object we passed to the search above. Below, we will access these results using the aggregator's ``values`` method. A full list of what can be loaded is as follows: @@ -126,7 +128,7 @@ the full non-linear search samples, for example every parameter sample and its l contains a summary of the results, for example the maximum log likelihood model and error estimates on parameters at 1 and 3 sigma confidence. -Accessing results via the ``samples_summary`` is much faster, because as it does reperform calculations using the full +Accessing results via the ``samples_summary`` is much faster, because as it does not reperform calculations using the full list of samples. Therefore, if the result you want is accessible via the ``samples_summary`` you should use it but if not you can revert to the ``samples. diff --git a/requirements.txt b/requirements.txt index 8f11787e7..f50f3246a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -7,7 +7,6 @@ typing-inspect>=0.4.0 emcee>=3.1.3 gprof2dot==2021.2.21 matplotlib -networkx numpydoc>=1.0.0 pyprojroot==0.2.0 pyswarms==1.3.0 From 04ef5e3842aa83137e3514fb692dc68bbe0bbfaf Mon Sep 17 00:00:00 2001 From: James Nightingale Date: Mon, 29 Apr 2024 15:01:58 +0100 Subject: [PATCH 8/8] typos --- docs/overview/statistical_methods.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/overview/statistical_methods.rst b/docs/overview/statistical_methods.rst index 030f5e7c3..784213dc4 100644 --- a/docs/overview/statistical_methods.rst +++ b/docs/overview/statistical_methods.rst @@ -11,8 +11,8 @@ Graphical Models For inference problems consisting of many datasets, the model composition is often very complex. Model parameters can depend on multiple datasets, and the datasets themselves may be interdependent. -Graphical models concisely describe these model and dataset dependencies and allow for -them to be fitted simultaneously, with a concise API and scientific workflow that ensures scalability to big datasets. +Graphical models concisely describe these model and dataset dependencies, allowing them to be fitted simultaneously. +This is achieved through a concise API and scientific workflow that ensures scalability to large datasets. A full description of using graphical models is given below: @@ -21,11 +21,11 @@ https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/g Hierarchical Models ------------------- -Hierarchical models are where multiple parameters in the model are assumed to be drawn from a common distributio. -The parameters of this parent distribution are themselves inferred from the data, which for many problems enables +Hierarchical models are where multiple parameters in the model are assumed to be drawn from a common distribution. +The parameters of this parent distribution are themselves inferred from the data, enabling more robust and informative model fitting. -Anfull description of using hierarchical models is given below: +A full description of using hierarchical models is given below: https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/howtofit/chapter_graphical_models/tutorial_4_hierachical_models.ipynb @@ -61,7 +61,7 @@ https://github.com/Jammy2211/autofit_workspace/blob/release/notebooks/features/i Search Grid Search ------------------ -A classic method to perform model-fitting is a grid search, where the parameters of a model are divided on to a grid of +A classic method to perform model-fitting is a grid search, where the parameters of a model are divided onto a grid of values and the likelihood of each set of parameters on this grid is sampled. For low dimensionality problems this simple approach can be sufficient to locate high likelihood solutions, however it scales poorly to higher dimensional problems.