Update CmdStan read and logic #1565

ahartikainen · 2021-02-15T12:25:06Z

Description

Read CmdStan csv files manually. This enables us to parse large models (100k parameters) much faster than pandas.

This PR also adds dtypes argument, which user can use to transform dtypes for specific parameters

dtypes = {"theta": int}

Checklist

Follows official PR format
Includes a sample plot to visually illustrate the changes (only for plot-related functions)
New features are properly documented (with an example if appropriate)?
Includes new or updated tests to cover the new feature
Code style correct (follows pylint and black guidelines)
Changes are listed in changelog

OriolAbril · 2021-02-15T17:20:58Z

I thought that pandas reader was written in C to be fast, do you have some advise on which cases make having a custom worth it?

ahartikainen · 2021-02-16T07:07:42Z

I will give you example later today.

ahartikainen · 2021-02-16T14:17:29Z

import tempfile
import numpy as np
import pandas as pd
from pathlib import Path
from uuid import uuid4
import time
import pandas as pd
import numpy as np
import shutil
import matplotlib.pyplot as plt

def read_output_file_manual(path):
    comments = []
    data = []
    columns = None
    with open(path, "rb") as f_obj:
        # read header
        for line in f_obj:
            if line.startswith(b"#"):
                comments.append(line.decode("utf-8").strip())
                continue
            columns = {key: idx for idx, key in enumerate(line.strip().decode("utf-8").split(","))}
            break

        for line in f_obj:
            line = line.strip()
            if line.startswith(b"#"):
                comments.append(line.decode("utf-8"))
                continue
            if line:
                data.append(line.split(b","))

        data = np.array(data, dtype=np.float64)

    return columns, data, comments

def read_output_file_pandas(path):
    comments = []
    data = []
    columns = None
    with open(path, "rb") as f_obj:
        # read header
        for line in f_obj:
            if line.startswith(b"#"):
                comments.append(line.decode("utf-8").strip())
                continue
            columns = {key: idx for idx, key in enumerate(line.strip().decode("utf-8").split(","))}
            break

        f_obj_loc = f_obj.tell()
        for line in f_obj:
            if line.startswith(b"#"):
                comments.append(line.strip().decode("utf-8"))
                continue
        f_obj.seek(f_obj_loc)
        data = pd.read_csv(f_obj, header=None, comment="#", float_precision="high", dtype=np.float64).values

    return columns, data, comments


%%time
np.random.seed(10)
reference_files = {}
with tempfile.TemporaryDirectory() as tmpdir:
    path = Path(tmpdir)
    for parameter_size in [10, 100, 1000, 10_000, 20_000, 50_000]:
        for draw_size in [100, 500, 1000, 10_000]:
            data = np.random.randn(draw_size, parameter_size)
            columns = [str(uuid4()) for _ in range(parameter_size)]
            df = pd.DataFrame(data=data, columns=columns)
            output_path = Path(tmpdir) / f"parameters_{parameter_size}_draws_{draw_size}.csv"
            output_path2 = Path(tmpdir) / f"parameters_{parameter_size}_draws_{draw_size}_2.csv"
            df.to_csv(str(output_path))
            shutil.copy(output_path, output_path2)
            reference_files[(parameter_size, draw_size)] = (output_path, output_path2)
    
    results = {"pandas": {}, "manual": {}}
    for key, (path, path2) in reference_files.items():
        res_manual = []
        res_pandas = []
        for n in range(1):
            st = time.time()
            val = read_output_file_manual(path)
            et = time.time()
            res_manual.append(et - st)
            
            st = time.time()
            val = read_output_file_pandas(path2)
            et = time.time()
            res_pandas.append(et - st)
            
        results["manual"][key] = res_manual
        results["pandas"][key] = res_pandas

res_df = pd.DataFrame(results).applymap(lambda x: np.mean(x)).reset_index().rename(columns={"level_0": "parameters", "level_1": "draws"})

fig, ax = plt.subplots(1, dpi=100, figsize=(7,7))
for i, (group, gdf) in enumerate(res_df.groupby(by="draws")):
    plt.plot(gdf["parameters"], gdf["pandas"], marker='.', label=f"pandas par:{group}", c=f"C{i%8}", lw=1)
    plt.plot(gdf["parameters"], gdf["manual"], marker='.', label=f"manual par:{group}", c=f"C{i%8}", ls="--", lw=1)
    
plt.xlabel("Parameters")
plt.ylabel("Timing (s)")
plt.yscale("log")
plt.xscale("log")
plt.legend()
plt.grid()
{spine.set_visible(False) for key, spine in plt.gca().spines.items() if key in ["top", "right"]}
plt.savefig("./csv_read_comparison.png", dpi=200, bbox_inches="tight")

Run duration was approx 40 minutes

codecov · 2021-02-17T12:43:01Z

Codecov Report

Merging #1565 (3df02e7) into main (1e3356e) will decrease coverage by 0.03%.
The diff coverage is 83.45%.

@@            Coverage Diff             @@
##             main    #1565      +/-   ##
==========================================
- Coverage   90.28%   90.25%   -0.04%     
==========================================
  Files         105      105              
  Lines       11405    11419      +14     
==========================================
+ Hits        10297    10306       +9     
- Misses       1108     1113       +5

Impacted Files	Coverage Δ
arviz/data/io_cmdstan.py	`91.03% <83.45%> (-0.87%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e3356e...568a296. Read the comment docs.

ahartikainen · 2021-02-17T15:29:23Z

Results (in seconds)

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	parameters	draws	pandas	manual	numpy
2	1000	10000	2.103543	4.698918	11.429663
5	50000	10000	420.375299	227.831740	1131.707759
8	100000	10000	1750.873820	1399.882980	5438.172931

Difference against pandas (x - pandas) (in seconds)

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	parameters	draws	manual	numpy
2	1000	10000	2.595374	9.326120
5	50000	10000	-192.543559	711.332459
8	100000	10000	-350.990840	3687.299111

I don't know, maybe go with the manual handling? All csv files should be "good" (we don't need to consider ill-formed ones)

ahartikainen · 2021-02-18T12:05:54Z

@OriolAbril are you happy with the changes?

Code should now a bit more clear than previously.

OriolAbril · 2021-02-18T12:06:44Z

Looks good, thanks!

* rewrite cmdstan logic * clean sample_stats * fix * Handle empty lines * use numbers * fix typo * temporarily downgrade pylint * downgrade astroid * fix * fix errors * change dict kw order * fix handling * update test * combine pandas and manual * remove requirement restrictions * use numpy gentext for fileloading * remove pandas import * fix typo * change test * update csv reader * clean file * add info to changelog

Ari Hartikainen and others added 22 commits February 18, 2021 14:04

rewrite cmdstan logic

0c27723

clean sample_stats

86255c8

fix

70df093

Handle empty lines

8c12b17

use numbers

b1a627b

fix typo

b7ec08c

temporarily downgrade pylint

bfb3ba0

downgrade astroid

caf3077

fix

60ddacd

fix errors

80e0213

change dict kw order

f31121e

fix handling

0a11a52

update test

896214c

combine pandas and manual

1f0546e

remove requirement restrictions

bbf9783

use numpy gentext for fileloading

c150419

remove pandas import

22f15c9

fix typo

257e1d0

change test

c9b6617

update csv reader

4aa068c

clean file

66f8342

add info to changelog

568a296

ahartikainen force-pushed the bugfixes/cmdstan branch from d6b93a5 to 568a296 Compare February 18, 2021 12:04

OriolAbril approved these changes Feb 18, 2021

View reviewed changes

ahartikainen merged commit 3d788cc into main Feb 18, 2021

ahartikainen deleted the bugfixes/cmdstan branch February 18, 2021 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CmdStan read and logic #1565

Update CmdStan read and logic #1565

ahartikainen commented Feb 15, 2021 •

edited

Loading

OriolAbril commented Feb 15, 2021

ahartikainen commented Feb 16, 2021

ahartikainen commented Feb 16, 2021 •

edited

Loading

codecov bot commented Feb 17, 2021 •

edited

Loading

ahartikainen commented Feb 17, 2021 •

edited

Loading

ahartikainen commented Feb 18, 2021

OriolAbril commented Feb 18, 2021

Update CmdStan read and logic #1565

Update CmdStan read and logic #1565

Conversation

ahartikainen commented Feb 15, 2021 • edited Loading

Description

Checklist

OriolAbril commented Feb 15, 2021

ahartikainen commented Feb 16, 2021

ahartikainen commented Feb 16, 2021 • edited Loading

codecov bot commented Feb 17, 2021 • edited Loading

Codecov Report

ahartikainen commented Feb 17, 2021 • edited Loading

Results (in seconds)

Difference against pandas (x - pandas) (in seconds)

ahartikainen commented Feb 18, 2021

OriolAbril commented Feb 18, 2021

ahartikainen commented Feb 15, 2021 •

edited

Loading

ahartikainen commented Feb 16, 2021 •

edited

Loading

codecov bot commented Feb 17, 2021 •

edited

Loading

ahartikainen commented Feb 17, 2021 •

edited

Loading