diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 8a196a1..fede2c7 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -1,5 +1,11 @@ name: codecov -on: [push, pull_request] +on: + pull_request: + branches-ignore: + - main + push: + branches: + - main jobs: run: runs-on: ${{ matrix.os }} @@ -16,6 +22,7 @@ jobs: uses: actions/setup-python@v2 with: python-version: ${{ matrix.python-version }} + node-version: 16 - name: Install dependencies run: | pip install -r requirements.txt --use-pep517 diff --git a/README.md b/README.md index 2df31ff..c3e275f 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ # obsidiantools 🪨⚒️ **obsidiantools** is a Python package for getting structured metadata about your [Obsidian.md notes](https://obsidian.md/) and analysing your vault. Complement your Obsidian workflows by getting metrics and detail about all your notes in one place through the widely-used Python data stack. -It's incredibly easy to explore structured data on your vault through this fluent interface. This is all the code you need to generate a `vault` object that stores the key data: +It's incredibly easy to explore structured data on your vault through this fluent interface. This is all the code you need to generate a `vault` object that stores all the data: ```python import obsidiantools.api as otools @@ -12,13 +12,13 @@ import obsidiantools.api as otools vault = otools.Vault().connect().gather() ``` -These are the basics of the function calls: -- `connect()`: connect your notes together in a graph structure and get metadata on links (e.g. wikilinks, backlinks, etc.) +These are the basics of the method calls: +- `connect()`: connect your notes together in a graph structure and get metadata on links (e.g. wikilinks, backlinks, etc.) There ais the option to support the inclusion of 'attachment' files in the graph. - `gather()`: gather the plaintext content from your notes in one place. This includes the 'source text' that represent how your notes are written. There are arguments to support what text you want to remove, e.g. remove code. See some of the **key features** below - all accessible from the `vault` object either through a method or an attribute. -As this package relies upon note (file)names, it is only recommended for use on vaults where wikilinks are not formatted as paths and where note names are unique. This should cover the vast majority of vaults that people create. +The package is built to support the 'shortest path when possible' option for links. This should cover the vast majority of vaults that people create. See the [wiki](https://github.com/mfarragher/obsidiantools/wiki) for more info on what sort of wikilink syntax is not well-supported and how the graph may be slightly different to what you see in the Obsidian app. ## 💡 Key features This is how **`obsidiantools`** can complement your workflows for note-taking: @@ -26,28 +26,30 @@ This is how **`obsidiantools`** can complement your workflows for note-taking: - NetworkX is the main Python library for network analysis, enabling sophisticated analyses of your vault. - NetworkX also supports the ability to export your graph to other data formats. - When instantiating a `vault`, the analysis can also be filtered on specific subdirectories. -- **Get summary stats about your notes, e.g. number of backlinks and wikilinks, in a Pandas dataframe** - - Get the dataframe via `vault.get_note_metadata()` +- **Get summary stats about your notes & files, e.g. number of backlinks and wikilinks, in a Pandas dataframe** + - Get the dataframe via `vault.get_note_metadata()` (notes / md files), `vault.get_media_file_metadata()` (media files that can be embedded in notes) and `vault.get_canvas_file_metadata()` (canvas files). - **Retrieve detail about your notes' links and metadata as built-in Python types** + - The main indices of files are `md_file_index`, `media_file_index` and `canvas_file_index` (canvas files). + - Check whether files included as links in the vault actually exist, via `vault` attributes like `nonexistent_notes`, `nonexistent_media_files` and `nonexistent_canvas_files`. + - Check whether actual files are isolated in the graph ('orphans'), via `vault` attributes like `isolated_notes`, `isolated_media_files` and `isolated_canvas_files`. + - You can access all the note & file links in one place, or you can load them for an individual note: + - e.g. `vault.backlinks_index` for all backlinks in the vault + - e.g. `vault.get_backlinks()` for the backlinks of an individual note - **md note info:** - The various types of links: - Wikilinks (incl. header links, links with alt text) - Embedded files - Backlinks - Markdown links - - You can access all the links in one place, or you can load them for an individual note: - - e.g. `vault.backlinks_index` for all backlinks in the vault - - e.g. `vault.get_backlinks()` for the backlinks of an individual note - Front matter via `vault.get_front_matter()` or `vault.front_matter_index` - Tags via `vault.get_tags()` or `vault.tags_index`. Nested tags are supported. - LaTeX math via `vault.get_math()` or `vault.math_index` - - Check which notes are isolated (`vault.isolated_notes`) - - Check which notes do not exist as files yet (`vault.nonexistent_notes`) - As long as `gather()` is called: - Get source text of note (via `vault.get_source_text()`). This tries to represent how a note's text appears in Obsidian's 'source mode'. - Get readable text of note (via `vault.get_readable_text()`). This tries to reduce note text to minimal markdown formatting, e.g. preserving paragraphs, headers and punctuation. Only slight processing is needed for various forms of NLP analysis. - **canvas file info:** - The JSON content of each canvas file is stored as a Python dict in `vault.canvas_content_index` + - Data to recreate the layout of content in a canvas file via the `vault.canvas_graph_detail_index` dict Check out the functionality in the demo repo. Launch the '15 minutes' demo in a virtual machine via Binder: @@ -58,7 +60,7 @@ There are other API features that try to mirror the Obsidian.md app, for your co The text from vault notes goes through this process: markdown → split out front matter from text → HTML → ASCII plaintext. ## ⏲️ Installation -``pip install obsidiantools`` +`pip install obsidiantools` Requires Python 3.9 or higher. diff --git a/obsidiantools/__init__.py b/obsidiantools/__init__.py index 70a293b..7300a0e 100644 --- a/obsidiantools/__init__.py +++ b/obsidiantools/__init__.py @@ -2,3 +2,4 @@ from . import md_utils from . import html_processing from . import canvas_utils +from . import media_utils diff --git a/obsidiantools/_constants.py b/obsidiantools/_constants.py index 6d0ecd1..a5dcff2 100644 --- a/obsidiantools/_constants.py +++ b/obsidiantools/_constants.py @@ -13,3 +13,24 @@ # helpers: WIKILINK_AS_STRING_REGEX = r'\[[^\]]+\]\([^)]+\)' EMBEDDED_FILE_LINK_AS_STRING_REGEX = r'!?\[{2}([^\]\]]+)\]{2}' + +# Sets of extensions via https://help.obsidian.md/How+to/Embed+files : +# NB: file.ext and file.EXT can exist in same folder +IMG_EXT_SET = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.svg', + '.PNG', '.JPG', '.JPEG', '.GIF', '.BMP', '.SVG'} +AUDIO_EXT_SET = {'.mp3', '.webm', '.wav', '.m4a', '.ogg', '.3gp', '.flac', + '.MP3', '.WEBM', '.WAV', '.M4A', '.OGG', '.3GP', '.FLAC'} +VIDEO_EXT_SET = {'.mp4', '.webm', '.ogv', '.mov', '.mkv', + '.MP4', '.WEBM', '.OGV', '.MOV', '.MKV'} +PDF_EXT_SET = {'.pdf', + '.PDF'} +# canvas files: +CANVAS_EXT_SET = {'.canvas', + '.CANVAS'} + +# metadata df cols order: +METADATA_DF_COLS_GENERIC_TYPE = [ + 'rel_filepath', 'abs_filepath', + 'file_exists', + 'n_backlinks', 'n_wikilinks', 'n_tags', 'n_embedded_files', + 'modified_time'] diff --git a/obsidiantools/_io.py b/obsidiantools/_io.py index e08ef8f..ce86d95 100644 --- a/obsidiantools/_io.py +++ b/obsidiantools/_io.py @@ -1,5 +1,6 @@ from pathlib import Path from glob import glob +import numpy as np def get_relpaths_from_dir(dir_path: Path, *, extension: str) -> list[Path]: @@ -81,3 +82,31 @@ def get_relpaths_matching_subdirs(dir_path: Path, *, extension=extension) if str(i.parent.as_posix()) in include_subdirs_final] + + +def _get_valid_filepaths_by_ext_set(dirpath: Path, *, + exts: set[str]): + all_files = [p.relative_to(dirpath) + for p in Path(dirpath).glob("**/*") + if p.suffix in exts] + return all_files + + +def _get_shortest_path_by_filename(relpaths_list: list[Path]) -> dict[str, Path]: + # get filename w/ ext only: + all_file_names_list = [f.name for f in relpaths_list] + + # get indices of dupe 'filename w/ ext': + _, inverse_ix, counts = np.unique( + np.array(all_file_names_list), + return_inverse=True, + return_counts=True, + axis=0) + dupe_names_ix = np.where(counts[inverse_ix] > 1)[0] + + # get shortest paths via mask: + shortest_paths_arr = np.array(all_file_names_list, dtype=object) + shortest_paths_arr[dupe_names_ix] = np.array( + [str(fpath) + for fpath in relpaths_list])[dupe_names_ix] + return {fn: path for fn, path in zip(shortest_paths_arr, relpaths_list)} diff --git a/obsidiantools/api.py b/obsidiantools/api.py index 148e756..9d893e3 100644 --- a/obsidiantools/api.py +++ b/obsidiantools/api.py @@ -1,13 +1,15 @@ -import os +import warnings import networkx as nx import numpy as np import pandas as pd from collections import Counter from pathlib import Path +from itertools import chain # init from .md_utils import (get_md_relpaths_matching_subdirs) -from .canvas_utils import (get_canvas_relpaths_matching_subdirs) +from .canvas_utils import (get_canvas_relpaths_matching_subdirs, + _get_all_valid_canvas_file_relpaths) # connect from .md_utils import (_get_md_front_matter_and_content, _get_html_from_md_content, @@ -17,11 +19,13 @@ _get_all_wikilinks_from_source_text, _get_all_embedded_files_from_source_text, get_tags, - get_source_text_from_html, _get_all_latex_from_html_content) -# gather -from .md_utils import (get_source_text_from_md_file, - get_readable_text_from_md_file) +from ._constants import METADATA_DF_COLS_GENERIC_TYPE +from ._io import _get_shortest_path_by_filename +from .media_utils import _get_all_valid_media_file_relpaths +# gather: +from .md_utils import (get_source_text_from_html, + _get_readable_text_from_html) # canvas: from .canvas_utils import (get_canvas_content, get_canvas_graph_detail) @@ -55,6 +59,7 @@ def __init__(self, dirpath: Path, *, The class supports subdirectories and relies heavily on relative paths for the API. + -- ARGS -- Args: dirpath (pathlib Path): the directory to analyse. This would typically be the vault's directory. If you have a @@ -67,6 +72,7 @@ def __init__(self, dirpath: Path, *, include_root (bool, optional): include files that are directly in the dir_path (root dir). Defaults to True. + -- METHODS -- Methods for setup: connect: connect notes in a graph gather: gather text content of notes @@ -86,13 +92,21 @@ def __init__(self, dirpath: Path, *, Methods for analysis across multiple notes: get_note_metadata - + Methods for analysis across multiple media & canvas files: + get_media_file_metadata + get_canvas_file_metadata + Method for all file types: + get_all_file_metadata + + -- ATTRIBUTES -- + - The main file lookups have (*) next to them - Attributes - general: dirpath (arg) + attachments (kwarg) is_connected is_gathered Attributes - md-related: - file_index + md_file_index (*) graph backlinks_index wikilinks_index @@ -106,13 +120,22 @@ def __init__(self, dirpath: Path, *, isolated_notes source_text_index readable_text_index + Attributes - media files: + media_file_index (*) + nonexistent_media_files + isolated_media_files Attributes - canvas-related: - canvas_file_index + canvas_file_index (*) + nonexistent_canvas_files + isolated_canvas_files canvas_content_index canvas_graph_detail_index """ + # args: self._dirpath = dirpath - self._file_index = self._get_md_relpaths_by_name( + self._attachments = None # connect() + + self._md_file_index = self._get_md_relpaths_by_name( include_subdirs=include_subdirs, include_root=include_root) self._canvas_file_index = self._get_canvas_relpaths_by_name( @@ -138,9 +161,16 @@ def __init__(self, dirpath: Path, *, self._source_text_index = {} self._readable_text_index = {} + # via media files: + self._media_file_index = {} + self._nonexistent_media_files = [] + self._isolated_media_files = [] + # via canvas content: self._canvas_content_index = {} self._canvas_graph_detail_index = {} + self._nonexistent_canvas_files = [] + self._isolated_canvas_files = [] @property def dirpath(self) -> Path: @@ -148,13 +178,24 @@ def dirpath(self) -> Path: return self._dirpath @property - def file_index(self) -> dict[str, Path]: + def attachments(self) -> bool: + """bool: argument for connect method. True to include 'attachment' + files. + """ + return self._attachments + + @attachments.setter + def attachments(self, value) -> bool: + self._attachments = value + + @property + def md_file_index(self) -> dict[str, Path]: """dict: one-to-one mapping of md filename (k) to relative path (v)""" - return self._file_index + return self._md_file_index - @file_index.setter - def file_index(self, value) -> dict[str, Path]: - self._file_index = value + @md_file_index.setter + def md_file_index(self, value) -> dict[str, Path]: + self._md_file_index = value @property def canvas_file_index(self) -> dict[str, Path]: @@ -171,54 +212,90 @@ def graph(self) -> nx.MultiDiGraph: """networkx Graph""" return self._graph + @graph.setter + def graph(self, value) -> nx.MultiDiGraph: + self._graph = value + @property def backlinks_index(self) -> dict[str, list[str]]: """dict of lists: note name (k) to lists (v). v is [] if k has no backlinks.""" return self._backlinks_index + @backlinks_index.setter + def backlinks_index(self, value) -> dict[str, list[str]]: + self._backlinks_index = value + @property def wikilinks_index(self) -> dict[str, list[str]]: """dict of lists: filename (k) to lists (v). v is [] if k has no wikilinks.""" return self._wikilinks_index + @wikilinks_index.setter + def wikilinks_index(self, value) -> dict[str, list[str]]: + self._wikilinks_index = value + @property def unique_wikilinks_index(self) -> dict[str, list[str]]: """dict of lists: filename (k) to lists (v). v is [] if k has no wikilinks.""" return self._unique_wikilinks_index + @unique_wikilinks_index.setter + def unique_wikilinks_index(self, value) -> dict[str, list[str]]: + self._unique_wikilinks_index = value + @property def embedded_files_index(self) -> dict[str, list[str]]: """dict: note name (k) to list of embedded file string (v). v is [] if k has no embedded files.""" return self._embedded_files_index + @embedded_files_index.setter + def embedded_files_index(self, value) -> dict[str, list[str]]: + self._embedded_files_index = value + @property def math_index(self) -> dict[str, list[str]]: """dict: note name (k) to list of LaTeX math string (v). v is [] if k has no LaTeX.""" return self._math_index + @math_index.setter + def math_index(self, value) -> dict[str, list[str]]: + self._math_index = value + @property def md_links_index(self) -> dict[str, list[str]]: """dict of lists: filename (k) to lists (v). v is [] if k has no markdown links.""" return self._md_links_index + @md_links_index.setter + def md_links_index(self, value) -> dict[str, list[str]]: + self._md_links_index = value + @property def unique_md_links_index(self) -> dict[str, list[str]]: """dict of lists: filename (k) to lists (v). v is [] if k has no markdown links.""" return self._unique_md_links_index + @unique_md_links_index.setter + def unique_md_links_index(self, value) -> dict[str, list[str]]: + self._unique_md_links_index = value + @property def tags_index(self) -> dict[str, list[str]]: """dict of lists: filename (k) to lists (v). v is [] if k has no tags.""" return self._tags_index + @tags_index.setter + def tags_index(self, value) -> dict[str, list[str]]: + self._tags_index = value + @property def nonexistent_notes(self) -> list[str]: """list: notes without files, i.e. the notes have backlink(s) but @@ -228,6 +305,10 @@ def nonexistent_notes(self) -> list[str]: be created as actual notes one day :-)""" return self._nonexistent_notes + @nonexistent_notes.setter + def nonexistent_notes(self, value) -> list[str]: + self._nonexistent_notes = value + @property def isolated_notes(self) -> list[str]: """list: notes (with their own md files) that lack backlinks and @@ -235,22 +316,91 @@ def isolated_notes(self) -> list[str]: Obsidian graph at all.""" return self._isolated_notes + @isolated_notes.setter + def isolated_notes(self, value) -> list[str]: + self._isolated_notes = value + @property def front_matter_index(self) -> dict[str, list[str]]: """dict: note name (k) to front matter (v). v is {} if no front matter was extracted from note.""" return self._front_matter_index + @front_matter_index.setter + def front_matter_index(self, value) -> dict[str, list[str]]: + self._front_matter_index = value + + @property + def media_file_index(self) -> dict[str, Path]: + """dict: media file (k) to relative path (v). + + These will appear in the index: + 1. Embedded files that exist. + 2. Embedded files that don't exist. + 3. Files that exist in the vault but haven't been embedded. + """ + return self._media_file_index + + @media_file_index.setter + def media_file_index(self, value) -> dict[str, Path]: + self._media_file_index = value + + @property + def nonexistent_media_files(self) -> list[str]: + """list: media files that don't exist on the file system yet.""" + return self._nonexistent_media_files + + @nonexistent_media_files.setter + def nonexistent_media_files(self, value) -> list[str]: + self._nonexistent_media_files = value + + @property + def isolated_media_files(self) -> list[str]: + """list: media files that lack backlinks from md files. + They are not connected to other notes in the Obsidian graph at all.""" + return self._isolated_media_files + + @isolated_media_files.setter + def isolated_media_files(self, value) -> list[str]: + self._isolated_media_files = value + + @property + def nonexistent_canvas_files(self) -> list[str]: + """list: canvas files that don't exist on the file system yet.""" + return self._nonexistent_canvas_files + + @nonexistent_canvas_files.setter + def nonexistent_canvas_files(self, value) -> list[str]: + self._nonexistent_canvas_files = value + + @property + def isolated_canvas_files(self) -> list[str]: + """list: canvas files that lack backlinks from md files. + They are not connected to other notes in the Obsidian graph at all.""" + return self._isolated_canvas_files + + @isolated_canvas_files.setter + def isolated_canvas_files(self, value) -> list[str]: + self._isolated_canvas_files = value + @property def is_connected(self) -> bool: """Bool: has the connect function been called to set up graph?""" return self._is_connected + @is_connected.setter + def is_connected(self, value) -> bool: + self._is_connected = value + @property def is_gathered(self) -> bool: """Bool: has the gather function been called to gather text?""" return self._is_gathered + @is_gathered.setter + def is_gathered(self, value) -> bool: + self._is_gathered = value + @property def source_text_index(self) -> dict[str, str]: """dict of strings: filename (k) to source text string (v). v is '' @@ -307,7 +457,8 @@ def canvas_graph_detail_index(self, value) -> \ ]: self._canvas_graph_detail_index = value - def connect(self, *, show_nested_tags: bool = False): + def connect(self, *, show_nested_tags: bool = False, + attachments=False): """connect your notes together by representing the vault as a Networkx graph object, G. @@ -321,82 +472,315 @@ def connect(self, *, show_nested_tags: bool = False): show_nested_tags (Boolean): show nested tags in the output. Defaults to False (which would mean only the highest level of any nested tags are included in the output). + attachments (Boolean): Defaults to False. 'Attachments' refers + to the graph toggle option in the Obsidian app. By default, + obsidiantools will only include md files (notes) in the + graph (i.e. like Attachments is toggled off in Obsidian app). + To include media files in the graph, set this option to True. + This will lead to the inclusion of media files' in the + backlinks_index. """ - # md content: if not self._is_connected: + self._attachments = attachments + + # md content: # index dicts, where k is a note name in the vault: - md_links_ix = {} - md_links_unique_ix = {} - embedded_files_ix = {} - tags_ix = {} - math_ix = {} - front_matter_ix = {} - wikilinks_ix = {} - wikilinks_unique_ix = {} + self._md_links_index = {} + self._unique_md_links_index = {} + self._embedded_files_index = {} + self._tags_index = {} + self._math_index = {} + self._front_matter_index = {} + # to be used for graph: + self._wikilinks_index = {} + self._unique_wikilinks_index = {} # loop through md files: - for f, relpath in self._file_index.items(): - # MAIN file read: - front_matter, content = _get_md_front_matter_and_content( + for f, relpath in self._md_file_index.items(): + self._connect_update_based_on_new_relpath( + relpath, note=f, + show_nested_tags=show_nested_tags) + + # canvas content: + # loop through canvas files: + self._canvas_content_index = {} + self._canvas_graph_detail_index = {} + for f, relpath in self._canvas_file_index.items(): + content_c = get_canvas_content( self._dirpath / relpath) - html = _get_html_from_md_content(content) - src_txt = get_source_text_from_html( - html, remove_code=True) - - # info from core text: - md_links_ix[f] = _get_md_links_from_source_text(src_txt) - md_links_unique_ix[f] = _get_unique_md_links_from_source_text(src_txt) - embedded_files_ix[f] = _get_all_embedded_files_from_source_text( - src_txt, remove_aliases=True) - wikilinks_ix[f] = _get_all_wikilinks_from_source_text( - src_txt, remove_aliases=True) - wikilinks_unique_ix[f] = _get_unique_wikilinks_from_source_text( - src_txt, remove_aliases=True) - # info from html: - math_ix[f] = _get_all_latex_from_html_content(html) - # split out front matter: - front_matter_ix[f] = front_matter - - # MORE file reads needed for extra info: - tags_ix[f] = get_tags(self._dirpath / relpath, - show_nested=show_nested_tags) - - self._md_links_index = md_links_ix - self._unique_md_links_index = md_links_unique_ix - self._embedded_files_index = embedded_files_ix - self._tags_index = tags_ix - self._math_index = math_ix - self._front_matter_index = front_matter_ix - # to be used for graph: - self._wikilinks_index = wikilinks_ix - self._unique_wikilinks_index = wikilinks_unique_ix - - # graph: - G = nx.MultiDiGraph(wikilinks_ix) + self._canvas_content_index[f] = content_c + G_c, pos_c, edge_labels_c = get_canvas_graph_detail( + content_c) + self._canvas_graph_detail_index[f] = G_c, pos_c, edge_labels_c + + # set these up before graph is created: + self._set_canvas_file_attrs() + self._set_media_file_attrs() + + # graph setup: + graph_data_dict = self.__get_graph_data_dict( + attachments=attachments) + G = nx.MultiDiGraph(graph_data_dict) self._graph = G - # info obtained from graph: - self._backlinks_index = self._get_backlinks_index(graph=G) - self._nonexistent_notes = self._get_nonexistent_notes() - self._isolated_notes = self._get_isolated_notes(graph=G) + self._set_graph_related_attributes() - self._is_connected = True + # set these again so that they are finally correct + # (to remove notes / md files from the 'nonexistent_*' attrs, + # the nonexistent_notes are required from the graph) + self._set_canvas_file_attrs() + self._set_media_file_attrs() - # canvas content: - # loop through canvas files: - canvas_content_ix = {} - canvas_graph_detail_ix = {} - for f, relpath in self._canvas_file_index.items(): - content_c = get_canvas_content( - self._dirpath / relpath) - canvas_content_ix[f] = content_c - G_c, pos_c, edge_labels_c = get_canvas_graph_detail( - content_c) - canvas_graph_detail_ix[f] = G_c, pos_c, edge_labels_c - self._canvas_content_index = canvas_content_ix - self._canvas_graph_detail_index = canvas_graph_detail_ix + self._is_connected = True return self # fluent + def _connect_update_based_on_new_relpath(self, relpath: Path, *, + note: str, + show_nested_tags: bool): + """Individual file read & associated attrs update for the + connect method.""" + exclude_canvas = not self._attachments + + # MAIN file read: + front_matter, content = _get_md_front_matter_and_content( + self._dirpath / relpath) + html = _get_html_from_md_content(content) + src_txt = get_source_text_from_html( + html, remove_code=True) + + # info from core text: + self._md_links_index[note] = ( + _get_md_links_from_source_text(src_txt)) + self._unique_md_links_index[note] = ( + _get_unique_md_links_from_source_text(src_txt)) + self._embedded_files_index[note] = ( + _get_all_embedded_files_from_source_text( + src_txt, remove_aliases=True) + # (aliases are redundant for connect method) + ) + self._wikilinks_index[note] = ( + _get_all_wikilinks_from_source_text( + src_txt, remove_aliases=True, + exclude_canvas=exclude_canvas)) + self._unique_wikilinks_index[note] = ( + _get_unique_wikilinks_from_source_text( + src_txt, remove_aliases=True, + exclude_canvas=exclude_canvas)) + # info from html: + self._math_index[note] = (_get_all_latex_from_html_content( + html)) + # split out front matter: + self._front_matter_index[note] = front_matter + + # MORE file reads needed for extra info: + self._tags_index[note] = get_tags( + self._dirpath / relpath, + show_nested=show_nested_tags) + + def _set_media_file_attrs(self): + (embedded_files_by_short_path, + non_embedded_files_by_short_path, + nonexistent_files_by_short_path) = ( + self._get_media_file_dicts_tuple()) + + # only set media file index once: + if not self._media_file_index: + files_ix = {**embedded_files_by_short_path, + **non_embedded_files_by_short_path} + self._media_file_index = files_ix + # these attrs can be set again, once graph is created: + self._nonexistent_media_files = list( + nonexistent_files_by_short_path.keys()) + self._isolated_media_files = list( + non_embedded_files_by_short_path.keys()) + + def _set_canvas_file_attrs(self): + (linked_files_by_short_path, + non_linked_files_by_short_path, + nonexistent_files_by_short_path) = ( + self._get_canvas_file_dicts_tuple()) + + self._nonexistent_canvas_files = list( + nonexistent_files_by_short_path.keys()) + self._isolated_canvas_files = list( + non_linked_files_by_short_path.keys()) + + def _get_media_file_dicts_tuple(self) \ + -> tuple[dict[str, Path], dict[str, Path], dict[str, Path]]: + """Return (existent files embedded, + existent files not embedded, + nonexistent files embedded). + + The reason this logic is complex is that media files are embedded in + md files in the Obsidian app using the shortest possible filepath, + but they all need to be cross-checked against actual media filepaths. + """ + + # detail on all embedded files AND ones that exist: + all_files_embedded_in_notes = list( + chain.from_iterable(self._embedded_files_index.values())) + media_file_relpaths_existent = _get_all_valid_media_file_relpaths( + self._dirpath) + return self.__get_file_dicts_tuple( + all_files_embedded_in_notes, + links_index=self._embedded_files_index, + existing_file_relpaths=media_file_relpaths_existent, + file_type='media') + + def _get_canvas_file_dicts_tuple(self) \ + -> tuple[dict[str, Path], dict[str, Path], dict[str, Path]]: + """Return (existent files linked, + existent files not linked, + nonexistent files linked). + + The reason this logic is complex is that media files are embedded in + md files in the Obsidian app using the shortest possible filepath, + but they all need to be cross-checked against actual media filepaths. + """ + + # detail on all linked files AND ones that exist: + all_files_linked_in_notes = list( + chain.from_iterable(self._wikilinks_index.values())) + canvas_file_relpaths_existent = _get_all_valid_canvas_file_relpaths( + self._dirpath) + return self.__get_file_dicts_tuple( + all_files_linked_in_notes, + links_index=self._wikilinks_index, + existing_file_relpaths=canvas_file_relpaths_existent, + file_type='canvas') + + def __get_file_dicts_tuple(self, linked_files_list: list[str], *, + links_index: dict[list[str]], + existing_file_relpaths: list[Path], + file_type: str): + # get shortest path for each 'linked' file of chosen type; + # check whether each exists + shortest_names_existent = _get_shortest_path_by_filename( + existing_file_relpaths) + # for nonexistent files, don't want to catch other types: + short_names_not_wanted_set = ( + set(shortest_names_existent) + .union(set(self._nonexistent_notes)) + .union(set(self._md_file_index))) + if file_type == 'canvas': + other_fpaths_not_wanted_set = _get_all_valid_media_file_relpaths( + self._dirpath) + elif file_type == 'media': + other_fpaths_not_wanted_set = _get_all_valid_canvas_file_relpaths( + self._dirpath) + else: + raise ValueError('Value for type is either "canvas" or "media".') + shortest_names_nonexistent = { + fn: Path(fn) for fn in chain(*links_index.values()) + if fn not in short_names_not_wanted_set + and Path(fn) not in other_fpaths_not_wanted_set} + shortest_names = {**shortest_names_existent, + **shortest_names_nonexistent} + + # SETS + # existent files (either linked or not): + set_files_existent_linked = ( + set(shortest_names_existent) + .intersection(set(linked_files_list))) + set_files_existent_not_linked = ( + set(shortest_names_existent) + .difference(set_files_existent_linked)) + # nonexistent files: + set_files_nonexistent_linked = ( + set(linked_files_list) + .intersection(set(shortest_names_nonexistent))) + + # DICTS + # existent files (either linked or not): + linked_files_by_short_path = { + short_path: rel_path + for short_path, rel_path in shortest_names.items() + if short_path in set_files_existent_linked} + non_linked_files_by_short_path = { + short_path: rel_path + for short_path, rel_path in shortest_names.items() + if short_path in set_files_existent_not_linked} + # nonexistent files: + nonexistent_files_by_short_path = { + short_path: np.NaN + for short_path in shortest_names_nonexistent.keys() + if short_path in set_files_nonexistent_linked} + + return (linked_files_by_short_path, + non_linked_files_by_short_path, + nonexistent_files_by_short_path) + + def _get_backlink_counts_for_media_files_only(self) -> dict[str, int]: + dict_out = dict.fromkeys(self._media_file_index.keys(), 0) + dict_counts = dict( + Counter(list(chain(*self._embedded_files_index.values())))) + # merge counts into dict_out: + dict_out = {**dict_out, **dict_counts} + return dict_out + + def _get_backlink_counts_for_canvas_files_only(self) -> dict[str, int]: + if not self._attachments: + raise AttributeError('Set attachments=True in connect() to get backlink counts for canvas files.') + dict_out = dict.fromkeys(self._canvas_file_index.keys(), 0) + dict_counts = dict( + Counter(list(chain(*self._wikilinks_index.values())))) + # merge counts into dict_out: + dict_out = {**dict_out, **dict_counts} + return dict_out + + def __get_graph_data_dict(self, *, attachments=False) -> \ + dict[str, list[str]]: + """Get the dict {k: v} of the graph's data: + where k is a note name and v is a list of the 'wikilinks' in + a note. + + The data are used to build the graph, based on the 'wikilinks' + in each note. Media files cannot have wikilinks, so they are not + in the dict keys, but can be inside the dict values as backlinks. + The detail in the dictionary is used to build the nodes and + edges in the graph. + + Args: + attachments (Bool): Defaults to False. If True, then 'Attachments' + files will be included as nodes in the graph. The shortest + possible filepath will be used for those files (as they are + would appear in the note editor itself, rather than the full + relative paths in the Obsidian app's graph view). + + Returns: + dict + """ + if not attachments: + # i) graph uses wikilinks (no embedded files). + # ii) the wikilinks index will have been set before based on the + # attachments kwarg to exclude canvas files + return self._wikilinks_index + else: + # attachments include 'media' files and canvas files: + # i) use wikilinks & embedded file info for graph edges: + d_out = { + n: (self._wikilinks_index.get(n, []) + + self._embedded_files_index.get(n, []) + ) + for n in (set(list(self._wikilinks_index.keys()) + + list(self._embedded_files_index.keys()))) + } + # ii) add isolated media files & canvas files as nodes: + isolated_files_dict = { + short_path: [] for short_path + in [*self._isolated_media_files, + *self._isolated_canvas_files]} + d_out = {**d_out, + **isolated_files_dict} + return d_out + + def _set_graph_related_attributes(self): + self._backlinks_index = self._get_backlinks_index( + graph=self._graph) + self._nonexistent_notes = self._get_nonexistent_notes() + self._isolated_notes = self._get_isolated_notes( + graph=self._graph) + def gather(self, *, tags: list[str] = None): """gather the content of your notes so that all the plaintext is stored in one place for easy access. @@ -419,21 +803,32 @@ def gather(self, *, tags: list[str] = None): will remove all header formatting (e.g. '#', '##' chars) and produces a one-line string. """ - # source text will not remove any content: - self._source_text_index = { - k: get_source_text_from_md_file(self._dirpath / v, - remove_code=True, - remove_math=True) - for k, v in self._file_index.items()} - self._readable_text_index = { - k: get_readable_text_from_md_file(self._dirpath / v, - tags=tags) - for k, v in self._file_index.items()} - + for f, relpath in self._md_file_index.items(): + self._gather_update_based_on_new_relpath( + relpath, + note=f, tags=tags) self._is_gathered = True return self # fluent + def _gather_update_based_on_new_relpath(self, relpath: Path, *, + note: str, tags: list[str]): + """Individual file read & associated attrs update for the + gather method.""" + # MAIN file read: + _, content = _get_md_front_matter_and_content( + self._dirpath / relpath) + html = _get_html_from_md_content(content) + # (also remove LaTeX for source text:) + src_txt = get_source_text_from_html( + html, remove_code=True, remove_math=True) + + # 'source' text will not remove any content, but 'readable' will: + self._source_text_index[note] = src_txt + self._readable_text_index[note] = ( + _get_readable_text_from_html( + html, tags=tags)) + def get_backlinks(self, note_name: str) -> list[str]: """Get backlinks for a note (given its name). @@ -481,12 +876,12 @@ def get_wikilinks(self, file_name: str) -> list[str]: """Get wikilinks for a note (given its filename). Wikilinks can only appear in notes that already exist, so if a - note is not in the file_index at all then the function will raise + note is not in the md_file_index at all then the function will raise a ValueError. Args: - file_name (str): the filename string that is in the file_index. - This is NOT the filepath! + file_name (str): the filename string that is in the + md_file_index. This is NOT the filepath! Returns: list @@ -494,7 +889,7 @@ def get_wikilinks(self, file_name: str) -> list[str]: if not self._is_connected: raise AttributeError('Connect notes before calling the function') - if file_name not in self._file_index: + if file_name not in self._md_file_index: raise ValueError('"{}" does not exist so it cannot have wikilinks.'.format(file_name)) else: return self._wikilinks_index[file_name] @@ -522,11 +917,11 @@ def get_embedded_files(self, file_name: str) -> list[str]: """Get embedded files for a note (given its filename). Embedded files can only appear in notes that already exist, so if a - note is not in the file_index at all then the function will raise + note is not in the md_file_index at all then the function will raise a ValueError. Args: - file_name (str): the filename string that is in the file_index. + file_name (str): the filename string that is in the md_file_index. This is NOT the filepath! Returns: @@ -535,7 +930,7 @@ def get_embedded_files(self, file_name: str) -> list[str]: if not self._is_connected: raise AttributeError('Connect notes before calling the function') - if file_name not in self._file_index: + if file_name not in self._md_file_index: raise ValueError('"{}" does not exist so it cannot have embedded files.'.format(file_name)) else: return self._embedded_files_index[file_name] @@ -544,11 +939,11 @@ def get_md_links(self, file_name: str) -> list[str]: """Get markdown links for a note (given its filename). Markdown links can only appear in notes that already exist, so if a - note is not in the file_index at all then the function will raise + note is not in the md_file_index at all then the function will raise a ValueError. Args: - file_name (str): the filename string that is in the file_index. + file_name (str): the filename string that is in the md_file_index. This is NOT the filepath! Returns: @@ -557,7 +952,7 @@ def get_md_links(self, file_name: str) -> list[str]: if not self._is_connected: raise AttributeError('Connect notes before calling the function') - if file_name not in self._file_index: + if file_name not in self._md_file_index: raise ValueError('"{}" does not exist so it cannot have md links.'.format(file_name)) else: return self._md_links_index[file_name] @@ -566,11 +961,11 @@ def get_front_matter(self, file_name: str) -> list[dict]: """Get front matter for a note (given its filename). Front matter can only appear in notes that already exist, so if a - note is not in the file_index at all then the function will raise + note is not in the md_file_index at all then the function will raise a ValueError. Args: - file_name (str): the filename string that is in the file_index. + file_name (str): the filename string that is in the md_file_index. This is NOT the filepath! Returns: @@ -578,31 +973,30 @@ def get_front_matter(self, file_name: str) -> list[dict]: """ if not self._is_connected: raise AttributeError('Connect notes before calling the function') - if file_name not in self._file_index: + if file_name not in self._md_file_index: raise ValueError('"{}" does not exist so it cannot have front matter.'.format(file_name)) else: return self._front_matter_index[file_name] - def get_tags(self, file_name: str, *, - show_nested: bool = False) -> list[str]: + def get_tags(self, file_name: str) -> list[str]: """Get tags for a note (given its filename). By default, only the highest level of any nested tags is shown in the output. Tags can only appear in notes that already exist, so if a - note is not in the file_index at all then the function will raise + note is not in the md_file_index at all then the function will raise a ValueError. Args: - file_name (str): the filename string that is in the file_index. - This is NOT the filepath! + file_name (str): the filename string that is in the + md_file_index. This is NOT the filepath! Returns: list """ if not self._is_connected: raise AttributeError('Connect notes before calling the function') - if file_name not in self._file_index: + if file_name not in self._md_file_index: raise ValueError('"{}" does not exist so it cannot have tags.'.format(file_name)) else: return self._tags_index[file_name] @@ -612,11 +1006,11 @@ def get_source_text(self, file_name: str) -> str: function 'gather' to have been called. Text can only appear in notes that already exist, so if a - note is not in the file_index at all then the function will raise + note is not in the md_file_index at all then the function will raise a ValueError. Args: - file_name (str): the filename string that is in the file_index. + file_name (str): the filename string that is in the md_file_index. This is NOT the filepath! Returns: @@ -624,7 +1018,7 @@ def get_source_text(self, file_name: str) -> str: """ if not self._is_gathered: raise AttributeError('Gather notes before calling the function') - if file_name not in self._file_index: + if file_name not in self._md_file_index: raise ValueError('"{}" does not exist so it cannot have text.'.format(file_name)) else: return self._source_text_index[file_name] @@ -638,11 +1032,11 @@ def get_readable_text(self, file_name: str) -> str: before the final output. Text can only appear in notes that already exist, so if a - note is not in the file_index at all then the function will raise + note is not in the md_file_index at all then the function will raise a ValueError. Args: - file_name (str): the filename string that is in the file_index. + file_name (str): the filename string that is in the md_file_index. This is NOT the filepath! Returns: @@ -650,7 +1044,7 @@ def get_readable_text(self, file_name: str) -> str: """ if not self._is_gathered: raise AttributeError('Gather notes before calling the function') - if file_name not in self._file_index: + if file_name not in self._md_file_index: raise ValueError('"{}" does not exist so it cannot have text.'.format(file_name)) else: return self._readable_text_index[file_name] @@ -713,7 +1107,9 @@ def __get_relpaths_by_name(self, *, extension, **kwargs) -> dict[str, Path]: shortest_paths_arr[dupe_names_ix] = np.array( [str(fpath) for fpath in relpaths_list])[dupe_names_ix] - return {n: p for n, p in zip(shortest_paths_arr, relpaths_list)} + + dict_out = {n: p for n, p in zip(shortest_paths_arr, relpaths_list)} + return dict_out def _get_md_relpaths_by_name(self, **kwargs) -> dict[str, Path]: return self.__get_relpaths_by_name(extension='md', @@ -723,13 +1119,14 @@ def _get_canvas_relpaths_by_name(self, **kwargs) -> dict[str, Path]: return self.__get_relpaths_by_name(extension='canvas', **kwargs) - def _get_backlinks_index(self, *, + @staticmethod + def _get_backlinks_index(*, graph: nx.MultiDiGraph) -> dict[str, list[str]]: """Return k,v pairs where k is the md note name and v is list of ALL backlinks found in k""" return {n: [n[0] for n in list(graph.in_edges(n))] - for n in self._graph.nodes} + for n in graph.nodes} def get_note_metadata(self) -> pd.DataFrame: """Structured dataset of metadata on the vault's notes. This @@ -743,13 +1140,20 @@ def get_note_metadata(self) -> pd.DataFrame: Returns: pd.DataFrame """ - if not self._is_connected: raise AttributeError('Connect notes before calling the function') - df = (pd.DataFrame(index=list(self._backlinks_index.keys())) - .rename_axis('note') - .pipe(self._create_note_metadata_columns) + ix_list = list(set(self._backlinks_index.keys()) + .difference(set(self._media_file_index)) + .difference(set(self._nonexistent_media_files)) + .difference(set(self._canvas_file_index)) + ) + + df = (pd.DataFrame(index=ix_list, + columns=METADATA_DF_COLS_GENERIC_TYPE) + .rename(columns={'file_exists': 'note_exists'}) + .rename_axis('note')) + df = (df.pipe(self._create_note_metadata_columns) .pipe(self._clean_up_note_metadata_dtypes) ) return df @@ -757,31 +1161,32 @@ def get_note_metadata(self) -> pd.DataFrame: def _create_note_metadata_columns(self, df: pd.DataFrame) -> pd.DataFrame: """pipe func for mutating df""" - df['rel_filepath'] = [self._file_index.get(f, np.NaN) - for f in df.index] + df['rel_filepath'] = [self._md_file_index.get(f, np.NaN) + for f in df.index.tolist()] df['abs_filepath'] = np.where(df['rel_filepath'].notna(), - [self._dirpath / Path(str(f)) - for f in df['rel_filepath']], + [self._dirpath / str(f) + for f in df['rel_filepath'].tolist()], np.NaN) df['note_exists'] = np.where(df['rel_filepath'].notna(), True, False) df['n_backlinks'] = [len(self.get_backlinks(f)) for f in df.index] df['n_wikilinks'] = np.where(df['note_exists'], [len(self._wikilinks_index.get(f, [])) - for f in df.index], + for f in df.index.tolist()], np.NaN) df['n_tags'] = np.where(df['note_exists'], [len(self._tags_index.get(f, [])) - for f in df.index], + for f in df.index.tolist()], np.NaN) df['n_embedded_files'] = np.where(df['note_exists'], [len(self._embedded_files_index.get( f, [])) - for f in df.index], + for f in df.index.tolist()], np.NaN) df['modified_time'] = pd.to_datetime( - [os.path.getmtime(f) if not pd.isna(f) else np.NaN - for f in df['abs_filepath']], + [f.lstat().st_mtime if not pd.isna(f) + else pd.NaT + for f in df['abs_filepath'].tolist()], unit='s' ) return df @@ -796,13 +1201,170 @@ def _clean_up_note_metadata_dtypes(self, df['n_wikilinks'] = df['n_wikilinks'].astype(float) # for consistency return df + def get_media_file_metadata(self) -> pd.DataFrame: + """Get a structured dataset of metadata on the vault's + media files. This includes filepaths and counts of different + link types. + + The df is indexed by media 'file' (i.e. nodes in the graph). + These will appear in the index: + 1. Embedded files that exist. + 2. Embedded files that don't exist. + 3. Files that exist in the vault but haven't been embedded. + + This dataset is available for however the vault object has + been set up: it will have metadata on the media files whether + or not you have configured media files to appear in the + obsidiantools graph. + + Files that haven't been created will only have info on the number + of backlinks - other columns will have NaN. + + Returns: + pd.DataFrame + """ + ix_list = [*list(self._media_file_index.keys()), + *self._nonexistent_media_files] + df = (pd.DataFrame(index=ix_list, + columns=METADATA_DF_COLS_GENERIC_TYPE) + .rename_axis('file')) + if not ix_list: + return df + else: + df = df.pipe(self._create_media_file_metadata_columns) + return df + + def _create_media_file_metadata_columns(self, + df: pd.DataFrame) -> pd.DataFrame: + """pipe func for mutating df""" + df['rel_filepath'] = [self._media_file_index.get(f, np.NaN) + for f in df.index.tolist()] + df['abs_filepath'] = np.where(df['rel_filepath'].notna(), + [self._dirpath / str(f) + for f in df['rel_filepath'].tolist()], + np.NaN) + df['file_exists'] = pd.Series( + np.logical_not(df.index.isin(self._nonexistent_media_files)), + index=df.index) + df['n_backlinks'] = self._get_backlink_counts_for_media_files_only() + df['modified_time'] = pd.to_datetime( + [f.lstat().st_mtime if not pd.isna(f) + else pd.NaT + for f in df['abs_filepath'].tolist()], + unit='s') + return df + + def get_canvas_file_metadata(self) -> pd.DataFrame: + """Get a structured dataset of metadata on the vault's + canvas files. This includes filepaths and counts of different + link types. + + The df is indexed by canvas 'file' (i.e. nodes in the graph). + These will appear in the index: + 1. Linked files that exist. + 2. Linked files that don't exist. + 3. Files that exist in the vault but haven't been linked. + + This dataset is available for however the vault object has + been set up: it will have metadata on the canvas files whether + or not you have configured canvas files to appear in the + obsidiantools graph. However, n_backlinks column will only be + calculated if attachments=True in the connect() method. + + Files that haven't been created will only have info on the number + of backlinks - other columns will have NaN. + + Returns: + pd.DataFrame + """ + ix_list = [*list(self._canvas_file_index.keys()), + *self._nonexistent_canvas_files] + df = (pd.DataFrame(index=ix_list, + columns=METADATA_DF_COLS_GENERIC_TYPE) + .rename_axis('file')) + if not ix_list: + return df + else: + df = df.pipe(self._create_canvas_file_metadata_columns) + return df + + def _create_canvas_file_metadata_columns(self, + df: pd.DataFrame) -> pd.DataFrame: + """pipe func for mutating df""" + df['rel_filepath'] = [self._canvas_file_index.get(f, np.NaN) + for f in df.index.tolist()] + df['abs_filepath'] = np.where(df['rel_filepath'].notna(), + [self._dirpath / str(f) + for f in df['rel_filepath'].tolist()], + np.NaN) + df['file_exists'] = pd.Series( + np.logical_not(df.index.isin(self._nonexistent_canvas_files)), + index=df.index) + if self._attachments: + df['n_backlinks'] = ( + self._get_backlink_counts_for_canvas_files_only()) + else: + df['n_backlinks'] = np.NaN + df['modified_time'] = pd.to_datetime( + [f.lstat().st_mtime if not pd.isna(f) + else pd.NaT + for f in df['abs_filepath'].tolist()], + unit='s') + return df + + def get_all_file_metadata(self) -> pd.DataFrame: + """Get a structured dataset of metadata on the vault's files, where + they are supported by the Obsidian app. This includes detail on + notes (md files), canvas files and media files. + + The df is indexed by 'file' (i.e. nodes in the graph). + These will appear in the index: + 1. Linked/embedded files that exist. + 2. Linked/embedded files that don't exist. + 3. Files that exist in the vault but haven't been linked/embedded. + + If attachments=False was set in the connect method, then only notes + (md files) will appear in the dataset. + Otherwise, notes, media files and canvas files will appear in the + dataset. + In both situations, n_backlinks = n_wikilinks + n_embedded_files. + + Files that haven't been created will only have info on the number + of backlinks; other columns in the dataset will have NaN values. + + Returns: + pd.DataFrame + """ + df = (self.get_note_metadata() + .rename(columns={'note_exists': 'file_exists'})) + df['graph_category'] = np.where( + df['file_exists'], 'note', 'nonexistent') + if not self._attachments: + warnings.warn('Only notes (md files) were used to build the graph. Set attachments=True in the connect method to show all file metadata.') + else: + df_media = self.get_media_file_metadata() + df_media['graph_category'] = np.where( + df_media['file_exists'], 'attachment', 'nonexistent') + df_canvas = self.get_canvas_file_metadata() + df_canvas['graph_category'] = np.where( + df_canvas['file_exists'], 'attachment', 'nonexistent') + + df = (pd.concat( + [df, df_media, df_canvas]) + .rename_axis('file')) + return df + def _get_nonexistent_notes(self) -> list[str]: """Get notes that have backlinks but don't have md files. The comparison is done with sets but the result is returned as a list.""" - return list(set(self.backlinks_index.keys()) - .difference(set(self.file_index))) + return list(set(self._backlinks_index.keys()) + # anything remaining that isn't a file is a non-e note: + .difference(set(self._md_file_index)) + .difference(set(self._media_file_index)) + .difference(set(self._nonexistent_media_files)) + .difference(set(self._canvas_file_index))) def _get_isolated_notes(self, *, graph: nx.MultiDiGraph) -> list[str]: @@ -810,4 +1372,5 @@ def _get_isolated_notes(self, *, i.e. they have 0 wikilinks and 0 backlinks. These notes are retrieved from the graph.""" - return list(nx.isolates(graph)) + return [fn for fn in nx.isolates(graph) + if fn in self._md_file_index] diff --git a/obsidiantools/canvas_utils.py b/obsidiantools/canvas_utils.py index ccc77dc..c40d009 100644 --- a/obsidiantools/canvas_utils.py +++ b/obsidiantools/canvas_utils.py @@ -1,8 +1,10 @@ import json import networkx as nx from pathlib import Path +from ._constants import CANVAS_EXT_SET from ._io import (get_relpaths_from_dir, - get_relpaths_matching_subdirs) + get_relpaths_matching_subdirs, + _get_valid_filepaths_by_ext_set) def get_canvas_relpaths_from_dir(dir_path: Path) -> list[Path]: @@ -59,6 +61,12 @@ def get_canvas_relpaths_matching_subdirs(dir_path: Path, *, include_root=include_root) +def _get_all_valid_canvas_file_relpaths(dirpath): + return (_get_valid_filepaths_by_ext_set( + dirpath, + exts=CANVAS_EXT_SET)) + + def get_canvas_content(filepath: Path) -> dict: """Get JSON content from canvas file as a Python dict. diff --git a/obsidiantools/html_processing.py b/obsidiantools/html_processing.py index c70325d..f04ad86 100644 --- a/obsidiantools/html_processing.py +++ b/obsidiantools/html_processing.py @@ -28,20 +28,30 @@ def _get_plaintext_from_html(html: str) -> str: def _remove_code(html: str) -> str: # exclude 'code' tags from link output: soup = BeautifulSoup(html, 'lxml') - for s in soup.select('code'): - s.extract() + soup = _remove_code_via_soup(soup) html_str = str(soup) return html_str +def _remove_code_via_soup(soup): + for s in soup.select('code'): + s.extract() + return soup + + def _remove_del_text(html: str) -> str: soup = BeautifulSoup(html, 'lxml') - for s in soup.select('del'): - s.extract() + soup = _remove_del_text_via_soup(soup) html_str = str(soup) return html_str +def _remove_del_text_via_soup(soup): + for s in soup.select('del'): + s.extract() + return soup + + def _remove_main_formatting( html: str, *, tags: list[str] = ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']) -> str: @@ -50,16 +60,21 @@ def _remove_main_formatting( def _remove_latex(html: str) -> str: soup = BeautifulSoup(html, 'lxml') - for s in soup.select('span', {'class': 'MathJax_Preview'}): - s.extract() + soup = _remove_latex_via_soup(soup) html_str = str(soup) return html_str +def _remove_latex_via_soup(soup): + for s in soup.select('span', {'class': 'MathJax_Preview'}): + s.extract() + return soup + + def _get_all_latex_from_html_content(html: str) -> list[str]: soup = BeautifulSoup(html, 'html.parser') s_content = soup.find_all('span', {'class': 'MathJax_Preview'}, - text=True) - latex_found_list = [i.text for i in s_content] + string=True) + latex_found_list = [i.string for i in s_content] return latex_found_list diff --git a/obsidiantools/md_utils.py b/obsidiantools/md_utils.py index b5aabbc..c99a52e 100644 --- a/obsidiantools/md_utils.py +++ b/obsidiantools/md_utils.py @@ -1,7 +1,7 @@ import re import yaml from pathlib import Path -from glob import glob +from bs4 import BeautifulSoup import markdown import frontmatter from ._constants import (WIKILINK_REGEX, @@ -13,7 +13,9 @@ from ._io import (get_relpaths_from_dir, get_relpaths_matching_subdirs) from .html_processing import (_get_plaintext_from_html, - _remove_code, _remove_latex, _remove_del_text, + _remove_code_via_soup, + _remove_latex_via_soup, + _remove_del_text_via_soup, _remove_main_formatting, _get_all_latex_from_html_content) @@ -72,7 +74,8 @@ def get_md_relpaths_matching_subdirs(dir_path: Path, *, include_root=include_root) -def get_wikilinks(filepath: Path) -> list[str]: +def get_wikilinks(filepath: Path, *, + exclude_canvas: bool = True) -> list[str]: """Get ALL wikilinks from a md file. The links' order of appearance in the file IS preserved in the output. @@ -87,6 +90,8 @@ def get_wikilinks(filepath: Path) -> list[str]: Args: filepath (pathlib Path): Path object representing the file from which info will be extracted. + exclude_canvas (bool): Defaults to True. Exclude canvas files from + the list of wikilinks. Returns: list of strings @@ -94,7 +99,8 @@ def get_wikilinks(filepath: Path) -> list[str]: src_txt = get_source_text_from_md_file(filepath, remove_code=True) wikilinks = _get_all_wikilinks_from_source_text( - src_txt, remove_aliases=True) + src_txt, remove_aliases=True, + exclude_canvas=exclude_canvas) return wikilinks @@ -122,7 +128,8 @@ def get_embedded_files(filepath: Path) -> list[str]: return files -def get_unique_wikilinks(filepath: Path) -> list[str]: +def get_unique_wikilinks(filepath: Path, *, + exclude_canvas: bool = True) -> list[str]: """Get UNIQUE wikilinks from a md file. The links' order of appearance in the file IS preserved in the output. @@ -135,13 +142,17 @@ def get_unique_wikilinks(filepath: Path) -> list[str]: Args: filepath (pathlib Path): Path object representing the file from which info will be extracted. + exclude_canvas (bool): Defaults to True. Exclude canvas files from + the list of wikilinks. Returns: list of strings """ src_txt = get_source_text_from_md_file(filepath, remove_code=True) - wikilinks = _get_unique_wikilinks_from_source_text(src_txt, remove_aliases=True) + wikilinks = _get_unique_wikilinks_from_source_text( + src_txt, remove_aliases=True, + exclude_canvas=exclude_canvas) return wikilinks @@ -296,11 +307,13 @@ def get_source_text_from_html(html: str, *, remove_code: bool = False, remove_math: bool = False) -> str: """html (without front matter) -> ASCII plaintext""" + soup = BeautifulSoup(html, 'lxml') if remove_code: - html = _remove_code(html) + soup = _remove_code_via_soup(soup) if remove_math: - html = _remove_latex(html) - return _get_plaintext_from_html(html) + soup = _remove_latex_via_soup(soup) + new_str = str(soup) + return _get_plaintext_from_html(new_str) def get_source_text_from_md_file(filepath: Path, *, @@ -323,20 +336,33 @@ def get_readable_text_from_md_file(filepath: Path, *, # strip out front matter (if any): html = _get_html_from_md_file( filepath) + html = _get_readable_text_from_html( + html, tags=tags) + return html + + +def _get_readable_text_from_html(html: str, *, + tags: list[str] = None) -> str: + # -str or regex- # wikilinks and md links as text: html = _replace_md_links_with_their_text(html) html = _replace_wikilinks_with_their_text(html) html = _remove_embedded_file_links_from_text(html) + + # -bs4- # remove code and remove major formatting on text: - html = _remove_code(html) - html = _remove_latex(html) - html = _remove_del_text(html) + soup = BeautifulSoup(html, 'lxml') + soup = _remove_code_via_soup(soup) + soup = _remove_latex_via_soup(soup) + soup = _remove_del_text_via_soup(soup) + new_str = str(soup) + # -BLEACH- if tags is not None: - html = _remove_main_formatting(html, tags=tags) + new_str = _remove_main_formatting(new_str, tags=tags) else: # defaults - html = _remove_main_formatting(html) + new_str = _remove_main_formatting(new_str) - return _get_plaintext_from_html(html) + return _get_plaintext_from_html(new_str) def _get_all_wikilinks_and_embedded_files(src_txt: str) -> list[str]: @@ -355,7 +381,8 @@ def _remove_aliases_from_wikilink_regex_matches(link_matches_list: list[str]) -> def _get_all_wikilinks_from_source_text(src_txt: str, *, - remove_aliases: bool = True) -> list[str]: + remove_aliases: bool = True, + exclude_canvas: bool = True) -> list[str]: matches_list = _get_all_wikilinks_and_embedded_files(src_txt) link_matches_list = [g[1] for g in matches_list if g[0] == ''] @@ -367,6 +394,9 @@ def _get_all_wikilinks_from_source_text(src_txt: str, *, # remove .md: link_matches_list = [name.removesuffix('.md') for name in link_matches_list] + if exclude_canvas: + link_matches_list = [n for n in link_matches_list + if not n.endswith('.canvas')] return link_matches_list @@ -388,9 +418,11 @@ def _get_all_latex_from_md_file(filepath: Path) -> list[str]: def _get_unique_wikilinks_from_source_text(src_txt: str, *, - remove_aliases: bool = True) -> list[str]: + remove_aliases: bool = True, + exclude_canvas: bool = True) -> list[str]: wikilinks = _get_all_wikilinks_from_source_text( - src_txt, remove_aliases=remove_aliases) + src_txt, remove_aliases=remove_aliases, + exclude_canvas=exclude_canvas) return list(dict.fromkeys(wikilinks)) diff --git a/obsidiantools/media_utils.py b/obsidiantools/media_utils.py new file mode 100644 index 0000000..63fa54f --- /dev/null +++ b/obsidiantools/media_utils.py @@ -0,0 +1,12 @@ +from pathlib import Path +import numpy as np +from ._constants import (IMG_EXT_SET, AUDIO_EXT_SET, + VIDEO_EXT_SET, PDF_EXT_SET) +from ._io import _get_valid_filepaths_by_ext_set + + +def _get_all_valid_media_file_relpaths(dirpath): + return (_get_valid_filepaths_by_ext_set( + dirpath, + exts=(IMG_EXT_SET | AUDIO_EXT_SET + | VIDEO_EXT_SET | PDF_EXT_SET))) diff --git a/setup.py b/setup.py index 3ffd601..d1a32f3 100644 --- a/setup.py +++ b/setup.py @@ -32,7 +32,7 @@ setuptools.setup( name="obsidiantools", - version="0.9.0", + version="0.10.0", author="Mark Farragher", description="Obsidian Tools - a Python interface for Obsidian.md vaults", long_description=LONG_DESCRIPTION, diff --git a/tests/test_api_setup.py b/tests/test_api_setup.py index e69fa9b..dfdb8a3 100644 --- a/tests/test_api_setup.py +++ b/tests/test_api_setup.py @@ -19,8 +19,8 @@ def test_vault_instantiation(tmp_path): # dirpath assert actual_vault.dirpath == tmp_path - # file_index - assert isinstance(actual_vault.file_index, dict) + # md_file_index + assert isinstance(actual_vault.md_file_index, dict) # graph and connections assert not actual_vault.graph diff --git a/tests/test_api_vault_stub_attr_setters.py b/tests/test_api_vault_stub_attr_setters.py index 4b29712..f941134 100644 --- a/tests/test_api_vault_stub_attr_setters.py +++ b/tests/test_api_vault_stub_attr_setters.py @@ -1,12 +1,12 @@ import pytest -import os from pathlib import Path +import networkx as nx from obsidiantools.api import Vault # NOTE: run the tests from the project dir. -WKD = Path(os.getcwd()) +WKD = Path().cwd() @pytest.fixture @@ -14,15 +14,88 @@ def actual_connected_vault(): return Vault(WKD / 'tests/vault-stub').connect().gather() -def test_attr_setters(actual_connected_vault): - actual_connected_vault.file_index = {} - assert actual_connected_vault.file_index == {} +def test_attr_setters_main_setup(actual_connected_vault): + actual_connected_vault.md_file_index = {} + assert actual_connected_vault.md_file_index == {} actual_connected_vault.canvas_file_index = {'New.canvas': ''} assert actual_connected_vault.canvas_file_index == {'New.canvas': ''} + +def test_attr_setters_md_connect_related(actual_connected_vault): + actual_connected_vault.attachments = True + assert actual_connected_vault.attachments + + actual_connected_vault.is_connected = True + assert actual_connected_vault.is_connected + + actual_connected_vault.is_gathered = True + assert actual_connected_vault.is_gathered + + actual_connected_vault.front_matter_index = {} + assert actual_connected_vault.front_matter_index == {} + + actual_connected_vault.backlinks_index = {} + assert actual_connected_vault.backlinks_index == {} + + actual_connected_vault.wikilinks_index = {} + assert actual_connected_vault.wikilinks_index == {} + + actual_connected_vault.unique_wikilinks_index = {} + assert actual_connected_vault.unique_wikilinks_index == {} + + actual_connected_vault.embedded_files_index = {} + assert actual_connected_vault.embedded_files_index == {} + + actual_connected_vault.math_index = {} + assert actual_connected_vault.math_index == {} + + actual_connected_vault.md_links_index = {} + assert actual_connected_vault.md_links_index == {} + + actual_connected_vault.unique_md_links_index = {} + assert actual_connected_vault.unique_md_links_index == {} + + actual_connected_vault.tags_index = {} + assert actual_connected_vault.tags_index == {} + + actual_connected_vault.nonexistent_notes = [] + assert actual_connected_vault.nonexistent_notes == [] + + actual_connected_vault.isolated_notes = [] + assert actual_connected_vault.isolated_notes == [] + + actual_connected_vault.media_file_index = {} + assert actual_connected_vault.media_file_index == {} + + actual_connected_vault.nonexistent_media_files = [] + assert actual_connected_vault.nonexistent_media_files == [] + + actual_connected_vault.isolated_media_files = [] + assert actual_connected_vault.isolated_media_files == [] + + actual_connected_vault.nonexistent_canvas_files = [] + assert actual_connected_vault.nonexistent_canvas_files == [] + + actual_connected_vault.isolated_canvas_files = [] + assert actual_connected_vault.isolated_canvas_files == [] + + # check that graph is set AND recognised as empty: + actual_connected_vault.graph = nx.MultiDiGraph() + assert nx.is_empty(actual_connected_vault.graph) + + +def test_attr_setters_md_gather_related(actual_connected_vault): actual_connected_vault.source_text_index = {'Isolated note': '`new text`'} assert actual_connected_vault.source_text_index == {'Isolated note': '`new text`'} actual_connected_vault.readable_text_index = {'Isolated note': 'Test'} assert actual_connected_vault.readable_text_index == {'Isolated note': 'Test'} + + +def test_attr_setters_canvas_connect_related(actual_connected_vault): + actual_connected_vault.canvas_content_index = {} + assert actual_connected_vault.canvas_content_index == {} + + actual_connected_vault.canvas_graph_detail_index = {} + assert actual_connected_vault.canvas_graph_detail_index == {} diff --git a/tests/test_api_vault_stub_connect_attachments_true.py b/tests/test_api_vault_stub_connect_attachments_true.py new file mode 100644 index 0000000..9d57b72 --- /dev/null +++ b/tests/test_api_vault_stub_connect_attachments_true.py @@ -0,0 +1,350 @@ +import pytest +import numpy as np +import pandas as pd +from pathlib import Path +from pandas.testing import assert_series_equal + + +from obsidiantools.api import Vault + +# NOTE: run the tests from the project dir. +WKD = Path().cwd() + + +@pytest.fixture +def actual_connected_vault(): + return (Vault(WKD / 'tests/vault-stub') + .connect(attachments=True)) + + +@pytest.fixture +def expected_note_metadata_dict(): + return { + 'rel_filepath': {'Sussudio': Path('Sussudio.md'), + 'Isolated note': Path('Isolated note.md'), + 'Brevissimus moenia': Path('lipsum/Brevissimus moenia.md'), + 'Ne fuit': Path('lipsum/Ne fuit.md'), + 'Alimenta': Path('lipsum/Alimenta.md'), + 'Vulnera ubera': Path('lipsum/Vulnera ubera.md'), + 'lipsum/Isolated note': Path('lipsum/Isolated note.md'), + 'Causam mihi': Path('lipsum/Causam mihi.md'), + 'American Psycho (film)': np.NaN, + 'Tarpeia': np.NaN, + 'Caelum': np.NaN, + 'Vita': np.NaN, + 'Aras Teucras': np.NaN, + 'Manus': np.NaN, + 'Bacchus': np.NaN, + 'Amor': np.NaN, + 'Virtus': np.NaN, + 'Tydides': np.NaN, + 'Dives': np.NaN, + 'Aetna': np.NaN}, + # abs_filepath would be here + 'note_exists': {'Sussudio': True, + 'Isolated note': True, + 'Brevissimus moenia': True, + 'Ne fuit': True, + 'Alimenta': True, + 'Vulnera ubera': True, + 'lipsum/Isolated note': True, + 'Causam mihi': True, + 'American Psycho (film)': False, + 'Tarpeia': False, + 'Caelum': False, + 'Vita': False, + 'Aras Teucras': False, + 'Manus': False, + 'Bacchus': False, + 'Amor': False, + 'Virtus': False, + 'Tydides': False, + 'Dives': False, + 'Aetna': False}, + 'n_backlinks': {'Sussudio': 0, + 'Isolated note': 0, + 'Brevissimus moenia': 1, + 'Ne fuit': 2, + 'Alimenta': 0, + 'Vulnera ubera': 0, + 'lipsum/Isolated note': 0, + 'Causam mihi': 1, + 'American Psycho (film)': 1, + 'Tarpeia': 3, + 'Caelum': 3, + 'Vita': 3, + 'Aras Teucras': 1, + 'Manus': 3, + 'Bacchus': 5, + 'Amor': 2, + 'Virtus': 1, + 'Tydides': 1, + 'Dives': 1, + 'Aetna': 1}, + 'n_wikilinks': {'Sussudio': 1.0, + 'Isolated note': 0.0, + 'Brevissimus moenia': 3.0, + 'Ne fuit': 6.0, + 'Alimenta': 12.0, + 'Vulnera ubera': 3.0, + 'lipsum/Isolated note': 0.0, + 'Causam mihi': 4.0, + 'American Psycho (film)': np.NaN, + 'Tarpeia': np.NaN, + 'Caelum': np.NaN, + 'Vita': np.NaN, + 'Aras Teucras': np.NaN, + 'Manus': np.NaN, + 'Bacchus': np.NaN, + 'Amor': np.NaN, + 'Virtus': np.NaN, + 'Tydides': np.NaN, + 'Dives': np.NaN, + 'Aetna': np.NaN}, + 'n_tags': {'Sussudio': 5.0, + 'Isolated note': 0.0, + 'Brevissimus moenia': 0.0, + 'Ne fuit': 0.0, + 'Alimenta': 0.0, + 'Vulnera ubera': 0.0, + 'lipsum/Isolated note': 0.0, + 'Causam mihi': 0.0, + 'American Psycho (film)': np.NaN, + 'Tarpeia': np.NaN, + 'Caelum': np.NaN, + 'Vita': np.NaN, + 'Aras Teucras': np.NaN, + 'Manus': np.NaN, + 'Bacchus': np.NaN, + 'Amor': np.NaN, + 'Virtus': np.NaN, + 'Tydides': np.NaN, + 'Dives': np.NaN, + 'Aetna': np.NaN}, + 'n_embedded_files': {'Isolated note': 0.0, + 'Sussudio': 2.0, + 'Brevissimus moenia': 0.0, + 'Ne fuit': 0.0, + 'Alimenta': 0.0, + 'Vulnera ubera': 0.0, + 'lipsum/Isolated note': 0.0, + 'Causam mihi': 0.0, + 'American Psycho (film)': np.NaN, + 'Tarpeia': np.NaN, + 'Caelum': np.NaN, + 'Vita': np.NaN, + 'Aras Teucras': np.NaN, + 'Manus': np.NaN, + 'Bacchus': np.NaN, + 'Amor': np.NaN, + 'Virtus': np.NaN, + 'Tydides': np.NaN, + 'Dives': np.NaN, + 'Aetna': np.NaN} + } + + +@pytest.fixture +def expected_media_file_metadata_dict(): + return { + 'rel_filepath': {'1999.flac': np.NaN, + 'Sussudio.mp3': np.NaN}, + # abs_filepath would be here + 'file_exists': {'1999.flac': False, + 'Sussudio.mp3': False}, + 'n_backlinks': {'1999.flac': 1, + 'Sussudio.mp3': 1}, + } + + +@pytest.fixture +def expected_embedded_files_index(): + return {'Isolated note': [], + 'lipsum/Isolated note': [], + 'Sussudio': ['Sussudio.mp3', '1999.flac'], + 'Brevissimus moenia': [], + 'Ne fuit': [], + 'Alimenta': [], + 'Vulnera ubera': [], + 'Causam mihi': []} + + +@pytest.fixture +def actual_note_metadata_df(actual_connected_vault): + return actual_connected_vault.get_note_metadata() + + +@pytest.fixture +def actual_media_file_metadata_df(actual_connected_vault): + return actual_connected_vault.get_media_file_metadata() + + +@pytest.fixture +def actual_canvas_file_metadata_df(actual_connected_vault): + return actual_connected_vault.get_canvas_file_metadata() + + +def test_get_metadata_cols(actual_note_metadata_df): + assert isinstance(actual_note_metadata_df, pd.DataFrame) + + expected_cols = ['rel_filepath', 'abs_filepath', + 'note_exists', + 'n_backlinks', 'n_wikilinks', + 'n_tags', + 'n_embedded_files', + 'modified_time'] + assert actual_note_metadata_df.columns.tolist() == expected_cols + + +def test_get_metadata_dtypes(actual_note_metadata_df): + assert actual_note_metadata_df['rel_filepath'].dtype == 'object' + assert actual_note_metadata_df['abs_filepath'].dtype == 'object' + assert actual_note_metadata_df['note_exists'].dtype == 'bool' + assert actual_note_metadata_df['n_backlinks'].dtype == 'int' + assert actual_note_metadata_df['n_wikilinks'].dtype == 'float' + assert actual_note_metadata_df['n_tags'].dtype == 'float' + assert actual_note_metadata_df['n_embedded_files'].dtype == 'float' + assert actual_note_metadata_df['modified_time'].dtype == 'datetime64[ns]' + + +def test_get_metadata_backlinks(actual_note_metadata_df, + expected_note_metadata_dict): + TEST_COL = 'n_backlinks' + + actual_series = actual_note_metadata_df[TEST_COL] + expected_series = (pd.Series(expected_note_metadata_dict.get(TEST_COL), + name=TEST_COL) + .rename_axis('note')) + assert_series_equal(actual_series, expected_series, + check_like=True) + + +def test_backlink_and_wikilink_totals_not_equal_for_test_vault(actual_note_metadata_df): + # every wikilink is another note's backlink + # INEQUALITY is expected when canvas files are INCLUDED in wikilinks list + # for this vault + assert (actual_note_metadata_df['n_backlinks'].sum() + != actual_note_metadata_df['n_wikilinks'].sum()) + + +def test_backlink_counts(actual_connected_vault): + expected_bl_count_subset = { + 'Sussudio': {}, + 'Alimenta': {}, + 'Tarpeia': {'Brevissimus moenia': 1, + 'Alimenta': 1, + 'Vulnera ubera': 1}, + 'Ne fuit': {'Alimenta': 1, + 'Causam mihi': 1}, + 'Bacchus': {'Ne fuit': 1, + 'Alimenta': 4}, + '1999.flac': {'Sussudio': 1}, + 'Sussudio.mp3': {'Sussudio': 1} + } + + for k in list(expected_bl_count_subset.keys()): + assert (actual_connected_vault.get_backlink_counts(k) + == expected_bl_count_subset.get(k)) + + with pytest.raises(ValueError): + actual_connected_vault.get_backlink_counts("Note that isn't in vault at all") + + +def test_isolated_notes(actual_connected_vault): + expected_isol_notes = ['Isolated note', 'lipsum/Isolated note'] + + assert isinstance(actual_connected_vault.isolated_notes, list) + + assert (set(actual_connected_vault.isolated_notes) + == set(expected_isol_notes)) + + # isolated notes can't have backlinks + for n in actual_connected_vault.isolated_notes: + assert actual_connected_vault.get_backlink_counts(n) == {} + # isolated notes can't have wikilinks + for n in actual_connected_vault.isolated_notes: + assert actual_connected_vault.get_wikilinks(n) == [] + + +def test_get_canvas_file_dicts_tuple(actual_connected_vault): + # (linked_files_by_short_path, + # non_linked_files_by_short_path, + # nonexistent_files_by_short_path) + actual_tuple = (actual_connected_vault. + _get_canvas_file_dicts_tuple()) + expected_tuple = ( + {'Crazy wall 2.canvas': Path('Crazy wall 2.canvas')}, + {'Crazy wall.canvas': Path('Crazy wall.canvas')}, + {}) + assert actual_tuple == expected_tuple + + +def test_get_media_file_dicts_tuple(actual_connected_vault): + # (embedded_files_by_short_path, + # non_embedded_files_by_short_path, + # nonexistent_files_by_short_path) + actual_tuple = (actual_connected_vault. + _get_media_file_dicts_tuple()) + expected_tuple = ( + {}, + {}, + {'Sussudio.mp3': np.NaN, '1999.flac': np.NaN}) + assert actual_tuple == expected_tuple + + +def test_nonexistent_canvas_files(actual_connected_vault, + actual_canvas_file_metadata_df): + expected_non_e_files = [] + + assert isinstance(actual_connected_vault.nonexistent_canvas_files, list) + + assert (set(actual_connected_vault.nonexistent_canvas_files) + == set(expected_non_e_files)) + assert (set(actual_canvas_file_metadata_df.loc[~actual_canvas_file_metadata_df['file_exists'], :] + .index.tolist()) + == set(expected_non_e_files)) + + +def test_nonexistent_media_files(actual_connected_vault, actual_media_file_metadata_df): + expected_non_e_files = ['1999.flac', 'Sussudio.mp3'] + + assert isinstance(actual_connected_vault.nonexistent_media_files, list) + + assert (set(actual_connected_vault.nonexistent_media_files) + == set(expected_non_e_files)) + assert (set(actual_media_file_metadata_df.loc[~actual_media_file_metadata_df['file_exists'], :] + .index.tolist()) + == set(expected_non_e_files)) + + +def test_isolated_canvas_files(actual_connected_vault): + expected_isol_files = ['Crazy wall.canvas'] + + assert isinstance(actual_connected_vault.isolated_canvas_files, list) + + assert (set(actual_connected_vault.isolated_canvas_files) + == set(expected_isol_files)) + + +def test_n_backlinks_not_null_in_canvas_file_metadata(actual_connected_vault): + df_canvas = actual_connected_vault.get_canvas_file_metadata() + assert df_canvas['n_backlinks'].isna().mean() == 0 + + +def test_all_file_metadata_df(actual_connected_vault): + actual_all_df = actual_connected_vault.get_all_file_metadata() + + actual_notes_df = actual_connected_vault.get_note_metadata() + actual_media_df = actual_connected_vault.get_media_file_metadata() + actual_canvas_df = actual_connected_vault.get_canvas_file_metadata() + + # check all dataframes concatenated successfully: + assert len(actual_all_df) == (len(actual_notes_df) + + len(actual_media_df) + + len(actual_canvas_df)) + + # check that media files are included in graph under attachments=True, + # with equality involving graph edges: + assert (actual_all_df['n_backlinks'].sum() + == (actual_all_df['n_wikilinks'].sum() + + actual_all_df['n_embedded_files'].sum())) diff --git a/tests/test_api_vault_stub_connect_canvas.py b/tests/test_api_vault_stub_connect_defaults_canvas.py similarity index 91% rename from tests/test_api_vault_stub_connect_canvas.py rename to tests/test_api_vault_stub_connect_defaults_canvas.py index 590449f..cb6310a 100644 --- a/tests/test_api_vault_stub_connect_canvas.py +++ b/tests/test_api_vault_stub_connect_defaults_canvas.py @@ -1,5 +1,4 @@ import pytest -import os import networkx as nx from pathlib import Path @@ -7,7 +6,7 @@ from obsidiantools.api import Vault # NOTE: run the tests from the project dir. -WKD = Path(os.getcwd()) +WKD = Path().cwd() @pytest.fixture @@ -93,5 +92,11 @@ def test_canvas_graph_detail_index_graph_other_attributes(actual_connected_vault actual_non_blank_edge_labels = { pair: label for pair, label in edge_labels.items() if label != ''} - expected_non_blank_edge_labels = {('d3f112f83760095a', 'c168506f5b075d91'): 'inspires?'} + expected_non_blank_edge_labels = { + ('d3f112f83760095a', 'c168506f5b075d91'): 'inspires?'} assert actual_non_blank_edge_labels == expected_non_blank_edge_labels + + +def test_n_backlinks_null_in_canvas_file_metadata(actual_connected_vault): + df_canvas = actual_connected_vault.get_canvas_file_metadata() + assert df_canvas['n_backlinks'].isna().mean() == 1 diff --git a/tests/test_api_vault_stub_connect_md.py b/tests/test_api_vault_stub_connect_defaults_md.py similarity index 90% rename from tests/test_api_vault_stub_connect_md.py rename to tests/test_api_vault_stub_connect_defaults_md.py index 334ae7e..031910e 100644 --- a/tests/test_api_vault_stub_connect_md.py +++ b/tests/test_api_vault_stub_connect_defaults_md.py @@ -1,15 +1,16 @@ import pytest import numpy as np import pandas as pd -import os from pathlib import Path -from pandas.testing import assert_series_equal +from pandas.testing import (assert_series_equal, + assert_frame_equal) from obsidiantools.api import Vault +from obsidiantools._constants import METADATA_DF_COLS_GENERIC_TYPE # NOTE: run the tests from the project dir. -WKD = Path(os.getcwd()) +WKD = Path().cwd() @pytest.fixture @@ -223,6 +224,11 @@ def actual_connected_vault(): return Vault(WKD / 'tests/vault-stub').connect() +@pytest.fixture +def actual_connected_vault_md_files_only(): + return Vault(WKD / 'tests/vault-stub/lipsum').connect() + + @pytest.fixture def actual_metadata_df(actual_connected_vault): return actual_connected_vault.get_note_metadata() @@ -313,6 +319,8 @@ def test_get_metadata_tags(actual_metadata_df, def test_backlink_and_wikilink_totals_equal(actual_metadata_df): # every wikilink is another note's backlink + # equality is expected when canvas files are excluded from wikilinks list + # for ANY VAULT under the defaults assert (actual_metadata_df['n_backlinks'].sum() == actual_metadata_df['n_wikilinks'].sum()) @@ -429,9 +437,9 @@ def test_wikilink_individual_notes(actual_connected_vault): actual_connected_vault.get_wikilinks('Tarpeia') # check that every existing note (file) has wikilink info - assert len(actual_wl_ix) == len(actual_connected_vault.file_index) + assert len(actual_wl_ix) == len(actual_connected_vault.md_file_index) for k in list(actual_wl_ix.keys()): - assert isinstance(actual_connected_vault.file_index.get(k), + assert isinstance(actual_connected_vault.md_file_index.get(k), Path) @@ -518,7 +526,7 @@ def test_embedded_files_sussudio(actual_connected_vault): def test_nodes_gte_files(actual_connected_vault): - act_f_len = len(actual_connected_vault.file_index) + act_f_len = len(actual_connected_vault.md_file_index) act_n_len = len(actual_connected_vault.wikilinks_index) assert act_n_len >= act_f_len @@ -589,3 +597,50 @@ def test_front_matter_not_existing(actual_connected_vault): def test_embedded_notes_not_existing(actual_connected_vault): with pytest.raises(ValueError): actual_connected_vault.get_embedded_files('Tarpeia') + + +def test_media_file_metadata_df_empty(actual_connected_vault_md_files_only): + # use the lipsum dir as the 'vault' dir (md only) + df_media = (actual_connected_vault_md_files_only + .get_media_file_metadata()) + + assert len(df_media) == 0 + + expected_cols = METADATA_DF_COLS_GENERIC_TYPE + actual_cols = df_media.columns.tolist() + assert actual_cols == expected_cols + + +def test_canvas_file_metadata_df_empty(actual_connected_vault_md_files_only): + # use the lipsum dir as the 'vault' dir (md only) + df_media = (actual_connected_vault_md_files_only + .get_canvas_file_metadata()) + + assert len(df_media) == 0 + + expected_cols = METADATA_DF_COLS_GENERIC_TYPE + actual_cols = df_media.columns.tolist() + assert actual_cols == expected_cols + + +def test_all_file_metadata_df(actual_connected_vault): + with pytest.warns(UserWarning): + actual_all_df = actual_connected_vault.get_all_file_metadata() + + actual_note_df = actual_connected_vault.get_note_metadata() + + # check that notes metadata was only used: + assert_frame_equal( + actual_all_df.drop(columns=['graph_category']), + actual_note_df.rename(columns={'note_exists': 'file_exists'})) + + # check that only notes are used for backlinks: + assert (actual_all_df['n_backlinks'].sum() + == (actual_all_df['n_wikilinks'].sum())) + + +def test_internal_canvas_backlink_counts_func_errors( + actual_connected_vault): + with pytest.raises(AttributeError): + (actual_connected_vault. + _get_backlink_counts_for_canvas_files_only()) diff --git a/tests/test_api_vault_stub_gather.py b/tests/test_api_vault_stub_gather.py index 6bf1edb..7adda27 100644 --- a/tests/test_api_vault_stub_gather.py +++ b/tests/test_api_vault_stub_gather.py @@ -1,12 +1,11 @@ import pytest -import os from pathlib import Path from obsidiantools.api import Vault # NOTE: run the tests from the project dir. -WKD = Path(os.getcwd()) +WKD = Path().cwd() @pytest.fixture @@ -49,7 +48,7 @@ def test_source_text_existing_file(actual_gathered_vault_defaults): actual_in_text = (actual_gathered_vault_defaults .get_source_text('Isolated note')) expected_start = '# Isolated note' - expected_end = 'an isolated note.\n' + expected_end = 'an isolated note ~~an orphan~~.\n' assert actual_in_text.startswith(expected_start) assert actual_in_text.endswith(expected_end) @@ -66,7 +65,7 @@ def test_readable_text_existing_file(actual_gathered_vault_defaults): def test_isolated_note_md_text(actual_gathered_vault_defaults): expected_text = r"""# Isolated note -This is an isolated note. +This is an isolated note ~~an orphan~~. """ assert actual_gathered_vault_defaults.is_gathered @@ -76,12 +75,37 @@ def test_isolated_note_md_text(actual_gathered_vault_defaults): def test_all_files_are_in_source_text_index(actual_gathered_vault_defaults): - file_keys = set(actual_gathered_vault_defaults.file_index.keys()) + file_keys = set(actual_gathered_vault_defaults.md_file_index.keys()) text_keys = set(actual_gathered_vault_defaults.source_text_index.keys()) assert file_keys == text_keys def test_all_files_are_in_readable_text_index(actual_gathered_vault_defaults): - file_keys = set(actual_gathered_vault_defaults.file_index.keys()) + file_keys = set(actual_gathered_vault_defaults.md_file_index.keys()) text_keys = set(actual_gathered_vault_defaults.readable_text_index.keys()) assert file_keys == text_keys + + +def test_sussudio_readable_text(actual_gathered_vault_defaults): + """Some nuances on how readable text is different vs source text: + - Code, LaTeX and embedded files are removed for readable text. + - Wikilinks, links & tags will better reflect how they look in md preview + in the readable text. + - Double spaces in source text become single spaces in readable text. + + Neither form of text will have front matter. + """ + expected_text = r"""# Sussudio + +Another word with absolutely no meaning 😄 + +This will be a note inside the vault dir. Others will be lipsum in a subdirectory. + +The song has been compared to the Prince's "1999" ( #y1982 ) <\- oh look, a tag! + +More tags: \- #y_1982 \- #y-1982 \- #y1982/sep \- #y2000/party-over/oops/out-of-time + +However these shouldn't be recognised as tags: \- (#y1985 ) \- #1985 \- American Psycho (film)#Patrick Bateman \- #hash_char_not_tag +""" + actual_text = actual_gathered_vault_defaults.get_readable_text('Sussudio') + assert actual_text == expected_text diff --git a/tests/test_canvas_utils_fpath_funcs.py b/tests/test_canvas_utils_fpath_funcs.py index 0a2215b..a674f99 100644 --- a/tests/test_canvas_utils_fpath_funcs.py +++ b/tests/test_canvas_utils_fpath_funcs.py @@ -1,4 +1,4 @@ -import os +from pathlib import Path import pytest @@ -18,5 +18,5 @@ def test_get_canvas_relpaths_from_dir(tmp_path): assert isinstance(actual_relpaths, list) for p in actual_relpaths: - assert isinstance(p, os.PathLike) + assert isinstance(p, Path) assert p.suffix == 'canvas' diff --git a/tests/test_canvas_utils_vault_stub.py b/tests/test_canvas_utils_vault_stub.py index 746d01a..bb28b25 100644 --- a/tests/test_canvas_utils_vault_stub.py +++ b/tests/test_canvas_utils_vault_stub.py @@ -1,4 +1,3 @@ -import os import pytest from pathlib import Path @@ -7,7 +6,7 @@ # NOTE: run the tests from the project dir. -WKD = Path(os.getcwd()) +WKD = Path().cwd() @pytest.fixture diff --git a/tests/test_html_processing.py b/tests/test_html_processing.py new file mode 100644 index 0000000..4fec610 --- /dev/null +++ b/tests/test_html_processing.py @@ -0,0 +1,54 @@ +from pathlib import Path + +from obsidiantools.html_processing import (_remove_code, + _remove_del_text, + _remove_latex) +from obsidiantools.md_utils import _get_html_from_md_file + +# NOTE: run the tests from the project dir. +WKD = Path().cwd() + + +def test_remove_code(): + fpath = Path('.') / 'tests/general/wikilinks_exclude-code.md' + + actual_html = _get_html_from_md_file(fpath) + actual_proc_html = _remove_code(actual_html) + actual_html_string = str(actual_proc_html) + + expected_html_string = """

code-avoid-wikilink

+
+

+

The snippets above are R code: they should not give a wikilink.

""" + assert actual_html_string == expected_html_string + + +def test_remove_del_text(): + fpath = Path('.') / 'tests/general/readable-text_all-deleted.md' + + actual_html = _get_html_from_md_file(fpath) + actual_proc_html = _remove_del_text(actual_html) + actual_html_string = str(actual_proc_html) + + expected_html_string = """

""" + assert actual_html_string == expected_html_string + + +def test_remove_latex_in_note_with_highly_formatted_text(): + fpath = Path('.') / 'tests/general/latex.md' + + actual_html = _get_html_from_md_file(fpath) + actual_proc_html = _remove_latex(actual_html) + actual_html_string = str(actual_proc_html) + + expected_html_string = """

Note with LaTeX

+

GEE

+

Regression coefficients estimated through GEE are asymptotically normal: +

+

The underscore chars above need to be caught through MathJax - capture subscripts rather than emphasis in the markdown parsing.

+

GEE estimation

+

A few eqs more using deeper LaTeX functionality:

+

Equations for GEE are solved for the regression parameters using: +

+

Taking the expectation of the equation system in ...

""" + assert actual_html_string == expected_html_string diff --git a/tests/test_md_utils.py b/tests/test_md_utils.py index 2b1f418..b926080 100644 --- a/tests/test_md_utils.py +++ b/tests/test_md_utils.py @@ -237,10 +237,10 @@ def test_front_matter_parse_double_curly(): def test_hash_char_parsing_func(): # '\#' in md file keeps # but stops text from being a tag - in_str = "\#hash #tag" + in_str = r"\#hash #tag" out_str = _transform_md_file_string_for_tag_parsing(in_str) - expected_str = "hash #tag" + expected_str = r"hash #tag" assert out_str == expected_str diff --git a/tests/test_md_utils_fpath_funcs.py b/tests/test_md_utils_fpath_funcs.py index 688b441..9797b3d 100644 --- a/tests/test_md_utils_fpath_funcs.py +++ b/tests/test_md_utils_fpath_funcs.py @@ -1,4 +1,4 @@ -import os +from pathlib import Path import pytest from obsidiantools.md_utils import (_get_html_from_md_file, @@ -22,7 +22,7 @@ def test_get_md_relpaths_from_dir(tmp_path): assert isinstance(actual_relpaths, list) for p in actual_relpaths: - assert isinstance(p, os.PathLike) + assert isinstance(p, Path) assert p.suffix == 'md' diff --git a/tests/test_md_utils_vault_stub.py b/tests/test_md_utils_vault_stub.py index 7d971d4..0534d69 100644 --- a/tests/test_md_utils_vault_stub.py +++ b/tests/test_md_utils_vault_stub.py @@ -1,4 +1,3 @@ -import os import pytest from pathlib import Path @@ -7,7 +6,7 @@ # NOTE: run the tests from the project dir. -WKD = Path(os.getcwd()) +WKD = Path().cwd() @pytest.fixture diff --git a/tests/vault-stub/Isolated note.md b/tests/vault-stub/Isolated note.md index 11a0df2..8722aab 100644 --- a/tests/vault-stub/Isolated note.md +++ b/tests/vault-stub/Isolated note.md @@ -1,2 +1,2 @@ # Isolated note -This is an isolated note. +This is an isolated note ~~an orphan~~. diff --git a/tests/vault-stub/lipsum/Alimenta.md b/tests/vault-stub/lipsum/Alimenta.md index c759dae..ade3201 100755 --- a/tests/vault-stub/lipsum/Alimenta.md +++ b/tests/vault-stub/lipsum/Alimenta.md @@ -1,4 +1,5 @@ # Alimenta +Food for thought: [[Crazy wall 2.canvas]] ## Ulciscitur ingemuere