feature/truncate-sentences-in-transcript-str #153

isaacna · 2022-01-24T02:23:41Z

Link to Relevant Issue

This pull request resolves #149

Description of Changes

Changing Transcript's __str__ method to truncate sentences. I did this with regex which was the least hacky thing I could think of. Alternatively, I was thinking about doing something like creating a deepcopy of self in __str__ and replacing sentences with [...], but that seems more hacky. There's not a standard way to hide/truncate a single property of a class afaik.

repr() will still show the full sentences.

Testing

Running print(EXAMPLE_TRANSCRIPT)
Before:

Transcript(generator='JacksonGen -- Lib Version: 0.0.0', confidence=0.93325, session_datetime='2021-01-10T15:00:00', created_datetime='2022-01-24T02:05:41.708769', sentences=[Sentence(index=0, confidence=0.9, start_time=0.0, end_time=1.0, words=[Word(index=0, start_time=0.0, end_time=0.5, text='hello', annotations=None), Word(index=1, start_time=0.5, end_time=1.0, text='everyone', annotations=None)], text='Hello everyone.', speaker_index=0, speaker_name='Jackson Maxfield Brown', annotations=None), Sentence(index=1, confidence=0.95, start_time=1.0, end_time=2.0, words=[Word(index=0, start_time=1.0, end_time=1.5, text='hi', annotations=None), Word(index=1, start_time=1.5, end_time=2.0, text='all', annotations=None)], text='Hi all.', speaker_index=1, speaker_name='Isaac Na', annotations=None)], annotations=TranscriptAnnotations(sections=[SectionAnnotation(name='Call to Order', start_sentence_index=0, stop_sentence_index=2, generator='Jackson Maxfield Brown', description=None)]))

After

Transcript(generator='JacksonGen -- Lib Version: 0.0.0', confidence=0.93325, session_datetime='2021-01-10T15:00:00', created_datetime='2022-01-24T02:13:21.769205', sentences=[...], annotations=TranscriptAnnotations(sections=[SectionAnnotation(name='Call to Order', start_sentence_index=0, stop_sentence_index=2, generator='Jackson Maxfield Brown', description=None)]))

codecov · 2022-01-24T02:28:31Z

Codecov Report

Merging #153 (439bf1a) into main (310902d) will decrease coverage by 0.29%.
The diff coverage is 11.11%.

@@            Coverage Diff             @@
##             main     #153      +/-   ##
==========================================
- Coverage   94.82%   94.53%   -0.30%     
==========================================
  Files          50       50              
  Lines        2531     2543      +12     
==========================================
+ Hits         2400     2404       +4     
- Misses        131      139       +8

Impacted Files	Coverage Δ
cdp_backend/pipeline/transcript_model.py	`86.88% <11.11%> (-13.12%)`	⬇️
...kend/tests/sr_models/test_google_cloud_sr_model.py	`100.00% <0.00%> (ø)`
cdp_backend/sr_models/google_cloud_sr_model.py	`98.66% <0.00%> (+0.03%)`	⬆️
cdp_backend/pipeline/event_gather_pipeline.py	`85.67% <0.00%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 310902d...439bf1a. Read the comment docs.

evamaxfield · 2022-01-24T02:34:50Z

I didn't even think of using a regex to strip. Cool!

Could you show what happens when you simply call the object? In ipython instead of using print just call it:

t = read_some_transcript()
t

calling t like this calls the __repr__ function. Idk if object standard is to use the __str__ for __repr__ if __repr__ is undefined

isaacna · 2022-01-24T02:55:26Z

This is a little tricky. In ipython, we get the non-truncated version so it must use repr if you just do t.

Looks like str uses repr if there's no implementation provided for the class. So to change repr to truncate the string, that would mean we'd have to rewrite the logic to convert to a string itself, since that's what repr does in the first place.

I'm not familiar with Jupyter, but how are you using the transcript objects? According to this post, __str__ is supposed to be user friendly and __repr__ is supposed to represent the object in more detail.

Based on this I think we can/should only use str when we want to view Transcript with truncated sentences

evamaxfield · 2022-01-24T18:42:46Z

I'm not familiar with Jupyter, but how are you using the transcript objects?

It's a pretty common operation to do the following in notebooks:

Where you load / compute some data and then just "print" out the loaded object at the end of the cell. I think this has become the norm in the jupyter world because if Python is running in an ipython client (jupyter) it will actually call to a __repr_html__ function first if there is one. I.e. pandas dataframes, dask clusters, dask arrays, etc. all have HTML reprs that are quite useful.

Anyway, its the norm in jupyter world to do that. I think there may be a way to do it too. Technically all that is being printed is just dataclass_json.to_dict() so I think we could do something like this:

string = "Transcript("
for k, v in self.to_dict().items():
    if k != "sentences":
        string += f"{k}={v}, "
    else:
        string += f"{k}=[...], "

# remove last comma and space
string = string[:-2]
# close paren
string += ")"

I am not the happiest with that but it probably works?

What may be interesting is to make it more pretty? like instead of just "[...]" we could say "[...] (n=1002)" for like "number of sentences" or something? Idk. now I am just tossing ideas out.

Happy to merge this in because it is an improvement from where we are currently but your call. Feel free to merge or keep tinkering.

isaacna · 2022-01-25T02:14:18Z

I think this has become the norm in the jupyter world because if Python is running in an ipython client (jupyter) it will actually call to a repr_html function first if there is one. I.e. pandas dataframes, dask clusters, dask arrays, etc. all have HTML reprs that are quite useful.

Anyway, its the norm in jupyter world to do that.

Gotcha, yeah I was mainly curious whether changing repr is something we should do, but since that's the standard in Jupyter I think it's fine to do that.

Technically all that is being printed is just dataclass_json.to_dict() so I think we could do something like this

Good idea! I did something similar except I used vars so we can preserve subclassing since to_dict was a pure json representation without class names. I also added (n=...) to the string generation

isaacna · 2022-01-25T02:15:39Z

It will now output the following when just calling the object (same content as before):

Transcript(generator='JacksonGen -- Lib Version: 0.0.0', confidence=0.93325, session_datetime='2021-01-10T15:00:00', created_datetime='2022-01-24T21:30:40.583302', sentences=[...] (n=2), annotations=TranscriptAnnotations(sections=[SectionAnnotation(name='Call to Order', start_sentence_index=0, stop_sentence_index=2, generator='Jackson Maxfield Brown', description=None)]))

evamaxfield

Wow. I have never even heard of the vars method. Very neat! Thanks!!

isaacna added 2 commits January 23, 2022 18:16

Truncate Transcript's sentences in tostring

78bd9b3

Add r in front of regex string

96e913d

isaacna requested a review from evamaxfield January 24, 2022 02:23

quiet mypy

7b56c23

Generate string from scratch in __repr__

439bf1a

evamaxfield approved these changes Jan 25, 2022

View reviewed changes

evamaxfield merged commit 4f8b6c7 into CouncilDataProject:main Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/truncate-sentences-in-transcript-str #153

feature/truncate-sentences-in-transcript-str #153

isaacna commented Jan 24, 2022 •

edited

Loading

codecov bot commented Jan 24, 2022 •

edited

Loading

evamaxfield commented Jan 24, 2022

isaacna commented Jan 24, 2022 •

edited

Loading

evamaxfield commented Jan 24, 2022

isaacna commented Jan 25, 2022

isaacna commented Jan 25, 2022

evamaxfield left a comment

feature/truncate-sentences-in-transcript-str #153

feature/truncate-sentences-in-transcript-str #153

Conversation

isaacna commented Jan 24, 2022 • edited Loading

Link to Relevant Issue

Description of Changes

Testing

codecov bot commented Jan 24, 2022 • edited Loading

Codecov Report

evamaxfield commented Jan 24, 2022

isaacna commented Jan 24, 2022 • edited Loading

evamaxfield commented Jan 24, 2022

isaacna commented Jan 25, 2022

isaacna commented Jan 25, 2022

evamaxfield left a comment

Choose a reason for hiding this comment

isaacna commented Jan 24, 2022 •

edited

Loading

codecov bot commented Jan 24, 2022 •

edited

Loading

isaacna commented Jan 24, 2022 •

edited

Loading