Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/truncate-sentences-in-transcript-str #153

Conversation

isaacna
Copy link
Collaborator

@isaacna isaacna commented Jan 24, 2022

Link to Relevant Issue

This pull request resolves #149

Description of Changes

Changing Transcript's __str__ method to truncate sentences. I did this with regex which was the least hacky thing I could think of. Alternatively, I was thinking about doing something like creating a deepcopy of self in __str__ and replacing sentences with [...], but that seems more hacky. There's not a standard way to hide/truncate a single property of a class afaik.

repr() will still show the full sentences.

Testing

Running print(EXAMPLE_TRANSCRIPT)
Before:

Transcript(generator='JacksonGen -- Lib Version: 0.0.0', confidence=0.93325, session_datetime='2021-01-10T15:00:00', created_datetime='2022-01-24T02:05:41.708769', sentences=[Sentence(index=0, confidence=0.9, start_time=0.0, end_time=1.0, words=[Word(index=0, start_time=0.0, end_time=0.5, text='hello', annotations=None), Word(index=1, start_time=0.5, end_time=1.0, text='everyone', annotations=None)], text='Hello everyone.', speaker_index=0, speaker_name='Jackson Maxfield Brown', annotations=None), Sentence(index=1, confidence=0.95, start_time=1.0, end_time=2.0, words=[Word(index=0, start_time=1.0, end_time=1.5, text='hi', annotations=None), Word(index=1, start_time=1.5, end_time=2.0, text='all', annotations=None)], text='Hi all.', speaker_index=1, speaker_name='Isaac Na', annotations=None)], annotations=TranscriptAnnotations(sections=[SectionAnnotation(name='Call to Order', start_sentence_index=0, stop_sentence_index=2, generator='Jackson Maxfield Brown', description=None)]))

After

Transcript(generator='JacksonGen -- Lib Version: 0.0.0', confidence=0.93325, session_datetime='2021-01-10T15:00:00', created_datetime='2022-01-24T02:13:21.769205', sentences=[...], annotations=TranscriptAnnotations(sections=[SectionAnnotation(name='Call to Order', start_sentence_index=0, stop_sentence_index=2, generator='Jackson Maxfield Brown', description=None)]))

@isaacna isaacna requested a review from evamaxfield January 24, 2022 02:23
@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

Merging #153 (439bf1a) into main (310902d) will decrease coverage by 0.29%.
The diff coverage is 11.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #153      +/-   ##
==========================================
- Coverage   94.82%   94.53%   -0.30%     
==========================================
  Files          50       50              
  Lines        2531     2543      +12     
==========================================
+ Hits         2400     2404       +4     
- Misses        131      139       +8     
Impacted Files Coverage Δ
cdp_backend/pipeline/transcript_model.py 86.88% <11.11%> (-13.12%) ⬇️
...kend/tests/sr_models/test_google_cloud_sr_model.py 100.00% <0.00%> (ø)
cdp_backend/sr_models/google_cloud_sr_model.py 98.66% <0.00%> (+0.03%) ⬆️
cdp_backend/pipeline/event_gather_pipeline.py 85.67% <0.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 310902d...439bf1a. Read the comment docs.

@evamaxfield
Copy link
Member

I didn't even think of using a regex to strip. Cool!

Could you show what happens when you simply call the object? In ipython instead of using print just call it:

t = read_some_transcript()
t

calling t like this calls the __repr__ function. Idk if object standard is to use the __str__ for __repr__ if __repr__ is undefined

@isaacna
Copy link
Collaborator Author

isaacna commented Jan 24, 2022

This is a little tricky. In ipython, we get the non-truncated version so it must use repr if you just do t.

Looks like str uses repr if there's no implementation provided for the class. So to change repr to truncate the string, that would mean we'd have to rewrite the logic to convert to a string itself, since that's what repr does in the first place.

I'm not familiar with Jupyter, but how are you using the transcript objects? According to this post, __str__ is supposed to be user friendly and __repr__ is supposed to represent the object in more detail.

Based on this I think we can/should only use str when we want to view Transcript with truncated sentences

@evamaxfield
Copy link
Member

I'm not familiar with Jupyter, but how are you using the transcript objects?

It's a pretty common operation to do the following in notebooks:

image

Where you load / compute some data and then just "print" out the loaded object at the end of the cell. I think this has become the norm in the jupyter world because if Python is running in an ipython client (jupyter) it will actually call to a __repr_html__ function first if there is one. I.e. pandas dataframes, dask clusters, dask arrays, etc. all have HTML reprs that are quite useful.

Anyway, its the norm in jupyter world to do that. I think there may be a way to do it too. Technically all that is being printed is just dataclass_json.to_dict() so I think we could do something like this:

string = "Transcript("
for k, v in self.to_dict().items():
    if k != "sentences":
        string += f"{k}={v}, "
    else:
        string += f"{k}=[...], "

# remove last comma and space
string = string[:-2]
# close paren
string += ")"

I am not the happiest with that but it probably works?

What may be interesting is to make it more pretty? like instead of just "[...]" we could say "[...] (n=1002)" for like "number of sentences" or something? Idk. now I am just tossing ideas out.

Happy to merge this in because it is an improvement from where we are currently but your call. Feel free to merge or keep tinkering.

@isaacna
Copy link
Collaborator Author

isaacna commented Jan 25, 2022

I think this has become the norm in the jupyter world because if Python is running in an ipython client (jupyter) it will actually call to a repr_html function first if there is one. I.e. pandas dataframes, dask clusters, dask arrays, etc. all have HTML reprs that are quite useful.

Anyway, its the norm in jupyter world to do that.

Gotcha, yeah I was mainly curious whether changing repr is something we should do, but since that's the standard in Jupyter I think it's fine to do that.

Technically all that is being printed is just dataclass_json.to_dict() so I think we could do something like this

Good idea! I did something similar except I used vars so we can preserve subclassing since to_dict was a pure json representation without class names. I also added (n=...) to the string generation

@isaacna
Copy link
Collaborator Author

isaacna commented Jan 25, 2022

It will now output the following when just calling the object (same content as before):

Transcript(generator='JacksonGen -- Lib Version: 0.0.0', confidence=0.93325, session_datetime='2021-01-10T15:00:00', created_datetime='2022-01-24T21:30:40.583302', sentences=[...] (n=2), annotations=TranscriptAnnotations(sections=[SectionAnnotation(name='Call to Order', start_sentence_index=0, stop_sentence_index=2, generator='Jackson Maxfield Brown', description=None)]))

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. I have never even heard of the vars method. Very neat! Thanks!!

@evamaxfield evamaxfield merged commit 4f8b6c7 into CouncilDataProject:main Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change Transcript object repr / to string
2 participants