Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export CSV - data quality issues #2271

Closed
mikewray opened this issue Jul 26, 2014 · 18 comments
Closed

Export CSV - data quality issues #2271

mikewray opened this issue Jul 26, 2014 · 18 comments
Assignees

Comments

@mikewray
Copy link

Hi, I'm testing 0.12.2 and noticed on my last field trip something weird with the export CSV file so have done a bit of testing. Turns out that some fields in the export report are being reported as zero and not reflecting the correct data as shown on the web view. To illustrate:

Here is the dataset for a specific small group of 5 students; I know all this data is correct i.e. reflects what has happened:

kasisi staff data report

When I export the csv data for the entire date range, here is an a extract of various fields of interest (I took into excel to make it look pretty):

kasisi staff export report

Analysis: when you compare the two you notice the total_exercises, videos watched and the underlying details are correct and match what is on the web page view. The errors are:

  1. total_hours reports zero - clearly wrong and does not agree to web view
  2. total_logins reports zero - clearly wrong and does not agree to web view
  3. Coach stats are on the report and shown as zero. Two errors here, the coach should not be in this report in any case as I'm just looking at a particular group. Secondly the numbers are wrong, there has been activity and this is shown if go into the coach section
  4. Finally, if you want to capture activity done the 26th for example you need to make your outer date when doing the export the 27th i.e. I think in the code the date parameter should be less than or equal to, at the moment it is just less than the date.
@mikewray
Copy link
Author

Hi there - just following up on this?

@aronasorman
Copy link
Collaborator

Hi @mikewray, thanks for reporting this. Will assign for someone to look at.

@aronasorman aronasorman added this to the 0.12.2 milestone Jul 30, 2014
@cpauya
Copy link
Contributor

cpauya commented Aug 6, 2014

Hi @mikewray

I've checked this on the release-0.12.0 release and used the python manage.py generaterealdata management command to, well, generate real data on my laptop. :)

Can you please elaborate what you meant by "When I export the csv data for the entire date range". The default is the date range for the previous month.

For items 1.,2.,3.: I'm assuming that what you meant is the default of previous month. This differ on the values displayed by the web browser because the latter doesn't do any date filters. I changed the date range to 01-Jan-2014 up to 31-Dec-2014 on my local test data and the CSV values and the web values are the same.

For item 4.: I have checked the codes and we are using inclusive operators lte (less than or equal to) and gte (greater than or equal to) for date ranges. I've checked the values for logins on the export and are so far inclusive already.

Can you please try to change the date range when you Export to csv and see if that validates?

cc: @aronasorman

@mikewray
Copy link
Author

mikewray commented Aug 6, 2014

Hi there, thanks for your work on this. I just tested on 0.12.3 and a lot of the issues have disappeared - so something has changed. So ignore above and focus on the testing I've just done. So now I can match most of the web page data with the csv report. One field seems weird - the videos watched does not match - the csv reports less than the web page. Just did a bit of testing - if I fully watch a video it comes out in the csv if I don't it does not. So I had a theory the web page was counting partially watched videos but I just did a test (watched first minute) but it did not feature on report or web page. Also I confirm for me to extract activity for say the 7th August the outer date range needs to be 8th August, maybe you can test for yourself. The other thing is the total_reports_view how many time the student looked at their progress report or is it the coach? If the former it does not seem to increment, well with my test anyway. Maybe you can check your end in.


fyi -> to answer your question the summary stats you see on the web page is from inception i.e. the date from when that instance of KA Lite was installed. Therefore to run a report that produces the same data the date range should cover the entire period from inception to present. You can set them when you run the report (on the calendar in the top right and top left there is an invisible button that can take you back as far as you want to go) - this is what I was doing before, it was not working but now it seems to be.

@mikewray
Copy link
Author

mikewray commented Aug 6, 2014

Further to above, I just did a bit more testing. I did a report for the date ranges from June 1st to June 30th, what is weird is the zero value fields - makes no sense, you can see exercises and videos were done and this must have taken some time. There is something not quite right here...

csv export example

@cpauya
Copy link
Contributor

cpauya commented Aug 6, 2014

Thank you for the clarifications. Will trace these up.

@mikewray
Copy link
Author

mikewray commented Aug 8, 2014

Hi there - I'm just following up on this, any progress or need my help with anything?

@mikewray
Copy link
Author

Hello again - I'm just checking on this. As we've just cleared off many bugs and issues this is now top of my list to sort out. I want to use these reports for detailed analysis. So hopefully someone can look at this, it is just a sql statement that pulls the info right?

@aronasorman
Copy link
Collaborator

Hi Mike, unfortunately we've put this off for next week as there are some pressing issues with our nalanda deployment. We'll have someone working on this next week!

@mikewray
Copy link
Author

Hi there - just chasing up on this one. Now data sync is working, this issue is the one is top of the list for me...if this report works correctly will allow great data analysis opportunities for M&E. Let me know how I can help. Hopefully it is a realitively simple fix as the underlying data is there.

@mikewray
Copy link
Author

Hi there, after a little break I'm just checking on my open tickets and see no progress on this. Anything I can do? tx!

@mikewray
Copy link
Author

mikewray commented Oct 7, 2014

Hi there - just checking on this issue again. This is an important one for me - would like to be able to trust these reports. Let me know if I can help in anyway. Tx!

@dylanjbarth
Copy link
Contributor

@mikewray see the discussion in #2466

This will be the eventual fix

dylanjbarth added a commit that referenced this issue Oct 20, 2014
Fixes #2271-Export CSV - data quality issues
@aronasorman
Copy link
Collaborator

Reopening this due to this comment.

@aronasorman aronasorman reopened this Nov 25, 2014
@djallado
Copy link
Contributor

djallado commented Dec 1, 2014

Diagram attached below illustrate how to handle our filtering with date range.

Diagram:
alt text

This Diagram illustrate how our CSV report should work.
The box is our whole set and Q1, Q2, Q3, Q4, Q5, and Q6 are all subsets of box.
For now what we have is only Q1 and Q2 which only select within the set.
Considering if filtering is outside the set, Q3 up to Q6 should be look into

@cpauya
Copy link
Contributor

cpauya commented Dec 2, 2014

Hi @mikewray

We have fixed the VideoLog totals on the #2742 PR of @djallado. We have also the potential fix for the total_logins and total_hours columns on the same PR but found that these data are summed-up into a monthly record.

So when you check out the PR, please do a check on a monthly date range first to check if the login hours are correct. Then you can narrow your filters to validate the issue.

Here are our Issues so far:

  1. The total_hours is based on the synced model UserLogSummary - with records summed-up on a monthly basis. So narrowing your filters for the month will yield incorrect user log data.
  2. The details are found on the UserLog model but this is not synced to central. Thus we cannot use this as basis for your report.

As per recommendations of @djallado on his PR at #2742 - can we narrow down the generation of UserLogSummary to daily instead of monthly so that we can filter our total_hours per day?

/cc: @aronasorman @jamalex

@mikewray
Copy link
Author

mikewray commented Dec 2, 2014

Hi there! Can you confirm the total_hours will work in any circumstance, so can cross months e.g. i.e. an inclusive report for the whole period of existence will equal the sum of two or more reports.

I've been doing all this testing through the central server, the data derives from 3 Rpi's in the field in Zambia. Can you put this change into staging or live and I can test it there? I don't have the test data on my laptop and think it will take me while to find an old data set that will work for this test. Much easier to work with data I know and I can run the same tests I've been doing. Thanks for your work on this

@aronasorman
Copy link
Collaborator

Hi @mikewray, I've updated the central server with the new csv export fixes. Please check if it's fixed on your side now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants