Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve items' timestamp upon upload and download. #13

Closed
matteotenca opened this issue Oct 31, 2021 · 9 comments
Closed

Preserve items' timestamp upon upload and download. #13

matteotenca opened this issue Oct 31, 2021 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@matteotenca
Copy link

Good morning,

I believe it would be nice to keep items' datestamp upon upload (as described here, where it seems to be the right behaviour) and downloads.

The docs here describe the fields createdDateTime and lastModifiedDateTime as read-only properties, so the FileSystemInfo facet, should be used to set those values. This means every item has two createdDateTime and two lastModifiedDateTime: the ones into FileSystemInfo are set by the client, the other ones are set by OneDrive.

Updating an uploaded file's FileSystemInfo is quite straightforward, using the PATCH method (docs)

The nasty part is to read/write the file's createdDateTime on the local machine. Under Windows, it is possible to read the creation time of a file with Python's internal stat call (although some filesystems may not respond properly). Under Linux, as far as I know, it is not possible to read the creation time via Python's stat call. I managed to read the creation time on Linux in pure Python via ctypes, kernel version >= 4.11 and glibc version >= 2.28 are needed.

Writing the creation time, under Windows, is possible via ctypes and maybe via Python's utime. Under Linux, I believe it is not possible to alter the creation time of a file by any mean.

Let me know your opinion!

Regards

Repository owner deleted a comment from demin13 Nov 1, 2021
@dariobauer dariobauer added the enhancement New feature or request label Nov 1, 2021
@dariobauer
Copy link
Owner

Yep, this would be a good improvement to add, although I think it is better to specify it when uploading rather then upload and then patch.

Looking at the Graph docs, changing the body sent within the upload_large_file method to something along the lines of this should work:

body = {
    "item": {
        "@odata.type": "microsoft.graph.driveItemUploadableProperties",
        "@microsoft.graph.conflictBehavior": conflict_behavior,
        "name": "file name.txt",
        "description": "String",
        "fileSize": 1024,
        "fileSystemInfo": {
            "@odata.type": "microsoft.graph.fileSystemInfo",
            "createdDateTime" : "datetime",
            "lastModifiedDateTime" : "datetime"
        }
}

I haven't tried this and obviously there are some issues to consider such as the "datetime" having to be UTC.

Modification time is fairly easy using mtime, but as you said creation time is an issue. So far the best of the sub-optimal solutions seems to be:

last_modified = os.path.getmtime(path_to_file)

if platform.system() == 'Windows':
    creation = os.path.getctime(path_to_file)
else:
    stat = os.stat(path_to_file)
    try:
         creation = stat.st_birthtime
    except AttributeError:
        # Likely using Linux, fall back to last modified.
         creation = stat.st_mtime

If this idea is proceeded then we should probably get rid of the standard upload method as it does not support a body with the request. Also I'm also thinking of rewriting upload to use async.

@dariobauer
Copy link
Owner

I have created a new branch to implement this.

So far I can get the creation and modification timestamps of the file into the correct format but I can't get Graph to accept the data. Oddly when item is in the json Graph flat out rejects it and if you don't include the item and move everything up a level then Graph appears to just ignore the fileSystemInfo.

Maybe you have any ideas on how to solve this?

@matteotenca
Copy link
Author

matteotenca commented Nov 2, 2021

Well I tested your code and works well. The format you chose seems to work:

# Create request body for the upload session
body = {
    "item": {
        "@odata.type": "microsoft.graph.driveItemUploadableProperties",
        "@microsoft.graph.conflictBehavior": conflict_behavior,
        "name": destination_file_name,
        "fileSize": file_size,
        "fileSystemInfo": {
            "@odata.type": "microsoft.graph.fileSystemInfo",
            "createdDateTime": file_created_str,
            "lastModifiedDateTime": file_modified_str,
        },
    }
}

The only problem I had was related to the fact that the files I uploaded had the creation date set later than the last modified date - this on my local machine, they were files I had manipulated via utime. Once uploaded, the creation time was correct, but the modification time was different from the one provided, but I think that makes sense.

A part from that (and a problem with httpx not being the last version on this Laptop, which prevented download from working), your code works like a charm, as far as I can tell.

@matteotenca
Copy link
Author

I added this code after line no. 369 to get more infos from detail_item:

            print("fileSystemInfo:")
            print(
                "    createdDateTime:",
                response_data.get("fileSystemInfo", {}).get("createdDateTime"),
            )
            print(
                "    lastModifiedDateTime:",
                response_data.get("fileSystemInfo", {}).get("lastModifiedDateTime"),
            )
            if "file" in response_data.keys():
                sha1 = (
                    response_data.get("file", {}).get("hashes", {}).get("sha1Hash", "")
                )
                sha256 = (
                    response_data.get("file", {})
                        .get("hashes", {})
                        .get("sha256Hash", "")
                )
                print("sha1Hash:", sha1)
                print("sha256Hash:", sha256)

@dariobauer
Copy link
Owner

dariobauer commented Nov 3, 2021

I fixed my issue by removing some of the unnecessary content in the body. Since the Graph API integrates with OneDrive Personal, OneDrive Business, and SharePoint I have seen some issue where something works for one service but not for others, and I think that was the case here.

I have also added the suggested item details printed using verbose, although none of my items seem to have hashes.

Can you please test the latest push of the branch works for your uploads?

Still need to look into for downloads.

@dariobauer dariobauer self-assigned this Nov 3, 2021
@matteotenca
Copy link
Author

Sorry for not answering you before, I had a hardware crash and I'm having some problems with backups. I gave a look at the new code and as far as I could understand this issue is resolved.

Also I'm also thinking of rewriting upload to use async.

This point is really interesting. I already made a lot of testing when I was developing the async download code.

The biggest issue I met was that it is not possibile to parallelize the upload of the "chunks", since each chunk upload must be completed before the next may start. This broke the legs once and for all to a real asyncronous approach. I.e if an upload from bytes 0-999 is ongoing, I got an API an error when trying to start a second concurrent upload over another connection from bytes 1000-1999. It seems to be mandatory to wait until the first operation completes to start the second upload request.

So, while keeping a single connection approach, I tried to switch from syncronous I/O (both disk and network-wise) to async I/O (using httpx and aiofiles) to try to achieve greater upload speeds. The result was disappointing: the upload was slower.

Then I tried to mix syncronous and asyncronous operations: aiofiles with requests and Python's native syncronous I/O with httpx asyncronous requests, but the speed was always worse than before.

It seems that miming a syncronous behaviour using asyncio based code just adds some overhead. This makes sense, since asyncio (as far as I could understand) does something only when something else is awaited, so awaiting for a disk operation and a network operation that depend on each other in the same loop makes no sense (correct me if I am wrong).

Maybe switching to "real" asyncronous code (like threads) may do the trick, but I did not tried.

Looking forward to hear what you think!

Regards

@dariobauer
Copy link
Owner

Yes, chunks must be uploaded sequentially in order per the Graph docs:

The fragments of the file must be uploaded sequentially in order. Uploading fragments out of order will result in an error.

I came to a similar conclusion, little to no gains from disk i/o. I think there only gains that could be made are if uploading multiple items then you could have a connection for each item.

In regards to updating the file metadata of a downloaded file - I have made no progress on this.

@matteotenca
Copy link
Author

In regards to updating the file metadata of a downloaded file - I have made no progress on this.

I have a (somewhat) working piece of code that can read and write timestamps under Windows using ctypes (creation time included), and a badly working piece of code which can read the creation time under Linux using ctypes too, but is totally unreliable - it does not work on Centos 8, but on Ubuntu 21.x is ok, even if kernel and libc versions are capable of reading the ctime on both platforms.

Anyway I believe that updating a file's creation time under Linux is not possible. Here the docs say:

File creation (birth) timestamp (btime)
              (not returned in the stat structure); statx.stx_btime

              The file's creation timestamp.  This is set on file
              creation and not changed subsequently.

              The btime timestamp was not historically present on UNIX
              systems and is not currently supported by most Linux
              filesystems.

If btime is considered as such, it is useless to waste time on this. Under Windows the ctypes solution works.

The cleanest approach could be to write a tiny library in C or C++ which reads/writes files' timestamps and to use it via ctypes, in such a way that any issue with the Windows' API would impact only over the binary library, leaving the Python code out of trouble.

In my opinion, all of this work should be accomplished just if a sync process between local and remote depends on it to become possible. I don't know if this is the case.

Bye!

@dariobauer
Copy link
Owner

Not currently implementable cross platform. Issue stale.

@dariobauer dariobauer closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants