-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More persistent logs #28
Comments
A more complex solution (to setup) could be something like https://kube-logging.dev/ that forwards logs to a central place and can store them on object storage. This looks like it's meant to be run once for the cluster (using daemonsets), so maybe not ideal here. |
A simple solution would be that the spider saves logs locally to node-local storage, and pushes them to object storage when the spider is terminated (also when evicted, for example). Then the logs endpoint (#12) needs to have knowledge of it. |
Something like scrapy-logexport could be a simple solution. |
@wvengen I discussed persistent logs problem with experienced colleague, he is familiar with Questionmark project because of one of Pythoneers days we had in the past, he suggested that one of interesting and suitable for us solution might be storing logs in a persistent volume. We can create persistent volumes and mount it to all our pods, they will write their own log files there are even if pods die, they will still stay on this persistent volume. I believe it might overlap with your second idea from comments to this problem. https://stackoverflow.com/questions/63479814/how-to-store-my-pod-logs-in-a-persistent-storage What do you think about it? |
Thank you for looking into this!
|
Thank you for your thoughts, I will use them as guidance to look deeper into this possible solution, still learning a lot about k8s and can't answer right away:) |
Yes, we can, according to solution that was provided on Stack Overflow and docs. "A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. It is a resource in the cluster just like a node is a cluster resource.". So we first create volume in a cluster and then claim some pieces for each deployment.
Yes, good point.
Depends on solution we choose. If I understand correctly by object storage you mean something like distributed storage like S3 on AWS? Not something from k8s world, right? If yes, we can rotate logs from volume to object storage.
This one is not a problem if we use streaming but if we use streaming do we even need volume? Why not just stream logs from k8s to object storage? What is better for us? Cheaper? If we don't want to implement streaming there should be a mechanism to maybe rotate logs after job is done, I need to read a bit more to understand how to implement something like this. |
Curious to hear more about how persistent volumes can be used by different pods at the same time (sometimes running on the same node, sometimes on different nodes). Object storage is not a full citizen in Kubernetes (see COSI for recent developments, though our cloud provider doesn't support that), but almost always used in conjunction with it. |
Not sure if I understand your first question, if you follow the link https://stackoverflow.com/questions/63479814/how-to-store-my-pod-logs-in-a-persistent-storage the solution with volume is saying that a volume mounts to each container where they can direct their logs, so it's not like many containers access the same mounted volume, every container has a mounted volume to access. |
Thanks for making this clear! It's what I would expect from the PV implementation.
With this approach, ideally, there would be a log file per spider run stored on S3, streaming. Unfortunately, you cannot append to an existing file on S3. But you can use multipart upload to upload logs in batches (see e.g. this SO question). scrapy-logexport can do something like this, though not streaming. |
We can't just delete PVC because it's coupled with PV, so I would suggest that we can use Logrotate to manage log files. This way we just regularly manage files and then it would be nice to store them somewhere in the cloud like a final destination, so we need a tool to put them there.
Here there is a nice tutorial and some additional info about Logrotate tool: https://betterstack.com/community/guides/logging/how-to-manage-log-files-with-logrotate-on-ubuntu-20-04/ |
Well, as you mentioned before, a PVC is linked to a single pod. So when a job is finished, there is just one log file of that one spider run (or perhaps multiple attempts in case of an error). So we can just drop the PV when we don't need it anymore - no logrotate comes into play here. Note that when you make a PVC, a PV is dynamically provisioned. (also, note that Kubernetes has support for e.g. NFS volumes, which would enable sharing across pods - but I'm not sure our cloud provider supports that - there also seems to be support for this with CSI, e.g. with ceph CSI - so it is possible to use PVCs with multiple attachments, see access modes, but it really depends on the driver whether this is supported) I think what is possible with CSI, really depends on the cloud environment. And we'd like to not use too specific features for now (like PVCs shared across pods - but maybe that is available everywhere), for ease of migration. I would think that object storage is more portable in that sense. For a system design, I see three different options. Storing logs on a persistant volume
Storing logs on object storage 'streaming'
Perhaps this could be more cleanly implemented as a sidecar container (with logs directed to a file), which does the uploading. That would be a more general solution to streaming logs to object storage, not only for Scrapy. Storing logs on object storage afterwards
|
Hi Willem, could you tell me, please, what is the usage pattern of scrapyd in production? How many spiders are in the cluster? Is there a designated location where the log files are stored? |
After migrating, this will be the case:
Not yet, but I expect there to be a single bucket for each instance. |
Note that we're making a generic software component here that could be used by various people, so try to make it useful for general use-cases you can think of (without making it too complex, and making sure the above fits in). |
On my way of solving the problem, I encountered this problem: nobody@scrapyd-k8s-7f59f447d4-tjhjj:/opt/app$ skopeo inspect docker://ghcr.io/q-m/scrapyd-k8s-spider-example:latest Willem, would it be difficult to add an image for this platform, please, so I could test and run things on k8s locally? I don't have an access to the docker repo, otherwise I could also add an image for this architecture. |
Ah, could point, an arm64 image would be welcome! I've added it. Does it work now? |
Nope, I guess it is important to have v8, so the target platform that is expected is linux/arm64/v8 When I run a pod with arm64 image I get the error: |
Ah, bummer. From what I could find, skopeo inspect docker://ghcr.io/q-m/scrapyd-k8s-spider-example:main
containers/skopeo#1617 (comment) inspired me to try docker buildx imagetools inspect ghcr.io/q-m/scrapyd-k8s-spider-example:main
which does show arm64 too. |
Yes, thank you, main does work! |
The solutions I have been working on that complied with the criteria "simple, universal, on Kubernetes, no spider modification"
|
@wvengen you asked me to look into some sort of Webhooks that detect resource change, is this something you had in mind? |
When the pod with the spider completes the run, it's logs are still available, I just need to find a way to collect logs from code when the pod is done. |
I'll look back at what you've written next week (time permitting, sorry I'm a bit busy these days). Will comment on some small things, and and curious what you would still recommend here.
That is almost true, but not quite: logs are truncated each night (I think), and the spider can run for longer than a day, so it still needs to do this periodically.
A new service for parsing and storing logs, running separate from the spider jobs, is not what I really had in mind for this issue. If we would go the standard-k8s-logging-stack route, one benefit would be that it could integrate well with clusters already having this. A downside is that this requires cluster resources running all the time, increasing costs (in our case, where we have no full logging stack present already).
I think it would be ok to use a persistent volume (not preferred, but perhaps a simple solution). That would be one persistant volume per spider job. There are some size considerations, i.e. if jobs are not deleted automatically and persistent volumes remain, that would add up quickly. And we'd still need to migrate from the persistent volume to object storage.
You can basically run multiple containers in a single pod (as you mention), and indeed when one container finishes, it affects the others. By using a custom entry-point (either need to take care to chain them properly, if there is an existing entry-point, or else perhaps require that the container image doesn't use a custom entry-point), it may be possible to run the spider and then wait for log shipping to finish (or adapt scrapyd-k8s with a custom spider run command to do so, not requiring adapting the entry-point).
Great you found that. Note that in #6, we may want to start listening to changes to spider pods. So if we need to work with listening to changes for log handling, scrapyd-k8s might be a suitable place for some parts here. |
The Python library for K8s which is being used to set up containers and jobs has a possibility to retrieve logs via core API which is already configured in the code: The question is how to use this kind of power correctly and not lose logs that are being truncated at night, I need to think about possible ways, another "difficulty" is to manage multiple containers with jobs, but it seems like we can list all of them and then work with it. |
After consideration of different solutions we decided to implement persistent volumes. Important to keep in mind that it shouldn't be too big and that we need to ship logs somewhere from it to free up the space if we want to store them longer, or just clean every once in a while when it's almost full, maybe clean old log files first. |
In the prod environment persistent volume is implemented through configuration of the external storage, it can be a hard drive or S3 or other options. @wvengen are you ok with implementing persistent volume using S3? |
S3 seems the sensible option for long-term storage, yes. If k8s persistent volumes can use S3 (and are supported in general, i.e. not only for a specific cloud provider), then that would be great. |
From what I learned, it can support S3 from different providers, the only thing I am not sure about is if we can unify the configuration of different S3 providers, for this I need to find out how to configure them, highlight overlapping parts and understand if it's possible to maybe pass credentials as env variables or some other way. |
Hi @wvengen I was reading Logging Architecture page in Kubernetes and it made me think about: how much resources do we have now for log files? How much do we need at maximum?
If we can just expand the limits for log files sizes it's the easiest way. Kubernetes is responsible for log rotation when we are talking about stdout/stderr. My assumption is that it wipes logs when they reach the default limits. It's not preserving logs but just another angle to tackle the problem if we need to let logs to live a bit longer than jobs.
|
Ah, that is very interesting, thank you! Is indeed tangential to the issue, but may actually satisfy our direct need. Thank you for sharing this! We are currently running on Kubernetes v1.26, would that support the beta kubelet config? Is it possible do change logging settings only for scrapyd-k8s spider pods / jobs? |
These parameters are available starting from v1.21, so yes, it is available for us with v.1.26. |
Concerning the other solution. If you want S3 specifically then we don't need persistent volume, but there is a challenge in collecting logs from all pods (which is solved by fluentbit but we are moving further with something custom), I am looking into one possible solution for that but not sure if it's a good one. To write anything to S3 we need to aggregate logs using logstash and find a way how to collect logs and send them to logstash. My current idea on that is to use TCP socket and try to configure worker pods to send logs via TCP socket. Not very easy. Persistent volume natively supports different storages, if we have multiple pods and we want to collect logs to the same storage, we need to choose any provider that supports ReadWriteMany mode, so many pods can write to the same storage at the same time, this way I can just redirect stdout/stderr to the files that are stored in persistent volume. As an example of a storage like this is Google Cloud Filestore, note that k8s natively supports different file storage providers which are listed here: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#types-of-persistent-volumes Both solutions have certain limitations in the infrastructure we can use, so it's always a trade off if we don't want to use industry standard approaches for all clusters. |
What about asking Scrapy to log to a file on a persistent volume? There is a standard way to do this, when invoking the spider. Then at the end of the spider run, move the file to object storage. Or, if we can have a persistent volume on object storage, that would be great - but I think that has some drawbacks / corner cases to think about. Regarding persistent volumes and EBS/Azure/GCE, these are deprecated: if there are CSI implementations doing this, that could be useful. |
On our Wednesday call I suggested to collect logs from code so "moving file at the end of a spider run" is same as collecting logs from code, we need data structures to keep track of running pods, we need to invoke a script that checks status of each pod by a specific schedule, like every 1 min, but at the same time we are facing the problem with logs that are truncated, so this way of solving the problem does not really solve our problem. With quite some engineering it's probably doable, there are corner cases that make it complex. That is why introducing persistent volumes is a good way to persist logs, compared to the previous way, I think, we ask pods to write to persistent volume and preserve the logs even if a pod crushes or is deleted, we still have it's logs, and if we constantly write to the file, then we don't need to come up with a solution about truncated logs, they will be preserved as well. Concerning CSI, we still can use the old annotation for disks and it will be automatically redirected to the new abstraction with the driver thing. But I can also spend some time and look into how to deploy this CSI.
Are there drawbacks/corner cases with persistent volumes you see right now and want me to think about? |
Agree! Let's go that route.
Good to know that old annotations still work - are they 'translated' to CSI by Kubernetes? I think it is useful to look a bit into CSI, also because there are probably limitations that would be good to know about.
I do read that S3 with CSI has limitations. I think it is good to see what is possible with that, but also come up with an approach without. So if we can use S3 as PV storage (incl. updates of logs - which I think is harder), that would be great. When I think KISS, I still come to letting Scrapy store logs on a regular persistent volume, and copy it to object storage when the spider is done (e.g. with an entrypoint). |
What is a regular persistent volume for you? In production it's always implemented through a storage that fits the requirements, if we use a file storage that allows ReadWriteMany and for example pay for that, why would we need another storage like S3 which is not compatible with ReadWriteMany? This way you store same files in two different types of storages and have to pay for both. If you want precisely S3 and only S3 ( or other object storage) which is not compatible with ReadWriteMany then we don't need a persistent volume, we can aggregate logs with logstash but the challenge is to collect them to logstash (sorry, I am a bit repeating myself, I mentioned this solution above), and logstash can be configured to ship aggregated logs to S3 directly, we don't need an additional step and resources here to set any sort of volume. I will look into CSI to learn about this abstraction. |
Good question :) I would expect that dynamically provisioned network-attached-storage would be common among Kubernetes cloud providers. I think local storage may or may not work (depending on local disk space), but as you mentioned before, it is not recommended to depend on node-local storage. I think we have three scenario's now:
|
Currently I have managed to access logs of job pods from the managing pod with scrapyd, I created a watcher to monitor events and when another pod with a job is running, I ask another watcher to monitor and send it's logs, the problem with the second watcher that it does not really watch, it reads and sends logs and quits until the next event triggers log reading, this is not the behavior I would like to have, so I am looking into this problem now and need an advice from more experienced colleagues. In the meanwhile I made a workaround, I append logs to a file located in the scrapyd pod with every read, so the file does not get rewritten, and in case logs were truncated, I still have them in the file, then I can delete duplicated lines by running pandas commands and the files are ready to be shipped to S3 but this is a workaround and not watching logs has some potential consequences for logs loss, since reading logs is triggered by events, there is a possibility, that from one event to another, some logs were written and then truncated, or the pod failed and we lose some part, it's probably not a big part and you said we can afford it. In case I don't manage to make log watcher work as expected, I can present that workaround solution next time. I had to put watcher in a separate thread in flask and I guess there are some details about async and threading that I did not fully grasp and that is why log watcher acts not as desired. |
Thank you for the progress update! I'm curious to see any configuration and code you have - but feel free to polish it a bit more if you like (though I'm a fan of early sharing). |
Ok, for now we have logs collection to files inside scraped-k8s container, they are collected by watchers (Kubernetes watch and client), instead of cleaning up duplicates we want to keep only unique logs in the files because pandas or other data science frameworks are quite big and do not really belong in our project. To keep it clean and simple we can track lines that are already added to the file with logs. There are multiple ways to do that, I see one really easy and effective way which I am going to implement and try on the cluster. We have many lines like this in our job logs:
So we cannot really rely on date time stamp, but if we check the last two lines in the file with job logs we have a unique combination of lines, so then we can parse the job container logs and skip lines until we find a match of those two unique lines and we can append to the log file only lines that come after that. This is an interesting algorithm! Also, hashing long lines will take more time than comparing them symbol by symbol, so I want to leave out hashing from this algorithm. Future plans we discussed:
Still keeping in mind other possible solutions but working on this one because it does not require heavy frameworks and simple enough for small clusters. |
There are two S3 settings that will be important for us, when using non-AWS S3 object storage providers. Would you like to check if these can be configured (e.g. in the environment - maybe no code change is even needed), and how? It's setting the endpoint, and whether to use host or path addressing style (we need the latter) (see e.g. boto3's explanation). |
As I understand, both are used when the driver is initialized, and in the code, the dynamic initialization is already implemented. So depending on the provider, a set of arguments is required. The arguments are parsed from the section [joblogs.storage.PROVIDER_INITIALS], if a user did not provide a parameter required for the driver initialization, it throws an error.
|
Ah, great! And thank you for looking this up & documenting! |
Merged, will try this in production. Looking at the README, I think it could use a section (below Configuration) on this feature, and how to configure it. Feel free to suggest something, if you like (otherwise I'll get round to it someday). |
Modified the README, let me know if you are missing anything in that PR! |
Regarding #28 (comment), I think this is still missing. You've done so much to develop this, so let's help people to actually know about its existence, and how to set it up! |
Currently, Docker / Kubernetes logs are used for logging. This is sometimes good enough, but in many situations not. These logs are often truncated at night (and potentially more often when grown to a large size) - especially on Kubernetes - so inspecting errors of long-running jobs can be difficult.
Find a way to keep logs around for at least a bit longer, e.g. for the lifetime of the job (as that could be configurable).
The focus is on Kubernetes, where this is the most pressing issue.
Note that if you have a large, mature Kubernetes cluster, it likely includes components to handle logs. But for smaller clusters, it brings a lot of overhead, and something else is desired.
Either Kubernetes has some hints or so to keep logs around for longer, or logs need to be stored elsewhere (and also removed by some system).
The text was updated successfully, but these errors were encountered: