Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httpHeaders Set-Cookie is single string #33

Open
Austinb opened this issue Jul 17, 2019 · 1 comment
Open

httpHeaders Set-Cookie is single string #33

Austinb opened this issue Jul 17, 2019 · 1 comment

Comments

@Austinb
Copy link

Austinb commented Jul 17, 2019

Where there are multiple Set-Cookie headers in a server response from a WARC record the value of httpHeaders.Set-Cookie is always the last one in the list. This should be returned as an array of the Set-Cookie headers if that change doesnt break other things or there should be another method to get all of the cookies from the headers block. Another option would be to keep the line endings (\n) for the response so it is still a string but you can split it if you want.

Example WARC record (minus the content block):

WARC/1.0
WARC-Type: request
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
Content-Length: 296
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4

GET /index.php?route=product/search&search=Sarj&page=4 HTTP/1.1
User-Agent: CCBot/2.0 (https://commoncrawl.org/faq/)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Host: www.bpazar.com
Connection: Keep-Alive
Accept-Encoding: gzip



WARC/1.0
WARC-Type: response
WARC-Date: 2019-06-15T21:54:45Z
WARC-Record-ID: <urn:uuid:3f3d6e43-9e5d-42ba-a111-43fcd90dd633>
Content-Length: 1043231
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:07d8abda-2416-492c-b139-8fb526d5f792>
WARC-Concurrent-To: <urn:uuid:1e7aaba9-c5b9-49cd-b0a8-6a4d7460c9b3>
WARC-IP-Address: 95.216.246.36
WARC-Target-URI: https://www.bpazar.com/index.php?route=product/search&search=Sarj&page=4
WARC-Payload-Digest: sha1:N2WQFAUYKKXT6MRWSCXCQC7FOZRQCLTI
WARC-Block-Digest: sha1:S3FKWWFJ7LCYFOHUZ4RBPFAMYNQSVQMH
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Server: nginx
Date: Sat, 15 Jun 2019 21:54:44 GMT
Content-Type: text/html; charset=UTF-8
X-Crawler-Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: OCSESSID=d4163e3479bec29a507792acc4; path=/
Set-Cookie: OCSESSID=57bfbd42e2fe9d4d5af66485f7; path=/
Set-Cookie: language=tr-tr; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=tr-tr
Set-Cookie: currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Nginx-Cache-Status: BYPASS
X-Server-Powered-By: Engintron
X-Crawler-Content-Encoding: gzip

Response from console.log(record.httpHeaders); when used in the record callback:

{ Server: 'nginx',
  Date: 'Sat, 15 Jun 2019 21:54:44 GMT',
  'Content-Type': 'text/html; charset=UTF-8',
  'X-Crawler-Transfer-Encoding': 'chunked',
  Connection: 'keep-alive',
  Vary: 'Accept-Encoding',
  'Set-Cookie':
   'currency=TRY; expires=Mon, 15-Jul-2019 21:54:40 GMT; Max-Age=2592000; path=/; domain=www.bpazar.com',
  'X-XSS-Protection': '1; mode=block',
  'X-Content-Type-Options': 'nosniff',
  'X-Nginx-Cache-Status': 'BYPASS',
  'X-Server-Powered-By': 'Engintron',
  'X-Crawler-Content-Encoding': 'gzip' }
@BubuAnabelas
Copy link
Contributor

I think that is done in the following lines (specifically 236):

static _parseHeaders (headerBuffs) {
const headers = {}
let len = headerBuffs.length
let i = 1
let key
let lastKey = ''
let sepPos
let currentBuffer
let curLen
while (i < len) {
currentBuffer = headerBuffs[i]
curLen = currentBuffer.length
sepPos = currentBuffer.indexOf(ColonSpace)
if (sepPos !== -1) {
key = ContentParser.utf8BufferSlice(currentBuffer, 0, sepPos)
lastKey = key
headers[key] = ContentParser.utf8BufferSlice(
currentBuffer,
sepPos + 2,
ContentParser.bufEndPosNoCRLF(currentBuffer, curLen)
)
} else {
headers[lastKey] = ContentParser.utf8BufferSlice(
currentBuffer,
0,
ContentParser.bufEndPosNoCRLF(currentBuffer, curLen)
)
}
i++
}
return headers
}
}

Maybe you could fix it and do a PR for it to get fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants