-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix line breaks detection in edge-case #41
Fix line breaks detection in edge-case #41
Conversation
@zajca chcel by som tu pripravit PR na fixnutie jedneho problemu, ale nespusta sa Travis: Vedel by si mi prosim s tym pomoct? Nemam prava na upravu tohto repa na GitHub aby som to zapol ... |
tak uz sa o to postaral ujovlado: https://keboola.slack.com/archives/C02CFS0H3/p1595587122049500?thread_ts=1595586938.049000&cid=C02CFS0H3 |
Čus, su na dovolence, ale mrknul sem na ten PR co sem dělal a tam sa travis nespustil, takže netuším. Asi @odinuv @ondrejhlavacek by mohl vědět. |
9089145
to
f62f9c4
Compare
…nger intend to maintain compatibility with PHP, https://hhvm.com/blog/2017/09/18/the-future-of-hhvm.html
f62f9c4
to
ee66641
Compare
75e68dd
to
bba5f75
Compare
@@ -2,7 +2,6 @@ language: php | |||
php: | |||
- 5.6 | |||
- 7.0 | |||
- hhvm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stare ani nove buildy uz s HHVM neprechadzaju, HHVM uz nepodporuje PHPcko:
https://hhvm.com/blog/2017/09/18/the-future-of-hhvm.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jasnacka, ale pridej tam nove verze PHP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. vdaka pridane v a6d0826
*/ | ||
public static function detectLineBreaks($sample, $enclosure, $escapedBy) | ||
{ | ||
$cleared = self::clearCsvValues($sample, $enclosure, $escapedBy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ked sa tento riadok upravi na $cleared = $sample;
, tak pekne vidno ako spadnu testy na detekcii zalamovania riadkov.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Najprv sa zo vzorky odstrania hodnoty a potom sa pokracuje povodnym sposobom.
@@ -364,14 +382,6 @@ public function invalidSkipLinesProvider() | |||
]; | |||
} | |||
|
|||
public function testInvalidNewLines() | |||
{ | |||
self::expectException(Exception::class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tento test teraz uz prechadza - nenastane exception, ... binarny subor ktory sa tam pouzival, obsahuje vela uvodzoviek ... a kedze sa teraz uz obsah medzi uvodzovkami ignoruje, ked sa detekuje zalomenie riadkov ... tak sa to nahodou teraz zdetekuje ako \n
, ... pomocou CsvReader
sa daju potom z toho subora nacitat aj data v textovej podobe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevim jestli to neni dost brutalni zmena, co t bude delat kdyz to bude parsovat jpgcko nebo zip :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nemyslim si, ze ide o velku zmenu. Tento test prechadzal iba nahodou, ak by sa tam dal iny binarny subor, ktory by obsahoval skor kod znaku \n
ako \r
, tak by sa to spravalo inak -> zdetekovalo by to aj v starom kode zalomenie pomocou \n
a islo dalej.
Toto repo CSVcka nijako nevaliduje, mozes mat napr rozny pocet stlpcov v jednotlivych riadkoch.
Tak ako doteraz to pri binarnom subore nahodne prejde/neprejde, ... len teraz sa beru aj uvodzovky, takze to prejde v inej mnozine nahodnych dat.
Tu su pozície klucovych znakov zo suboru binary
:
0x29E
: 22
- "
0x2AB
: 22
- "
0x328
: 22
- "
0x627
: 0D
- \r
0x65F
: 22
- "
0x683
: 0A
- \n
Teda v starom kode sa ako prve naslo \r
a zdetekovalo sa to ako nevalidne zalomenie riadkov.
Teraz sa zdetekuju aj uvodzovky a zoberie sa teda ako prve \n
.
Mozem ten binarny subor upravit aby na pozicii 0x683
bolo \r
- a test zachovat - ale nevidim v tom zmysel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jojo, super! vubec me to vcera takhle nedoslo
@@ -0,0 +1,2 @@ | |||
"test","some textwith\r line breaksinsidebutrowsareusing \n \"line\" break" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text je zalomeny pomocou \r
znakov, ktore tu v diffe nevidno, vid test kde sa pouziva.
@@ -137,6 +137,24 @@ public function testParseEscapedBy() | |||
self::assertEquals($expected, iterator_to_array($csvFile)); | |||
} | |||
|
|||
public function testParseMacLineEndsInField() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Najprv som do PR pridal test, ktory testuje tento edge-case, bez fixu to padalo:
https://travis-ci.com/github/keboola/php-csv/jobs/364605649#L287
/* | ||
* Regexp examples: | ||
* enclosure: |"|, escapedBy: none, regexp: ~"(?>(?>"")|[^"])*"~ | ||
* enclosure: |"|, escapedBy: |\|, regexp: ~"(?>(?>\\"|\\\\)|[^"])*"~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generovanie regularneho vyrazu som okomentoval, nie je nejaky dlhy, ale aby bolo jasne co sa tam deje.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Musim pouzit viac-riadkovy komentar, kedze sa tam nachadza ?>
(koniec PHP suboru).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jako dobry poctenicko :)
@ujovlado skusil som to radsej fixnut ako to obchadzat v extractoroch, ... hodil by som tu na review zajcu alebo odina, no obaja su off. |
ja su dost mimo provoze jeste, takze jsem to jen tak proletel, ale prosim zkus perf test na nejaky XX GB CSV - vidim, ze by se ta zmena mela tykat jen detekce a ta se stejne dela na samplu, a nemyslim si, ze by tam mel byt problem, ale rad bych mel jistotu |
overit backtrack limit |
6c792e0
to
bf6e4a9
Compare
bf6e4a9
to
86f24a0
Compare
$csvFile->writeRow($rows[1]); | ||
self::fail('Expected exception was not thrown.'); | ||
} catch (Exception $e) { | ||
// Exception message differs between PHP versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pridal som do CI novsie verzie PHP, v tomto teste sa mierne lisi chybova sprava, medzi verziami PHP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me nikdy nenapadlo se s tim takhle srat, ja bych zkratil tu message :D ale kazdopadne 👍
$csvFile->writeRow($rows[0]); | ||
self::fail('Expected exception was not thrown.'); | ||
} catch (Exception $e) { | ||
// Exception message differs between PHP versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pridal som do CI novsie verzie PHP, v tomto teste sa tiez mierne lisi chybova sprava, medzi verziami PHP.
@@ -90,7 +90,7 @@ public function writeRow(array $row) | |||
"Cannot write to CSV file " . $this->fileName . | |||
($ret === false && error_get_last() ? 'Error: ' . error_get_last()['message'] : '') . | |||
' Return: ' . json_encode($ret) . | |||
' To write: ' . strlen($str) . ' Written: ' . $ret, | |||
' To write: ' . strlen($str) . ' Written: ' . (int) $ret, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ret moze byt aj false, no v chybovej sprave chceme cislo.
@@ -148,7 +129,7 @@ protected function readLine() | |||
// allow empty enclosure hack | |||
$enclosure = !$this->getEnclosure() ? chr(0) : $this->getEnclosure(); | |||
$escapedBy = !$this->getEscapedBy() ? chr(0) : $this->getEscapedBy(); | |||
return fgetcsv($this->getFilePointer(), null, $this->getDelimiter(), $enclosure, $escapedBy); | |||
return @fgetcsv($this->getFilePointer(), null, $this->getDelimiter(), $enclosure, $escapedBy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V PHP 7.4 (pridane do CI) funkcie fgetcsv
a fread
vyhodia chybu - ak invalid file pointer.
V starsich verziach sa vrati iba null
alebo false
, chyba sa tam nevyhodi.
Z pohladu uzivatela sa vrati prazdne pole, testuje sa to tu (tento test padal):
Lines 477 to 484 in 86f24a0
public function testInvalidPointer() | |
{ | |
$fileName = sys_get_temp_dir() . DIRECTORY_SEPARATOR . uniqid('csv-test'); | |
$file = fopen($fileName, 'w'); | |
$csvFile = new CsvReader($file); | |
self::assertEquals([], $csvFile->getHeader()); | |
self::assertEquals([], iterator_to_array($csvFile)); | |
} |
V CsvWriter
si naopak generujeme chybovu spravu sami, tam to zostalo bez zmeny, a testuje sa to tu:
php-csv/tests/CsvWriteTest.php
Lines 190 to 215 in 86f24a0
public function testInvalidPointer() | |
{ | |
$fileName = sys_get_temp_dir() . DIRECTORY_SEPARATOR . uniqid('csv-test'); | |
touch($fileName); | |
$pointer = fopen($fileName, 'r'); | |
$csvFile = new CsvWriter($pointer); | |
$rows = [['col1', 'col2']]; | |
try { | |
$csvFile->writeRow($rows[0]); | |
self::fail('Expected exception was not thrown.'); | |
} catch (Exception $e) { | |
// Exception message differs between PHP versions. | |
$or = new PHPUnit_Framework_Constraint_Or(); | |
$or->setConstraints([ | |
new PHPUnit_Framework_Constraint_StringContains( | |
'Cannot write to CSV file Return: 0 To write: 14 Written: 0' | |
), | |
new PHPUnit_Framework_Constraint_StringContains( | |
'Cannot write to CSV file Error: fwrite(): ' . | |
'write of 14 bytes failed with errno=9 Bad file descriptor Return: false To write: 14 Written: 0' | |
) | |
]); | |
self::assertThat($e->getMessage(), $or); | |
} | |
} |
} | ||
} | ||
|
||
public function getPerformanceTestInputs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pridal som perf testy.
Podla nich na mojom PC trva spracovanie regularneho vyrazu 190ms
, ktory sa pridaval, pri 10000 vzorke, ktora je nastavena v kode.
Napr. pre prvy data set 1M-simple-rows
, bola duration
:
- v starom kode
4.0193018913269
- v novom kode
4.1560678482056
- teda rozdiel je do nejakych
5%
, pri opakovanych spusteniach to trocha kolisalo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ja bych rek, ze dobry - melo by to byt zhorseni o konstantu, takze v pohode
Ad:
Backtrack limit:
|
@odinuv snazil som sa zapracovat vsetky tvoje pripomienky, pozries prosim na to? vdaka. |
(este pozeram ze na Travis CI su tie performance testy ovela pomalsie ako lokalne, idem to este fixnut) |
Vypol som spustanie preformance testov v CI. BTW: Lines 144 to 147 in b77f1c7
|
tvl "pripominky", ja to jen prolitl v polohalucinacispanku a napsal jsem nejake vyblitky co me napadly :) jasne, jdu na to! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probral jsem se tim, zadnej (ani potencionalni) bugisek me nenapadl, no tak snad to bude dobry, ale je to takovej edge edge case, ze si myslim, ze jestli to nebude fungovat tak na to prijdem za 3 roky :)
tests/CsvReadTest.php
Outdated
@@ -7,7 +7,9 @@ | |||
use Keboola\Csv\CsvWriter; | |||
use Keboola\Csv\Exception; | |||
use Keboola\Csv\InvalidArgumentException; | |||
use phpDocumentor\Reflection\Types\Void_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to se tu pripletlo
@@ -364,14 +382,6 @@ public function invalidSkipLinesProvider() | |||
]; | |||
} | |||
|
|||
public function testInvalidNewLines() | |||
{ | |||
self::expectException(Exception::class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jojo, super! vubec me to vcera takhle nedoslo
$csvFile->writeRow($rows[1]); | ||
self::fail('Expected exception was not thrown.'); | ||
} catch (Exception $e) { | ||
// Exception message differs between PHP versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me nikdy nenapadlo se s tim takhle srat, ja bych zkratil tu message :D ale kazdopadne 👍
} | ||
} | ||
|
||
public function getPerformanceTestInputs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ja bych rek, ze dobry - melo by to byt zhorseni o konstantu, takze v pohode
/* | ||
* Regexp examples: | ||
* enclosure: |"|, escapedBy: none, regexp: ~"(?>(?>"")|[^"])*"~ | ||
* enclosure: |"|, escapedBy: |\|, regexp: ~"(?>(?>\\"|\\\\)|[^"])*"~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jako dobry poctenicko :)
src/LineBreaksHelper.php
Outdated
* enclosure: |"|, escapedBy: none, regexp: ~"(?>(?>"")|[^"])*"~ | ||
* enclosure: |"|, escapedBy: |\|, regexp: ~"(?>(?>\\"|\\\\)|[^"])*"~ | ||
*/ | ||
$regexpDelimiter = '~'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
konstanta?
@odinuv vdaka ze si to pozrel ;) fixol som este tu konstantu a chybu v |
No ted jsem si uvedomil, ze jsem zapomnel na ten validateLineBreak uvnitr readLine, rekl bych, ze mas pravdu, ze tam nemusi byt a stacil by v konstruktoru - imho je to pozustatek toho jak to bylo implementovane ve verzi 1 |
ok. vdaka. hodil som to do osobitneho PR: |
Solving: https://keboola.atlassian.net/browse/COM-350
Changes:
\r
in field, in the first line are correctly readed byCsvReader
, no exceptionKeboola\Csv\InvalidArgumentException: Invalid line break. Please use unix \n or win \r\n line breaks.
is thrown.Problem
ex-mssql
som zistil, ze tam vznika problem pri parsovani CSV z programu BCP.\r
(Mac styl).CsvReader
to nevie spravne vyhodnotit.$sample
je obmedzena na 10000 znakov, takze pre regularny vyraz to nebude ziadny problem.CodeClimate hlasi nejake chyby a neviem preco, ale coverage urcite neklesol, len tam berie chybne subor z testov do coverage, nepodarilo sa mi zistit kde je problem, mozno nejaky bug: