Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line breaks detection in edge-case #41

Merged
merged 13 commits into from
Jul 28, 2020

Conversation

michaljurecko
Copy link
Contributor

@michaljurecko michaljurecko commented Jul 24, 2020

Solving: https://keboola.atlassian.net/browse/COM-350

Changes:

  • CSV files that contains \r in field, in the first line are correctly readed by CsvReader, no exception Keboola\Csv\InvalidArgumentException: Invalid line break. Please use unix \n or win \r\n line breaks. is thrown.
  • Line end is correctly detected in this situation.

Problem

  • Pri extractoroch pouzivame niekedy externe nastroje na vygenerovanie CSV, pretoze su rychlejsie.
  • Pri nasadeni novej verzie ex-mssql som zistil, ze tam vznika problem pri parsovani CSV z programu BCP.
  • Viac info: Catch invalid line breaks exception from CsvReader db-extractor-mssql#172
  • Vyexportovany CSV subor nema hlavicky a hned v prvom riadku sa nachazda XML obsah, ktory ma riadky zalomene pomocou \r (Mac styl).
  • Jedna sa o edge case, no kedze sa extractory pouzivaju vo velke miere, tak by to bolo dobre fixnut.
  • Kedze je to v uvodzovkach jedna sa o validne CSVcko, no `CsvReader nevedel spravne detekovat sposob zalomenia riadkov a tato kontrola sa neda ani vypnut.
  • Teda v praxi moze napr 20min bezat export pomocou BCP, a potom dalsich 30min cez PDO fallback, kedze CsvReader to nevie spravne vyhodnotit.
  • Upravil som to tak, ze sa cez regularny vyraz odstrania hodnoty medzi uvodzovkami a az tak sa spravi detekcia zalomenia riadkov + napisal som na to testy.
  • Velkost vzorky $sample je obmedzena na 10000 znakov, takze pre regularny vyraz to nebude ziadny problem.

CodeClimate hlasi nejake chyby a neviem preco, ale coverage urcite neklesol, len tam berie chybne subor z testov do coverage, nepodarilo sa mi zistit kde je problem, mozno nejaky bug:
image

@michaljurecko
Copy link
Contributor Author

@zajca chcel by som tu pripravit PR na fixnutie jedneho problemu, ale nespusta sa Travis:
https://travis-ci.org/github/keboola/php-csv

Vedel by si mi prosim s tym pomoct? Nemam prava na upravu tohto repa na GitHub aby som to zapol ...
Je to vypnute z nejakeho dovodu? Alebo je to chyba?

@michaljurecko
Copy link
Contributor Author

@zajca chcel by som tu pripravit PR na fixnutie jedneho problemu, ale nespusta sa Travis:
https://travis-ci.org/github/keboola/php-csv

Vedel by si mi prosim s tym pomoct? Nemam prava na upravu tohto repa na GitHub aby som to zapol ...
Je to vypnute z nejakeho dovodu? Alebo je to chyba?

tak uz sa o to postaral ujovlado: https://keboola.slack.com/archives/C02CFS0H3/p1595587122049500?thread_ts=1595586938.049000&cid=C02CFS0H3

@zajca
Copy link
Member

zajca commented Jul 24, 2020

@zajca chcel by som tu pripravit PR na fixnutie jedneho problemu, ale nespusta sa Travis:
https://travis-ci.org/github/keboola/php-csv

Vedel by si mi prosim s tym pomoct? Nemam prava na upravu tohto repa na GitHub aby som to zapol ...
Je to vypnute z nejakeho dovodu? Alebo je to chyba?

Čus, su na dovolence, ale mrknul sem na ten PR co sem dělal a tam sa travis nespustil, takže netuším. Asi @odinuv @ondrejhlavacek by mohl vědět.
// tak nic když vyřešeno

@michaljurecko michaljurecko force-pushed the webrouse-COM-350-fix-line-breaks-detect branch from 9089145 to f62f9c4 Compare July 24, 2020 12:08
@michaljurecko michaljurecko force-pushed the webrouse-COM-350-fix-line-breaks-detect branch from f62f9c4 to ee66641 Compare July 24, 2020 12:22
@michaljurecko michaljurecko force-pushed the webrouse-COM-350-fix-line-breaks-detect branch from 75e68dd to bba5f75 Compare July 24, 2020 13:39
@michaljurecko michaljurecko marked this pull request as ready for review July 24, 2020 13:50
@@ -2,7 +2,6 @@ language: php
php:
- 5.6
- 7.0
- hhvm
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stare ani nove buildy uz s HHVM neprechadzaju, HHVM uz nepodporuje PHPcko:
https://hhvm.com/blog/2017/09/18/the-future-of-hhvm.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jasnacka, ale pridej tam nove verze PHP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. vdaka pridane v a6d0826

*/
public static function detectLineBreaks($sample, $enclosure, $escapedBy)
{
$cleared = self::clearCsvValues($sample, $enclosure, $escapedBy);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ked sa tento riadok upravi na $cleared = $sample;, tak pekne vidno ako spadnu testy na detekcii zalamovania riadkov.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Najprv sa zo vzorky odstrania hodnoty a potom sa pokracuje povodnym sposobom.

@@ -364,14 +382,6 @@ public function invalidSkipLinesProvider()
];
}

public function testInvalidNewLines()
{
self::expectException(Exception::class);
Copy link
Contributor Author

@michaljurecko michaljurecko Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tento test teraz uz prechadza - nenastane exception, ... binarny subor ktory sa tam pouzival, obsahuje vela uvodzoviek ... a kedze sa teraz uz obsah medzi uvodzovkami ignoruje, ked sa detekuje zalomenie riadkov ... tak sa to nahodou teraz zdetekuje ako \n, ... pomocou CsvReader sa daju potom z toho subora nacitat aj data v textovej podobe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevim jestli to neni dost brutalni zmena, co t bude delat kdyz to bude parsovat jpgcko nebo zip :)

Copy link
Contributor Author

@michaljurecko michaljurecko Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nemyslim si, ze ide o velku zmenu. Tento test prechadzal iba nahodou, ak by sa tam dal iny binarny subor, ktory by obsahoval skor kod znaku \n ako \r, tak by sa to spravalo inak -> zdetekovalo by to aj v starom kode zalomenie pomocou \n a islo dalej.

Toto repo CSVcka nijako nevaliduje, mozes mat napr rozny pocet stlpcov v jednotlivych riadkoch.
Tak ako doteraz to pri binarnom subore nahodne prejde/neprejde, ... len teraz sa beru aj uvodzovky, takze to prejde v inej mnozine nahodnych dat.

Tu su pozície klucovych znakov zo suboru binary:
0x29E: 22 - "
0x2AB: 22 - "

0x328: 22 - "
0x627: 0D - \r
0x65F: 22 - "

0x683: 0A - \n

Teda v starom kode sa ako prve naslo \r a zdetekovalo sa to ako nevalidne zalomenie riadkov.
Teraz sa zdetekuju aj uvodzovky a zoberie sa teda ako prve \n.

Mozem ten binarny subor upravit aby na pozicii 0x683 bolo \r - a test zachovat - ale nevidim v tom zmysel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jojo, super! vubec me to vcera takhle nedoslo

@@ -0,0 +1,2 @@
"test","some textwith\r line breaksinsidebutrowsareusing \n \"line\" break"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text je zalomeny pomocou \r znakov, ktore tu v diffe nevidno, vid test kde sa pouziva.

@@ -137,6 +137,24 @@ public function testParseEscapedBy()
self::assertEquals($expected, iterator_to_array($csvFile));
}

public function testParseMacLineEndsInField()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Najprv som do PR pridal test, ktory testuje tento edge-case, bez fixu to padalo:
https://travis-ci.com/github/keboola/php-csv/jobs/364605649#L287

/*
* Regexp examples:
* enclosure: |"|, escapedBy: none, regexp: ~"(?>(?>"")|[^"])*"~
* enclosure: |"|, escapedBy: |\|, regexp: ~"(?>(?>\\"|\\\\)|[^"])*"~
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generovanie regularneho vyrazu som okomentoval, nie je nejaky dlhy, ale aby bolo jasne co sa tam deje.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Musim pouzit viac-riadkovy komentar, kedze sa tam nachadza ?> (koniec PHP suboru).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jako dobry poctenicko :)

@michaljurecko
Copy link
Contributor Author

@ujovlado skusil som to radsej fixnut ako to obchadzat v extractoroch, ... hodil by som tu na review zajcu alebo odina, no obaja su off.

@odinuv
Copy link
Member

odinuv commented Jul 25, 2020

ja su dost mimo provoze jeste, takze jsem to jen tak proletel, ale prosim zkus perf test na nejaky XX GB CSV - vidim, ze by se ta zmena mela tykat jen detekce a ta se stejne dela na samplu, a nemyslim si, ze by tam mel byt problem, ale rad bych mel jistotu

@odinuv
Copy link
Member

odinuv commented Jul 25, 2020

overit backtrack limit

@michaljurecko michaljurecko force-pushed the webrouse-COM-350-fix-line-breaks-detect branch 5 times, most recently from 6c792e0 to bf6e4a9 Compare July 27, 2020 07:32
@michaljurecko michaljurecko force-pushed the webrouse-COM-350-fix-line-breaks-detect branch from bf6e4a9 to 86f24a0 Compare July 27, 2020 07:35
$csvFile->writeRow($rows[1]);
self::fail('Expected exception was not thrown.');
} catch (Exception $e) {
// Exception message differs between PHP versions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pridal som do CI novsie verzie PHP, v tomto teste sa mierne lisi chybova sprava, medzi verziami PHP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me nikdy nenapadlo se s tim takhle srat, ja bych zkratil tu message :D ale kazdopadne 👍

$csvFile->writeRow($rows[0]);
self::fail('Expected exception was not thrown.');
} catch (Exception $e) {
// Exception message differs between PHP versions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pridal som do CI novsie verzie PHP, v tomto teste sa tiez mierne lisi chybova sprava, medzi verziami PHP.

@@ -90,7 +90,7 @@ public function writeRow(array $row)
"Cannot write to CSV file " . $this->fileName .
($ret === false && error_get_last() ? 'Error: ' . error_get_last()['message'] : '') .
' Return: ' . json_encode($ret) .
' To write: ' . strlen($str) . ' Written: ' . $ret,
' To write: ' . strlen($str) . ' Written: ' . (int) $ret,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ret moze byt aj false, no v chybovej sprave chceme cislo.

@@ -148,7 +129,7 @@ protected function readLine()
// allow empty enclosure hack
$enclosure = !$this->getEnclosure() ? chr(0) : $this->getEnclosure();
$escapedBy = !$this->getEscapedBy() ? chr(0) : $this->getEscapedBy();
return fgetcsv($this->getFilePointer(), null, $this->getDelimiter(), $enclosure, $escapedBy);
return @fgetcsv($this->getFilePointer(), null, $this->getDelimiter(), $enclosure, $escapedBy);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V PHP 7.4 (pridane do CI) funkcie fgetcsv a fread vyhodia chybu - ak invalid file pointer.
V starsich verziach sa vrati iba null alebo false, chyba sa tam nevyhodi.

Z pohladu uzivatela sa vrati prazdne pole, testuje sa to tu (tento test padal):

public function testInvalidPointer()
{
$fileName = sys_get_temp_dir() . DIRECTORY_SEPARATOR . uniqid('csv-test');
$file = fopen($fileName, 'w');
$csvFile = new CsvReader($file);
self::assertEquals([], $csvFile->getHeader());
self::assertEquals([], iterator_to_array($csvFile));
}

V CsvWriter si naopak generujeme chybovu spravu sami, tam to zostalo bez zmeny, a testuje sa to tu:

public function testInvalidPointer()
{
$fileName = sys_get_temp_dir() . DIRECTORY_SEPARATOR . uniqid('csv-test');
touch($fileName);
$pointer = fopen($fileName, 'r');
$csvFile = new CsvWriter($pointer);
$rows = [['col1', 'col2']];
try {
$csvFile->writeRow($rows[0]);
self::fail('Expected exception was not thrown.');
} catch (Exception $e) {
// Exception message differs between PHP versions.
$or = new PHPUnit_Framework_Constraint_Or();
$or->setConstraints([
new PHPUnit_Framework_Constraint_StringContains(
'Cannot write to CSV file Return: 0 To write: 14 Written: 0'
),
new PHPUnit_Framework_Constraint_StringContains(
'Cannot write to CSV file Error: fwrite(): ' .
'write of 14 bytes failed with errno=9 Bad file descriptor Return: false To write: 14 Written: 0'
)
]);
self::assertThat($e->getMessage(), $or);
}
}

}
}

public function getPerformanceTestInputs()
Copy link
Contributor Author

@michaljurecko michaljurecko Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pridal som perf testy.

Podla nich na mojom PC trva spracovanie regularneho vyrazu 190ms, ktory sa pridaval, pri 10000 vzorke, ktora je nastavena v kode.

Napr. pre prvy data set 1M-simple-rows, bola duration:

  • v starom kode 4.0193018913269
  • v novom kode 4.1560678482056
  • teda rozdiel je do nejakych 5%, pri opakovanych spusteniach to trocha kolisalo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ja bych rek, ze dobry - melo by to byt zhorseni o konstantu, takze v pohode

@michaljurecko
Copy link
Contributor Author

michaljurecko commented Jul 27, 2020

Ad:

overit backtrack limit

Backtrack limit:

  • je v PHP default 1000000, kedze je to kniznica, tak to nevieme tu upravit
  • velkost vzorky nad ktorou bezi regexp je max. 10000
  • v regularnom vyraze pouzivam ?> Once-only operator, alebo inak nazyvany aj Atomic grouping, ktory sluzi aj ako prevencia problemov s backtrackingom ...
  • pri pouziti tohto operatora sa postupuje iba jednym smerom, teda k backtrackingu nedochadza, ... ak je napr. text 12345abc a regexp: (?>\d+xyz) ... tak ako \d+ sa matchne 12345, kratsie alternativy 1234 .... 123 ... sa neskusaju.
  • staci takto, alebo co si si predstavoval pod overenim?, ... pridal som testy na performance, a tie to tiez svojim sposobom testuju (kedze sa jedna o velke subory).

@michaljurecko
Copy link
Contributor Author

@odinuv snazil som sa zapracovat vsetky tvoje pripomienky, pozries prosim na to? vdaka.

@michaljurecko michaljurecko removed the request for review from odinuv July 27, 2020 09:32
@michaljurecko
Copy link
Contributor Author

(este pozeram ze na Travis CI su tie performance testy ovela pomalsie ako lokalne, idem to este fixnut)

@michaljurecko
Copy link
Contributor Author

Vypol som spustanie preformance testov v CI.

BTW:
Dalo by sa to o ~20% zrychlit ak by sa pri kazdom readLine nevolalo validateLineBreak, staci to zvalidovat raz, v konstruktore, nie? Volanie metody stoji nejaky cas, a potom je tam este nejake inArray a dalsi kod v validateLineBreak.

php-csv/src/CsvReader.php

Lines 144 to 147 in b77f1c7

protected function readLine()
{
$this->validateLineBreak();

@odinuv
Copy link
Member

odinuv commented Jul 27, 2020

@odinuv snazil som sa zapracovat vsetky tvoje pripomienky, pozries prosim na to? vdaka.

tvl "pripominky", ja to jen prolitl v polohalucinacispanku a napsal jsem nejake vyblitky co me napadly :)

jasne, jdu na to!

Copy link
Member

@odinuv odinuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probral jsem se tim, zadnej (ani potencionalni) bugisek me nenapadl, no tak snad to bude dobry, ale je to takovej edge edge case, ze si myslim, ze jestli to nebude fungovat tak na to prijdem za 3 roky :)

@@ -7,7 +7,9 @@
use Keboola\Csv\CsvWriter;
use Keboola\Csv\Exception;
use Keboola\Csv\InvalidArgumentException;
use phpDocumentor\Reflection\Types\Void_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to se tu pripletlo

@@ -364,14 +382,6 @@ public function invalidSkipLinesProvider()
];
}

public function testInvalidNewLines()
{
self::expectException(Exception::class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jojo, super! vubec me to vcera takhle nedoslo

$csvFile->writeRow($rows[1]);
self::fail('Expected exception was not thrown.');
} catch (Exception $e) {
// Exception message differs between PHP versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me nikdy nenapadlo se s tim takhle srat, ja bych zkratil tu message :D ale kazdopadne 👍

}
}

public function getPerformanceTestInputs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ja bych rek, ze dobry - melo by to byt zhorseni o konstantu, takze v pohode

/*
* Regexp examples:
* enclosure: |"|, escapedBy: none, regexp: ~"(?>(?>"")|[^"])*"~
* enclosure: |"|, escapedBy: |\|, regexp: ~"(?>(?>\\"|\\\\)|[^"])*"~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jako dobry poctenicko :)

* enclosure: |"|, escapedBy: none, regexp: ~"(?>(?>"")|[^"])*"~
* enclosure: |"|, escapedBy: |\|, regexp: ~"(?>(?>\\"|\\\\)|[^"])*"~
*/
$regexpDelimiter = '~';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

konstanta?

@michaljurecko
Copy link
Contributor Author

@odinuv vdaka ze si to pozrel ;) fixol som este tu konstantu a chybu v use a mergujem to.

@michaljurecko michaljurecko merged commit a439656 into master Jul 28, 2020
@odinuv
Copy link
Member

odinuv commented Jul 28, 2020

No ted jsem si uvedomil, ze jsem zapomnel na ten validateLineBreak uvnitr readLine, rekl bych, ze mas pravdu, ze tam nemusi byt a stacil by v konstruktoru - imho je to pozustatek toho jak to bylo implementovane ve verzi 1

@michaljurecko
Copy link
Contributor Author

ok. vdaka. hodil som to do osobitneho PR:
#42

@odinuv odinuv deleted the webrouse-COM-350-fix-line-breaks-detect branch January 17, 2023 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants