"incomplete final line found by readTableHeader" when using read.delim() on a tab-delimited file with Chinese character

Question

"incomplete final line found by readTableHeader" when using read.delim() on a tab-delimited file with Chinese character

8.9k views Asked by TerenceLam At 09 June 2015 at 04:49

I got this "incomplete final line found by readTableHeader" error message when using read.delim() to read in a tab-delimited text file. There are Traditional Chinese characters in the header and content, so I am already using alternative encoding, like this:

kg = read.delim("KG_EDB_20150505.csv",fileEncoding="UTF-16LE")

Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote,  :
  incomplete final line found by readTableHeader on 'KG_EDB_20150505.csv'

I have read other posts with similar issues, e.g.:

'Incomplete final line' warning when trying to read a .csv file into R In read.table(): incomplete final line found by readTableHeader

But unfortunately the suggested solutions in these posts cannot solve the problem.

A summary of what were tried etc:

Pressing ENTER at the last line of the text file: same error
Trimming the text file into header + 1 single of data, then make sure there is a new line (ENTER) between the line for header and the content: same error
Trimming the text file until only the header is left, then copy&paste the header onto the next line and use it to pretend as a line of data. Add a new line (ENTER) after the fake line of data: WORKS! Chinese is all garbage, but then I do not need those anyway.
Remove the trailing new line (ENTER) in #3: same error, but can read 1 line of fake data into the data.frame.
Open in Excel directly: works, but not the workflow I want.

So what gives?

Is there a way I can read in such file?

or

Is there a way to massage the file (preferably in R) and then read it in?

The file is here:

https://dl.dropboxusercontent.com/u/5860015/KG_EDB_20150505.csv

It was from a government webpage here:

http://www1.map.gov.hk/gih3/view/index.jsp
(Map Tools > Data Download > Kindergarten-cum-child Care Centres)

Many thanks in advance!

Update:

By a stroke of luck, I isolated an offending character inside the text file, namely this Chinese character "稚". It may not be the only one, but if I add it to the file in #3, same error again. I do not know what is special about this character and I do no need any info in the text file in Chinese anyway.

So now there are more questions:

Is there a way to skip reading this offending character?

or

Is there a way in R to replace this offending character in the file, before reading in the text file?

Original Q&A

There are 1 answers

**IRTFM** · Answer 1 · 2015-06-09T05:42:45+00:00

It's full of Chinese characters (every other field in fact).

First line:

"ENGLISH CATEGORY" "中文類別" "ENGLISH NAME" "中文名稱" "ENGLISH ADDRESS" "中文地址" "LONGITUDE" "經度" "LATITUDE" "緯度" "EASTING" "坐標東" "NORTHING" "坐標北" "STUDENTS GENDER" "就讀學生性別" "SESSION" "學校授課時間" "DISTRICT" "分區" "FINANCE TYPE" "資助種類" "SCHOOL LEVEL" "學校類型" "OPENING HOURS" "開放時間" "TELEPHONE" "聯絡電話" "FAX NUMBER" "傳真號碼" "EMAIL ADDRESS" "電郵地址" "WEBSITE" "網頁" "RELIGION" "宗教"

And my editor thinks it is UTF-16 and that it is "Little Endian".

Unless you are set up with the right fonts and understand the ins and outs of encodings, it is much easier to use an external editor, especially since you say you do not want the info that is in the Chinese fields. I succeeded with the TextWrangler editor from Bare Bones Software. It's the free version of their more full featured editor, but it has the capacity to remove non-ASCII characters and save as UTF-8 encoded file.

> inp <- read.table("~/Downloads/KG_EDB_20150505.txt", header=TRUE)
> str(inp)
'data.frame':   1385 obs. of  36 variables:
 $ ENGLISH.CATEGORY: Factor w/ 1 level "Kindergartens": 1 1 1 1 1 1 1 1 1 1 ...
 $ X               : logi  NA NA NA NA NA NA ...
 $ ENGLISH.NAME    : Factor w/ 1368 levels "A-ONE KINDERGARTEN",..: 137 38 835 714 858 551 455 533 1073 396 ...
 $ X.1             : Factor w/ 68 levels "","-()","()",..: 5 3 3 5 3 3 3 3 3 3 ...
 $ ENGLISH.ADDRESS : Factor w/ 562 levels "(INCLUDING 1-STOREY SCHOOL EXTENSION) 23 NAM LONG SHAN ROAD ABERDEEN HONG KONG",..: 448 40 34 316 396 55 326 160 273 483 ...
 $ X.2             : Factor w/ 294 levels "","()","()29",..: 257 1 21 1 1 112 1 59 1 289 ...
 $ LONGITUDE       : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ X.3             : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ LATITUDE        : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ X.4             : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ EASTING         : num  836221 828924 834914 818325 828492 ...
 $ X.5             : num  836221 828924 834914 818325 828492 ...
 $ NORTHING        : num  821002 826433 820623 835893 840814 ...
 $ X.6             : num  821002 826433 820623 835893 840814 ...
 $ STUDENTS.GENDER : Factor w/ 2 levels "CO-ED","GIRLS": 1 1 1 1 1 1 1 1 1 1 ...
 $ X.7             : logi  NA NA NA NA NA NA ...
 snipped.

The fields that had Chinese in the header are all now blank. It's NOT a csv file.... no commas. If I were doing it again for myself I'd use stringsAsFactors =FALSE

It's also possible to input the file with the correct encoding. This works on the original file with no editing at all:

> inp2 <- read.table("~/Downloads/KG_EDB_20150505.csv", header=TRUE, fileEncoding="UTF-16")
> str(inp2)
'data.frame':   1385 obs. of  36 variables:
 $ ENGLISH.CATEGORY: Factor w/ 1 level "Kindergartens": 1 1 1 1 1 1 1 1 1 1 ...
 $ 中文類別        : Factor w/ 1 level "幼稚園": 1 1 1 1 1 1 1 1 1 1 ...
 $ ENGLISH.NAME    : Factor w/ 1368 levels "A-ONE KINDERGARTEN",..: 137 38 835 714 858 551 455 533 1073 396 ...
 $ 中文名稱        : Factor w/ 1355 levels "","DISCOVERY BAY INTERNATIONAL SCHOOL (A.M.)",..: 1186 507 854 630 64 134 1298 147 520 1256 ...
 $ ENGLISH.ADDRESS : Factor w/ 562 levels "(INCLUDING 1-STOREY SCHOOL EXTENSION) 23 NAM LONG SHAN ROAD ABERDEEN HONG KONG",..: 448 40 34 316 396 55 326 160 273 483 ...
 $ 中文地址        : Factor w/ 554 levels "34 PRICE ROAD HONG KONG",..: 32 395 51 259 173 37 58 28 176 370 ...
 $ LONGITUDE       : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ 經度            : Factor w/ 416 levels "113-51-49","113-51-54",..: 101 302 406 60 314 167 189 104 330 363 ...
 $ LATITUDE        : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ 緯度            : Factor w/ 397 levels "22-12-36","22-13-10",..: 150 257 139 357 388 139 167 160 383 377 ...
 $ EASTING         : num  836221 828924 834914 818325 828492 ...
 $ 坐標東          : num  836221 828924 834914 818325 828492 ...
 $ NORTHING        : num  821002 826433 820623 835893 840814 ...
 $ 坐標北          : num  821002 826433 820623 835893 840814 ...
snipped.

TechQA.

"incomplete final line found by readTableHeader" when using read.delim() on a tab-delimited file with Chinese character

There are 1 answers

Related Questions in R

Related Questions in CSV

Related Questions in CHARACTER-ENCODING

Related Questions in CHINESE-LOCALE

Popular Questions

Popular Tags

Trending Questions