Python: Reading a file and adding keys and values to dictionaries from different lines

1.1k views Asked by At

I'm very new to Python and I'm having trouble working on an assignment which basically is like this:

#Read line by line a WARC file to identify string1.

#When string1 found, add part of the string as a key to a dictionary.

#Then continue reading file to identify string2, and add part of string2 as a value to the previous key.

#Keep going through file and doing the same to build the dictionary.

I can't import anything so it's causing me a bit of trouble, especially adding the key, then leaving the value empty and continue going through the file to find string2 to be used as value.

I've started thinking something like saving the key to an intermediate variable, then going on to identify the value, add to an intermediate variable and finally build the dictionary.

def main ():
###open the file
file = open("warc_file.warc", "rb")
filetxt = file.read().decode('ascii','ignore')
filedata = filetxt.split("\r\n")
dictionary = dict()
while line in filedata:
    for line in filedata:
        if "WARC-Type: response" in line:
            break
    for line in filedata:
        if "WARC-Target-URI: " in line:
           urlkey = line.strip("WARC-Target-URI: ")
2

There are 2 answers

0
akaessens On BEST ANSWER

Your idea with storing the key to an intermediate value is good.

I also suggest using the following snippet to iterate over the lines.

with open(filename, "rb") as file:
    lines = file.readlines()
    for line in lines: 
        print(line)

To create dictionary entries in Python, the dict.update() method can be used. It allows you to create new keys or update values if the key already exists.

d = dict() # create empty dict
d.update({"key" : None}) # create entry without value
d.update({"key" : 123}) # update the value
0
Matthew Strawbridge On

It's not entirely clear what you're trying to do, but I'll have a go at answering.

Suppose you have a WARC file like this:

WARC-Type: response
WARC-Target-URI: http://example.example
something
WARC-IP-Address: 88.88.88.88

WARC-Type: response
WARC-Target-URI: http://example2.example2
something else
WARC-IP-Address: 99.99.99.99

Then you could create a dictionary that maps the target URIs to the IP addresses like this:

dictionary = dict()

with open("warc_file.warc", "rb") as file:
  urlkey = None
  value = None

  for line in file:
    if b"WARC-Target-URI: " in line:
      assert urlkey is None
      urlkey = line.strip(b"WARC-Target-URI: ").rstrip(b"\n").decode("ascii")

    if b"WARC-IP-Address: " in line:
      assert urlkey is not None
      assert value is None

      value = line.strip(b"WARC-IP-Address: ").rstrip(b"\n").decode("ascii")

      dictionary[urlkey] = value

      urlkey = None
      value = None

print(dictionary)

This prints the following result:

{'http://example.example': '88.88.88.88', 'http://example2.example2': '99.99.99.99'}

Note that this approach only loads one line of the file into memory at a time, which might be significant if the file is very large.